All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC net-next 0/3] Proposal for VRF-lite
@ 2015-06-08 18:35 Shrijeet Mukherjee
  2015-06-08 18:35 ` [RFC net-next 1/3] Symbol preparation for VRF driver Shrijeet Mukherjee
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-08 18:35 UTC (permalink / raw)
  To: hannes, nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay, Shrijeet Mukherjee

From: Shrijeet Mukherjee <shm@cumulusnetworks.com>

In the context of internet scale routing a requirement that always
comes up is the need to partition the available routing tables into
disjoint routing planes. A specific use case is the multi-tenancy
problem where each tenant has their own unique routing tables and in
the very least need different default gateways.

This is an attempt to build the ability to create virtual router
domains aka VRF's (VRF-lite to be specific) in the linux packet
forwarding stack. The main observation is that through the use of
rules and socket binding to interfaces, all the facilities that we
need are already present in the infrastructure. What is missing is a
handle that identifies a routing domain and can be used to gather
applicable rules/tables and uniqify neighbor selection. The scheme
used needs to preserves the notions of ECMP, and general routing
principles.

This driver is a cross between functionality that the IPVLAN driver
and the Team drivers provide where a device is created and packets
into/out of the routing domain are shuttled through this device. The
device is then used as a handle to identify the applicable rules. The
VRF device is thus the layer3 equivalent of a vlan device.

The very important point to note is that this is only a Layer3 concept
so LLDP like tools do not need to be run in each VRF, processes can
run in unaware mode or select a VRF to be talking through. Also the
behavioral model is a generalized application of the familiar VRF-Lite
model with some performance paths that need optimization. (Specifically
the output route selector that Roopa, Robert, Thomas and EricB are
currently discussing on the MPLS thread)

High Level points

1. Simple overlay driver (minimal changes to current stack)
   * uses the existing fib tables and fib rules infrastructure
2. Modelled closely after the ipvlan driver
3. Uses current API and infrastructure.
   * Applications can use SO_BINDTODEVICE or cmsg device indentifiers
     to pick VRF (ping, traceroute just work)
   * Standard IP Rules work, and since they are aggregated against the
     device, scale is manageable
4. Completely orthogonal to Namespaces and only provides separation in
   the routing plane (and ARP)
5. Debugging is built-in as tcpdump and counters on the VRF device
   works as is.

                                                 N2
           N1 (all configs here)          +---------------+
    +--------------+                      |               |
    |swp1 :10.0.1.1+----------------------+swp1 :10.0.1.2 |
    |              |                      |               |
    |swp2 :10.0.2.1+----------------------+swp2 :10.0.2.2 |
    |              |                      +---------------+
    | VRF 0        |
    | table 5      |
    |              |
    +---------------+
    |              |
    | VRF 1        |                             N3
    | table 6      |                      +---------------+
    |              |                      |               |
    |swp3 :10.0.2.1+----------------------+swp1 :10.0.2.2 |
    |              |                      |               |
    |swp4 :10.0.3.1+----------------------+swp2 :10.0.3.2 |
    +--------------+                      +---------------+


Given the topology above, the setup needed to get the basic VRF
functions working would be

# Create the VRF devices
ip link add vrf0 type vrf table 5
ip link add vrf1 type vrf table 6

# Enslave the routing member interfaces
ip link set swp1 master vrf0
ip link set swp2 master vrf0
ip link set swp3 master vrf1
ip link set swp4 master vrf1

ip link set vrf0 up
ip link set vrf1 up

# move the connected routes from the main table to the 
# correct table

# move vrf0 connected routes from main to table 5
ip route del 10.0.1.0/24
ip route del 10.0.2.0/24
ip route add 10.0.1.0/24 dev swp1 table 5
ip route add 10.0.2.0/24 dev swp2 table 5

# move vrf1 connected routes from main to table 6
ip route del 10.0.2.0/24
ip route del 10.0.3.0/24
ip route add 10.0.3.0/24 dev swp4 table 6
ip route add 10.0.2.0/24 dev swp3 table 6

# Install the lookup rules that map table to VRF domain
ip rule add pref 200 oif vrf0 lookup 5
ip rule add pref 200 iif vrf0 lookup 5
ip rule add pref 200 oif vrf1 lookup 6
ip rule add pref 200 iif vrf1 lookup 6

# ping using VRF0 is simply
ping -I vrf0 -I <optional-src-addr> 10.0.1.2

# tcp/udp applications specify the interface using SO_BINDTODEVICE or
cmsg hdr pointing to the desired vrf device.


Design Highlights

1. RX path

The Basic action here is that for IP traffic (arp_rcv, icmp_rcv and
ip_rcv) we check the incoming interface to see if it is enslaved. If
enslaved, then the master device is used as the device for all lookups
allowing the routing table for the lookup to be selected by the IIF
rule

1.a Forwarded Traffic

In ip_route_input_slow we move the IIF to be that of the master
device. This causes the IIF rule that maps to the VRF device to be
applied forcing the packet to be looked up by the table that is
associated with that device. For forwarded traffic the VRF device
provides a convenient hook to group the forwarding action for a group
of inbound ports.

1.b Locally terminated traffic

Packets are checked in arp_rcv, icmp_rcv and ip_rcv and the IIF is
moved to the VRF device if the current IIF is enslaved. for LOCAL
traffic this has two implications.

We need the LOCAL table entries in the actual VRF device's routing
table as well, and if that is present then we will match in the flow
hashes using the device which the socket is bound to. Since using
VRF's requires the socket to bind to an interface (netdev) that is
what receive hash is going to resolve to.  all incoming frames
destined to LOCAL will need to have it's iff changed

2. TX path

2.a Locally originated traffic

The Basic point is here the oif override option that already exists in
the linux kernel. Currently if the destination device exists (and thus
is local), the flow_output_key functions generate a route to send the
pkt towards the device. Leveraging that scheme (this can be
optimized), we send the outbound pkts directly to the VRF device's
xmit function. Since we can only specify one interface there is no
concerns over ECMP, or missing pkts. Once the packet lands, the xmit
function marks the pkt so that it hits the OIF rule for that VRF
device and the proper table lookup happens and the pkt is sent along
by normal fwding actions. The only change needed in the stack (with no
added cost) is to check a FL4 flag that indicates it was originated by
the VRF driver and the oif hint is to be ignored

2.b Overlapping neighbor entries

Since the outgoing packet (socket) needs to specify the VRF domain and
we reject fwding through a device that is not enslaved, by picking the
VRF device we decide which path the packet will go through.

Considerations

ARP

The LOCAL table here is a pain, and needs to be melded into the main
table that will require special handling of ARP. ARP replies are sent
to the generic stack, and need to be accepted by hitting a LOCAL
route. If the LOCAL route is in a VRF table, then ARP replies miss
classification and end up being fwded through to the default route. If
the ARP replies are redirected to be seen as received on the VRF
device then the ARP entry is registered against the VRF device and
final forwarding using physical ports will not complete. Currently
enslaving will install the "local" route into the table associated
with the VRF device. Un-enslave will put the local route back into
LOCAL table. Hannes has a plan to work this into a per VRF local table
concept. Update : fixed this with a specific change into the arp stack

Route Leaking and Policy Routing

Policy Routing needs standard rule precedence and using fw_mark to
selective apply policies just work.  Route Leaking is an interesting
angle. Since the Nexthops used in the final forwarding step all belong
to the same namespace, there is no restriction on which Nexthop can be
used in which table. The route lookup in the context of the VRF table
enforces that it does not forward through non-slave interfaces, so
that it does not accidentally leak.
However since we are using the standard fib rules on a mismatch in the
route table that a VRF is pointing to, an attempt will be made to fwd
the packet from the next routing table (in the example shown above it
will end up on the default route of the main table)

Connected route management is still a little fragile ..

Bug-Fixes,Ideas from Hannes, David Ahern, Roopa Prabhu, Jon Toppins
                     Jamal

Shrijeet Mukherjee (3):
  Symbol preparation for VRF driver
  VRF driver and needed infrastructure
  rcv path changes for vrf traffic

 drivers/net/Kconfig          |    6 +
 drivers/net/Makefile         |    1 +
 drivers/net/vrf.c            |  654 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/netdevice.h    |   10 +
 include/net/flow.h           |    1 +
 include/net/vrf.h            |   19 ++
 include/uapi/linux/if_link.h |    9 +
 net/ipv4/fib_frontend.c      |   15 +-
 net/ipv4/fib_trie.c          |    9 +-
 net/ipv4/icmp.c              |    6 +
 net/ipv4/route.c             |    3 +-
 11 files changed, 727 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC net-next 1/3] Symbol preparation for VRF driver
  2015-06-08 18:35 [RFC net-next 0/3] Proposal for VRF-lite Shrijeet Mukherjee
@ 2015-06-08 18:35 ` Shrijeet Mukherjee
  2015-06-10 16:24   ` Alexander Duyck
  2015-06-08 18:35 ` [RFC net-next 2/3] VRF driver and needed infrastructure Shrijeet Mukherjee
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-08 18:35 UTC (permalink / raw)
  To: hannes, nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay, Shrijeet Mukherjee

From: Shrijeet Mukherjee <shm@cumulusnetworks.com>

This change is needed for the following VRF driver which creates
a routing domain and allows applications to bind to the domain.

No active code path changes.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
---
 net/ipv4/fib_frontend.c |    1 +
 net/ipv4/fib_trie.c     |    2 ++
 2 files changed, 3 insertions(+)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 872494e..9d4cef4 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -108,6 +108,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
 	hlist_add_head_rcu(&tb->tb_hlist, &net->ipv4.fib_table_hash[h]);
 	return tb;
 }
+EXPORT_SYMBOL_GPL(fib_new_table);
 
 /* caller must hold either rtnl or rcu read lock */
 struct fib_table *fib_get_table(struct net *net, u32 id)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 01bce15..97fa62d 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1247,6 +1247,7 @@ out:
 err:
 	return err;
 }
+EXPORT_SYMBOL_GPL(fib_table_insert);
 
 static inline t_key prefix_mismatch(t_key key, struct key_vector *n)
 {
@@ -1535,6 +1536,7 @@ int fib_table_delete(struct fib_table *tb, struct fib_config *cfg)
 	alias_free_mem_rcu(fa_to_delete);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(fib_table_delete);
 
 /* Scan for the next leaf starting at the provided key value */
 static struct key_vector *leaf_walk_rcu(struct key_vector **tn, t_key key)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC net-next 2/3] VRF driver and needed infrastructure
  2015-06-08 18:35 [RFC net-next 0/3] Proposal for VRF-lite Shrijeet Mukherjee
  2015-06-08 18:35 ` [RFC net-next 1/3] Symbol preparation for VRF driver Shrijeet Mukherjee
@ 2015-06-08 18:35 ` Shrijeet Mukherjee
  2015-06-08 19:08   ` David Ahern
                     ` (4 more replies)
  2015-06-08 18:35 ` [RFC net-next 3/3] rcv path changes for vrf traffic Shrijeet Mukherjee
                   ` (4 subsequent siblings)
  6 siblings, 5 replies; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-08 18:35 UTC (permalink / raw)
  To: hannes, nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay, Shrijeet Mukherjee

From: Shrijeet Mukherjee <shm@cumulusnetworks.com>

This driver borrows heavily from IPvlan and teaming drivers.

Routing domains (VRF-lite) are created by instantiating a device
and enslaving all routed interfaces that participate in the domain.
As part of the enslavement, all local routes pointing to enslaved
devices are re-pointed to the vrf device, thus forcing outgoing
sockets to bind to the vrf to function.

Standard FIB rules can then bind the VRF device to tables and regular
fib rule processing is followed.

Routed traffic through the box, is fwded by using the VRF device as
the IIF and following the IIF rule to a table which is mated with
the VRF.

Locally originated traffic is directed at the VRF device using
SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
the xmit function of the vrf driver, which then completes the ip lookup
and output.

This solution is completely orthogonal to namespaces and allow the L3
equivalent of vlans to exist allowing the routing space to be
partitioned.

Example use is
   ip link add vrf0 type vrf table 5
   ip link set eth1 master vrf0
   ip link set vrf0 up

   ip rule add iif vrf0 table 5
   ip rule add oif vrf0 table 5

TODO:
This changeset is for IPv4 only
Connected route management can be made much better, but is deferred to
user space for now.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
---
 drivers/net/Kconfig          |    6 +
 drivers/net/Makefile         |    1 +
 drivers/net/vrf.c            |  654 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/netdevice.h    |   10 +
 include/net/flow.h           |    1 +
 include/net/vrf.h            |   19 ++
 include/uapi/linux/if_link.h |    9 +
 7 files changed, 700 insertions(+)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 019fcef..27a333c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -283,6 +283,12 @@ config NLMON
 	  diagnostics, etc. This is mostly intended for developers or support
 	  to debug netlink issues. If unsure, say N.
 
+config NET_VRF
+	tristate "Virtual Routing and Forwarding (Lite)"
+	---help---
+          This option enables the support for mapping interfaces into VRF's. The
+          support enables VRF devices
+
 endif # NET_CORE
 
 config SUNGEM_PHY
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c12cb22..ca16dd6 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_NLMON) += nlmon.o
+obj-$(CONFIG_NET_VRF) += vrf.o
 
 #
 # Networking Drivers
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
new file mode 100644
index 0000000..08b3e79
--- /dev/null
+++ b/drivers/net/vrf.c
@@ -0,0 +1,654 @@
+/*
+ * vrf.c: device driver to encapsulate a VRF space
+ *
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * Based on dummy, team and ipvlan drivers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/rtnetlink.h>
+#include <net/rtnetlink.h>
+#include <net/arp.h>
+#include <linux/u64_stats_sync.h>
+#include <linux/hashtable.h>
+
+#include <linux/inetdevice.h>
+#include <net/ip.h>
+#include <net/ip_fib.h>
+#include <net/ip6_route.h>
+#include <net/rtnetlink.h>
+#include <net/route.h>
+#include <net/addrconf.h>
+#include <net/vrf.h>
+
+#define DRV_NAME	"vrf"
+#define DRV_VERSION	"1.0"
+
+#define vrf_is_slave(dev)   ((dev->flags & IFF_SLAVE) == IFF_SLAVE)
+#define vrf_is_master(dev)  ((dev->flags & IFF_MASTER) == IFF_MASTER)
+
+#define vrf_master_get_rcu(dev) \
+	((struct net_device *)rcu_dereference(dev->rx_handler_data))
+
+struct pcpu_dstats {
+	u64			tx_pkts;
+	u64			tx_bytes;
+	u64			tx_drps;
+	u64			rx_pkts;
+	u64			rx_bytes;
+	struct u64_stats_sync	syncp;
+};
+
+struct slave {
+	struct list_head	list;
+	struct net_device	*dev;
+	long			priority;
+};
+
+struct slave_queue {
+	spinlock_t		lock; /* lock for slave insert/delete */
+	struct list_head	all_slaves;
+	int			num_slaves;
+	struct net_device	*master_dev;
+};
+
+struct net_vrf {
+	struct slave_queue	queue;
+	struct fib_table        *tb;
+	u32                     tb_id;
+};
+
+static int is_ip_rx_frame(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+	case htons(ETH_P_IPV6):
+		return 1;
+	}
+	return 0;
+}
+
+/* note: already called with rcu_read_lock */
+static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+
+	if (is_ip_rx_frame(skb)) {
+		struct net_device *dev = vrf_master_get_rcu(skb->dev);
+		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
+
+		u64_stats_update_begin(&dstats->syncp);
+		dstats->rx_pkts++;
+		dstats->rx_bytes += skb->len;
+		u64_stats_update_end(&dstats->syncp);
+	}
+	return RX_HANDLER_PASS;
+}
+
+static struct rtnl_link_stats64 *vrf_get_stats64(
+	struct net_device *dev, struct rtnl_link_stats64 *stats)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_dstats *dstats;
+		u64 tbytes, tpkts, tdrops, rbytes, rpkts;
+		unsigned int start;
+
+		dstats = per_cpu_ptr(dev->dstats, i);
+		do {
+			start = u64_stats_fetch_begin_irq(&dstats->syncp);
+			tbytes = dstats->tx_bytes;
+			tpkts = dstats->tx_pkts;
+			tdrops = dstats->tx_drps;
+			rbytes = dstats->rx_bytes;
+			rpkts = dstats->rx_pkts;
+		} while (u64_stats_fetch_retry_irq(&dstats->syncp, start));
+		stats->tx_bytes += tbytes;
+		stats->tx_packets += tpkts;
+		stats->tx_dropped += tdrops;
+		stats->rx_bytes += rbytes;
+		stats->rx_packets += rpkts;
+	}
+	return stats;
+}
+
+static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb,
+					   struct net_device *dev)
+{
+	return 0;
+}
+
+static int _vrf_send_v4_prep(struct sk_buff *skb, struct flowi4 *fl4,
+			     struct net_device *vrf_dev)
+{
+	struct rtable *rt;
+	struct net_device *dev = skb->dev;
+
+	rt = ip_route_output_flow(dev_net(dev), fl4, NULL);
+
+	if (IS_ERR(rt))
+		goto err;
+
+	if ((rt->rt_type != RTN_UNICAST && rt->rt_type != RTN_LOCAL) ||
+	    !rt->dst.dev->vrf_ptr) {
+		ip_rt_put(rt);
+		goto err;
+	}
+
+	/* prevent slave cross reference */
+	if (rt->dst.dev->vrf_ptr->ifindex != vrf_dev->ifindex) {
+		ip_rt_put(rt);
+		goto err;
+	}
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rt->dst);
+
+	return 0;
+err:
+	return 1;
+}
+
+static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb,
+					   struct net_device *vrf_dev)
+{
+	struct iphdr *ip4h = ip_hdr(skb);
+	int ret = NET_XMIT_DROP;
+	struct net_device *dev = skb->dev;
+	struct flowi4 fl4 = {
+		/* needed to match OIF rule */
+		.flowi4_oif = vrf_dev->ifindex,
+		.flowi4_iif = LOOPBACK_IFINDEX,
+		.flowi4_tos = RT_TOS(ip4h->tos),
+		.flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC,
+		.daddr = ip4h->daddr,
+		.saddr = 0,
+	};
+
+	if (_vrf_send_v4_prep(skb, &fl4, vrf_dev))
+		goto err;
+
+	dev = skb_dst(skb)->dev;
+	ip4h->saddr = inet_select_addr(dev, 0, RT_SCOPE_LINK);
+	ret = ip_local_out(skb);
+
+	if (unlikely(net_xmit_eval(ret)))
+		vrf_dev->stats.tx_errors++;
+	else
+		ret = NET_XMIT_SUCCESS;
+
+	goto out;
+err:
+	vrf_dev->stats.tx_errors++;
+	kfree_skb(skb);
+out:
+	return ret;
+}
+
+static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev)
+{
+	/* the frames we recv have full L2 headers, strip */
+	if (skb_mac_header_was_set(skb)) {
+		skb_pull(skb, sizeof(struct ethhdr));
+		skb->mac_header = (typeof(skb->mac_header))~0U;
+		skb_reset_network_header(skb);
+	}
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		return vrf_process_v4_outbound(skb, dev);
+	case htons(ETH_P_IPV6):
+		return vrf_process_v6_outbound(skb, dev);
+	default:
+		return NET_XMIT_DROP;
+	}
+}
+
+static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	netdev_tx_t ret = is_ip_tx_frame(skb, dev);
+
+	if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
+		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
+
+		u64_stats_update_begin(&dstats->syncp);
+		dstats->tx_pkts++;
+		dstats->tx_bytes += skb->len;
+		u64_stats_update_end(&dstats->syncp);
+	} else {
+		this_cpu_inc(dev->dstats->tx_drps);
+	}
+
+	return ret;
+}
+
+int vrf_output(struct sock *sk, struct sk_buff *skb)
+{
+	struct net_device *dev = skb_dst(skb)->dev;
+
+	return vrf_xmit(skb, dev);
+}
+
+/**************************** device handling ********************/
+
+/* queue->lock must be held */
+static struct slave *__vrf_find_slave_dev(struct slave_queue *queue,
+					  struct net_device *dev)
+{
+	struct list_head *this, *head;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		if (slave->dev == dev)
+			return slave;
+	}
+
+	return NULL;
+}
+
+static void vrf_kill_one_slave(struct slave_queue *queue, struct slave *slave)
+{
+	list_del(&slave->list);
+	queue->num_slaves--;
+	slave->dev->flags &= ~IFF_SLAVE;
+	netdev_rx_handler_unregister(slave->dev);
+	kfree(slave->dev->vrf_ptr);
+	slave->dev->vrf_ptr = NULL;
+	dev_put(slave->dev);
+	kfree(slave);
+}
+
+/* queue->lock must be held */
+static int __vrf_insert_slave(struct slave_queue *queue, struct slave *slave,
+			      struct net_device *master)
+{
+	struct net_vrf *vrf = netdev_priv(master);
+	struct slave *duplicate_slave = NULL;
+	int master_ifindex = master->ifindex;
+	int err = 0;
+
+	duplicate_slave = __vrf_find_slave_dev(queue, slave->dev);
+	if (duplicate_slave)
+		vrf_kill_one_slave(queue, duplicate_slave);
+
+	dev_hold(slave->dev);
+	list_add(&slave->list, &queue->all_slaves);
+	queue->num_slaves++;
+	slave->dev->flags |= IFF_SLAVE;
+
+	slave->dev->vrf_ptr = kmalloc(sizeof(*slave->dev->vrf_ptr), GFP_KERNEL);
+	if (!slave->dev->vrf_ptr)
+		return -ENODEV;
+	slave->dev->vrf_ptr->ifindex = master_ifindex;
+	slave->dev->vrf_ptr->tb_id = vrf->tb_id;
+
+	/* register the packet handler for slave ports */
+	err = netdev_rx_handler_register(slave->dev, vrf_handle_frame,
+					 (void *)master);
+	if (err) {
+		netdev_err(slave->dev,
+			   "Device %s failed to register rx_handler\n",
+			   slave->dev->name);
+		return err;
+	}
+
+	return 0;
+}
+
+static void vrf_fib_magic(int cmd, int type, __be32 dst, int dst_len,
+			  struct in_ifaddr *ifa, struct net_device *vrf_dev)
+{
+	struct net_vrf *vrf = netdev_priv(vrf_dev);
+	struct net *net = dev_net(ifa->ifa_dev->dev);
+	struct net *vrf_net = dev_net(vrf_dev);
+	struct fib_table *tb, *lc_tb;
+	struct fib_config cfg = {
+		.fc_protocol = RTPROT_KERNEL,
+		.fc_type = type,
+		.fc_dst = dst,
+		.fc_dst_len = dst_len,
+		.fc_prefsrc = ifa->ifa_local,
+		.fc_nlflags = NLM_F_CREATE | NLM_F_APPEND,
+		.fc_nlinfo = {
+			.nl_net = net,
+		},
+	};
+
+	lc_tb = fib_new_table(dev_net(vrf_dev), RT_TABLE_LOCAL);
+	tb = vrf->tb;
+	if (!tb || !lc_tb)
+		return;
+
+	if (net != vrf_net)
+		return;
+
+	if (type != RTN_LOCAL)
+		cfg.fc_scope = RT_SCOPE_LINK;
+	else
+		cfg.fc_scope = RT_SCOPE_HOST;
+
+	if (cmd == RTM_NEWROUTE) {
+		cfg.fc_table = lc_tb->tb_id;
+		cfg.fc_oif = ifa->ifa_dev->dev->ifindex;
+		fib_table_delete(lc_tb, &cfg);
+		cfg.fc_table = tb->tb_id;
+		cfg.fc_oif = vrf_dev->ifindex;
+		fib_table_insert(tb, &cfg);
+	} else {
+		cfg.fc_table = tb->tb_id;
+		cfg.fc_oif = vrf_dev->ifindex;
+		fib_table_delete(tb, &cfg);
+		cfg.fc_table = lc_tb->tb_id;
+		cfg.fc_oif = ifa->ifa_dev->dev->ifindex;
+		fib_table_insert(lc_tb, &cfg);
+	}
+}
+
+static void vrf_move_local_routes(int cmd, struct net_device *dev,
+				  struct net_device *port_dev)
+{
+	struct in_device *in_dev = __in_dev_get_rcu(port_dev);
+
+	if (!in_dev)
+		return;
+
+	for_ifa(in_dev) {
+		__be32 addr;
+
+		addr = ifa->ifa_local;
+		vrf_fib_magic(cmd, RTN_LOCAL, addr, 32, ifa, dev);
+	} endfor_ifa(in_dev);
+}
+
+static int vrf_inetaddr_event(struct notifier_block *this, unsigned long event,
+			      void *ptr)
+{
+	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
+	struct net_device *port_dev = ifa->ifa_dev->dev;
+	struct net *net = dev_net(port_dev);
+	int master = 0;
+
+	if (port_dev->vrf_ptr)
+		master = port_dev->vrf_ptr->ifindex;
+
+	if (master) {
+		struct net_device *dev = dev_get_by_index(net, master);
+
+		switch (event) {
+		case NETDEV_UP:
+			vrf_move_local_routes(RTM_NEWROUTE, dev, port_dev);
+			break;
+		case NETDEV_DOWN:
+			vrf_move_local_routes(RTM_DELROUTE, dev, port_dev);
+			break;
+		}
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block vrf_inetaddr_notifier = {
+	.notifier_call = vrf_inetaddr_event,
+};
+
+/* netlink lock is assumed here */
+static int vrf_add_slave(struct net_device *dev,
+			 struct net_device *port_dev)
+{
+	if (!dev || !port_dev)
+		return -ENODEV;
+
+	if (dev_net(dev) != dev_net(port_dev))
+		return -ENODEV;
+
+	if (!vrf_is_master(port_dev) && !vrf_is_slave(port_dev)) {
+		struct slave *s = kmalloc(sizeof(*s), GFP_KERNEL);
+		struct net_vrf *vrf = netdev_priv(dev);
+		int ret;
+
+		if (!s)
+			return -ENOMEM;
+
+		memset(s, 0, sizeof(*s));
+		s->dev = port_dev;
+
+		spin_lock_bh(&vrf->queue.lock);
+		ret = __vrf_insert_slave(&vrf->queue, s, dev);
+		if (ret)
+			kfree(s);
+
+		spin_unlock_bh(&vrf->queue.lock);
+
+		vrf_move_local_routes(RTM_NEWROUTE, dev, port_dev);
+		ret = netdev_master_upper_dev_link(port_dev, dev);
+		return ret;
+	}
+
+	return -EINVAL;
+}
+
+static int vrf_del_slave(struct net_device *dev,
+			 struct net_device *port_dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct slave *slave = __vrf_find_slave_dev(queue, port_dev);
+
+	if (!slave)
+		return -EINVAL;
+
+	vrf_kill_one_slave(queue, slave);
+	vrf_move_local_routes(RTM_DELROUTE, dev, port_dev);
+	netdev_upper_dev_unlink(port_dev, dev);
+
+	return 0;
+}
+
+static int vrf_dev_init(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+
+	spin_lock_init(&vrf->queue.lock);
+	INIT_LIST_HEAD(&vrf->queue.all_slaves);
+	vrf->queue.master_dev = dev;
+
+	dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
+	dev->flags  =  IFF_MASTER | IFF_NOARP;
+	if (!dev->dstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void vrf_dev_uninit(struct net_device *dev)
+{
+	free_percpu(dev->dstats);
+}
+
+static int vrf_dev_close(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct list_head *this, *head;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		slave->dev->vrf_ptr->ifindex = 0;
+		slave->dev->vrf_ptr->tb_id = 0;
+	}
+
+	if (dev->flags & IFF_MASTER)
+		dev->flags &= ~IFF_UP;
+/* XXX does the table not need a free
+ * fib_table_delete(vrf->tb, cfg);
+ */
+	return 0;
+}
+
+static int vrf_dev_open(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct list_head *this, *head;
+	int err = 0;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		slave->dev->vrf_ptr->ifindex = dev->ifindex;
+		slave->dev->vrf_ptr->tb_id = vrf->tb_id;
+	}
+
+	if (dev->flags & IFF_MASTER)
+		dev->flags |= IFF_UP;
+
+	if (!vrf->tb)
+		return -EINVAL;
+
+	return err;
+}
+
+static int vrf_neigh_create(struct neighbour *n)
+{
+	n->nud_state = NUD_REACHABLE;
+	n->dead      = 0;
+
+	return 0;
+}
+
+static const struct net_device_ops vrf_netdev_ops = {
+	.ndo_init		= vrf_dev_init,
+	.ndo_uninit		= vrf_dev_uninit,
+	.ndo_open		= vrf_dev_open,
+	.ndo_stop               = vrf_dev_close,
+	.ndo_start_xmit		= vrf_xmit,
+	.ndo_get_stats64	= vrf_get_stats64,
+	.ndo_add_slave          = vrf_add_slave,
+	.ndo_del_slave          = vrf_del_slave,
+	.ndo_neigh_construct    = vrf_neigh_create,
+};
+
+static void vrf_get_drvinfo(struct net_device *dev,
+			    struct ethtool_drvinfo *info)
+{
+	strlcpy(info->driver, DRV_NAME, sizeof(info->driver));
+	strlcpy(info->version, DRV_VERSION, sizeof(info->version));
+}
+
+static const struct ethtool_ops vrf_ethtool_ops = {
+	.get_drvinfo            = vrf_get_drvinfo,
+};
+
+static void vrf_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+
+	/* Initialize the device structure. */
+	dev->netdev_ops = &vrf_netdev_ops;
+	dev->ethtool_ops = &vrf_ethtool_ops;
+	dev->destructor = free_netdev;
+
+	/* Fill in device structure with ethernet-generic values. */
+	dev->tx_queue_len = 0;
+	eth_hw_addr_random(dev);
+}
+
+static int vrf_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (tb[IFLA_ADDRESS]) {
+		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
+			return -EINVAL;
+		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
+			return -EADDRNOTAVAIL;
+	}
+	return 0;
+}
+
+static int vrf_newlink(struct net *src_net, struct net_device *dev,
+		       struct nlattr *tb[], struct nlattr *data[])
+{
+	int err;
+	struct net_vrf *vrf = netdev_priv(dev);
+
+	if (data && data[IFLA_VRF_TABLE]) {
+		vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
+		/* reserve a table for this VRF device */
+		vrf->tb = fib_new_table(dev_net(dev), vrf->tb_id);
+		if (!vrf->tb)
+			return -ERANGE;
+
+		dev->priv_flags |= IFF_VRF_MASTER;
+		err = register_netdevice(dev);
+		if (err) {
+			free_netdev(dev);
+			return -ENODEV;
+		}
+	}
+	return 0;
+}
+
+static void vrf_dellink(struct net_device *dev, struct list_head *head)
+{
+	/* Need to free the table ? */
+	unregister_netdev(dev);
+}
+
+static const struct nla_policy vrf_nl_policy[IFLA_VRF_MAX + 1] = {
+	[IFLA_VRF_TABLE] = { .type = NLA_U32 },
+};
+
+static struct rtnl_link_ops vrf_link_ops __read_mostly = {
+	.kind		= DRV_NAME,
+	.priv_size      = sizeof(struct net_vrf),
+	.policy         = vrf_nl_policy,
+	.newlink        = vrf_newlink,
+	.dellink        = vrf_dellink,
+	.setup		= vrf_setup,
+	.validate	= vrf_validate,
+	.maxtype        = IFLA_VRF_MAX,
+};
+
+static int __init vrf_init_module(void)
+{
+	int err = 0;
+
+	rtnl_lock();
+	err = __rtnl_link_register(&vrf_link_ops);
+	if (err < 0)
+		goto out;
+
+	register_inetaddr_notifier(&vrf_inetaddr_notifier);
+
+out:
+	rtnl_unlock();
+	return err;
+}
+
+static void __exit vrf_cleanup_module(void)
+{
+	unregister_inetaddr_notifier(&vrf_inetaddr_notifier);
+	rtnl_link_unregister(&vrf_link_ops);
+}
+
+module_init(vrf_init_module);
+module_exit(vrf_cleanup_module);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK(DRV_NAME);
+MODULE_VERSION(DRV_VERSION);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 51f8d2f..29febf3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -51,6 +51,7 @@
 #include <linux/neighbour.h>
 #include <uapi/linux/netdevice.h>
 #include <uapi/linux/if_bonding.h>
+#include <net/vrf.h>
 
 struct netpoll_info;
 struct device;
@@ -1270,6 +1271,7 @@ enum netdev_priv_flags {
 	IFF_XMIT_DST_RELEASE_PERM	= 1<<22,
 	IFF_IPVLAN_MASTER		= 1<<23,
 	IFF_IPVLAN_SLAVE		= 1<<24,
+	IFF_VRF_MASTER		        = 1<<25,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1297,6 +1299,7 @@ enum netdev_priv_flags {
 #define IFF_XMIT_DST_RELEASE_PERM	IFF_XMIT_DST_RELEASE_PERM
 #define IFF_IPVLAN_MASTER		IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE		IFF_IPVLAN_SLAVE
+#define IFF_VRF_MASTER		        IFF_VRF_MASTER
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -1413,6 +1416,7 @@ enum netdev_priv_flags {
  *	@dn_ptr:	DECnet specific data
  *	@ip6_ptr:	IPv6 specific data
  *	@ax25_ptr:	AX.25 specific data
+ *	@vrf_ptr:	VRF specific data
  *	@ieee80211_ptr:	IEEE 802.11 specific data, assign before registering
  *
  *	@last_rx:	Time of last Rx
@@ -1625,6 +1629,7 @@ struct net_device {
 	struct dn_dev __rcu     *dn_ptr;
 	struct inet6_dev __rcu	*ip6_ptr;
 	void			*ax25_ptr;
+	struct net_vrf_dev      *vrf_ptr;
 	struct wireless_dev	*ieee80211_ptr;
 	struct wpan_dev		*ieee802154_ptr;
 #if IS_ENABLED(CONFIG_MPLS_ROUTING)
@@ -3776,6 +3781,11 @@ static inline bool netif_supports_nofcs(struct net_device *dev)
 	return dev->priv_flags & IFF_SUPP_NOFCS;
 }
 
+static inline bool netif_is_vrf(struct net_device *dev)
+{
+	return dev->priv_flags & IFF_VRF_MASTER;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a15..69aaa99 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -29,6 +29,7 @@ struct flowi_common {
 	__u8	flowic_flags;
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_KNOWN_NH		0x02
+#define FLOWI_FLAG_VRFSRC		0x04
 	__u32	flowic_secid;
 };
 
diff --git a/include/net/vrf.h b/include/net/vrf.h
new file mode 100644
index 0000000..11d7dbf8
--- /dev/null
+++ b/include/net/vrf.h
@@ -0,0 +1,19 @@
+/*
+ * include/net/net_vrf.h - adds vrf dev structure definitions
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __LINUX_NET_VRF_H
+#define __LINUX_NET_VRF_H
+
+struct net_vrf_dev {
+	int                     ifindex; /* ifindex of master dev */
+	u32                     tb_id;
+};
+
+#endif /* __LINUX_NET_VRF_H */
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index afccc93..b98a443 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -339,6 +339,15 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC	1
 
+/* VRF section */
+enum {
+	IFLA_VRF_UNSPEC,
+	IFLA_VRF_TABLE,
+	__IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
+
 /* IPVLAN section */
 enum {
 	IFLA_IPVLAN_UNSPEC,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 18:35 [RFC net-next 0/3] Proposal for VRF-lite Shrijeet Mukherjee
  2015-06-08 18:35 ` [RFC net-next 1/3] Symbol preparation for VRF driver Shrijeet Mukherjee
  2015-06-08 18:35 ` [RFC net-next 2/3] VRF driver and needed infrastructure Shrijeet Mukherjee
@ 2015-06-08 18:35 ` Shrijeet Mukherjee
  2015-06-08 19:58   ` Hannes Frederic Sowa
  2015-06-10 18:31   ` Alexander Duyck
  2015-06-08 18:35 ` [RFC iproute2] Add the ability to create a VRF device and specify it's table binding Shrijeet Mukherjee
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-08 18:35 UTC (permalink / raw)
  To: hannes, nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay, Shrijeet Mukherjee

From: Shrijeet Mukherjee <shm@cumulusnetworks.com>

Incoming frames for IP protocol stacks need the IIF to be changed
from the actual interface to the VRF device. This allows the IIF
rule to be used to select tables (or do regular PBR)

This change selects the iif to be the VRF device if it exists and
the incoming iif is enslaved to the VRF device.

Since VRF aware sockets are always bound to the VRF device this
system allows return traffic to find the socket of origin.

changes are in the arp_rcv, icmp_rcv and ip_rcv paths

Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF
as it would create code variations and the vrf_ptr check is there
I can make that whole thing modular.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
---
 net/ipv4/fib_frontend.c |   14 +++++++++++---
 net/ipv4/fib_trie.c     |    7 +++++--
 net/ipv4/icmp.c         |    6 ++++++
 net/ipv4/route.c        |    3 ++-
 4 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 9d4cef4..cf2d584 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -45,6 +45,7 @@
 #include <net/ip_fib.h>
 #include <net/rtnetlink.h>
 #include <net/xfrm.h>
+#include <net/vrf.h>
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
@@ -218,15 +219,18 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 	struct fib_result	res;
 	unsigned int ret = RTN_BROADCAST;
 	struct fib_table *local_table;
+	int rt_table = RT_TABLE_LOCAL;
 
 	if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
 		return RTN_BROADCAST;
 	if (ipv4_is_multicast(addr))
 		return RTN_MULTICAST;
 
-	rcu_read_lock();
+	if (dev && dev->vrf_ptr)
+		rt_table =  dev->vrf_ptr->tb_id;
 
-	local_table = fib_get_table(net, RT_TABLE_LOCAL);
+	rcu_read_lock();
+	local_table = fib_get_table(net, rt_table);
 	if (local_table) {
 		ret = RTN_UNICAST;
 		if (!fib_table_lookup(local_table, &fl4, &res, FIB_LOOKUP_NOREF)) {
@@ -309,7 +313,11 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	bool dev_match;
 
 	fl4.flowi4_oif = 0;
-	fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
+	if (dev->vrf_ptr)
+		fl4.flowi4_iif = dev->vrf_ptr ?
+			dev->vrf_ptr->ifindex : dev->ifindex;
+	else
+		fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
 	fl4.daddr = src;
 	fl4.saddr = dst;
 	fl4.flowi4_tos = tos;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 97fa62d..515ff11 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1409,8 +1409,11 @@ found:
 
 			if (nh->nh_flags & RTNH_F_DEAD)
 				continue;
-			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
-				continue;
+			if (!(flp->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
+				if (flp->flowi4_oif &&
+				    flp->flowi4_oif != nh->nh_oif)
+					continue;
+			}
 
 			if (!(fib_flags & FIB_LOOKUP_NOREF))
 				atomic_inc(&fi->fib_clntref);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f5203fb..61b7da3 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -96,6 +96,7 @@
 #include <net/xfrm.h>
 #include <net/inet_common.h>
 #include <net/ip_fib.h>
+#include <net/vrf.h>
 
 /*
  *	Build xmit assembly blocks
@@ -420,6 +421,8 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 			daddr = icmp_param->replyopts.opt.opt.faddr;
 	}
 	memset(&fl4, 0, sizeof(fl4));
+	if (skb->dev && skb->dev->vrf_ptr)
+		fl4.flowi4_oif = skb->dev->vrf_ptr->ifindex;
 	fl4.daddr = daddr;
 	fl4.saddr = saddr;
 	fl4.flowi4_mark = mark;
@@ -458,6 +461,9 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->flowi4_proto = IPPROTO_ICMP;
 	fl4->fl4_icmp_type = type;
 	fl4->fl4_icmp_code = code;
+	if (skb_in->dev && skb_in->dev->vrf_ptr)
+		fl4->flowi4_oif =  skb_in->dev->vrf_ptr->ifindex;
+
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
 	rt = __ip_route_output_key(net, fl4);
 	if (IS_ERR(rt))
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f605598..8c37720 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -109,6 +109,7 @@
 #include <linux/kmemleak.h>
 #endif
 #include <net/secure_seq.h>
+#include <net/vrf.h>
 
 #define RT_FL_TOS(oldflp4) \
 	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
@@ -1710,7 +1711,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	 *	Now we are ready to route packet.
 	 */
 	fl4.flowi4_oif = 0;
-	fl4.flowi4_iif = dev->ifindex;
+	fl4.flowi4_iif = dev->vrf_ptr ? dev->vrf_ptr->ifindex : dev->ifindex;
 	fl4.flowi4_mark = skb->mark;
 	fl4.flowi4_tos = tos;
 	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC iproute2] Add the ability to create a VRF device and specify it's table binding.
  2015-06-08 18:35 [RFC net-next 0/3] Proposal for VRF-lite Shrijeet Mukherjee
                   ` (2 preceding siblings ...)
  2015-06-08 18:35 ` [RFC net-next 3/3] rcv path changes for vrf traffic Shrijeet Mukherjee
@ 2015-06-08 18:35 ` Shrijeet Mukherjee
  2015-06-08 19:13 ` [RFC net-next 0/3] Proposal for VRF-lite David Ahern
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-08 18:35 UTC (permalink / raw)
  To: hannes, nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay, Shrijeet Mukherjee

From: Shrijeet Mukherjee <shm@cumulusnetworks.com>

Modified from the iplink_vlan implementation.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
---
 include/linux/if_link.h |    8 ++++
 ip/Makefile             |    2 +-
 ip/iplink_vrf.c         |   93 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 1 deletion(-)
 create mode 100644 ip/iplink_vrf.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 3d0d613..25bf881 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -337,6 +337,14 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC	1
 
+/* VRF section */
+enum {
+	IFLA_VRF_UNSPEC,
+	IFLA_VRF_TABLE,
+	__IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
 /* IPVLAN section */
 enum {
 	IFLA_IPVLAN_UNSPEC,
diff --git a/ip/Makefile b/ip/Makefile
index 2c742f3..83536d6 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -6,7 +6,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
     iplink_macvlan.o iplink_macvtap.o ipl2tp.o link_vti.o link_vti6.o \
     iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
     link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
-    iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o
+    iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o iplink_vrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
new file mode 100644
index 0000000..58443a5
--- /dev/null
+++ b/ip/iplink_vrf.c
@@ -0,0 +1,93 @@
+/* iplink_vrf.c	VRF device support
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ *
+ * Authors:     Shrijeet Mukherjee <shm@cumulusnetworks.com>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <linux/if_link.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+
+static void vrf_explain(FILE *f)
+{
+	fprintf(f, "Usage: ... vrf table TABLEID \n");
+}
+
+static void explain(void)
+{
+	vrf_explain(stderr);
+}
+
+static int table_arg(void)
+{
+	fprintf(stderr,"Error: argument of \"table\" must be 0-32767 and currently unused\n");
+	return -1;
+}
+
+static int vrf_parse_opt(struct link_util *lu, int argc, char **argv,
+			    struct nlmsghdr *n)
+{
+	while (argc > 0) {
+		if (matches(*argv, "table") == 0) {
+			__u32 table = 0;
+			NEXT_ARG();
+
+			table = atoi(*argv);
+			if (table < 0 || table > 32767)
+				return table_arg();
+			/* XXX need a table in-use check here */
+			fprintf(stderr, "adding table %d\n", table);
+			addattr32(n, 1024, IFLA_VRF_TABLE, table);
+		} else if (matches(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "vrf: unknown option \"%s\"?\n",
+				*argv);
+			explain();
+			return -1;
+		}
+		argc--, argv++;
+	}
+
+	return 0;
+}
+
+static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+
+	if (!tb)
+		return;
+
+	if (tb[IFLA_VRF_TABLE]) {
+		if (RTA_PAYLOAD(tb[IFLA_VRF_TABLE]) == sizeof(__u32)) {
+			__u32 table = rta_getattr_u32(tb[IFLA_VRF_TABLE]);
+
+			fprintf(f, " table %d ", table);
+		}
+	}
+}
+
+static void vrf_print_help(struct link_util *lu, int argc, char **argv,
+			      FILE *f)
+{
+	vrf_explain(f);
+}
+
+struct link_util vrf_link_util = {
+	.id		= "vrf",
+	.maxattr	= IFLA_VRF_MAX,
+	.parse_opt	= vrf_parse_opt,
+	.print_opt	= vrf_print_opt,
+	.print_help	= vrf_print_help,
+};
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 2/3] VRF driver and needed infrastructure
  2015-06-08 18:35 ` [RFC net-next 2/3] VRF driver and needed infrastructure Shrijeet Mukherjee
@ 2015-06-08 19:08   ` David Ahern
  2015-06-08 20:17   ` Hannes Frederic Sowa
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: David Ahern @ 2015-06-08 19:08 UTC (permalink / raw)
  To: Shrijeet Mukherjee, hannes, nicolas.dichtel, ebiederm, hadi,
	davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

On 6/8/15 12:35 PM, Shrijeet Mukherjee wrote:
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 019fcef..27a333c 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -283,6 +283,12 @@ config NLMON
>   	  diagnostics, etc. This is mostly intended for developers or support
>   	  to debug netlink issues. If unsure, say N.
>
> +config NET_VRF
> +	tristate "Virtual Routing and Forwarding (Lite)"
> +	---help---
> +          This option enables the support for mapping interfaces into VRF's. The
> +          support enables VRF devices
> +
>   endif # NET_CORE
>
>   config SUNGEM_PHY

I think you need:

depends on IP_MULTIPLE_TABLES && IPV6_MULTIPLE_TABLES

David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-08 18:35 [RFC net-next 0/3] Proposal for VRF-lite Shrijeet Mukherjee
                   ` (3 preceding siblings ...)
  2015-06-08 18:35 ` [RFC iproute2] Add the ability to create a VRF device and specify it's table binding Shrijeet Mukherjee
@ 2015-06-08 19:13 ` David Ahern
  2015-06-08 19:51   ` Shrijeet Mukherjee
  2015-06-08 20:41   ` Hannes Frederic Sowa
  2015-06-09  8:58 ` Nicolas Dichtel
  2015-06-09 10:15 ` Thomas Graf
  6 siblings, 2 replies; 36+ messages in thread
From: David Ahern @ 2015-06-08 19:13 UTC (permalink / raw)
  To: Shrijeet Mukherjee, hannes, nicolas.dichtel, ebiederm, hadi,
	davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

On 6/8/15 12:35 PM, Shrijeet Mukherjee wrote:
> 5. Debugging is built-in as tcpdump and counters on the VRF device
>     works as is.

Is the intent that something like this

   tcpdump -i vrf0

can be used to see vrf traffic?

vrf_handle_frame only bumps counters; it does not switch skb->dev to the 
vrf device so for Rx path tcpdump will not get the packets. ie., tcpdump 
only shows outbound packets.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-08 19:13 ` [RFC net-next 0/3] Proposal for VRF-lite David Ahern
@ 2015-06-08 19:51   ` Shrijeet Mukherjee
  2015-06-08 20:41   ` Hannes Frederic Sowa
  1 sibling, 0 replies; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-08 19:51 UTC (permalink / raw)
  To: David Ahern
  Cc: Hannes Frederic Sowa, Nicolas Dichtel, Eric W. Biederman,
	Jamal Hadi Salim, David Miller, Stephen Hemminger, Netdev,
	Roopa Prabhu, Andy Gospodarek, Jon Toppins, Nikolay Aleksandrov

Good catch, as you know I used to have the device getting modified in
the RX path and that made it all work

generic ip_rcv will need a fix to make RX visible to tcpdump, but yes,
that is the goal.

On Mon, Jun 8, 2015 at 12:13 PM, David Ahern <dsahern@gmail.com> wrote:
> On 6/8/15 12:35 PM, Shrijeet Mukherjee wrote:
>>
>> 5. Debugging is built-in as tcpdump and counters on the VRF device
>>     works as is.
>
>
> Is the intent that something like this
>
>   tcpdump -i vrf0
>
> can be used to see vrf traffic?
>
> vrf_handle_frame only bumps counters; it does not switch skb->dev to the vrf
> device so for Rx path tcpdump will not get the packets. ie., tcpdump only
> shows outbound packets.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 18:35 ` [RFC net-next 3/3] rcv path changes for vrf traffic Shrijeet Mukherjee
@ 2015-06-08 19:58   ` Hannes Frederic Sowa
  2015-06-08 20:00     ` Hannes Frederic Sowa
                       ` (4 more replies)
  2015-06-10 18:31   ` Alexander Duyck
  1 sibling, 5 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-08 19:58 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen, netdev,
	roopa, gospo, jtoppins, nikolay

Hi Shrijeet,

On Mo, 2015-06-08 at 11:35 -0700, Shrijeet Mukherjee wrote:
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> 
> Incoming frames for IP protocol stacks need the IIF to be changed
> from the actual interface to the VRF device. This allows the IIF
> rule to be used to select tables (or do regular PBR)
> 
> This change selects the iif to be the VRF device if it exists and
> the incoming iif is enslaved to the VRF device.
> 
> Since VRF aware sockets are always bound to the VRF device this
> system allows return traffic to find the socket of origin.
> 
> changes are in the arp_rcv, icmp_rcv and ip_rcv paths
> 
> Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF
> as it would create code variations and the vrf_ptr check is there
> I can make that whole thing modular.

>From an architectural level I think the output path looks good. For the
input path I would also to propose my (I think) more flexible solution:

For rx layer I want to also propose my try:

[PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override

By allowing to direct routing table lookups to a specific table based
on the incoming interface for IPv4 and ARP, we start to behave like a
strong end host system without tweaking arp_* sysctl settings.

The main motivation behind this patch was input and forwarding support
in a VRF like model. Maybe it also helps for hardware offloading by
allowing reducing rule complexity.

An example:

$ ip rule flush
$ ip rule del
$ ip rule del
$ ip rule add inherit-table
0:      from all inherit-table

This by default still uses RT_TABLE_LOCAL until we set up per interface
route tables:

$ ip link set dev enp0s25 ipv4-rt-table-id 100
$ ip -d link ls dev enp0s25
2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0 ipv4-rt-table-id 100 addrgenmode none

This let's incoming and arp requests use routing table 100. The system
will stop responding to arp requests as we don't have any entries in
this routing table.

$ ip address add 192.168.88.223/24 dev enp0s25 table 100
$ ip -d address ls
2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0
    inet 192.168.88.223/24 scope global enp0s25 table 100
       valid_lft forever preferred_lft forever
$ ip route add 192.168.88.0/24 dev enp0s25 table 100
$ ip route add default via 192.168.88.1 table 100
$ ip route ls dev table 100
local 192.168.88.223 dev enp0s25  proto kernel  scope host  src 192.168.88.223
192.168.88.0/24 dev enp0s25  scope link
default via 192.168.88.1 dev enp0s25 proto static metric 600

Those changes direct arp lookups towards table 100 and the input route,
too. The local address is used for icmp source addresses and arp
replies. The connected route to steer icmp packets out of that interface.

This patch covers only the forwarding path.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/inetdevice.h        | 19 ++++++++++++++++---
 include/net/flow.h                |  2 ++
 include/uapi/linux/fib_rules.h    |  1 +
 include/uapi/linux/if_addr.h      |  1 +
 include/uapi/linux/if_link.h      |  1 +
 net/core/fib_rules.c              | 12 +++++++++---
 net/ipv4/devinet.c                | 18 +++++++++++++++++-
 net/ipv4/fib_frontend.c           | 11 +++++++++--
 net/ipv4/fib_rules.c              |  7 ++++++-
 net/ipv4/fib_semantics.c          |  4 +++-
 net/ipv4/icmp.c                   |  1 +
 net/ipv4/netfilter/ipt_rpfilter.c |  1 +
 net/ipv4/route.c                  |  1 +
 13 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 0a21fbe..ed68f8e 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -25,19 +25,20 @@ struct in_device {
        atomic_t                refcnt;
        int                     dead;
        struct in_ifaddr        *ifa_list;      /* IP ifaddr chain              */
+       u32                     rt_table_id;
 
        struct ip_mc_list __rcu *mc_list;       /* IP multicast filter chain    */
        struct ip_mc_list __rcu * __rcu *mc_hash;
 
        int                     mc_count;       /* Number of installed mcasts   */
+       unsigned char           mr_qrv;
+       unsigned char           mr_gq_running;
+       unsigned char           mr_ifc_count;
        spinlock_t              mc_tomb_lock;
        struct ip_mc_list       *mc_tomb;
        unsigned long           mr_v1_seen;
        unsigned long           mr_v2_seen;
        unsigned long           mr_maxdelay;
-       unsigned char           mr_qrv;
-       unsigned char           mr_gq_running;
-       unsigned char           mr_ifc_count;
        struct timer_list       mr_gq_timer;    /* general query timer */
        struct timer_list       mr_ifc_timer;   /* interface change timer */
 
@@ -145,6 +146,7 @@ struct in_ifaddr {
        __u32                   ifa_preferred_lft;
        unsigned long           ifa_cstamp; /* created timestamp */
        unsigned long           ifa_tstamp; /* updated timestamp */
+       __u32                   ifa_rt_table; /* subnet route table */
 };
 
 int register_inetaddr_notifier(struct notifier_block *nb);
@@ -237,6 +239,17 @@ static inline void in_dev_put(struct in_device *idev)
 #define __in_dev_put(idev)  atomic_dec(&(idev)->refcnt)
 #define in_dev_hold(idev)   atomic_inc(&(idev)->refcnt)
 
+static inline u32 ipv4_idev_rt_table(const struct net_device *dev)
+{
+       u32 table_id;
+
+       rcu_read_lock();
+       table_id = __in_dev_get_rcu(dev)->rt_table_id;
+       rcu_read_unlock();
+
+       return table_id != RT_TABLE_UNSPEC ? table_id : RT_TABLE_LOCAL;
+}
+
 #endif /* __KERNEL__ */
 
 static __inline__ __be32 inet_make_mask(int logmask)
diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a15..635e028 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -70,6 +70,8 @@ struct flowi4 {
        /* (saddr,daddr) must be grouped, same order as in IP header */
        __be32                  saddr;
        __be32                  daddr;
+       __u32                   rt_table_id;
+
 
        union flowi_uli         uli;
 #define fl4_sport              uli.ports.sport
diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b82d7e..da7c79a 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -64,6 +64,7 @@ enum {
        FR_ACT_BLACKHOLE,       /* Drop without notification */
        FR_ACT_UNREACHABLE,     /* Drop with ENETUNREACH */
        FR_ACT_PROHIBIT,        /* Drop with EACCES */
+       FR_ACT_TO_TBL_INHERIT_DEV,
        __FR_ACT_MAX,
 };
 
diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index 4318ab1..af89016 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -32,6 +32,7 @@ enum {
        IFA_CACHEINFO,
        IFA_MULTICAST,
        IFA_FLAGS,
+       IFA_RT_TABLE,
        __IFA_MAX,
 };
 
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 1737b7a..7f4cdb2 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -163,6 +163,7 @@ enum {
 enum {
        IFLA_INET_UNSPEC,
        IFLA_INET_CONF,
+       IFLA_INET_RT_TABLE,
        __IFLA_INET_MAX,
 };
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 9a12668..2728873 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -556,11 +556,17 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule,
 
        frh = nlmsg_data(nlh);
        frh->family = ops->family;
-       frh->table = rule->table;
-       if (nla_put_u32(skb, FRA_TABLE, rule->table))
-               goto nla_put_failure;
+
+       /* table id is not valid if we inherit from interface */
+       if (rule->action != FR_ACT_TO_TBL_INHERIT_DEV) {
+               frh->table = rule->table;
+               if (nla_put_u32(skb, FRA_TABLE, rule->table))
+                       goto nla_put_failure;
+       }
+
        if (nla_put_u32(skb, FRA_SUPPRESS_PREFIXLEN, rule->suppress_prefixlen))
                goto nla_put_failure;
+
        frh->res1 = 0;
        frh->res2 = 0;
        frh->action = rule->action;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 419d23c..91f074d 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -100,6 +100,7 @@ static const struct nla_policy ifa_ipv4_policy[IFA_MAX+1] = {
        [IFA_LABEL]             = { .type = NLA_STRING, .len = IFNAMSIZ - 1 },
        [IFA_CACHEINFO]         = { .len = sizeof(struct ifa_cacheinfo) },
        [IFA_FLAGS]             = { .type = NLA_U32 },
+       [IFA_RT_TABLE]          = { .type = NLA_U32 },
 };
 
 #define IN4_ADDR_HSIZE_SHIFT   8
@@ -244,6 +245,7 @@ static struct in_device *inetdev_init(struct net_device *dev)
                        sizeof(in_dev->cnf));
        in_dev->cnf.sysctl = NULL;
        in_dev->dev = dev;
+       in_dev->rt_table_id = RT_TABLE_UNSPEC;
        in_dev->arp_parms = neigh_parms_alloc(dev, &arp_tbl);
        if (!in_dev->arp_parms)
                goto out_kfree;
@@ -783,6 +785,11 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh,
        if (!tb[IFA_ADDRESS])
                tb[IFA_ADDRESS] = tb[IFA_LOCAL];
 
+       if (tb[IFA_RT_TABLE])
+               ifa->ifa_rt_table = nla_get_u32(tb[IFA_RT_TABLE]);
+       else
+               ifa->ifa_rt_table = RT_TABLE_UNSPEC;
+
        INIT_HLIST_NODE(&ifa->hash);
        ifa->ifa_prefixlen = ifm->ifa_prefixlen;
        ifa->ifa_mask = inet_make_mask(ifm->ifa_prefixlen);
@@ -1549,6 +1556,7 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa,
            (ifa->ifa_label[0] &&
             nla_put_string(skb, IFA_LABEL, ifa->ifa_label)) ||
            nla_put_u32(skb, IFA_FLAGS, ifa->ifa_flags) ||
+           nla_put_u32(skb, IFA_RT_TABLE, ifa->ifa_rt_table) ||
            put_cacheinfo(skb, ifa->ifa_cstamp, ifa->ifa_tstamp,
                          preferred, valid))
                goto nla_put_failure;
@@ -1652,7 +1660,8 @@ static size_t inet_get_link_af_size(const struct net_device *dev)
        if (!in_dev)
                return 0;
 
-       return nla_total_size(IPV4_DEVCONF_MAX * 4); /* IFLA_INET_CONF */
+       return nla_total_size(IPV4_DEVCONF_MAX * 4) +   /* IFLA_INET_CONF */
+              nla_total_size(sizeof(u32));             /* IFLA_INET_RT_TABLE */
 }
 
 static int inet_fill_link_af(struct sk_buff *skb, const struct net_device *dev)
@@ -1664,6 +1673,9 @@ static int inet_fill_link_af(struct sk_buff *skb, const struct net_device *dev)
        if (!in_dev)
                return -ENODATA;
 
+       if (nla_put_u32(skb, IFLA_INET_RT_TABLE, in_dev->rt_table_id) < 0)
+               return -EMSGSIZE;
+
        nla = nla_reserve(skb, IFLA_INET_CONF, IPV4_DEVCONF_MAX * 4);
        if (!nla)
                return -EMSGSIZE;
@@ -1676,6 +1688,7 @@ static int inet_fill_link_af(struct sk_buff *skb, const struct net_device *dev)
 
 static const struct nla_policy inet_af_policy[IFLA_INET_MAX+1] = {
        [IFLA_INET_CONF]        = { .type = NLA_NESTED },
+       [IFLA_INET_RT_TABLE]    = { .type = NLA_U32 },
 };
 
 static int inet_validate_link_af(const struct net_device *dev,
@@ -1723,6 +1736,9 @@ static int inet_set_link_af(struct net_device *dev, const struct nlattr *nla)
                        ipv4_devconf_set(in_dev, nla_type(a), nla_get_u32(a));
        }
 
+       if (tb[IFLA_INET_RT_TABLE])
+               in_dev->rt_table_id = nla_get_u32(tb[IFLA_INET_RT_TABLE]);
+
        return 0;
 }
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 872494e..56b2656 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -225,7 +225,7 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 
        rcu_read_lock();
 
-       local_table = fib_get_table(net, RT_TABLE_LOCAL);
+       local_table = fib_get_table(net, dev ? ipv4_idev_rt_table(dev) : RT_TABLE_LOCAL);
        if (local_table) {
                ret = RTN_UNICAST;
                if (!fib_table_lookup(local_table, &fl4, &res, FIB_LOOKUP_NOREF)) {
@@ -277,6 +277,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
                fl4.flowi4_iif = LOOPBACK_IFINDEX;
                fl4.daddr = ip_hdr(skb)->saddr;
                fl4.saddr = 0;
+               fl4.rt_table_id = ipv4_idev_rt_table(dev);
                fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
                fl4.flowi4_scope = scope;
                fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
@@ -311,6 +312,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
        fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
        fl4.daddr = src;
        fl4.saddr = dst;
+       fl4.rt_table_id =  ipv4_idev_rt_table(dev);
        fl4.flowi4_tos = tos;
        fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
 
@@ -774,7 +776,12 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad
                },
        };
 
-       if (type == RTN_UNICAST)
+       /* if ifa_rt_table is different from default RT_TABLE_LOCAL
+        * use its value for all types of routes
+        */
+       if (ifa->ifa_rt_table != RT_TABLE_UNSPEC)
+               tb = fib_new_table(net, ifa->ifa_rt_table);
+       else if (type == RTN_UNICAST)
                tb = fib_new_table(net, RT_TABLE_MAIN);
        else
                tb = fib_new_table(net, RT_TABLE_LOCAL);
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 5615198..acb415c 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -75,9 +75,14 @@ static int fib4_rule_action(struct fib_rule *rule, struct flowi *flp,
 {
        int err = -EAGAIN;
        struct fib_table *tbl;
+       u32 table;
 
        switch (rule->action) {
+       case FR_ACT_TO_TBL_INHERIT_DEV:
+               table = flp->u.ip4.rt_table_id;
+               break;
        case FR_ACT_TO_TBL:
+               table = rule->table;
                break;
 
        case FR_ACT_UNREACHABLE:
@@ -93,7 +98,7 @@ static int fib4_rule_action(struct fib_rule *rule, struct flowi *flp,
 
        rcu_read_lock();
 
-       tbl = fib_get_table(rule->fr_net, rule->table);
+       tbl = fib_get_table(rule->fr_net, table);
        if (tbl)
                err = fib_table_lookup(tbl, &flp->u.ip4,
                                       (struct fib_result *)arg->result,
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 28ec3c1..afb0011 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -587,7 +587,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 {
        int err;
        struct net *net;
-       struct net_device *dev;
+       struct net_device *dev = NULL;
 
        net = cfg->fc_nlinfo.nl_net;
        if (nh->nh_gw) {
@@ -616,6 +616,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
                                .flowi4_scope = cfg->fc_scope + 1,
                                .flowi4_oif = nh->nh_oif,
                                .flowi4_iif = LOOPBACK_IFINDEX,
+                               .rt_table_id = dev ? ipv4_idev_rt_table(dev)
+                                              : RT_TABLE_LOCAL,
                        };
 
                        /* It is not necessary, but requires a bit of thinking */
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f5203fb..36952c8 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -425,6 +425,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
        fl4.flowi4_mark = mark;
        fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
        fl4.flowi4_proto = IPPROTO_ICMP;
+       fl4.rt_table_id = ipv4_idev_rt_table(skb->dev);
        security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
        rt = ip_route_output_key(net, &fl4);
        if (IS_ERR(rt))
diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
index 4bfaedf..c7c1407 100644
--- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -93,6 +93,7 @@ static bool rpfilter_mt(const struct sk_buff *skb, struct xt_action_param *par)
        flow.flowi4_iif = LOOPBACK_IFINDEX;
        flow.daddr = iph->saddr;
        flow.saddr = rpfilter_get_saddr(iph->daddr);
+       flow.rt_table_id = ipv4_idev_rt_table(skb->dev);
        flow.flowi4_oif = 0;
        flow.flowi4_mark = info->flags & XT_RPFILTER_VALID_MARK ? skb->mark : 0;
        flow.flowi4_tos = RT_TOS(iph->tos);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f605598..eec1908 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1716,6 +1716,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
        fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
        fl4.daddr = daddr;
        fl4.saddr = saddr;
+       fl4.rt_table_id = ipv4_idev_rt_table(dev);
        err = fib_lookup(net, &fl4, &res);
        if (err != 0) {
                if (!IN_DEV_FORWARD(in_dev))
-- 
2.4.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 19:58   ` Hannes Frederic Sowa
@ 2015-06-08 20:00     ` Hannes Frederic Sowa
  2015-06-08 20:22     ` Shrijeet Mukherjee
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-08 20:00 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen, netdev,
	roopa, gospo, jtoppins, nikolay

On Mo, 2015-06-08 at 21:58 +0200, Hannes Frederic Sowa wrote:
> Hi Shrijeet,
> 
> On Mo, 2015-06-08 at 11:35 -0700, Shrijeet Mukherjee wrote:
> > From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> > 
> > Incoming frames for IP protocol stacks need the IIF to be changed
> > from the actual interface to the VRF device. This allows the IIF
> > rule to be used to select tables (or do regular PBR)
> > 
> > This change selects the iif to be the VRF device if it exists and
> > the incoming iif is enslaved to the VRF device.
> > 
> > Since VRF aware sockets are always bound to the VRF device this
> > system allows return traffic to find the socket of origin.
> > 
> > changes are in the arp_rcv, icmp_rcv and ip_rcv paths
> > 
> > Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF
> > as it would create code variations and the vrf_ptr check is there
> > I can make that whole thing modular.
> 
> From an architectural level I think the output path looks good. For 
> the
> input path I would also to propose my (I think) more flexible 
> solution:
> 
> For rx layer I want to also propose my try:
> 
> [PATCH net-next RFC] net: ipv4: arp: strong end system model 
> semantics by per-interface local table override
> 
> By allowing to direct routing table lookups to a specific table based
> on the incoming interface for IPv4 and ARP, we start to behave like a
> strong end host system without tweaking arp_* sysctl settings.
> 
> The main motivation behind this patch was input and forwarding 
> support
> in a VRF like model. Maybe it also helps for hardware offloading by
> allowing reducing rule complexity.
> 
> An example:
> 
> $ ip rule flush
> $ ip rule del
> $ ip rule del
> $ ip rule add inherit-table
> 0:      from all inherit-table
> 
> This by default still uses RT_TABLE_LOCAL until we set up per 
> interface
> route tables:
> 
> $ ip link set dev enp0s25 ipv4-rt-table-id 100
> $ ip -d link ls dev enp0s25
> 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel 
> state UP mode DEFAULT group default qlen 1000
>     link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0 
> ipv4-rt-table-id 100 addrgenmode none
> 
> This let's incoming and arp requests use routing table 100. The 
> system
> will stop responding to arp requests as we don't have any entries in
> this routing table.
> 
> $ ip address add 192.168.88.223/24 dev enp0s25 table 100
> $ ip -d address ls
> 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel 
> state UP group default qlen 1000
>     link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0
>     inet 192.168.88.223/24 scope global enp0s25 table 100
>        valid_lft forever preferred_lft forever
> $ ip route add 192.168.88.0/24 dev enp0s25 table 100
> $ ip route add default via 192.168.88.1 table 100
> $ ip route ls dev table 100
> local 192.168.88.223 dev enp0s25  proto kernel  scope host  src 
> 192.168.88.223
> 192.168.88.0/24 dev enp0s25  scope link
> default via 192.168.88.1 dev enp0s25 proto static metric 600
> 
> Those changes direct arp lookups towards table 100 and the input 
> route,
> too. The local address is used for icmp source addresses and arp
> replies. The connected route to steer icmp packets out of that 
> interface.
> 
> This patch covers only the forwarding path.

The iproute2 patch is currently here:

https://github.com/hannes/iproute2/commits/vrf

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 2/3] VRF driver and needed infrastructure
  2015-06-08 18:35 ` [RFC net-next 2/3] VRF driver and needed infrastructure Shrijeet Mukherjee
  2015-06-08 19:08   ` David Ahern
@ 2015-06-08 20:17   ` Hannes Frederic Sowa
  2015-06-09  9:19   ` Nicolas Dichtel
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-08 20:17 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen, netdev,
	roopa, gospo, jtoppins, nikolay

Hi,

On Mo, 2015-06-08 at 11:35 -0700, Shrijeet Mukherjee wrote:
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> 
> This driver borrows heavily from IPvlan and teaming drivers.
> 
> Routing domains (VRF-lite) are created by instantiating a device
> and enslaving all routed interfaces that participate in the domain.
> As part of the enslavement, all local routes pointing to enslaved
> devices are re-pointed to the vrf device, thus forcing outgoing
> sockets to bind to the vrf to function.
> 
> Standard FIB rules can then bind the VRF device to tables and regular
> fib rule processing is followed.
> 
> Routed traffic through the box, is fwded by using the VRF device as
> the IIF and following the IIF rule to a table which is mated with
> the VRF.
> 
> Locally originated traffic is directed at the VRF device using
> SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
> the xmit function of the vrf driver, which then completes the ip lookup
> and output.
> 
> This solution is completely orthogonal to namespaces and allow the L3
> equivalent of vlans to exist allowing the routing space to be
> partitioned.
> 
> Example use is
>    ip link add vrf0 type vrf table 5
>    ip link set eth1 master vrf0
>    ip link set vrf0 up
> 
>    ip rule add iif vrf0 table 5
>    ip rule add oif vrf0 table 5
> 
> TODO:
> This changeset is for IPv4 only
> Connected route management can be made much better, but is deferred to
> user space for now.

One thing that got lost is that we should prohibit user space applications
to bind to devices which are vrf interfaces without having CAP_NET_ADMIN
capability, so user space programs can be in future restricted to a specific
VRF.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 19:58   ` Hannes Frederic Sowa
  2015-06-08 20:00     ` Hannes Frederic Sowa
@ 2015-06-08 20:22     ` Shrijeet Mukherjee
  2015-06-08 20:33       ` Hannes Frederic Sowa
  2015-06-08 22:05     ` David Miller
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-08 20:22 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Nicolas Dichtel, David Ahern, Eric W. Biederman,
	Jamal Hadi Salim, David Miller, Stephen Hemminger, Netdev,
	Roopa Prabhu, Andy Gospodarek, Jon Toppins, Nikolay Aleksandrov

On Mon, Jun 8, 2015 at 12:58 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Shrijeet,
>
> From an architectural level I think the output path looks good. For the
> input path I would also to propose my (I think) more flexible solution:
>
> For rx layer I want to also propose my try:
>
> [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override
>
> By allowing to direct routing table lookups to a specific table based
> on the incoming interface for IPv4 and ARP, we start to behave like a
> strong end host system without tweaking arp_* sysctl settings.
>
> The main motivation behind this patch was input and forwarding support
> in a VRF like model. Maybe it also helps for hardware offloading by
> allowing reducing rule complexity.
>
> An example:
>
> $ ip rule flush
> $ ip rule del
> $ ip rule del
> $ ip rule add inherit-table
> 0:      from all inherit-table
>
> This by default still uses RT_TABLE_LOCAL until we set up per interface
> route tables:
>
> $ ip link set dev enp0s25 ipv4-rt-table-id 100
> $ ip -d link ls dev enp0s25
> 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
>     link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0 ipv4-rt-table-id 100 addrgenmode none
>
> This let's incoming and arp requests use routing table 100. The system
> will stop responding to arp requests as we don't have any entries in
> this routing table.


I like this model in general, as it addresses the issue that I have
not addressed around connected routes.

This would force local and directly connected host routes to be learnt
into the correct table.

It does bring the question up then.

1. The driver already knows the vrf device to table map
2. If the device also knows the final device to table map

then do we need to use fib_rules and just lookup the table directly.
It does make the configuration a little longer since each component
device now needs configuration when you add/del a member from a vrf.

If people generally agree and we want to skip the fib_rule lookup,
then I can make it such that enslaving already takes the dev-table id
as well, and then the process of enslaving in the nominal VRF case
becomes

ip link add vrf-dev type vrf table foo ipv4-rt-table-id bar
ip link set eth2 master vrf-dev

Does that work ?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 20:22     ` Shrijeet Mukherjee
@ 2015-06-08 20:33       ` Hannes Frederic Sowa
  2015-06-08 22:44         ` Shrijeet Mukherjee
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-08 20:33 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: Nicolas Dichtel, David Ahern, Eric W. Biederman,
	Jamal Hadi Salim, David Miller, Stephen Hemminger, Netdev,
	Roopa Prabhu, Andy Gospodarek, Jon Toppins, Nikolay Aleksandrov

On Mon, Jun 8, 2015, at 22:22, Shrijeet Mukherjee wrote:
> On Mon, Jun 8, 2015 at 12:58 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > Hi Shrijeet,
> >
> > From an architectural level I think the output path looks good. For the
> > input path I would also to propose my (I think) more flexible solution:
> >
> > For rx layer I want to also propose my try:
> >
> > [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override
> >
> > By allowing to direct routing table lookups to a specific table based
> > on the incoming interface for IPv4 and ARP, we start to behave like a
> > strong end host system without tweaking arp_* sysctl settings.
> >
> > The main motivation behind this patch was input and forwarding support
> > in a VRF like model. Maybe it also helps for hardware offloading by
> > allowing reducing rule complexity.
> >
> > An example:
> >
> > $ ip rule flush
> > $ ip rule del
> > $ ip rule del
> > $ ip rule add inherit-table
> > 0:      from all inherit-table
> >
> > This by default still uses RT_TABLE_LOCAL until we set up per interface
> > route tables:
> >
> > $ ip link set dev enp0s25 ipv4-rt-table-id 100
> > $ ip -d link ls dev enp0s25
> > 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
> >     link/ether e4:7f:b2:1b:4c:61 brd ff:ff:ff:ff:ff:ff promiscuity 0 ipv4-rt-table-id 100 addrgenmode none
> >
> > This let's incoming and arp requests use routing table 100. The system
> > will stop responding to arp requests as we don't have any entries in
> > this routing table.
> 
> 
> I like this model in general, as it addresses the issue that I have
> not addressed around connected routes.
> 
> This would force local and directly connected host routes to be learnt
> into the correct table.
> 
> It does bring the question up then.
> 
> 1. The driver already knows the vrf device to table map
> 2. If the device also knows the final device to table map
> 
> then do we need to use fib_rules and just lookup the table directly.
> It does make the configuration a little longer since each component
> device now needs configuration when you add/del a member from a vrf.

This model is usable on its own, especially if one does not need routing
daemons
or user space software dealing with VRFs and sending out packets.

> If people generally agree and we want to skip the fib_rule lookup,
> then I can make it such that enslaving already takes the dev-table id
> as well, and then the process of enslaving in the nominal VRF case
> becomes
> 
> ip link add vrf-dev type vrf table foo ipv4-rt-table-id bar
> ip link set eth2 master vrf-dev

I think this would be great.

Last time I looked into the patches it was not yet clear if we can do
that
without holding strong references to the other interfaces. Hopefully
this can
be done by just passing down the table ids to the slaves during
initializing
and teardown of the master vrf interface.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-08 19:13 ` [RFC net-next 0/3] Proposal for VRF-lite David Ahern
  2015-06-08 19:51   ` Shrijeet Mukherjee
@ 2015-06-08 20:41   ` Hannes Frederic Sowa
  1 sibling, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-08 20:41 UTC (permalink / raw)
  To: David Ahern, Shrijeet Mukherjee, Nicolas Dichtel,
	Eric W. Biederman, Jamal Hadi Salim, davem, Stephen Hemminger,
	netdev
  Cc: roopa, andy gospodarek, jtoppins, nikolay

On Mon, Jun 8, 2015, at 21:13, David Ahern wrote:
> On 6/8/15 12:35 PM, Shrijeet Mukherjee wrote:
> > 5. Debugging is built-in as tcpdump and counters on the VRF device
> >     works as is.
> 
> Is the intent that something like this
> 
>    tcpdump -i vrf0
> 
> can be used to see vrf traffic?
> 
> vrf_handle_frame only bumps counters; it does not switch skb->dev to the 
> vrf device so for Rx path tcpdump will not get the packets. ie., tcpdump 
> only shows outbound packets.

My hope initially was that the vrf interface type would be as slim as
possible. I
am not even sure if we need packet counters, as one could easily have
user
space handle that by looking up the relations and accumulating them.
Same
for VRF traffic.

But the current model does allow to add support for that easily, so why
not? It
depends on how far we can and want to move parts of the logic into the
core
stack in the end.

Would you see this as a requirement?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 19:58   ` Hannes Frederic Sowa
  2015-06-08 20:00     ` Hannes Frederic Sowa
  2015-06-08 20:22     ` Shrijeet Mukherjee
@ 2015-06-08 22:05     ` David Miller
  2015-06-08 22:13       ` Hannes Frederic Sowa
  2015-06-09  0:36     ` David Ahern
  2015-06-09  1:03     ` David Ahern
  4 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2015-06-08 22:05 UTC (permalink / raw)
  To: hannes
  Cc: shm, nicolas.dichtel, dsahern, ebiederm, hadi, stephen, netdev,
	roopa, gospo, jtoppins, nikolay

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Mon, 08 Jun 2015 21:58:37 +0200

> +static inline u32 ipv4_idev_rt_table(const struct net_device *dev)
> +{
> +       u32 table_id;
> +
> +       rcu_read_lock();
> +       table_id = __in_dev_get_rcu(dev)->rt_table_id;
> +       rcu_read_unlock();
> +
> +       return table_id != RT_TABLE_UNSPEC ? table_id : RT_TABLE_LOCAL;
> +}

It's a real shame you have to do all of this RCU locking and inetdev
deref, because in more than half of the call sites the idev is already
available.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 22:05     ` David Miller
@ 2015-06-08 22:13       ` Hannes Frederic Sowa
  2015-06-08 22:21         ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-08 22:13 UTC (permalink / raw)
  To: David Miller
  Cc: Shrijeet Mukherjee, Nicolas Dichtel, David Ahern,
	Eric W. Biederman, Jamal Hadi Salim, Stephen Hemminger, netdev,
	roopa, andy gospodarek, jtoppins, nikolay

On Tue, Jun 9, 2015, at 00:05, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Mon, 08 Jun 2015 21:58:37 +0200
> 
> > +static inline u32 ipv4_idev_rt_table(const struct net_device *dev)
> > +{
> > +       u32 table_id;
> > +
> > +       rcu_read_lock();
> > +       table_id = __in_dev_get_rcu(dev)->rt_table_id;
> > +       rcu_read_unlock();
> > +
> > +       return table_id != RT_TABLE_UNSPEC ? table_id : RT_TABLE_LOCAL;
> > +}
> 
> It's a real shame you have to do all of this RCU locking and inetdev
> deref, because in more than half of the call sites the idev is already
> available.

I agree, I was not happy with that either.

It is easy to move the rt_table_id to net_device and use the same one
for IPv6.
This would force people to build symmetric routing configurations. I was
striving for
maximum flexibility first but I don't really think this matters here.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 22:13       ` Hannes Frederic Sowa
@ 2015-06-08 22:21         ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2015-06-08 22:21 UTC (permalink / raw)
  To: hannes
  Cc: shm, nicolas.dichtel, dsahern, ebiederm, hadi, stephen, netdev,
	roopa, gospo, jtoppins, nikolay

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Tue, 09 Jun 2015 00:13:08 +0200

> It is easy to move the rt_table_id to net_device and use the same
> one for IPv6.  This would force people to build symmetric routing
> configurations. I was striving for maximum flexibility first but I
> don't really think this matters here.

Alternatively you could have __ipv4_idev_rt_table(idev) and implement
ipv4_idev_rt_table() in terms of that.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 20:33       ` Hannes Frederic Sowa
@ 2015-06-08 22:44         ` Shrijeet Mukherjee
  2015-06-09  5:41           ` Hannes Frederic Sowa
  0 siblings, 1 reply; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-08 22:44 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Nicolas Dichtel, David Ahern, Eric W. Biederman,
	Jamal Hadi Salim, David Miller, Stephen Hemminger, Netdev,
	Roopa Prabhu, Andy Gospodarek, Jon Toppins, Nikolay Aleksandrov

On Mon, Jun 8, 2015 at 1:33 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Mon, Jun 8, 2015, at 22:22, Shrijeet Mukherjee wrote:
>> On Mon, Jun 8, 2015 at 12:58 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > Hi Shrijeet,
>> >
>> > This let's incoming and arp requests use routing table 100. The system
>> > will stop responding to arp requests as we don't have any entries in
>> > this routing table.
>>
>>
>> I like this model in general, as it addresses the issue that I have
>> not addressed around connected routes.
>>
>> This would force local and directly connected host routes to be learnt
>> into the correct table.
>>
>> It does bring the question up then.
>>
>> 1. The driver already knows the vrf device to table map
>> 2. If the device also knows the final device to table map
>>
>> then do we need to use fib_rules and just lookup the table directly.
>> It does make the configuration a little longer since each component
>> device now needs configuration when you add/del a member from a vrf.
>
> This model is usable on its own, especially if one does not need routing
> daemons
> or user space software dealing with VRFs and sending out packets.
>
>> If people generally agree and we want to skip the fib_rule lookup,
>> then I can make it such that enslaving already takes the dev-table id
>> as well, and then the process of enslaving in the nominal VRF case
>> becomes
>>
>> ip link add vrf-dev type vrf table foo ipv4-rt-table-id bar
>> ip link set eth2 master vrf-dev
>
> I think this would be great.
>
> Last time I looked into the patches it was not yet clear if we can do
> that
> without holding strong references to the other interfaces. Hopefully
> this can
> be done by just passing down the table ids to the slaves during
> initializing
> and teardown of the master vrf interface.
>
> Bye,
> Hannes

We can do that, and the hooks are all available. But do we want to cut
out the fib_rules ? this would close out the opportunity for someone
to insert a fib_rule to override the rule which directs to a VRF
device.

Personally don't have a strong opinion, but want to make sure we
understand that choice.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 19:58   ` Hannes Frederic Sowa
                       ` (2 preceding siblings ...)
  2015-06-08 22:05     ` David Miller
@ 2015-06-09  0:36     ` David Ahern
  2015-06-09  1:03     ` David Ahern
  4 siblings, 0 replies; 36+ messages in thread
From: David Ahern @ 2015-06-09  0:36 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Shrijeet Mukherjee
  Cc: nicolas.dichtel, ebiederm, hadi, davem, stephen, netdev, roopa,
	gospo, jtoppins, nikolay

On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote:
> Hi Shrijeet,
>
> On Mo, 2015-06-08 at 11:35 -0700, Shrijeet Mukherjee wrote:
>> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
>>
>> Incoming frames for IP protocol stacks need the IIF to be changed
>> from the actual interface to the VRF device. This allows the IIF
>> rule to be used to select tables (or do regular PBR)
>>
>> This change selects the iif to be the VRF device if it exists and
>> the incoming iif is enslaved to the VRF device.
>>
>> Since VRF aware sockets are always bound to the VRF device this
>> system allows return traffic to find the socket of origin.
>>
>> changes are in the arp_rcv, icmp_rcv and ip_rcv paths
>>
>> Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF
>> as it would create code variations and the vrf_ptr check is there
>> I can make that whole thing modular.
>
>  From an architectural level I think the output path looks good. For the
> input path I would also to propose my (I think) more flexible solution:
>

Something is still not right on the output path. e.g., I see the wrong 
source address showing up on ping -I vrf0:

# ping -I vrf0 1.1.1.254
ping: Warning: source address might be selected on device other than vrf0.
PING 1.1.1.254 (1.1.1.254) from 172.16.1.52 vrf0: 56(84) bytes of data.
64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=0.215 ms
...

The reason is because the datagram connect function fails to look up the 
outbound route in the vrf and falls back to the main table. (As an aside 
the fallback to other tables is something that should not be happening 
for VRFs; you want to use the table specific to the VRF.)

The route lookup fails because it passes in oif = vrf device (this VRF 
design relies on bind to device which sets oif in the flow). That is 
good for selecting the table to use for the lookups, but not good for 
selecting the route within the table.

This is one way to fix the connect problem:

diff --git a/include/net/route.h b/include/net/route.h
index fe22d03afb6a..a18798caec25 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -245,11 +245,18 @@ static inline void ip_route_connect_init(struct 
flowi4 *fl4, __be32 dst, __be32
                      __be16 sport, __be16 dport,
                      struct sock *sk)
  {
+   struct net_device *dev = dev_get_by_index(sock_net(sk), oif);
     __u8 flow_flags = 0;

     if (inet_sk(sk)->transparent)
         flow_flags |= FLOWI_FLAG_ANYSRC;

+   if (dev) {
+       if (netif_is_vrf(dev))
+           flow_flags |= FLOWI_FLAG_VRFSRC;
+       dev_put(dev);
+   }
+
     flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
                protocol, flow_flags, dst, src, dport, sport);
  }


which essentially tells fib_table_lookup to drop the OIF comparison 
after selecting the table per this change made in the patch Shrijeet posted:

                         if (!(flp->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
                                 if (flp->flowi4_oif &&
                                     flp->flowi4_oif != nh->nh_oif)
                                         continue;
                         }

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 19:58   ` Hannes Frederic Sowa
                       ` (3 preceding siblings ...)
  2015-06-09  0:36     ` David Ahern
@ 2015-06-09  1:03     ` David Ahern
  2015-06-09  5:35       ` Hannes Frederic Sowa
  4 siblings, 1 reply; 36+ messages in thread
From: David Ahern @ 2015-06-09  1:03 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Shrijeet Mukherjee
  Cc: nicolas.dichtel, ebiederm, hadi, davem, stephen, netdev, roopa,
	gospo, jtoppins, nikolay

On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote:
> For rx layer I want to also propose my try:
>
> [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override
>

I applied only the first 2 patches from Shrijeet and then tried to apply 
your patch; it doesn't apply. Way too many failures. What branch should 
it apply too?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-09  1:03     ` David Ahern
@ 2015-06-09  5:35       ` Hannes Frederic Sowa
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-09  5:35 UTC (permalink / raw)
  To: David Ahern, Shrijeet Mukherjee
  Cc: Nicolas Dichtel, Eric W. Biederman, Jamal Hadi Salim, davem,
	Stephen Hemminger, netdev, roopa, andy gospodarek, jtoppins,
	nikolay

Hi,

On Tue, Jun 9, 2015, at 03:03, David Ahern wrote:
> On 6/8/15 1:58 PM, Hannes Frederic Sowa wrote:
> > For rx layer I want to also propose my try:
> >
> > [PATCH net-next RFC] net: ipv4: arp: strong end system model semantics by per-interface local table override
> >
> 
> I applied only the first 2 patches from Shrijeet and then tried to apply 
> your patch; it doesn't apply. Way too many failures. What branch should 
> it apply too?

The patch is currently stand-alone and should apply ontop of net-next
commit 6da8253bdd3945b81377e4908d6d395a9956f8af
Author: Florian Fainelli <f.fainelli@gmail.com>
Date:   Mon Jun 8 11:05:20 2015 -0700

net: phy: bcm7xxx: update workaround to fix 100BaseT corner cases

Shrijeet and me will consolidate that soon.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 22:44         ` Shrijeet Mukherjee
@ 2015-06-09  5:41           ` Hannes Frederic Sowa
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-09  5:41 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: Nicolas Dichtel, David Ahern, Eric W. Biederman,
	Jamal Hadi Salim, David Miller, Stephen Hemminger, Netdev,
	Roopa Prabhu, Andy Gospodarek, Jon Toppins, Nikolay Aleksandrov

On Tue, Jun 9, 2015, at 00:44, Shrijeet Mukherjee wrote:
> On Mon, Jun 8, 2015 at 1:33 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > On Mon, Jun 8, 2015, at 22:22, Shrijeet Mukherjee wrote:
> >> On Mon, Jun 8, 2015 at 12:58 PM, Hannes Frederic Sowa
> >> <hannes@stressinduktion.org> wrote:
> >> > Hi Shrijeet,
> >> >
> >> > This let's incoming and arp requests use routing table 100. The system
> >> > will stop responding to arp requests as we don't have any entries in
> >> > this routing table.
> >>
> >>
> >> I like this model in general, as it addresses the issue that I have
> >> not addressed around connected routes.
> >>
> >> This would force local and directly connected host routes to be learnt
> >> into the correct table.
> >>
> >> It does bring the question up then.
> >>
> >> 1. The driver already knows the vrf device to table map
> >> 2. If the device also knows the final device to table map
> >>
> >> then do we need to use fib_rules and just lookup the table directly.
> >> It does make the configuration a little longer since each component
> >> device now needs configuration when you add/del a member from a vrf.
> >
> > This model is usable on its own, especially if one does not need routing
> > daemons
> > or user space software dealing with VRFs and sending out packets.
> >
> >> If people generally agree and we want to skip the fib_rule lookup,
> >> then I can make it such that enslaving already takes the dev-table id
> >> as well, and then the process of enslaving in the nominal VRF case
> >> becomes
> >>
> >> ip link add vrf-dev type vrf table foo ipv4-rt-table-id bar
> >> ip link set eth2 master vrf-dev
> >
> > I think this would be great.
> >
> > Last time I looked into the patches it was not yet clear if we can do
> > that
> > without holding strong references to the other interfaces. Hopefully
> > this can
> > be done by just passing down the table ids to the slaves during
> > initializing
> > and teardown of the master vrf interface.
> >
> > Bye,
> > Hannes
> 
> We can do that, and the hooks are all available. But do we want to cut
> out the fib_rules ? this would close out the opportunity for someone
> to insert a fib_rule to override the rule which directs to a VRF
> device.
> 
> Personally don't have a strong opinion, but want to make sure we
> understand that choice.

Hmm, wouldn't that still work with a target I added in my patch?

The only problem I see is that people might build up rules which are not
symmetric and thus vrf behavior differs from input and output path. One
addition to ease this is to add a interface selector which matches on
both, iif and oif. Also we must still keep in mind that rules are
matched linearly by using a list, hundreds of vrfs would thus first have
match hundreds of ip rules.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-08 18:35 [RFC net-next 0/3] Proposal for VRF-lite Shrijeet Mukherjee
                   ` (4 preceding siblings ...)
  2015-06-08 19:13 ` [RFC net-next 0/3] Proposal for VRF-lite David Ahern
@ 2015-06-09  8:58 ` Nicolas Dichtel
  2015-06-09 14:21   ` David Ahern
  2015-06-09 10:15 ` Thomas Graf
  6 siblings, 1 reply; 36+ messages in thread
From: Nicolas Dichtel @ 2015-06-09  8:58 UTC (permalink / raw)
  To: Shrijeet Mukherjee, hannes, dsahern, ebiederm, hadi, davem,
	stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

Le 08/06/2015 20:35, Shrijeet Mukherjee a écrit :
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
>
> In the context of internet scale routing a requirement that always
> comes up is the need to partition the available routing tables into
> disjoint routing planes. A specific use case is the multi-tenancy
> problem where each tenant has their own unique routing tables and in
> the very least need different default gateways.
>
> This is an attempt to build the ability to create virtual router
> domains aka VRF's (VRF-lite to be specific) in the linux packet
> forwarding stack. The main observation is that through the use of
[snip]
>   drivers/net/vrf.c            |  654 ++++++++++++++++++++++++++++++++++++++++++

I'm not really in favor of the name 'vrf'. This term is very controversial and
having a consensus of what is/contains a 'vrf' is quite impossible.
There was already a lot of discussions about this topic on quagga ml that show
that everybody has a different opinion about this term ;-)

I know you call this 'MRF' internally, why not using this name instead?


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 2/3] VRF driver and needed infrastructure
  2015-06-08 18:35 ` [RFC net-next 2/3] VRF driver and needed infrastructure Shrijeet Mukherjee
  2015-06-08 19:08   ` David Ahern
  2015-06-08 20:17   ` Hannes Frederic Sowa
@ 2015-06-09  9:19   ` Nicolas Dichtel
  2015-06-09 12:35   ` Nikolay Aleksandrov
  2015-06-10 18:20   ` Alexander Duyck
  4 siblings, 0 replies; 36+ messages in thread
From: Nicolas Dichtel @ 2015-06-09  9:19 UTC (permalink / raw)
  To: Shrijeet Mukherjee, hannes, dsahern, ebiederm, hadi, davem,
	stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

Le 08/06/2015 20:35, Shrijeet Mukherjee a écrit :
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
[snip]
> --- /dev/null
> +++ b/drivers/net/vrf.c
[snip]
> +
> +static struct rtnl_link_ops vrf_link_ops __read_mostly = {
> +	.kind		= DRV_NAME,
> +	.priv_size      = sizeof(struct net_vrf),
> +	.policy         = vrf_nl_policy,
> +	.newlink        = vrf_newlink,
> +	.dellink        = vrf_dellink,
> +	.setup		= vrf_setup,
> +	.validate	= vrf_validate,
> +	.maxtype        = IFLA_VRF_MAX,
> +};
It would be good to also have a .fill_info callback.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-08 18:35 [RFC net-next 0/3] Proposal for VRF-lite Shrijeet Mukherjee
                   ` (5 preceding siblings ...)
  2015-06-09  8:58 ` Nicolas Dichtel
@ 2015-06-09 10:15 ` Thomas Graf
  2015-06-09 12:30   ` Nicolas Dichtel
       [not found]   ` <CAJmoNQHRTJwdMjziQiPBX07sZKrYd3Z1ASNi1xQZdgJ1Vs6bGg@mail.gmail.com>
  6 siblings, 2 replies; 36+ messages in thread
From: Thomas Graf @ 2015-06-09 10:15 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: hannes, nicolas.dichtel, dsahern, ebiederm, hadi, davem, stephen,
	netdev, roopa, gospo, jtoppins, nikolay

On 06/08/15 at 11:35am, Shrijeet Mukherjee wrote:
[...]
> model with some performance paths that need optimization. (Specifically
> the output route selector that Roopa, Robert, Thomas and EricB are
> currently discussing on the MPLS thread)

Thanks for posting these patches just in time. This explains how
you intent to deploy Roopa's patches in a scalable manner.

> High Level points
> 
> 1. Simple overlay driver (minimal changes to current stack)
>    * uses the existing fib tables and fib rules infrastructure
> 2. Modelled closely after the ipvlan driver
> 3. Uses current API and infrastructure.
>    * Applications can use SO_BINDTODEVICE or cmsg device indentifiers
>      to pick VRF (ping, traceroute just work)

I like the aspect of reusing existing user interfaces. We might
need to introduce a more fine grained capability than CAP_NET_RAW
to give containers the privileges to bind to a VRF without
allowing them to inject raw frames.

Given I understand this correctly: If my intent was to run a
process in multiple VRFs, then I would need to run that process
in the host network namespace which contains the VRF devices
which would also contain the physical devices. While I might want
to grant my process the ability to bind to VRFs, I may not want
to give it the privileges to bind to any device. So we could
consider introducing CAP_NET_VRF which would allow to bind to
VRF devices.

>    * Standard IP Rules work, and since they are aggregated against the
>      device, scale is manageable
> 4. Completely orthogonal to Namespaces and only provides separation in
>    the routing plane (and ARP)
> 5. Debugging is built-in as tcpdump and counters on the VRF device
>    works as is.
> 
>                                                  N2
>            N1 (all configs here)          +---------------+
>     +--------------+                      |               |
>     |swp1 :10.0.1.1+----------------------+swp1 :10.0.1.2 |
>     |              |                      |               |
>     |swp2 :10.0.2.1+----------------------+swp2 :10.0.2.2 |
>     |              |                      +---------------+
>     | VRF 0        |
>     | table 5      |
>     |              |
>     +---------------+
>     |              |
>     | VRF 1        |                             N3
>     | table 6      |                      +---------------+
>     |              |                      |               |
>     |swp3 :10.0.2.1+----------------------+swp1 :10.0.2.2 |
>     |              |                      |               |
>     |swp4 :10.0.3.1+----------------------+swp2 :10.0.3.2 |
>     +--------------+                      +---------------+

Do I understand this correctly that swp* represent veth pairs?
Why do you have distinct addresses on each peer of the pair?
Are the addresses in N2 and N3 considered private and NATed?

[...]

> # Install the lookup rules that map table to VRF domain
> ip rule add pref 200 oif vrf0 lookup 5
> ip rule add pref 200 iif vrf0 lookup 5
> ip rule add pref 200 oif vrf1 lookup 6
> ip rule add pref 200 iif vrf1 lookup 6

I think this is a good start but we all know the scalability
constraints of this. Depending on the number of L3 domains,
an eBPF classifier utilizing a map to translate origin to
routing table and vice versa might address the scale requirement
long term.

[...]

I will comment on the implementation specifics once I have a
good understanding of your desired end state looks like.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-09 10:15 ` Thomas Graf
@ 2015-06-09 12:30   ` Nicolas Dichtel
  2015-06-09 12:43     ` Hannes Frederic Sowa
       [not found]   ` <CAJmoNQHRTJwdMjziQiPBX07sZKrYd3Z1ASNi1xQZdgJ1Vs6bGg@mail.gmail.com>
  1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Dichtel @ 2015-06-09 12:30 UTC (permalink / raw)
  To: Thomas Graf, Shrijeet Mukherjee
  Cc: hannes, dsahern, ebiederm, hadi, davem, stephen, netdev, roopa,
	gospo, jtoppins, nikolay

Le 09/06/2015 12:15, Thomas Graf a écrit :
> On 06/08/15 at 11:35am, Shrijeet Mukherjee wrote:
> [...]
>> model with some performance paths that need optimization. (Specifically
>> the output route selector that Roopa, Robert, Thomas and EricB are
>> currently discussing on the MPLS thread)
>
> Thanks for posting these patches just in time. This explains how
> you intent to deploy Roopa's patches in a scalable manner.
>
>> High Level points
>>
>> 1. Simple overlay driver (minimal changes to current stack)
>>     * uses the existing fib tables and fib rules infrastructure
>> 2. Modelled closely after the ipvlan driver
>> 3. Uses current API and infrastructure.
>>     * Applications can use SO_BINDTODEVICE or cmsg device indentifiers
>>       to pick VRF (ping, traceroute just work)
>
> I like the aspect of reusing existing user interfaces. We might
> need to introduce a more fine grained capability than CAP_NET_RAW
> to give containers the privileges to bind to a VRF without
> allowing them to inject raw frames.
>
> Given I understand this correctly: If my intent was to run a
> process in multiple VRFs, then I would need to run that process
> in the host network namespace which contains the VRF devices
> which would also contain the physical devices. While I might want
> to grant my process the ability to bind to VRFs, I may not want
> to give it the privileges to bind to any device. So we could
> consider introducing CAP_NET_VRF which would allow to bind to
> VRF devices.

If I understand correctly, all existing applications should also be modified
if I want to run them into a VRF/MRF (see my previous email)?

ssh, dhcp, httpd, etc should be runnable per MRF without modifications of
their source code. So, it becomes a netns. What's about an IKE dameon?

It makes sense to have both: netns and MRF ; each can have their own logics
of VRF-like behavior depending on how a VRF is defined by the end users.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 2/3] VRF driver and needed infrastructure
  2015-06-08 18:35 ` [RFC net-next 2/3] VRF driver and needed infrastructure Shrijeet Mukherjee
                     ` (2 preceding siblings ...)
  2015-06-09  9:19   ` Nicolas Dichtel
@ 2015-06-09 12:35   ` Nikolay Aleksandrov
  2015-06-10  2:11     ` Shrijeet Mukherjee
  2015-06-10 18:20   ` Alexander Duyck
  4 siblings, 1 reply; 36+ messages in thread
From: Nikolay Aleksandrov @ 2015-06-09 12:35 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: hannes, nicolas.dichtel, dsahern, ebiederm, hadi, David Miller,
	Stephen Hemminger, Netdev, Roopa Prabhu, Andy Gospodarek,
	Jon Toppins

On Mon, Jun 8, 2015 at 8:35 PM, Shrijeet Mukherjee
<shm@cumulusnetworks.com> wrote:
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
>
> This driver borrows heavily from IPvlan and teaming drivers.
>
> Routing domains (VRF-lite) are created by instantiating a device
> and enslaving all routed interfaces that participate in the domain.
> As part of the enslavement, all local routes pointing to enslaved
> devices are re-pointed to the vrf device, thus forcing outgoing
> sockets to bind to the vrf to function.
>
> Standard FIB rules can then bind the VRF device to tables and regular
> fib rule processing is followed.
>
> Routed traffic through the box, is fwded by using the VRF device as
> the IIF and following the IIF rule to a table which is mated with
> the VRF.
>
> Locally originated traffic is directed at the VRF device using
> SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
> the xmit function of the vrf driver, which then completes the ip lookup
> and output.
>
> This solution is completely orthogonal to namespaces and allow the L3
> equivalent of vlans to exist allowing the routing space to be
> partitioned.
>
> Example use is
>    ip link add vrf0 type vrf table 5
>    ip link set eth1 master vrf0
>    ip link set vrf0 up
>
>    ip rule add iif vrf0 table 5
>    ip rule add oif vrf0 table 5
>
> TODO:
> This changeset is for IPv4 only
> Connected route management can be made much better, but is deferred to
> user space for now.
>
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> ---
>  drivers/net/Kconfig          |    6 +
>  drivers/net/Makefile         |    1 +
>  drivers/net/vrf.c            |  654 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/netdevice.h    |   10 +
>  include/net/flow.h           |    1 +
>  include/net/vrf.h            |   19 ++
>  include/uapi/linux/if_link.h |    9 +
>  7 files changed, 700 insertions(+)
>  create mode 100644 drivers/net/vrf.c
>  create mode 100644 include/net/vrf.h
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 019fcef..27a333c 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -283,6 +283,12 @@ config NLMON
>           diagnostics, etc. This is mostly intended for developers or support
>           to debug netlink issues. If unsure, say N.
>
> +config NET_VRF
> +       tristate "Virtual Routing and Forwarding (Lite)"
> +       ---help---
> +          This option enables the support for mapping interfaces into VRF's. The
> +          support enables VRF devices
> +
>  endif # NET_CORE
>
>  config SUNGEM_PHY
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index c12cb22..ca16dd6 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
>  obj-$(CONFIG_VXLAN) += vxlan.o
>  obj-$(CONFIG_GENEVE) += geneve.o
>  obj-$(CONFIG_NLMON) += nlmon.o
> +obj-$(CONFIG_NET_VRF) += vrf.o
>
>  #
>  # Networking Drivers
> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
> new file mode 100644
> index 0000000..08b3e79
> --- /dev/null
> +++ b/drivers/net/vrf.c
> @@ -0,0 +1,654 @@
> +/*
> + * vrf.c: device driver to encapsulate a VRF space
> + *
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * Based on dummy, team and ipvlan drivers
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ip.h>
> +#include <linux/init.h>
> +#include <linux/moduleparam.h>
> +#include <linux/rtnetlink.h>
> +#include <net/rtnetlink.h>
> +#include <net/arp.h>
> +#include <linux/u64_stats_sync.h>
> +#include <linux/hashtable.h>
> +
> +#include <linux/inetdevice.h>
> +#include <net/ip.h>
> +#include <net/ip_fib.h>
> +#include <net/ip6_route.h>
> +#include <net/rtnetlink.h>
> +#include <net/route.h>
> +#include <net/addrconf.h>
> +#include <net/vrf.h>
> +
> +#define DRV_NAME       "vrf"
> +#define DRV_VERSION    "1.0"
> +
> +#define vrf_is_slave(dev)   ((dev->flags & IFF_SLAVE) == IFF_SLAVE)
> +#define vrf_is_master(dev)  ((dev->flags & IFF_MASTER) == IFF_MASTER)

nit: I think you can drop the "==" check here.

> +
> +#define vrf_master_get_rcu(dev) \
> +       ((struct net_device *)rcu_dereference(dev->rx_handler_data))
> +
> +struct pcpu_dstats {
> +       u64                     tx_pkts;
> +       u64                     tx_bytes;
> +       u64                     tx_drps;
> +       u64                     rx_pkts;
> +       u64                     rx_bytes;
> +       struct u64_stats_sync   syncp;
> +};
> +
> +struct slave {
> +       struct list_head        list;
> +       struct net_device       *dev;
> +       long                    priority;
> +};
> +
> +struct slave_queue {
> +       spinlock_t              lock; /* lock for slave insert/delete */
> +       struct list_head        all_slaves;
> +       int                     num_slaves;
> +       struct net_device       *master_dev;
> +};
> +
> +struct net_vrf {
> +       struct slave_queue      queue;
> +       struct fib_table        *tb;
> +       u32                     tb_id;
> +};
> +
> +static int is_ip_rx_frame(struct sk_buff *skb)
> +{
> +       switch (skb->protocol) {
> +       case htons(ETH_P_IP):
> +       case htons(ETH_P_IPV6):
> +               return 1;
> +       }
> +       return 0;
> +}
> +
> +/* note: already called with rcu_read_lock */
> +static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
> +{
> +       struct sk_buff *skb = *pskb;
> +
> +       if (is_ip_rx_frame(skb)) {
> +               struct net_device *dev = vrf_master_get_rcu(skb->dev);
> +               struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
> +
> +               u64_stats_update_begin(&dstats->syncp);
> +               dstats->rx_pkts++;
> +               dstats->rx_bytes += skb->len;
> +               u64_stats_update_end(&dstats->syncp);
> +       }
> +       return RX_HANDLER_PASS;
> +}
> +
> +static struct rtnl_link_stats64 *vrf_get_stats64(
> +       struct net_device *dev, struct rtnl_link_stats64 *stats)
> +{
> +       int i;
> +
> +       for_each_possible_cpu(i) {
> +               const struct pcpu_dstats *dstats;
> +               u64 tbytes, tpkts, tdrops, rbytes, rpkts;
> +               unsigned int start;
> +
> +               dstats = per_cpu_ptr(dev->dstats, i);
> +               do {
> +                       start = u64_stats_fetch_begin_irq(&dstats->syncp);
> +                       tbytes = dstats->tx_bytes;
> +                       tpkts = dstats->tx_pkts;
> +                       tdrops = dstats->tx_drps;
> +                       rbytes = dstats->rx_bytes;
> +                       rpkts = dstats->rx_pkts;
> +               } while (u64_stats_fetch_retry_irq(&dstats->syncp, start));
> +               stats->tx_bytes += tbytes;
> +               stats->tx_packets += tpkts;
> +               stats->tx_dropped += tdrops;
> +               stats->rx_bytes += rbytes;
> +               stats->rx_packets += rpkts;
> +       }
> +       return stats;
> +}
> +
> +static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb,
> +                                          struct net_device *dev)
> +{
> +       return 0;
> +}
> +
> +static int _vrf_send_v4_prep(struct sk_buff *skb, struct flowi4 *fl4,
> +                            struct net_device *vrf_dev)
> +{
> +       struct rtable *rt;
> +       struct net_device *dev = skb->dev;
> +
> +       rt = ip_route_output_flow(dev_net(dev), fl4, NULL);
> +
> +       if (IS_ERR(rt))
> +               goto err;
> +
> +       if ((rt->rt_type != RTN_UNICAST && rt->rt_type != RTN_LOCAL) ||
> +           !rt->dst.dev->vrf_ptr) {
> +               ip_rt_put(rt);
> +               goto err;
> +       }
> +
> +       /* prevent slave cross reference */
> +       if (rt->dst.dev->vrf_ptr->ifindex != vrf_dev->ifindex) {
> +               ip_rt_put(rt);
> +               goto err;
> +       }
> +
> +       skb_dst_drop(skb);
> +       skb_dst_set(skb, &rt->dst);
> +
> +       return 0;
> +err:
> +       return 1;
> +}
> +
> +static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb,
> +                                          struct net_device *vrf_dev)
> +{
> +       struct iphdr *ip4h = ip_hdr(skb);
> +       int ret = NET_XMIT_DROP;
> +       struct net_device *dev = skb->dev;
> +       struct flowi4 fl4 = {
> +               /* needed to match OIF rule */
> +               .flowi4_oif = vrf_dev->ifindex,
> +               .flowi4_iif = LOOPBACK_IFINDEX,
> +               .flowi4_tos = RT_TOS(ip4h->tos),
> +               .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC,
> +               .daddr = ip4h->daddr,
> +               .saddr = 0,
> +       };
> +
> +       if (_vrf_send_v4_prep(skb, &fl4, vrf_dev))
> +               goto err;
> +
> +       dev = skb_dst(skb)->dev;
> +       ip4h->saddr = inet_select_addr(dev, 0, RT_SCOPE_LINK);
> +       ret = ip_local_out(skb);
> +
> +       if (unlikely(net_xmit_eval(ret)))
> +               vrf_dev->stats.tx_errors++;
> +       else
> +               ret = NET_XMIT_SUCCESS;
> +
> +       goto out;
> +err:
> +       vrf_dev->stats.tx_errors++;
> +       kfree_skb(skb);
> +out:
> +       return ret;
> +}
> +
> +static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev)
> +{
> +       /* the frames we recv have full L2 headers, strip */
> +       if (skb_mac_header_was_set(skb)) {
> +               skb_pull(skb, sizeof(struct ethhdr));
> +               skb->mac_header = (typeof(skb->mac_header))~0U;
> +               skb_reset_network_header(skb);
> +       }
> +
> +       switch (skb->protocol) {
> +       case htons(ETH_P_IP):
> +               return vrf_process_v4_outbound(skb, dev);
> +       case htons(ETH_P_IPV6):
> +               return vrf_process_v6_outbound(skb, dev);
> +       default:
> +               return NET_XMIT_DROP;
> +       }
> +}
> +
> +static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +       netdev_tx_t ret = is_ip_tx_frame(skb, dev);
> +
> +       if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
> +               struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
> +
> +               u64_stats_update_begin(&dstats->syncp);
> +               dstats->tx_pkts++;
> +               dstats->tx_bytes += skb->len;
> +               u64_stats_update_end(&dstats->syncp);
> +       } else {
> +               this_cpu_inc(dev->dstats->tx_drps);
> +       }
> +
> +       return ret;
> +}
> +
> +int vrf_output(struct sock *sk, struct sk_buff *skb)
> +{
> +       struct net_device *dev = skb_dst(skb)->dev;
> +
> +       return vrf_xmit(skb, dev);
> +}
> +
> +/**************************** device handling ********************/
> +
> +/* queue->lock must be held */
> +static struct slave *__vrf_find_slave_dev(struct slave_queue *queue,
> +                                         struct net_device *dev)
> +{
> +       struct list_head *this, *head;
> +
> +       head = &queue->all_slaves;
> +       list_for_each(this, head) {
> +               struct slave *slave = list_entry(this, struct slave, list);
> +
> +               if (slave->dev == dev)
> +                       return slave;
> +       }
> +
> +       return NULL;
> +}
> +
> +static void vrf_kill_one_slave(struct slave_queue *queue, struct slave *slave)
> +{
> +       list_del(&slave->list);
> +       queue->num_slaves--;
> +       slave->dev->flags &= ~IFF_SLAVE;
> +       netdev_rx_handler_unregister(slave->dev);
> +       kfree(slave->dev->vrf_ptr);
> +       slave->dev->vrf_ptr = NULL;
> +       dev_put(slave->dev);
> +       kfree(slave);
> +}
> +
> +/* queue->lock must be held */
> +static int __vrf_insert_slave(struct slave_queue *queue, struct slave *slave,
> +                             struct net_device *master)
> +{
> +       struct net_vrf *vrf = netdev_priv(master);
> +       struct slave *duplicate_slave = NULL;
> +       int master_ifindex = master->ifindex;
> +       int err = 0;
> +
> +       duplicate_slave = __vrf_find_slave_dev(queue, slave->dev);
> +       if (duplicate_slave)
> +               vrf_kill_one_slave(queue, duplicate_slave);

vrf_kill_one_slave() calls netdev_rx_handler_unregister() which does
synchronize_rcu() and
here you're running with the queue spinlock held and softirqs disabled.

> +
> +       dev_hold(slave->dev);
> +       list_add(&slave->list, &queue->all_slaves);
> +       queue->num_slaves++;
> +       slave->dev->flags |= IFF_SLAVE;
> +
> +       slave->dev->vrf_ptr = kmalloc(sizeof(*slave->dev->vrf_ptr), GFP_KERNEL);

Again this runs with a spinlock and softirqs disabled, GFP_KERNEL can sleep.

> +       if (!slave->dev->vrf_ptr)
> +               return -ENODEV;
> +       slave->dev->vrf_ptr->ifindex = master_ifindex;
> +       slave->dev->vrf_ptr->tb_id = vrf->tb_id;
> +
> +       /* register the packet handler for slave ports */
> +       err = netdev_rx_handler_register(slave->dev, vrf_handle_frame,
> +                                        (void *)master);
> +       if (err) {
> +               netdev_err(slave->dev,
> +                          "Device %s failed to register rx_handler\n",
> +                          slave->dev->name);
> +               return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static void vrf_fib_magic(int cmd, int type, __be32 dst, int dst_len,
> +                         struct in_ifaddr *ifa, struct net_device *vrf_dev)
> +{
> +       struct net_vrf *vrf = netdev_priv(vrf_dev);
> +       struct net *net = dev_net(ifa->ifa_dev->dev);
> +       struct net *vrf_net = dev_net(vrf_dev);
> +       struct fib_table *tb, *lc_tb;
> +       struct fib_config cfg = {
> +               .fc_protocol = RTPROT_KERNEL,
> +               .fc_type = type,
> +               .fc_dst = dst,
> +               .fc_dst_len = dst_len,
> +               .fc_prefsrc = ifa->ifa_local,
> +               .fc_nlflags = NLM_F_CREATE | NLM_F_APPEND,
> +               .fc_nlinfo = {
> +                       .nl_net = net,
> +               },
> +       };
> +
> +       lc_tb = fib_new_table(dev_net(vrf_dev), RT_TABLE_LOCAL);
> +       tb = vrf->tb;
> +       if (!tb || !lc_tb)
> +               return;
> +
> +       if (net != vrf_net)
> +               return;
> +
> +       if (type != RTN_LOCAL)
> +               cfg.fc_scope = RT_SCOPE_LINK;
> +       else
> +               cfg.fc_scope = RT_SCOPE_HOST;
> +
> +       if (cmd == RTM_NEWROUTE) {
> +               cfg.fc_table = lc_tb->tb_id;
> +               cfg.fc_oif = ifa->ifa_dev->dev->ifindex;
> +               fib_table_delete(lc_tb, &cfg);
> +               cfg.fc_table = tb->tb_id;
> +               cfg.fc_oif = vrf_dev->ifindex;
> +               fib_table_insert(tb, &cfg);
> +       } else {
> +               cfg.fc_table = tb->tb_id;
> +               cfg.fc_oif = vrf_dev->ifindex;
> +               fib_table_delete(tb, &cfg);
> +               cfg.fc_table = lc_tb->tb_id;
> +               cfg.fc_oif = ifa->ifa_dev->dev->ifindex;
> +               fib_table_insert(lc_tb, &cfg);
> +       }
> +}
> +
> +static void vrf_move_local_routes(int cmd, struct net_device *dev,
> +                                 struct net_device *port_dev)
> +{
> +       struct in_device *in_dev = __in_dev_get_rcu(port_dev);
> +
> +       if (!in_dev)
> +               return;
> +
> +       for_ifa(in_dev) {
> +               __be32 addr;
> +
> +               addr = ifa->ifa_local;
> +               vrf_fib_magic(cmd, RTN_LOCAL, addr, 32, ifa, dev);
> +       } endfor_ifa(in_dev);
> +}
> +
> +static int vrf_inetaddr_event(struct notifier_block *this, unsigned long event,
> +                             void *ptr)
> +{
> +       struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
> +       struct net_device *port_dev = ifa->ifa_dev->dev;
> +       struct net *net = dev_net(port_dev);
> +       int master = 0;
> +
> +       if (port_dev->vrf_ptr)
> +               master = port_dev->vrf_ptr->ifindex;
> +
> +       if (master) {
> +               struct net_device *dev = dev_get_by_index(net, master);
> +
> +               switch (event) {
> +               case NETDEV_UP:
> +                       vrf_move_local_routes(RTM_NEWROUTE, dev, port_dev);
> +                       break;
> +               case NETDEV_DOWN:
> +                       vrf_move_local_routes(RTM_DELROUTE, dev, port_dev);
> +                       break;
> +               }
> +       }
> +       return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block vrf_inetaddr_notifier = {
> +       .notifier_call = vrf_inetaddr_event,
> +};
> +
> +/* netlink lock is assumed here */
> +static int vrf_add_slave(struct net_device *dev,
> +                        struct net_device *port_dev)
> +{
> +       if (!dev || !port_dev)
> +               return -ENODEV;
> +
> +       if (dev_net(dev) != dev_net(port_dev))
> +               return -ENODEV;
> +
> +       if (!vrf_is_master(port_dev) && !vrf_is_slave(port_dev)) {
> +               struct slave *s = kmalloc(sizeof(*s), GFP_KERNEL);
> +               struct net_vrf *vrf = netdev_priv(dev);
> +               int ret;
> +
> +               if (!s)
> +                       return -ENOMEM;
> +
> +               memset(s, 0, sizeof(*s));

You can use kzalloc for "s".

> +               s->dev = port_dev;
> +
> +               spin_lock_bh(&vrf->queue.lock);
> +               ret = __vrf_insert_slave(&vrf->queue, s, dev);
> +               if (ret)
> +                       kfree(s);
> +
> +               spin_unlock_bh(&vrf->queue.lock);
> +
> +               vrf_move_local_routes(RTM_NEWROUTE, dev, port_dev);
> +               ret = netdev_master_upper_dev_link(port_dev, dev);
> +               return ret;
> +       }
> +
> +       return -EINVAL;
> +}
> +
> +static int vrf_del_slave(struct net_device *dev,
> +                        struct net_device *port_dev)
> +{
> +       struct net_vrf *vrf = netdev_priv(dev);
> +       struct slave_queue *queue = &vrf->queue;
> +       struct slave *slave = __vrf_find_slave_dev(queue, port_dev);
> +
> +       if (!slave)
> +               return -EINVAL;
> +
> +       vrf_kill_one_slave(queue, slave);
> +       vrf_move_local_routes(RTM_DELROUTE, dev, port_dev);
> +       netdev_upper_dev_unlink(port_dev, dev);
> +
> +       return 0;
> +}
> +
> +static int vrf_dev_init(struct net_device *dev)
> +{
> +       struct net_vrf *vrf = netdev_priv(dev);
> +
> +       spin_lock_init(&vrf->queue.lock);
> +       INIT_LIST_HEAD(&vrf->queue.all_slaves);
> +       vrf->queue.master_dev = dev;
> +
> +       dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
> +       dev->flags  =  IFF_MASTER | IFF_NOARP;
> +       if (!dev->dstats)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +static void vrf_dev_uninit(struct net_device *dev)
> +{
> +       free_percpu(dev->dstats);
> +}
> +
> +static int vrf_dev_close(struct net_device *dev)
> +{
> +       struct net_vrf *vrf = netdev_priv(dev);
> +       struct slave_queue *queue = &vrf->queue;
> +       struct list_head *this, *head;
> +
> +       head = &queue->all_slaves;
> +       list_for_each(this, head) {
> +               struct slave *slave = list_entry(this, struct slave, list);
> +
> +               slave->dev->vrf_ptr->ifindex = 0;
> +               slave->dev->vrf_ptr->tb_id = 0;
> +       }
> +
> +       if (dev->flags & IFF_MASTER)
> +               dev->flags &= ~IFF_UP;
> +/* XXX does the table not need a free
> + * fib_table_delete(vrf->tb, cfg);
> + */
> +       return 0;
> +}
> +
> +static int vrf_dev_open(struct net_device *dev)
> +{
> +       struct net_vrf *vrf = netdev_priv(dev);
> +       struct slave_queue *queue = &vrf->queue;
> +       struct list_head *this, *head;
> +       int err = 0;
> +
> +       head = &queue->all_slaves;
> +       list_for_each(this, head) {
> +               struct slave *slave = list_entry(this, struct slave, list);
> +
> +               slave->dev->vrf_ptr->ifindex = dev->ifindex;
> +               slave->dev->vrf_ptr->tb_id = vrf->tb_id;
> +       }
> +
> +       if (dev->flags & IFF_MASTER)
> +               dev->flags |= IFF_UP;
> +
> +       if (!vrf->tb)
> +               return -EINVAL;
> +
> +       return err;
> +}
> +
> +static int vrf_neigh_create(struct neighbour *n)
> +{
> +       n->nud_state = NUD_REACHABLE;
> +       n->dead      = 0;
> +
> +       return 0;
> +}
> +
> +static const struct net_device_ops vrf_netdev_ops = {
> +       .ndo_init               = vrf_dev_init,
> +       .ndo_uninit             = vrf_dev_uninit,
> +       .ndo_open               = vrf_dev_open,
> +       .ndo_stop               = vrf_dev_close,
> +       .ndo_start_xmit         = vrf_xmit,
> +       .ndo_get_stats64        = vrf_get_stats64,
> +       .ndo_add_slave          = vrf_add_slave,
> +       .ndo_del_slave          = vrf_del_slave,
> +       .ndo_neigh_construct    = vrf_neigh_create,
> +};
> +
> +static void vrf_get_drvinfo(struct net_device *dev,
> +                           struct ethtool_drvinfo *info)
> +{
> +       strlcpy(info->driver, DRV_NAME, sizeof(info->driver));
> +       strlcpy(info->version, DRV_VERSION, sizeof(info->version));
> +}
> +
> +static const struct ethtool_ops vrf_ethtool_ops = {
> +       .get_drvinfo            = vrf_get_drvinfo,
> +};
> +
> +static void vrf_setup(struct net_device *dev)
> +{
> +       ether_setup(dev);
> +
> +       /* Initialize the device structure. */
> +       dev->netdev_ops = &vrf_netdev_ops;
> +       dev->ethtool_ops = &vrf_ethtool_ops;
> +       dev->destructor = free_netdev;
> +
> +       /* Fill in device structure with ethernet-generic values. */
> +       dev->tx_queue_len = 0;
> +       eth_hw_addr_random(dev);
> +}
> +
> +static int vrf_validate(struct nlattr *tb[], struct nlattr *data[])
> +{
> +       if (tb[IFLA_ADDRESS]) {
> +               if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
> +                       return -EINVAL;
> +               if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
> +                       return -EADDRNOTAVAIL;
> +       }
> +       return 0;
> +}
> +
> +static int vrf_newlink(struct net *src_net, struct net_device *dev,
> +                      struct nlattr *tb[], struct nlattr *data[])
> +{
> +       int err;
> +       struct net_vrf *vrf = netdev_priv(dev);
> +
> +       if (data && data[IFLA_VRF_TABLE]) {
> +               vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
> +               /* reserve a table for this VRF device */
> +               vrf->tb = fib_new_table(dev_net(dev), vrf->tb_id);
> +               if (!vrf->tb)
> +                       return -ERANGE;
> +
> +               dev->priv_flags |= IFF_VRF_MASTER;
> +               err = register_netdevice(dev);
> +               if (err) {
> +                       free_netdev(dev);
> +                       return -ENODEV;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static void vrf_dellink(struct net_device *dev, struct list_head *head)
> +{
> +       /* Need to free the table ? */
> +       unregister_netdev(dev);

I think ->dellink() runs with rtnl held and unregister_netdev() tries
to acquire rtnl
so you'll probably deadlock here.

> +}
> +
> +static const struct nla_policy vrf_nl_policy[IFLA_VRF_MAX + 1] = {
> +       [IFLA_VRF_TABLE] = { .type = NLA_U32 },
> +};
> +
> +static struct rtnl_link_ops vrf_link_ops __read_mostly = {
> +       .kind           = DRV_NAME,
> +       .priv_size      = sizeof(struct net_vrf),
> +       .policy         = vrf_nl_policy,
> +       .newlink        = vrf_newlink,
> +       .dellink        = vrf_dellink,
> +       .setup          = vrf_setup,
> +       .validate       = vrf_validate,
> +       .maxtype        = IFLA_VRF_MAX,
> +};
> +
> +static int __init vrf_init_module(void)
> +{
> +       int err = 0;

Initialization is unnecessary, it's always assigned below.

> +
> +       rtnl_lock();
> +       err = __rtnl_link_register(&vrf_link_ops);
> +       if (err < 0)
> +               goto out;
> +
> +       register_inetaddr_notifier(&vrf_inetaddr_notifier);
> +
> +out:
> +       rtnl_unlock();
> +       return err;
> +}
> +
> +static void __exit vrf_cleanup_module(void)
> +{
> +       unregister_inetaddr_notifier(&vrf_inetaddr_notifier);
> +       rtnl_link_unregister(&vrf_link_ops);
> +}
> +
> +module_init(vrf_init_module);
> +module_exit(vrf_cleanup_module);
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS_RTNL_LINK(DRV_NAME);
> +MODULE_VERSION(DRV_VERSION);
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 51f8d2f..29febf3 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -51,6 +51,7 @@
>  #include <linux/neighbour.h>
>  #include <uapi/linux/netdevice.h>
>  #include <uapi/linux/if_bonding.h>
> +#include <net/vrf.h>
>
>  struct netpoll_info;
>  struct device;
> @@ -1270,6 +1271,7 @@ enum netdev_priv_flags {
>         IFF_XMIT_DST_RELEASE_PERM       = 1<<22,
>         IFF_IPVLAN_MASTER               = 1<<23,
>         IFF_IPVLAN_SLAVE                = 1<<24,
> +       IFF_VRF_MASTER                  = 1<<25,
>  };
>
>  #define IFF_802_1Q_VLAN                        IFF_802_1Q_VLAN
> @@ -1297,6 +1299,7 @@ enum netdev_priv_flags {
>  #define IFF_XMIT_DST_RELEASE_PERM      IFF_XMIT_DST_RELEASE_PERM
>  #define IFF_IPVLAN_MASTER              IFF_IPVLAN_MASTER
>  #define IFF_IPVLAN_SLAVE               IFF_IPVLAN_SLAVE
> +#define IFF_VRF_MASTER                 IFF_VRF_MASTER
>
>  /**
>   *     struct net_device - The DEVICE structure.
> @@ -1413,6 +1416,7 @@ enum netdev_priv_flags {
>   *     @dn_ptr:        DECnet specific data
>   *     @ip6_ptr:       IPv6 specific data
>   *     @ax25_ptr:      AX.25 specific data
> + *     @vrf_ptr:       VRF specific data
>   *     @ieee80211_ptr: IEEE 802.11 specific data, assign before registering
>   *
>   *     @last_rx:       Time of last Rx
> @@ -1625,6 +1629,7 @@ struct net_device {
>         struct dn_dev __rcu     *dn_ptr;
>         struct inet6_dev __rcu  *ip6_ptr;
>         void                    *ax25_ptr;
> +       struct net_vrf_dev      *vrf_ptr;
>         struct wireless_dev     *ieee80211_ptr;
>         struct wpan_dev         *ieee802154_ptr;
>  #if IS_ENABLED(CONFIG_MPLS_ROUTING)
> @@ -3776,6 +3781,11 @@ static inline bool netif_supports_nofcs(struct net_device *dev)
>         return dev->priv_flags & IFF_SUPP_NOFCS;
>  }
>
> +static inline bool netif_is_vrf(struct net_device *dev)
> +{
> +       return dev->priv_flags & IFF_VRF_MASTER;
> +}
> +
>  /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>  static inline void netif_keep_dst(struct net_device *dev)
>  {
> diff --git a/include/net/flow.h b/include/net/flow.h
> index 8109a15..69aaa99 100644
> --- a/include/net/flow.h
> +++ b/include/net/flow.h
> @@ -29,6 +29,7 @@ struct flowi_common {
>         __u8    flowic_flags;
>  #define FLOWI_FLAG_ANYSRC              0x01
>  #define FLOWI_FLAG_KNOWN_NH            0x02
> +#define FLOWI_FLAG_VRFSRC              0x04
>         __u32   flowic_secid;
>  };
>
> diff --git a/include/net/vrf.h b/include/net/vrf.h
> new file mode 100644
> index 0000000..11d7dbf8
> --- /dev/null
> +++ b/include/net/vrf.h
> @@ -0,0 +1,19 @@
> +/*
> + * include/net/net_vrf.h - adds vrf dev structure definitions
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#ifndef __LINUX_NET_VRF_H
> +#define __LINUX_NET_VRF_H
> +
> +struct net_vrf_dev {
> +       int                     ifindex; /* ifindex of master dev */
> +       u32                     tb_id;
> +};
> +
> +#endif /* __LINUX_NET_VRF_H */
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index afccc93..b98a443 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -339,6 +339,15 @@ enum macvlan_macaddr_mode {
>
>  #define MACVLAN_FLAG_NOPROMISC 1
>
> +/* VRF section */
> +enum {
> +       IFLA_VRF_UNSPEC,
> +       IFLA_VRF_TABLE,
> +       __IFLA_VRF_MAX
> +};
> +
> +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
> +
>  /* IPVLAN section */
>  enum {
>         IFLA_IPVLAN_UNSPEC,
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-09 12:30   ` Nicolas Dichtel
@ 2015-06-09 12:43     ` Hannes Frederic Sowa
  0 siblings, 0 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2015-06-09 12:43 UTC (permalink / raw)
  To: Nicolas Dichtel, Thomas Graf, Shrijeet Mukherjee
  Cc: David Ahern, Eric W. Biederman, Jamal Hadi Salim, davem,
	Stephen Hemminger, netdev, roopa, andy gospodarek, jtoppins,
	nikolay

On Tue, Jun 9, 2015, at 14:30, Nicolas Dichtel wrote:
> Le 09/06/2015 12:15, Thomas Graf a écrit :
> > On 06/08/15 at 11:35am, Shrijeet Mukherjee wrote:
> > [...]
> >> model with some performance paths that need optimization. (Specifically
> >> the output route selector that Roopa, Robert, Thomas and EricB are
> >> currently discussing on the MPLS thread)
> >
> > Thanks for posting these patches just in time. This explains how
> > you intent to deploy Roopa's patches in a scalable manner.
> >
> >> High Level points
> >>
> >> 1. Simple overlay driver (minimal changes to current stack)
> >>     * uses the existing fib tables and fib rules infrastructure
> >> 2. Modelled closely after the ipvlan driver
> >> 3. Uses current API and infrastructure.
> >>     * Applications can use SO_BINDTODEVICE or cmsg device indentifiers
> >>       to pick VRF (ping, traceroute just work)
> >
> > I like the aspect of reusing existing user interfaces. We might
> > need to introduce a more fine grained capability than CAP_NET_RAW
> > to give containers the privileges to bind to a VRF without
> > allowing them to inject raw frames.
> >
> > Given I understand this correctly: If my intent was to run a
> > process in multiple VRFs, then I would need to run that process
> > in the host network namespace which contains the VRF devices
> > which would also contain the physical devices. While I might want
> > to grant my process the ability to bind to VRFs, I may not want
> > to give it the privileges to bind to any device. So we could
> > consider introducing CAP_NET_VRF which would allow to bind to
> > VRF devices.
> 
> If I understand correctly, all existing applications should also be
> modified
> if I want to run them into a VRF/MRF (see my previous email)?
> 
> ssh, dhcp, httpd, etc should be runnable per MRF without modifications of
> their source code. So, it becomes a netns. What's about an IKE dameon?
> 
> It makes sense to have both: netns and MRF ; each can have their own
> logics
> of VRF-like behavior depending on how a VRF is defined by the end users.

Agreed, the idea is to have a prctl in the end which gets inherited by
fork. current->rt_table_id or some kind of vrf specifier in task_struct
would make that possible then.

A helper tool like ip route exec table 100 /bin/bash would then start a
session bound to a specific routing instance.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-09  8:58 ` Nicolas Dichtel
@ 2015-06-09 14:21   ` David Ahern
  2015-06-09 14:55     ` Nicolas Dichtel
  0 siblings, 1 reply; 36+ messages in thread
From: David Ahern @ 2015-06-09 14:21 UTC (permalink / raw)
  To: nicolas.dichtel, Shrijeet Mukherjee, hannes, ebiederm, hadi,
	davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

Hi Nicolas:

On 6/9/15 2:58 AM, Nicolas Dichtel wrote:
> I'm not really in favor of the name 'vrf'. This term is very
> controversial and
> having a consensus of what is/contains a 'vrf' is quite impossible.
> There was already a lot of discussions about this topic on quagga ml
> that show
> that everybody has a different opinion about this term ;-)

Are you referring to this thread?
https://lists.quagga.net/pipermail/quagga-dev/2014-November/011795.html

I could see differing opinions regarding the implementation of a VRF; is 
there really a controversy on what a VRF is?

David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-09 14:21   ` David Ahern
@ 2015-06-09 14:55     ` Nicolas Dichtel
  2015-06-09 17:14       ` Shrijeet Mukherjee
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Dichtel @ 2015-06-09 14:55 UTC (permalink / raw)
  To: David Ahern, Shrijeet Mukherjee, hannes, ebiederm, hadi, davem,
	stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

Le 09/06/2015 16:21, David Ahern a écrit :
> Hi Nicolas:
>
> On 6/9/15 2:58 AM, Nicolas Dichtel wrote:
>> I'm not really in favor of the name 'vrf'. This term is very
>> controversial and
>> having a consensus of what is/contains a 'vrf' is quite impossible.
>> There was already a lot of discussions about this topic on quagga ml
>> that show
>> that everybody has a different opinion about this term ;-)
>
> Are you referring to this thread?
> https://lists.quagga.net/pipermail/quagga-dev/2014-November/011795.html
No, there were recent discussions on quagga about that subject. Here is some
non-exhaustive pointers:
https://lists.quagga.net/pipermail/quagga-dev/2015-May/012581.html
https://lists.quagga.net/pipermail/quagga-dev/2015-May/012630.html
https://lists.quagga.net/pipermail/quagga-dev/2015-June/012715.html

Note the last pointer also explains why it was called MRF by Cumulus.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
  2015-06-09 14:55     ` Nicolas Dichtel
@ 2015-06-09 17:14       ` Shrijeet Mukherjee
  0 siblings, 0 replies; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-09 17:14 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: David Ahern, Hannes Frederic Sowa, Eric W. Biederman,
	Jamal Hadi Salim, David Miller, Stephen Hemminger, Netdev,
	Roopa Prabhu, Andy Gospodarek, Jon Toppins, Nikolay Aleksandrov

On Tue, Jun 9, 2015 at 7:55 AM, Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
> Le 09/06/2015 16:21, David Ahern a écrit :
>>
>> Hi Nicolas:
>>
>> On 6/9/15 2:58 AM, Nicolas Dichtel wrote:
>>>
>>> I'm not really in favor of the name 'vrf'. This term is very
>>> controversial and
>>> having a consensus of what is/contains a 'vrf' is quite impossible.
>>> There was already a lot of discussions about this topic on quagga ml
>>> that show
>>> that everybody has a different opinion about this term ;-)
>>
>>
>> Are you referring to this thread?
>> https://lists.quagga.net/pipermail/quagga-dev/2014-November/011795.html
>
> No, there were recent discussions on quagga about that subject. Here is some
> non-exhaustive pointers:
> https://lists.quagga.net/pipermail/quagga-dev/2015-May/012581.html
> https://lists.quagga.net/pipermail/quagga-dev/2015-May/012630.html
> https://lists.quagga.net/pipermail/quagga-dev/2015-June/012715.html
>
> Note the last pointer also explains why it was called MRF by Cumulus.
>
>

Agreed, I used the term VRF for this series to make sure we had the
right context, but clearly MRF is a term we are happier to use ..

> Regards,
> Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 2/3] VRF driver and needed infrastructure
  2015-06-09 12:35   ` Nikolay Aleksandrov
@ 2015-06-10  2:11     ` Shrijeet Mukherjee
  0 siblings, 0 replies; 36+ messages in thread
From: Shrijeet Mukherjee @ 2015-06-10  2:11 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Hannes Frederic Sowa, Nicolas Dichtel, David Ahern,
	Eric W. Biederman, Jamal Hadi Salim, David Miller,
	Stephen Hemminger, Netdev, Roopa Prabhu, Andy Gospodarek,
	Jon Toppins

On Tue, Jun 9, 2015 at 5:35 AM, Nikolay Aleksandrov
<nikolay@cumulusnetworks.com> wrote:
> On Mon, Jun 8, 2015 at 8:35 PM, Shrijeet Mukherjee
> <shm@cumulusnetworks.com> wrote:
>
> vrf_kill_one_slave() calls netdev_rx_handler_unregister() which does
> synchronize_rcu() and
> here you're running with the queue spinlock held and softirqs disabled.
>




>> +
>> +       dev_hold(slave->dev);
>> +       list_add(&slave->list, &queue->all_slaves);
>> +       queue->num_slaves++;
>> +       slave->dev->flags |= IFF_SLAVE;
>> +
>> +       slave->dev->vrf_ptr = kmalloc(sizeof(*slave->dev->vrf_ptr), GFP_KERNEL);
>
> Again this runs with a spinlock and softirqs disabled, GFP_KERNEL can sleep.
>

Good catches, also the one below .. will spin a rev2 patch shortly
with these fixes.

Also working on incorporating Hannes's and Dahern's patches ..


>> +static void vrf_dellink(struct net_device *dev, struct list_head *head)
>> +{
>> +       /* Need to free the table ? */
>> +       unregister_netdev(dev);
>
> I think ->dellink() runs with rtnl held and unregister_netdev() tries
> to acquire rtnl
> so you'll probably deadlock here.
>
>> +}

yup.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 1/3] Symbol preparation for VRF driver
  2015-06-08 18:35 ` [RFC net-next 1/3] Symbol preparation for VRF driver Shrijeet Mukherjee
@ 2015-06-10 16:24   ` Alexander Duyck
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Duyck @ 2015-06-10 16:24 UTC (permalink / raw)
  To: Shrijeet Mukherjee, hannes, nicolas.dichtel, dsahern, ebiederm,
	hadi, davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

On 06/08/2015 11:35 AM, Shrijeet Mukherjee wrote:
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
>
> This change is needed for the following VRF driver which creates
> a routing domain and allows applications to bind to the domain.
>
> No active code path changes.
>
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> ---
>   net/ipv4/fib_frontend.c |    1 +
>   net/ipv4/fib_trie.c     |    2 ++
>   2 files changed, 3 insertions(+)
>
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 872494e..9d4cef4 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -108,6 +108,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
>   	hlist_add_head_rcu(&tb->tb_hlist, &net->ipv4.fib_table_hash[h]);
>   	return tb;
>   }
> +EXPORT_SYMBOL_GPL(fib_new_table);
>
>   /* caller must hold either rtnl or rcu read lock */
>   struct fib_table *fib_get_table(struct net *net, u32 id)

So this block of code is wrapped in a #ifdef for 
CONFIG_IP_MULTIPLE_TABLES so anything that references it will need to 
make sure that it is defined.

> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 01bce15..97fa62d 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1247,6 +1247,7 @@ out:
>   err:
>   	return err;
>   }
> +EXPORT_SYMBOL_GPL(fib_table_insert);
>
>   static inline t_key prefix_mismatch(t_key key, struct key_vector *n)
>   {
> @@ -1535,6 +1536,7 @@ int fib_table_delete(struct fib_table *tb, struct fib_config *cfg)
>   	alias_free_mem_rcu(fa_to_delete);
>   	return 0;
>   }
> +EXPORT_SYMBOL_GPL(fib_table_delete);
>
>   /* Scan for the next leaf starting at the provided key value */
>   static struct key_vector *leaf_walk_rcu(struct key_vector **tn, t_key key)
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 2/3] VRF driver and needed infrastructure
  2015-06-08 18:35 ` [RFC net-next 2/3] VRF driver and needed infrastructure Shrijeet Mukherjee
                     ` (3 preceding siblings ...)
  2015-06-09 12:35   ` Nikolay Aleksandrov
@ 2015-06-10 18:20   ` Alexander Duyck
  4 siblings, 0 replies; 36+ messages in thread
From: Alexander Duyck @ 2015-06-10 18:20 UTC (permalink / raw)
  To: Shrijeet Mukherjee, hannes, nicolas.dichtel, dsahern, ebiederm,
	hadi, davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

On 06/08/2015 11:35 AM, Shrijeet Mukherjee wrote:
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
>
> This driver borrows heavily from IPvlan and teaming drivers.
>
> Routing domains (VRF-lite) are created by instantiating a device
> and enslaving all routed interfaces that participate in the domain.
> As part of the enslavement, all local routes pointing to enslaved
> devices are re-pointed to the vrf device, thus forcing outgoing
> sockets to bind to the vrf to function.
>
> Standard FIB rules can then bind the VRF device to tables and regular
> fib rule processing is followed.
>
> Routed traffic through the box, is fwded by using the VRF device as
> the IIF and following the IIF rule to a table which is mated with
> the VRF.
>
> Locally originated traffic is directed at the VRF device using
> SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
> the xmit function of the vrf driver, which then completes the ip lookup
> and output.
>
> This solution is completely orthogonal to namespaces and allow the L3
> equivalent of vlans to exist allowing the routing space to be
> partitioned.
>
> Example use is
>     ip link add vrf0 type vrf table 5
>     ip link set eth1 master vrf0
>     ip link set vrf0 up
>
>     ip rule add iif vrf0 table 5
>     ip rule add oif vrf0 table 5
>
> TODO:
> This changeset is for IPv4 only
> Connected route management can be made much better, but is deferred to
> user space for now.
>
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> ---
>   drivers/net/Kconfig          |    6 +
>   drivers/net/Makefile         |    1 +
>   drivers/net/vrf.c            |  654 ++++++++++++++++++++++++++++++++++++++++++
>   include/linux/netdevice.h    |   10 +
>   include/net/flow.h           |    1 +
>   include/net/vrf.h            |   19 ++
>   include/uapi/linux/if_link.h |    9 +
>   7 files changed, 700 insertions(+)
>   create mode 100644 drivers/net/vrf.c
>   create mode 100644 include/net/vrf.h
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 019fcef..27a333c 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -283,6 +283,12 @@ config NLMON
>   	  diagnostics, etc. This is mostly intended for developers or support
>   	  to debug netlink issues. If unsure, say N.
>
> +config NET_VRF
> +	tristate "Virtual Routing and Forwarding (Lite)"
> +	---help---
> +          This option enables the support for mapping interfaces into VRF's. The
> +          support enables VRF devices
> +
>   endif # NET_CORE
>
>   config SUNGEM_PHY
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index c12cb22..ca16dd6 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
>   obj-$(CONFIG_VXLAN) += vxlan.o
>   obj-$(CONFIG_GENEVE) += geneve.o
>   obj-$(CONFIG_NLMON) += nlmon.o
> +obj-$(CONFIG_NET_VRF) += vrf.o
>
>   #
>   # Networking Drivers
> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
> new file mode 100644
> index 0000000..08b3e79
> --- /dev/null
> +++ b/drivers/net/vrf.c
> @@ -0,0 +1,654 @@
> +/*
> + * vrf.c: device driver to encapsulate a VRF space
> + *
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * Based on dummy, team and ipvlan drivers
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ip.h>
> +#include <linux/init.h>
> +#include <linux/moduleparam.h>
> +#include <linux/rtnetlink.h>
> +#include <net/rtnetlink.h>
> +#include <net/arp.h>
> +#include <linux/u64_stats_sync.h>
> +#include <linux/hashtable.h>
> +
> +#include <linux/inetdevice.h>
> +#include <net/ip.h>
> +#include <net/ip_fib.h>
> +#include <net/ip6_route.h>
> +#include <net/rtnetlink.h>
> +#include <net/route.h>
> +#include <net/addrconf.h>
> +#include <net/vrf.h>
> +
> +#define DRV_NAME	"vrf"
> +#define DRV_VERSION	"1.0"
> +
> +#define vrf_is_slave(dev)   ((dev->flags & IFF_SLAVE) == IFF_SLAVE)
> +#define vrf_is_master(dev)  ((dev->flags & IFF_MASTER) == IFF_MASTER)
> +
> +#define vrf_master_get_rcu(dev) \
> +	((struct net_device *)rcu_dereference(dev->rx_handler_data))
> +
> +struct pcpu_dstats {
> +	u64			tx_pkts;
> +	u64			tx_bytes;
> +	u64			tx_drps;
> +	u64			rx_pkts;
> +	u64			rx_bytes;
> +	struct u64_stats_sync	syncp;
> +};

Do you really need a new pcpu_dstats or could you just use 
pcpu_sw_netstats?  I'm just wondering if you are expecting to see that 
many Tx packets dropped.  If not you could probably just move that to 
the standard netdev stat instead.

> +struct slave {
> +	struct list_head	list;
> +	struct net_device	*dev;
> +	long			priority;
> +};
> +
> +struct slave_queue {
> +	spinlock_t		lock; /* lock for slave insert/delete */
> +	struct list_head	all_slaves;
> +	int			num_slaves;
> +	struct net_device	*master_dev;
> +};
> +
> +struct net_vrf {
> +	struct slave_queue	queue;
> +	struct fib_table        *tb;
> +	u32                     tb_id;
> +};
> +
> +static int is_ip_rx_frame(struct sk_buff *skb)
> +{
> +	switch (skb->protocol) {
> +	case htons(ETH_P_IP):
> +	case htons(ETH_P_IPV6):
> +		return 1;
> +	}
> +	return 0;
> +}
> +

This could probably be an inline to save you some function call 
overhead, but the compiler may already be taking care of it assuming it 
isn't doing anything stupid.

> +/* note: already called with rcu_read_lock */
> +static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +
> +	if (is_ip_rx_frame(skb)) {
> +		struct net_device *dev = vrf_master_get_rcu(skb->dev);
> +		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
> +
> +		u64_stats_update_begin(&dstats->syncp);
> +		dstats->rx_pkts++;
> +		dstats->rx_bytes += skb->len;
> +		u64_stats_update_end(&dstats->syncp);
> +	}
> +	return RX_HANDLER_PASS;
> +}
> +
> +static struct rtnl_link_stats64 *vrf_get_stats64(
> +	struct net_device *dev, struct rtnl_link_stats64 *stats)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		const struct pcpu_dstats *dstats;
> +		u64 tbytes, tpkts, tdrops, rbytes, rpkts;
> +		unsigned int start;
> +
> +		dstats = per_cpu_ptr(dev->dstats, i);
> +		do {
> +			start = u64_stats_fetch_begin_irq(&dstats->syncp);
> +			tbytes = dstats->tx_bytes;
> +			tpkts = dstats->tx_pkts;
> +			tdrops = dstats->tx_drps;
> +			rbytes = dstats->rx_bytes;
> +			rpkts = dstats->rx_pkts;
> +		} while (u64_stats_fetch_retry_irq(&dstats->syncp, start));
> +		stats->tx_bytes += tbytes;
> +		stats->tx_packets += tpkts;
> +		stats->tx_dropped += tdrops;
> +		stats->rx_bytes += rbytes;
> +		stats->rx_packets += rpkts;
> +	}
> +	return stats;
> +}
> +
> +static netdev_tx_t vrf_process_v6_outbound(struct sk_buff *skb,
> +					   struct net_device *dev)
> +{
> +	return 0;
> +}
> +
> +static int _vrf_send_v4_prep(struct sk_buff *skb, struct flowi4 *fl4,
> +			     struct net_device *vrf_dev)
> +{
> +	struct rtable *rt;
> +	struct net_device *dev = skb->dev;
> +
> +	rt = ip_route_output_flow(dev_net(dev), fl4, NULL);
> +
> +	if (IS_ERR(rt))
> +		goto err;
> +
> +	if ((rt->rt_type != RTN_UNICAST && rt->rt_type != RTN_LOCAL) ||
> +	    !rt->dst.dev->vrf_ptr) {
> +		ip_rt_put(rt);
> +		goto err;
> +	}
> +
> +	/* prevent slave cross reference */
> +	if (rt->dst.dev->vrf_ptr->ifindex != vrf_dev->ifindex) {
> +		ip_rt_put(rt);
> +		goto err;
> +	}
> +
> +	skb_dst_drop(skb);
> +	skb_dst_set(skb, &rt->dst);
> +
> +	return 0;
> +err:
> +	return 1;
> +}
> +
> +static netdev_tx_t vrf_process_v4_outbound(struct sk_buff *skb,
> +					   struct net_device *vrf_dev)
> +{
> +	struct iphdr *ip4h = ip_hdr(skb);
> +	int ret = NET_XMIT_DROP;
> +	struct net_device *dev = skb->dev;
> +	struct flowi4 fl4 = {
> +		/* needed to match OIF rule */
> +		.flowi4_oif = vrf_dev->ifindex,
> +		.flowi4_iif = LOOPBACK_IFINDEX,
> +		.flowi4_tos = RT_TOS(ip4h->tos),
> +		.flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC,
> +		.daddr = ip4h->daddr,
> +		.saddr = 0,
> +	};
> +
> +	if (_vrf_send_v4_prep(skb, &fl4, vrf_dev))
> +		goto err;
> +
> +	dev = skb_dst(skb)->dev;
> +	ip4h->saddr = inet_select_addr(dev, 0, RT_SCOPE_LINK);
> +	ret = ip_local_out(skb);
> +
> +	if (unlikely(net_xmit_eval(ret)))
> +		vrf_dev->stats.tx_errors++;
> +	else
> +		ret = NET_XMIT_SUCCESS;
> +
> +	goto out;
> +err:
> +	vrf_dev->stats.tx_errors++;
> +	kfree_skb(skb);
> +out:
> +	return ret;
> +}
> +
> +static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev)
> +{
> +	/* the frames we recv have full L2 headers, strip */
> +	if (skb_mac_header_was_set(skb)) {
> +		skb_pull(skb, sizeof(struct ethhdr));
> +		skb->mac_header = (typeof(skb->mac_header))~0U;
> +		skb_reset_network_header(skb);
> +	}
> +
> +	switch (skb->protocol) {
> +	case htons(ETH_P_IP):
> +		return vrf_process_v4_outbound(skb, dev);
> +	case htons(ETH_P_IPV6):
> +		return vrf_process_v6_outbound(skb, dev);
> +	default:
> +		return NET_XMIT_DROP;
> +	}
> +}
> +
> +static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	netdev_tx_t ret = is_ip_tx_frame(skb, dev);
> +
> +	if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
> +		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
> +
> +		u64_stats_update_begin(&dstats->syncp);
> +		dstats->tx_pkts++;
> +		dstats->tx_bytes += skb->len;
> +		u64_stats_update_end(&dstats->syncp);
> +	} else {
> +		this_cpu_inc(dev->dstats->tx_drps);
> +	}
> +
> +	return ret;
> +}
> +

So if vrf_dev->stats.tx_errors is incremented already in this path, why 
not just use the dev->stats.tx_dropped instead of the pcpu stat?

> +int vrf_output(struct sock *sk, struct sk_buff *skb)
> +{
> +	struct net_device *dev = skb_dst(skb)->dev;
> +
> +	return vrf_xmit(skb, dev);
> +}
> +
> +/**************************** device handling ********************/
> +
> +/* queue->lock must be held */
> +static struct slave *__vrf_find_slave_dev(struct slave_queue *queue,
> +					  struct net_device *dev)
> +{
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		if (slave->dev == dev)
> +			return slave;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void vrf_kill_one_slave(struct slave_queue *queue, struct slave *slave)
> +{
> +	list_del(&slave->list);
> +	queue->num_slaves--;
> +	slave->dev->flags &= ~IFF_SLAVE;
> +	netdev_rx_handler_unregister(slave->dev);
> +	kfree(slave->dev->vrf_ptr);
> +	slave->dev->vrf_ptr = NULL;
> +	dev_put(slave->dev);
> +	kfree(slave);
> +}
> +
> +/* queue->lock must be held */
> +static int __vrf_insert_slave(struct slave_queue *queue, struct slave *slave,
> +			      struct net_device *master)
> +{
> +	struct net_vrf *vrf = netdev_priv(master);
> +	struct slave *duplicate_slave = NULL;
> +	int master_ifindex = master->ifindex;
> +	int err = 0;
> +
> +	duplicate_slave = __vrf_find_slave_dev(queue, slave->dev);
> +	if (duplicate_slave)
> +		vrf_kill_one_slave(queue, duplicate_slave);
> +
> +	dev_hold(slave->dev);
> +	list_add(&slave->list, &queue->all_slaves);
> +	queue->num_slaves++;
> +	slave->dev->flags |= IFF_SLAVE;
> +
> +	slave->dev->vrf_ptr = kmalloc(sizeof(*slave->dev->vrf_ptr), GFP_KERNEL);
> +	if (!slave->dev->vrf_ptr)
> +		return -ENODEV;
> +	slave->dev->vrf_ptr->ifindex = master_ifindex;
> +	slave->dev->vrf_ptr->tb_id = vrf->tb_id;
> +
> +	/* register the packet handler for slave ports */
> +	err = netdev_rx_handler_register(slave->dev, vrf_handle_frame,
> +					 (void *)master);
> +	if (err) {
> +		netdev_err(slave->dev,
> +			   "Device %s failed to register rx_handler\n",
> +			   slave->dev->name);
> +		return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vrf_fib_magic(int cmd, int type, __be32 dst, int dst_len,
> +			  struct in_ifaddr *ifa, struct net_device *vrf_dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(vrf_dev);
> +	struct net *net = dev_net(ifa->ifa_dev->dev);
> +	struct net *vrf_net = dev_net(vrf_dev);
> +	struct fib_table *tb, *lc_tb;
> +	struct fib_config cfg = {
> +		.fc_protocol = RTPROT_KERNEL,
> +		.fc_type = type,
> +		.fc_dst = dst,
> +		.fc_dst_len = dst_len,
> +		.fc_prefsrc = ifa->ifa_local,
> +		.fc_nlflags = NLM_F_CREATE | NLM_F_APPEND,
> +		.fc_nlinfo = {
> +			.nl_net = net,
> +		},
> +	};
> +
> +	lc_tb = fib_new_table(dev_net(vrf_dev), RT_TABLE_LOCAL);
> +	tb = vrf->tb;
> +	if (!tb || !lc_tb)
> +		return;
> +

Depending on which way you are going you should be calling fib_get_table 
on the table that is about to have the rule stolen, and fib_new_table on 
the table that is about to have a rule inserted.  Generally you don't 
want to call fib_new_table unless you are actually inserting a route.

Also you should probably just store the table ID instead of the table 
pointer itself.

> +	if (net != vrf_net)
> +		return;
> +
> +	if (type != RTN_LOCAL)
> +		cfg.fc_scope = RT_SCOPE_LINK;
> +	else
> +		cfg.fc_scope = RT_SCOPE_HOST;
> +
> +	if (cmd == RTM_NEWROUTE) {
> +		cfg.fc_table = lc_tb->tb_id;
> +		cfg.fc_oif = ifa->ifa_dev->dev->ifindex;
> +		fib_table_delete(lc_tb, &cfg);
> +		cfg.fc_table = tb->tb_id;
> +		cfg.fc_oif = vrf_dev->ifindex;
> +		fib_table_insert(tb, &cfg);
> +	} else {
> +		cfg.fc_table = tb->tb_id;
> +		cfg.fc_oif = vrf_dev->ifindex;
> +		fib_table_delete(tb, &cfg);
> +		cfg.fc_table = lc_tb->tb_id;
> +		cfg.fc_oif = ifa->ifa_dev->dev->ifindex;
> +		fib_table_insert(lc_tb, &cfg);
> +	}
> +}
> +
> +static void vrf_move_local_routes(int cmd, struct net_device *dev,
> +				  struct net_device *port_dev)
> +{
> +	struct in_device *in_dev = __in_dev_get_rcu(port_dev);
> +
> +	if (!in_dev)
> +		return;
> +
> +	for_ifa(in_dev) {
> +		__be32 addr;
> +
> +		addr = ifa->ifa_local;
> +		vrf_fib_magic(cmd, RTN_LOCAL, addr, 32, ifa, dev);
> +	} endfor_ifa(in_dev);
> +}
> +
> +static int vrf_inetaddr_event(struct notifier_block *this, unsigned long event,
> +			      void *ptr)
> +{
> +	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
> +	struct net_device *port_dev = ifa->ifa_dev->dev;
> +	struct net *net = dev_net(port_dev);
> +	int master = 0;
> +
> +	if (port_dev->vrf_ptr)
> +		master = port_dev->vrf_ptr->ifindex;
> +
> +	if (master) {
> +		struct net_device *dev = dev_get_by_index(net, master);
> +
> +		switch (event) {
> +		case NETDEV_UP:
> +			vrf_move_local_routes(RTM_NEWROUTE, dev, port_dev);
> +			break;
> +		case NETDEV_DOWN:
> +			vrf_move_local_routes(RTM_DELROUTE, dev, port_dev);
> +			break;
> +		}
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block vrf_inetaddr_notifier = {
> +	.notifier_call = vrf_inetaddr_event,
> +};
> +
> +/* netlink lock is assumed here */
> +static int vrf_add_slave(struct net_device *dev,
> +			 struct net_device *port_dev)
> +{
> +	if (!dev || !port_dev)
> +		return -ENODEV;
> +
> +	if (dev_net(dev) != dev_net(port_dev))
> +		return -ENODEV;
> +
> +	if (!vrf_is_master(port_dev) && !vrf_is_slave(port_dev)) {
> +		struct slave *s = kmalloc(sizeof(*s), GFP_KERNEL);
> +		struct net_vrf *vrf = netdev_priv(dev);
> +		int ret;
> +
> +		if (!s)
> +			return -ENOMEM;
> +
> +		memset(s, 0, sizeof(*s));
> +		s->dev = port_dev;
> +
> +		spin_lock_bh(&vrf->queue.lock);
> +		ret = __vrf_insert_slave(&vrf->queue, s, dev);
> +		if (ret)
> +			kfree(s);
> +
> +		spin_unlock_bh(&vrf->queue.lock);
> +
> +		vrf_move_local_routes(RTM_NEWROUTE, dev, port_dev);
> +		ret = netdev_master_upper_dev_link(port_dev, dev);
> +		return ret;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +static int vrf_del_slave(struct net_device *dev,
> +			 struct net_device *port_dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct slave *slave = __vrf_find_slave_dev(queue, port_dev);
> +
> +	if (!slave)
> +		return -EINVAL;
> +
> +	vrf_kill_one_slave(queue, slave);
> +	vrf_move_local_routes(RTM_DELROUTE, dev, port_dev);
> +	netdev_upper_dev_unlink(port_dev, dev);
> +
> +	return 0;
> +}
> +
> +static int vrf_dev_init(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	spin_lock_init(&vrf->queue.lock);
> +	INIT_LIST_HEAD(&vrf->queue.all_slaves);
> +	vrf->queue.master_dev = dev;
> +
> +	dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
> +	dev->flags  =  IFF_MASTER | IFF_NOARP;
> +	if (!dev->dstats)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void vrf_dev_uninit(struct net_device *dev)
> +{
> +	free_percpu(dev->dstats);
> +}
> +
> +static int vrf_dev_close(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		slave->dev->vrf_ptr->ifindex = 0;
> +		slave->dev->vrf_ptr->tb_id = 0;
> +	}
> +
> +	if (dev->flags & IFF_MASTER)
> +		dev->flags &= ~IFF_UP;
> +/* XXX does the table not need a free
> + * fib_table_delete(vrf->tb, cfg);
> + */
> +	return 0;
> +}
> +

The call would probably be fib_free_table, not fib_table_delete since 
delete drops a rule as I recall.  I believe the behavior is to leave the 
table there once it is created.  It will be empty, but I don't believe 
there is much that actually frees the tables themsleves other than 
unloading the network namespace.

> +static int vrf_dev_open(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct list_head *this, *head;
> +	int err = 0;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		slave->dev->vrf_ptr->ifindex = dev->ifindex;
> +		slave->dev->vrf_ptr->tb_id = vrf->tb_id;
> +	}
> +
> +	if (dev->flags & IFF_MASTER)
> +		dev->flags |= IFF_UP;
> +
> +	if (!vrf->tb)
> +		return -EINVAL;
> +
> +	return err;
> +}
> +
> +static int vrf_neigh_create(struct neighbour *n)
> +{
> +	n->nud_state = NUD_REACHABLE;
> +	n->dead      = 0;
> +
> +	return 0;
> +}
> +
> +static const struct net_device_ops vrf_netdev_ops = {
> +	.ndo_init		= vrf_dev_init,
> +	.ndo_uninit		= vrf_dev_uninit,
> +	.ndo_open		= vrf_dev_open,
> +	.ndo_stop               = vrf_dev_close,
> +	.ndo_start_xmit		= vrf_xmit,
> +	.ndo_get_stats64	= vrf_get_stats64,
> +	.ndo_add_slave          = vrf_add_slave,
> +	.ndo_del_slave          = vrf_del_slave,
> +	.ndo_neigh_construct    = vrf_neigh_create,
> +};
> +
> +static void vrf_get_drvinfo(struct net_device *dev,
> +			    struct ethtool_drvinfo *info)
> +{
> +	strlcpy(info->driver, DRV_NAME, sizeof(info->driver));
> +	strlcpy(info->version, DRV_VERSION, sizeof(info->version));
> +}
> +
> +static const struct ethtool_ops vrf_ethtool_ops = {
> +	.get_drvinfo            = vrf_get_drvinfo,
> +};
> +
> +static void vrf_setup(struct net_device *dev)
> +{
> +	ether_setup(dev);
> +
> +	/* Initialize the device structure. */
> +	dev->netdev_ops = &vrf_netdev_ops;
> +	dev->ethtool_ops = &vrf_ethtool_ops;
> +	dev->destructor = free_netdev;
> +
> +	/* Fill in device structure with ethernet-generic values. */
> +	dev->tx_queue_len = 0;
> +	eth_hw_addr_random(dev);
> +}
> +
> +static int vrf_validate(struct nlattr *tb[], struct nlattr *data[])
> +{
> +	if (tb[IFLA_ADDRESS]) {
> +		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
> +			return -EINVAL;
> +		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
> +			return -EADDRNOTAVAIL;
> +	}
> +	return 0;
> +}
> +
> +static int vrf_newlink(struct net *src_net, struct net_device *dev,
> +		       struct nlattr *tb[], struct nlattr *data[])
> +{
> +	int err;
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	if (data && data[IFLA_VRF_TABLE]) {
> +		vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
> +		/* reserve a table for this VRF device */
> +		vrf->tb = fib_new_table(dev_net(dev), vrf->tb_id);
> +		if (!vrf->tb)
> +			return -ERANGE;
> +
> +		dev->priv_flags |= IFF_VRF_MASTER;
> +		err = register_netdevice(dev);
> +		if (err) {
> +			free_netdev(dev);
> +			return -ENODEV;
> +		}
> +	}
> +	return 0;
> +}
> +

You probably don't need to create the new table here.  This can wait 
until the first new route is added to the table.

> +static void vrf_dellink(struct net_device *dev, struct list_head *head)
> +{
> +	/* Need to free the table ? */
> +	unregister_netdev(dev);
> +}
> +
> +static const struct nla_policy vrf_nl_policy[IFLA_VRF_MAX + 1] = {
> +	[IFLA_VRF_TABLE] = { .type = NLA_U32 },
> +};
> +
> +static struct rtnl_link_ops vrf_link_ops __read_mostly = {
> +	.kind		= DRV_NAME,
> +	.priv_size      = sizeof(struct net_vrf),
> +	.policy         = vrf_nl_policy,
> +	.newlink        = vrf_newlink,
> +	.dellink        = vrf_dellink,
> +	.setup		= vrf_setup,
> +	.validate	= vrf_validate,
> +	.maxtype        = IFLA_VRF_MAX,
> +};
> +
> +static int __init vrf_init_module(void)
> +{
> +	int err = 0;
> +
> +	rtnl_lock();
> +	err = __rtnl_link_register(&vrf_link_ops);
> +	if (err < 0)
> +		goto out;
> +
> +	register_inetaddr_notifier(&vrf_inetaddr_notifier);
> +
> +out:
> +	rtnl_unlock();
> +	return err;
> +}
> +
> +static void __exit vrf_cleanup_module(void)
> +{
> +	unregister_inetaddr_notifier(&vrf_inetaddr_notifier);
> +	rtnl_link_unregister(&vrf_link_ops);
> +}
> +
> +module_init(vrf_init_module);
> +module_exit(vrf_cleanup_module);
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS_RTNL_LINK(DRV_NAME);
> +MODULE_VERSION(DRV_VERSION);
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 51f8d2f..29febf3 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -51,6 +51,7 @@
>   #include <linux/neighbour.h>
>   #include <uapi/linux/netdevice.h>
>   #include <uapi/linux/if_bonding.h>
> +#include <net/vrf.h>
>
>   struct netpoll_info;
>   struct device;
> @@ -1270,6 +1271,7 @@ enum netdev_priv_flags {
>   	IFF_XMIT_DST_RELEASE_PERM	= 1<<22,
>   	IFF_IPVLAN_MASTER		= 1<<23,
>   	IFF_IPVLAN_SLAVE		= 1<<24,
> +	IFF_VRF_MASTER		        = 1<<25,
>   };
>
>   #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> @@ -1297,6 +1299,7 @@ enum netdev_priv_flags {
>   #define IFF_XMIT_DST_RELEASE_PERM	IFF_XMIT_DST_RELEASE_PERM
>   #define IFF_IPVLAN_MASTER		IFF_IPVLAN_MASTER
>   #define IFF_IPVLAN_SLAVE		IFF_IPVLAN_SLAVE
> +#define IFF_VRF_MASTER		        IFF_VRF_MASTER
>
>   /**
>    *	struct net_device - The DEVICE structure.
> @@ -1413,6 +1416,7 @@ enum netdev_priv_flags {
>    *	@dn_ptr:	DECnet specific data
>    *	@ip6_ptr:	IPv6 specific data
>    *	@ax25_ptr:	AX.25 specific data
> + *	@vrf_ptr:	VRF specific data
>    *	@ieee80211_ptr:	IEEE 802.11 specific data, assign before registering
>    *
>    *	@last_rx:	Time of last Rx
> @@ -1625,6 +1629,7 @@ struct net_device {
>   	struct dn_dev __rcu     *dn_ptr;
>   	struct inet6_dev __rcu	*ip6_ptr;
>   	void			*ax25_ptr;
> +	struct net_vrf_dev      *vrf_ptr;
>   	struct wireless_dev	*ieee80211_ptr;
>   	struct wpan_dev		*ieee802154_ptr;
>   #if IS_ENABLED(CONFIG_MPLS_ROUTING)
> @@ -3776,6 +3781,11 @@ static inline bool netif_supports_nofcs(struct net_device *dev)
>   	return dev->priv_flags & IFF_SUPP_NOFCS;
>   }
>
> +static inline bool netif_is_vrf(struct net_device *dev)
> +{
> +	return dev->priv_flags & IFF_VRF_MASTER;
> +}
> +
>   /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>   static inline void netif_keep_dst(struct net_device *dev)
>   {
> diff --git a/include/net/flow.h b/include/net/flow.h
> index 8109a15..69aaa99 100644
> --- a/include/net/flow.h
> +++ b/include/net/flow.h
> @@ -29,6 +29,7 @@ struct flowi_common {
>   	__u8	flowic_flags;
>   #define FLOWI_FLAG_ANYSRC		0x01
>   #define FLOWI_FLAG_KNOWN_NH		0x02
> +#define FLOWI_FLAG_VRFSRC		0x04
>   	__u32	flowic_secid;
>   };
>
> diff --git a/include/net/vrf.h b/include/net/vrf.h
> new file mode 100644
> index 0000000..11d7dbf8
> --- /dev/null
> +++ b/include/net/vrf.h
> @@ -0,0 +1,19 @@
> +/*
> + * include/net/net_vrf.h - adds vrf dev structure definitions
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#ifndef __LINUX_NET_VRF_H
> +#define __LINUX_NET_VRF_H
> +
> +struct net_vrf_dev {
> +	int                     ifindex; /* ifindex of master dev */
> +	u32                     tb_id;
> +};
> +
> +#endif /* __LINUX_NET_VRF_H */
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index afccc93..b98a443 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -339,6 +339,15 @@ enum macvlan_macaddr_mode {
>
>   #define MACVLAN_FLAG_NOPROMISC	1
>
> +/* VRF section */
> +enum {
> +	IFLA_VRF_UNSPEC,
> +	IFLA_VRF_TABLE,
> +	__IFLA_VRF_MAX
> +};
> +
> +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
> +
>   /* IPVLAN section */
>   enum {
>   	IFLA_IPVLAN_UNSPEC,
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 3/3] rcv path changes for vrf traffic
  2015-06-08 18:35 ` [RFC net-next 3/3] rcv path changes for vrf traffic Shrijeet Mukherjee
  2015-06-08 19:58   ` Hannes Frederic Sowa
@ 2015-06-10 18:31   ` Alexander Duyck
  1 sibling, 0 replies; 36+ messages in thread
From: Alexander Duyck @ 2015-06-10 18:31 UTC (permalink / raw)
  To: Shrijeet Mukherjee, hannes, nicolas.dichtel, dsahern, ebiederm,
	hadi, davem, stephen, netdev
  Cc: roopa, gospo, jtoppins, nikolay

On 06/08/2015 11:35 AM, Shrijeet Mukherjee wrote:
> From: Shrijeet Mukherjee <shm@cumulusnetworks.com>
>
> Incoming frames for IP protocol stacks need the IIF to be changed
> from the actual interface to the VRF device. This allows the IIF
> rule to be used to select tables (or do regular PBR)
>
> This change selects the iif to be the VRF device if it exists and
> the incoming iif is enslaved to the VRF device.
>
> Since VRF aware sockets are always bound to the VRF device this
> system allows return traffic to find the socket of origin.
>
> changes are in the arp_rcv, icmp_rcv and ip_rcv paths
>
> Question : I did not wrap the rcv modifications, in CONFIG_NET_VRF
> as it would create code variations and the vrf_ptr check is there
> I can make that whole thing modular.
>
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> ---
>   net/ipv4/fib_frontend.c |   14 +++++++++++---
>   net/ipv4/fib_trie.c     |    7 +++++--
>   net/ipv4/icmp.c         |    6 ++++++
>   net/ipv4/route.c        |    3 ++-
>   4 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 9d4cef4..cf2d584 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -45,6 +45,7 @@
>   #include <net/ip_fib.h>
>   #include <net/rtnetlink.h>
>   #include <net/xfrm.h>
> +#include <net/vrf.h>
>
>   #ifndef CONFIG_IP_MULTIPLE_TABLES
>
> @@ -218,15 +219,18 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
>   	struct fib_result	res;
>   	unsigned int ret = RTN_BROADCAST;
>   	struct fib_table *local_table;
> +	int rt_table = RT_TABLE_LOCAL;
>
>   	if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr))
>   		return RTN_BROADCAST;
>   	if (ipv4_is_multicast(addr))
>   		return RTN_MULTICAST;
>
> -	rcu_read_lock();
> +	if (dev && dev->vrf_ptr)
> +		rt_table =  dev->vrf_ptr->tb_id;
>
> -	local_table = fib_get_table(net, RT_TABLE_LOCAL);
> +	rcu_read_lock();
> +	local_table = fib_get_table(net, rt_table);
>   	if (local_table) {
>   		ret = RTN_UNICAST;
>   		if (!fib_table_lookup(local_table, &fl4, &res, FIB_LOOKUP_NOREF)) {

I would prefer it if you left a blank line between rcu_read_lock and the 
local_table assignment line.  It would help to make this patch and the 
code in general a bit more readable.

> @@ -309,7 +313,11 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
>   	bool dev_match;
>
>   	fl4.flowi4_oif = 0;
> -	fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
> +	if (dev->vrf_ptr)
> +		fl4.flowi4_iif = dev->vrf_ptr ?
> +			dev->vrf_ptr->ifindex : dev->ifindex;
> +	else
> +		fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
>   	fl4.daddr = src;
>   	fl4.saddr = dst;
>   	fl4.flowi4_tos = tos;

This code is kind of redundant.  What is the point of the dev->vfr_ptr 
ternary operator when you just checked it in the if statement.

> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 97fa62d..515ff11 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1409,8 +1409,11 @@ found:
>
>   			if (nh->nh_flags & RTNH_F_DEAD)
>   				continue;
> -			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
> -				continue;
> +			if (!(flp->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
> +				if (flp->flowi4_oif &&
> +				    flp->flowi4_oif != nh->nh_oif)
> +					continue;
> +			}
>
>   			if (!(fib_flags & FIB_LOOKUP_NOREF))
>   				atomic_inc(&fi->fib_clntref);

You might break this up so that both the flowi4_flags and flowi4_oif 
checks are in the first if statement, and the != is in the next.  I 
would argue that the flowi4_oif check might be done first as you can 
avoid having to test for any masks since you are testing for a 0 value.

> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index f5203fb..61b7da3 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -96,6 +96,7 @@
>   #include <net/xfrm.h>
>   #include <net/inet_common.h>
>   #include <net/ip_fib.h>
> +#include <net/vrf.h>
>
>   /*
>    *	Build xmit assembly blocks
> @@ -420,6 +421,8 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
>   			daddr = icmp_param->replyopts.opt.opt.faddr;
>   	}
>   	memset(&fl4, 0, sizeof(fl4));
> +	if (skb->dev && skb->dev->vrf_ptr)
> +		fl4.flowi4_oif = skb->dev->vrf_ptr->ifindex;
>   	fl4.daddr = daddr;
>   	fl4.saddr = saddr;
>   	fl4.flowi4_mark = mark;
> @@ -458,6 +461,9 @@ static struct rtable *icmp_route_lookup(struct net *net,
>   	fl4->flowi4_proto = IPPROTO_ICMP;
>   	fl4->fl4_icmp_type = type;
>   	fl4->fl4_icmp_code = code;
> +	if (skb_in->dev && skb_in->dev->vrf_ptr)
> +		fl4->flowi4_oif =  skb_in->dev->vrf_ptr->ifindex;
> +
>   	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
>   	rt = __ip_route_output_key(net, fl4);
>   	if (IS_ERR(rt))
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f605598..8c37720 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -109,6 +109,7 @@
>   #include <linux/kmemleak.h>
>   #endif
>   #include <net/secure_seq.h>
> +#include <net/vrf.h>
>
>   #define RT_FL_TOS(oldflp4) \
>   	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
> @@ -1710,7 +1711,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
>   	 *	Now we are ready to route packet.
>   	 */
>   	fl4.flowi4_oif = 0;
> -	fl4.flowi4_iif = dev->ifindex;
> +	fl4.flowi4_iif = dev->vrf_ptr ? dev->vrf_ptr->ifindex : dev->ifindex;
>   	fl4.flowi4_mark = skb->mark;
>   	fl4.flowi4_tos = tos;
>   	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC net-next 0/3] Proposal for VRF-lite
       [not found]   ` <CAJmoNQHRTJwdMjziQiPBX07sZKrYd3Z1ASNi1xQZdgJ1Vs6bGg@mail.gmail.com>
@ 2015-06-12  9:46     ` Thomas Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2015-06-12  9:46 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: Hannes Frederic Sowa, Nicolas Dichtel, David Ahern,
	Eric W. Biederman, Jamal Hadi Salim, David Miller,
	Stephen Hemminger, Netdev, Roopa Prabhu, Andy Gospodarek,
	Jon Toppins, Nikolay Aleksandrov

On 06/10/15 at 01:43pm, Shrijeet Mukherjee wrote:
> On Tue, Jun 9, 2015 at 3:15 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > Do I understand this correctly that swp* represent veth pairs?
> > Why do you have distinct addresses on each peer of the pair?
> > Are the addresses in N2 and N3 considered private and NATed?
> >
> > [...]
> >
> >
> ???These are physical boxes in the picture not veth pairs or NAT's :)???

I see. So if I translate this to a virtual world with veths where
the guest facing peer is in its own netns, the host facing veth
peer would get attached to a vrf device and we should be good.

> ???Are you worried about ip rule scale ? this reduces the scale to number of
> L3 domains, which should be not that large. I do think we need to speed up
> rule lookup from the linear walk we have right now.

I definitely have more L3 domains than what a linear search can
handle.

> A generic classifier seems like a bigger hammer, but if that is the way to
> replace rules it is a worthy concept.
> 
> That said, the patches from Hannes et al, will make it such that the table
> lookup maybe from the driver directly and thus will skip past the fib rule
> lookup.

The approach from Hannes definitely works for the physical world
but is undesirable for overlays, logical or encapsulations, where
we want to avoid maintaining a net_device for every virtual network.

As I said, I think this is something that can be resolved later on
with a programmable classifier.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-06-12  9:47 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08 18:35 [RFC net-next 0/3] Proposal for VRF-lite Shrijeet Mukherjee
2015-06-08 18:35 ` [RFC net-next 1/3] Symbol preparation for VRF driver Shrijeet Mukherjee
2015-06-10 16:24   ` Alexander Duyck
2015-06-08 18:35 ` [RFC net-next 2/3] VRF driver and needed infrastructure Shrijeet Mukherjee
2015-06-08 19:08   ` David Ahern
2015-06-08 20:17   ` Hannes Frederic Sowa
2015-06-09  9:19   ` Nicolas Dichtel
2015-06-09 12:35   ` Nikolay Aleksandrov
2015-06-10  2:11     ` Shrijeet Mukherjee
2015-06-10 18:20   ` Alexander Duyck
2015-06-08 18:35 ` [RFC net-next 3/3] rcv path changes for vrf traffic Shrijeet Mukherjee
2015-06-08 19:58   ` Hannes Frederic Sowa
2015-06-08 20:00     ` Hannes Frederic Sowa
2015-06-08 20:22     ` Shrijeet Mukherjee
2015-06-08 20:33       ` Hannes Frederic Sowa
2015-06-08 22:44         ` Shrijeet Mukherjee
2015-06-09  5:41           ` Hannes Frederic Sowa
2015-06-08 22:05     ` David Miller
2015-06-08 22:13       ` Hannes Frederic Sowa
2015-06-08 22:21         ` David Miller
2015-06-09  0:36     ` David Ahern
2015-06-09  1:03     ` David Ahern
2015-06-09  5:35       ` Hannes Frederic Sowa
2015-06-10 18:31   ` Alexander Duyck
2015-06-08 18:35 ` [RFC iproute2] Add the ability to create a VRF device and specify it's table binding Shrijeet Mukherjee
2015-06-08 19:13 ` [RFC net-next 0/3] Proposal for VRF-lite David Ahern
2015-06-08 19:51   ` Shrijeet Mukherjee
2015-06-08 20:41   ` Hannes Frederic Sowa
2015-06-09  8:58 ` Nicolas Dichtel
2015-06-09 14:21   ` David Ahern
2015-06-09 14:55     ` Nicolas Dichtel
2015-06-09 17:14       ` Shrijeet Mukherjee
2015-06-09 10:15 ` Thomas Graf
2015-06-09 12:30   ` Nicolas Dichtel
2015-06-09 12:43     ` Hannes Frederic Sowa
     [not found]   ` <CAJmoNQHRTJwdMjziQiPBX07sZKrYd3Z1ASNi1xQZdgJ1Vs6bGg@mail.gmail.com>
2015-06-12  9:46     ` Thomas Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.