All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC net-next 0/6] Proposal for VRF-lite - v2
@ 2015-07-06 15:03 David Ahern
  2015-07-06 15:03 ` [RFC net-next 1/6] fib: export symbols David Ahern
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: David Ahern @ 2015-07-06 15:03 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, David Ahern

In the context of internet scale routing a requirement that always
comes up is the need to partition the available routing tables into
disjoint routing planes. A specific use case is the multi-tenancy
problem where each tenant has their own unique routing tables and in
the very least need different default gateways.

This is an attempt to build the ability to create virtual router
domains aka VRF's (VRF-lite to be specific) in the linux packet
forwarding stack. The main observation is that through the use of
rules and socket binding to interfaces, all the facilities that we
need are already present in the infrastructure. What is missing is a
handle that identifies a routing domain and can be used to gather
applicable rules/tables and uniqify neighbor selection. The scheme
used needs to preserves the notions of ECMP, and general routing
principles.

This driver is a cross between functionality that the IPVLAN driver
and the Team drivers provide where a device is created and packets
into/out of the routing domain are shuttled through this device. The
device is then used as a handle to identify the applicable rules. The
VRF device is thus the layer3 equivalent of a vlan device.

The very important point to note is that this is only a Layer3 concept
so LLDP like tools do not need to be run in each VRF, processes can
run in unaware mode or select a VRF to be talking through. Also the
behavioral model is a generalized application of the familiar VRF-Lite
model with some performance paths that need optimization. (Specifically
the output route selector that Roopa, Robert, Thomas and EricB are
currently discussing on the MPLS thread)

High Level points
=================
1. Simple overlay driver (minimal changes to current stack)
   * uses the existing fib tables and fib rules infrastructure
2. Modelled closely after the ipvlan driver
3. Uses current API and infrastructure.
   * Applications can use SO_BINDTODEVICE or cmsg device indentifiers
     to pick VRF (ping, traceroute just work)
   * Standard IP Rules work, and since they are aggregated against the
     device, scale is manageable
4. Completely orthogonal to Namespaces and only provides separation in
   the routing plane (and ARP)

                                                 N2
           N1 (all configs here)          +---------------+
    +--------------+                      |               |
    |swp1 :10.0.1.1+----------------------+swp1 :10.0.1.2 |
    |              |                      |               |
    |swp2 :10.0.2.1+----------------------+swp2 :10.0.2.2 |
    |              |                      +---------------+
    | VRF 1        |
    | table 5      |
    |              |
    +---------------+
    |              |
    | VRF 2        |                             N3
    | table 6      |                      +---------------+
    |              |                      |               |
    |swp3 :10.0.2.1+----------------------+swp1 :10.0.2.2 |
    |              |                      |               |
    |swp4 :10.0.3.1+----------------------+swp2 :10.0.3.2 |
    +--------------+                      +---------------+


Given the topology above, the setup needed to get the basic VRF
functions working would be

Create the VRF devices and associate with a table
    ip link add vrf1 type vrf table 5
    ip link add vrf2 type vrf table 6

Install the lookup rules that map table to VRF domain
    ip rule add pref 200 oif vrf1 lookup 5
    ip rule add pref 200 iif vrf1 lookup 5
    ip rule add pref 200 oif vrf2 lookup 6
    ip rule add pref 200 iif vrf2 lookup 6

    ip link set vrf1 up
    ip link set vrf2 up

Enslave the routing member interfaces
    ip link set swp1 master vrf1
    ip link set swp2 master vrf1
    ip link set swp3 master vrf2
    ip link set swp4 master vrf2

In this version connected routes are automatically moved from main table
to VRF table.

ping using VRF0 is simply
    ping -I vrf0 -I <optional-src-addr> 10.0.1.2

Or using the task context and a command such as the example chvrf in
patch 6 unmodified applications are run in a VRF context using:
   chvrf -v 1 ping 10.0.1.2


Design Highlights
=================
If a device is enslaved to a VRF device (ie., associated with a VRF)
then:
1. Rx path
   The master device index is used as the iif for all lookups.

2. Tx path
   Similarly, for Tx the VRF device oif is used in the flow to direct
   lookups to the table associated with the VRF via its rule. From there
   the FLOWI_FLAG_VRFSRC flag is used to indicate that the oif should
   not be used for FIB table lookups.

3. Connected and local routes
   On link up for a device, connected and local routes are added to the
   table associated with the VRF device, rather than the local and main
   tables.

4. Socket lookups
   Socket lookups use the VRF device for comparison with sk_bound_dev_if.
   If a socket is not bound to a device a socket match can happen based
   on destination address, port and protocol in which case a VRF global
   or agnostic process handles the connection (ie., this allows 1 listener
   socket to handle connections across VRFs). The child socket becomes
   bound to the VRF (sk_bound_dev_if is set to the VRF device).

5. Neighbor entries
   Neighbor entries are not impacted by the VRF device. Entries are
   associated with a particular interface; the VRF association is indirect
   via the interface-to-VRF device enslavement.

TO-DO
=====
1. ipv4 multicast

2. ICMP and error path handling on connection attempts
   - e.g., connection attempt to a port with no listener

3. IPv6

4. netfilter integration

5. listen filter to restrict VRF connections
   - i.e., bind to VRF's a, b, c only or NOT VRFs e, f, g


Bug-Fixes and ideas from Hannes, Roopa Prabhu, Jon Toppins, Jamal

Patches can also be pulled from:
    https://github.com/dsahern/linux.git, vrf-dev-4.1 branch
    https://github.com/dsahern/iproute2,  vrf-dev-4.1 branch


Shrijeet Mukherjee and David Ahern (6):
  fib: export symbols
  net: Preparation for vrf device
  net: Introduce VRF device driver - v2
  net: Modifications to ipv4 stack for VRF devices
  net: Add sk_bind_dev_if to task_struct
  net: Add chvrf command

 drivers/net/Kconfig           |   7 +
 drivers/net/Makefile          |   1 +
 drivers/net/vrf.c             | 486 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/netdevice.h     |  21 ++
 include/linux/sched.h         |   3 +
 include/net/flow.h            |   1 +
 include/net/inet_hashtables.h |   9 +-
 include/net/route.h           |   4 +
 include/net/vrf.h             |  71 ++++++
 include/uapi/linux/if_link.h  |   9 +
 include/uapi/linux/prctl.h    |   4 +
 kernel/fork.c                 |   2 +
 kernel/sys.c                  |  35 +++
 net/ipv4/af_inet.c            |   1 +
 net/ipv4/fib_frontend.c       |  31 ++-
 net/ipv4/fib_semantics.c      |  25 ++-
 net/ipv4/fib_trie.c           |   8 +-
 net/ipv4/icmp.c               |   4 +
 net/ipv4/ping.c               |   3 +-
 net/ipv4/raw.c                |   5 +-
 net/ipv4/route.c              |  12 +-
 net/ipv4/syncookies.c         |   4 +-
 net/ipv4/tcp_input.c          |   6 +-
 net/ipv4/tcp_ipv4.c           |   6 +-
 net/ipv4/udp.c                |   2 +
 net/ipv6/af_inet6.c           |   1 +
 tools/net/Makefile            |   6 +-
 tools/net/chvrf.c             | 225 +++++++++++++++++++
 28 files changed, 962 insertions(+), 30 deletions(-)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h
 create mode 100644 tools/net/chvrf.c

-- 
2.3.2 (Apple Git-55)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC net-next 1/6] fib: export symbols
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
@ 2015-07-06 15:03 ` David Ahern
  2015-07-06 15:03 ` [RFC net-next 2/6] net: Preparation for vrf device David Ahern
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-06 15:03 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, David Ahern

This change is needed for the following VRF driver.

No active code path changes.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 net/ipv4/fib_frontend.c | 1 +
 net/ipv4/fib_trie.c     | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6bbc54940eb4..974fa51effca 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -108,6 +108,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
 	hlist_add_head_rcu(&tb->tb_hlist, &net->ipv4.fib_table_hash[h]);
 	return tb;
 }
+EXPORT_SYMBOL_GPL(fib_new_table);
 
 /* caller must hold either rtnl or rcu read lock */
 struct fib_table *fib_get_table(struct net *net, u32 id)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 15d32612e3c6..ac2d828c6daa 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1887,6 +1887,7 @@ void fib_free_table(struct fib_table *tb)
 {
 	call_rcu(&tb->rcu, __trie_free_rcu);
 }
+EXPORT_SYMBOL_GPL(fib_free_table);
 
 static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 			     struct sk_buff *skb, struct netlink_callback *cb)
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC net-next 2/6] net: Preparation for vrf device
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
  2015-07-06 15:03 ` [RFC net-next 1/6] fib: export symbols David Ahern
@ 2015-07-06 15:03 ` David Ahern
  2015-07-08  8:37   ` Nicolas Dichtel
  2015-07-06 15:03 ` [RFC net-next 3/6] net: Introduce VRF device driver - v2 David Ahern
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-06 15:03 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, David Ahern

Add a VRF_MASTER flag for interfaces and helper functions for determining
if a device is a VRF_MASTER.

Also, add link attribute for passing VRF_TABLE id.

Both are used in the following patch that adds a VRF device driver.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/linux/netdevice.h    | 21 +++++++++++++++++++++
 include/uapi/linux/if_link.h |  9 +++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e20979dfd6a9..142cb64f139c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1274,6 +1274,7 @@ enum netdev_priv_flags {
 	IFF_XMIT_DST_RELEASE_PERM	= 1<<22,
 	IFF_IPVLAN_MASTER		= 1<<23,
 	IFF_IPVLAN_SLAVE		= 1<<24,
+	IFF_VRF_MASTER		        = 1<<25,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1301,6 +1302,7 @@ enum netdev_priv_flags {
 #define IFF_XMIT_DST_RELEASE_PERM	IFF_XMIT_DST_RELEASE_PERM
 #define IFF_IPVLAN_MASTER		IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE		IFF_IPVLAN_SLAVE
+#define IFF_VRF_MASTER		        IFF_VRF_MASTER
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -1417,6 +1419,7 @@ enum netdev_priv_flags {
  *	@dn_ptr:	DECnet specific data
  *	@ip6_ptr:	IPv6 specific data
  *	@ax25_ptr:	AX.25 specific data
+ *	@vrf_ptr:	VRF specific data
  *	@ieee80211_ptr:	IEEE 802.11 specific data, assign before registering
  *
  *	@last_rx:	Time of last Rx
@@ -1629,6 +1632,7 @@ struct net_device {
 	struct dn_dev __rcu     *dn_ptr;
 	struct inet6_dev __rcu	*ip6_ptr;
 	void			*ax25_ptr;
+	struct net_vrf_dev      *vrf_ptr;
 	struct wireless_dev	*ieee80211_ptr;
 	struct wpan_dev		*ieee802154_ptr;
 #if IS_ENABLED(CONFIG_MPLS_ROUTING)
@@ -3781,6 +3785,23 @@ static inline bool netif_supports_nofcs(struct net_device *dev)
 	return dev->priv_flags & IFF_SUPP_NOFCS;
 }
 
+static inline bool netif_is_vrf(struct net_device *dev)
+{
+	return dev->priv_flags & IFF_VRF_MASTER;
+}
+
+static inline bool netif_idx_is_vrf(struct net *net, int idx)
+{
+	struct net_device *dev = dev_get_by_index(net, idx);
+	bool rc = false;
+
+	if (dev) {
+		rc = netif_is_vrf(dev);
+		dev_put(dev);
+	}
+	return rc;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 2c7e8e3d3981..bfbb4d8eeec2 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -339,6 +339,15 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC	1
 
+/* VRF section */
+enum {
+	IFLA_VRF_UNSPEC,
+	IFLA_VRF_TABLE,
+	__IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
+
 /* IPVLAN section */
 enum {
 	IFLA_IPVLAN_UNSPEC,
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
  2015-07-06 15:03 ` [RFC net-next 1/6] fib: export symbols David Ahern
  2015-07-06 15:03 ` [RFC net-next 2/6] net: Preparation for vrf device David Ahern
@ 2015-07-06 15:03 ` David Ahern
  2015-07-06 15:42   ` Nicolas Dichtel
                     ` (3 more replies)
  2015-07-06 15:03 ` [RFC net-next 4/6] net: Modifications to ipv4 stack for VRF devices David Ahern
                   ` (5 subsequent siblings)
  8 siblings, 4 replies; 32+ messages in thread
From: David Ahern @ 2015-07-06 15:03 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, David Ahern

This driver borrows heavily from IPvlan and teaming drivers.

Routing domains (VRF-lite) are created by instantiating a device
and enslaving all routed interfaces that participate in the domain.
As part of the enslavement, all local routes pointing to enslaved
devices are re-pointed to the vrf device, thus forcing outgoing
sockets to bind to the vrf to function.

Standard FIB rules can then bind the VRF device to tables and regular
fib rule processing is followed.

Routed traffic through the box, is fwded by using the VRF device as
the IIF and following the IIF rule to a table which is mated with
the VRF.

Locally originated traffic is directed at the VRF device using
SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
the xmit function of the vrf driver, which then completes the ip lookup
and output.

This solution is completely orthogonal to namespaces and allow the L3
equivalent of vlans to exist allowing the routing space to be
partitioned.

Example:

   Create vrf 1:
     ip link add vrf1 type vrf table 5
     ip rule add iif vrf1 table 5
     ip rule add oif vrf1 table 5
     ip route add table 5 prohibit default
     ip link set vrf1 up

   Add interface to vrf 1:
     ip link set eth1 master vrf1

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

v2:
- addressed comments from first RFC
- significant changes to improve simplicity of implementation
---
 drivers/net/Kconfig  |   7 +
 drivers/net/Makefile |   1 +
 drivers/net/vrf.c    | 486 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/net/vrf.h    |  71 ++++++++
 4 files changed, 565 insertions(+)
 create mode 100644 drivers/net/vrf.c
 create mode 100644 include/net/vrf.h

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 019fceffc9e5..b040aa233408 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -283,6 +283,13 @@ config NLMON
 	  diagnostics, etc. This is mostly intended for developers or support
 	  to debug netlink issues. If unsure, say N.
 
+config NET_VRF
+	tristate "Virtual Routing and Forwarding (Lite)"
+	depends on IP_MULTIPLE_TABLES && IPV6_MULTIPLE_TABLES
+	---help---
+          This option enables the support for mapping interfaces into VRF's. The
+          support enables VRF devices
+
 endif # NET_CORE
 
 config SUNGEM_PHY
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c12cb22478a7..ca16dd689b36 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_NLMON) += nlmon.o
+obj-$(CONFIG_NET_VRF) += vrf.o
 
 #
 # Networking Drivers
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
new file mode 100644
index 000000000000..b9f9ae68388d
--- /dev/null
+++ b/drivers/net/vrf.c
@@ -0,0 +1,487 @@
+/*
+ * vrf.c: device driver to encapsulate a VRF space
+ *
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * Based on dummy, team and ipvlan drivers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/rtnetlink.h>
+#include <net/rtnetlink.h>
+#include <net/arp.h>
+#include <linux/u64_stats_sync.h>
+#include <linux/hashtable.h>
+
+#include <linux/inetdevice.h>
+#include <net/ip.h>
+#include <net/ip_fib.h>
+#include <net/ip6_route.h>
+#include <net/rtnetlink.h>
+#include <net/route.h>
+#include <net/addrconf.h>
+#include <net/vrf.h>
+
+#define DRV_NAME	"vrf"
+#define DRV_VERSION	"1.0"
+
+#define vrf_is_slave(dev)   ((dev)->flags & IFF_SLAVE)
+#define vrf_is_master(dev)  ((dev)->flags & IFF_MASTER)
+
+#define vrf_master_get_rcu(dev) \
+	((struct net_device *)rcu_dereference(dev->rx_handler_data))
+
+struct pcpu_dstats {
+	u64			tx_pkts;
+	u64			tx_bytes;
+	u64			tx_drps;
+	u64			rx_pkts;
+	u64			rx_bytes;
+	struct u64_stats_sync	syncp;
+};
+
+struct slave {
+	struct list_head	list;
+	struct net_device	*dev;
+	long			priority;
+};
+
+struct slave_queue {
+	spinlock_t		lock; /* lock for slave insert/delete */
+	struct list_head	all_slaves;
+	int			num_slaves;
+	struct net_device	*master_dev;
+};
+
+struct net_vrf {
+	struct slave_queue	queue;
+	struct fib_table        *tb;
+	u32                     tb_id;
+};
+
+static int is_ip_rx_frame(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+	case htons(ETH_P_IPV6):
+		return 1;
+	}
+	return 0;
+}
+
+/* note: already called with rcu_read_lock */
+static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+
+	if (is_ip_rx_frame(skb)) {
+		struct net_device *dev = vrf_master_get_rcu(skb->dev);
+		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
+
+		u64_stats_update_begin(&dstats->syncp);
+		dstats->rx_pkts++;
+		dstats->rx_bytes += skb->len;
+		u64_stats_update_end(&dstats->syncp);
+	}
+	return RX_HANDLER_PASS;
+}
+
+static struct rtnl_link_stats64 *vrf_get_stats64(
+	struct net_device *dev, struct rtnl_link_stats64 *stats)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_dstats *dstats;
+		u64 tbytes, tpkts, tdrops, rbytes, rpkts;
+		unsigned int start;
+
+		dstats = per_cpu_ptr(dev->dstats, i);
+		do {
+			start = u64_stats_fetch_begin_irq(&dstats->syncp);
+			tbytes = dstats->tx_bytes;
+			tpkts = dstats->tx_pkts;
+			tdrops = dstats->tx_drps;
+			rbytes = dstats->rx_bytes;
+			rpkts = dstats->rx_pkts;
+		} while (u64_stats_fetch_retry_irq(&dstats->syncp, start));
+		stats->tx_bytes += tbytes;
+		stats->tx_packets += tpkts;
+		stats->tx_dropped += tdrops;
+		stats->rx_bytes += rbytes;
+		stats->rx_packets += rpkts;
+	}
+	return stats;
+}
+
+static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	return NET_XMIT_DROP;
+}
+
+/**************************** device handling ********************/
+
+/* queue->lock must be held */
+static struct slave *__vrf_find_slave_dev(struct slave_queue *queue,
+					  struct net_device *dev)
+{
+	struct list_head *this, *head;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		if (slave->dev == dev)
+			return slave;
+	}
+
+	return NULL;
+}
+
+static void __vrf_kill_slave(struct slave_queue *queue, struct slave *slave)
+{
+	list_del(&slave->list);
+	queue->num_slaves--;
+	slave->dev->flags &= ~IFF_SLAVE;
+	netdev_rx_handler_unregister(slave->dev);
+	kfree(slave->dev->vrf_ptr);
+	slave->dev->vrf_ptr = NULL;
+	dev_put(slave->dev);
+	kfree(slave);
+}
+
+/* queue->lock must be held */
+static void __vrf_insert_slave(struct slave_queue *queue, struct slave *slave,
+			       struct net_device *master)
+{
+	struct slave *duplicate_slave = NULL;
+
+	duplicate_slave = __vrf_find_slave_dev(queue, slave->dev);
+	if (duplicate_slave)
+		__vrf_kill_slave(queue, duplicate_slave);
+
+	dev_hold(slave->dev);
+	list_add(&slave->list, &queue->all_slaves);
+	queue->num_slaves++;
+}
+
+/* netlink lock is assumed here */
+static int vrf_add_slave(struct net_device *dev,
+			 struct net_device *port_dev)
+{
+	if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev))
+		return -ENODEV;
+
+	if (!vrf_is_master(port_dev) && !vrf_is_slave(port_dev)) {
+		struct slave *s = kzalloc(sizeof(*s), GFP_KERNEL);
+		struct net_vrf *vrf = netdev_priv(dev);
+		struct slave_queue *queue = &vrf->queue;
+		bool is_running = netif_running(port_dev);
+		unsigned int flags = port_dev->flags;
+		int ret;
+
+		if (!s)
+			return -ENOMEM;
+
+		s->dev = port_dev;
+
+		spin_lock_bh(&queue->lock);
+		__vrf_insert_slave(queue, s, dev);
+		spin_unlock_bh(&queue->lock);
+
+		port_dev->vrf_ptr = kmalloc(sizeof(*port_dev->vrf_ptr),
+					    GFP_KERNEL);
+		if (!port_dev->vrf_ptr)
+			return -ENOMEM;
+
+		port_dev->vrf_ptr->ifindex = dev->ifindex;
+		port_dev->vrf_ptr->tb_id = vrf->tb_id;
+
+		/* register the packet handler for slave ports */
+		ret = netdev_rx_handler_register(port_dev, vrf_handle_frame,
+						 (void *)dev);
+		if (ret) {
+			netdev_err(port_dev,
+				   "Device %s failed to register rx_handler\n",
+				   port_dev->name);
+			kfree(port_dev->vrf_ptr);
+			kfree(s);
+			return ret;
+		}
+
+		if (is_running) {
+			ret = dev_change_flags(port_dev, flags & ~IFF_UP);
+			if (ret < 0)
+				goto out_fail;
+		}
+
+		ret = netdev_master_upper_dev_link(port_dev, dev);
+		if (ret < 0)
+			goto out_fail;
+
+		if (is_running) {
+			ret = dev_change_flags(port_dev, flags);
+			if (ret < 0)
+				goto out_fail;
+		}
+
+		port_dev->flags |= IFF_SLAVE;
+
+		return 0;
+
+out_fail:
+		spin_lock_bh(&queue->lock);
+		__vrf_kill_slave(queue, s);
+		spin_unlock_bh(&queue->lock);
+
+		return ret;
+	}
+
+	return -EINVAL;
+}
+
+static int vrf_del_slave(struct net_device *dev,
+			 struct net_device *port_dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct slave *slave = __vrf_find_slave_dev(queue, port_dev);
+	bool is_running = netif_running(port_dev);
+	unsigned int flags = port_dev->flags;
+	int ret = 0;
+
+	if (!slave)
+		return -EINVAL;
+
+	if (is_running)
+		ret = dev_change_flags(port_dev, flags & ~IFF_UP);
+
+	spin_lock_bh(&queue->lock);
+	__vrf_kill_slave(queue, slave);
+	spin_unlock_bh(&queue->lock);
+
+	netdev_upper_dev_unlink(port_dev, dev);
+
+	if (is_running)
+		ret = dev_change_flags(port_dev, flags);
+
+	return 0;
+}
+
+static int vrf_dev_init(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+
+	spin_lock_init(&vrf->queue.lock);
+	INIT_LIST_HEAD(&vrf->queue.all_slaves);
+	vrf->queue.master_dev = dev;
+
+	dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
+	dev->flags  =  IFF_MASTER | IFF_NOARP;
+	if (!dev->dstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void vrf_dev_uninit(struct net_device *dev)
+{
+	free_percpu(dev->dstats);
+}
+
+static int vrf_dev_close(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct list_head *this, *head;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		slave->dev->vrf_ptr->ifindex = 0;
+		slave->dev->vrf_ptr->tb_id = 0;
+	}
+
+	if (dev->flags & IFF_MASTER)
+		dev->flags &= ~IFF_UP;
+
+	return 0;
+}
+
+static int vrf_dev_open(struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	struct slave_queue *queue = &vrf->queue;
+	struct list_head *this, *head;
+
+	head = &queue->all_slaves;
+	list_for_each(this, head) {
+		struct slave *slave = list_entry(this, struct slave, list);
+
+		slave->dev->vrf_ptr->ifindex = dev->ifindex;
+		slave->dev->vrf_ptr->tb_id = vrf->tb_id;
+	}
+
+	if (dev->flags & IFF_MASTER)
+		dev->flags |= IFF_UP;
+
+	if (!vrf->tb)
+		return -EINVAL;
+
+	return 0;
+}
+
+static const struct net_device_ops vrf_netdev_ops = {
+	.ndo_init		= vrf_dev_init,
+	.ndo_uninit		= vrf_dev_uninit,
+	.ndo_open		= vrf_dev_open,
+	.ndo_stop               = vrf_dev_close,
+	.ndo_start_xmit		= vrf_xmit,
+	.ndo_get_stats64	= vrf_get_stats64,
+	.ndo_add_slave          = vrf_add_slave,
+	.ndo_del_slave          = vrf_del_slave,
+};
+
+static void vrf_get_drvinfo(struct net_device *dev,
+			    struct ethtool_drvinfo *info)
+{
+	strlcpy(info->driver, DRV_NAME, sizeof(info->driver));
+	strlcpy(info->version, DRV_VERSION, sizeof(info->version));
+}
+
+static const struct ethtool_ops vrf_ethtool_ops = {
+	.get_drvinfo            = vrf_get_drvinfo,
+};
+
+static void vrf_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+
+	/* Initialize the device structure. */
+	dev->netdev_ops = &vrf_netdev_ops;
+	dev->ethtool_ops = &vrf_ethtool_ops;
+	dev->destructor = free_netdev;
+
+	/* Fill in device structure with ethernet-generic values. */
+	dev->tx_queue_len = 0;
+	eth_hw_addr_random(dev);
+}
+
+static int vrf_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (tb[IFLA_ADDRESS]) {
+		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
+			return -EINVAL;
+		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
+			return -EADDRNOTAVAIL;
+	}
+	return 0;
+}
+
+static int vrf_newlink(struct net *src_net, struct net_device *dev,
+		       struct nlattr *tb[], struct nlattr *data[])
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+	int err;
+
+	if (!data || !data[IFLA_VRF_TABLE])
+		return -EINVAL;
+
+	vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
+
+	/* reserve a table for this VRF device */
+	err = -ERANGE;
+	vrf->tb = fib_new_table(dev_net(dev), vrf->tb_id);
+	if (!vrf->tb)
+		goto out_fail;
+
+	dev->priv_flags |= IFF_VRF_MASTER;
+
+	err = -ENOMEM;
+	dev->vrf_ptr = kmalloc(sizeof(*dev->vrf_ptr), GFP_KERNEL);
+	if (!dev->vrf_ptr)
+		goto out_fail;
+
+	dev->vrf_ptr->ifindex = dev->ifindex;
+	dev->vrf_ptr->tb_id = vrf->tb_id;
+
+	err = register_netdevice(dev);
+	if (err < 0)
+		goto out_fail;
+
+	return 0;
+
+out_fail:
+	kfree(dev->vrf_ptr);
+	free_netdev(dev);
+	return err;
+}
+
+static void vrf_dellink(struct net_device *dev, struct list_head *head)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+
+	kfree(dev->vrf_ptr);
+	fib_free_table(vrf->tb);
+}
+
+static size_t vrf_nl_getsize(const struct net_device *dev)
+{
+	return nla_total_size(sizeof(u32));  /* IFLA_VRF_TABLE */
+}
+
+static int vrf_fillinfo(struct sk_buff *skb,
+			const struct net_device *dev)
+{
+	struct net_vrf *vrf = netdev_priv(dev);
+
+	return nla_put_u32(skb, IFLA_VRF_TABLE, vrf->tb_id);
+}
+
+static const struct nla_policy vrf_nl_policy[IFLA_VRF_MAX + 1] = {
+	[IFLA_VRF_TABLE] = { .type = NLA_U32 },
+};
+
+static struct rtnl_link_ops vrf_link_ops __read_mostly = {
+	.kind		= DRV_NAME,
+	.priv_size      = sizeof(struct net_vrf),
+
+	.get_size       = vrf_nl_getsize,
+	.policy         = vrf_nl_policy,
+	.validate	= vrf_validate,
+	.fill_info      = vrf_fillinfo,
+
+	.newlink        = vrf_newlink,
+	.dellink        = vrf_dellink,
+	.setup		= vrf_setup,
+	.maxtype        = IFLA_VRF_MAX,
+};
+
+static int __init vrf_init_module(void)
+{
+	return rtnl_link_register(&vrf_link_ops);
+}
+
+static void __exit vrf_cleanup_module(void)
+{
+	rtnl_link_unregister(&vrf_link_ops);
+}
+
+module_init(vrf_init_module);
+module_exit(vrf_cleanup_module);
+MODULE_AUTHOR("Shrijeet Mukherjee, David Ahern");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK(DRV_NAME);
+MODULE_VERSION(DRV_VERSION);
diff --git a/include/net/vrf.h b/include/net/vrf.h
new file mode 100644
index 000000000000..3ab1e332c781
--- /dev/null
+++ b/include/net/vrf.h
@@ -0,0 +1,71 @@
+/*
+ * include/net/net_vrf.h - adds vrf dev structure definitions
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __LINUX_NET_VRF_H
+#define __LINUX_NET_VRF_H
+
+struct net_vrf_dev {
+	int                     ifindex; /* ifindex of master dev */
+	u32                     tb_id;   /* table id for VRF */
+};
+
+#if IS_ENABLED(CONFIG_NET_VRF)
+static inline int vrf_master_dev_idx(const struct net_device *dev)
+{
+	int idx = 0;
+
+	if (dev && dev->vrf_ptr)
+		idx = dev->vrf_ptr->ifindex;
+
+	return idx;
+}
+
+static inline int vrf_get_master_dev_idx(struct net *net, int idx)
+{
+	int rc = 0;
+
+	if (idx) {
+		struct net_device *dev = dev_get_by_index(net, idx);
+
+		if (dev) {
+			rc = vrf_master_dev_idx(dev);
+			dev_put(dev);
+		}
+	}
+	return rc;
+}
+
+static inline int vrf_dev_table(const struct net_device *dev)
+{
+	int tb_id = 0;
+
+	if (dev && dev->vrf_ptr)
+		tb_id = dev->vrf_ptr->tb_id;
+
+	return tb_id;
+}
+#else
+static inline int vrf_master_dev_idx(const struct net_device *dev)
+{
+	return 0;
+}
+
+static inline int vrf_get_master_dev_idx(struct net *net, int idx)
+{
+	return 0;
+}
+
+static inline int vrf_dev_table(const struct net_device *dev)
+{
+	return 0;
+}
+#endif
+
+#endif /* __LINUX_NET_VRF_H */
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC net-next 4/6] net: Modifications to ipv4 stack for VRF devices
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
                   ` (2 preceding siblings ...)
  2015-07-06 15:03 ` [RFC net-next 3/6] net: Introduce VRF device driver - v2 David Ahern
@ 2015-07-06 15:03 ` David Ahern
  2015-07-06 15:03 ` [RFC net-next 5/6] net: Add sk_bind_dev_if to task_struct David Ahern
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-06 15:03 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, David Ahern

With the following tweaks to the IPv4 stack:
- enslaving devices to a VRF device automatically moves routes to the
  VRF table; removing the VRF master moves routes back to the main table

- the following use cases work for both Rx and Tx:
  + ICMP (ping -I <vrf-device> <ip>)
  + TCP server and client bound to VRF device
  + TCP server not bound to VRF device but working through it
    * client connections are bound to VRF device
  + UDP server and client bound to VRF device

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/net/flow.h            |  1 +
 include/net/inet_hashtables.h |  9 +++++++--
 include/net/route.h           |  4 ++++
 net/ipv4/fib_frontend.c       | 30 ++++++++++++++++++++----------
 net/ipv4/fib_semantics.c      | 25 ++++++++++++++++++++-----
 net/ipv4/fib_trie.c           |  7 +++++--
 net/ipv4/icmp.c               |  4 ++++
 net/ipv4/ping.c               |  3 ++-
 net/ipv4/raw.c                |  5 +++--
 net/ipv4/route.c              | 12 ++++++++++--
 net/ipv4/syncookies.c         |  4 +++-
 net/ipv4/tcp_input.c          |  6 +++++-
 net/ipv4/tcp_ipv4.c           |  6 ++++--
 net/ipv4/udp.c                |  2 ++
 14 files changed, 90 insertions(+), 28 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a159d1b3..69aaa99fdeb8 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -29,6 +29,7 @@ struct flowi_common {
 	__u8	flowic_flags;
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_KNOWN_NH		0x02
+#define FLOWI_FLAG_VRFSRC		0x04
 	__u32	flowic_secid;
 };
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index b73c88a19dd4..e26c43823a13 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -31,6 +31,7 @@
 #include <net/route.h>
 #include <net/tcp_states.h>
 #include <net/netns/hash.h>
+#include <net/vrf.h>
 
 #include <linux/atomic.h>
 #include <asm/byteorder.h>
@@ -300,10 +301,14 @@ static inline struct sock *__inet_lookup(struct net *net,
 					 struct inet_hashinfo *hashinfo,
 					 const __be32 saddr, const __be16 sport,
 					 const __be32 daddr, const __be16 dport,
-					 const int dif)
+					 int dif)
 {
 	u16 hnum = ntohs(dport);
-	struct sock *sk = __inet_lookup_established(net, hashinfo,
+	struct sock *sk;
+
+	dif = vrf_get_master_dev_idx(net, dif) ? : dif;
+
+	sk = __inet_lookup_established(net, hashinfo,
 				saddr, sport, daddr, hnum, dif);
 
 	return sk ? : __inet_lookup_listener(net, hashinfo, saddr, sport,
diff --git a/include/net/route.h b/include/net/route.h
index fe22d03afb6a..460333bab217 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -188,6 +188,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
 unsigned int inet_addr_type(struct net *net, __be32 addr);
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
 				__be32 addr);
 void ip_rt_multicast_event(struct in_device *);
@@ -250,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32
 	if (inet_sk(sk)->transparent)
 		flow_flags |= FLOWI_FLAG_ANYSRC;
 
+	if (netif_idx_is_vrf(sock_net(sk), oif))
+		flow_flags |= FLOWI_FLAG_VRFSRC;
+
 	flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
 			   protocol, flow_flags, dst, src, dport, sport);
 }
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 974fa51effca..7c73eb058c91 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -45,6 +45,7 @@
 #include <net/ip_fib.h>
 #include <net/rtnetlink.h>
 #include <net/xfrm.h>
+#include <net/vrf.h>
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
@@ -212,7 +213,7 @@ void fib_flush_external(struct net *net)
  */
 static inline unsigned int __inet_dev_addr_type(struct net *net,
 						const struct net_device *dev,
-						__be32 addr)
+						__be32 addr, int rt_table)
 {
 	struct flowi4		fl4 = { .daddr = addr };
 	struct fib_result	res;
@@ -225,8 +226,7 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 		return RTN_MULTICAST;
 
 	rcu_read_lock();
-
-	local_table = fib_get_table(net, RT_TABLE_LOCAL);
+	local_table = fib_get_table(net, rt_table);
 	if (local_table) {
 		ret = RTN_UNICAST;
 		if (!fib_table_lookup(local_table, &fl4, &res, FIB_LOOKUP_NOREF)) {
@@ -239,16 +239,24 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 	return ret;
 }
 
+unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id)
+{
+	return __inet_dev_addr_type(net, NULL, addr, tb_id);
+}
+EXPORT_SYMBOL(inet_addr_type_table);
+
 unsigned int inet_addr_type(struct net *net, __be32 addr)
 {
-	return __inet_dev_addr_type(net, NULL, addr);
+	return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL);
 }
 EXPORT_SYMBOL(inet_addr_type);
 
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
 				__be32 addr)
 {
-	return __inet_dev_addr_type(net, dev, addr);
+	int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL;
+
+	return __inet_dev_addr_type(net, dev, addr, rt_table);
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
 
@@ -309,7 +317,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	bool dev_match;
 
 	fl4.flowi4_oif = 0;
-	fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
+	fl4.flowi4_iif = vrf_master_dev_idx(dev);
+	if (!fl4.flowi4_iif)
+		fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
 	fl4.daddr = src;
 	fl4.saddr = dst;
 	fl4.flowi4_tos = tos;
@@ -761,6 +771,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifaddr *ifa)
 {
 	struct net *net = dev_net(ifa->ifa_dev->dev);
+	int tb_id = vrf_dev_table(ifa->ifa_dev->dev);
 	struct fib_table *tb;
 	struct fib_config cfg = {
 		.fc_protocol = RTPROT_KERNEL,
@@ -775,11 +786,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad
 		},
 	};
 
-	if (type == RTN_UNICAST)
-		tb = fib_new_table(net, RT_TABLE_MAIN);
-	else
-		tb = fib_new_table(net, RT_TABLE_LOCAL);
+	if (!tb_id)
+		tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL;
 
+	tb = fib_new_table(net, tb_id);
 	if (!tb)
 		return;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 3bfccd83551c..3c3e2006ce72 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -760,6 +760,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh)
 	return nh->nh_saddr;
 }
 
+static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc)
+{
+	if (cfg->fc_type != RTN_LOCAL || !cfg->fc_dst ||
+	    fib_prefsrc != cfg->fc_dst) {
+		int tb_id = cfg->fc_table;
+
+		if (tb_id == RT_TABLE_MAIN)
+			tb_id = RT_TABLE_LOCAL;
+
+		if (inet_addr_type_table(cfg->fc_nlinfo.nl_net,
+					 fib_prefsrc, tb_id) != RTN_LOCAL) {
+			return false;
+		}
+	}
+	return true;
+}
+
 struct fib_info *fib_create_info(struct fib_config *cfg)
 {
 	int err;
@@ -940,11 +957,9 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 			fi->fib_flags |= RTNH_F_LINKDOWN;
 	}
 
-	if (fi->fib_prefsrc) {
-		if (cfg->fc_type != RTN_LOCAL || !cfg->fc_dst ||
-		    fi->fib_prefsrc != cfg->fc_dst)
-			if (inet_addr_type(net, fi->fib_prefsrc) != RTN_LOCAL)
-				goto err_inval;
+
+	if (fi->fib_prefsrc && !fib_valid_prefsrc(cfg, fi->fib_prefsrc)) {
+		goto err_inval;
 	}
 
 	change_nexthops(fi) {
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index ac2d828c6daa..7da901c56e35 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1421,8 +1421,11 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
 			    nh->nh_flags & RTNH_F_LINKDOWN &&
 			    !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE))
 				continue;
-			if (flp->flowi4_oif && flp->flowi4_oif != nh->nh_oif)
-				continue;
+			if (!(flp->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
+				if (flp->flowi4_oif &&
+				    flp->flowi4_oif != nh->nh_oif)
+					continue;
+			}
 
 			if (!(fib_flags & FIB_LOOKUP_NOREF))
 				atomic_inc(&fi->fib_clntref);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f5203fba6236..115d3c1c548f 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -96,6 +96,7 @@
 #include <net/xfrm.h>
 #include <net/inet_common.h>
 #include <net/ip_fib.h>
+#include <net/vrf.h>
 
 /*
  *	Build xmit assembly blocks
@@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	fl4.flowi4_mark = mark;
 	fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 	fl4.flowi4_proto = IPPROTO_ICMP;
+	fl4.flowi4_oif = vrf_master_dev_idx(skb->dev) ? : skb->dev->ifindex;
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_key(net, &fl4);
 	if (IS_ERR(rt))
@@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->flowi4_proto = IPPROTO_ICMP;
 	fl4->fl4_icmp_type = type;
 	fl4->fl4_icmp_code = code;
+	fl4->flowi4_oif = vrf_master_dev_idx(skb_in->dev) ? : skb_in->dev->ifindex;
+
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
 	rt = __ip_route_output_key(net, fl4);
 	if (IS_ERR(rt))
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 05ff44b758df..685fada659f5 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -44,6 +44,7 @@
 #include <net/route.h>
 #include <net/inet_common.h>
 #include <net/checksum.h>
+#include <net/vrf.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include <linux/in6.h>
@@ -174,7 +175,7 @@ static struct sock *ping_lookup(struct net *net, struct sk_buff *skb, u16 ident)
 	struct sock *sk = NULL;
 	struct inet_sock *isk;
 	struct hlist_nulls_node *hnode;
-	int dif = skb->dev->ifindex;
+	int dif = vrf_master_dev_idx(skb->dev) ? : skb->dev->ifindex;
 
 	if (skb->protocol == htons(ETH_P_IP)) {
 		pr_debug("try to find: num = %d, daddr = %pI4, dif = %d\n",
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 561cd4b8fc6e..95ef2834533d 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -72,6 +72,7 @@
 #include <net/inet_common.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
+#include <net/vrf.h>
 #include <linux/rtnetlink.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
@@ -171,6 +172,7 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
 	struct hlist_head *head;
 	int delivered = 0;
 	struct net *net;
+	int idx = vrf_master_dev_idx(skb->dev) ? : skb->dev->ifindex;
 
 	read_lock(&raw_v4_hashinfo.lock);
 	head = &raw_v4_hashinfo.ht[hash];
@@ -179,8 +181,7 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
 
 	net = dev_net(skb->dev);
 	sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol,
-			     iph->saddr, iph->daddr,
-			     skb->dev->ifindex);
+			     iph->saddr, iph->daddr, idx);
 
 	while (sk) {
 		delivered = 1;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d0362a2de3d3..c66fdeb3a101 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -109,6 +109,7 @@
 #include <linux/kmemleak.h>
 #endif
 #include <net/secure_seq.h>
+#include <net/vrf.h>
 
 #define RT_FL_TOS(oldflp4) \
 	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
@@ -1710,7 +1711,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	 *	Now we are ready to route packet.
 	 */
 	fl4.flowi4_oif = 0;
-	fl4.flowi4_iif = dev->ifindex;
+	fl4.flowi4_iif = vrf_master_dev_idx(dev) ? : dev->ifindex;
 	fl4.flowi4_mark = skb->mark;
 	fl4.flowi4_tos = tos;
 	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
@@ -2089,6 +2090,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		if (!dev_out)
 			goto out;
 
+		if (netif_is_vrf(dev_out))
+			fl4->flowi4_flags |= FLOWI_FLAG_VRFSRC;
+
 		/* RACE: Check return value of inet_select_addr instead. */
 		if (!(dev_out->flags & IFF_UP) || !__in_dev_get_rcu(dev_out)) {
 			rth = ERR_PTR(-ENETUNREACH);
@@ -2273,8 +2277,12 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 				    struct sock *sk)
 {
-	struct rtable *rt = __ip_route_output_key(net, flp4);
+	struct rtable *rt;
+
+	if (netif_idx_is_vrf(net, flp4->flowi4_oif))
+		flp4->flowi4_flags |= FLOWI_FLAG_VRFSRC;
 
+	rt = __ip_route_output_key(net, flp4);
 	if (IS_ERR(rt))
 		return rt;
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index d70b1f603692..120f4406ba7a 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -348,7 +348,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	treq->snt_synack	= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsecr : 0;
 	treq->tfo_listener	= false;
 
-	ireq->ir_iif = sk->sk_bound_dev_if;
+	ireq->ir_iif = vrf_get_master_dev_idx(sock_net(sk), skb->skb_iif);
+	if (!ireq->ir_iif)
+		ireq->ir_iif = sk->sk_bound_dev_if;
 
 	/* We throwed the options of the initial SYN away, so we hope
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 684f095d196e..3018b4f795eb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -72,6 +72,7 @@
 #include <net/dst.h>
 #include <net/tcp.h>
 #include <net/inet_common.h>
+#include <net/vrf.h>
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <linux/errqueue.h>
@@ -6138,7 +6139,10 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_openreq_init(req, &tmp_opt, skb, sk);
 
 	/* Note: tcp_v6_init_req() might override ir_iif for link locals */
-	inet_rsk(req)->ir_iif = sk->sk_bound_dev_if;
+	inet_rsk(req)->ir_iif = vrf_get_master_dev_idx(sock_net(sk),
+						       skb->skb_iif);
+	if (!inet_rsk(req)->ir_iif)
+		inet_rsk(req)->ir_iif = sk->sk_bound_dev_if;
 
 	af_ops->init_req(req, sk, skb);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d7d4c2b79cf2..c03e28477275 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -682,6 +682,8 @@ static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
 	 */
 	if (sk)
 		arg.bound_dev_if = sk->sk_bound_dev_if;
+	if (!arg.bound_dev_if)
+		arg.bound_dev_if = vrf_master_dev_idx(skb_dst(skb)->dev);
 
 	arg.tos = ip_hdr(skb)->tos;
 	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
@@ -766,8 +768,8 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
 				      ip_hdr(skb)->saddr, /* XXX */
 				      arg.iov[0].iov_len, IPPROTO_TCP, 0);
 	arg.csumoffset = offsetof(struct tcphdr, check) / 2;
-	if (oif)
-		arg.bound_dev_if = oif;
+	arg.bound_dev_if = oif ? : vrf_master_dev_idx(skb_dst(skb)->dev);
+
 	arg.tos = tos;
 	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
 			      skb, &TCP_SKB_CB(skb)->header.h4.opt,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 83aa604f9273..cf706d7898a2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -501,6 +501,8 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	int score, badness, matches = 0, reuseport = 0;
 	u32 hash = 0;
 
+	dif = vrf_get_master_dev_idx(net, dif) ? : dif;
+
 	rcu_read_lock();
 	if (hslot->count > 10) {
 		hash2 = udp4_portaddr_hash(net, daddr, hnum);
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC net-next 5/6] net: Add sk_bind_dev_if to task_struct
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
                   ` (3 preceding siblings ...)
  2015-07-06 15:03 ` [RFC net-next 4/6] net: Modifications to ipv4 stack for VRF devices David Ahern
@ 2015-07-06 15:03 ` David Ahern
  2015-07-06 15:03 ` [RFC net-next 6/6] net: Add chvrf command David Ahern
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-06 15:03 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, David Ahern

Allow tasks to have a default device index for binding sockets. If set
the value is passed to all AF_INET/AF_INET6 sockets when they are created.

The task setting is passed parent to child on fork, but can be set or
changed after task creation using prctl (if task has CAP_NET_ADMIN
permissions). The setting for a socket can be retrieved using prctl().
This option allows an administrator to restrict a task to only send/receive
packets through the specified device. In the case of VRF devices this
option restricts tasks to a specific VRF.

Correlation of the device index to a specific VRF, ie.,
   ifindex --> VRF device --> VRF id
is left to userspace.

Example using VRF devices:
1. vrf1 is created and assigned to table 5
2. eth2 is enslaved to vrf1
3. eth2 is given the address 1.1.1.1/24

$ ip route ls table 5
prohibit default
1.1.1.0/24 dev eth2  scope link
local 1.1.1.1 dev eth2  proto kernel  scope host  src 1.1.1.1

With out setting a VRF context ping, tcp and udp attempts fail. e.g,
$ ping 1.1.1.254
connect: Network is unreachable

After binding the task to the vrf device ping succeeds:
$ ./chvrf -v 1 ping -c1 1.1.1.254
PING 1.1.1.254 (1.1.1.254) 56(84) bytes of data.
64 bytes from 1.1.1.254: icmp_seq=1 ttl=64 time=2.32 ms
---
 include/linux/sched.h      |  3 +++
 include/uapi/linux/prctl.h |  4 ++++
 kernel/fork.c              |  2 ++
 kernel/sys.c               | 35 +++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         |  1 +
 net/ipv6/af_inet6.c        |  1 +
 6 files changed, 46 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6633e83e608a..0b6ab0e2ea57 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1543,6 +1543,9 @@ struct task_struct {
 	struct files_struct *files;
 /* namespaces */
 	struct nsproxy *nsproxy;
+/* network */
+	/* if set INET/INET6 sockets are bound to given dev index on create */
+	int sk_bind_dev_if;
 /* signal handlers */
 	struct signal_struct *signal;
 	struct sighand_struct *sighand;
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 31891d9535e2..1ef45195d146 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -190,4 +190,8 @@ struct prctl_mm_map {
 # define PR_FP_MODE_FR		(1 << 0)	/* 64b FP registers */
 # define PR_FP_MODE_FRE		(1 << 1)	/* 32b compatibility */
 
+/* get/set network interface sockets are bound to by default */
+#define PR_SET_SK_BIND_DEV_IF   47
+#define PR_GET_SK_BIND_DEV_IF   48
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 0bb88b555550..d2c7f32370ef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -375,6 +375,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->splice_pipe = NULL;
 	tsk->task_frag.page = NULL;
 
+	tsk->sk_bind_dev_if = orig->sk_bind_dev_if;
+
 	account_kernel_stack(ti, 1);
 
 	return tsk;
diff --git a/kernel/sys.c b/kernel/sys.c
index 8571296b7ddb..7e56fb9dbf8e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/netdevice.h>
 
 #include <linux/kmsg_dump.h>
 /* Move somewhere else to avoid recompiling? */
@@ -2243,6 +2244,40 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+#ifdef CONFIG_NET
+	case PR_SET_SK_BIND_DEV_IF:
+	{
+		struct net_device *dev;
+		int idx = (int) arg2;
+
+		if (!capable(CAP_NET_ADMIN))
+			return -EPERM;
+
+		if (idx) {
+			dev = dev_get_by_index(me->nsproxy->net_ns, idx);
+			if (!dev)
+				return -EINVAL;
+			dev_put(dev);
+		}
+		me->sk_bind_dev_if = idx;
+		break;
+	}
+	case PR_GET_SK_BIND_DEV_IF:
+	{
+		struct task_struct *tsk;
+		int sk_bind_dev_if = -EINVAL;
+
+		rcu_read_lock();
+		tsk = find_task_by_vpid(arg2);
+		if (tsk)
+			sk_bind_dev_if = tsk->sk_bind_dev_if;
+		rcu_read_unlock();
+		if (tsk != me && !capable(CAP_NET_ADMIN))
+			return -EPERM;
+		error = sk_bind_dev_if;
+		break;
+	}
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee87151f..a3b24f14e378 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -350,6 +350,7 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
 	sk->sk_destruct	   = inet_sock_destruct;
 	sk->sk_protocol	   = protocol;
 	sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv;
+	sk->sk_bound_dev_if = current->sk_bind_dev_if;
 
 	inet->uc_ttl	= -1;
 	inet->mc_loop	= 1;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 7de52b65173f..165bc4d9f987 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -189,6 +189,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 	sk->sk_destruct		= inet_sock_destruct;
 	sk->sk_family		= PF_INET6;
 	sk->sk_protocol		= protocol;
+	sk->sk_bound_dev_if	= current->sk_bind_dev_if;
 
 	sk->sk_backlog_rcv	= answer->prot->backlog_rcv;
 
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC net-next 6/6] net: Add chvrf command
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
                   ` (4 preceding siblings ...)
  2015-07-06 15:03 ` [RFC net-next 5/6] net: Add sk_bind_dev_if to task_struct David Ahern
@ 2015-07-06 15:03 ` David Ahern
  2015-07-06 15:03 ` [RFC PATCH] iproute2: Add support for VRF device David Ahern
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-06 15:03 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, David Ahern

Example of how to use the default bind to interface option for tasks and
correlate with VRF devices.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 tools/net/Makefile |   6 +-
 tools/net/chvrf.c  | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 229 insertions(+), 2 deletions(-)
 create mode 100644 tools/net/chvrf.c

diff --git a/tools/net/Makefile b/tools/net/Makefile
index ee577ea03ba5..c13f11f5637a 100644
--- a/tools/net/Makefile
+++ b/tools/net/Makefile
@@ -10,7 +10,7 @@ YACC = bison
 %.lex.c: %.l
 	$(LEX) -o $@ $<
 
-all : bpf_jit_disasm bpf_dbg bpf_asm
+all : bpf_jit_disasm bpf_dbg bpf_asm chvrf
 
 bpf_jit_disasm : CFLAGS = -Wall -O2 -DPACKAGE='bpf_jit_disasm'
 bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl
@@ -25,8 +25,10 @@ bpf_asm : LDLIBS =
 bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o
 bpf_exp.lex.o : bpf_exp.yacc.c
 
+chvrf : CFLAGS = -Wall -O2
+
 clean :
-	rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.*
+	rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.* chvrf
 
 install :
 	install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
diff --git a/tools/net/chvrf.c b/tools/net/chvrf.c
new file mode 100644
index 000000000000..71cc925fd101
--- /dev/null
+++ b/tools/net/chvrf.c
@@ -0,0 +1,225 @@
+/*
+ * chvrf.c - Example of how to use the default bind-to-device option for
+ *           tasks and correlate to VRFs via the VRF device.
+ *
+ * Copyright (c) 2015 Cumulus Networks
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+#include <sys/socket.h>
+#include <signal.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <netinet/in.h>
+#include <net/if.h> /* for struct ifreq  */
+#include <libgen.h>
+#include <errno.h>
+
+#ifndef PR_SET_SK_BIND_DEV_IF
+#define PR_SET_SK_BIND_DEV_IF   47
+#endif
+#ifndef PR_GET_SK_BIND_DEV_IF
+#define PR_GET_SK_BIND_DEV_IF   48
+#endif
+
+static int vrf_to_device(int vrf)
+{
+	struct ifreq ifdata;
+	int sd, rc;
+
+	memset(&ifdata, 0, sizeof(ifdata));
+	snprintf(ifdata.ifr_name, sizeof(ifdata.ifr_name) - 1, "vrf%d", vrf);
+
+	sd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+	if (sd < 0) {
+		perror("socket failed");
+		return -1;
+	}
+
+	/* Get the index for the specified interface */
+	rc = ioctl(sd, SIOCGIFINDEX, (char *)&ifdata);
+	close(sd);
+	if (rc != 0) {
+		perror("ioctl(SIOCGIFINDEX) failed");
+		return -1;
+	}
+
+	return ifdata.ifr_ifindex;
+}
+
+static int device_to_vrf(int idx)
+{
+	struct ifreq ifdata;
+	int sd, vrf, rc;
+
+	memset(&ifdata, 0, sizeof(ifdata));
+	ifdata.ifr_ifindex = idx;
+
+	sd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+	if (sd < 0) {
+		perror("socket failed");
+		return -1;
+	}
+
+	/* Get the index for the specified interface */
+	rc = ioctl(sd, SIOCGIFNAME, (char *)&ifdata);
+	close(sd);
+	if (rc != 0) {
+		perror("ioctl(SIOCGIFNAME) failed");
+		return -1;
+	}
+
+	if (sscanf(ifdata.ifr_name, "vrf%d", &vrf) != 1) {
+		fprintf(stderr, "Unexpected device name (%s)\n", ifdata.ifr_name);
+		vrf = -1;
+	}
+
+	return vrf;
+}
+
+static int set_vrf(int vrf)
+{
+	int idx;
+	long err;
+
+	/* convert vrf to device index */
+	idx = vrf_to_device(vrf);
+	if (idx < 0) {
+		fprintf(stderr, "Failed to get device index for vrf %d\n", vrf);
+		return -1;
+	}
+
+	/* set default device bind */
+	err = prctl(PR_SET_SK_BIND_DEV_IF, idx);
+	if (err < 0) {
+		fprintf(stderr, "prctl failed to device index: %d\n", errno);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* get vrf context for given process id */
+static int get_vrf(pid_t pid)
+{
+	int vrf;
+	long err;
+
+	/* lookup device index pid is tied to */
+	err = prctl(PR_GET_SK_BIND_DEV_IF, pid);
+	if (err < 0) {
+		fprintf(stderr, "prctl failed: %d\n", errno);
+		return -1;
+	}
+
+	if (err == 0)
+		return 0;
+
+	/* convert device index to vrf id */
+	vrf = device_to_vrf((int)err);
+	if (vrf < 0) {
+		fprintf(stderr, "Failed to get device index for vrf %d\n", vrf);
+		return -1;
+	}
+
+	return vrf;
+}
+
+static int run_vrf(char **argv, int vrf)
+{
+	char *cmd;
+
+	if (set_vrf(vrf) != 0) {
+		fprintf(stderr, "Failed to set vrf context\n");
+		return 1;
+	}
+
+	cmd = strdup(argv[0]);
+	if (!cmd) {
+		fprintf(stderr, "Failed to set command\n");
+		return 1;
+	}
+	argv[0] = basename(argv[0]);
+	if (execvp(cmd, argv) < 0)
+		perror("Failed to exec command\n");
+
+	return 1;
+}
+
+static int show_vrf(pid_t pid)
+{
+	int vrf = get_vrf(pid);
+
+	switch (vrf) {
+	case -1:
+		fprintf(stderr, "Failed to get vrf context for pid %d\n", pid);
+		if (kill(pid, 0) < 0) {
+			if (errno == ESRCH)
+				fprintf(stderr, "No process with given pid\n");
+		}
+		break;
+	case 0:
+		printf("Process %d is not running in a VRF context\n", pid);
+		break;
+	default:
+		printf("Process %d is running in VRF %d\n", pid, vrf);
+	}
+	return vrf < 0 ? 1 : 0;
+}
+
+static void usage(char *_prog)
+{
+	const char *prog = basename(_prog);
+
+	fprintf(stderr, "usage:\n");
+	fprintf(stderr, "\nShow VRF context for given pid\n");
+	fprintf(stderr, "\t%s -p pid\n", prog);
+	fprintf(stderr, "\nRun command in given VRF context\n");
+	fprintf(stderr, "\t%s -v vrf <command>\n", prog);
+}
+
+int main(int argc, char *argv[])
+{
+	int rc;
+	pid_t pid = 0;
+	int vrf = 0;
+
+	extern char *optarg;
+	extern int optind;
+
+	while ((rc = getopt(argc, argv, "+:p:v:")) != -1) {
+		switch (rc) {
+		case 'p':
+			pid = atoi(optarg);
+			break;
+		case 'v':
+			vrf = atoi(optarg);
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	if ((pid && vrf) || (!pid && !vrf)) {
+		usage(argv[0]);
+		return 1;
+	}
+
+	if (pid)
+		return show_vrf(pid);
+
+	if (optind == argc) {
+		usage(argv[0]);
+		return 1;
+	}
+
+	return run_vrf(&argv[optind], vrf);
+}
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH] iproute2: Add support for VRF device
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
                   ` (5 preceding siblings ...)
  2015-07-06 15:03 ` [RFC net-next 6/6] net: Add chvrf command David Ahern
@ 2015-07-06 15:03 ` David Ahern
  2015-07-06 15:40 ` [RFC net-next 0/6] Proposal for VRF-lite - v2 Nicolas Dichtel
  2015-07-10  5:14 ` Scott Feldman
  8 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-06 15:03 UTC (permalink / raw)
  To: netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes,
	nicolas.dichtel, stephen, hadi, ebiederm, davem, David Ahern

Allow user to create a vrf device and specify its table binding.
Based on the iplink_vlan implementation.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
 include/linux/if_link.h |  8 +++++
 ip/Makefile             |  2 +-
 ip/iplink.c             |  2 +-
 ip/iplink_vrf.c         | 88 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 98 insertions(+), 2 deletions(-)
 create mode 100644 ip/iplink_vrf.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 8df6a8466839..28872fbf6814 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -337,6 +337,14 @@ enum macvlan_macaddr_mode {
 
 #define MACVLAN_FLAG_NOPROMISC	1
 
+/* VRF section */
+enum {
+	IFLA_VRF_UNSPEC,
+	IFLA_VRF_TABLE,
+	__IFLA_VRF_MAX
+};
+
+#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1)
 /* IPVLAN section */
 enum {
 	IFLA_IPVLAN_UNSPEC,
diff --git a/ip/Makefile b/ip/Makefile
index 77653ecc5785..d8b38ac2e44b 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
     iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
     link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
     iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
-    iplink_geneve.o
+    iplink_geneve.o iplink_vrf.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink.c b/ip/iplink.c
index e296e6f611b8..892e8bc8808b 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -93,7 +93,7 @@ void iplink_usage(void)
 		fprintf(stderr, "TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | macvtap |\n");
 		fprintf(stderr, "          bridge | bond | ipoib | ip6tnl | ipip | sit | vxlan |\n");
 		fprintf(stderr, "          gre | gretap | ip6gre | ip6gretap | vti | nlmon |\n");
-		fprintf(stderr, "          bond_slave | ipvlan | geneve }\n");
+		fprintf(stderr, "          bond_slave | ipvlan | geneve | vrf }\n");
 	}
 	exit(-1);
 }
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
new file mode 100644
index 000000000000..8d66802cf940
--- /dev/null
+++ b/ip/iplink_vrf.c
@@ -0,0 +1,88 @@
+/* iplink_vrf.c	VRF device support
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ *
+ * Authors:     Shrijeet Mukherjee <shm@cumulusnetworks.com>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <linux/if_link.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+
+static void vrf_explain(FILE *f)
+{
+	fprintf(f, "Usage: ... vrf table TABLEID \n");
+}
+
+static void explain(void)
+{
+	vrf_explain(stderr);
+}
+
+static int table_arg(void)
+{
+	fprintf(stderr,"Error: argument of \"table\" must be 0-32767 and currently unused\n");
+	return -1;
+}
+
+static int vrf_parse_opt(struct link_util *lu, int argc, char **argv,
+			    struct nlmsghdr *n)
+{
+	while (argc > 0) {
+		if (matches(*argv, "table") == 0) {
+			__u32 table = 0;
+			NEXT_ARG();
+
+			table = atoi(*argv);
+			if (table < 0 || table > 32767)
+				return table_arg();
+			/* XXX need a table in-use check here */
+			fprintf(stderr, "adding table %d\n", table);
+			addattr32(n, 1024, IFLA_VRF_TABLE, table);
+		} else if (matches(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "vrf: unknown option \"%s\"?\n",
+				*argv);
+			explain();
+			return -1;
+		}
+		argc--, argv++;
+	}
+
+	return 0;
+}
+
+static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+printf("vrf_print_opt ...\n");
+	if (!tb)
+		return;
+
+	if (tb[IFLA_VRF_TABLE])
+		fprintf(f, "table %u ", rta_getattr_u32(tb[IFLA_VRF_TABLE]));
+}
+
+static void vrf_print_help(struct link_util *lu, int argc, char **argv,
+			      FILE *f)
+{
+	vrf_explain(f);
+}
+
+struct link_util vrf_link_util = {
+	.id		= "vrf",
+	.maxattr	= IFLA_VRF_MAX,
+	.parse_opt	= vrf_parse_opt,
+	.print_opt	= vrf_print_opt,
+	.print_help	= vrf_print_help,
+};
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/6] Proposal for VRF-lite - v2
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
                   ` (6 preceding siblings ...)
  2015-07-06 15:03 ` [RFC PATCH] iproute2: Add support for VRF device David Ahern
@ 2015-07-06 15:40 ` Nicolas Dichtel
  2015-07-06 17:53   ` Shrijeet Mukherjee
  2015-07-10  5:14 ` Scott Feldman
  8 siblings, 1 reply; 32+ messages in thread
From: Nicolas Dichtel @ 2015-07-06 15:40 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes, stephen,
	hadi, ebiederm, davem

Le 06/07/2015 17:03, David Ahern a écrit :
> In the context of internet scale routing a requirement that always
> comes up is the need to partition the available routing tables into
> disjoint routing planes. A specific use case is the multi-tenancy
> problem where each tenant has their own unique routing tables and in
> the very least need different default gateways.
>
> This is an attempt to build the ability to create virtual router
> domains aka VRF's (VRF-lite to be specific) in the linux packet
> forwarding stack. The main observation is that through the use of
> rules and socket binding to interfaces, all the facilities that we
> need are already present in the infrastructure. What is missing is a
> handle that identifies a routing domain and can be used to gather
> applicable rules/tables and uniqify neighbor selection. The scheme
> used needs to preserves the notions of ECMP, and general routing
> principles.
[snip]
>   drivers/net/vrf.c             | 486 ++++++++++++++++++++++++++++++++++++++++++
[snip]

I'm still opposed to name this 'vrf', see the v1 thread:
  - http://www.spinics.net/lists/netdev/msg332357.html
  - http://www.spinics.net/lists/netdev/msg332376.html

Shrijeet seemed to agree to rename it, is there a problem?


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-06 15:03 ` [RFC net-next 3/6] net: Introduce VRF device driver - v2 David Ahern
@ 2015-07-06 15:42   ` Nicolas Dichtel
  2015-07-06 16:37   ` Nikolay Aleksandrov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Nicolas Dichtel @ 2015-07-06 15:42 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes, stephen,
	hadi, ebiederm, davem

Le 06/07/2015 17:03, David Ahern a écrit :
> This driver borrows heavily from IPvlan and teaming drivers.
>
> Routing domains (VRF-lite) are created by instantiating a device
> and enslaving all routed interfaces that participate in the domain.
> As part of the enslavement, all local routes pointing to enslaved
> devices are re-pointed to the vrf device, thus forcing outgoing
> sockets to bind to the vrf to function.
>
> Standard FIB rules can then bind the VRF device to tables and regular
> fib rule processing is followed.
>
> Routed traffic through the box, is fwded by using the VRF device as
> the IIF and following the IIF rule to a table which is mated with
> the VRF.
>
> Locally originated traffic is directed at the VRF device using
> SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
> the xmit function of the vrf driver, which then completes the ip lookup
> and output.
>
> This solution is completely orthogonal to namespaces and allow the L3
> equivalent of vlans to exist allowing the routing space to be
> partitioned.
>
> Example:
>
>     Create vrf 1:
>       ip link add vrf1 type vrf table 5
>       ip rule add iif vrf1 table 5
>       ip rule add oif vrf1 table 5
>       ip route add table 5 prohibit default
>       ip link set vrf1 up
>
>     Add interface to vrf 1:
>       ip link set eth1 master vrf1
>
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
>
> v2:
> - addressed comments from first RFC
> - significant changes to improve simplicity of implementation
History should be put after the '---'.

> ---
ie here.

>   drivers/net/Kconfig  |   7 +
>   drivers/net/Makefile |   1 +
>   drivers/net/vrf.c    | 486 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   include/net/vrf.h    |  71 ++++++++
>   4 files changed, 565 insertions(+)
>   create mode 100644 drivers/net/vrf.c
>   create mode 100644 include/net/vrf.h

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-06 15:03 ` [RFC net-next 3/6] net: Introduce VRF device driver - v2 David Ahern
  2015-07-06 15:42   ` Nicolas Dichtel
@ 2015-07-06 16:37   ` Nikolay Aleksandrov
  2015-07-06 16:46     ` David Ahern
  2015-07-08  9:27   ` Nicolas Dichtel
  2015-07-08 18:34   ` Sowmini Varadhan
  3 siblings, 1 reply; 32+ messages in thread
From: Nikolay Aleksandrov @ 2015-07-06 16:37 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: shm, roopa, gospo, jtoppins, ddutt, hannes, nicolas.dichtel,
	stephen, hadi, ebiederm, davem

On 07/06/2015 05:03 PM, David Ahern wrote:
> This driver borrows heavily from IPvlan and teaming drivers.
> 
> Routing domains (VRF-lite) are created by instantiating a device
> and enslaving all routed interfaces that participate in the domain.
> As part of the enslavement, all local routes pointing to enslaved
> devices are re-pointed to the vrf device, thus forcing outgoing
> sockets to bind to the vrf to function.
> 
> Standard FIB rules can then bind the VRF device to tables and regular
> fib rule processing is followed.
> 
> Routed traffic through the box, is fwded by using the VRF device as
> the IIF and following the IIF rule to a table which is mated with
> the VRF.
> 
> Locally originated traffic is directed at the VRF device using
> SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
> the xmit function of the vrf driver, which then completes the ip lookup
> and output.
> 
> This solution is completely orthogonal to namespaces and allow the L3
> equivalent of vlans to exist allowing the routing space to be
> partitioned.
> 
> Example:
> 
>    Create vrf 1:
>      ip link add vrf1 type vrf table 5
>      ip rule add iif vrf1 table 5
>      ip rule add oif vrf1 table 5
>      ip route add table 5 prohibit default
>      ip link set vrf1 up
> 
>    Add interface to vrf 1:
>      ip link set eth1 master vrf1
> 
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> 
> v2:
> - addressed comments from first RFC
> - significant changes to improve simplicity of implementation
> ---
>  drivers/net/Kconfig  |   7 +
>  drivers/net/Makefile |   1 +
>  drivers/net/vrf.c    | 486 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/net/vrf.h    |  71 ++++++++
>  4 files changed, 565 insertions(+)
>  create mode 100644 drivers/net/vrf.c
>  create mode 100644 include/net/vrf.h
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 019fceffc9e5..b040aa233408 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -283,6 +283,13 @@ config NLMON
>  	  diagnostics, etc. This is mostly intended for developers or support
>  	  to debug netlink issues. If unsure, say N.
>  
> +config NET_VRF
> +	tristate "Virtual Routing and Forwarding (Lite)"
> +	depends on IP_MULTIPLE_TABLES && IPV6_MULTIPLE_TABLES
> +	---help---
> +          This option enables the support for mapping interfaces into VRF's. The
> +          support enables VRF devices
> +
>  endif # NET_CORE
>  
>  config SUNGEM_PHY
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index c12cb22478a7..ca16dd689b36 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
>  obj-$(CONFIG_VXLAN) += vxlan.o
>  obj-$(CONFIG_GENEVE) += geneve.o
>  obj-$(CONFIG_NLMON) += nlmon.o
> +obj-$(CONFIG_NET_VRF) += vrf.o
>  
>  #
>  # Networking Drivers
> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
> new file mode 100644
> index 000000000000..b9f9ae68388d
> --- /dev/null
> +++ b/drivers/net/vrf.c
> @@ -0,0 +1,487 @@
> +/*
> + * vrf.c: device driver to encapsulate a VRF space
> + *
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * Based on dummy, team and ipvlan drivers
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ip.h>
> +#include <linux/init.h>
> +#include <linux/moduleparam.h>
> +#include <linux/rtnetlink.h>
> +#include <net/rtnetlink.h>
> +#include <net/arp.h>
> +#include <linux/u64_stats_sync.h>
> +#include <linux/hashtable.h>
> +
> +#include <linux/inetdevice.h>
> +#include <net/ip.h>
> +#include <net/ip_fib.h>
> +#include <net/ip6_route.h>
> +#include <net/rtnetlink.h>
> +#include <net/route.h>
> +#include <net/addrconf.h>
> +#include <net/vrf.h>
> +
> +#define DRV_NAME	"vrf"
> +#define DRV_VERSION	"1.0"
> +
> +#define vrf_is_slave(dev)   ((dev)->flags & IFF_SLAVE)
> +#define vrf_is_master(dev)  ((dev)->flags & IFF_MASTER)
> +
> +#define vrf_master_get_rcu(dev) \
> +	((struct net_device *)rcu_dereference(dev->rx_handler_data))
> +
> +struct pcpu_dstats {
> +	u64			tx_pkts;
> +	u64			tx_bytes;
> +	u64			tx_drps;
> +	u64			rx_pkts;
> +	u64			rx_bytes;
> +	struct u64_stats_sync	syncp;
> +};
> +
> +struct slave {
> +	struct list_head	list;
> +	struct net_device	*dev;
> +	long			priority;
> +};
> +
> +struct slave_queue {
> +	spinlock_t		lock; /* lock for slave insert/delete */
> +	struct list_head	all_slaves;
> +	int			num_slaves;
> +	struct net_device	*master_dev;
> +};
> +
> +struct net_vrf {
> +	struct slave_queue	queue;
> +	struct fib_table        *tb;
> +	u32                     tb_id;
> +};
> +
> +static int is_ip_rx_frame(struct sk_buff *skb)
> +{
> +	switch (skb->protocol) {
> +	case htons(ETH_P_IP):
> +	case htons(ETH_P_IPV6):
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* note: already called with rcu_read_lock */
> +static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +
> +	if (is_ip_rx_frame(skb)) {
> +		struct net_device *dev = vrf_master_get_rcu(skb->dev);
> +		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
> +
> +		u64_stats_update_begin(&dstats->syncp);
> +		dstats->rx_pkts++;
> +		dstats->rx_bytes += skb->len;
> +		u64_stats_update_end(&dstats->syncp);
> +	}
> +	return RX_HANDLER_PASS;
> +}
> +
> +static struct rtnl_link_stats64 *vrf_get_stats64(
> +	struct net_device *dev, struct rtnl_link_stats64 *stats)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		const struct pcpu_dstats *dstats;
> +		u64 tbytes, tpkts, tdrops, rbytes, rpkts;
> +		unsigned int start;
> +
> +		dstats = per_cpu_ptr(dev->dstats, i);
> +		do {
> +			start = u64_stats_fetch_begin_irq(&dstats->syncp);
> +			tbytes = dstats->tx_bytes;
> +			tpkts = dstats->tx_pkts;
> +			tdrops = dstats->tx_drps;
> +			rbytes = dstats->rx_bytes;
> +			rpkts = dstats->rx_pkts;
> +		} while (u64_stats_fetch_retry_irq(&dstats->syncp, start));
> +		stats->tx_bytes += tbytes;
> +		stats->tx_packets += tpkts;
> +		stats->tx_dropped += tdrops;
> +		stats->rx_bytes += rbytes;
> +		stats->rx_packets += rpkts;
> +	}
> +	return stats;
> +}
> +
> +static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	return NET_XMIT_DROP;
> +}
> +
> +/**************************** device handling ********************/
> +
> +/* queue->lock must be held */
> +static struct slave *__vrf_find_slave_dev(struct slave_queue *queue,
> +					  struct net_device *dev)
> +{
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		if (slave->dev == dev)
> +			return slave;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void __vrf_kill_slave(struct slave_queue *queue, struct slave *slave)
> +{
> +	list_del(&slave->list);
> +	queue->num_slaves--;
> +	slave->dev->flags &= ~IFF_SLAVE;
> +	netdev_rx_handler_unregister(slave->dev);
> +	kfree(slave->dev->vrf_ptr);
> +	slave->dev->vrf_ptr = NULL;
> +	dev_put(slave->dev);
> +	kfree(slave);
> +}
> +
> +/* queue->lock must be held */
> +static void __vrf_insert_slave(struct slave_queue *queue, struct slave *slave,
> +			       struct net_device *master)
> +{
> +	struct slave *duplicate_slave = NULL;
> +
> +	duplicate_slave = __vrf_find_slave_dev(queue, slave->dev);
> +	if (duplicate_slave)
> +		__vrf_kill_slave(queue, duplicate_slave);
> +
> +	dev_hold(slave->dev);
> +	list_add(&slave->list, &queue->all_slaves);
> +	queue->num_slaves++;
> +}
> +
> +/* netlink lock is assumed here */
> +static int vrf_add_slave(struct net_device *dev,
> +			 struct net_device *port_dev)
> +{
> +	if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev))
> +		return -ENODEV;
> +
> +	if (!vrf_is_master(port_dev) && !vrf_is_slave(port_dev)) {
> +		struct slave *s = kzalloc(sizeof(*s), GFP_KERNEL);
> +		struct net_vrf *vrf = netdev_priv(dev);
> +		struct slave_queue *queue = &vrf->queue;
> +		bool is_running = netif_running(port_dev);
> +		unsigned int flags = port_dev->flags;
> +		int ret;
> +
> +		if (!s)
> +			return -ENOMEM;
> +
> +		s->dev = port_dev;
> +
> +		spin_lock_bh(&queue->lock);
> +		__vrf_insert_slave(queue, s, dev);
> +		spin_unlock_bh(&queue->lock);
> +
> +		port_dev->vrf_ptr = kmalloc(sizeof(*port_dev->vrf_ptr),
> +					    GFP_KERNEL);
> +		if (!port_dev->vrf_ptr)
> +			return -ENOMEM;
                        ^^^^^^^^^
I believe you'll have a slave in the list with inconsistent state which could
even lead to null ptr derefernce if vrf_ptr is used, also __vrf_insert_slave
does dev_hold so the dev refcnt will be incorrect as well.

> +
> +		port_dev->vrf_ptr->ifindex = dev->ifindex;
> +		port_dev->vrf_ptr->tb_id = vrf->tb_id;
> +
> +		/* register the packet handler for slave ports */
> +		ret = netdev_rx_handler_register(port_dev, vrf_handle_frame,
> +						 (void *)dev);
> +		if (ret) {
> +			netdev_err(port_dev,
> +				   "Device %s failed to register rx_handler\n",
> +				   port_dev->name);
> +			kfree(port_dev->vrf_ptr);
> +			kfree(s);
> +			return ret;
                        ^^^^^^^^^^
The slave is being freed while on the list here, device's refcnt will be wrong etc.

> +		}
> +
> +		if (is_running) {
> +			ret = dev_change_flags(port_dev, flags & ~IFF_UP);
> +			if (ret < 0)
> +				goto out_fail;
> +		}
> +
> +		ret = netdev_master_upper_dev_link(port_dev, dev);
> +		if (ret < 0)
> +			goto out_fail;
> +
> +		if (is_running) {
> +			ret = dev_change_flags(port_dev, flags);
> +			if (ret < 0)
> +				goto out_fail;
> +		}
> +
> +		port_dev->flags |= IFF_SLAVE;
> +
> +		return 0;
> +
> +out_fail:
> +		spin_lock_bh(&queue->lock);
> +		__vrf_kill_slave(queue, s);
> +		spin_unlock_bh(&queue->lock);

__vrf_kill_slave() doesn't do upper device unlink and the device can be linked
if we fail in the dev_change_flags above.

> +
> +		return ret;
> +	}
> +
> +	return -EINVAL;
> +}
^^^^
In my opinion the structure of the above function should change to something more
straightforward with proper exit labels and cleanup upon failure, also a level of
indentation can be avoided.

> +
> +static int vrf_del_slave(struct net_device *dev,
> +			 struct net_device *port_dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct slave *slave = __vrf_find_slave_dev(queue, port_dev);
> +	bool is_running = netif_running(port_dev);
> +	unsigned int flags = port_dev->flags;
> +	int ret = 0;

ret seems unused/unchecked in this function

> +
> +	if (!slave)
> +		return -EINVAL;
> +
> +	if (is_running)
> +		ret = dev_change_flags(port_dev, flags & ~IFF_UP);
> +
> +	spin_lock_bh(&queue->lock);
> +	__vrf_kill_slave(queue, slave);
> +	spin_unlock_bh(&queue->lock);
> +
> +	netdev_upper_dev_unlink(port_dev, dev);
> +
> +	if (is_running)
> +		ret = dev_change_flags(port_dev, flags);
> +
> +	return 0;
> +}
> +
> +static int vrf_dev_init(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	spin_lock_init(&vrf->queue.lock);
> +	INIT_LIST_HEAD(&vrf->queue.all_slaves);
> +	vrf->queue.master_dev = dev;
> +
> +	dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
> +	dev->flags  =  IFF_MASTER | IFF_NOARP;
> +	if (!dev->dstats)
> +		return -ENOMEM;
        ^^^^^
nit: I'd suggest moving the check after the allocation

> +
> +	return 0;
> +}
> +
> +static void vrf_dev_uninit(struct net_device *dev)
> +{
> +	free_percpu(dev->dstats);
> +}
> +
> +static int vrf_dev_close(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		slave->dev->vrf_ptr->ifindex = 0;
> +		slave->dev->vrf_ptr->tb_id = 0;
> +	}
> +
> +	if (dev->flags & IFF_MASTER)
> +		dev->flags &= ~IFF_UP;
> +
> +	return 0;
> +}
> +
> +static int vrf_dev_open(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		slave->dev->vrf_ptr->ifindex = dev->ifindex;
> +		slave->dev->vrf_ptr->tb_id = vrf->tb_id;
> +	}
> +
> +	if (dev->flags & IFF_MASTER)
> +		dev->flags |= IFF_UP;
> +
> +	if (!vrf->tb)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static const struct net_device_ops vrf_netdev_ops = {
> +	.ndo_init		= vrf_dev_init,
> +	.ndo_uninit		= vrf_dev_uninit,
> +	.ndo_open		= vrf_dev_open,
> +	.ndo_stop               = vrf_dev_close,
> +	.ndo_start_xmit		= vrf_xmit,
> +	.ndo_get_stats64	= vrf_get_stats64,
> +	.ndo_add_slave          = vrf_add_slave,
> +	.ndo_del_slave          = vrf_del_slave,
> +};
> +
> +static void vrf_get_drvinfo(struct net_device *dev,
> +			    struct ethtool_drvinfo *info)
> +{
> +	strlcpy(info->driver, DRV_NAME, sizeof(info->driver));
> +	strlcpy(info->version, DRV_VERSION, sizeof(info->version));
> +}
> +
> +static const struct ethtool_ops vrf_ethtool_ops = {
> +	.get_drvinfo            = vrf_get_drvinfo,
> +};
> +
> +static void vrf_setup(struct net_device *dev)
> +{
> +	ether_setup(dev);
> +
> +	/* Initialize the device structure. */
> +	dev->netdev_ops = &vrf_netdev_ops;
> +	dev->ethtool_ops = &vrf_ethtool_ops;
> +	dev->destructor = free_netdev;
> +
> +	/* Fill in device structure with ethernet-generic values. */
> +	dev->tx_queue_len = 0;
> +	eth_hw_addr_random(dev);
> +}
> +
> +static int vrf_validate(struct nlattr *tb[], struct nlattr *data[])
> +{
> +	if (tb[IFLA_ADDRESS]) {
> +		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
> +			return -EINVAL;
> +		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
> +			return -EADDRNOTAVAIL;
> +	}
> +	return 0;
> +}
> +
> +static int vrf_newlink(struct net *src_net, struct net_device *dev,
> +		       struct nlattr *tb[], struct nlattr *data[])
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	int err;
> +
> +	if (!data || !data[IFLA_VRF_TABLE])
> +		return -EINVAL;
> +
> +	vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
> +
> +	/* reserve a table for this VRF device */
> +	err = -ERANGE;
> +	vrf->tb = fib_new_table(dev_net(dev), vrf->tb_id);
> +	if (!vrf->tb)
> +		goto out_fail;
> +
> +	dev->priv_flags |= IFF_VRF_MASTER;
> +
> +	err = -ENOMEM;
> +	dev->vrf_ptr = kmalloc(sizeof(*dev->vrf_ptr), GFP_KERNEL);
> +	if (!dev->vrf_ptr)
> +		goto out_fail;
> +
> +	dev->vrf_ptr->ifindex = dev->ifindex;
> +	dev->vrf_ptr->tb_id = vrf->tb_id;
> +
> +	err = register_netdevice(dev);
> +	if (err < 0)
> +		goto out_fail;
> +
> +	return 0;
> +
> +out_fail:
> +	kfree(dev->vrf_ptr);
> +	free_netdev(dev);
> +	return err;
> +}
> +
> +static void vrf_dellink(struct net_device *dev, struct list_head *head)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	kfree(dev->vrf_ptr);
> +	fib_free_table(vrf->tb);
> +}
> +
> +static size_t vrf_nl_getsize(const struct net_device *dev)
> +{
> +	return nla_total_size(sizeof(u32));  /* IFLA_VRF_TABLE */
> +}
> +
> +static int vrf_fillinfo(struct sk_buff *skb,
> +			const struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	return nla_put_u32(skb, IFLA_VRF_TABLE, vrf->tb_id);
> +}
> +
> +static const struct nla_policy vrf_nl_policy[IFLA_VRF_MAX + 1] = {
> +	[IFLA_VRF_TABLE] = { .type = NLA_U32 },
> +};
> +
> +static struct rtnl_link_ops vrf_link_ops __read_mostly = {
> +	.kind		= DRV_NAME,
> +	.priv_size      = sizeof(struct net_vrf),
> +
> +	.get_size       = vrf_nl_getsize,
> +	.policy         = vrf_nl_policy,
> +	.validate	= vrf_validate,
> +	.fill_info      = vrf_fillinfo,
> +
> +	.newlink        = vrf_newlink,
> +	.dellink        = vrf_dellink,
> +	.setup		= vrf_setup,
> +	.maxtype        = IFLA_VRF_MAX,
> +};
> +
> +static int __init vrf_init_module(void)
> +{
> +	return rtnl_link_register(&vrf_link_ops);
> +}
> +
> +static void __exit vrf_cleanup_module(void)
> +{
> +	rtnl_link_unregister(&vrf_link_ops);
> +}
> +
> +module_init(vrf_init_module);
> +module_exit(vrf_cleanup_module);
> +MODULE_AUTHOR("Shrijeet Mukherjee, David Ahern");
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS_RTNL_LINK(DRV_NAME);
> +MODULE_VERSION(DRV_VERSION);
> diff --git a/include/net/vrf.h b/include/net/vrf.h
> new file mode 100644
> index 000000000000..3ab1e332c781
> --- /dev/null
> +++ b/include/net/vrf.h
> @@ -0,0 +1,71 @@
> +/*
> + * include/net/net_vrf.h - adds vrf dev structure definitions
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#ifndef __LINUX_NET_VRF_H
> +#define __LINUX_NET_VRF_H
> +
> +struct net_vrf_dev {
> +	int                     ifindex; /* ifindex of master dev */
> +	u32                     tb_id;   /* table id for VRF */
> +};
> +
> +#if IS_ENABLED(CONFIG_NET_VRF)
> +static inline int vrf_master_dev_idx(const struct net_device *dev)
> +{
> +	int idx = 0;
> +
> +	if (dev && dev->vrf_ptr)
> +		idx = dev->vrf_ptr->ifindex;
> +
> +	return idx;
> +}
> +
> +static inline int vrf_get_master_dev_idx(struct net *net, int idx)
> +{
> +	int rc = 0;
> +
> +	if (idx) {
> +		struct net_device *dev = dev_get_by_index(net, idx);
> +
> +		if (dev) {
> +			rc = vrf_master_dev_idx(dev);
> +			dev_put(dev);
> +		}
> +	}
> +	return rc;
> +}
> +
> +static inline int vrf_dev_table(const struct net_device *dev)
> +{
> +	int tb_id = 0;
> +
> +	if (dev && dev->vrf_ptr)
> +		tb_id = dev->vrf_ptr->tb_id;
> +
> +	return tb_id;
> +}
> +#else
> +static inline int vrf_master_dev_idx(const struct net_device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline int vrf_get_master_dev_idx(struct net *net, int idx)
> +{
> +	return 0;
> +}
> +
> +static inline int vrf_dev_table(const struct net_device *dev)
> +{
> +	return 0;
> +}
> +#endif
> +
> +#endif /* __LINUX_NET_VRF_H */
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-06 16:37   ` Nikolay Aleksandrov
@ 2015-07-06 16:46     ` David Ahern
  0 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-06 16:46 UTC (permalink / raw)
  To: Nikolay Aleksandrov, netdev
  Cc: shm, roopa, gospo, jtoppins, ddutt, hannes, nicolas.dichtel,
	stephen, hadi, ebiederm, davem

On 7/6/15 10:37 AM, Nikolay Aleksandrov wrote:
>> +static int vrf_add_slave(struct net_device *dev,
>> +			 struct net_device *port_dev)
>> +{
>> +	if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev))
>> +		return -ENODEV;
>> +
>> +	if (!vrf_is_master(port_dev) && !vrf_is_slave(port_dev)) {
>> +		struct slave *s = kzalloc(sizeof(*s), GFP_KERNEL);
>> +		struct net_vrf *vrf = netdev_priv(dev);
>> +		struct slave_queue *queue = &vrf->queue;
>> +		bool is_running = netif_running(port_dev);
>> +		unsigned int flags = port_dev->flags;
>> +		int ret;
>> +
>> +		if (!s)
>> +			return -ENOMEM;
>> +
>> +		s->dev = port_dev;
>> +
>> +		spin_lock_bh(&queue->lock);
>> +		__vrf_insert_slave(queue, s, dev);
>> +		spin_unlock_bh(&queue->lock);
>> +
>> +		port_dev->vrf_ptr = kmalloc(sizeof(*port_dev->vrf_ptr),
>> +					    GFP_KERNEL);
>> +		if (!port_dev->vrf_ptr)
>> +			return -ENOMEM;
>                          ^^^^^^^^^
> I believe you'll have a slave in the list with inconsistent state which could
> even lead to null ptr derefernce if vrf_ptr is used, also __vrf_insert_slave
> does dev_hold so the dev refcnt will be incorrect as well.

Right. Good catch, will fix.

>
>> +
>> +		port_dev->vrf_ptr->ifindex = dev->ifindex;
>> +		port_dev->vrf_ptr->tb_id = vrf->tb_id;
>> +
>> +		/* register the packet handler for slave ports */
>> +		ret = netdev_rx_handler_register(port_dev, vrf_handle_frame,
>> +						 (void *)dev);
>> +		if (ret) {
>> +			netdev_err(port_dev,
>> +				   "Device %s failed to register rx_handler\n",
>> +				   port_dev->name);
>> +			kfree(port_dev->vrf_ptr);
>> +			kfree(s);
>> +			return ret;
>                          ^^^^^^^^^^
> The slave is being freed while on the list here, device's refcnt will be wrong etc.

ack. Will fix.

>
>> +		}
>> +
>> +		if (is_running) {
>> +			ret = dev_change_flags(port_dev, flags & ~IFF_UP);
>> +			if (ret < 0)
>> +				goto out_fail;
>> +		}
>> +
>> +		ret = netdev_master_upper_dev_link(port_dev, dev);
>> +		if (ret < 0)
>> +			goto out_fail;
>> +
>> +		if (is_running) {
>> +			ret = dev_change_flags(port_dev, flags);
>> +			if (ret < 0)
>> +				goto out_fail;
>> +		}
>> +
>> +		port_dev->flags |= IFF_SLAVE;
>> +
>> +		return 0;
>> +
>> +out_fail:
>> +		spin_lock_bh(&queue->lock);
>> +		__vrf_kill_slave(queue, s);
>> +		spin_unlock_bh(&queue->lock);
>
> __vrf_kill_slave() doesn't do upper device unlink and the device can be linked
> if we fail in the dev_change_flags above.

will fix.

>
>> +
>> +		return ret;
>> +	}
>> +
>> +	return -EINVAL;
>> +}
> ^^^^
> In my opinion the structure of the above function should change to something more
> straightforward with proper exit labels and cleanup upon failure, also a level of
> indentation can be avoided.

Sure. The indentation comes after the pointer checks so locals can be 
intialized when declared. Will work on the clean up/simplification for 
next rev.

>
>> +
>> +static int vrf_del_slave(struct net_device *dev,
>> +			 struct net_device *port_dev)
>> +{
>> +	struct net_vrf *vrf = netdev_priv(dev);
>> +	struct slave_queue *queue = &vrf->queue;
>> +	struct slave *slave = __vrf_find_slave_dev(queue, port_dev);
>> +	bool is_running = netif_running(port_dev);
>> +	unsigned int flags = port_dev->flags;
>> +	int ret = 0;
>
> ret seems unused/unchecked in this function

It is used but not checked. I struggled with what to do on the error 
path. Do we want netdev_err() on a failure?

>
>> +
>> +	if (!slave)
>> +		return -EINVAL;
>> +
>> +	if (is_running)
>> +		ret = dev_change_flags(port_dev, flags & ~IFF_UP);
>> +
>> +	spin_lock_bh(&queue->lock);
>> +	__vrf_kill_slave(queue, slave);
>> +	spin_unlock_bh(&queue->lock);
>> +
>> +	netdev_upper_dev_unlink(port_dev, dev);
>> +
>> +	if (is_running)
>> +		ret = dev_change_flags(port_dev, flags);
>> +
>> +	return 0;
>> +}
>> +
>> +static int vrf_dev_init(struct net_device *dev)
>> +{
>> +	struct net_vrf *vrf = netdev_priv(dev);
>> +
>> +	spin_lock_init(&vrf->queue.lock);
>> +	INIT_LIST_HEAD(&vrf->queue.all_slaves);
>> +	vrf->queue.master_dev = dev;
>> +
>> +	dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
>> +	dev->flags  =  IFF_MASTER | IFF_NOARP;
>> +	if (!dev->dstats)
>> +		return -ENOMEM;
>          ^^^^^
> nit: I'd suggest moving the check after the allocation

agreed.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/6] Proposal for VRF-lite - v2
  2015-07-06 15:40 ` [RFC net-next 0/6] Proposal for VRF-lite - v2 Nicolas Dichtel
@ 2015-07-06 17:53   ` Shrijeet Mukherjee
  2015-07-08  9:30     ` Nicolas Dichtel
  0 siblings, 1 reply; 32+ messages in thread
From: Shrijeet Mukherjee @ 2015-07-06 17:53 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: David Ahern, Netdev, Roopa Prabhu, Andy Gospodarek, Jon Toppins,
	Nikolay Aleksandrov, Dinesh Dutt, Hannes Frederic Sowa,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	David Miller

No no problem,

Just trying to get the functional aspects worked out. the global
search replace will be easy.

Was hoping to see some more responses on the naming suggestions here
from the community. If there is not disagreement we can spin patches
with MRF as the name.


On Mon, Jul 6, 2015 at 8:40 AM, Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
> Le 06/07/2015 17:03, David Ahern a écrit :
>>
>> In the context of internet scale routing a requirement that always
>> comes up is the need to partition the available routing tables into
>> disjoint routing planes. A specific use case is the multi-tenancy
>> problem where each tenant has their own unique routing tables and in
>> the very least need different default gateways.
>>
>> This is an attempt to build the ability to create virtual router
>> domains aka VRF's (VRF-lite to be specific) in the linux packet
>> forwarding stack. The main observation is that through the use of
>> rules and socket binding to interfaces, all the facilities that we
>> need are already present in the infrastructure. What is missing is a
>> handle that identifies a routing domain and can be used to gather
>> applicable rules/tables and uniqify neighbor selection. The scheme
>> used needs to preserves the notions of ECMP, and general routing
>> principles.
>
> [snip]
>>
>>   drivers/net/vrf.c             | 486
>> ++++++++++++++++++++++++++++++++++++++++++
>
> [snip]
>
> I'm still opposed to name this 'vrf', see the v1 thread:
>  - http://www.spinics.net/lists/netdev/msg332357.html
>  - http://www.spinics.net/lists/netdev/msg332376.html
>
> Shrijeet seemed to agree to rename it, is there a problem?
>
>
> Regards,
> Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 2/6] net: Preparation for vrf device
  2015-07-06 15:03 ` [RFC net-next 2/6] net: Preparation for vrf device David Ahern
@ 2015-07-08  8:37   ` Nicolas Dichtel
  2015-07-08  8:40     ` Nicolas Dichtel
  2015-07-08 16:10     ` David Ahern
  0 siblings, 2 replies; 32+ messages in thread
From: Nicolas Dichtel @ 2015-07-08  8:37 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes, stephen,
	hadi, ebiederm, davem

Le 06/07/2015 17:03, David Ahern a écrit :
> Add a VRF_MASTER flag for interfaces and helper functions for determining
> if a device is a VRF_MASTER.
>
> Also, add link attribute for passing VRF_TABLE id.
>
> Both are used in the following patch that adds a VRF device driver.
>
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---
>   include/linux/netdevice.h    | 21 +++++++++++++++++++++
>   include/uapi/linux/if_link.h |  9 +++++++++
>   2 files changed, 30 insertions(+)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index e20979dfd6a9..142cb64f139c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1274,6 +1274,7 @@ enum netdev_priv_flags {
>   	IFF_XMIT_DST_RELEASE_PERM	= 1<<22,
>   	IFF_IPVLAN_MASTER		= 1<<23,
>   	IFF_IPVLAN_SLAVE		= 1<<24,
> +	IFF_VRF_MASTER		        = 1<<25,
nit: use tab instead space here  ^^^^^^^

Also, why calling this '_MASTER', is there a notion of SLAVE?

>   };
>
>   #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> @@ -1301,6 +1302,7 @@ enum netdev_priv_flags {
>   #define IFF_XMIT_DST_RELEASE_PERM	IFF_XMIT_DST_RELEASE_PERM
>   #define IFF_IPVLAN_MASTER		IFF_IPVLAN_MASTER
>   #define IFF_IPVLAN_SLAVE		IFF_IPVLAN_SLAVE
> +#define IFF_VRF_MASTER		        IFF_VRF_MASTER
nit: use tab instead space here          ^^^^^^^

>
>   /**
>    *	struct net_device - The DEVICE structure.
> @@ -1417,6 +1419,7 @@ enum netdev_priv_flags {
>    *	@dn_ptr:	DECnet specific data
>    *	@ip6_ptr:	IPv6 specific data
>    *	@ax25_ptr:	AX.25 specific data
> + *	@vrf_ptr:	VRF specific data
>    *	@ieee80211_ptr:	IEEE 802.11 specific data, assign before registering
>    *
>    *	@last_rx:	Time of last Rx
> @@ -1629,6 +1632,7 @@ struct net_device {
>   	struct dn_dev __rcu     *dn_ptr;
>   	struct inet6_dev __rcu	*ip6_ptr;
>   	void			*ax25_ptr;
> +	struct net_vrf_dev      *vrf_ptr;
nit: use tab here         ^^^^^^

>   	struct wireless_dev	*ieee80211_ptr;
>   	struct wpan_dev		*ieee802154_ptr;
>   #if IS_ENABLED(CONFIG_MPLS_ROUTING)
> @@ -3781,6 +3785,23 @@ static inline bool netif_supports_nofcs(struct net_device *dev)
>   	return dev->priv_flags & IFF_SUPP_NOFCS;
>   }
>
> +static inline bool netif_is_vrf(struct net_device *dev)
> +{
> +	return dev->priv_flags & IFF_VRF_MASTER;
> +}
> +
> +static inline bool netif_idx_is_vrf(struct net *net, int idx)
Usally, the index of an interface is named 'ifindex', it eases code reading
to keep the same name. Something like:
netif_index_is_vrf(struct net *net, int ifindex)

> +{
> +	struct net_device *dev = dev_get_by_index(net, idx);
> +	bool rc = false;
> +
> +	if (dev) {
> +		rc = netif_is_vrf(dev);
> +		dev_put(dev);
> +	}
> +	return rc;
> +}
> +
[snip]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 2/6] net: Preparation for vrf device
  2015-07-08  8:37   ` Nicolas Dichtel
@ 2015-07-08  8:40     ` Nicolas Dichtel
  2015-07-08 16:10     ` David Ahern
  1 sibling, 0 replies; 32+ messages in thread
From: Nicolas Dichtel @ 2015-07-08  8:40 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes, stephen,
	hadi, ebiederm, davem

Le 08/07/2015 10:37, Nicolas Dichtel a écrit :
[snip]
> Also, why calling this '_MASTER', is there a notion of SLAVE?
Ok, just got it in the next patch ;-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-06 15:03 ` [RFC net-next 3/6] net: Introduce VRF device driver - v2 David Ahern
  2015-07-06 15:42   ` Nicolas Dichtel
  2015-07-06 16:37   ` Nikolay Aleksandrov
@ 2015-07-08  9:27   ` Nicolas Dichtel
  2015-07-08 16:38     ` David Ahern
  2015-07-08 18:34   ` Sowmini Varadhan
  3 siblings, 1 reply; 32+ messages in thread
From: Nicolas Dichtel @ 2015-07-08  9:27 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes, stephen,
	hadi, ebiederm, davem

Le 06/07/2015 17:03, David Ahern a écrit :
> This driver borrows heavily from IPvlan and teaming drivers.
>
> Routing domains (VRF-lite) are created by instantiating a device
> and enslaving all routed interfaces that participate in the domain.
> As part of the enslavement, all local routes pointing to enslaved
> devices are re-pointed to the vrf device, thus forcing outgoing
> sockets to bind to the vrf to function.
>
> Standard FIB rules can then bind the VRF device to tables and regular
> fib rule processing is followed.
>
> Routed traffic through the box, is fwded by using the VRF device as
> the IIF and following the IIF rule to a table which is mated with
> the VRF.
>
> Locally originated traffic is directed at the VRF device using
> SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
> the xmit function of the vrf driver, which then completes the ip lookup
> and output.
>
> This solution is completely orthogonal to namespaces and allow the L3
> equivalent of vlans to exist allowing the routing space to be
> partitioned.
>
> Example:
>
>     Create vrf 1:
>       ip link add vrf1 type vrf table 5
>       ip rule add iif vrf1 table 5
>       ip rule add oif vrf1 table 5
>       ip route add table 5 prohibit default
>       ip link set vrf1 up
>
>     Add interface to vrf 1:
>       ip link set eth1 master vrf1
>
> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
>
> v2:
> - addressed comments from first RFC
> - significant changes to improve simplicity of implementation
> ---
>   drivers/net/Kconfig  |   7 +
>   drivers/net/Makefile |   1 +
>   drivers/net/vrf.c    | 486 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   include/net/vrf.h    |  71 ++++++++
>   4 files changed, 565 insertions(+)
>   create mode 100644 drivers/net/vrf.c
>   create mode 100644 include/net/vrf.h
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 019fceffc9e5..b040aa233408 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -283,6 +283,13 @@ config NLMON
>   	  diagnostics, etc. This is mostly intended for developers or support
>   	  to debug netlink issues. If unsure, say N.
>
> +config NET_VRF
> +	tristate "Virtual Routing and Forwarding (Lite)"
> +	depends on IP_MULTIPLE_TABLES && IPV6_MULTIPLE_TABLES
> +	---help---
> +          This option enables the support for mapping interfaces into VRF's. The
> +          support enables VRF devices
    ^^^^^^^^
nit: use tab instead space for the last two lines.

> +
>   endif # NET_CORE
>
>   config SUNGEM_PHY
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index c12cb22478a7..ca16dd689b36 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
>   obj-$(CONFIG_VXLAN) += vxlan.o
>   obj-$(CONFIG_GENEVE) += geneve.o
>   obj-$(CONFIG_NLMON) += nlmon.o
> +obj-$(CONFIG_NET_VRF) += vrf.o
>
>   #
>   # Networking Drivers
> diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
> new file mode 100644
> index 000000000000..b9f9ae68388d
> --- /dev/null
> +++ b/drivers/net/vrf.c
> @@ -0,0 +1,487 @@
> +/*
> + * vrf.c: device driver to encapsulate a VRF space
> + *
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * Based on dummy, team and ipvlan drivers
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ip.h>
> +#include <linux/init.h>
> +#include <linux/moduleparam.h>
> +#include <linux/rtnetlink.h>
> +#include <net/rtnetlink.h>
> +#include <net/arp.h>
> +#include <linux/u64_stats_sync.h>
> +#include <linux/hashtable.h>
> +
> +#include <linux/inetdevice.h>
> +#include <net/ip.h>
> +#include <net/ip_fib.h>
> +#include <net/ip6_route.h>
> +#include <net/rtnetlink.h>
> +#include <net/route.h>
> +#include <net/addrconf.h>
> +#include <net/vrf.h>
> +
> +#define DRV_NAME	"vrf"
> +#define DRV_VERSION	"1.0"
> +
> +#define vrf_is_slave(dev)   ((dev)->flags & IFF_SLAVE)
> +#define vrf_is_master(dev)  ((dev)->flags & IFF_MASTER)
> +
> +#define vrf_master_get_rcu(dev) \
> +	((struct net_device *)rcu_dereference(dev->rx_handler_data))
> +
> +struct pcpu_dstats {
> +	u64			tx_pkts;
> +	u64			tx_bytes;
> +	u64			tx_drps;
> +	u64			rx_pkts;
> +	u64			rx_bytes;
> +	struct u64_stats_sync	syncp;
> +};
Why not using 'struct pcpu_sw_netstats' (dev->tstats), like most virtual
interfaces? Not sure that it's really needed to have tx_drps per cpu (and
I don't see anyone using it into this patch).

> +
> +struct slave {
> +	struct list_head	list;
> +	struct net_device	*dev;
> +	long			priority;
> +};
> +
> +struct slave_queue {
> +	spinlock_t		lock; /* lock for slave insert/delete */
> +	struct list_head	all_slaves;
> +	int			num_slaves;
> +	struct net_device	*master_dev;
> +};
> +
> +struct net_vrf {
> +	struct slave_queue	queue;
> +	struct fib_table        *tb;
nit: tabs instead spaces^^^^^^^^

> +	u32                     tb_id;
tabs       ^^^^^^^^^^^^^^^^^^^^^

> +};
> +
> +static int is_ip_rx_frame(struct sk_buff *skb)
bool instead of int?

> +{
> +	switch (skb->protocol) {
> +	case htons(ETH_P_IP):
> +	case htons(ETH_P_IPV6):
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* note: already called with rcu_read_lock */
> +static rx_handler_result_t vrf_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +
> +	if (is_ip_rx_frame(skb)) {
> +		struct net_device *dev = vrf_master_get_rcu(skb->dev);
> +		struct pcpu_dstats *dstats = this_cpu_ptr(dev->dstats);
> +
> +		u64_stats_update_begin(&dstats->syncp);
> +		dstats->rx_pkts++;
> +		dstats->rx_bytes += skb->len;
> +		u64_stats_update_end(&dstats->syncp);
> +	}
> +	return RX_HANDLER_PASS;
> +}
> +
> +static struct rtnl_link_stats64 *vrf_get_stats64(
> +	struct net_device *dev, struct rtnl_link_stats64 *stats)
> +{
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		const struct pcpu_dstats *dstats;
> +		u64 tbytes, tpkts, tdrops, rbytes, rpkts;
> +		unsigned int start;
> +
> +		dstats = per_cpu_ptr(dev->dstats, i);
> +		do {
> +			start = u64_stats_fetch_begin_irq(&dstats->syncp);
> +			tbytes = dstats->tx_bytes;
> +			tpkts = dstats->tx_pkts;
> +			tdrops = dstats->tx_drps;
> +			rbytes = dstats->rx_bytes;
> +			rpkts = dstats->rx_pkts;
> +		} while (u64_stats_fetch_retry_irq(&dstats->syncp, start));
> +		stats->tx_bytes += tbytes;
> +		stats->tx_packets += tpkts;
> +		stats->tx_dropped += tdrops;
> +		stats->rx_bytes += rbytes;
> +		stats->rx_packets += rpkts;
> +	}
> +	return stats;
> +}
> +
> +static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	return NET_XMIT_DROP;
> +}
> +
> +/**************************** device handling ********************/
> +
> +/* queue->lock must be held */
> +static struct slave *__vrf_find_slave_dev(struct slave_queue *queue,
> +					  struct net_device *dev)
> +{
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		if (slave->dev == dev)
> +			return slave;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void __vrf_kill_slave(struct slave_queue *queue, struct slave *slave)
> +{
> +	list_del(&slave->list);
> +	queue->num_slaves--;
> +	slave->dev->flags &= ~IFF_SLAVE;
> +	netdev_rx_handler_unregister(slave->dev);
> +	kfree(slave->dev->vrf_ptr);
> +	slave->dev->vrf_ptr = NULL;
> +	dev_put(slave->dev);
> +	kfree(slave);
> +}
> +
> +/* queue->lock must be held */
> +static void __vrf_insert_slave(struct slave_queue *queue, struct slave *slave,
> +			       struct net_device *master)
> +{
> +	struct slave *duplicate_slave = NULL;
> +
> +	duplicate_slave = __vrf_find_slave_dev(queue, slave->dev);
> +	if (duplicate_slave)
> +		__vrf_kill_slave(queue, duplicate_slave);
I miss the point here. Why removing the slave if it is already there?

> +
> +	dev_hold(slave->dev);
> +	list_add(&slave->list, &queue->all_slaves);
> +	queue->num_slaves++;
> +}
> +
> +/* netlink lock is assumed here */
ASSERT_RTNL()?

> +static int vrf_add_slave(struct net_device *dev,
> +			 struct net_device *port_dev)
> +{
> +	if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev))
> +		return -ENODEV;
> +
> +	if (!vrf_is_master(port_dev) && !vrf_is_slave(port_dev)) {
> +		struct slave *s = kzalloc(sizeof(*s), GFP_KERNEL);
> +		struct net_vrf *vrf = netdev_priv(dev);
> +		struct slave_queue *queue = &vrf->queue;
> +		bool is_running = netif_running(port_dev);
> +		unsigned int flags = port_dev->flags;
> +		int ret;
> +
> +		if (!s)
> +			return -ENOMEM;
> +
> +		s->dev = port_dev;
> +
> +		spin_lock_bh(&queue->lock);
> +		__vrf_insert_slave(queue, s, dev);
> +		spin_unlock_bh(&queue->lock);
> +
> +		port_dev->vrf_ptr = kmalloc(sizeof(*port_dev->vrf_ptr),
> +					    GFP_KERNEL);
> +		if (!port_dev->vrf_ptr)
> +			return -ENOMEM;
> +
> +		port_dev->vrf_ptr->ifindex = dev->ifindex;
> +		port_dev->vrf_ptr->tb_id = vrf->tb_id;
> +
> +		/* register the packet handler for slave ports */
> +		ret = netdev_rx_handler_register(port_dev, vrf_handle_frame,
> +						 (void *)dev);
So, it won't be possible to add a slave which already has a
macvlan or ipvlan (or team?) interface registered.

> +		if (ret) {
> +			netdev_err(port_dev,
> +				   "Device %s failed to register rx_handler\n",
> +				   port_dev->name);
> +			kfree(port_dev->vrf_ptr);
> +			kfree(s);
> +			return ret;
> +		}
> +
> +		if (is_running) {
> +			ret = dev_change_flags(port_dev, flags & ~IFF_UP);
> +			if (ret < 0)
> +				goto out_fail;
> +		}
> +
> +		ret = netdev_master_upper_dev_link(port_dev, dev);
> +		if (ret < 0)
> +			goto out_fail;
> +
> +		if (is_running) {
> +			ret = dev_change_flags(port_dev, flags);
> +			if (ret < 0)
> +				goto out_fail;
> +		}
> +
> +		port_dev->flags |= IFF_SLAVE;
> +
> +		return 0;
> +
> +out_fail:
> +		spin_lock_bh(&queue->lock);
> +		__vrf_kill_slave(queue, s);
> +		spin_unlock_bh(&queue->lock);
> +
> +		return ret;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +static int vrf_del_slave(struct net_device *dev,
> +			 struct net_device *port_dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct slave *slave = __vrf_find_slave_dev(queue, port_dev);
> +	bool is_running = netif_running(port_dev);
> +	unsigned int flags = port_dev->flags;
> +	int ret = 0;
> +
> +	if (!slave)
> +		return -EINVAL;
> +
> +	if (is_running)
> +		ret = dev_change_flags(port_dev, flags & ~IFF_UP);
> +
> +	spin_lock_bh(&queue->lock);
> +	__vrf_kill_slave(queue, slave);
> +	spin_unlock_bh(&queue->lock);
> +
> +	netdev_upper_dev_unlink(port_dev, dev);
> +
> +	if (is_running)
> +		ret = dev_change_flags(port_dev, flags);
> +
> +	return 0;
> +}
> +
> +static int vrf_dev_init(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	spin_lock_init(&vrf->queue.lock);
> +	INIT_LIST_HEAD(&vrf->queue.all_slaves);
> +	vrf->queue.master_dev = dev;
> +
> +	dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
> +	dev->flags  =  IFF_MASTER | IFF_NOARP;
> +	if (!dev->dstats)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void vrf_dev_uninit(struct net_device *dev)
> +{
> +	free_percpu(dev->dstats);
> +}
> +
> +static int vrf_dev_close(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		slave->dev->vrf_ptr->ifindex = 0;
> +		slave->dev->vrf_ptr->tb_id = 0;
> +	}
> +
> +	if (dev->flags & IFF_MASTER)
> +		dev->flags &= ~IFF_UP;
> +
> +	return 0;
> +}
> +
> +static int vrf_dev_open(struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	struct slave_queue *queue = &vrf->queue;
> +	struct list_head *this, *head;
> +
> +	head = &queue->all_slaves;
> +	list_for_each(this, head) {
> +		struct slave *slave = list_entry(this, struct slave, list);
> +
> +		slave->dev->vrf_ptr->ifindex = dev->ifindex;
> +		slave->dev->vrf_ptr->tb_id = vrf->tb_id;
> +	}
> +
> +	if (dev->flags & IFF_MASTER)
> +		dev->flags |= IFF_UP;
> +
> +	if (!vrf->tb)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static const struct net_device_ops vrf_netdev_ops = {
> +	.ndo_init		= vrf_dev_init,
> +	.ndo_uninit		= vrf_dev_uninit,
> +	.ndo_open		= vrf_dev_open,
> +	.ndo_stop               = vrf_dev_close,
nit: tabs        ^^^^^^^^^^^^^^^

> +	.ndo_start_xmit		= vrf_xmit,
> +	.ndo_get_stats64	= vrf_get_stats64,
> +	.ndo_add_slave          = vrf_add_slave,
nit: tabs             ^^^^^^^^^^

> +	.ndo_del_slave          = vrf_del_slave,
nit: tabs             ^^^^^^^^^^

> +};
> +
> +static void vrf_get_drvinfo(struct net_device *dev,
> +			    struct ethtool_drvinfo *info)
> +{
> +	strlcpy(info->driver, DRV_NAME, sizeof(info->driver));
> +	strlcpy(info->version, DRV_VERSION, sizeof(info->version));
> +}
> +
> +static const struct ethtool_ops vrf_ethtool_ops = {
> +	.get_drvinfo            = vrf_get_drvinfo,
nit: tabs           ^^^^^^^^^^^^

> +};
> +
> +static void vrf_setup(struct net_device *dev)
> +{
> +	ether_setup(dev);
> +
> +	/* Initialize the device structure. */
> +	dev->netdev_ops = &vrf_netdev_ops;
> +	dev->ethtool_ops = &vrf_ethtool_ops;
> +	dev->destructor = free_netdev;
> +
> +	/* Fill in device structure with ethernet-generic values. */
> +	dev->tx_queue_len = 0;
> +	eth_hw_addr_random(dev);
> +}
> +
> +static int vrf_validate(struct nlattr *tb[], struct nlattr *data[])
> +{
> +	if (tb[IFLA_ADDRESS]) {
> +		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
> +			return -EINVAL;
> +		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
> +			return -EADDRNOTAVAIL;
> +	}
> +	return 0;
> +}
> +
> +static int vrf_newlink(struct net *src_net, struct net_device *dev,
> +		       struct nlattr *tb[], struct nlattr *data[])
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +	int err;
> +
> +	if (!data || !data[IFLA_VRF_TABLE])
> +		return -EINVAL;
> +
> +	vrf->tb_id = nla_get_u32(data[IFLA_VRF_TABLE]);
> +
> +	/* reserve a table for this VRF device */
> +	err = -ERANGE;
> +	vrf->tb = fib_new_table(dev_net(dev), vrf->tb_id);
> +	if (!vrf->tb)
> +		goto out_fail;
> +
> +	dev->priv_flags |= IFF_VRF_MASTER;
> +
> +	err = -ENOMEM;
> +	dev->vrf_ptr = kmalloc(sizeof(*dev->vrf_ptr), GFP_KERNEL);
> +	if (!dev->vrf_ptr)
> +		goto out_fail;
> +
> +	dev->vrf_ptr->ifindex = dev->ifindex;
> +	dev->vrf_ptr->tb_id = vrf->tb_id;
> +
> +	err = register_netdevice(dev);
> +	if (err < 0)
> +		goto out_fail;
> +
> +	return 0;
> +
> +out_fail:
> +	kfree(dev->vrf_ptr);
> +	free_netdev(dev);
> +	return err;
> +}
> +
> +static void vrf_dellink(struct net_device *dev, struct list_head *head)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	kfree(dev->vrf_ptr);
> +	fib_free_table(vrf->tb);
> +}
> +
> +static size_t vrf_nl_getsize(const struct net_device *dev)
> +{
> +	return nla_total_size(sizeof(u32));  /* IFLA_VRF_TABLE */
> +}
> +
> +static int vrf_fillinfo(struct sk_buff *skb,
> +			const struct net_device *dev)
> +{
> +	struct net_vrf *vrf = netdev_priv(dev);
> +
> +	return nla_put_u32(skb, IFLA_VRF_TABLE, vrf->tb_id);
> +}
> +
> +static const struct nla_policy vrf_nl_policy[IFLA_VRF_MAX + 1] = {
> +	[IFLA_VRF_TABLE] = { .type = NLA_U32 },
> +};
> +
> +static struct rtnl_link_ops vrf_link_ops __read_mostly = {
> +	.kind		= DRV_NAME,
> +	.priv_size      = sizeof(struct net_vrf),
nit: tabs         ^^^^^^
> +
> +	.get_size       = vrf_nl_getsize,
nit: tabs        ^^^^^^^
> +	.policy         = vrf_nl_policy,
nit: tabs      ^^^^^^^^^
> +	.validate	= vrf_validate,
> +	.fill_info      = vrf_fillinfo,
nit: tabs         ^^^^^^
> +
> +	.newlink        = vrf_newlink,
nit: tabs       ^^^^^^^^
> +	.dellink        = vrf_dellink,
nit: tabs       ^^^^^^^^
> +	.setup		= vrf_setup,
> +	.maxtype        = IFLA_VRF_MAX,
nit: tabs       ^^^^^^^^
> +};
> +
> +static int __init vrf_init_module(void)
> +{
> +	return rtnl_link_register(&vrf_link_ops);
> +}
> +
> +static void __exit vrf_cleanup_module(void)
> +{
> +	rtnl_link_unregister(&vrf_link_ops);
> +}
> +
> +module_init(vrf_init_module);
> +module_exit(vrf_cleanup_module);
> +MODULE_AUTHOR("Shrijeet Mukherjee, David Ahern");
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS_RTNL_LINK(DRV_NAME);
> +MODULE_VERSION(DRV_VERSION);
> diff --git a/include/net/vrf.h b/include/net/vrf.h
> new file mode 100644
> index 000000000000..3ab1e332c781
> --- /dev/null
> +++ b/include/net/vrf.h
> @@ -0,0 +1,71 @@
> +/*
> + * include/net/net_vrf.h - adds vrf dev structure definitions
> + * Copyright (c) 2015 Cumulus Networks
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#ifndef __LINUX_NET_VRF_H
> +#define __LINUX_NET_VRF_H
> +
> +struct net_vrf_dev {
> +	int                     ifindex; /* ifindex of master dev */
> +	u32                     tb_id;   /* table id for VRF */
> +};
> +
> +#if IS_ENABLED(CONFIG_NET_VRF)
> +static inline int vrf_master_dev_idx(const struct net_device *dev)
> +{
> +	int idx = 0;
> +
> +	if (dev && dev->vrf_ptr)
> +		idx = dev->vrf_ptr->ifindex;
> +
> +	return idx;
> +}
> +
> +static inline int vrf_get_master_dev_idx(struct net *net, int idx)
ifindex instead idx for the whole file?


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/6] Proposal for VRF-lite - v2
  2015-07-06 17:53   ` Shrijeet Mukherjee
@ 2015-07-08  9:30     ` Nicolas Dichtel
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Dichtel @ 2015-07-08  9:30 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: David Ahern, Netdev, Roopa Prabhu, Andy Gospodarek, Jon Toppins,
	Nikolay Aleksandrov, Dinesh Dutt, Hannes Frederic Sowa,
	Stephen Hemminger, Jamal Hadi Salim, Eric W. Biederman,
	David Miller

Le 06/07/2015 19:53, Shrijeet Mukherjee a écrit :
> No no problem,
>
> Just trying to get the functional aspects worked out. the global
> search replace will be easy.
>
> Was hoping to see some more responses on the naming suggestions here
> from the community. If there is not disagreement we can spin patches
> with MRF as the name.
For me, it's ok.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 2/6] net: Preparation for vrf device
  2015-07-08  8:37   ` Nicolas Dichtel
  2015-07-08  8:40     ` Nicolas Dichtel
@ 2015-07-08 16:10     ` David Ahern
  1 sibling, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-08 16:10 UTC (permalink / raw)
  To: nicolas.dichtel, netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes, stephen,
	hadi, ebiederm, davem


On 7/8/15 2:37 AM, Nicolas Dichtel wrote:
> Le 06/07/2015 17:03, David Ahern a écrit :
>> Add a VRF_MASTER flag for interfaces and helper functions for determining
>> if a device is a VRF_MASTER.
>>
>> Also, add link attribute for passing VRF_TABLE id.
>>
>> Both are used in the following patch that adds a VRF device driver.
>>
>> Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
>> ---
>>   include/linux/netdevice.h    | 21 +++++++++++++++++++++
>>   include/uapi/linux/if_link.h |  9 +++++++++
>>   2 files changed, 30 insertions(+)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index e20979dfd6a9..142cb64f139c 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1274,6 +1274,7 @@ enum netdev_priv_flags {
>>       IFF_XMIT_DST_RELEASE_PERM    = 1<<22,
>>       IFF_IPVLAN_MASTER        = 1<<23,
>>       IFF_IPVLAN_SLAVE        = 1<<24,
>> +    IFF_VRF_MASTER                = 1<<25,
> nit: use tab instead space here  ^^^^^^^
>
> Also, why calling this '_MASTER', is there a notion of SLAVE?
>
>>   };
>>
>>   #define IFF_802_1Q_VLAN            IFF_802_1Q_VLAN
>> @@ -1301,6 +1302,7 @@ enum netdev_priv_flags {
>>   #define IFF_XMIT_DST_RELEASE_PERM    IFF_XMIT_DST_RELEASE_PERM
>>   #define IFF_IPVLAN_MASTER        IFF_IPVLAN_MASTER
>>   #define IFF_IPVLAN_SLAVE        IFF_IPVLAN_SLAVE
>> +#define IFF_VRF_MASTER                IFF_VRF_MASTER
> nit: use tab instead space here          ^^^^^^^
>
>>
>>   /**
>>    *    struct net_device - The DEVICE structure.
>> @@ -1417,6 +1419,7 @@ enum netdev_priv_flags {
>>    *    @dn_ptr:    DECnet specific data
>>    *    @ip6_ptr:    IPv6 specific data
>>    *    @ax25_ptr:    AX.25 specific data
>> + *    @vrf_ptr:    VRF specific data
>>    *    @ieee80211_ptr:    IEEE 802.11 specific data, assign before
>> registering
>>    *
>>    *    @last_rx:    Time of last Rx
>> @@ -1629,6 +1632,7 @@ struct net_device {
>>       struct dn_dev __rcu     *dn_ptr;
>>       struct inet6_dev __rcu    *ip6_ptr;
>>       void            *ax25_ptr;
>> +    struct net_vrf_dev      *vrf_ptr;
> nit: use tab here         ^^^^^^
>
>>       struct wireless_dev    *ieee80211_ptr;
>>       struct wpan_dev        *ieee802154_ptr;
>>   #if IS_ENABLED(CONFIG_MPLS_ROUTING)
>> @@ -3781,6 +3785,23 @@ static inline bool netif_supports_nofcs(struct
>> net_device *dev)
>>       return dev->priv_flags & IFF_SUPP_NOFCS;
>>   }
>>
>> +static inline bool netif_is_vrf(struct net_device *dev)
>> +{
>> +    return dev->priv_flags & IFF_VRF_MASTER;
>> +}
>> +
>> +static inline bool netif_idx_is_vrf(struct net *net, int idx)
> Usally, the index of an interface is named 'ifindex', it eases code reading
> to keep the same name. Something like:
> netif_index_is_vrf(struct net *net, int ifindex)
>
>> +{
>> +    struct net_device *dev = dev_get_by_index(net, idx);
>> +    bool rc = false;
>> +
>> +    if (dev) {
>> +        rc = netif_is_vrf(dev);
>> +        dev_put(dev);
>> +    }
>> +    return rc;
>> +}
>> +
> [snip]

ack on all comments. Updated patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-08  9:27   ` Nicolas Dichtel
@ 2015-07-08 16:38     ` David Ahern
  0 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-08 16:38 UTC (permalink / raw)
  To: nicolas.dichtel, netdev
  Cc: shm, roopa, gospo, jtoppins, nikolay, ddutt, hannes, stephen,
	hadi, ebiederm, davem

On 7/8/15 3:27 AM, Nicolas Dichtel wrote:
>> +
>> +struct pcpu_dstats {
>> +    u64            tx_pkts;
>> +    u64            tx_bytes;
>> +    u64            tx_drps;
>> +    u64            rx_pkts;
>> +    u64            rx_bytes;
>> +    struct u64_stats_sync    syncp;
>> +};
> Why not using 'struct pcpu_sw_netstats' (dev->tstats), like most virtual
> interfaces? Not sure that it's really needed to have tx_drps per cpu (and
> I don't see anyone using it into this patch).

Alex asked the same question on the first RFC. Shrijeet had opinions on 
why this version versus netdev's. Shrijeet?

-----8<-----

>> +/* queue->lock must be held */
>> +static void __vrf_insert_slave(struct slave_queue *queue, struct
>> slave *slave,
>> +                   struct net_device *master)
>> +{
>> +    struct slave *duplicate_slave = NULL;
>> +
>> +    duplicate_slave = __vrf_find_slave_dev(queue, slave->dev);
>> +    if (duplicate_slave)
>> +        __vrf_kill_slave(queue, duplicate_slave);
> I miss the point here. Why removing the slave if it is already there?

not sure. I'll investigate.

>
>> +
>> +    dev_hold(slave->dev);
>> +    list_add(&slave->list, &queue->all_slaves);
>> +    queue->num_slaves++;
>> +}
>> +
>> +/* netlink lock is assumed here */
> ASSERT_RTNL()?

done.

>
>> +static int vrf_add_slave(struct net_device *dev,
>> +             struct net_device *port_dev)
>> +{
>> +    if (!dev || !port_dev || dev_net(dev) != dev_net(port_dev))
>> +        return -ENODEV;
>> +
>> +    if (!vrf_is_master(port_dev) && !vrf_is_slave(port_dev)) {
>> +        struct slave *s = kzalloc(sizeof(*s), GFP_KERNEL);
>> +        struct net_vrf *vrf = netdev_priv(dev);
>> +        struct slave_queue *queue = &vrf->queue;
>> +        bool is_running = netif_running(port_dev);
>> +        unsigned int flags = port_dev->flags;
>> +        int ret;
>> +
>> +        if (!s)
>> +            return -ENOMEM;
>> +
>> +        s->dev = port_dev;
>> +
>> +        spin_lock_bh(&queue->lock);
>> +        __vrf_insert_slave(queue, s, dev);
>> +        spin_unlock_bh(&queue->lock);
>> +
>> +        port_dev->vrf_ptr = kmalloc(sizeof(*port_dev->vrf_ptr),
>> +                        GFP_KERNEL);
>> +        if (!port_dev->vrf_ptr)
>> +            return -ENOMEM;
>> +
>> +        port_dev->vrf_ptr->ifindex = dev->ifindex;
>> +        port_dev->vrf_ptr->tb_id = vrf->tb_id;
>> +
>> +        /* register the packet handler for slave ports */
>> +        ret = netdev_rx_handler_register(port_dev, vrf_handle_frame,
>> +                         (void *)dev);
> So, it won't be possible to add a slave which already has a
> macvlan or ipvlan (or team?) interface registered.
>

Shrijeet, thoughts?

-----8<-----

>> +
>> +static struct rtnl_link_ops vrf_link_ops __read_mostly = {
>> +    .kind        = DRV_NAME,
>> +    .priv_size      = sizeof(struct net_vrf),
> nit: tabs         ^^^^^^
>> +
>> +    .get_size       = vrf_nl_getsize,
> nit: tabs        ^^^^^^^
>> +    .policy         = vrf_nl_policy,
> nit: tabs      ^^^^^^^^^
>> +    .validate    = vrf_validate,
>> +    .fill_info      = vrf_fillinfo,
> nit: tabs         ^^^^^^
>> +
>> +    .newlink        = vrf_newlink,
> nit: tabs       ^^^^^^^^
>> +    .dellink        = vrf_dellink,
> nit: tabs       ^^^^^^^^
>> +    .setup        = vrf_setup,
>> +    .maxtype        = IFLA_VRF_MAX,
> nit: tabs       ^^^^^^^^
>> +};

ACK on all tab comments; fixed. Ditto for bool on is_tx_frame.

-----8<-----

>> +#if IS_ENABLED(CONFIG_NET_VRF)
>> +static inline int vrf_master_dev_idx(const struct net_device *dev)
>> +{
>> +    int idx = 0;
>> +
>> +    if (dev && dev->vrf_ptr)
>> +        idx = dev->vrf_ptr->ifindex;
>> +
>> +    return idx;
>> +}
>> +
>> +static inline int vrf_get_master_dev_idx(struct net *net, int idx)
> ifindex instead idx for the whole file?

done.

Thanks for the review.

David

PS. comments addressed while consuming a croque-monsieur (my daughter 
just returned from a European trip; loves the sandwich)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-06 15:03 ` [RFC net-next 3/6] net: Introduce VRF device driver - v2 David Ahern
                     ` (2 preceding siblings ...)
  2015-07-08  9:27   ` Nicolas Dichtel
@ 2015-07-08 18:34   ` Sowmini Varadhan
  2015-07-09 17:19     ` David Ahern
  3 siblings, 1 reply; 32+ messages in thread
From: Sowmini Varadhan @ 2015-07-08 18:34 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, shm, roopa, Andy Gospodarek, jtoppins, nikolay, ddutt,
	Hannes Frederic Sowa, Nicolas Dichtel, Stephen Hemminger, hadi,
	Eric W. Biederman, David Miller

On Mon, Jul 6, 2015 at 5:03 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> This driver borrows heavily from IPvlan and teaming drivers.
>
> Routing domains (VRF-lite) are created by instantiating a device
> and enslaving all routed interfaces that participate in the domain.
> As part of the enslavement, all local routes pointing to enslaved
> devices are re-pointed to the vrf device, thus forcing outgoing
> sockets to bind to the vrf to function.
>
> Standard FIB rules can then bind the VRF device to tables and regular
> fib rule processing is followed.

Perhaps I misunderstand the design proposal here, but a switch's VRF
is essentially just a separate routing table, whose input and output interfaces
are exclusively bound to the VRF.

Can an application in the model above get visibiltiy into the (enslaved?)
interfaces in the vrf? Can an application use IP_PKTINFO to send a packet out of
a specific interface on a selected VRF? What about receiving
IP_PKTINFO about input interface?
What about setting ipsec policy for interfaces in the vrf?

> Routed traffic through the box, is fwded by using the VRF device as
> the IIF and following the IIF rule to a table which is mated with
> the VRF.
>
> Locally originated traffic is directed at the VRF device using
> SO_BINDTODEVICE or cmsg headers. This in turn drops the packet into
> the xmit function of the vrf driver, which then completes the ip lookup
> and output.
>
> This solution is completely orthogonal to namespaces and allow the L3
> equivalent of vlans to exist allowing the routing space to be
> partitioned.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-08 18:34   ` Sowmini Varadhan
@ 2015-07-09 17:19     ` David Ahern
  2015-07-09 17:28       ` Sowmini Varadhan
  0 siblings, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-09 17:19 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: netdev, shm, roopa, Andy Gospodarek, jtoppins, nikolay, ddutt,
	Hannes Frederic Sowa, Nicolas Dichtel, Stephen Hemminger, hadi,
	Eric W. Biederman, David Miller

Hi Sowmini:

On 7/8/15 12:34 PM, Sowmini Varadhan wrote:
> Perhaps I misunderstand the design proposal here, but a switch's VRF
> is essentially just a separate routing table, whose input and output interfaces
> are exclusively bound to the VRF.

yes, and this model follows that.

>
> Can an application in the model above get visibiltiy into the (enslaved?)
> interfaces in the vrf?

yes. e.g., 'ip link show' prints the vrf device it is enslaved to:

6: swp3: <BROADCAST,MULTICAST,SLAVE> mtu 1500 qdisc noop master vrf10 
state DOWN mode DEFAULT group default qlen 1000
     link/ether 52:54:00:12:35:03 brd ff:ff:ff:ff:ff:ff
7: swp4: <BROADCAST,MULTICAST,SLAVE> mtu 1500 qdisc noop master vrf10 
state DOWN mode DEFAULT group default qlen 1000
     link/ether 52:54:00:12:35:04 brd ff:ff:ff:ff:ff:ff
8: swp5: <BROADCAST,MULTICAST,SLAVE> mtu 1500 qdisc noop master vrf10 
state DOWN mode DEFAULT group default qlen 1000
     link/ether 52:54:00:12:35:05 brd ff:ff:ff:ff:ff:ff
9: swp6: <BROADCAST,MULTICAST,SLAVE> mtu 1500 qdisc noop master vrf10 
state DOWN mode DEFAULT group default qlen 1000


> Can an application use IP_PKTINFO to send a packet out of
> a specific interface on a selected VRF? What about receiving
> IP_PKTINFO about input interface?

On the to-do list to use cmsg to specify a VRF for outbound packets 
using non-connected sockets. I do not believe it is going to work, but 
need to look into it.

> What about setting ipsec policy for interfaces in the vrf?
>

similarly, need to look at ipsec use cases with this vrf model.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-09 17:19     ` David Ahern
@ 2015-07-09 17:28       ` Sowmini Varadhan
  2015-07-10  1:36         ` Eric W. Biederman
  2015-07-10  2:39         ` David Ahern
  0 siblings, 2 replies; 32+ messages in thread
From: Sowmini Varadhan @ 2015-07-09 17:28 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, Shrijeet Mukherjee, Roopa Prabhu, Andy Gospodarek,
	jtoppins, nikolay, ddutt, Hannes Frederic Sowa, Nicolas Dichtel,
	Stephen Hemminger, hadi, Eric W. Biederman, David Miller

On Thu, Jul 9, 2015 at 7:19 PM, David Ahern <dsa@cumulusnetworks.com> wrote:

> On the to-do list to use cmsg to specify a VRF for outbound packets using
> non-connected sockets. I do not believe it is going to work, but need to
> look into it.
>
>> What about setting ipsec policy for interfaces in the vrf?

>From a purely parochial standpoint, how would rds sockets work in this model?
Would the tcp encaps happen before or after the the vrf "driver" output?
Same problem for NFS.

>From a non-parochial standpoint. There are a *lot* of routing apps that actually
need more visibility into many details about the "slave" interface: e.g., OSPF,
ARP snoop, IPSLA.. the list is pretty long.

I think it's a bad idea to use a "driver" to represent a table lookup. Too many
hacks will become necessary.

--Sowmini

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-09 17:28       ` Sowmini Varadhan
@ 2015-07-10  1:36         ` Eric W. Biederman
  2015-07-10  2:12           ` David Ahern
  2015-07-10  2:39         ` David Ahern
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-07-10  1:36 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: David Ahern, netdev, Shrijeet Mukherjee, Roopa Prabhu,
	Andy Gospodarek, jtoppins, nikolay, ddutt, Hannes Frederic Sowa,
	Nicolas Dichtel, Stephen Hemminger, hadi, David Miller

Sowmini Varadhan <sowmini05@gmail.com> writes:

> On Thu, Jul 9, 2015 at 7:19 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>
>> On the to-do list to use cmsg to specify a VRF for outbound packets using
>> non-connected sockets. I do not believe it is going to work, but need to
>> look into it.
>>
>>> What about setting ipsec policy for interfaces in the vrf?
>
> From a purely parochial standpoint, how would rds sockets work in this model?
> Would the tcp encaps happen before or after the the vrf "driver" output?
> Same problem for NFS.
>
> From a non-parochial standpoint. There are a *lot* of routing apps that actually
> need more visibility into many details about the "slave" interface: e.g., OSPF,
> ARP snoop, IPSLA.. the list is pretty long.
>
> I think it's a bad idea to use a "driver" to represent a table lookup. Too many
> hacks will become necessary.

With respect to sockets there is also the issue that ip addresses are
not per vrf.  Which means things like packet fragmentation reassembly
can easily do the wrong thing.  Similarly things like the xfrm for ipsec
tunnels are not hooked into this mix.

So I really do not see how this VRF/MRF thing as designed can support
general purpose sockets.  I am not certain it can correctly support any
kind of socket except perhaps SOCK_RAW.

Which strongly suggests to me you can leave off the magic network device
and simply add the fib_lookup logic that finds the base fib table by
looking at the in coming network device.  That seems a whole lot simpler
and a whole lot less surprising.  The better routing will work and
people playing with sockets will clearly see the dangers.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-10  1:36         ` Eric W. Biederman
@ 2015-07-10  2:12           ` David Ahern
  2015-07-10  3:55             ` Eric W. Biederman
  0 siblings, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-10  2:12 UTC (permalink / raw)
  To: Eric W. Biederman, Sowmini Varadhan
  Cc: netdev, Shrijeet Mukherjee, Roopa Prabhu, Andy Gospodarek,
	jtoppins, nikolay, ddutt, Hannes Frederic Sowa, Nicolas Dichtel,
	Stephen Hemminger, hadi, David Miller

On 7/9/15 7:36 PM, Eric W. Biederman wrote:
> Sowmini Varadhan <sowmini05@gmail.com> writes:
>
>> On Thu, Jul 9, 2015 at 7:19 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>>
>>> On the to-do list to use cmsg to specify a VRF for outbound packets using
>>> non-connected sockets. I do not believe it is going to work, but need to
>>> look into it.
>>>
>>>> What about setting ipsec policy for interfaces in the vrf?
>>
>>  From a purely parochial standpoint, how would rds sockets work in this model?
>> Would the tcp encaps happen before or after the the vrf "driver" output?
>> Same problem for NFS.
>>
>>  From a non-parochial standpoint. There are a *lot* of routing apps that actually
>> need more visibility into many details about the "slave" interface: e.g., OSPF,
>> ARP snoop, IPSLA.. the list is pretty long.
>>
>> I think it's a bad idea to use a "driver" to represent a table lookup. Too many
>> hacks will become necessary.
>
> With respect to sockets there is also the issue that ip addresses are
> not per vrf.

IP addresses are per interface and interfaces are uniquely assigned to a 
VRF so why do you think IP addresses are not per VRF?

>  Which means things like packet fragmentation reassembly
> can easily do the wrong thing.  Similarly things like the xfrm for ipsec
> tunnels are not hooked into this mix.
>
> So I really do not see how this VRF/MRF thing as designed can support
> general purpose sockets.  I am not certain it can correctly support any
> kind of socket except perhaps SOCK_RAW.

Sockets bound to the VRF device work properly. Why do you think they won't?

>
> Which strongly suggests to me you can leave off the magic network device
> and simply add the fib_lookup logic that finds the base fib table by
> looking at the in coming network device.  That seems a whole lot simpler
> and a whole lot less surprising.  The better routing will work and
> people playing with sockets will clearly see the dangers.

I believe Hannes is looking at that approach. Hopefully patches will be 
available soon to have more meaningful conversations (e.g., LPC). As we 
move along with working patch sets we can compare and contrast the 2 
models -- what features do each enable and at what cost.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-09 17:28       ` Sowmini Varadhan
  2015-07-10  1:36         ` Eric W. Biederman
@ 2015-07-10  2:39         ` David Ahern
  2015-07-10  3:28           ` Sowmini Varadhan
  1 sibling, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-10  2:39 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: netdev, Shrijeet Mukherjee, Roopa Prabhu, Andy Gospodarek,
	jtoppins, nikolay, ddutt, Hannes Frederic Sowa, Nicolas Dichtel,
	Stephen Hemminger, hadi, Eric W. Biederman, David Miller

On 7/9/15 11:28 AM, Sowmini Varadhan wrote:
> On Thu, Jul 9, 2015 at 7:19 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>
>> On the to-do list to use cmsg to specify a VRF for outbound packets using
>> non-connected sockets. I do not believe it is going to work, but need to
>> look into it.
>>
>>> What about setting ipsec policy for interfaces in the vrf?
>
>  From a purely parochial standpoint, how would rds sockets work in this model?
> Would the tcp encaps happen before or after the the vrf "driver" output?
> Same problem for NFS.

If I set the VRF context (ie., set the SO_BINDTODEVICE for all sockets) 
of any RDS, NFS or any other socket app it runs in that VRF context and 
works just fine.

>
>  From a non-parochial standpoint. There are a *lot* of routing apps that actually
> need more visibility into many details about the "slave" interface: e.g., OSPF,
> ARP snoop, IPSLA.. the list is pretty long.
>
> I think it's a bad idea to use a "driver" to represent a table lookup. Too many
> hacks will become necessary.

Most of the changes needed to the networking stack are to address which 
table is used for FIB lookups. The stack has a strong preference to the 
local and main tables. I have a new patch set which better explains 
patch 4 in this version. I'll send it out in the next few days, but you 
can get a preview here:

   https://github.com/dsahern/linux.git, vrf-dev-4.1-v2 branch

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-10  2:39         ` David Ahern
@ 2015-07-10  3:28           ` Sowmini Varadhan
  2015-07-10  3:44             ` David Ahern
  0 siblings, 1 reply; 32+ messages in thread
From: Sowmini Varadhan @ 2015-07-10  3:28 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, Shrijeet Mukherjee, Roopa Prabhu, Andy Gospodarek,
	jtoppins, nikolay, ddutt, Hannes Frederic Sowa, Nicolas Dichtel,
	Stephen Hemminger, hadi, Eric W. Biederman, David Miller

On Fri, Jul 10, 2015 at 4:39 AM, David Ahern <dsa@cumulusnetworks.com> wrote:

> If I set the VRF context (ie., set the SO_BINDTODEVICE for all sockets) of
> any RDS, NFS or any other socket app it runs in that VRF context and works
> just fine

What if the application wants to do SO_BINDTODEVICE?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-10  3:28           ` Sowmini Varadhan
@ 2015-07-10  3:44             ` David Ahern
  0 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-10  3:44 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: netdev, Shrijeet Mukherjee, Roopa Prabhu, Andy Gospodarek,
	jtoppins, nikolay, ddutt, Hannes Frederic Sowa, Nicolas Dichtel,
	Stephen Hemminger, hadi, Eric W. Biederman, David Miller

On 7/9/15 9:28 PM, Sowmini Varadhan wrote:
> On Fri, Jul 10, 2015 at 4:39 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
>
>> If I set the VRF context (ie., set the SO_BINDTODEVICE for all sockets) of
>> any RDS, NFS or any other socket app it runs in that VRF context and works
>> just fine
>
> What if the application wants to do SO_BINDTODEVICE?
>

We have knowledge of both -- vrf master and vrf slave. Do the checks 
from local up ... is there a socket bound to slave device? If yes, 
match. If no, check for socket bound to master (VRF device). If yes, 
match. If no, check global (only works for connected sockets).

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-10  2:12           ` David Ahern
@ 2015-07-10  3:55             ` Eric W. Biederman
  2015-07-10  4:20               ` David Ahern
  0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-07-10  3:55 UTC (permalink / raw)
  To: David Ahern
  Cc: Sowmini Varadhan, netdev, Shrijeet Mukherjee, Roopa Prabhu,
	Andy Gospodarek, jtoppins, nikolay, ddutt, Hannes Frederic Sowa,
	Nicolas Dichtel, Stephen Hemminger, hadi, David Miller

David Ahern <dsa@cumulusnetworks.com> writes:

> On 7/9/15 7:36 PM, Eric W. Biederman wrote:
>> Sowmini Varadhan <sowmini05@gmail.com> writes:
>>
>>> On Thu, Jul 9, 2015 at 7:19 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>>>
>>>> On the to-do list to use cmsg to specify a VRF for outbound packets using
>>>> non-connected sockets. I do not believe it is going to work, but need to
>>>> look into it.
>>>>
>>>>> What about setting ipsec policy for interfaces in the vrf?
>>>
>>>  From a purely parochial standpoint, how would rds sockets work in this model?
>>> Would the tcp encaps happen before or after the the vrf "driver" output?
>>> Same problem for NFS.
>>>
>>>  From a non-parochial standpoint. There are a *lot* of routing apps that actually
>>> need more visibility into many details about the "slave" interface: e.g., OSPF,
>>> ARP snoop, IPSLA.. the list is pretty long.
>>>
>>> I think it's a bad idea to use a "driver" to represent a table lookup. Too many
>>> hacks will become necessary.
>>
>> With respect to sockets there is also the issue that ip addresses are
>> not per vrf.
>
> IP addresses are per interface and interfaces are uniquely assigned to
> a VRF so why do you think IP addresses are not per VRF?

I have read large swaths of the linux networking code over the years.

Further I was thinking more about non-local addresses ip addresses, but
I would not be surprised if there are also issues with local addresses.

>>  Which means things like packet fragmentation reassembly
>> can easily do the wrong thing.  Similarly things like the xfrm for ipsec
>> tunnels are not hooked into this mix.
>>
>> So I really do not see how this VRF/MRF thing as designed can support
>> general purpose sockets.  I am not certain it can correctly support any
>> kind of socket except perhaps SOCK_RAW.
>
> Sockets bound to the VRF device work properly. Why do you think they won't?

Because there are many locations in the network stack (like fragment
reassembly) that make the assumption that ip addresses are unique and
do not bother looking at network device or anything else.  If fragments
manage to come into play I don't expect it would be hard to poision a
connections with fragments from another routing domain with overlapping
ip addresses.

I guess if we are talking about SO_BINDTODEVICE which requires
CAP_NET_RAW we aren't really talking ordinary applications so there is
already a big helping of buyer beware.

Still a blanket statement that sockets just work and there is nothing
to be concerned about is just wrong.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-10  3:55             ` Eric W. Biederman
@ 2015-07-10  4:20               ` David Ahern
  2015-07-10  4:56                 ` Eric W. Biederman
  0 siblings, 1 reply; 32+ messages in thread
From: David Ahern @ 2015-07-10  4:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sowmini Varadhan, netdev, Shrijeet Mukherjee, Roopa Prabhu,
	Andy Gospodarek, jtoppins, nikolay, ddutt, Hannes Frederic Sowa,
	Nicolas Dichtel, Stephen Hemminger, hadi, David Miller

On 7/9/15 9:55 PM, Eric W. Biederman wrote:
>> IP addresses are per interface and interfaces are uniquely assigned to
>> a VRF so why do you think IP addresses are not per VRF?
>
> I have read large swaths of the linux networking code over the years.
>
> Further I was thinking more about non-local addresses ip addresses, but
> I would not be surprised if there are also issues with local addresses.

Well, if someone has a specific example I'll take a look.

>
>>>   Which means things like packet fragmentation reassembly
>>> can easily do the wrong thing.  Similarly things like the xfrm for ipsec
>>> tunnels are not hooked into this mix.
>>>
>>> So I really do not see how this VRF/MRF thing as designed can support
>>> general purpose sockets.  I am not certain it can correctly support any
>>> kind of socket except perhaps SOCK_RAW.
>>
>> Sockets bound to the VRF device work properly. Why do you think they won't?
>
> Because there are many locations in the network stack (like fragment
> reassembly) that make the assumption that ip addresses are unique and
> do not bother looking at network device or anything else.  If fragments
> manage to come into play I don't expect it would be hard to poision a
> connections with fragments from another routing domain with overlapping
> ip addresses.

If that is true it is a problem with the networking stack today and is 
completely independent of this VRF proposal.


> I guess if we are talking about SO_BINDTODEVICE which requires
> CAP_NET_RAW we aren't really talking ordinary applications so there is
> already a big helping of buyer beware.
>
> Still a blanket statement that sockets just work and there is nothing
> to be concerned about is just wrong.

If you have examples of something that does not work I will be happy to 
look into it. As it stands I have a growing suite of test cases where my 
comment is true.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-10  4:20               ` David Ahern
@ 2015-07-10  4:56                 ` Eric W. Biederman
  2015-07-10 18:42                   ` David Ahern
  0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2015-07-10  4:56 UTC (permalink / raw)
  To: David Ahern
  Cc: Sowmini Varadhan, netdev, Shrijeet Mukherjee, Roopa Prabhu,
	Andy Gospodarek, jtoppins, nikolay, ddutt, Hannes Frederic Sowa,
	Nicolas Dichtel, Stephen Hemminger, hadi, David Miller

David Ahern <dsa@cumulusnetworks.com> writes:

> On 7/9/15 9:55 PM, Eric W. Biederman wrote:
>>> IP addresses are per interface and interfaces are uniquely assigned to
>>> a VRF so why do you think IP addresses are not per VRF?
>>
>> I have read large swaths of the linux networking code over the years.
>>
>> Further I was thinking more about non-local addresses ip addresses, but
>> I would not be surprised if there are also issues with local addresses.
>
> Well, if someone has a specific example I'll take a look.

I have given specific areas of concern, and explained myself and you are
blowing me off.

Besides the fragment reassembly and xfrm there are things like the
ineetpeer cache.

>>>>   Which means things like packet fragmentation reassembly
>>>> can easily do the wrong thing.  Similarly things like the xfrm for ipsec
>>>> tunnels are not hooked into this mix.
>>>>
>>>> So I really do not see how this VRF/MRF thing as designed can support
>>>> general purpose sockets.  I am not certain it can correctly support any
>>>> kind of socket except perhaps SOCK_RAW.
>>>
>>> Sockets bound to the VRF device work properly. Why do you think they won't?
>>
>> Because there are many locations in the network stack (like fragment
>> reassembly) that make the assumption that ip addresses are unique and
>> do not bother looking at network device or anything else.  If fragments
>> manage to come into play I don't expect it would be hard to poision a
>> connections with fragments from another routing domain with overlapping
>> ip addresses.
>
> If that is true it is a problem with the networking stack today and is
> completely independent of this VRF proposal.

Not at all.  It is required functionality for reassembly of ip fragments
when the packets come in via different paths.  This is required to
support multi-path ip reception.

This only becomes a bug in the scenario you have proposed.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 0/6] Proposal for VRF-lite - v2
  2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
                   ` (7 preceding siblings ...)
  2015-07-06 15:40 ` [RFC net-next 0/6] Proposal for VRF-lite - v2 Nicolas Dichtel
@ 2015-07-10  5:14 ` Scott Feldman
  8 siblings, 0 replies; 32+ messages in thread
From: Scott Feldman @ 2015-07-10  5:14 UTC (permalink / raw)
  To: David Ahern
  Cc: Netdev, Shrijeet Mukherjee, Roopa Prabhu, Andy Gospodarek,
	Jonathan Toppins, Nikolay Aleksandrov, ddutt,
	Hannes Frederic Sowa, Nicolas Dichtel, stephen, Jamal Hadi Salim,
	ebiederm, David S. Miller

On Mon, Jul 6, 2015 at 8:03 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
> In the context of internet scale routing a requirement that always
> comes up is the need to partition the available routing tables into
> disjoint routing planes. A specific use case is the multi-tenancy
> problem where each tenant has their own unique routing tables and in
> the very least need different default gateways.

Based on this problem statement, netns would be the answer: to
partition the physical router into N virtual routers.  If routing is
offloaded, the offload device is netns-aware to preserve the
partitioning down to the HW level.

I see from earlier discussions on VRF that netns is no good because
it's an inefficient use of resources.  I wonder if that's true in a
practical way?  If I have a 48-port router, I could create 24 2-port
virtual routers using netns, each running routing stuff (bgp, lldp,
ospf, etc).  Is the netns overhead plus the routing sw duplication not
going to fit on a Cumulus-class router?

In other words, if noone had ever heard of VRF, we'd conclude netns
given the problem statement.  And then focus on inefficiencies in
netns, if the implementation didn't fit a particular target.

So my C in RFC is what's wrong with using netns?  And can those wrongs be fixed?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC net-next 3/6] net: Introduce VRF device driver - v2
  2015-07-10  4:56                 ` Eric W. Biederman
@ 2015-07-10 18:42                   ` David Ahern
  0 siblings, 0 replies; 32+ messages in thread
From: David Ahern @ 2015-07-10 18:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sowmini Varadhan, netdev, Shrijeet Mukherjee, Roopa Prabhu,
	Andy Gospodarek, jtoppins, nikolay, ddutt, Hannes Frederic Sowa,
	Nicolas Dichtel, Stephen Hemminger, hadi, David Miller

On 7/9/15 10:56 PM, Eric W. Biederman wrote:
> I have given specific areas of concern, and explained myself and you are
> blowing me off.

You have not had answered my question with any additional details or 
code references -- ie., a specific example. Asking you for clarification 
and details is not blowing you off.

To recap:

Eric: "With respect to sockets there is also the issue that ip addresses 
are not per vrf."

David: "IP addresses are per interface and interfaces are uniquely 
assigned to a VRF so why do you think IP addresses are not per VRF?"

Eric: "I have read large swaths of the linux networking code over the 
years. Further I was thinking more about non-local addresses ip 
addresses, but I would not be surprised if there are also issues with 
local addresses."

David: "Well, if someone has a specific example I'll take a look."

So, let me try this again: All of the IPv4 and IPv6 addresses I am aware 
of are held in structs linked to a specific netdevice. Can you give me a 
specific example of what you mean here? I can't respond to your feedback 
based on the little information you have given me.

>
> Besides the fragment reassembly and xfrm there are things like the
> ineetpeer cache.

noted.

>
>>>>>    Which means things like packet fragmentation reassembly
>>>>> can easily do the wrong thing.  Similarly things like the xfrm for ipsec
>>>>> tunnels are not hooked into this mix.
>>>>>
>>>>> So I really do not see how this VRF/MRF thing as designed can support
>>>>> general purpose sockets.  I am not certain it can correctly support any
>>>>> kind of socket except perhaps SOCK_RAW.
>>>>
>>>> Sockets bound to the VRF device work properly. Why do you think they won't?
>>>
>>> Because there are many locations in the network stack (like fragment
>>> reassembly) that make the assumption that ip addresses are unique and
>>> do not bother looking at network device or anything else.  If fragments
>>> manage to come into play I don't expect it would be hard to poision a
>>> connections with fragments from another routing domain with overlapping
>>> ip addresses.
>>
>> If that is true it is a problem with the networking stack today and is
>> completely independent of this VRF proposal.
>
> Not at all.  It is required functionality for reassembly of ip fragments
> when the packets come in via different paths.  This is required to
> support multi-path ip reception.
>
> This only becomes a bug in the scenario you have proposed.

Can you please point to a specific line in one of these patches that 
impacts fragmentation?

This patch -- patch 3 -- adds a VRF driver. It has not mucked with 
packets at all.

Patch 4 (again I have a better split in a forthcoming revision) tweaks a 
few places in the IPv4 stack with respect to socket lookups (dif 
modified) and FIB table lookups (specifying a table to use or tweaking 
oif/iif).

Since the VRF device has not touched the packet and does not introduce a 
tunnel how has its use/existence impacted fragmentation?

I do plan tests to include ipsec for example and fragmentation; it's a 
matter of time. If you have suggestions on a setup/test case I should 
include please let me know.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-07-10 18:42 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-06 15:03 [RFC net-next 0/6] Proposal for VRF-lite - v2 David Ahern
2015-07-06 15:03 ` [RFC net-next 1/6] fib: export symbols David Ahern
2015-07-06 15:03 ` [RFC net-next 2/6] net: Preparation for vrf device David Ahern
2015-07-08  8:37   ` Nicolas Dichtel
2015-07-08  8:40     ` Nicolas Dichtel
2015-07-08 16:10     ` David Ahern
2015-07-06 15:03 ` [RFC net-next 3/6] net: Introduce VRF device driver - v2 David Ahern
2015-07-06 15:42   ` Nicolas Dichtel
2015-07-06 16:37   ` Nikolay Aleksandrov
2015-07-06 16:46     ` David Ahern
2015-07-08  9:27   ` Nicolas Dichtel
2015-07-08 16:38     ` David Ahern
2015-07-08 18:34   ` Sowmini Varadhan
2015-07-09 17:19     ` David Ahern
2015-07-09 17:28       ` Sowmini Varadhan
2015-07-10  1:36         ` Eric W. Biederman
2015-07-10  2:12           ` David Ahern
2015-07-10  3:55             ` Eric W. Biederman
2015-07-10  4:20               ` David Ahern
2015-07-10  4:56                 ` Eric W. Biederman
2015-07-10 18:42                   ` David Ahern
2015-07-10  2:39         ` David Ahern
2015-07-10  3:28           ` Sowmini Varadhan
2015-07-10  3:44             ` David Ahern
2015-07-06 15:03 ` [RFC net-next 4/6] net: Modifications to ipv4 stack for VRF devices David Ahern
2015-07-06 15:03 ` [RFC net-next 5/6] net: Add sk_bind_dev_if to task_struct David Ahern
2015-07-06 15:03 ` [RFC net-next 6/6] net: Add chvrf command David Ahern
2015-07-06 15:03 ` [RFC PATCH] iproute2: Add support for VRF device David Ahern
2015-07-06 15:40 ` [RFC net-next 0/6] Proposal for VRF-lite - v2 Nicolas Dichtel
2015-07-06 17:53   ` Shrijeet Mukherjee
2015-07-08  9:30     ` Nicolas Dichtel
2015-07-10  5:14 ` Scott Feldman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.