All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload
@ 2015-03-03 23:31 sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 1/7] rtnetlink: add RTNH_F_EXTERNAL flag for fib offload sfeldma
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: sfeldma @ 2015-03-03 23:31 UTC (permalink / raw)
  To: netdev, davem, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>

v3:

Changes based on v2 review comments:

  - Move check for custom rules up earlier in patch set, to keep git bisect
    safe.
  - Simplify the route add/modify failure handling to simple try until
    failure, and then on failure, undo everything.  The switchdev driver
    will return err when route can normally be installed to device, but
    the install fails for one reason or another (no space left on device,
    etc).  If a failure happens, uninstall all routes from the device,
    punting forwarding for all routes back to the kernel.
  - Scan route's full nexthop list, ensuring all nexthop devs belong
    to the same switchdev device, otherwise don't try to install route
    to device.

v2:

Changes based on v1 review comments and discussions at netconf:

  - Allow route modification, but use same ndo op used for adding route.
    Driver/device is expected to modify route in-place, if it can, to avoid
    interruption of service.
  - Add new RTNH_F_EXTERNAL flag to mark FIB entries offloaded externally.
  - Don't offload routes if using custom IP rules.  If routes are already
    offloaded, and custom IP rules are turned on, flush routes from offload
    device.  (Offloaded routes are marked with RTNH_F_EXTERNAL).
  - Use kernel's neigh resolution code to resolve route's nexthops' neigh
    MAC addrs.  (Thanks davem, works great!).
  - Use fib->fib_priority in rocker driver to give priorities to routes in
    OF-DPA unicast route table.

v1:

This patch set adds L3 routing offload support for IPv4 routes.  The idea is to
mirror routes installed in the kernel's FIB down to a hardware switch device to
offload the data forwarding path for L3.  Only the data forwarding path is
intercepted.  Control and management of the kernel's FIB remains with the
kernel.

Scott Feldman (7):
  rtnetlink: add RTNH_F_EXTERNAL flag for fib offload
  netdevice: add IPv4 fib add/del ops
  switchdev: add IPv4 fib ndo ops wrappers
  switchdev: don't support custom ip rules, for now
  switchdev: implement IPv4 fib ndo wrappers
  fib: hook IPv4 fib for hardware offload
  rocker: implement IPv4 fib offloading

 drivers/net/ethernet/rocker/rocker.c |  527 +++++++++++++++++++++++++++++++---
 include/linux/netdevice.h            |   22 ++
 include/net/ip_fib.h                 |    2 +
 include/net/switchdev.h              |   19 ++
 include/uapi/linux/rtnetlink.h       |    1 +
 net/ipv4/fib_frontend.c              |   13 +
 net/ipv4/fib_rules.c                 |    3 +
 net/ipv4/fib_trie.c                  |   63 +++-
 net/switchdev/switchdev.c            |  127 ++++++++
 9 files changed, 729 insertions(+), 48 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next v3 1/7] rtnetlink: add RTNH_F_EXTERNAL flag for fib offload
  2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
@ 2015-03-03 23:31 ` sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 2/7] netdevice: add IPv4 fib add/del ops sfeldma
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: sfeldma @ 2015-03-03 23:31 UTC (permalink / raw)
  To: netdev, davem, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>

Add new RTNH_F_EXTERNAL flag to mark fib entries offloaded externally, for
example to a switchdev switch device.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
---
 include/uapi/linux/rtnetlink.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 5cc5d66..b476e86 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -332,6 +332,7 @@ struct rtnexthop {
 #define RTNH_F_DEAD		1	/* Nexthop is dead (used by multipath)	*/
 #define RTNH_F_PERVASIVE	2	/* Do recursive gateway lookup	*/
 #define RTNH_F_ONLINK		4	/* Gateway is forced on link	*/
+#define RTNH_F_EXTERNAL		8	/* Route installed externally	*/
 
 /* Macros to handle hexthops */
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v3 2/7] netdevice: add IPv4 fib add/del ops
  2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 1/7] rtnetlink: add RTNH_F_EXTERNAL flag for fib offload sfeldma
@ 2015-03-03 23:31 ` sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 3/7] switchdev: add IPv4 fib ndo ops wrappers sfeldma
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: sfeldma @ 2015-03-03 23:31 UTC (permalink / raw)
  To: netdev, davem, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>

Add two new ndo ops for IPv4 fib offload support, add and del.  Add uses
modifiy semantics if fib entry already offloaded.  Drivers implementing the new
ndo ops will return err<0 if programming device fails, for example if device's
tables are full.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
---
 include/linux/netdevice.h |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5897b4e..73b2766 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -769,6 +769,8 @@ struct netdev_phys_item_id {
 typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
 				       struct sk_buff *skb);
 
+struct fib_info;
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1032,6 +1034,14 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
  *	Called to notify switch device port of bridge port STP
  *	state change.
+ * int (*ndo_sw_parent_fib_ipv4_add)(struct net_device *dev, __be32 dst,
+ *				     int dst_len, struct fib_info *fi,
+ *				     u8 tos, u8 type, u32 tb_id);
+ *	Called to add/modify IPv4 route to switch device.
+ * int (*ndo_sw_parent_fib_ipv4_del)(struct net_device *dev, __be32 dst,
+ *				     int dst_len, struct fib_info *fi,
+ *				     u8 tos, u8 type, u32 tb_id);
+ *	Called to delete IPv4 route from switch device.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1193,6 +1203,18 @@ struct net_device_ops {
 							    struct netdev_phys_item_id *psid);
 	int			(*ndo_switch_port_stp_update)(struct net_device *dev,
 							      u8 state);
+	int			(*ndo_switch_fib_ipv4_add)(struct net_device *dev,
+							   __be32 dst,
+							   int dst_len,
+							   struct fib_info *fi,
+							   u8 tos, u8 type,
+							   u32 tb_id);
+	int			(*ndo_switch_fib_ipv4_del)(struct net_device *dev,
+							   __be32 dst,
+							   int dst_len,
+							   struct fib_info *fi,
+							   u8 tos, u8 type,
+							   u32 tb_id);
 #endif
 };
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v3 3/7] switchdev: add IPv4 fib ndo ops wrappers
  2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 1/7] rtnetlink: add RTNH_F_EXTERNAL flag for fib offload sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 2/7] netdevice: add IPv4 fib add/del ops sfeldma
@ 2015-03-03 23:31 ` sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 4/7] switchdev: don't support custom ip rules, for now sfeldma
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: sfeldma @ 2015-03-03 23:31 UTC (permalink / raw)
  To: netdev, davem, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>

Add IPv4 fib ndo wrapper funcs and stub them out for now.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
---
 include/net/switchdev.h   |   19 +++++++++++++++++++
 net/switchdev/switchdev.c |   39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index cfcdac2..8d2ac66 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -51,6 +51,11 @@ int ndo_dflt_netdev_switch_port_bridge_dellink(struct net_device *dev,
 					       struct nlmsghdr *nlh, u16 flags);
 int ndo_dflt_netdev_switch_port_bridge_setlink(struct net_device *dev,
 					       struct nlmsghdr *nlh, u16 flags);
+int netdev_switch_fib_ipv4_add(u32 dst, int dst_len, struct fib_info *fi,
+			       u8 tos, u8 type, u32 tb_id);
+int netdev_switch_fib_ipv4_del(u32 dst, int dst_len, struct fib_info *fi,
+			       u8 tos, u8 type, u32 tb_id);
+
 #else
 
 static inline int netdev_switch_parent_id_get(struct net_device *dev,
@@ -109,6 +114,20 @@ static inline int ndo_dflt_netdev_switch_port_bridge_setlink(struct net_device *
 	return 0;
 }
 
+static inline int netdev_switch_fib_ipv4_add(u32 dst, int dst_len,
+					     struct fib_info *fi,
+					     u8 tos, u8 type, u32 tb_id)
+{
+	return 0;
+}
+
+static inline int netdev_switch_fib_ipv4_del(u32 dst, int dst_len,
+					     struct fib_info *fi,
+					     u8 tos, u8 type, u32 tb_id)
+{
+	return 0;
+}
+
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 8c1e558..3c090f8 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -14,6 +14,7 @@
 #include <linux/mutex.h>
 #include <linux/notifier.h>
 #include <linux/netdevice.h>
+#include <net/ip_fib.h>
 #include <net/switchdev.h>
 
 /**
@@ -225,3 +226,41 @@ int ndo_dflt_netdev_switch_port_bridge_dellink(struct net_device *dev,
 	return ret;
 }
 EXPORT_SYMBOL(ndo_dflt_netdev_switch_port_bridge_dellink);
+
+/**
+ *	netdev_switch_fib_ipv4_add - Add IPv4 route entry to switch
+ *
+ *	@dst: route's IPv4 destination address
+ *	@dst_len: destination address length (prefix length)
+ *	@fi: route FIB info structure
+ *	@tos: route TOS
+ *	@type: route type
+ *	@tb_id: route table ID
+ *
+ *	Add IPv4 route entry to switch device.
+ */
+int netdev_switch_fib_ipv4_add(u32 dst, int dst_len, struct fib_info *fi,
+			       u8 tos, u8 type, u32 tb_id)
+{
+	return 0;
+}
+EXPORT_SYMBOL(netdev_switch_fib_ipv4_add);
+
+/**
+ *	netdev_switch_fib_ipv4_del - Delete IPv4 route entry from switch
+ *
+ *	@dst: route's IPv4 destination address
+ *	@dst_len: destination address length (prefix length)
+ *	@fi: route FIB info structure
+ *	@tos: route TOS
+ *	@type: route type
+ *	@tb_id: route table ID
+ *
+ *	Delete IPv4 route entry from switch device.
+ */
+int netdev_switch_fib_ipv4_del(u32 dst, int dst_len, struct fib_info *fi,
+			       u8 tos, u8 type, u32 tb_id)
+{
+	return 0;
+}
+EXPORT_SYMBOL(netdev_switch_fib_ipv4_del);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v3 4/7] switchdev: don't support custom ip rules, for now
  2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
                   ` (2 preceding siblings ...)
  2015-03-03 23:31 ` [PATCH net-next v3 3/7] switchdev: add IPv4 fib ndo ops wrappers sfeldma
@ 2015-03-03 23:31 ` sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 5/7] switchdev: implement IPv4 fib ndo wrappers sfeldma
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: sfeldma @ 2015-03-03 23:31 UTC (permalink / raw)
  To: netdev, davem, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>

Keep switchdev FIB offload model simple for now and don't allow custom ip
rules.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
---
 include/net/ip_fib.h      |    2 ++
 net/ipv4/fib_frontend.c   |   13 +++++++++++++
 net/ipv4/fib_rules.c      |    3 +++
 net/ipv4/fib_trie.c       |   27 +++++++++++++++++++++++++++
 net/switchdev/switchdev.c |    4 ++++
 5 files changed, 49 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index cba4b7c..894a75c 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -195,6 +195,7 @@ int fib_table_delete(struct fib_table *, struct fib_config *);
 int fib_table_dump(struct fib_table *table, struct sk_buff *skb,
 		   struct netlink_callback *cb);
 int fib_table_flush(struct fib_table *table);
+void fib_table_flush_external(struct fib_table *table);
 void fib_free_table(struct fib_table *tb);
 
 
@@ -294,6 +295,7 @@ static inline int fib_num_tclassid_users(struct net *net)
 	return 0;
 }
 #endif
+void fib_flush_external(struct net *net);
 
 /* Exported by fib_semantics.c */
 int ip_fib_check_default(__be32 gw, struct net_device *dev);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 57be71d..c33c19a 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -146,6 +146,19 @@ static void fib_flush(struct net *net)
 		rt_cache_flush(net);
 }
 
+void fib_flush_external(struct net *net)
+{
+	struct fib_table *tb;
+	struct hlist_head *head;
+	unsigned int h;
+
+	for (h = 0; h < FIB_TABLE_HASHSZ; h++) {
+		head = &net->ipv4.fib_table_hash[h];
+		hlist_for_each_entry(tb, head, tb_hlist)
+			fib_table_flush_external(tb);
+	}
+}
+
 /*
  * Find address type as if only "dev" was present in the system. If
  * on_dev is NULL then all interfaces are taken into consideration.
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index d3db718..190d0d0 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -209,6 +209,8 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 	rule4->tos = frh->tos;
 
 	net->ipv4.fib_has_custom_rules = true;
+	fib_flush_external(rule->fr_net);
+
 	err = 0;
 errout:
 	return err;
@@ -224,6 +226,7 @@ static void fib4_rule_delete(struct fib_rule *rule)
 		net->ipv4.fib_num_tclassid_users--;
 #endif
 	net->ipv4.fib_has_custom_rules = true;
+	fib_flush_external(rule->fr_net);
 }
 
 static int fib4_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index f485345..32c0117 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1494,6 +1494,23 @@ int fib_table_delete(struct fib_table *tb, struct fib_config *cfg)
 	return 0;
 }
 
+static void trie_flush_leaf_external(struct fib_table *tb, struct tnode *l)
+{
+	struct hlist_node *tmp;
+	struct fib_alias *fa;
+
+	hlist_for_each_entry_safe(fa, tmp, &l->leaf, fa_list) {
+		struct fib_info *fi = fa->fa_info;
+
+		if (fi && (fi->fib_flags & RTNH_F_EXTERNAL)) {
+			netdev_switch_fib_ipv4_del(l->key,
+						   KEYLENGTH - fa->fa_slen,
+						   fi, fa->fa_tos,
+						   fa->fa_type, tb->tb_id);
+		}
+	}
+}
+
 static int trie_flush_leaf(struct tnode *l)
 {
 	struct hlist_node *tmp;
@@ -1616,6 +1633,16 @@ int fib_table_flush(struct fib_table *tb)
 	return found;
 }
 
+/* Caller must hold RTNL */
+void fib_table_flush_external(struct fib_table *tb)
+{
+	struct trie *t = (struct trie *)tb->tb_data;
+	struct tnode *l;
+
+	for (l = trie_firstleaf(t); l; l = trie_nextleaf(l))
+		trie_flush_leaf_external(tb, l);
+}
+
 void fib_free_table(struct fib_table *tb)
 {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 3c090f8..81c4c02 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -242,6 +242,10 @@ EXPORT_SYMBOL(ndo_dflt_netdev_switch_port_bridge_dellink);
 int netdev_switch_fib_ipv4_add(u32 dst, int dst_len, struct fib_info *fi,
 			       u8 tos, u8 type, u32 tb_id)
 {
+	/* Don't offload route if using custom ip rules */
+	if (fi->fib_net->ipv4.fib_has_custom_rules)
+		return 0;
+
 	return 0;
 }
 EXPORT_SYMBOL(netdev_switch_fib_ipv4_add);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v3 5/7] switchdev: implement IPv4 fib ndo wrappers
  2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
                   ` (3 preceding siblings ...)
  2015-03-03 23:31 ` [PATCH net-next v3 4/7] switchdev: don't support custom ip rules, for now sfeldma
@ 2015-03-03 23:31 ` sfeldma
  2015-03-03 23:31 ` [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload sfeldma
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: sfeldma @ 2015-03-03 23:31 UTC (permalink / raw)
  To: netdev, davem, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>

Flesh out ndo wrappers to call into device driver.  To call into device driver,
the wrapper must interate over route's nexthops to ensure all nexthop devs
belong to the same switch device.  Currently, there is no support for route's
nexthops spanning offloaded and non-offloaded devices, or spanning ports of
multiple offload devices.

Since switch device ports may be stacked under virtual interfaces (bonds and/or
bridges), and the route's nexthop may be on the virtual interface, the wrapper
will traverse the nexthop dev down to the base dev.  It's the base dev that's
passed to the switchdev driver's ndo ops.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
---
 net/switchdev/switchdev.c |   88 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 86 insertions(+), 2 deletions(-)

diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 81c4c02..8701541 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -227,6 +227,64 @@ int ndo_dflt_netdev_switch_port_bridge_dellink(struct net_device *dev,
 }
 EXPORT_SYMBOL(ndo_dflt_netdev_switch_port_bridge_dellink);
 
+static struct net_device *netdev_switch_get_lowest_dev(struct net_device *dev)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct net_device *lower_dev;
+	struct net_device *port_dev;
+	struct list_head *iter;
+
+	/* Recusively search down until we find a sw port dev.
+	 * (A sw port dev supports ndo_switch_parent_id_get).
+	 */
+
+	if (ops->ndo_switch_parent_id_get)
+		return dev;
+
+	netdev_for_each_lower_dev(dev, lower_dev, iter) {
+		port_dev = netdev_switch_get_lowest_dev(lower_dev);
+		if (port_dev)
+			return port_dev;
+	}
+
+	return NULL;
+}
+
+static struct net_device *netdev_switch_get_dev_by_nhs(struct fib_info *fi)
+{
+	struct netdev_phys_item_id psid;
+	struct netdev_phys_item_id prev_psid;
+	struct net_device *dev = NULL;
+	int nhsel;
+
+	/* For this route, all nexthop devs must be on the same switch. */
+
+	for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
+		const struct fib_nh *nh = &fi->fib_nh[nhsel];
+
+		if (!nh->nh_dev)
+			return NULL;
+
+		dev = netdev_switch_get_lowest_dev(nh->nh_dev);
+		if (!dev)
+			return NULL;
+
+		if (netdev_switch_parent_id_get(dev, &psid))
+			return NULL;
+
+		if (nhsel > 0) {
+			if (prev_psid.id_len != psid.id_len)
+				return NULL;
+			if (memcmp(prev_psid.id, psid.id, psid.id_len))
+				return NULL;
+		}
+
+		prev_psid = psid;
+	}
+
+	return dev;
+}
+
 /**
  *	netdev_switch_fib_ipv4_add - Add IPv4 route entry to switch
  *
@@ -242,11 +300,24 @@ EXPORT_SYMBOL(ndo_dflt_netdev_switch_port_bridge_dellink);
 int netdev_switch_fib_ipv4_add(u32 dst, int dst_len, struct fib_info *fi,
 			       u8 tos, u8 type, u32 tb_id)
 {
+	struct net_device *dev;
+	const struct net_device_ops *ops;
+	int err = 0;
+
 	/* Don't offload route if using custom ip rules */
 	if (fi->fib_net->ipv4.fib_has_custom_rules)
 		return 0;
 
-	return 0;
+	dev = netdev_switch_get_dev_by_nhs(fi);
+	if (!dev)
+		return 0;
+	ops = dev->netdev_ops;
+
+	if (ops->ndo_switch_fib_ipv4_add)
+		err = ops->ndo_switch_fib_ipv4_add(dev, htonl(dst), dst_len,
+						   fi, tos, type, tb_id);
+
+	return err;
 }
 EXPORT_SYMBOL(netdev_switch_fib_ipv4_add);
 
@@ -265,6 +336,19 @@ EXPORT_SYMBOL(netdev_switch_fib_ipv4_add);
 int netdev_switch_fib_ipv4_del(u32 dst, int dst_len, struct fib_info *fi,
 			       u8 tos, u8 type, u32 tb_id)
 {
-	return 0;
+	struct net_device *dev;
+	const struct net_device_ops *ops;
+	int err = 0;
+
+	dev = netdev_switch_get_dev_by_nhs(fi);
+	if (!dev)
+		return 0;
+	ops = dev->netdev_ops;
+
+	if (ops->ndo_switch_fib_ipv4_del)
+		err = ops->ndo_switch_fib_ipv4_del(dev, htonl(dst), dst_len,
+						   fi, tos, type, tb_id);
+
+	return err;
 }
 EXPORT_SYMBOL(netdev_switch_fib_ipv4_del);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload
  2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
                   ` (4 preceding siblings ...)
  2015-03-03 23:31 ` [PATCH net-next v3 5/7] switchdev: implement IPv4 fib ndo wrappers sfeldma
@ 2015-03-03 23:31 ` sfeldma
  2015-03-04  0:01   ` Alexander Duyck
  2015-03-05  7:03   ` John Fastabend
  2015-03-03 23:32 ` [PATCH net-next v3 7/7] rocker: implement IPv4 fib offloading sfeldma
  2015-03-04  5:38 ` [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload David Miller
  7 siblings, 2 replies; 19+ messages in thread
From: sfeldma @ 2015-03-03 23:31 UTC (permalink / raw)
  To: netdev, davem, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>

Call into the switchdev driver any time an IPv4 fib entry is
added/modified/deleted from the kernel's FIB.  The switchdev driver may or
may not install the route to the offload device.  In the case where the
driver tries to install the route and something goes wrong (device's routing
table is full, etc), then all of the offloaded routes will be flushed from the
device, and route forwarding falls back to the kernel.

We can refine this fail-over logic in subsequent patches.  For now, use the
simplist model of offloading routes up to the point of failure, and then on
failure, undo everything.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
---
 net/ipv4/fib_trie.c |   36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 32c0117..668f09b 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -79,6 +79,7 @@
 #include <net/tcp.h>
 #include <net/sock.h>
 #include <net/ip_fib.h>
+#include <net/switchdev.h>
 #include "fib_lookup.h"
 
 #define MAX_STAT_DEPTH 32
@@ -1161,7 +1162,18 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
 			new_fa->fa_state = state & ~FA_S_ACCESSED;
 			new_fa->fa_slen = fa->fa_slen;
 
+			err = netdev_switch_fib_ipv4_add(key, plen, fi,
+							 new_fa->fa_tos,
+							 cfg->fc_type,
+							 tb->tb_id);
+			if (err) {
+				fib_flush_external(fi->fib_net);
+				kmem_cache_free(fn_alias_kmem, new_fa);
+				goto out;
+			}
+
 			hlist_replace_rcu(&fa->fa_list, &new_fa->fa_list);
+
 			alias_free_mem_rcu(fa);
 
 			fib_release_info(fi_drop);
@@ -1197,12 +1209,20 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
 	new_fa->fa_state = 0;
 	new_fa->fa_slen = slen;
 
+	/* (Optionally) offload fib entry to switch hardware. */
+	err = netdev_switch_fib_ipv4_add(key, plen, fi, tos,
+					 cfg->fc_type, tb->tb_id);
+	if (err) {
+		fib_flush_external(fi->fib_net);
+		goto out_free_new_fa;
+	}
+
 	/* Insert new entry to the list. */
 	if (!l) {
 		l = fib_insert_node(t, key, plen);
 		if (unlikely(!l)) {
 			err = -ENOMEM;
-			goto out_free_new_fa;
+			goto out_sw_fib_del;
 		}
 	}
 
@@ -1217,6 +1237,8 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
 succeeded:
 	return 0;
 
+out_sw_fib_del:
+	netdev_switch_fib_ipv4_del(key, plen, fi, tos, cfg->fc_type, tb->tb_id);
 out_free_new_fa:
 	kmem_cache_free(fn_alias_kmem, new_fa);
 out:
@@ -1475,6 +1497,10 @@ int fib_table_delete(struct fib_table *tb, struct fib_config *cfg)
 		return -ESRCH;
 
 	fa = fa_to_delete;
+
+	netdev_switch_fib_ipv4_del(key, plen, fa->fa_info, tos,
+				   cfg->fc_type, tb->tb_id);
+
 	rtmsg_fib(RTM_DELROUTE, htonl(key), fa, plen, tb->tb_id,
 		  &cfg->fc_nlinfo, 0);
 
@@ -1511,7 +1537,7 @@ static void trie_flush_leaf_external(struct fib_table *tb, struct tnode *l)
 	}
 }
 
-static int trie_flush_leaf(struct tnode *l)
+static int trie_flush_leaf(struct fib_table *tb, struct tnode *l)
 {
 	struct hlist_node *tmp;
 	unsigned char slen = 0;
@@ -1522,6 +1548,10 @@ static int trie_flush_leaf(struct tnode *l)
 		struct fib_info *fi = fa->fa_info;
 
 		if (fi && (fi->fib_flags & RTNH_F_DEAD)) {
+			netdev_switch_fib_ipv4_del(l->key,
+						   KEYLENGTH - fa->fa_slen,
+						   fi, fa->fa_tos,
+						   fa->fa_type, tb->tb_id);
 			hlist_del_rcu(&fa->fa_list);
 			fib_release_info(fa->fa_info);
 			alias_free_mem_rcu(fa);
@@ -1610,7 +1640,7 @@ int fib_table_flush(struct fib_table *tb)
 	int found = 0;
 
 	for (l = trie_firstleaf(t); l; l = trie_nextleaf(l)) {
-		found += trie_flush_leaf(l);
+		found += trie_flush_leaf(tb, l);
 
 		if (ll) {
 			if (hlist_empty(&ll->leaf))
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v3 7/7] rocker: implement IPv4 fib offloading
  2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
                   ` (5 preceding siblings ...)
  2015-03-03 23:31 ` [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload sfeldma
@ 2015-03-03 23:32 ` sfeldma
  2015-03-04  5:38 ` [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload David Miller
  7 siblings, 0 replies; 19+ messages in thread
From: sfeldma @ 2015-03-03 23:32 UTC (permalink / raw)
  To: netdev, davem, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>

The driver implements ndo_switch_fib_ipv4_add/del ops to add/del/mod IPv4
routes to/from switchdev device.  Once a route is added to the device, and the
route's nexthops are resolved to neighbor MAC address, the device will forward
matching pkts rather than the kernel.  This offloads the L3 forwarding path
from the kernel to the device.  Note that control and management planes are
still mananged by Linux; only the data plane is offloaded.  Standard routing
control protocols such as OSPF and BGP run on Linux and manage the kernel's FIB
via standard rtm netlink msgs...nothing changes here.

A new hash table is added to rocker to track neighbors.  The driver listens for
neighbor updates events using netevent notifier NETEVENT_NEIGH_UPDATE.  Any ARP
table updates for ports on this device are recorded in this table.  Routes
installed to the device with nexthops that reference neighbors in this table
are "qualified".  In the case of a route with nexthops not resolved in the
table, the kernel is asked to resolve the nexthop.

The driver uses fib_info->fib_priority for the priority field in rocker's
unicast routing table.

The device can only forward to pkts matching route dst to resolved nexthops.
Currently, the device only supports single-path routes (i.e. routes with one
nexthop).  Equal Cost Multipath (ECMP) route support will be added in followup
patches.

This patch is driver support for unicast IPv4 routing only.  Followup patches
will add driver and infrastructure for IPv6 routing and multicast routing.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/ethernet/rocker/rocker.c |  527 +++++++++++++++++++++++++++++++---
 1 file changed, 482 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
index e5a15a4..569d3d8 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -32,6 +32,9 @@
 #include <linux/bitops.h>
 #include <net/switchdev.h>
 #include <net/rtnetlink.h>
+#include <net/ip_fib.h>
+#include <net/netevent.h>
+#include <net/arp.h>
 #include <asm-generic/io-64-nonatomic-lo-hi.h>
 #include <generated/utsrelease.h>
 
@@ -111,9 +114,10 @@ struct rocker_flow_tbl_key {
 
 struct rocker_flow_tbl_entry {
 	struct hlist_node entry;
-	u32 ref_count;
+	u32 cmd;
 	u64 cookie;
 	struct rocker_flow_tbl_key key;
+	size_t key_len;
 	u32 key_crc32; /* key */
 };
 
@@ -161,6 +165,16 @@ struct rocker_internal_vlan_tbl_entry {
 	__be16 vlan_id;
 };
 
+struct rocker_neigh_tbl_entry {
+	struct hlist_node entry;
+	__be32 ip_addr; /* key */
+	struct net_device *dev;
+	u32 ref_count;
+	u32 index;
+	u8 eth_dst[ETH_ALEN];
+	bool ttl_check;
+};
+
 struct rocker_desc_info {
 	char *data; /* mapped */
 	size_t data_size;
@@ -234,6 +248,9 @@ struct rocker {
 	unsigned long internal_vlan_bitmap[ROCKER_INTERNAL_VLAN_BITMAP_LEN];
 	DECLARE_HASHTABLE(internal_vlan_tbl, 8);
 	spinlock_t internal_vlan_tbl_lock;
+	DECLARE_HASHTABLE(neigh_tbl, 16);
+	spinlock_t neigh_tbl_lock;
+	u32 neigh_tbl_next_index;
 };
 
 static const u8 zero_mac[ETH_ALEN]   = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };
@@ -256,7 +273,6 @@ enum {
 	ROCKER_PRIORITY_VLAN = 1,
 	ROCKER_PRIORITY_TERM_MAC_UCAST = 0,
 	ROCKER_PRIORITY_TERM_MAC_MCAST = 1,
-	ROCKER_PRIORITY_UNICAST_ROUTING = 1,
 	ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_EXACT = 1,
 	ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_WILD = 2,
 	ROCKER_PRIORITY_BRIDGING_VLAN = 3,
@@ -1940,8 +1956,7 @@ static int rocker_cmd_flow_tbl_add(struct rocker *rocker,
 	struct rocker_tlv *cmd_info;
 	int err = 0;
 
-	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
-			       ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_ADD))
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE, entry->cmd))
 		return -EMSGSIZE;
 	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
 	if (!cmd_info)
@@ -1998,8 +2013,7 @@ static int rocker_cmd_flow_tbl_del(struct rocker *rocker,
 	const struct rocker_flow_tbl_entry *entry = priv;
 	struct rocker_tlv *cmd_info;
 
-	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
-			       ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_DEL))
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE, entry->cmd))
 		return -EMSGSIZE;
 	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
 	if (!cmd_info)
@@ -2168,9 +2182,9 @@ static int rocker_cmd_group_tbl_del(struct rocker *rocker,
 	return 0;
 }
 
-/*****************************************
- * Flow, group, FDB, internal VLAN tables
- *****************************************/
+/***************************************************
+ * Flow, group, FDB, internal VLAN and neigh tables
+ ***************************************************/
 
 static int rocker_init_tbls(struct rocker *rocker)
 {
@@ -2186,6 +2200,9 @@ static int rocker_init_tbls(struct rocker *rocker)
 	hash_init(rocker->internal_vlan_tbl);
 	spin_lock_init(&rocker->internal_vlan_tbl_lock);
 
+	hash_init(rocker->neigh_tbl);
+	spin_lock_init(&rocker->neigh_tbl_lock);
+
 	return 0;
 }
 
@@ -2196,6 +2213,7 @@ static void rocker_free_tbls(struct rocker *rocker)
 	struct rocker_group_tbl_entry *group_entry;
 	struct rocker_fdb_tbl_entry *fdb_entry;
 	struct rocker_internal_vlan_tbl_entry *internal_vlan_entry;
+	struct rocker_neigh_tbl_entry *neigh_entry;
 	struct hlist_node *tmp;
 	int bkt;
 
@@ -2219,16 +2237,22 @@ static void rocker_free_tbls(struct rocker *rocker)
 			   tmp, internal_vlan_entry, entry)
 		hash_del(&internal_vlan_entry->entry);
 	spin_unlock_irqrestore(&rocker->internal_vlan_tbl_lock, flags);
+
+	spin_lock_irqsave(&rocker->neigh_tbl_lock, flags);
+	hash_for_each_safe(rocker->neigh_tbl, bkt, tmp, neigh_entry, entry)
+		hash_del(&neigh_entry->entry);
+	spin_unlock_irqrestore(&rocker->neigh_tbl_lock, flags);
 }
 
 static struct rocker_flow_tbl_entry *
 rocker_flow_tbl_find(struct rocker *rocker, struct rocker_flow_tbl_entry *match)
 {
 	struct rocker_flow_tbl_entry *found;
+	size_t key_len = match->key_len ? match->key_len : sizeof(found->key);
 
 	hash_for_each_possible(rocker->flow_tbl, found,
 			       entry, match->key_crc32) {
-		if (memcmp(&found->key, &match->key, sizeof(found->key)) == 0)
+		if (memcmp(&found->key, &match->key, key_len) == 0)
 			return found;
 	}
 
@@ -2241,42 +2265,34 @@ static int rocker_flow_tbl_add(struct rocker_port *rocker_port,
 {
 	struct rocker *rocker = rocker_port->rocker;
 	struct rocker_flow_tbl_entry *found;
+	size_t key_len = match->key_len ? match->key_len : sizeof(found->key);
 	unsigned long flags;
-	bool add_to_hw = false;
-	int err = 0;
 
-	match->key_crc32 = crc32(~0, &match->key, sizeof(match->key));
+	match->key_crc32 = crc32(~0, &match->key, key_len);
 
 	spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
 
 	found = rocker_flow_tbl_find(rocker, match);
 
 	if (found) {
-		kfree(match);
+		match->cookie = found->cookie;
+		hash_del(&found->entry);
+		kfree(found);
+		found = match;
+		found->cmd = ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_MOD;
 	} else {
 		found = match;
 		found->cookie = rocker->flow_tbl_next_cookie++;
-		hash_add(rocker->flow_tbl, &found->entry, found->key_crc32);
-		add_to_hw = true;
+		found->cmd = ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_ADD;
 	}
 
-	found->ref_count++;
+	hash_add(rocker->flow_tbl, &found->entry, found->key_crc32);
 
 	spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
 
-	if (add_to_hw) {
-		err = rocker_cmd_exec(rocker, rocker_port,
-				      rocker_cmd_flow_tbl_add,
-				      found, NULL, NULL, nowait);
-		if (err) {
-			spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
-			hash_del(&found->entry);
-			spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
-			kfree(found);
-		}
-	}
-
-	return err;
+	return rocker_cmd_exec(rocker, rocker_port,
+			       rocker_cmd_flow_tbl_add,
+			       found, NULL, NULL, nowait);
 }
 
 static int rocker_flow_tbl_del(struct rocker_port *rocker_port,
@@ -2285,29 +2301,26 @@ static int rocker_flow_tbl_del(struct rocker_port *rocker_port,
 {
 	struct rocker *rocker = rocker_port->rocker;
 	struct rocker_flow_tbl_entry *found;
+	size_t key_len = match->key_len ? match->key_len : sizeof(found->key);
 	unsigned long flags;
-	bool del_from_hw = false;
 	int err = 0;
 
-	match->key_crc32 = crc32(~0, &match->key, sizeof(match->key));
+	match->key_crc32 = crc32(~0, &match->key, key_len);
 
 	spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
 
 	found = rocker_flow_tbl_find(rocker, match);
 
 	if (found) {
-		found->ref_count--;
-		if (found->ref_count == 0) {
-			hash_del(&found->entry);
-			del_from_hw = true;
-		}
+		hash_del(&found->entry);
+		found->cmd = ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_DEL;
 	}
 
 	spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
 
 	kfree(match);
 
-	if (del_from_hw) {
+	if (found) {
 		err = rocker_cmd_exec(rocker, rocker_port,
 				      rocker_cmd_flow_tbl_del,
 				      found, NULL, NULL, nowait);
@@ -2467,6 +2480,31 @@ static int rocker_flow_tbl_bridge(struct rocker_port *rocker_port,
 	return rocker_flow_tbl_do(rocker_port, flags, entry);
 }
 
+static int rocker_flow_tbl_ucast4_routing(struct rocker_port *rocker_port,
+					  __be16 eth_type, __be32 dst,
+					  __be32 dst_mask, u32 priority,
+					  enum rocker_of_dpa_table_id goto_tbl,
+					  u32 group_id, int flags)
+{
+	struct rocker_flow_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_UNICAST_ROUTING;
+	entry->key.priority = priority;
+	entry->key.ucast_routing.eth_type = eth_type;
+	entry->key.ucast_routing.dst4 = dst;
+	entry->key.ucast_routing.dst4_mask = dst_mask;
+	entry->key.ucast_routing.goto_tbl = goto_tbl;
+	entry->key.ucast_routing.group_id = group_id;
+	entry->key_len = offsetof(struct rocker_flow_tbl_key,
+				  ucast_routing.group_id);
+
+	return rocker_flow_tbl_do(rocker_port, flags, entry);
+}
+
 static int rocker_flow_tbl_acl(struct rocker_port *rocker_port,
 			       int flags, u32 in_pport,
 			       u32 in_pport_mask,
@@ -2554,7 +2592,6 @@ static int rocker_group_tbl_add(struct rocker_port *rocker_port,
 	struct rocker *rocker = rocker_port->rocker;
 	struct rocker_group_tbl_entry *found;
 	unsigned long flags;
-	int err = 0;
 
 	spin_lock_irqsave(&rocker->group_tbl_lock, flags);
 
@@ -2574,12 +2611,9 @@ static int rocker_group_tbl_add(struct rocker_port *rocker_port,
 
 	spin_unlock_irqrestore(&rocker->group_tbl_lock, flags);
 
-	if (found->cmd)
-		err = rocker_cmd_exec(rocker, rocker_port,
-				      rocker_cmd_group_tbl_add,
-				      found, NULL, NULL, nowait);
-
-	return err;
+	return rocker_cmd_exec(rocker, rocker_port,
+			       rocker_cmd_group_tbl_add,
+			       found, NULL, NULL, nowait);
 }
 
 static int rocker_group_tbl_del(struct rocker_port *rocker_port,
@@ -2675,6 +2709,244 @@ static int rocker_group_l2_flood(struct rocker_port *rocker_port,
 				       group_id);
 }
 
+static int rocker_group_l3_unicast(struct rocker_port *rocker_port,
+				   int flags, u32 index, u8 *src_mac,
+				   u8 *dst_mac, __be16 vlan_id,
+				   bool ttl_check, u32 pport)
+{
+	struct rocker_group_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	entry->group_id = ROCKER_GROUP_L3_UNICAST(index);
+	if (src_mac)
+		ether_addr_copy(entry->l3_unicast.eth_src, src_mac);
+	if (dst_mac)
+		ether_addr_copy(entry->l3_unicast.eth_dst, dst_mac);
+	entry->l3_unicast.vlan_id = vlan_id;
+	entry->l3_unicast.ttl_check = ttl_check;
+	entry->l3_unicast.group_id = ROCKER_GROUP_L2_INTERFACE(vlan_id, pport);
+
+	return rocker_group_tbl_do(rocker_port, flags, entry);
+}
+
+static struct rocker_neigh_tbl_entry *
+	rocker_neigh_tbl_find(struct rocker *rocker, __be32 ip_addr)
+{
+	struct rocker_neigh_tbl_entry *found;
+
+	hash_for_each_possible(rocker->neigh_tbl, found, entry, ip_addr)
+		if (found->ip_addr == ip_addr)
+			return found;
+
+	return NULL;
+}
+
+static void _rocker_neigh_add(struct rocker *rocker,
+			      struct rocker_neigh_tbl_entry *entry)
+{
+	entry->index = rocker->neigh_tbl_next_index++;
+	entry->ref_count++;
+	hash_add(rocker->neigh_tbl, &entry->entry, entry->ip_addr);
+}
+
+static void _rocker_neigh_del(struct rocker *rocker,
+			      struct rocker_neigh_tbl_entry *entry)
+{
+	if (--entry->ref_count == 0) {
+		hash_del(&entry->entry);
+		kfree(entry);
+	}
+}
+
+static void _rocker_neigh_update(struct rocker *rocker,
+				 struct rocker_neigh_tbl_entry *entry,
+				 u8 *eth_dst, bool ttl_check)
+{
+	if (eth_dst) {
+		ether_addr_copy(entry->eth_dst, eth_dst);
+		entry->ttl_check = ttl_check;
+	} else {
+		entry->ref_count++;
+	}
+}
+
+static int rocker_port_ipv4_neigh(struct rocker_port *rocker_port,
+				  int flags, __be32 ip_addr, u8 *eth_dst)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_neigh_tbl_entry *entry;
+	struct rocker_neigh_tbl_entry *found;
+	unsigned long lock_flags;
+	__be16 eth_type = htons(ETH_P_IP);
+	enum rocker_of_dpa_table_id goto_tbl =
+		ROCKER_OF_DPA_TABLE_ID_ACL_POLICY;
+	u32 group_id;
+	u32 priority = 0;
+	bool adding = !(flags & ROCKER_OP_FLAG_REMOVE);
+	bool updating;
+	bool removing;
+	int err = 0;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	spin_lock_irqsave(&rocker->neigh_tbl_lock, lock_flags);
+
+	found = rocker_neigh_tbl_find(rocker, ip_addr);
+
+	updating = found && adding;
+	removing = found && !adding;
+	adding = !found && adding;
+
+	if (adding) {
+		entry->ip_addr = ip_addr;
+		entry->dev = rocker_port->dev;
+		ether_addr_copy(entry->eth_dst, eth_dst);
+		entry->ttl_check = true;
+		_rocker_neigh_add(rocker, entry);
+	} else if (removing) {
+		memcpy(entry, found, sizeof(*entry));
+		_rocker_neigh_del(rocker, found);
+	} else if (updating) {
+		_rocker_neigh_update(rocker, found, eth_dst, true);
+		memcpy(entry, found, sizeof(*entry));
+	} else {
+		err = -ENOENT;
+	}
+
+	spin_unlock_irqrestore(&rocker->neigh_tbl_lock, lock_flags);
+
+	if (err)
+		goto err_out;
+
+	/* For each active neighbor, we have an L3 unicast group and
+	 * a /32 route to the neighbor, which uses the L3 unicast
+	 * group.  The L3 unicast group can also be referred to by
+	 * other routes' nexthops.
+	 */
+
+	err = rocker_group_l3_unicast(rocker_port, flags,
+				      entry->index,
+				      rocker_port->dev->dev_addr,
+				      entry->eth_dst,
+				      rocker_port->internal_vlan_id,
+				      entry->ttl_check,
+				      rocker_port->pport);
+	if (err) {
+		netdev_err(rocker_port->dev,
+			   "Error (%d) L3 unicast group index %d\n",
+			   err, entry->index);
+		goto err_out;
+	}
+
+	if (adding || removing) {
+		group_id = ROCKER_GROUP_L3_UNICAST(entry->index);
+		err = rocker_flow_tbl_ucast4_routing(rocker_port,
+						     eth_type, ip_addr,
+						     inet_make_mask(32),
+						     priority, goto_tbl,
+						     group_id, flags);
+
+		if (err)
+			netdev_err(rocker_port->dev,
+				   "Error (%d) /32 unicast route %pI4 group 0x%08x\n",
+				   err, &entry->ip_addr, group_id);
+	}
+
+err_out:
+	if (!adding)
+		kfree(entry);
+
+	return err;
+}
+
+static int rocker_port_ipv4_resolve(struct rocker_port *rocker_port,
+				    __be32 ip_addr)
+{
+	struct net_device *dev = rocker_port->dev;
+	struct neighbour *n = __ipv4_neigh_lookup(dev, ip_addr);
+	int err = 0;
+
+	if (!n)
+		n = neigh_create(&arp_tbl, &ip_addr, dev);
+	if (!n)
+		return -ENOMEM;
+
+	/* If the neigh is already resolved, then go ahead and
+	 * install the entry, otherwise start the ARP process to
+	 * resolve the neigh.
+	 */
+
+	if (n->nud_state & NUD_VALID)
+		err = rocker_port_ipv4_neigh(rocker_port, 0, ip_addr, n->ha);
+	else
+		neigh_event_send(n, NULL);
+
+	return err;
+}
+
+static int rocker_port_ipv4_nh(struct rocker_port *rocker_port, int flags,
+			       __be32 ip_addr, u32 *index)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_neigh_tbl_entry *entry;
+	struct rocker_neigh_tbl_entry *found;
+	unsigned long lock_flags;
+	bool adding = !(flags & ROCKER_OP_FLAG_REMOVE);
+	bool updating;
+	bool removing;
+	bool resolved = true;
+	int err = 0;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	spin_lock_irqsave(&rocker->neigh_tbl_lock, lock_flags);
+
+	found = rocker_neigh_tbl_find(rocker, ip_addr);
+	if (found)
+		*index = found->index;
+
+	updating = found && adding;
+	removing = found && !adding;
+	adding = !found && adding;
+
+	if (adding) {
+		entry->ip_addr = ip_addr;
+		entry->dev = rocker_port->dev;
+		_rocker_neigh_add(rocker, entry);
+		*index = entry->index;
+		resolved = false;
+	} else if (removing) {
+		_rocker_neigh_del(rocker, found);
+	} else if (updating) {
+		_rocker_neigh_update(rocker, found, NULL, false);
+		resolved = !is_zero_ether_addr(found->eth_dst);
+	} else {
+		err = -ENOENT;
+	}
+
+	spin_unlock_irqrestore(&rocker->neigh_tbl_lock, lock_flags);
+
+	if (!adding)
+		kfree(entry);
+
+	if (err)
+		return err;
+
+	/* Resolved means neigh ip_addr is resolved to neigh mac. */
+
+	if (!resolved)
+		err = rocker_port_ipv4_resolve(rocker_port, ip_addr);
+
+	return err;
+}
+
 static int rocker_port_vlan_flood_group(struct rocker_port *rocker_port,
 					int flags, __be16 vlan_id)
 {
@@ -3429,6 +3701,84 @@ not_found:
 	spin_unlock_irqrestore(&rocker->internal_vlan_tbl_lock, lock_flags);
 }
 
+static int rocker_port_fib_ipv4(struct rocker_port *rocker_port, __be32 dst,
+				int dst_len, struct fib_info *fi, u32 tb_id,
+				int flags)
+{
+	struct fib_nh *nh;
+	__be16 eth_type = htons(ETH_P_IP);
+	__be32 dst_mask = inet_make_mask(dst_len);
+	__be16 internal_vlan_id = rocker_port->internal_vlan_id;
+	u32 priority = fi->fib_priority;
+	enum rocker_of_dpa_table_id goto_tbl =
+		ROCKER_OF_DPA_TABLE_ID_ACL_POLICY;
+	u32 group_id;
+	bool nh_on_port;
+	bool has_gw;
+	u32 index;
+	int err;
+
+	/* XXX support ECMP */
+
+	nh = fi->fib_nh;
+	nh_on_port = (fi->fib_dev == rocker_port->dev);
+	has_gw = !!nh->nh_gw;
+
+	if (has_gw && nh_on_port) {
+		err = rocker_port_ipv4_nh(rocker_port, flags,
+					  nh->nh_gw, &index);
+		if (err)
+			return err;
+
+		group_id = ROCKER_GROUP_L3_UNICAST(index);
+	} else {
+		/* Send to CPU for processing */
+		group_id = ROCKER_GROUP_L2_INTERFACE(internal_vlan_id, 0);
+	}
+
+	err = rocker_flow_tbl_ucast4_routing(rocker_port, eth_type, dst,
+					     dst_mask, priority, goto_tbl,
+					     group_id, flags);
+	if (err)
+		netdev_err(rocker_port->dev, "Error (%d) IPv4 route %pI4\n",
+			   err, &dst);
+
+	return err;
+}
+
+static bool rocker_port_fib_ipv4_skip(struct net_device *dev,
+				      __be32 dst, int dst_len,
+				      struct fib_info *fi,
+				      u8 tos, u8 type, u32 tb_id)
+{
+	if (fi->fib_flags & RTM_F_CLONED)
+		return true;
+
+	if (tb_id != RT_TABLE_MAIN && tb_id != RT_TABLE_LOCAL)
+		return true;
+
+	if (type != RTN_UNICAST && type != RTN_BLACKHOLE &&
+	    type != RTN_UNREACHABLE && type != RTN_LOCAL &&
+	    type != RTN_BROADCAST)
+		return true;
+
+	if (tb_id == RT_TABLE_MAIN && type != RTN_UNICAST &&
+	    type != RTN_BLACKHOLE && type != RTN_UNREACHABLE)
+		return true;
+
+	if (tos != 0)
+		return true;
+
+	if (ipv4_is_loopback(dst))
+		return true;
+
+	/* XXX not handling ECMP right now */
+	if (fi->fib_nhs != 1)
+		return true;
+
+	return false;
+}
+
 /*****************
  * Net device ops
  *****************/
@@ -3830,6 +4180,46 @@ static int rocker_port_switch_port_stp_update(struct net_device *dev, u8 state)
 	return rocker_port_stp_update(rocker_port, state);
 }
 
+static int rocker_port_switch_fib_ipv4_add(struct net_device *dev,
+					   __be32 dst, int dst_len,
+					   struct fib_info *fi,
+					   u8 tos, u8 type, u32 tb_id)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	bool skip;
+	int flags = 0;
+	int err;
+
+	skip = rocker_port_fib_ipv4_skip(dev, dst, dst_len, fi,
+					 tos, type, tb_id);
+	if (skip)
+		return 0;
+
+	err = rocker_port_fib_ipv4(rocker_port, dst, dst_len,
+				   fi, tb_id, flags);
+	if (!err)
+		fi->fib_flags |= RTNH_F_EXTERNAL;
+
+	return err;
+}
+
+static int rocker_port_switch_fib_ipv4_del(struct net_device *dev,
+					   __be32 dst, int dst_len,
+					   struct fib_info *fi,
+					   u8 tos, u8 type, u32 tb_id)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	int flags = ROCKER_OP_FLAG_REMOVE;
+	int err;
+
+	err = rocker_port_fib_ipv4(rocker_port, dst, dst_len,
+				   fi, tb_id, flags);
+	if (!err)
+		fi->fib_flags &= ~RTNH_F_EXTERNAL;
+
+	return err;
+}
+
 static const struct net_device_ops rocker_port_netdev_ops = {
 	.ndo_open			= rocker_port_open,
 	.ndo_stop			= rocker_port_stop,
@@ -3844,6 +4234,8 @@ static const struct net_device_ops rocker_port_netdev_ops = {
 	.ndo_bridge_getlink		= rocker_port_bridge_getlink,
 	.ndo_switch_parent_id_get	= rocker_port_switch_parent_id_get,
 	.ndo_switch_port_stp_update	= rocker_port_switch_port_stp_update,
+	.ndo_switch_fib_ipv4_add	= rocker_port_switch_fib_ipv4_add,
+	.ndo_switch_fib_ipv4_del	= rocker_port_switch_fib_ipv4_del,
 };
 
 /********************
@@ -4544,6 +4936,48 @@ static struct notifier_block rocker_netdevice_nb __read_mostly = {
 	.notifier_call = rocker_netdevice_event,
 };
 
+/************************************
+ * Net event notifier event handler
+ ************************************/
+
+static int rocker_neigh_update(struct net_device *dev, struct neighbour *n)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	int flags = (n->nud_state & NUD_VALID) ? 0 : ROCKER_OP_FLAG_REMOVE;
+	__be32 ip_addr = *(__be32 *)n->primary_key;
+
+	return rocker_port_ipv4_neigh(rocker_port, flags, ip_addr, n->ha);
+}
+
+static int rocker_netevent_event(struct notifier_block *unused,
+				 unsigned long event, void *ptr)
+{
+	struct net_device *dev;
+	struct neighbour *n = ptr;
+	int err;
+
+	switch (event) {
+	case NETEVENT_NEIGH_UPDATE:
+		if (n->tbl != &arp_tbl)
+			return NOTIFY_DONE;
+		dev = n->dev;
+		if (!rocker_port_dev_check(dev))
+			return NOTIFY_DONE;
+		err = rocker_neigh_update(dev, n);
+		if (err)
+			netdev_warn(dev,
+				    "failed to handle neigh update (err %d)\n",
+				    err);
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block rocker_netevent_nb __read_mostly = {
+	.notifier_call = rocker_netevent_event,
+};
+
 /***********************
  * Module init and exit
  ***********************/
@@ -4553,18 +4987,21 @@ static int __init rocker_module_init(void)
 	int err;
 
 	register_netdevice_notifier(&rocker_netdevice_nb);
+	register_netevent_notifier(&rocker_netevent_nb);
 	err = pci_register_driver(&rocker_pci_driver);
 	if (err)
 		goto err_pci_register_driver;
 	return 0;
 
 err_pci_register_driver:
+	unregister_netdevice_notifier(&rocker_netevent_nb);
 	unregister_netdevice_notifier(&rocker_netdevice_nb);
 	return err;
 }
 
 static void __exit rocker_module_exit(void)
 {
+	unregister_netevent_notifier(&rocker_netevent_nb);
 	unregister_netdevice_notifier(&rocker_netdevice_nb);
 	pci_unregister_driver(&rocker_pci_driver);
 }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload
  2015-03-03 23:31 ` [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload sfeldma
@ 2015-03-04  0:01   ` Alexander Duyck
  2015-03-04  3:16     ` Scott Feldman
  2015-03-05  7:03   ` John Fastabend
  1 sibling, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2015-03-04  0:01 UTC (permalink / raw)
  To: sfeldma, netdev, davem, jiri, roopa


On 03/03/2015 03:31 PM, sfeldma@gmail.com wrote:
> From: Scott Feldman <sfeldma@gmail.com>
>
> Call into the switchdev driver any time an IPv4 fib entry is
> added/modified/deleted from the kernel's FIB.  The switchdev driver may or
> may not install the route to the offload device.  In the case where the
> driver tries to install the route and something goes wrong (device's routing
> table is full, etc), then all of the offloaded routes will be flushed from the
> device, and route forwarding falls back to the kernel.
>
> We can refine this fail-over logic in subsequent patches.  For now, use the
> simplist model of offloading routes up to the point of failure, and then on
> failure, undo everything.
>
> Signed-off-by: Scott Feldman <sfeldma@gmail.com>
> ---
>   net/ipv4/fib_trie.c |   36 +++++++++++++++++++++++++++++++++---
>   1 file changed, 33 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 32c0117..668f09b 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -79,6 +79,7 @@
>   #include <net/tcp.h>
>   #include <net/sock.h>
>   #include <net/ip_fib.h>
> +#include <net/switchdev.h>
>   #include "fib_lookup.h"
>   
>   #define MAX_STAT_DEPTH 32
> @@ -1161,7 +1162,18 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
>   			new_fa->fa_state = state & ~FA_S_ACCESSED;
>   			new_fa->fa_slen = fa->fa_slen;
>   
> +			err = netdev_switch_fib_ipv4_add(key, plen, fi,
> +							 new_fa->fa_tos,
> +							 cfg->fc_type,
> +							 tb->tb_id);
> +			if (err) {
> +				fib_flush_external(fi->fib_net);
> +				kmem_cache_free(fn_alias_kmem, new_fa);
> +				goto out;
> +			}
> +
>   			hlist_replace_rcu(&fa->fa_list, &new_fa->fa_list);
> +
>   			alias_free_mem_rcu(fa);
>   
>   			fib_release_info(fi_drop);
> @@ -1197,12 +1209,20 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
>   	new_fa->fa_state = 0;
>   	new_fa->fa_slen = slen;
>   
> +	/* (Optionally) offload fib entry to switch hardware. */
> +	err = netdev_switch_fib_ipv4_add(key, plen, fi, tos,
> +					 cfg->fc_type, tb->tb_id);
> +	if (err) {
> +		fib_flush_external(fi->fib_net);
> +		goto out_free_new_fa;
> +	}
> +
>   	/* Insert new entry to the list. */
>   	if (!l) {
>   		l = fib_insert_node(t, key, plen);
>   		if (unlikely(!l)) {
>   			err = -ENOMEM;
> -			goto out_free_new_fa;
> +			goto out_sw_fib_del;
>   		}
>   	}
>   

Wouldn't it make more sense to insert the route in the trie first, and 
then notify the hardware of the new route?  It seems like it would be 
much easier to pull the route back out of the trie on failure rather 
than having to delete a route from the hardware on a failure to allocate 
memory to store it.

- Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload
  2015-03-04  0:01   ` Alexander Duyck
@ 2015-03-04  3:16     ` Scott Feldman
  0 siblings, 0 replies; 19+ messages in thread
From: Scott Feldman @ 2015-03-04  3:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, David S. Miller, Jiří Pírko, Roopa Prabhu

On Tue, Mar 3, 2015 at 4:01 PM, Alexander Duyck
<alexander.h.duyck@redhat.com> wrote:
>
> On 03/03/2015 03:31 PM, sfeldma@gmail.com wrote:
>>
>> From: Scott Feldman <sfeldma@gmail.com>
>>
>> Call into the switchdev driver any time an IPv4 fib entry is
>> added/modified/deleted from the kernel's FIB.  The switchdev driver may or
>> may not install the route to the offload device.  In the case where the
>> driver tries to install the route and something goes wrong (device's
>> routing
>> table is full, etc), then all of the offloaded routes will be flushed from
>> the
>> device, and route forwarding falls back to the kernel.
>>
>> We can refine this fail-over logic in subsequent patches.  For now, use
>> the
>> simplist model of offloading routes up to the point of failure, and then
>> on
>> failure, undo everything.
>>
>> Signed-off-by: Scott Feldman <sfeldma@gmail.com>
>> ---
>>   net/ipv4/fib_trie.c |   36 +++++++++++++++++++++++++++++++++---
>>   1 file changed, 33 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>> index 32c0117..668f09b 100644
>> --- a/net/ipv4/fib_trie.c
>> +++ b/net/ipv4/fib_trie.c
>> @@ -79,6 +79,7 @@
>>   #include <net/tcp.h>
>>   #include <net/sock.h>
>>   #include <net/ip_fib.h>
>> +#include <net/switchdev.h>
>>   #include "fib_lookup.h"
>>     #define MAX_STAT_DEPTH 32
>> @@ -1161,7 +1162,18 @@ int fib_table_insert(struct fib_table *tb, struct
>> fib_config *cfg)
>>                         new_fa->fa_state = state & ~FA_S_ACCESSED;
>>                         new_fa->fa_slen = fa->fa_slen;
>>   +                     err = netdev_switch_fib_ipv4_add(key, plen, fi,
>> +                                                        new_fa->fa_tos,
>> +                                                        cfg->fc_type,
>> +                                                        tb->tb_id);
>> +                       if (err) {
>> +                               fib_flush_external(fi->fib_net);
>> +                               kmem_cache_free(fn_alias_kmem, new_fa);
>> +                               goto out;
>> +                       }
>> +
>>                         hlist_replace_rcu(&fa->fa_list, &new_fa->fa_list);
>> +
>>                         alias_free_mem_rcu(fa);
>>                         fib_release_info(fi_drop);
>> @@ -1197,12 +1209,20 @@ int fib_table_insert(struct fib_table *tb, struct
>> fib_config *cfg)
>>         new_fa->fa_state = 0;
>>         new_fa->fa_slen = slen;
>>   +     /* (Optionally) offload fib entry to switch hardware. */
>> +       err = netdev_switch_fib_ipv4_add(key, plen, fi, tos,
>> +                                        cfg->fc_type, tb->tb_id);
>> +       if (err) {
>> +               fib_flush_external(fi->fib_net);
>> +               goto out_free_new_fa;
>> +       }
>> +
>>         /* Insert new entry to the list. */
>>         if (!l) {
>>                 l = fib_insert_node(t, key, plen);
>>                 if (unlikely(!l)) {
>>                         err = -ENOMEM;
>> -                       goto out_free_new_fa;
>> +                       goto out_sw_fib_del;
>>                 }
>>         }
>>
>
>
> Wouldn't it make more sense to insert the route in the trie first, and then
> notify the hardware of the new route?  It seems like it would be much easier
> to pull the route back out of the trie on failure rather than having to
> delete a route from the hardware on a failure to allocate memory to store

6 one way half a dozen the other.  At one point I wanted return code
from hw install to indicate if route is installed to kernel FIB, so I
needed to try hw install first.  Now we've simplified the logic, so
order could be reversed.  Maybe lets keep it the way it is so we can
fancy with hw install return code later.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload
  2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
                   ` (6 preceding siblings ...)
  2015-03-03 23:32 ` [PATCH net-next v3 7/7] rocker: implement IPv4 fib offloading sfeldma
@ 2015-03-04  5:38 ` David Miller
  2015-03-04  7:28   ` Scott Feldman
  7 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2015-03-04  5:38 UTC (permalink / raw)
  To: sfeldma; +Cc: netdev, jiri, roopa

From: sfeldma@gmail.com
Date: Tue,  3 Mar 2015 15:31:53 -0800

> v3:
> 
> Changes based on v2 review comments:
> 
>   - Move check for custom rules up earlier in patch set, to keep git bisect
>     safe.
>   - Simplify the route add/modify failure handling to simple try until
>     failure, and then on failure, undo everything.  The switchdev driver
>     will return err when route can normally be installed to device, but
>     the install fails for one reason or another (no space left on device,
>     etc).  If a failure happens, uninstall all routes from the device,
>     punting forwarding for all routes back to the kernel.
>   - Scan route's full nexthop list, ensuring all nexthop devs belong
>     to the same switchdev device, otherwise don't try to install route
>     to device.

Getting really close.

I think you have to make some minor adjustments to the
fib_flush_external() cases.

First of all, that code that unloads the entries from the hardware
should clear the RTNH_F_EXTERNAL flag.

Secondly, if you call fib_flush_external() because an add returned an
error, you have to set some boolean state which prevents the next new
route insert from loading only that new route into the hardware
because that's exactly what will happen with your current
implementation.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload
  2015-03-04  5:38 ` [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload David Miller
@ 2015-03-04  7:28   ` Scott Feldman
  2015-03-04 21:06     ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Scott Feldman @ 2015-03-04  7:28 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev, Jiří Pírko, Roopa Prabhu

On Tue, Mar 3, 2015 at 9:38 PM, David Miller <davem@davemloft.net> wrote:
> From: sfeldma@gmail.com
> Date: Tue,  3 Mar 2015 15:31:53 -0800
>
>> v3:
>>
>> Changes based on v2 review comments:
>>
>>   - Move check for custom rules up earlier in patch set, to keep git bisect
>>     safe.
>>   - Simplify the route add/modify failure handling to simple try until
>>     failure, and then on failure, undo everything.  The switchdev driver
>>     will return err when route can normally be installed to device, but
>>     the install fails for one reason or another (no space left on device,
>>     etc).  If a failure happens, uninstall all routes from the device,
>>     punting forwarding for all routes back to the kernel.
>>   - Scan route's full nexthop list, ensuring all nexthop devs belong
>>     to the same switchdev device, otherwise don't try to install route
>>     to device.
>
> Getting really close.
>
> I think you have to make some minor adjustments to the
> fib_flush_external() cases.
>
> First of all, that code that unloads the entries from the hardware
> should clear the RTNH_F_EXTERNAL flag.

In v3, the setting and clearing of RTNH_F_EXTERNAL moved to the
driver, the implementer of the add/del ndo ops.  So RTNH_F_EXTERNAL
does get cleared by the driver on fib_flush_external().  We could add
an additional clear above the driver, just in case the driver screwed
up and forgot to clear it.  Driver bug in that case; not sure where to
draw the line.

> Secondly, if you call fib_flush_external() because an add returned an
> error, you have to set some boolean state which prevents the next new
> route insert from loading only that new route into the hardware
> because that's exactly what will happen with your current
> implementation.

I guess we could add a net.ipv4.fib_hw_screwed.  But that kills other
innocent switch devices on same netns.  Or is it a private driver
bool, which gets set on first install err, and is checked on
subsequent installs?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload
  2015-03-04  7:28   ` Scott Feldman
@ 2015-03-04 21:06     ` David Miller
  2015-03-05  4:50       ` Scott Feldman
  2015-03-05  7:18       ` John Fastabend
  0 siblings, 2 replies; 19+ messages in thread
From: David Miller @ 2015-03-04 21:06 UTC (permalink / raw)
  To: sfeldma; +Cc: netdev, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>
Date: Tue, 3 Mar 2015 23:28:06 -0800

> On Tue, Mar 3, 2015 at 9:38 PM, David Miller <davem@davemloft.net> wrote:
> In v3, the setting and clearing of RTNH_F_EXTERNAL moved to the
> driver, the implementer of the add/del ndo ops.  So RTNH_F_EXTERNAL
> does get cleared by the driver on fib_flush_external().  We could add
> an additional clear above the driver, just in case the driver screwed
> up and forgot to clear it.  Driver bug in that case; not sure where to
> draw the line.

I'd rather the state bit get managed by net/ipv4/*.c rather than
duplicate this into every driver, that's error prone and duplicates
logic unnecessarily.

>> Secondly, if you call fib_flush_external() because an add returned an
>> error, you have to set some boolean state which prevents the next new
>> route insert from loading only that new route into the hardware
>> because that's exactly what will happen with your current
>> implementation.
> 
> I guess we could add a net.ipv4.fib_hw_screwed.  But that kills other
> innocent switch devices on same netns.  Or is it a private driver
> bool, which gets set on first install err, and is checked on
> subsequent installs?

You can make it per-netdevice if you want.  Put it into the inetdevice
area perhaps.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload
  2015-03-04 21:06     ` David Miller
@ 2015-03-05  4:50       ` Scott Feldman
  2015-03-05  5:04         ` David Miller
  2015-03-05  5:07         ` David Miller
  2015-03-05  7:18       ` John Fastabend
  1 sibling, 2 replies; 19+ messages in thread
From: Scott Feldman @ 2015-03-05  4:50 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev, Jiří Pírko, Roopa Prabhu

On Wed, Mar 4, 2015 at 1:06 PM, David Miller <davem@davemloft.net> wrote:
>
> From: Scott Feldman <sfeldma@gmail.com>
> Date: Tue, 3 Mar 2015 23:28:06 -0800
>
> > On Tue, Mar 3, 2015 at 9:38 PM, David Miller <davem@davemloft.net> wrote:
> > In v3, the setting and clearing of RTNH_F_EXTERNAL moved to the
> > driver, the implementer of the add/del ndo ops.  So RTNH_F_EXTERNAL
> > does get cleared by the driver on fib_flush_external().  We could add
> > an additional clear above the driver, just in case the driver screwed
> > up and forgot to clear it.  Driver bug in that case; not sure where to
> > draw the line.
>
> I'd rather the state bit get managed by net/ipv4/*.c rather than
> duplicate this into every driver, that's error prone and duplicates
> logic unnecessarily.

That's the way it was in v2 :(

In v3, net/ipv4/*.c doesn't know if the driver skipped installing a
route to hw.  The driver returns 0 as if it was installed. So the
driver has to mark the ones actually installed.

This is why in v2 had return code -EOPNOTSUPP for routes that are
skipped by driver.  For example, rocker currently skips ECMP routes.
It's not an err condition.

But I see driver could skip and not skip the wrong combination of
routes such that we get a prefix split, for example.  We can't trust
driver.

I don't know what to do for v4.  Bummer, we got pretty close.  My only
ideas right now involve more code to handling unwinding some routes
from hw when related route fails to install to hw.

-scott

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload
  2015-03-05  4:50       ` Scott Feldman
@ 2015-03-05  5:04         ` David Miller
  2015-03-05  5:07         ` David Miller
  1 sibling, 0 replies; 19+ messages in thread
From: David Miller @ 2015-03-05  5:04 UTC (permalink / raw)
  To: sfeldma; +Cc: netdev, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>
Date: Wed, 4 Mar 2015 20:50:28 -0800

> In v3, net/ipv4/*.c doesn't know if the driver skipped installing a
> route to hw.  The driver returns 0 as if it was installed. So the
> driver has to mark the ones actually installed.
> 
> This is why in v2 had return code -EOPNOTSUPP for routes that are
> skipped by driver.  For example, rocker currently skips ECMP routes.
> It's not an err condition.

The driver should say what it did.

> But I see driver could skip and not skip the wrong combination of
> routes such that we get a prefix split, for example.  We can't trust
> driver.

If the driver said "I did this" then we have to trust it, I don't
understand why this is an issue.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload
  2015-03-05  4:50       ` Scott Feldman
  2015-03-05  5:04         ` David Miller
@ 2015-03-05  5:07         ` David Miller
  1 sibling, 0 replies; 19+ messages in thread
From: David Miller @ 2015-03-05  5:07 UTC (permalink / raw)
  To: sfeldma; +Cc: netdev, jiri, roopa

From: Scott Feldman <sfeldma@gmail.com>
Date: Wed, 4 Mar 2015 20:50:28 -0800

> I don't know what to do for v4.  Bummer, we got pretty close.  My only
> ideas right now involve more code to handling unwinding some routes
> from hw when related route fails to install to hw.

Scott:

1) route ADD

	driver has NULL ndo_op, do nothing, install sw route

	ndo_op SUCCESS, return 0, all is good and the caller sets the
	external bit

	ndo_op FAILURE, returns any error code, any and all external
	routes are removed with DEL ndo_op, all external bits are
	cleared

2) route DEL

	driver has NULL ndo_op, do nothing

	ndo_op SUCCESS, return 0, clear external bit

	ndo_op FAILURE, pass failure up the delete call stack

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload
  2015-03-03 23:31 ` [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload sfeldma
  2015-03-04  0:01   ` Alexander Duyck
@ 2015-03-05  7:03   ` John Fastabend
  2015-03-05  7:05     ` David Miller
  1 sibling, 1 reply; 19+ messages in thread
From: John Fastabend @ 2015-03-05  7:03 UTC (permalink / raw)
  To: sfeldma; +Cc: netdev, davem, jiri, roopa

On 03/03/2015 03:31 PM, sfeldma@gmail.com wrote:
> From: Scott Feldman <sfeldma@gmail.com>
>
> Call into the switchdev driver any time an IPv4 fib entry is
> added/modified/deleted from the kernel's FIB.  The switchdev driver may or
> may not install the route to the offload device.  In the case where the
> driver tries to install the route and something goes wrong (device's routing
> table is full, etc), then all of the offloaded routes will be flushed from the
> device, and route forwarding falls back to the kernel.
>
> We can refine this fail-over logic in subsequent patches.  For now, use the
> simplist model of offloading routes up to the point of failure, and then on
> failure, undo everything.
>
> Signed-off-by: Scott Feldman <sfeldma@gmail.com>
> ---

[...]

> @@ -1197,12 +1209,20 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
>   	new_fa->fa_state = 0;
>   	new_fa->fa_slen = slen;
>
> +	/* (Optionally) offload fib entry to switch hardware. */
> +	err = netdev_switch_fib_ipv4_add(key, plen, fi, tos,
> +					 cfg->fc_type, tb->tb_id);
> +	if (err) {
> +		fib_flush_external(fi->fib_net);
> +		goto out_free_new_fa;
> +	}
> +


Don't you need something to disable further fib entries from being added
once you get a failure and flush the table? Maybe I'm just not seeing.

The case being you add a set of entries, you get a failure and flush the
table then add another entry. The last entry is successfully inserted
but now your out of sync.

>   	/* Insert new entry to the list. */
>   	if (!l) {
>   		l = fib_insert_node(t, key, plen);
>   		if (unlikely(!l)) {
>   			err = -ENOMEM;
> -			goto out_free_new_fa;
> +			goto out_sw_fib_del;
>   		}
>   	}
>

Thanks.


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload
  2015-03-05  7:03   ` John Fastabend
@ 2015-03-05  7:05     ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2015-03-05  7:05 UTC (permalink / raw)
  To: john.fastabend; +Cc: sfeldma, netdev, jiri, roopa

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 04 Mar 2015 23:03:49 -0800

> Don't you need something to disable further fib entries from being
> added
> once you get a failure and flush the table? Maybe I'm just not seeing.
> 
> The case being you add a set of entries, you get a failure and flush
> the
> table then add another entry. The last entry is successfully inserted
> but now your out of sync.

Yes, he knows, we've been discussing exactly this problem.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload
  2015-03-04 21:06     ` David Miller
  2015-03-05  4:50       ` Scott Feldman
@ 2015-03-05  7:18       ` John Fastabend
  1 sibling, 0 replies; 19+ messages in thread
From: John Fastabend @ 2015-03-05  7:18 UTC (permalink / raw)
  To: David Miller; +Cc: sfeldma, netdev, jiri, roopa

On 03/04/2015 01:06 PM, David Miller wrote:
> From: Scott Feldman <sfeldma@gmail.com>
> Date: Tue, 3 Mar 2015 23:28:06 -0800
>
>> On Tue, Mar 3, 2015 at 9:38 PM, David Miller <davem@davemloft.net> wrote:
>> In v3, the setting and clearing of RTNH_F_EXTERNAL moved to the
>> driver, the implementer of the add/del ndo ops.  So RTNH_F_EXTERNAL
>> does get cleared by the driver on fib_flush_external().  We could add
>> an additional clear above the driver, just in case the driver screwed
>> up and forgot to clear it.  Driver bug in that case; not sure where to
>> draw the line.
>
> I'd rather the state bit get managed by net/ipv4/*.c rather than
> duplicate this into every driver, that's error prone and duplicates
> logic unnecessarily.
>
>>> Secondly, if you call fib_flush_external() because an add returned an
>>> error, you have to set some boolean state which prevents the next new
>>> route insert from loading only that new route into the hardware
>>> because that's exactly what will happen with your current
>>> implementation.
>>
>> I guess we could add a net.ipv4.fib_hw_screwed.  But that kills other
>> innocent switch devices on same netns.  Or is it a private driver
>> bool, which gets set on first install err, and is checked on
>> subsequent installs?
>
> You can make it per-netdevice if you want.  Put it into the inetdevice
> area perhaps.

Also at some point we will need a way to get it out of the
fib_hw_screwed state. But I guess we can lump this into the "better
policy" later.

Sorry for the noise on the last post...

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-03-05  7:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-03 23:31 [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload sfeldma
2015-03-03 23:31 ` [PATCH net-next v3 1/7] rtnetlink: add RTNH_F_EXTERNAL flag for fib offload sfeldma
2015-03-03 23:31 ` [PATCH net-next v3 2/7] netdevice: add IPv4 fib add/del ops sfeldma
2015-03-03 23:31 ` [PATCH net-next v3 3/7] switchdev: add IPv4 fib ndo ops wrappers sfeldma
2015-03-03 23:31 ` [PATCH net-next v3 4/7] switchdev: don't support custom ip rules, for now sfeldma
2015-03-03 23:31 ` [PATCH net-next v3 5/7] switchdev: implement IPv4 fib ndo wrappers sfeldma
2015-03-03 23:31 ` [PATCH net-next v3 6/7] fib: hook IPv4 fib for hardware offload sfeldma
2015-03-04  0:01   ` Alexander Duyck
2015-03-04  3:16     ` Scott Feldman
2015-03-05  7:03   ` John Fastabend
2015-03-05  7:05     ` David Miller
2015-03-03 23:32 ` [PATCH net-next v3 7/7] rocker: implement IPv4 fib offloading sfeldma
2015-03-04  5:38 ` [PATCH net-next v3 0/7] switchdev: add IPv4 routing offload David Miller
2015-03-04  7:28   ` Scott Feldman
2015-03-04 21:06     ` David Miller
2015-03-05  4:50       ` Scott Feldman
2015-03-05  5:04         ` David Miller
2015-03-05  5:07         ` David Miller
2015-03-05  7:18       ` John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.