All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath
@ 2014-08-21 16:18 Jiri Pirko
  2014-08-21 16:18 ` [patch net-next RFC 02/12] net: rename netdev_phys_port_id to more generic name Jiri Pirko
                   ` (9 more replies)
  0 siblings, 10 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:18 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

This patchset can be divided into 3 main sections:
- introduce switchdev api for implementing switch drivers
- add hardware acceleration bits into openvswitch datapath, This uses
  previously mentioned switchdev api
- introduce rocker switch driver which implenets switchdev api

More info in separate patches.

So now there is possible out of the box to create ovs bridge over rocker
switch ports and the flows will be offloaded into hardware.

Jiri Pirko (12):
  openvswitch: split flow structures into ovs specific and generic ones
  net: rename netdev_phys_port_id to more generic name
  net: introduce generic switch devices support
  rtnl: expose physical switch id for particular device
  net-sysfs: expose physical switch id for particular device
  net: introduce dummy switch
  dsa: implement ndo_swdev_get_id
  net: introduce netdev_phys_item_ids_match helper
  openvswitch: introduce vport_op get_netdev
  openvswitch: add support for datapath hardware offload
  sw_flow: add misc section to key with in_port_ifindex field
  rocker: introduce rocker switch driver

 Documentation/networking/switchdev.txt           |   53 +
 MAINTAINERS                                      |    6 +
 drivers/net/Kconfig                              |   15 +
 drivers/net/Makefile                             |    3 +
 drivers/net/dummyswitch.c                        |  131 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |    2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |    2 +-
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |    2 +-
 drivers/net/rocker.c                             | 3446 ++++++++++++++++++++++
 drivers/net/rocker.h                             |  465 +++
 include/linux/netdevice.h                        |   51 +-
 include/linux/sw_flow.h                          |  115 +
 include/linux/switchdev.h                        |   44 +
 include/uapi/linux/if_link.h                     |   10 +
 net/Kconfig                                      |    6 +
 net/core/Makefile                                |    1 +
 net/core/dev.c                                   |    2 +-
 net/core/net-sysfs.c                             |   26 +-
 net/core/rtnetlink.c                             |   30 +-
 net/core/switchdev.c                             |  173 ++
 net/dsa/Kconfig                                  |    2 +-
 net/dsa/slave.c                                  |   16 +
 net/openvswitch/Makefile                         |    3 +-
 net/openvswitch/actions.c                        |    3 +-
 net/openvswitch/datapath.c                       |  109 +-
 net/openvswitch/datapath.h                       |    7 +-
 net/openvswitch/dp_notify.c                      |    7 +-
 net/openvswitch/flow.c                           |    6 +-
 net/openvswitch/flow.h                           |  102 +-
 net/openvswitch/flow_netlink.c                   |   53 +-
 net/openvswitch/flow_netlink.h                   |   10 +-
 net/openvswitch/flow_table.c                     |  119 +-
 net/openvswitch/flow_table.h                     |   30 +-
 net/openvswitch/hw_offload.c                     |  258 ++
 net/openvswitch/hw_offload.h                     |   22 +
 net/openvswitch/vport-gre.c                      |    4 +-
 net/openvswitch/vport-internal_dev.c             |   56 +-
 net/openvswitch/vport-netdev.c                   |   19 +
 net/openvswitch/vport-netdev.h                   |   12 -
 net/openvswitch/vport-vxlan.c                    |    2 +-
 net/openvswitch/vport.c                          |    2 +-
 net/openvswitch/vport.h                          |    6 +-
 43 files changed, 5145 insertions(+), 288 deletions(-)
 create mode 100644 Documentation/networking/switchdev.txt
 create mode 100644 drivers/net/dummyswitch.c
 create mode 100644 drivers/net/rocker.c
 create mode 100644 drivers/net/rocker.h
 create mode 100644 include/linux/sw_flow.h
 create mode 100644 include/linux/switchdev.h
 create mode 100644 net/core/switchdev.c
 create mode 100644 net/openvswitch/hw_offload.c
 create mode 100644 net/openvswitch/hw_offload.h

-- 
1.9.3

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [patch net-next RFC 01/12] openvswitch: split flow structures into ovs specific and generic ones
       [not found] ` <1408637945-10390-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-08-21 16:18   ` Jiri Pirko
  2014-08-21 16:18   ` [patch net-next RFC 03/12] net: introduce generic switch devices support Jiri Pirko
  2014-08-21 16:18   ` [patch net-next RFC 06/12] net: introduce dummy switch Jiri Pirko
  2 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:18 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

After this, flow related structures can be used in other code.

Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
---
 include/linux/sw_flow.h        |  98 ++++++++++++++++++++++++++++++++++
 net/openvswitch/actions.c      |   3 +-
 net/openvswitch/datapath.c     |  74 +++++++++++++-------------
 net/openvswitch/datapath.h     |   4 +-
 net/openvswitch/flow.c         |   6 +--
 net/openvswitch/flow.h         | 102 +++++++----------------------------
 net/openvswitch/flow_netlink.c |  53 +++++++++---------
 net/openvswitch/flow_netlink.h |  10 ++--
 net/openvswitch/flow_table.c   | 118 ++++++++++++++++++++++-------------------
 net/openvswitch/flow_table.h   |  30 +++++------
 net/openvswitch/vport-gre.c    |   4 +-
 net/openvswitch/vport-vxlan.c  |   2 +-
 net/openvswitch/vport.c        |   2 +-
 net/openvswitch/vport.h        |   2 +-
 14 files changed, 275 insertions(+), 233 deletions(-)
 create mode 100644 include/linux/sw_flow.h

diff --git a/include/linux/sw_flow.h b/include/linux/sw_flow.h
new file mode 100644
index 0000000..b622fde
--- /dev/null
+++ b/include/linux/sw_flow.h
@@ -0,0 +1,98 @@
+/*
+ * Copyright (c) 2007-2012 Nicira, Inc.
+ * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _LINUX_SW_FLOW_H_
+#define _LINUX_SW_FLOW_H_
+
+struct sw_flow_key_ipv4_tunnel {
+	__be64 tun_id;
+	__be32 ipv4_src;
+	__be32 ipv4_dst;
+	__be16 tun_flags;
+	u8   ipv4_tos;
+	u8   ipv4_ttl;
+};
+
+struct sw_flow_key {
+	struct sw_flow_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
+	struct {
+		u32	priority;	/* Packet QoS priority. */
+		u32	skb_mark;	/* SKB mark. */
+		u16	in_port;	/* Input switch port (or DP_MAX_PORTS). */
+	} __packed phy; /* Safe when right after 'tun_key'. */
+	struct {
+		u8     src[ETH_ALEN];	/* Ethernet source address. */
+		u8     dst[ETH_ALEN];	/* Ethernet destination address. */
+		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
+		__be16 type;		/* Ethernet frame type. */
+	} eth;
+	struct {
+		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
+		u8     tos;		/* IP ToS. */
+		u8     ttl;		/* IP TTL/hop limit. */
+		u8     frag;		/* One of OVS_FRAG_TYPE_*. */
+	} ip;
+	struct {
+		__be16 src;		/* TCP/UDP/SCTP source port. */
+		__be16 dst;		/* TCP/UDP/SCTP destination port. */
+		__be16 flags;		/* TCP flags. */
+	} tp;
+	union {
+		struct {
+			struct {
+				__be32 src;	/* IP source address. */
+				__be32 dst;	/* IP destination address. */
+			} addr;
+			struct {
+				u8 sha[ETH_ALEN];	/* ARP source hardware address. */
+				u8 tha[ETH_ALEN];	/* ARP target hardware address. */
+			} arp;
+		} ipv4;
+		struct {
+			struct {
+				struct in6_addr src;	/* IPv6 source address. */
+				struct in6_addr dst;	/* IPv6 destination address. */
+			} addr;
+			__be32 label;			/* IPv6 flow label. */
+			struct {
+				struct in6_addr target;	/* ND target address. */
+				u8 sll[ETH_ALEN];	/* ND source link layer address. */
+				u8 tll[ETH_ALEN];	/* ND target link layer address. */
+			} nd;
+		} ipv6;
+	};
+} __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */
+
+struct sw_flow_key_range {
+	unsigned short int start;
+	unsigned short int end;
+};
+
+struct sw_flow_mask {
+	struct sw_flow_key_range range;
+	struct sw_flow_key key;
+};
+
+struct sw_flow_action {
+};
+
+struct sw_flow_actions {
+	unsigned count;
+	struct sw_flow_action actions[0];
+};
+
+struct sw_flow {
+	struct sw_flow_key key;
+	struct sw_flow_key unmasked_key;
+	struct sw_flow_mask *mask;
+	struct sw_flow_actions *actions;
+};
+
+#endif /* _LINUX_SW_FLOW_H_ */
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index fe5cda0..cb6d242 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -605,8 +605,9 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 /* Execute a list of actions against 'skb'. */
 int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb)
 {
-	struct sw_flow_actions *acts = rcu_dereference(OVS_CB(skb)->flow->sf_acts);
+	struct ovs_flow_actions *acts;
 
+	acts = rcu_dereference(OVS_CB(skb)->flow->sf_acts);
 	OVS_CB(skb)->tun_key = NULL;
 	return do_execute_actions(dp, skb, acts->actions, acts->actions_len);
 }
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 7228ec3..683d6cd 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -240,7 +240,7 @@ void ovs_dp_detach_port(struct vport *p)
 void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb)
 {
 	struct datapath *dp = p->dp;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct dp_stats_percpu *stats;
 	struct sw_flow_key key;
 	u64 *stats_counter;
@@ -505,9 +505,9 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 {
 	struct ovs_header *ovs_header = info->userhdr;
 	struct nlattr **a = info->attrs;
-	struct sw_flow_actions *acts;
+	struct ovs_flow_actions *acts;
 	struct sk_buff *packet;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct datapath *dp;
 	struct ethhdr *eth;
 	int len;
@@ -544,11 +544,11 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	if (IS_ERR(flow))
 		goto err_kfree_skb;
 
-	err = ovs_flow_extract(packet, -1, &flow->key);
+	err = ovs_flow_extract(packet, -1, &flow->flow.key);
 	if (err)
 		goto err_flow_free;
 
-	err = ovs_nla_get_flow_metadata(flow, a[OVS_PACKET_ATTR_KEY]);
+	err = ovs_nla_get_flow_metadata(&flow->flow, a[OVS_PACKET_ATTR_KEY]);
 	if (err)
 		goto err_flow_free;
 	acts = ovs_nla_alloc_flow_actions(nla_len(a[OVS_PACKET_ATTR_ACTIONS]));
@@ -557,15 +557,15 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 		goto err_flow_free;
 
 	err = ovs_nla_copy_actions(a[OVS_PACKET_ATTR_ACTIONS],
-				   &flow->key, 0, &acts);
+				   &flow->flow.key, 0, &acts);
 	rcu_assign_pointer(flow->sf_acts, acts);
 	if (err)
 		goto err_flow_free;
 
 	OVS_CB(packet)->flow = flow;
-	OVS_CB(packet)->pkt_key = &flow->key;
-	packet->priority = flow->key.phy.priority;
-	packet->mark = flow->key.phy.skb_mark;
+	OVS_CB(packet)->pkt_key = &flow->flow.key;
+	packet->priority = flow->flow.key.phy.priority;
+	packet->mark = flow->flow.key.phy.skb_mark;
 
 	rcu_read_lock();
 	dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
@@ -648,7 +648,7 @@ static void get_dp_stats(struct datapath *dp, struct ovs_dp_stats *stats,
 	}
 }
 
-static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
+static size_t ovs_flow_cmd_msg_size(const struct ovs_flow_actions *acts)
 {
 	return NLMSG_ALIGN(sizeof(struct ovs_header))
 		+ nla_total_size(key_attr_size()) /* OVS_FLOW_ATTR_KEY */
@@ -660,7 +660,7 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
 }
 
 /* Called with ovs_mutex or RCU read lock. */
-static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
+static int ovs_flow_cmd_fill_info(const struct ovs_flow *flow, int dp_ifindex,
 				  struct sk_buff *skb, u32 portid,
 				  u32 seq, u32 flags, u8 cmd)
 {
@@ -684,7 +684,8 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 	if (!nla)
 		goto nla_put_failure;
 
-	err = ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key, skb);
+	err = ovs_nla_put_flow(&flow->flow.unmasked_key,
+			       &flow->flow.unmasked_key, skb);
 	if (err)
 		goto error;
 	nla_nest_end(skb, nla);
@@ -693,7 +694,7 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 	if (!nla)
 		goto nla_put_failure;
 
-	err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb);
+	err = ovs_nla_put_flow(&flow->flow.key, &flow->flow.mask->key, skb);
 	if (err)
 		goto error;
 
@@ -725,7 +726,7 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 	 */
 	start = nla_nest_start(skb, OVS_FLOW_ATTR_ACTIONS);
 	if (start) {
-		const struct sw_flow_actions *sf_acts;
+		const struct ovs_flow_actions *sf_acts;
 
 		sf_acts = rcu_dereference_ovsl(flow->sf_acts);
 		err = ovs_nla_put_actions(sf_acts->actions,
@@ -752,9 +753,9 @@ error:
 }
 
 /* May not be called with RCU read lock. */
-static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *acts,
-					       struct genl_info *info,
-					       bool always)
+static struct sk_buff *
+ovs_flow_cmd_alloc_info(const struct ovs_flow_actions *acts,
+			struct genl_info *info, bool always)
 {
 	struct sk_buff *skb;
 
@@ -769,7 +770,7 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *act
 }
 
 /* Called with ovs_mutex. */
-static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
+static struct sk_buff *ovs_flow_cmd_build_info(const struct ovs_flow *flow,
 					       int dp_ifindex,
 					       struct genl_info *info, u8 cmd,
 					       bool always)
@@ -793,12 +794,12 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 {
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
-	struct sw_flow *flow, *new_flow;
+	struct ovs_flow *flow, *new_flow;
 	struct sw_flow_mask mask;
 	struct sk_buff *reply;
 	struct datapath *dp;
-	struct sw_flow_actions *acts;
-	struct sw_flow_match match;
+	struct ovs_flow_actions *acts;
+	struct ovs_flow_match match;
 	int error;
 
 	/* Must have key and actions. */
@@ -818,13 +819,14 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	}
 
 	/* Extract key. */
-	ovs_match_init(&match, &new_flow->unmasked_key, &mask);
+	ovs_match_init(&match, &new_flow->flow.unmasked_key, &mask);
 	error = ovs_nla_get_match(&match,
 				  a[OVS_FLOW_ATTR_KEY], a[OVS_FLOW_ATTR_MASK]);
 	if (error)
 		goto err_kfree_flow;
 
-	ovs_flow_mask_key(&new_flow->key, &new_flow->unmasked_key, &mask);
+	ovs_flow_mask_key(&new_flow->flow.key,
+			  &new_flow->flow.unmasked_key, &mask);
 
 	/* Validate actions. */
 	acts = ovs_nla_alloc_flow_actions(nla_len(a[OVS_FLOW_ATTR_ACTIONS]));
@@ -832,8 +834,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	if (IS_ERR(acts))
 		goto err_kfree_flow;
 
-	error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &new_flow->key,
-				     0, &acts);
+	error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS],
+				     &new_flow->flow.key, 0, &acts);
 	if (error) {
 		OVS_NLERR("Flow actions may not be safe on all matching packets.\n");
 		goto err_kfree_acts;
@@ -852,7 +854,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		goto err_unlock_ovs;
 	}
 	/* Check if this is a duplicate flow */
-	flow = ovs_flow_tbl_lookup(&dp->table, &new_flow->unmasked_key);
+	flow = ovs_flow_tbl_lookup(&dp->table, &new_flow->flow.unmasked_key);
 	if (likely(!flow)) {
 		rcu_assign_pointer(new_flow->sf_acts, acts);
 
@@ -873,7 +875,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		}
 		ovs_unlock();
 	} else {
-		struct sw_flow_actions *old_acts;
+		struct ovs_flow_actions *old_acts;
 
 		/* Bail out if we're not allowed to modify an existing flow.
 		 * We accept NLM_F_CREATE in place of the intended NLM_F_EXCL
@@ -932,12 +934,12 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key, masked_key;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct sw_flow_mask mask;
 	struct sk_buff *reply = NULL;
 	struct datapath *dp;
-	struct sw_flow_actions *old_acts = NULL, *acts = NULL;
-	struct sw_flow_match match;
+	struct ovs_flow_actions *old_acts = NULL, *acts = NULL;
+	struct ovs_flow_match match;
 	int error;
 
 	/* Extract key. */
@@ -1039,9 +1041,9 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
 	struct sk_buff *reply;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct datapath *dp;
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 	int err;
 
 	if (!a[OVS_FLOW_ATTR_KEY]) {
@@ -1087,9 +1089,9 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
 	struct sk_buff *reply;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct datapath *dp;
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 	int err;
 
 	if (likely(a[OVS_FLOW_ATTR_KEY])) {
@@ -1120,7 +1122,7 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	ovs_flow_tbl_remove(&dp->table, flow);
 	ovs_unlock();
 
-	reply = ovs_flow_cmd_alloc_info((const struct sw_flow_actions __force *) flow->sf_acts,
+	reply = ovs_flow_cmd_alloc_info((const struct ovs_flow_actions __force *) flow->sf_acts,
 					info, false);
 	if (likely(reply)) {
 		if (likely(!IS_ERR(reply))) {
@@ -1160,7 +1162,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
 
 	ti = rcu_dereference(dp->table.ti);
 	for (;;) {
-		struct sw_flow *flow;
+		struct ovs_flow *flow;
 		u32 bucket, obj;
 
 		bucket = cb->args[0];
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 701b573..291f5a0 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -100,9 +100,9 @@ struct datapath {
  * packet is not being tunneled.
  */
 struct ovs_skb_cb {
-	struct sw_flow		*flow;
+	struct ovs_flow		*flow;
 	struct sw_flow_key	*pkt_key;
-	struct ovs_key_ipv4_tunnel  *tun_key;
+	struct sw_flow_key_ipv4_tunnel  *tun_key;
 };
 #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)
 
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index d07ab53..40949a5 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -61,7 +61,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies)
 
 #define TCP_FLAGS_BE16(tp) (*(__be16 *)&tcp_flag_word(tp) & htons(0x0FFF))
 
-void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
+void ovs_flow_stats_update(struct ovs_flow *flow, __be16 tcp_flags,
 			   struct sk_buff *skb)
 {
 	struct flow_stats *stats;
@@ -123,7 +123,7 @@ unlock:
 }
 
 /* Must be called with rcu_read_lock or ovs_mutex. */
-void ovs_flow_stats_get(const struct sw_flow *flow,
+void ovs_flow_stats_get(const struct ovs_flow *flow,
 			struct ovs_flow_stats *ovs_stats,
 			unsigned long *used, __be16 *tcp_flags)
 {
@@ -152,7 +152,7 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
 }
 
 /* Called with ovs_mutex. */
-void ovs_flow_stats_clear(struct sw_flow *flow)
+void ovs_flow_stats_clear(struct ovs_flow *flow)
 {
 	int node;
 
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 5e5aaed..90ce2ea 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -32,26 +32,18 @@
 #include <linux/time.h>
 #include <linux/flex_array.h>
 #include <net/inet_ecn.h>
+#include <linux/sw_flow.h>
 
 struct sk_buff;
 
-/* Used to memset ovs_key_ipv4_tunnel padding. */
+/* Used to memset sw_flow_key_ipv4_tunnel padding. */
 #define OVS_TUNNEL_KEY_SIZE					\
-	(offsetof(struct ovs_key_ipv4_tunnel, ipv4_ttl) +	\
-	FIELD_SIZEOF(struct ovs_key_ipv4_tunnel, ipv4_ttl))
-
-struct ovs_key_ipv4_tunnel {
-	__be64 tun_id;
-	__be32 ipv4_src;
-	__be32 ipv4_dst;
-	__be16 tun_flags;
-	u8   ipv4_tos;
-	u8   ipv4_ttl;
-} __packed __aligned(4); /* Minimize padding. */
-
-static inline void ovs_flow_tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
-					 const struct iphdr *iph, __be64 tun_id,
-					 __be16 tun_flags)
+	(offsetof(struct sw_flow_key_ipv4_tunnel, ipv4_ttl) +	\
+	FIELD_SIZEOF(struct sw_flow_key_ipv4_tunnel, ipv4_ttl))
+
+static inline void
+ovs_flow_tun_key_init(struct sw_flow_key_ipv4_tunnel *tun_key,
+		      const struct iphdr *iph, __be64 tun_id, __be16 tun_flags)
 {
 	tun_key->tun_id = tun_id;
 	tun_key->ipv4_src = iph->saddr;
@@ -65,76 +57,20 @@ static inline void ovs_flow_tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
 	       sizeof(*tun_key) - OVS_TUNNEL_KEY_SIZE);
 }
 
-struct sw_flow_key {
-	struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
-	struct {
-		u32	priority;	/* Packet QoS priority. */
-		u32	skb_mark;	/* SKB mark. */
-		u16	in_port;	/* Input switch port (or DP_MAX_PORTS). */
-	} __packed phy; /* Safe when right after 'tun_key'. */
-	struct {
-		u8     src[ETH_ALEN];	/* Ethernet source address. */
-		u8     dst[ETH_ALEN];	/* Ethernet destination address. */
-		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
-		__be16 type;		/* Ethernet frame type. */
-	} eth;
-	struct {
-		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
-		u8     tos;		/* IP ToS. */
-		u8     ttl;		/* IP TTL/hop limit. */
-		u8     frag;		/* One of OVS_FRAG_TYPE_*. */
-	} ip;
-	struct {
-		__be16 src;		/* TCP/UDP/SCTP source port. */
-		__be16 dst;		/* TCP/UDP/SCTP destination port. */
-		__be16 flags;		/* TCP flags. */
-	} tp;
-	union {
-		struct {
-			struct {
-				__be32 src;	/* IP source address. */
-				__be32 dst;	/* IP destination address. */
-			} addr;
-			struct {
-				u8 sha[ETH_ALEN];	/* ARP source hardware address. */
-				u8 tha[ETH_ALEN];	/* ARP target hardware address. */
-			} arp;
-		} ipv4;
-		struct {
-			struct {
-				struct in6_addr src;	/* IPv6 source address. */
-				struct in6_addr dst;	/* IPv6 destination address. */
-			} addr;
-			__be32 label;			/* IPv6 flow label. */
-			struct {
-				struct in6_addr target;	/* ND target address. */
-				u8 sll[ETH_ALEN];	/* ND source link layer address. */
-				u8 tll[ETH_ALEN];	/* ND target link layer address. */
-			} nd;
-		} ipv6;
-	};
-} __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */
-
-struct sw_flow_key_range {
-	unsigned short int start;
-	unsigned short int end;
-};
-
-struct sw_flow_mask {
+struct ovs_flow_mask {
 	int ref_count;
 	struct rcu_head rcu;
 	struct list_head list;
-	struct sw_flow_key_range range;
-	struct sw_flow_key key;
+	struct sw_flow_mask mask;
 };
 
-struct sw_flow_match {
+struct ovs_flow_match {
 	struct sw_flow_key *key;
 	struct sw_flow_key_range range;
 	struct sw_flow_mask *mask;
 };
 
-struct sw_flow_actions {
+struct ovs_flow_actions {
 	struct rcu_head rcu;
 	u32 actions_len;
 	struct nlattr actions[];
@@ -148,17 +84,15 @@ struct flow_stats {
 	__be16 tcp_flags;		/* Union of seen TCP flags. */
 };
 
-struct sw_flow {
+struct ovs_flow {
 	struct rcu_head rcu;
 	struct hlist_node hash_node[2];
 	u32 hash;
 	int stats_last_writer;		/* NUMA-node id of the last writer on
 					 * 'stats[0]'.
 					 */
-	struct sw_flow_key key;
-	struct sw_flow_key unmasked_key;
-	struct sw_flow_mask *mask;
-	struct sw_flow_actions __rcu *sf_acts;
+	struct sw_flow flow;
+	struct ovs_flow_actions __rcu *sf_acts;
 	struct flow_stats __rcu *stats[]; /* One for each NUMA node.  First one
 					   * is allocated at flow creation time,
 					   * the rest are allocated on demand
@@ -180,11 +114,11 @@ struct arp_eth_header {
 	unsigned char       ar_tip[4];		/* target IP address        */
 } __packed;
 
-void ovs_flow_stats_update(struct sw_flow *, __be16 tcp_flags,
+void ovs_flow_stats_update(struct ovs_flow *, __be16 tcp_flags,
 			   struct sk_buff *);
-void ovs_flow_stats_get(const struct sw_flow *, struct ovs_flow_stats *,
+void ovs_flow_stats_get(const struct ovs_flow *, struct ovs_flow_stats *,
 			unsigned long *used, __be16 *tcp_flags);
-void ovs_flow_stats_clear(struct sw_flow *);
+void ovs_flow_stats_clear(struct ovs_flow *);
 u64 ovs_flow_used_time(unsigned long flow_jiffies);
 
 int ovs_flow_extract(struct sk_buff *, u16 in_port, struct sw_flow_key *);
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d757848..1eb5054 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -48,7 +48,7 @@
 
 #include "flow_netlink.h"
 
-static void update_range__(struct sw_flow_match *match,
+static void update_range__(struct ovs_flow_match *match,
 			   size_t offset, size_t size, bool is_mask)
 {
 	struct sw_flow_key_range *range = NULL;
@@ -105,7 +105,7 @@ static u16 range_n_bytes(const struct sw_flow_key_range *range)
 	return range->end - range->start;
 }
 
-static bool match_validate(const struct sw_flow_match *match,
+static bool match_validate(const struct ovs_flow_match *match,
 			   u64 key_attrs, u64 mask_attrs)
 {
 	u64 key_expected = 1 << OVS_KEY_ATTR_ETHERNET;
@@ -327,7 +327,7 @@ static int parse_flow_nlattrs(const struct nlattr *attr,
 }
 
 static int ipv4_tun_from_nlattr(const struct nlattr *attr,
-				struct sw_flow_match *match, bool is_mask)
+				struct ovs_flow_match *match, bool is_mask)
 {
 	struct nlattr *a;
 	int rem;
@@ -416,8 +416,8 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 }
 
 static int ipv4_tun_to_nlattr(struct sk_buff *skb,
-			      const struct ovs_key_ipv4_tunnel *tun_key,
-			      const struct ovs_key_ipv4_tunnel *output)
+			      const struct sw_flow_key_ipv4_tunnel *tun_key,
+			      const struct sw_flow_key_ipv4_tunnel *output)
 {
 	struct nlattr *nla;
 
@@ -451,7 +451,7 @@ static int ipv4_tun_to_nlattr(struct sk_buff *skb,
 }
 
 
-static int metadata_from_nlattrs(struct sw_flow_match *match,  u64 *attrs,
+static int metadata_from_nlattrs(struct ovs_flow_match *match,  u64 *attrs,
 				 const struct nlattr **a, bool is_mask)
 {
 	if (*attrs & (1 << OVS_KEY_ATTR_PRIORITY)) {
@@ -489,7 +489,7 @@ static int metadata_from_nlattrs(struct sw_flow_match *match,  u64 *attrs,
 	return 0;
 }
 
-static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
+static int ovs_key_from_nlattrs(struct ovs_flow_match *match, u64 attrs,
 				const struct nlattr **a, bool is_mask)
 {
 	int err;
@@ -730,7 +730,7 @@ static void sw_flow_mask_set(struct sw_flow_mask *mask,
  * @mask: Optional. Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink
  * attribute specifies the mask field of the wildcarded flow.
  */
-int ovs_nla_get_match(struct sw_flow_match *match,
+int ovs_nla_get_match(struct ovs_flow_match *match,
 		      const struct nlattr *key,
 		      const struct nlattr *mask)
 {
@@ -849,11 +849,11 @@ int ovs_nla_get_match(struct sw_flow_match *match,
 int ovs_nla_get_flow_metadata(struct sw_flow *flow,
 			      const struct nlattr *attr)
 {
-	struct ovs_key_ipv4_tunnel *tun_key = &flow->key.tun_key;
+	struct sw_flow_key_ipv4_tunnel *tun_key = &flow->key.tun_key;
 	const struct nlattr *a[OVS_KEY_ATTR_MAX + 1];
 	u64 attrs = 0;
 	int err;
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 
 	flow->key.phy.in_port = DP_MAX_PORTS;
 	flow->key.phy.priority = 0;
@@ -1070,9 +1070,9 @@ nla_put_failure:
 
 #define MAX_ACTIONS_BUFSIZE	(32 * 1024)
 
-struct sw_flow_actions *ovs_nla_alloc_flow_actions(int size)
+struct ovs_flow_actions *ovs_nla_alloc_flow_actions(int size)
 {
-	struct sw_flow_actions *sfa;
+	struct ovs_flow_actions *sfa;
 
 	if (size > MAX_ACTIONS_BUFSIZE)
 		return ERR_PTR(-EINVAL);
@@ -1087,19 +1087,19 @@ struct sw_flow_actions *ovs_nla_alloc_flow_actions(int size)
 
 /* Schedules 'sf_acts' to be freed after the next RCU grace period.
  * The caller must hold rcu_read_lock for this to be sensible. */
-void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
+void ovs_nla_free_flow_actions(struct ovs_flow_actions *sf_acts)
 {
 	kfree_rcu(sf_acts, rcu);
 }
 
-static struct nlattr *reserve_sfa_size(struct sw_flow_actions **sfa,
+static struct nlattr *reserve_sfa_size(struct ovs_flow_actions **sfa,
 				       int attr_len)
 {
 
-	struct sw_flow_actions *acts;
+	struct ovs_flow_actions *acts;
 	int new_acts_size;
 	int req_size = NLA_ALIGN(attr_len);
-	int next_offset = offsetof(struct sw_flow_actions, actions) +
+	int next_offset = offsetof(struct ovs_flow_actions, actions) +
 					(*sfa)->actions_len;
 
 	if (req_size <= (ksize(*sfa) - next_offset))
@@ -1127,7 +1127,8 @@ out:
 	return  (struct nlattr *) ((unsigned char *)(*sfa) + next_offset);
 }
 
-static int add_action(struct sw_flow_actions **sfa, int attrtype, void *data, int len)
+static int add_action(struct ovs_flow_actions **sfa, int attrtype,
+		      void *data, int len)
 {
 	struct nlattr *a;
 
@@ -1145,7 +1146,7 @@ static int add_action(struct sw_flow_actions **sfa, int attrtype, void *data, in
 	return 0;
 }
 
-static inline int add_nested_action_start(struct sw_flow_actions **sfa,
+static inline int add_nested_action_start(struct ovs_flow_actions **sfa,
 					  int attrtype)
 {
 	int used = (*sfa)->actions_len;
@@ -1158,7 +1159,7 @@ static inline int add_nested_action_start(struct sw_flow_actions **sfa,
 	return used;
 }
 
-static inline void add_nested_action_end(struct sw_flow_actions *sfa,
+static inline void add_nested_action_end(struct ovs_flow_actions *sfa,
 					 int st_offset)
 {
 	struct nlattr *a = (struct nlattr *) ((unsigned char *)sfa->actions +
@@ -1169,7 +1170,7 @@ static inline void add_nested_action_end(struct sw_flow_actions *sfa,
 
 static int validate_and_copy_sample(const struct nlattr *attr,
 				    const struct sw_flow_key *key, int depth,
-				    struct sw_flow_actions **sfa)
+				    struct ovs_flow_actions **sfa)
 {
 	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
 	const struct nlattr *probability, *actions;
@@ -1226,7 +1227,7 @@ static int validate_tp_port(const struct sw_flow_key *flow_key)
 	return -EINVAL;
 }
 
-void ovs_match_init(struct sw_flow_match *match,
+void ovs_match_init(struct ovs_flow_match *match,
 		    struct sw_flow_key *key,
 		    struct sw_flow_mask *mask)
 {
@@ -1243,9 +1244,9 @@ void ovs_match_init(struct sw_flow_match *match,
 }
 
 static int validate_and_copy_set_tun(const struct nlattr *attr,
-				     struct sw_flow_actions **sfa)
+				     struct ovs_flow_actions **sfa)
 {
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 	struct sw_flow_key key;
 	int err, start;
 
@@ -1267,7 +1268,7 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 
 static int validate_set(const struct nlattr *a,
 			const struct sw_flow_key *flow_key,
-			struct sw_flow_actions **sfa,
+			struct ovs_flow_actions **sfa,
 			bool *set_tun)
 {
 	const struct nlattr *ovs_key = nla_data(a);
@@ -1381,7 +1382,7 @@ static int validate_userspace(const struct nlattr *attr)
 }
 
 static int copy_action(const struct nlattr *from,
-		       struct sw_flow_actions **sfa)
+		       struct ovs_flow_actions **sfa)
 {
 	int totlen = NLA_ALIGN(from->nla_len);
 	struct nlattr *to;
@@ -1397,7 +1398,7 @@ static int copy_action(const struct nlattr *from,
 int ovs_nla_copy_actions(const struct nlattr *attr,
 			 const struct sw_flow_key *key,
 			 int depth,
-			 struct sw_flow_actions **sfa)
+			 struct ovs_flow_actions **sfa)
 {
 	const struct nlattr *a;
 	int rem, err;
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 4401510..296b126 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -37,24 +37,24 @@
 
 #include "flow.h"
 
-void ovs_match_init(struct sw_flow_match *match,
+void ovs_match_init(struct ovs_flow_match *match,
 		    struct sw_flow_key *key, struct sw_flow_mask *mask);
 
 int ovs_nla_put_flow(const struct sw_flow_key *,
 		     const struct sw_flow_key *, struct sk_buff *);
 int ovs_nla_get_flow_metadata(struct sw_flow *flow,
 			      const struct nlattr *attr);
-int ovs_nla_get_match(struct sw_flow_match *match,
+int ovs_nla_get_match(struct ovs_flow_match *match,
 		      const struct nlattr *,
 		      const struct nlattr *);
 
 int ovs_nla_copy_actions(const struct nlattr *attr,
 			 const struct sw_flow_key *key, int depth,
-			 struct sw_flow_actions **sfa);
+			 struct ovs_flow_actions **sfa);
 int ovs_nla_put_actions(const struct nlattr *attr,
 			int len, struct sk_buff *skb);
 
-struct sw_flow_actions *ovs_nla_alloc_flow_actions(int actions_len);
-void ovs_nla_free_flow_actions(struct sw_flow_actions *);
+struct ovs_flow_actions *ovs_nla_alloc_flow_actions(int actions_len);
+void ovs_nla_free_flow_actions(struct ovs_flow_actions *);
 
 #endif /* flow_netlink.h */
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index cf2d853..e7d9a41 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -73,9 +73,9 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
 		*d++ = *s++ & *m++;
 }
 
-struct sw_flow *ovs_flow_alloc(void)
+struct ovs_flow *ovs_flow_alloc(void)
 {
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct flow_stats *stats;
 	int node;
 
@@ -84,7 +84,7 @@ struct sw_flow *ovs_flow_alloc(void)
 		return ERR_PTR(-ENOMEM);
 
 	flow->sf_acts = NULL;
-	flow->mask = NULL;
+	flow->flow.mask = NULL;
 	flow->stats_last_writer = NUMA_NO_NODE;
 
 	/* Initialize the default stat node. */
@@ -135,11 +135,11 @@ static struct flex_array *alloc_buckets(unsigned int n_buckets)
 	return buckets;
 }
 
-static void flow_free(struct sw_flow *flow)
+static void flow_free(struct ovs_flow *flow)
 {
 	int node;
 
-	kfree((struct sw_flow_actions __force *)flow->sf_acts);
+	kfree((struct ovs_flow_actions __force *)flow->sf_acts);
 	for_each_node(node)
 		if (flow->stats[node])
 			kmem_cache_free(flow_stats_cache,
@@ -149,12 +149,12 @@ static void flow_free(struct sw_flow *flow)
 
 static void rcu_free_flow_callback(struct rcu_head *rcu)
 {
-	struct sw_flow *flow = container_of(rcu, struct sw_flow, rcu);
+	struct ovs_flow *flow = container_of(rcu, struct ovs_flow, rcu);
 
 	flow_free(flow);
 }
 
-void ovs_flow_free(struct sw_flow *flow, bool deferred)
+void ovs_flow_free(struct ovs_flow *flow, bool deferred)
 {
 	if (!flow)
 		return;
@@ -232,7 +232,7 @@ static void table_instance_destroy(struct table_instance *ti, bool deferred)
 		goto skip_flows;
 
 	for (i = 0; i < ti->n_buckets; i++) {
-		struct sw_flow *flow;
+		struct ovs_flow *flow;
 		struct hlist_head *head = flex_array_get(ti->buckets, i);
 		struct hlist_node *n;
 		int ver = ti->node_ver;
@@ -257,10 +257,10 @@ void ovs_flow_tbl_destroy(struct flow_table *table, bool deferred)
 	table_instance_destroy(ti, deferred);
 }
 
-struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
-				       u32 *bucket, u32 *last)
+struct ovs_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
+					u32 *bucket, u32 *last)
 {
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct hlist_head *head;
 	int ver;
 	int i;
@@ -291,7 +291,8 @@ static struct hlist_head *find_bucket(struct table_instance *ti, u32 hash)
 				(hash & (ti->n_buckets - 1)));
 }
 
-static void table_instance_insert(struct table_instance *ti, struct sw_flow *flow)
+static void table_instance_insert(struct table_instance *ti,
+				  struct ovs_flow *flow)
 {
 	struct hlist_head *head;
 
@@ -310,7 +311,7 @@ static void flow_table_copy_flows(struct table_instance *old,
 
 	/* Insert in new table. */
 	for (i = 0; i < old->n_buckets; i++) {
-		struct sw_flow *flow;
+		struct ovs_flow *flow;
 		struct hlist_head *head;
 
 		head = flex_array_get(old->buckets, i);
@@ -397,21 +398,21 @@ static bool flow_cmp_masked_key(const struct sw_flow *flow,
 	return cmp_key(&flow->key, key, key_start, key_end);
 }
 
-bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
-			       struct sw_flow_match *match)
+bool ovs_flow_cmp_unmasked_key(const struct ovs_flow *flow,
+			       struct ovs_flow_match *match)
 {
 	struct sw_flow_key *key = match->key;
 	int key_start = flow_key_start(key);
 	int key_end = match->range.end;
 
-	return cmp_key(&flow->unmasked_key, key, key_start, key_end);
+	return cmp_key(&flow->flow.unmasked_key, key, key_start, key_end);
 }
 
-static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
-					  const struct sw_flow_key *unmasked,
-					  struct sw_flow_mask *mask)
+static struct ovs_flow *masked_flow_lookup(struct table_instance *ti,
+					   const struct sw_flow_key *unmasked,
+					   struct sw_flow_mask *mask)
 {
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct hlist_head *head;
 	int key_start = mask->range.start;
 	int key_end = mask->range.end;
@@ -422,50 +423,50 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
 	hash = flow_hash(&masked_key, key_start, key_end);
 	head = find_bucket(ti, hash);
 	hlist_for_each_entry_rcu(flow, head, hash_node[ti->node_ver]) {
-		if (flow->mask == mask && flow->hash == hash &&
-		    flow_cmp_masked_key(flow, &masked_key,
-					  key_start, key_end))
+		if (flow->flow.mask == mask && flow->hash == hash &&
+		    flow_cmp_masked_key(&flow->flow, &masked_key,
+					key_start, key_end))
 			return flow;
 	}
 	return NULL;
 }
 
-struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
-				    const struct sw_flow_key *key,
-				    u32 *n_mask_hit)
+struct ovs_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
+					   const struct sw_flow_key *key,
+					   u32 *n_mask_hit)
 {
 	struct table_instance *ti = rcu_dereference_ovsl(tbl->ti);
-	struct sw_flow_mask *mask;
-	struct sw_flow *flow;
+	struct ovs_flow_mask *mask;
+	struct ovs_flow *flow;
 
 	*n_mask_hit = 0;
 	list_for_each_entry_rcu(mask, &tbl->mask_list, list) {
 		(*n_mask_hit)++;
-		flow = masked_flow_lookup(ti, key, mask);
+		flow = masked_flow_lookup(ti, key, &mask->mask);
 		if (flow)  /* Found */
 			return flow;
 	}
 	return NULL;
 }
 
-struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
-				    const struct sw_flow_key *key)
+struct ovs_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
+				     const struct sw_flow_key *key)
 {
 	u32 __always_unused n_mask_hit;
 
 	return ovs_flow_tbl_lookup_stats(tbl, key, &n_mask_hit);
 }
 
-struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
-					  struct sw_flow_match *match)
+struct ovs_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
+					   struct ovs_flow_match *match)
 {
 	struct table_instance *ti = rcu_dereference_ovsl(tbl->ti);
-	struct sw_flow_mask *mask;
-	struct sw_flow *flow;
+	struct ovs_flow_mask *mask;
+	struct ovs_flow *flow;
 
 	/* Always called under ovs-mutex. */
 	list_for_each_entry(mask, &tbl->mask_list, list) {
-		flow = masked_flow_lookup(ti, match->key, mask);
+		flow = masked_flow_lookup(ti, match->key, &mask->mask);
 		if (flow && ovs_flow_cmp_unmasked_key(flow, match))  /* Found */
 			return flow;
 	}
@@ -474,7 +475,7 @@ struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
 
 int ovs_flow_tbl_num_masks(const struct flow_table *table)
 {
-	struct sw_flow_mask *mask;
+	struct ovs_flow_mask *mask;
 	int num = 0;
 
 	list_for_each_entry(mask, &table->mask_list, list)
@@ -489,7 +490,7 @@ static struct table_instance *table_instance_expand(struct table_instance *ti)
 }
 
 /* Remove 'mask' from the mask list, if it is not needed any more. */
-static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
+static void flow_mask_remove(struct flow_table *tbl, struct ovs_flow_mask *mask)
 {
 	if (mask) {
 		/* ovs-lock is required to protect mask-refcount and
@@ -507,9 +508,12 @@ static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
 }
 
 /* Must be called with OVS mutex held. */
-void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow)
+void ovs_flow_tbl_remove(struct flow_table *table, struct ovs_flow *flow)
 {
 	struct table_instance *ti = ovsl_dereference(table->ti);
+	struct ovs_flow_mask *mask = container_of(flow->flow.mask,
+						  struct ovs_flow_mask,
+						  mask);
 
 	BUG_ON(table->count == 0);
 	hlist_del_rcu(&flow->hash_node[ti->node_ver]);
@@ -518,12 +522,12 @@ void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow)
 	/* RCU delete the mask. 'flow->mask' is not NULLed, as it should be
 	 * accessible as long as the RCU read lock is held.
 	 */
-	flow_mask_remove(table, flow->mask);
+	flow_mask_remove(table, mask);
 }
 
-static struct sw_flow_mask *mask_alloc(void)
+static struct ovs_flow_mask *mask_alloc(void)
 {
-	struct sw_flow_mask *mask;
+	struct ovs_flow_mask *mask;
 
 	mask = kmalloc(sizeof(*mask), GFP_KERNEL);
 	if (mask)
@@ -543,15 +547,16 @@ static bool mask_equal(const struct sw_flow_mask *a,
 		&& (memcmp(a_, b_, range_n_bytes(&a->range)) == 0);
 }
 
-static struct sw_flow_mask *flow_mask_find(const struct flow_table *tbl,
-					   const struct sw_flow_mask *mask)
+static struct ovs_flow_mask *flow_mask_find(const struct flow_table *tbl,
+					    const struct sw_flow_mask *mask)
 {
 	struct list_head *ml;
 
 	list_for_each(ml, &tbl->mask_list) {
-		struct sw_flow_mask *m;
-		m = container_of(ml, struct sw_flow_mask, list);
-		if (mask_equal(mask, m))
+		struct ovs_flow_mask *m;
+
+		m = container_of(ml, struct ovs_flow_mask, list);
+		if (mask_equal(mask, &m->mask))
 			return m;
 	}
 
@@ -559,30 +564,31 @@ static struct sw_flow_mask *flow_mask_find(const struct flow_table *tbl,
 }
 
 /* Add 'mask' into the mask list, if it is not already there. */
-static int flow_mask_insert(struct flow_table *tbl, struct sw_flow *flow,
+static int flow_mask_insert(struct flow_table *tbl, struct ovs_flow *flow,
 			    struct sw_flow_mask *new)
 {
-	struct sw_flow_mask *mask;
+	struct ovs_flow_mask *mask;
+
 	mask = flow_mask_find(tbl, new);
 	if (!mask) {
 		/* Allocate a new mask if none exsits. */
 		mask = mask_alloc();
 		if (!mask)
 			return -ENOMEM;
-		mask->key = new->key;
-		mask->range = new->range;
+		mask->mask.key = new->key;
+		mask->mask.range = new->range;
 		list_add_rcu(&mask->list, &tbl->mask_list);
 	} else {
 		BUG_ON(!mask->ref_count);
 		mask->ref_count++;
 	}
 
-	flow->mask = mask;
+	flow->flow.mask = &mask->mask;
 	return 0;
 }
 
 /* Must be called with OVS mutex held. */
-int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
+int ovs_flow_tbl_insert(struct flow_table *table, struct ovs_flow *flow,
 			struct sw_flow_mask *mask)
 {
 	struct table_instance *new_ti = NULL;
@@ -593,8 +599,8 @@ int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
 	if (err)
 		return err;
 
-	flow->hash = flow_hash(&flow->key, flow->mask->range.start,
-			flow->mask->range.end);
+	flow->hash = flow_hash(&flow->flow.key, flow->flow.mask->range.start,
+			       flow->flow.mask->range.end);
 	ti = ovsl_dereference(table->ti);
 	table_instance_insert(ti, flow);
 	table->count++;
@@ -620,7 +626,7 @@ int ovs_flow_init(void)
 	BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long));
 	BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
 
-	flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
+	flow_cache = kmem_cache_create("ovs_flow", sizeof(struct ovs_flow)
 				       + (num_possible_nodes()
 					  * sizeof(struct flow_stats *)),
 				       0, 0, NULL);
diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
index 5918bff..d57d6b5 100644
--- a/net/openvswitch/flow_table.h
+++ b/net/openvswitch/flow_table.h
@@ -57,29 +57,29 @@ extern struct kmem_cache *flow_stats_cache;
 int ovs_flow_init(void);
 void ovs_flow_exit(void);
 
-struct sw_flow *ovs_flow_alloc(void);
-void ovs_flow_free(struct sw_flow *, bool deferred);
+struct ovs_flow *ovs_flow_alloc(void);
+void ovs_flow_free(struct ovs_flow *, bool deferred);
 
 int ovs_flow_tbl_init(struct flow_table *);
 int ovs_flow_tbl_count(struct flow_table *table);
 void ovs_flow_tbl_destroy(struct flow_table *table, bool deferred);
 int ovs_flow_tbl_flush(struct flow_table *flow_table);
 
-int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
+int ovs_flow_tbl_insert(struct flow_table *table, struct ovs_flow *flow,
 			struct sw_flow_mask *mask);
-void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow);
+void ovs_flow_tbl_remove(struct flow_table *table, struct ovs_flow *flow);
 int  ovs_flow_tbl_num_masks(const struct flow_table *table);
-struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *table,
-				       u32 *bucket, u32 *idx);
-struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *,
-				    const struct sw_flow_key *,
-				    u32 *n_mask_hit);
-struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *,
-				    const struct sw_flow_key *);
-struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
-					  struct sw_flow_match *match);
-bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
-			       struct sw_flow_match *match);
+struct ovs_flow *ovs_flow_tbl_dump_next(struct table_instance *table,
+					u32 *bucket, u32 *idx);
+struct ovs_flow *ovs_flow_tbl_lookup_stats(struct flow_table *,
+					   const struct sw_flow_key *,
+					   u32 *n_mask_hit);
+struct ovs_flow *ovs_flow_tbl_lookup(struct flow_table *,
+				     const struct sw_flow_key *);
+struct ovs_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
+					   struct ovs_flow_match *match);
+bool ovs_flow_cmp_unmasked_key(const struct ovs_flow *flow,
+			       struct ovs_flow_match *match);
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
 		       const struct sw_flow_mask *mask);
diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index f49148a..fda79eb 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -63,7 +63,7 @@ static __be16 filter_tnl_flags(__be16 flags)
 static struct sk_buff *__build_header(struct sk_buff *skb,
 				      int tunnel_hlen)
 {
-	const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
+	const struct sw_flow_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
 	struct tnl_ptk_info tpi;
 
 	skb = gre_handle_offloads(skb, !!(tun_key->tun_flags & TUNNEL_CSUM));
@@ -92,7 +92,7 @@ static __be64 key_to_tunnel_id(__be32 key, __be32 seq)
 static int gre_rcv(struct sk_buff *skb,
 		   const struct tnl_ptk_info *tpi)
 {
-	struct ovs_key_ipv4_tunnel tun_key;
+	struct sw_flow_key_ipv4_tunnel tun_key;
 	struct ovs_net *ovs_net;
 	struct vport *vport;
 	__be64 key;
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index d8b7e24..b7edf47 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -58,7 +58,7 @@ static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
 /* Called with rcu_read_lock and BH disabled. */
 static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 {
-	struct ovs_key_ipv4_tunnel tun_key;
+	struct sw_flow_key_ipv4_tunnel tun_key;
 	struct vport *vport = vs->data;
 	struct iphdr *iph;
 	__be64 key;
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 6d8f2ec..7df5234 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -438,7 +438,7 @@ u32 ovs_vport_find_upcall_portid(const struct vport *p, struct sk_buff *skb)
  * skb->data should point to the Ethernet header.
  */
 void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
-		       struct ovs_key_ipv4_tunnel *tun_key)
+		       struct sw_flow_key_ipv4_tunnel *tun_key)
 {
 	struct pcpu_sw_netstats *stats;
 
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 35f89d8..8409e06 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -210,7 +210,7 @@ static inline struct vport *vport_from_priv(void *priv)
 }
 
 void ovs_vport_receive(struct vport *, struct sk_buff *,
-		       struct ovs_key_ipv4_tunnel *);
+		       struct sw_flow_key_ipv4_tunnel *);
 
 /* List of statically compiled vport implementations.  Don't forget to also
  * add yours to the list at the top of vport.c. */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 02/12] net: rename netdev_phys_port_id to more generic name
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
@ 2014-08-21 16:18 ` Jiri Pirko
       [not found]   ` <1408637945-10390-3-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  2014-08-21 16:18 ` [patch net-next RFC 04/12] rtnl: expose physical switch id for particular device Jiri Pirko
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:18 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

So this can be reused for identification of other "items" as well.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c      |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |  2 +-
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |  2 +-
 include/linux/netdevice.h                        | 16 ++++++++--------
 net/core/dev.c                                   |  2 +-
 net/core/net-sysfs.c                             |  2 +-
 net/core/rtnetlink.c                             |  6 +++---
 8 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index c13364b..56ae8d0 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12389,7 +12389,7 @@ static int bnx2x_validate_addr(struct net_device *dev)
 }
 
 static int bnx2x_get_phys_port_id(struct net_device *netdev,
-				  struct netdev_phys_port_id *ppid)
+				  struct netdev_phys_item_id *ppid)
 {
 	struct bnx2x *bp = netdev_priv(netdev);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index eddec6b..58f2bb2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7219,7 +7219,7 @@ static void i40e_del_vxlan_port(struct net_device *netdev,
 
 #endif
 static int i40e_get_phys_port_id(struct net_device *netdev,
-				 struct netdev_phys_port_id *ppid)
+				 struct netdev_phys_item_id *ppid)
 {
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_pf *pf = np->vsi->back;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index bb536aa..edf3040 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2276,7 +2276,7 @@ static int mlx4_en_set_vf_link_state(struct net_device *dev, int vf, int link_st
 
 #define PORT_ID_BYTE_LEN 8
 static int mlx4_en_get_phys_port_id(struct net_device *dev,
-				    struct netdev_phys_port_id *ppid)
+				    struct netdev_phys_item_id *ppid)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_dev *mdev = priv->mdev->dev;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index cf08b2d..a9836b7 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -454,7 +454,7 @@ static void qlcnic_82xx_cancel_idc_work(struct qlcnic_adapter *adapter)
 }
 
 static int qlcnic_get_phys_port_id(struct net_device *netdev,
-				   struct netdev_phys_port_id *ppid)
+				   struct netdev_phys_item_id *ppid)
 {
 	struct qlcnic_adapter *adapter = netdev_priv(netdev);
 	struct qlcnic_hardware_context *ahw = adapter->ahw;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3837739..39294b9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -739,13 +739,13 @@ struct netdev_fcoe_hbainfo {
 };
 #endif
 
-#define MAX_PHYS_PORT_ID_LEN 32
+#define MAX_PHYS_ITEM_ID_LEN 32
 
-/* This structure holds a unique identifier to identify the
- * physical port used by a netdevice.
+/* This structure holds a unique identifier to identify some
+ * physical item (port for example) used by a netdevice.
  */
-struct netdev_phys_port_id {
-	unsigned char id[MAX_PHYS_PORT_ID_LEN];
+struct netdev_phys_item_id {
+	unsigned char id[MAX_PHYS_ITEM_ID_LEN];
 	unsigned char id_len;
 };
 
@@ -961,7 +961,7 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	USB_CDC_NOTIFY_NETWORK_CONNECTION) should NOT implement this function.
  *
  * int (*ndo_get_phys_port_id)(struct net_device *dev,
- *			       struct netdev_phys_port_id *ppid);
+ *			       struct netdev_phys_item_id *ppid);
  *	Called to get ID of physical port of this device. If driver does
  *	not implement this, it is assumed that the hw is not able to have
  *	multiple net devices on single physical port.
@@ -1129,7 +1129,7 @@ struct net_device_ops {
 	int			(*ndo_change_carrier)(struct net_device *dev,
 						      bool new_carrier);
 	int			(*ndo_get_phys_port_id)(struct net_device *dev,
-							struct netdev_phys_port_id *ppid);
+							struct netdev_phys_item_id *ppid);
 	void			(*ndo_add_vxlan_port)(struct  net_device *dev,
 						      sa_family_t sa_family,
 						      __be16 port);
@@ -2753,7 +2753,7 @@ void dev_set_group(struct net_device *, int);
 int dev_set_mac_address(struct net_device *, struct sockaddr *);
 int dev_change_carrier(struct net_device *, bool new_carrier);
 int dev_get_phys_port_id(struct net_device *dev,
-			 struct netdev_phys_port_id *ppid);
+			 struct netdev_phys_item_id *ppid);
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq);
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index b65a505..217247f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5651,7 +5651,7 @@ EXPORT_SYMBOL(dev_change_carrier);
  *	Get device physical port ID
  */
 int dev_get_phys_port_id(struct net_device *dev,
-			 struct netdev_phys_port_id *ppid)
+			 struct netdev_phys_item_id *ppid)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 9dd0669..55dc4da 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -387,7 +387,7 @@ static ssize_t phys_port_id_show(struct device *dev,
 		return restart_syscall();
 
 	if (dev_isalive(netdev)) {
-		struct netdev_phys_port_id ppid;
+		struct netdev_phys_item_id ppid;
 
 		ret = dev_get_phys_port_id(netdev, &ppid);
 		if (!ret)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f0493e3..80c135a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -868,7 +868,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + rtnl_port_size(dev, ext_filter_mask) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
 	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
-	       + nla_total_size(MAX_PHYS_PORT_ID_LEN); /* IFLA_PHYS_PORT_ID */
+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_PORT_ID */
 }
 
 static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
@@ -952,7 +952,7 @@ static int rtnl_port_fill(struct sk_buff *skb, struct net_device *dev,
 static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
 {
 	int err;
-	struct netdev_phys_port_id ppid;
+	struct netdev_phys_item_id ppid;
 
 	err = dev_get_phys_port_id(dev, &ppid);
 	if (err) {
@@ -1196,7 +1196,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PROMISCUITY]	= { .type = NLA_U32 },
 	[IFLA_NUM_TX_QUEUES]	= { .type = NLA_U32 },
 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
-	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
+	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
 };
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 03/12] net: introduce generic switch devices support
       [not found] ` <1408637945-10390-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  2014-08-21 16:18   ` [patch net-next RFC 01/12] openvswitch: split flow structures into ovs specific and generic ones Jiri Pirko
@ 2014-08-21 16:18   ` Jiri Pirko
  2014-08-21 16:41     ` Ben Hutchings
                       ` (3 more replies)
  2014-08-21 16:18   ` [patch net-next RFC 06/12] net: introduce dummy switch Jiri Pirko
  2 siblings, 4 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:18 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

The goal of this is to provide a possibility to suport various switch
chips. Drivers should implement relevant ndos to do so. Now there is a
couple of ndos defines:
- for getting physical switch id is in place.
- for work with flows.

Note that user can use random port netdevice to access the switch.

Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
---
 Documentation/networking/switchdev.txt |  53 +++++++++++
 include/linux/netdevice.h              |  28 ++++++
 include/linux/switchdev.h              |  44 +++++++++
 net/Kconfig                            |   6 ++
 net/core/Makefile                      |   1 +
 net/core/switchdev.c                   | 163 +++++++++++++++++++++++++++++++++
 6 files changed, 295 insertions(+)
 create mode 100644 Documentation/networking/switchdev.txt
 create mode 100644 include/linux/switchdev.h
 create mode 100644 net/core/switchdev.c

diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
new file mode 100644
index 0000000..435746a
--- /dev/null
+++ b/Documentation/networking/switchdev.txt
@@ -0,0 +1,53 @@
+Switch device drivers HOWTO
+===========================
+
+First lets describe a topology a bit. Imagine the following example:
+
+       +----------------------------+    +---------------+
+       |     SOME switch chip       |    |      CPU      |
+       +----------------------------+    +---------------+
+       port1 port2 port3 port4 MNGMNT    |     PCI-E     |
+         |     |     |     |     |       +---------------+
+        PHY   PHY    |     |     |         |  NIC0 NIC1
+                     |     |     |         |   |    |
+                     |     |     +- PCI-E -+   |    |
+                     |     +------- MII -------+    |
+                     +------------- MII ------------+
+
+In this example, there are two independent lines between the switch silicon
+and CPU. NIC0 and NIC1 drivers are not aware of a switch presence. They are
+separate from the switch driver. SOME switch chip is by managed by a driver
+via PCI-E device MNGMNT. Note that MNGMNT device, NIC0 and NIC1 may be
+connected to some other type of bus.
+
+Now, for the previous example show the representation in kernel:
+
+       +----------------------------+    +---------------+
+       |     SOME switch chip       |    |      CPU      |
+       +----------------------------+    +---------------+
+       sw0p0 sw0p1 sw0p2 sw0p3 MNGMNT    |     PCI-E     |
+         |     |     |     |     |       +---------------+
+        PHY   PHY    |     |     |         |  eth0 eth1
+                     |     |     |         |   |    |
+                     |     |     +- PCI-E -+   |    |
+                     |     +------- MII -------+    |
+                     +------------- MII ------------+
+
+Lets call the example switch driver for SOME switch chip "SOMEswitch". This
+driver takes care of PCI-E device MNGMNT. There is a netdevice instance sw0pX
+created for each port of a switch. These netdevices are instances
+of "SOMEswitch" driver. sw0pX netdevices serve as a "representation"
+of the switch chip. eth0 and eth1 are instances of some other existing driver.
+
+The only difference of the switch-port netdevice from the ordinary netdevice
+is that is implements couple more NDOs:
+
+	ndo_swdev_get_id - This returns the same ID for two port netdevices of
+			   the same physical switch chip. This is mandatory to
+			   be implemented by all switch drivers and serves
+			   the caller for recognition of a port netdevice.
+	ndo_swdev_* - Functions that serve for a manipulation of the switch chip
+		      itself. They are not port-specific. Caller might use
+		      arbitrary port netdevice of the same switch and it will
+		      make no difference.
+	ndo_swportdev_* - Functions that serve for a port-specific manipulation.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 39294b9..8b5d14c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -49,6 +49,8 @@
 
 #include <linux/netdev_features.h>
 #include <linux/neighbour.h>
+#include <linux/sw_flow.h>
+
 #include <uapi/linux/netdevice.h>
 
 struct netpoll_info;
@@ -997,6 +999,24 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	Callback to use for xmit over the accelerated station. This
  *	is used in place of ndo_start_xmit on accelerated net
  *	devices.
+ *
+ * int (*ndo_swdev_get_id)(struct net_device *dev,
+ *			   struct netdev_phys_item_id *psid);
+ *	Called to get an ID of the switch chip this port is part of.
+ *	If driver implements this, it indicates that it represents a port
+ *	of a switch chip.
+ *
+ * int (*ndo_swdev_flow_insert)(struct net_device *dev,
+ *				const struct sw_flow *flow);
+ *	Called to insert a flow into switch device. If driver does
+ *	not implement this, it is assumed that the hw does not have
+ *	a capability to work with flows.
+ *
+ * int (*ndo_swdev_flow_remove)(struct net_device *dev,
+ *				const struct sw_flow *flow);
+ *	Called to remove a flow from switch device. If driver does
+ *	not implement this, it is assumed that the hw does not have
+ *	a capability to work with flows.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1146,6 +1166,14 @@ struct net_device_ops {
 							struct net_device *dev,
 							void *priv);
 	int			(*ndo_get_lock_subclass)(struct net_device *dev);
+#ifdef CONFIG_NET_SWITCHDEV
+	int			(*ndo_swdev_get_id)(struct net_device *dev,
+						    struct netdev_phys_item_id *psid);
+	int			(*ndo_swdev_flow_insert)(struct net_device *dev,
+							 const struct sw_flow *flow);
+	int			(*ndo_swdev_flow_remove)(struct net_device *dev,
+							 const struct sw_flow *flow);
+#endif
 };
 
 /**
diff --git a/include/linux/switchdev.h b/include/linux/switchdev.h
new file mode 100644
index 0000000..ba77a68
--- /dev/null
+++ b/include/linux/switchdev.h
@@ -0,0 +1,44 @@
+/*
+ * include/linux/switchdev.h - Switch device API
+ * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#ifndef _LINUX_SWITCHDEV_H_
+#define _LINUX_SWITCHDEV_H_
+
+#include <linux/netdevice.h>
+#include <linux/sw_flow.h>
+
+#ifdef CONFIG_NET_SWITCHDEV
+
+int swdev_get_id(struct net_device *dev, struct netdev_phys_item_id *psid);
+int swdev_flow_insert(struct net_device *dev, const struct sw_flow *flow);
+int swdev_flow_remove(struct net_device *dev, const struct sw_flow *flow);
+
+#else
+
+static inline int swdev_get_id(struct net_device *dev,
+			       struct netdev_phys_item_id *psid)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int swdev_flow_insert(struct net_device *dev,
+				    const struct sw_flow *flow)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int swdev_flow_remove(struct net_device *dev,
+				    const struct sw_flow *flow)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif
+
+#endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/Kconfig b/net/Kconfig
index 4051fdf..40f729f 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -290,6 +290,12 @@ config NET_FLOW_LIMIT
 	  with many clients some protection against DoS by a single (spoofed)
 	  flow that greatly exceeds average workload.
 
+config NET_SWITCHDEV
+	boolean "Switch device support"
+	depends on INET
+	---help---
+	  This module provides support for hardware switch chips.
+
 menu "Network testing"
 
 config NET_PKTGEN
diff --git a/net/core/Makefile b/net/core/Makefile
index 71093d9..8583c38 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -24,3 +24,4 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
 obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
 obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
 obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
+obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o
diff --git a/net/core/switchdev.c b/net/core/switchdev.c
new file mode 100644
index 0000000..4fad097
--- /dev/null
+++ b/net/core/switchdev.c
@@ -0,0 +1,163 @@
+/*
+ * net/core/switchdev.c - Switch device API
+ * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/netdevice.h>
+#include <linux/switchdev.h>
+
+/**
+ *	swdev_get_id - Get ID of a switch
+ *	@dev: port device
+ *	@psid: switch ID
+ *
+ *	Get ID of a switch this port is part of.
+ */
+int swdev_get_id(struct net_device *dev, struct netdev_phys_item_id *psid)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (!ops->ndo_swdev_get_id)
+		return -EOPNOTSUPP;
+	return ops->ndo_swdev_get_id(dev, psid);
+}
+EXPORT_SYMBOL(swdev_get_id);
+
+static void print_flow_key_tun(const char *prefix,
+			       const struct sw_flow_key *key)
+{
+	pr_debug("%s tun  { id %08llx, s %pI4, d %pI4, f %02x, tos %x, ttl %x }\n",
+		 prefix,
+		 be64_to_cpu(key->tun_key.tun_id), &key->tun_key.ipv4_src,
+		 &key->tun_key.ipv4_dst, ntohs(key->tun_key.tun_flags),
+		 key->tun_key.ipv4_tos, key->tun_key.ipv4_ttl);
+}
+
+static void print_flow_key_phy(const char *prefix,
+			       const struct sw_flow_key *key)
+{
+	pr_debug("%s phy  { prio %04x, mark %04x, in_port %02x }\n",
+		 prefix,
+		 key->phy.priority, key->phy.skb_mark, key->phy.in_port);
+}
+
+static void print_flow_key_eth(const char *prefix,
+			       const struct sw_flow_key *key)
+{
+	pr_debug("%s eth  { sm %pM, dm %pM, tci %04x, type %04x }\n",
+		 prefix,
+		 key->eth.src, key->eth.dst, ntohs(key->eth.tci),
+		 ntohs(key->eth.type));
+}
+
+static void print_flow_key_ip(const char *prefix,
+			      const struct sw_flow_key *key)
+{
+	pr_debug("%s ip   { proto %02x, tos %02x, ttl %02x }\n",
+		 prefix,
+		 key->ip.proto, key->ip.tos, key->ip.ttl);
+}
+
+static void print_flow_key_ipv4(const char *prefix,
+				const struct sw_flow_key *key)
+{
+	pr_debug("%s ipv4 { si %pI4, di %pI4, sm %pM, dm %pM }\n",
+		 prefix,
+		 &key->ipv4.addr.src, &key->ipv4.addr.dst,
+		 key->ipv4.arp.sha, key->ipv4.arp.tha);
+}
+
+static void print_flow_actions(struct sw_flow_actions *actions)
+{
+	int i;
+
+	pr_debug("  actions:\n");
+	if (!actions)
+		return;
+	for (i = 0; i < actions->count; i++) {
+		struct sw_flow_action *action = &actions->actions[i];
+
+		switch (action->type) {
+		case SW_FLOW_ACTION_TYPE_OUTPUT:
+			pr_debug("    output    { dev %s }\n",
+				 action->output_dev->name);
+			break;
+		case SW_FLOW_ACTION_TYPE_VLAN_PUSH:
+			pr_debug("    vlan push { proto %04x, tci %04x }\n",
+				 ntohs(action->vlan.vlan_proto),
+				 ntohs(action->vlan.vlan_tci));
+			break;
+		case SW_FLOW_ACTION_TYPE_VLAN_POP:
+			pr_debug("    vlan pop\n");
+			break;
+		}
+	}
+}
+
+#define PREFIX_NONE "      "
+#define PREFIX_MASK "  mask"
+
+static void print_flow(const struct sw_flow *flow, struct net_device *dev,
+		       const char *comment)
+{
+	pr_debug("%s flow %s (%x-%x):\n", dev->name, comment,
+		 flow->mask->range.start, flow->mask->range.end);
+	print_flow_key_tun(PREFIX_NONE, &flow->key);
+	print_flow_key_tun(PREFIX_MASK, &flow->mask->key);
+	print_flow_key_phy(PREFIX_NONE, &flow->key);
+	print_flow_key_phy(PREFIX_MASK, &flow->mask->key);
+	print_flow_key_eth(PREFIX_NONE, &flow->key);
+	print_flow_key_eth(PREFIX_MASK, &flow->mask->key);
+	print_flow_key_ip(PREFIX_NONE, &flow->key);
+	print_flow_key_ip(PREFIX_MASK, &flow->mask->key);
+	print_flow_key_ipv4(PREFIX_NONE, &flow->key);
+	print_flow_key_ipv4(PREFIX_MASK, &flow->mask->key);
+	print_flow_actions(flow->actions);
+}
+
+/**
+ *	swdev_flow_insert - Insert a flow into switch
+ *	@dev: port device
+ *	@flow: flow descriptor
+ *
+ *	Insert a flow into switch this port is part of.
+ */
+int swdev_flow_insert(struct net_device *dev, const struct sw_flow *flow)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	print_flow(flow, dev, "insert");
+	if (!ops->ndo_swdev_flow_insert)
+		return -EOPNOTSUPP;
+	WARN_ON(!ops->ndo_swdev_get_id);
+	BUG_ON(!flow->actions);
+	return ops->ndo_swdev_flow_insert(dev, flow);
+}
+EXPORT_SYMBOL(swdev_flow_insert);
+
+/**
+ *	swdev_flow_remove - Remove a flow from switch
+ *	@dev: port device
+ *	@flow: flow descriptor
+ *
+ *	Remove a flow from switch this port is part of.
+ */
+int swdev_flow_remove(struct net_device *dev, const struct sw_flow *flow)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	print_flow(flow, dev, "remove");
+	if (!ops->ndo_swdev_flow_remove)
+		return -EOPNOTSUPP;
+	WARN_ON(!ops->ndo_swdev_get_id);
+	return ops->ndo_swdev_flow_remove(dev, flow);
+}
+EXPORT_SYMBOL(swdev_flow_remove);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 04/12] rtnl: expose physical switch id for particular device
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
  2014-08-21 16:18 ` [patch net-next RFC 02/12] net: rename netdev_phys_port_id to more generic name Jiri Pirko
@ 2014-08-21 16:18 ` Jiri Pirko
       [not found]   ` <1408637945-10390-5-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  2014-08-21 16:18 ` [patch net-next RFC 05/12] net-sysfs: " Jiri Pirko
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:18 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

The netdevice represents a port in a switch, it will expose
IFLA_PHYS_SWITCH_ID value via rtnl. Two netdevices with the same value
belong to one physical switch.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         | 26 +++++++++++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index ff95760..fe6c4c5 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -145,6 +145,7 @@ enum {
 	IFLA_CARRIER,
 	IFLA_PHYS_PORT_ID,
 	IFLA_CARRIER_CHANGES,
+	IFLA_PHYS_SWITCH_ID,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 80c135a..2c08fe4 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -43,6 +43,7 @@
 
 #include <linux/inet.h>
 #include <linux/netdevice.h>
+#include <linux/switchdev.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/arp.h>
@@ -868,7 +869,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + rtnl_port_size(dev, ext_filter_mask) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
 	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
-	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_PORT_ID */
+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_SWITCH_ID */
 }
 
 static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
@@ -967,6 +969,24 @@ static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static int rtnl_phys_switch_id_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	int err;
+	struct netdev_phys_item_id psid;
+
+	err = swdev_get_id(dev, &psid);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+
+	if (nla_put(skb, IFLA_PHYS_SWITCH_ID, psid.id_len, psid.id))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask)
@@ -1039,6 +1059,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_phys_port_id_fill(skb, dev))
 		goto nla_put_failure;
 
+	if (rtnl_phys_switch_id_fill(skb, dev))
+		goto nla_put_failure;
+
 	attr = nla_reserve(skb, IFLA_STATS,
 			sizeof(struct rtnl_link_stats));
 	if (attr == NULL)
@@ -1198,6 +1221,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
 	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
+	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 05/12] net-sysfs: expose physical switch id for particular device
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
  2014-08-21 16:18 ` [patch net-next RFC 02/12] net: rename netdev_phys_port_id to more generic name Jiri Pirko
  2014-08-21 16:18 ` [patch net-next RFC 04/12] rtnl: expose physical switch id for particular device Jiri Pirko
@ 2014-08-21 16:18 ` Jiri Pirko
       [not found] ` <1408637945-10390-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:18 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/core/net-sysfs.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 55dc4da..69e3d64 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -12,6 +12,7 @@
 #include <linux/capability.h>
 #include <linux/kernel.h>
 #include <linux/netdevice.h>
+#include <linux/switchdev.h>
 #include <linux/if_arp.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
@@ -399,6 +400,28 @@ static ssize_t phys_port_id_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(phys_port_id);
 
+static ssize_t phys_switch_id_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	ssize_t ret = -EINVAL;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	if (dev_isalive(netdev)) {
+		struct netdev_phys_item_id ppid;
+
+		ret = swdev_get_id(netdev, &ppid);
+		if (!ret)
+			ret = sprintf(buf, "%*phN\n", ppid.id_len, ppid.id);
+	}
+	rtnl_unlock();
+
+	return ret;
+}
+static DEVICE_ATTR_RO(phys_switch_id);
+
 static struct attribute *net_class_attrs[] = {
 	&dev_attr_netdev_group.attr,
 	&dev_attr_type.attr,
@@ -423,6 +446,7 @@ static struct attribute *net_class_attrs[] = {
 	&dev_attr_flags.attr,
 	&dev_attr_tx_queue_len.attr,
 	&dev_attr_phys_port_id.attr,
+	&dev_attr_phys_switch_id.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(net_class);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 06/12] net: introduce dummy switch
       [not found] ` <1408637945-10390-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  2014-08-21 16:18   ` [patch net-next RFC 01/12] openvswitch: split flow structures into ovs specific and generic ones Jiri Pirko
  2014-08-21 16:18   ` [patch net-next RFC 03/12] net: introduce generic switch devices support Jiri Pirko
@ 2014-08-21 16:18   ` Jiri Pirko
       [not found]     ` <1408637945-10390-7-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  2 siblings, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:18 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

Dummy switch implementation using switchdev interface

Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
---
 drivers/net/Kconfig          |   7 +++
 drivers/net/Makefile         |   1 +
 drivers/net/dummyswitch.c    | 131 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/if_link.h |   9 +++
 4 files changed, 148 insertions(+)
 create mode 100644 drivers/net/dummyswitch.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c6f6f69..7822c74 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -71,6 +71,13 @@ config DUMMY
 	  To compile this driver as a module, choose M here: the module
 	  will be called dummy.
 
+config NET_DUMMY_SWITCH
+	tristate "Dummy switch net driver support"
+	depends on NET_SWITCHDEV
+	---help---
+	  To compile this driver as a module, choose M here: the module
+	  will be called dummyswitch.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 61aefdd..3c835ba 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -7,6 +7,7 @@
 #
 obj-$(CONFIG_BONDING) += bonding/
 obj-$(CONFIG_DUMMY) += dummy.o
+obj-$(CONFIG_NET_DUMMY_SWITCH) += dummyswitch.o
 obj-$(CONFIG_EQUALIZER) += eql.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
diff --git a/drivers/net/dummyswitch.c b/drivers/net/dummyswitch.c
new file mode 100644
index 0000000..9d80c14
--- /dev/null
+++ b/drivers/net/dummyswitch.c
@@ -0,0 +1,131 @@
+/*
+ * drivers/net/dummyswitch.c - Dummy switch device
+ * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/etherdevice.h>
+#include <linux/switchdev.h>
+
+#include <net/rtnetlink.h>
+
+struct dummyswport_priv {
+	struct netdev_phys_item_id psid;
+};
+
+static netdev_tx_t dummyswport_start_xmit(struct sk_buff *skb,
+					  struct net_device *dev)
+{
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int dummyswport_swdev_get_id(struct net_device *dev,
+				    struct netdev_phys_item_id *psid)
+{
+	struct dummyswport_priv *dsp = netdev_priv(dev);
+
+	memcpy(psid, &dsp->psid, sizeof(*psid));
+	return 0;
+}
+
+static int dummyswport_change_carrier(struct net_device *dev, bool new_carrier)
+{
+	if (new_carrier)
+		netif_carrier_on(dev);
+	else
+		netif_carrier_off(dev);
+	return 0;
+}
+
+static const struct net_device_ops dummyswport_netdev_ops = {
+	.ndo_start_xmit		= dummyswport_start_xmit,
+	.ndo_swdev_get_id	= dummyswport_swdev_get_id,
+	.ndo_change_carrier	= dummyswport_change_carrier,
+};
+
+static void dummyswport_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+
+	/* Initialize the device structure. */
+	dev->netdev_ops = &dummyswport_netdev_ops;
+	dev->destructor = free_netdev;
+
+	/* Fill in device structure with ethernet-generic values. */
+	dev->tx_queue_len = 0;
+	dev->flags |= IFF_NOARP;
+	dev->flags &= ~IFF_MULTICAST;
+	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+	dev->features	|= NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO;
+	dev->features	|= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
+	eth_hw_addr_random(dev);
+}
+
+static int dummyswport_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (tb[IFLA_ADDRESS])
+		return -EINVAL;
+	if (!data || !data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID])
+		return -EINVAL;
+	return 0;
+}
+
+static int dummyswport_newlink(struct net *src_net, struct net_device *dev,
+			       struct nlattr *tb[], struct nlattr *data[])
+{
+	struct dummyswport_priv *dsp = netdev_priv(dev);
+	int err;
+
+	dsp->psid.id_len = nla_len(data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID]);
+	memcpy(dsp->psid.id, nla_data(data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID]),
+	       dsp->psid.id_len);
+
+	err = register_netdevice(dev);
+	if (err)
+		return err;
+
+	netif_carrier_on(dev);
+
+	return 0;
+}
+
+static const struct nla_policy dummyswport_policy[IFLA_DUMMYSWPORT_MAX + 1] = {
+	[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID] = { .type = NLA_BINARY,
+					      .len = MAX_PHYS_ITEM_ID_LEN },
+};
+
+static struct rtnl_link_ops dummyswport_link_ops __read_mostly = {
+	.kind		= "dummyswport",
+	.priv_size	= sizeof(struct dummyswport_priv),
+	.setup		= dummyswport_setup,
+	.validate	= dummyswport_validate,
+	.newlink	= dummyswport_newlink,
+	.policy		= dummyswport_policy,
+	.maxtype	= IFLA_DUMMYSWPORT_MAX,
+};
+
+static int __init dummysw_module_init(void)
+{
+	return rtnl_link_register(&dummyswport_link_ops);
+}
+
+static void __exit dummysw_module_exit(void)
+{
+	rtnl_link_unregister(&dummyswport_link_ops);
+}
+
+module_init(dummysw_module_init);
+module_exit(dummysw_module_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>");
+MODULE_DESCRIPTION("Dummy switch device");
+MODULE_ALIAS_RTNL_LINK("dummyswport");
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index fe6c4c5..e7e122b 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -562,4 +562,13 @@ enum {
 
 #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
 
+/* DUMMYSWPORT section */
+enum {
+	IFLA_DUMMYSWPORT_UNSPEC,
+	IFLA_DYMMYSWPORT_PHYS_SWITCH_ID,
+	__IFLA_DUMMYSWPORT_MAX,
+};
+
+#define IFLA_DUMMYSWPORT_MAX (__IFLA_DUMMYSWPORT_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
                   ` (3 preceding siblings ...)
       [not found] ` <1408637945-10390-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-08-21 16:19 ` Jiri Pirko
  2014-08-21 16:38   ` Ben Hutchings
  2014-08-21 16:56   ` Florian Fainelli
  2014-08-21 16:19 ` [patch net-next RFC 08/12] net: introduce netdev_phys_item_ids_match helper Jiri Pirko
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/dsa/Kconfig |  2 +-
 net/dsa/slave.c | 16 ++++++++++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index f5eede1..66c445a 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -1,6 +1,6 @@
 config HAVE_NET_DSA
 	def_bool y
-	depends on NETDEVICES && !S390
+	depends on NETDEVICES && NET_SWITCHDEV && !S390
 
 # Drivers must select NET_DSA and the appropriate tagging format
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 45a1e34..e069ba3 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -171,6 +171,19 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 	return -EOPNOTSUPP;
 }
 
+static int dsa_slave_swdev_get_id(struct net_device *dev,
+				  struct netdev_phys_item_id *psid)
+{
+	struct dsa_slave_priv *p = netdev_priv(dev);
+	struct dsa_switch *ds = p->parent;
+	u64 tmp = (u64) ds;
+
+	/* TODO: add more sophisticated id generation */
+	memcpy(&psid->id, &tmp, sizeof(tmp));
+	psid->id_len = sizeof(tmp);
+
+	return 0;
+}
 
 /* ethtool operations *******************************************************/
 static int
@@ -303,6 +316,7 @@ static const struct net_device_ops dsa_netdev_ops = {
 	.ndo_set_rx_mode	= dsa_slave_set_rx_mode,
 	.ndo_set_mac_address	= dsa_slave_set_mac_address,
 	.ndo_do_ioctl		= dsa_slave_ioctl,
+	.ndo_swdev_get_id	= dsa_slave_swdev_get_id,
 };
 #endif
 #ifdef CONFIG_NET_DSA_TAG_EDSA
@@ -315,6 +329,7 @@ static const struct net_device_ops edsa_netdev_ops = {
 	.ndo_set_rx_mode	= dsa_slave_set_rx_mode,
 	.ndo_set_mac_address	= dsa_slave_set_mac_address,
 	.ndo_do_ioctl		= dsa_slave_ioctl,
+	.ndo_swdev_get_id	= dsa_slave_swdev_get_id,
 };
 #endif
 #ifdef CONFIG_NET_DSA_TAG_TRAILER
@@ -327,6 +342,7 @@ static const struct net_device_ops trailer_netdev_ops = {
 	.ndo_set_rx_mode	= dsa_slave_set_rx_mode,
 	.ndo_set_mac_address	= dsa_slave_set_mac_address,
 	.ndo_do_ioctl		= dsa_slave_ioctl,
+	.ndo_swdev_get_id	= dsa_slave_swdev_get_id,
 };
 #endif
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 08/12] net: introduce netdev_phys_item_ids_match helper
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
                   ` (4 preceding siblings ...)
  2014-08-21 16:19 ` [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id Jiri Pirko
@ 2014-08-21 16:19 ` Jiri Pirko
  2014-08-21 16:19 ` [patch net-next RFC 09/12] openvswitch: introduce vport_op get_netdev Jiri Pirko
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@cumulusnetworks.com>
---
 include/linux/netdevice.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8b5d14c..b48028d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -751,6 +751,13 @@ struct netdev_phys_item_id {
 	unsigned char id_len;
 };
 
+static inline bool netdev_phys_item_ids_match(struct netdev_phys_item_id *id1,
+					      struct netdev_phys_item_id *id2)
+{
+	return id1->id_len == id2->id_len &&
+	       !memcmp(id1->id, id2->id, id1->id_len);
+}
+
 typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
 				       struct sk_buff *skb);
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 09/12] openvswitch: introduce vport_op get_netdev
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
                   ` (5 preceding siblings ...)
  2014-08-21 16:19 ` [patch net-next RFC 08/12] net: introduce netdev_phys_item_ids_match helper Jiri Pirko
@ 2014-08-21 16:19 ` Jiri Pirko
  2014-08-21 16:19 ` [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload Jiri Pirko
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

This will allow to query easily if the vport has netdev. Also it allows
to unexpose netdev_vport_priv and struct netdev_vport.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/openvswitch/datapath.c           |  2 +-
 net/openvswitch/dp_notify.c          |  7 ++---
 net/openvswitch/vport-internal_dev.c | 56 ++++++++++++++++++++++++------------
 net/openvswitch/vport-netdev.c       | 16 +++++++++++
 net/openvswitch/vport-netdev.h       | 12 --------
 net/openvswitch/vport.h              |  2 ++
 6 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 683d6cd..75bb07f 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -171,7 +171,7 @@ static int get_dpifindex(struct datapath *dp)
 
 	local = ovs_vport_rcu(dp, OVSP_LOCAL);
 	if (local)
-		ifindex = netdev_vport_priv(local)->dev->ifindex;
+		ifindex = local->ops->get_netdev(local)->ifindex;
 	else
 		ifindex = 0;
 
diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
index 2c631fe..d2cc24b 100644
--- a/net/openvswitch/dp_notify.c
+++ b/net/openvswitch/dp_notify.c
@@ -58,13 +58,12 @@ void ovs_dp_notify_wq(struct work_struct *work)
 			struct hlist_node *n;
 
 			hlist_for_each_entry_safe(vport, n, &dp->ports[i], dp_hash_node) {
-				struct netdev_vport *netdev_vport;
+				struct net_device *dev;
 
 				if (vport->ops->type != OVS_VPORT_TYPE_NETDEV)
 					continue;
-
-				netdev_vport = netdev_vport_priv(vport);
-				if (!(netdev_vport->dev->priv_flags & IFF_OVS_DATAPATH))
+				dev = vport->ops->get_netdev(vport);
+				if (!(dev->priv_flags & IFF_OVS_DATAPATH))
 					dp_detach_port_notify(vport);
 			}
 		}
diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c
index 8451612..6be7928 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -32,6 +32,17 @@
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
 
+struct internal_dev_vport {
+	struct rcu_head rcu;
+	struct net_device *dev;
+};
+
+static struct internal_dev_vport *
+internal_dev_vport_priv(const struct vport *vport)
+{
+	return vport_priv(vport);
+}
+
 struct internal_dev {
 	struct vport *vport;
 };
@@ -154,49 +165,50 @@ static void do_setup(struct net_device *netdev)
 static struct vport *internal_dev_create(const struct vport_parms *parms)
 {
 	struct vport *vport;
-	struct netdev_vport *netdev_vport;
+	struct internal_dev_vport *int_vport;
 	struct internal_dev *internal_dev;
+	struct net_device *dev;
 	int err;
 
-	vport = ovs_vport_alloc(sizeof(struct netdev_vport),
+	vport = ovs_vport_alloc(sizeof(struct internal_dev_vport),
 				&ovs_internal_vport_ops, parms);
 	if (IS_ERR(vport)) {
 		err = PTR_ERR(vport);
 		goto error;
 	}
 
-	netdev_vport = netdev_vport_priv(vport);
+	int_vport = internal_dev_vport_priv(vport);
 
-	netdev_vport->dev = alloc_netdev(sizeof(struct internal_dev),
-					 parms->name, NET_NAME_UNKNOWN,
-					 do_setup);
-	if (!netdev_vport->dev) {
+	dev = alloc_netdev(sizeof(struct internal_dev), parms->name,
+			   NET_NAME_UNKNOWN, do_setup);
+	if (!dev) {
 		err = -ENOMEM;
 		goto error_free_vport;
 	}
+	int_vport->dev = dev;
 
-	dev_net_set(netdev_vport->dev, ovs_dp_get_net(vport->dp));
-	internal_dev = internal_dev_priv(netdev_vport->dev);
+	dev_net_set(dev, ovs_dp_get_net(vport->dp));
+	internal_dev = internal_dev_priv(dev);
 	internal_dev->vport = vport;
 
 	/* Restrict bridge port to current netns. */
 	if (vport->port_no == OVSP_LOCAL)
-		netdev_vport->dev->features |= NETIF_F_NETNS_LOCAL;
+		dev->features |= NETIF_F_NETNS_LOCAL;
 
 	rtnl_lock();
-	err = register_netdevice(netdev_vport->dev);
+	err = register_netdevice(dev);
 	if (err)
 		goto error_free_netdev;
 
-	dev_set_promiscuity(netdev_vport->dev, 1);
+	dev_set_promiscuity(dev, 1);
 	rtnl_unlock();
-	netif_start_queue(netdev_vport->dev);
+	netif_start_queue(dev);
 
 	return vport;
 
 error_free_netdev:
 	rtnl_unlock();
-	free_netdev(netdev_vport->dev);
+	free_netdev(dev);
 error_free_vport:
 	ovs_vport_free(vport);
 error:
@@ -205,21 +217,21 @@ error:
 
 static void internal_dev_destroy(struct vport *vport)
 {
-	struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
+	struct internal_dev_vport *int_vport = internal_dev_vport_priv(vport);
 
-	netif_stop_queue(netdev_vport->dev);
+	netif_stop_queue(int_vport->dev);
 	rtnl_lock();
-	dev_set_promiscuity(netdev_vport->dev, -1);
+	dev_set_promiscuity(int_vport->dev, -1);
 
 	/* unregister_netdevice() waits for an RCU grace period. */
-	unregister_netdevice(netdev_vport->dev);
+	unregister_netdevice(int_vport->dev);
 
 	rtnl_unlock();
 }
 
 static int internal_dev_recv(struct vport *vport, struct sk_buff *skb)
 {
-	struct net_device *netdev = netdev_vport_priv(vport)->dev;
+	struct net_device *netdev = internal_dev_vport_priv(vport)->dev;
 	int len;
 
 	len = skb->len;
@@ -238,12 +250,18 @@ static int internal_dev_recv(struct vport *vport, struct sk_buff *skb)
 	return len;
 }
 
+static struct net_device *internal_dev_get_netdev(struct vport *vport)
+{
+	return internal_dev_vport_priv(vport)->dev;
+}
+
 const struct vport_ops ovs_internal_vport_ops = {
 	.type		= OVS_VPORT_TYPE_INTERNAL,
 	.create		= internal_dev_create,
 	.destroy	= internal_dev_destroy,
 	.get_name	= ovs_netdev_get_name,
 	.send		= internal_dev_recv,
+	.get_netdev	= internal_dev_get_netdev,
 };
 
 int ovs_is_internal_dev(const struct net_device *netdev)
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index d21f77d..aaf3d14 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -33,6 +33,16 @@
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
 
+struct netdev_vport {
+	struct rcu_head rcu;
+	struct net_device *dev;
+};
+
+static struct netdev_vport *netdev_vport_priv(const struct vport *vport)
+{
+	return vport_priv(vport);
+}
+
 /* Must be called with rcu_read_lock. */
 static void netdev_port_receive(struct vport *vport, struct sk_buff *skb)
 {
@@ -224,10 +234,16 @@ struct vport *ovs_netdev_get_vport(struct net_device *dev)
 		return NULL;
 }
 
+static struct net_device *netdev_get_netdev(struct vport *vport)
+{
+	return netdev_vport_priv(vport)->dev;
+}
+
 const struct vport_ops ovs_netdev_vport_ops = {
 	.type		= OVS_VPORT_TYPE_NETDEV,
 	.create		= netdev_create,
 	.destroy	= netdev_destroy,
 	.get_name	= ovs_netdev_get_name,
 	.send		= netdev_send,
+	.get_netdev	= netdev_get_netdev,
 };
diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h
index 8df01c11..f03d41d 100644
--- a/net/openvswitch/vport-netdev.h
+++ b/net/openvswitch/vport-netdev.h
@@ -26,18 +26,6 @@
 
 struct vport *ovs_netdev_get_vport(struct net_device *dev);
 
-struct netdev_vport {
-	struct rcu_head rcu;
-
-	struct net_device *dev;
-};
-
-static inline struct netdev_vport *
-netdev_vport_priv(const struct vport *vport)
-{
-	return vport_priv(vport);
-}
-
 const char *ovs_netdev_get_name(const struct vport *);
 void ovs_netdev_detach_dev(struct vport *);
 
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 8409e06..f434271 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -164,6 +164,8 @@ struct vport_ops {
 	const char *(*get_name)(const struct vport *);
 
 	int (*send)(struct vport *, struct sk_buff *);
+
+	struct net_device *(*get_netdev)(struct vport *);
 };
 
 enum vport_err_type {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
                   ` (6 preceding siblings ...)
  2014-08-21 16:19 ` [patch net-next RFC 09/12] openvswitch: introduce vport_op get_netdev Jiri Pirko
@ 2014-08-21 16:19 ` Jiri Pirko
       [not found]   ` <1408637945-10390-11-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
       [not found]   ` <20140904090447.GB3176@vergenet.net>
  2014-08-21 16:19 ` [patch net-next RFC 11/12] sw_flow: add misc section to key with in_port_ifindex field Jiri Pirko
  2014-08-21 16:19 ` [patch net-next RFC 12/12] rocker: introduce rocker switch driver Jiri Pirko
  9 siblings, 2 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

Benefit from the possibility to work with flows in switch devices and
use the swdev api to offload flow datapath.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/sw_flow.h        |  14 +++
 net/openvswitch/Makefile       |   3 +-
 net/openvswitch/datapath.c     |  33 ++++++
 net/openvswitch/datapath.h     |   3 +
 net/openvswitch/flow_table.c   |   1 +
 net/openvswitch/hw_offload.c   | 235 +++++++++++++++++++++++++++++++++++++++++
 net/openvswitch/hw_offload.h   |  22 ++++
 net/openvswitch/vport-netdev.c |   3 +
 net/openvswitch/vport.h        |   2 +
 9 files changed, 315 insertions(+), 1 deletion(-)
 create mode 100644 net/openvswitch/hw_offload.c
 create mode 100644 net/openvswitch/hw_offload.h

diff --git a/include/linux/sw_flow.h b/include/linux/sw_flow.h
index b622fde..079d065 100644
--- a/include/linux/sw_flow.h
+++ b/include/linux/sw_flow.h
@@ -80,7 +80,21 @@ struct sw_flow_mask {
 	struct sw_flow_key key;
 };
 
+enum sw_flow_action_type {
+	SW_FLOW_ACTION_TYPE_OUTPUT,
+	SW_FLOW_ACTION_TYPE_VLAN_PUSH,
+	SW_FLOW_ACTION_TYPE_VLAN_POP,
+};
+
 struct sw_flow_action {
+	enum sw_flow_action_type type;
+	union {
+		struct net_device *output_dev;
+		struct {
+			__be16 vlan_proto;
+			u16 vlan_tci;
+		} vlan;
+	};
 };
 
 struct sw_flow_actions {
diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile
index 3591cb5..5152437 100644
--- a/net/openvswitch/Makefile
+++ b/net/openvswitch/Makefile
@@ -13,7 +13,8 @@ openvswitch-y := \
 	flow_table.o \
 	vport.o \
 	vport-internal_dev.o \
-	vport-netdev.o
+	vport-netdev.o \
+	hw_offload.o
 
 ifneq ($(CONFIG_OPENVSWITCH_VXLAN),)
 openvswitch-y += vport-vxlan.o
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 75bb07f..3e43e1d 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -57,6 +57,7 @@
 #include "flow_netlink.h"
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
+#include "hw_offload.h"
 
 int ovs_net_id __read_mostly;
 
@@ -864,6 +865,9 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 			acts = NULL;
 			goto err_unlock_ovs;
 		}
+		error = ovs_hw_flow_insert(dp, new_flow);
+		if (error)
+			pr_warn("failed to insert flow into hw\n");
 
 		if (unlikely(reply)) {
 			error = ovs_flow_cmd_fill_info(new_flow,
@@ -896,10 +900,18 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 				goto err_unlock_ovs;
 			}
 		}
+		error = ovs_hw_flow_remove(dp, flow);
+		if (error)
+			pr_warn("failed to remove flow from hw\n");
+
 		/* Update actions. */
 		old_acts = ovsl_dereference(flow->sf_acts);
 		rcu_assign_pointer(flow->sf_acts, acts);
 
+		error = ovs_hw_flow_insert(dp, flow);
+		if (error)
+			pr_warn("failed to insert flow into hw\n");
+
 		if (unlikely(reply)) {
 			error = ovs_flow_cmd_fill_info(flow,
 						       ovs_header->dp_ifindex,
@@ -993,9 +1005,17 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 
 	/* Update actions, if present. */
 	if (likely(acts)) {
+		error = ovs_hw_flow_remove(dp, flow);
+		if (error)
+			pr_warn("failed to remove flow from hw\n");
+
 		old_acts = ovsl_dereference(flow->sf_acts);
 		rcu_assign_pointer(flow->sf_acts, acts);
 
+		error = ovs_hw_flow_insert(dp, flow);
+		if (error)
+			pr_warn("failed to insert flow into hw\n");
+
 		if (unlikely(reply)) {
 			error = ovs_flow_cmd_fill_info(flow,
 						       ovs_header->dp_ifindex,
@@ -1109,6 +1129,9 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	}
 
 	if (unlikely(!a[OVS_FLOW_ATTR_KEY])) {
+		err = ovs_hw_flow_flush(dp);
+		if (err)
+			pr_warn("failed to flush flows from hw\n");
 		err = ovs_flow_tbl_flush(&dp->table);
 		goto unlock;
 	}
@@ -1120,6 +1143,9 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	}
 
 	ovs_flow_tbl_remove(&dp->table, flow);
+	err = ovs_hw_flow_remove(dp, flow);
+	if (err)
+		pr_warn("failed to remove flow from hw\n");
 	ovs_unlock();
 
 	reply = ovs_flow_cmd_alloc_info((const struct ovs_flow_actions __force *) flow->sf_acts,
@@ -1368,6 +1394,8 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	for (i = 0; i < DP_VPORT_HASH_BUCKETS; i++)
 		INIT_HLIST_HEAD(&dp->ports[i]);
 
+	INIT_LIST_HEAD(&dp->swdev_rep_list);
+
 	/* Set up our datapath device. */
 	parms.name = nla_data(a[OVS_DP_ATTR_NAME]);
 	parms.type = OVS_VPORT_TYPE_INTERNAL;
@@ -1431,6 +1459,7 @@ err:
 static void __dp_destroy(struct datapath *dp)
 {
 	int i;
+	int err;
 
 	for (i = 0; i < DP_VPORT_HASH_BUCKETS; i++) {
 		struct vport *vport;
@@ -1448,6 +1477,10 @@ static void __dp_destroy(struct datapath *dp)
 	 */
 	ovs_dp_detach_port(ovs_vport_ovsl(dp, OVSP_LOCAL));
 
+	err = ovs_hw_flow_flush(dp);
+	if (err)
+		pr_warn("failed to flush flows from hw\n");
+
 	/* RCU destroy the flow table */
 	ovs_flow_tbl_destroy(&dp->table, true);
 
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 291f5a0..9dc11a6 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -90,6 +90,9 @@ struct datapath {
 #endif
 
 	u32 user_features;
+
+	/* List of switchdev representative ports */
+	struct list_head swdev_rep_list;
 };
 
 /**
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index e7d9a41..c01e4cb 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -85,6 +85,7 @@ struct ovs_flow *ovs_flow_alloc(void)
 
 	flow->sf_acts = NULL;
 	flow->flow.mask = NULL;
+	flow->flow.actions = NULL;
 	flow->stats_last_writer = NUMA_NO_NODE;
 
 	/* Initialize the default stat node. */
diff --git a/net/openvswitch/hw_offload.c b/net/openvswitch/hw_offload.c
new file mode 100644
index 0000000..edb8a68
--- /dev/null
+++ b/net/openvswitch/hw_offload.c
@@ -0,0 +1,235 @@
+/*
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/sw_flow.h>
+#include <linux/switchdev.h>
+
+#include "datapath.h"
+#include "vport-netdev.h"
+
+static int sw_flow_action_create(struct datapath *dp,
+				 struct sw_flow_actions **p_actions,
+				 struct ovs_flow_actions *acts)
+{
+	const struct nlattr *attr = acts->actions;
+	int len = acts->actions_len;
+	const struct nlattr *a;
+	int rem;
+	struct sw_flow_actions *actions;
+	struct sw_flow_action *cur;
+	size_t count = 0;
+	int err;
+
+	for (a = attr, rem = len; rem > 0; a = nla_next(a, &rem))
+		count++;
+
+	actions = kzalloc(sizeof(struct sw_flow_actions) +
+			  sizeof(struct sw_flow_action) * count,
+			  GFP_KERNEL);
+	if (!actions)
+		return -ENOMEM;
+	actions->count = count;
+
+	cur = actions->actions;
+	for (a = attr, rem = len; rem > 0; a = nla_next(a, &rem)) {
+		switch (nla_type(a)) {
+		case OVS_ACTION_ATTR_OUTPUT:
+			{
+				struct vport *vport;
+
+				vport = ovs_vport_ovsl_rcu(dp, nla_get_u32(a));
+				cur->type = SW_FLOW_ACTION_TYPE_OUTPUT;
+				cur->output_dev = vport->ops->get_netdev(vport);
+			}
+			break;
+
+		case OVS_ACTION_ATTR_PUSH_VLAN:
+			{
+				const struct ovs_action_push_vlan *vlan;
+
+				vlan = nla_data(a);
+				cur->type = SW_FLOW_ACTION_TYPE_VLAN_PUSH;
+				cur->vlan.vlan_proto = vlan->vlan_tpid;
+				cur->vlan.vlan_tci = vlan->vlan_tci;
+			}
+			break;
+
+		case OVS_ACTION_ATTR_POP_VLAN:
+			cur->type = SW_FLOW_ACTION_TYPE_VLAN_POP;
+			break;
+
+		default:
+			err = -EOPNOTSUPP;
+			goto errout;
+		}
+		cur++;
+	}
+	*p_actions = actions;
+	return 0;
+
+errout:
+	kfree(actions);
+	return err;
+}
+
+int ovs_hw_flow_insert(struct datapath *dp, struct ovs_flow *flow)
+{
+	struct sw_flow_actions *actions;
+	struct vport *vport;
+	struct net_device *dev;
+	int err;
+
+	ASSERT_OVSL();
+	BUG_ON(flow->flow.actions);
+
+	err = sw_flow_action_create(dp, &actions, flow->sf_acts);
+	if (err)
+		return err;
+	flow->flow.actions = actions;
+
+	list_for_each_entry(vport, &dp->swdev_rep_list, swdev_rep_list) {
+		dev = vport->ops->get_netdev(vport);
+		BUG_ON(!dev);
+		err = swdev_flow_insert(dev, &flow->flow);
+		if (err == -ENODEV) /* out device is not in this switch */
+			continue;
+		if (err)
+			break;
+	}
+
+	if (err) {
+		kfree(actions);
+		flow->flow.actions = NULL;
+	}
+	return err;
+}
+
+int ovs_hw_flow_remove(struct datapath *dp, struct ovs_flow *flow)
+{
+	struct vport *vport;
+	struct net_device *dev;
+	int err = 0;
+
+	ASSERT_OVSL();
+	list_for_each_entry(vport, &dp->swdev_rep_list, swdev_rep_list) {
+		dev = vport->ops->get_netdev(vport);
+		BUG_ON(!dev);
+		err = swdev_flow_remove(dev, &flow->flow);
+		if (err == -ENODEV) /* out device is not in this switch */
+			continue;
+		if (err)
+			break;
+	}
+	kfree(flow->flow.actions);
+	flow->flow.actions = NULL;
+	return err;
+}
+
+int ovs_hw_flow_flush(struct datapath *dp)
+{
+	struct table_instance *ti;
+	int i;
+	int ver;
+	int err;
+
+	ti = ovsl_dereference(dp->table.ti);
+	ver = ti->node_ver;
+
+	for (i = 0; i < ti->n_buckets; i++) {
+		struct ovs_flow *flow;
+		struct hlist_head *head = flex_array_get(ti->buckets, i);
+
+		hlist_for_each_entry(flow, head, hash_node[ver]) {
+			err = ovs_hw_flow_remove(dp, flow);
+			if (err)
+				return err;
+		}
+	}
+	return 0;
+}
+
+static bool __is_vport_in_swdev_rep_list(struct datapath *dp,
+					 struct vport *vport)
+{
+	struct vport *cur_vport;
+
+	list_for_each_entry(cur_vport, &dp->swdev_rep_list, swdev_rep_list) {
+		if (cur_vport == vport)
+			return true;
+	}
+	return false;
+}
+
+static struct vport *__find_vport_by_swdev_id(struct datapath *dp,
+					      struct vport *vport)
+{
+	struct net_device *dev;
+	struct vport *cur_vport;
+	struct netdev_phys_item_id id;
+	struct netdev_phys_item_id cur_id;
+	int i;
+	int err;
+
+	err = swdev_get_id(vport->ops->get_netdev(vport), &id);
+	if (err)
+		return ERR_PTR(err);
+
+	for (i = 0; i < DP_VPORT_HASH_BUCKETS; i++) {
+		hlist_for_each_entry(cur_vport, &dp->ports[i], dp_hash_node) {
+			if (cur_vport->ops->type != OVS_VPORT_TYPE_NETDEV)
+				continue;
+			if (cur_vport == vport)
+				continue;
+			dev = cur_vport->ops->get_netdev(cur_vport);
+			if (!dev)
+				continue;
+			err = swdev_get_id(dev, &cur_id);
+			if (err)
+				continue;
+			if (netdev_phys_item_ids_match(&id, &cur_id))
+				return cur_vport;
+		}
+	}
+	return ERR_PTR(-ENOENT);
+}
+
+void ovs_hw_port_add(struct datapath *dp, struct vport *vport)
+{
+	struct vport *found_vport;
+
+	ASSERT_OVSL();
+	/* The representative list contains always one port per switch dev id */
+	found_vport = __find_vport_by_swdev_id(dp, vport);
+	if (IS_ERR(found_vport) && PTR_ERR(found_vport) == -ENOENT) {
+		list_add(&vport->swdev_rep_list, &dp->swdev_rep_list);
+		pr_debug("%s added to rep_list\n", vport->ops->get_name(vport));
+	}
+}
+
+void ovs_hw_port_del(struct datapath *dp, struct vport *vport)
+{
+	struct vport *found_vport;
+
+	ASSERT_OVSL();
+	if (!__is_vport_in_swdev_rep_list(dp, vport))
+		return;
+
+	list_del(&vport->swdev_rep_list);
+	pr_debug("%s deleted from rep_list\n", vport->ops->get_name(vport));
+	found_vport = __find_vport_by_swdev_id(dp, vport);
+	if (!IS_ERR(found_vport)) {
+		list_add(&found_vport->swdev_rep_list, &dp->swdev_rep_list);
+		pr_debug("%s added to rep_list instead\n",
+			 found_vport->ops->get_name(found_vport));
+	}
+}
diff --git a/net/openvswitch/hw_offload.h b/net/openvswitch/hw_offload.h
new file mode 100644
index 0000000..83972d7
--- /dev/null
+++ b/net/openvswitch/hw_offload.h
@@ -0,0 +1,22 @@
+/*
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef HW_OFFLOAD_H
+#define HW_OFFLOAD_H 1
+
+#include "datapath.h"
+#include "flow.h"
+
+int ovs_hw_flow_insert(struct datapath *dp, struct ovs_flow *flow);
+int ovs_hw_flow_remove(struct datapath *dp, struct ovs_flow *flow);
+int ovs_hw_flow_flush(struct datapath *dp);
+void ovs_hw_port_add(struct datapath *dp, struct vport *vport);
+void ovs_hw_port_del(struct datapath *dp, struct vport *vport);
+
+#endif
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index aaf3d14..c5953de 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -32,6 +32,7 @@
 #include "datapath.h"
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
+#include "hw_offload.h"
 
 struct netdev_vport {
 	struct rcu_head rcu;
@@ -136,6 +137,7 @@ static struct vport *netdev_create(const struct vport_parms *parms)
 	dev_set_promiscuity(netdev_vport->dev, 1);
 	netdev_vport->dev->priv_flags |= IFF_OVS_DATAPATH;
 	rtnl_unlock();
+	ovs_hw_port_add(vport->dp, vport);
 
 	return vport;
 
@@ -176,6 +178,7 @@ static void netdev_destroy(struct vport *vport)
 {
 	struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
 
+	ovs_hw_port_del(vport->dp, vport);
 	rtnl_lock();
 	if (netdev_vport->dev->priv_flags & IFF_OVS_DATAPATH)
 		ovs_netdev_detach_dev(vport);
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index f434271..c28604a 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -110,6 +110,8 @@ struct vport {
 
 	spinlock_t stats_lock;
 	struct vport_err_stats err_stats;
+
+	struct list_head swdev_rep_list;
 };
 
 /**
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 11/12] sw_flow: add misc section to key with in_port_ifindex field
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
                   ` (7 preceding siblings ...)
  2014-08-21 16:19 ` [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload Jiri Pirko
@ 2014-08-21 16:19 ` Jiri Pirko
  2014-08-21 16:19 ` [patch net-next RFC 12/12] rocker: introduce rocker switch driver Jiri Pirko
  9 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/sw_flow.h      |  3 +++
 net/core/switchdev.c         | 10 ++++++++++
 net/openvswitch/hw_offload.c | 23 +++++++++++++++++++++++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/sw_flow.h b/include/linux/sw_flow.h
index 079d065..e2ee54a 100644
--- a/include/linux/sw_flow.h
+++ b/include/linux/sw_flow.h
@@ -68,6 +68,9 @@ struct sw_flow_key {
 			} nd;
 		} ipv6;
 	};
+	struct {
+		u32	in_port_ifindex; /* Input switch port ifindex (or 0). */
+	} misc;
 } __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */
 
 struct sw_flow_key_range {
diff --git a/net/core/switchdev.c b/net/core/switchdev.c
index 4fad097..6d271a0 100644
--- a/net/core/switchdev.c
+++ b/net/core/switchdev.c
@@ -75,6 +75,14 @@ static void print_flow_key_ipv4(const char *prefix,
 		 key->ipv4.arp.sha, key->ipv4.arp.tha);
 }
 
+static void print_flow_key_misc(const char *prefix,
+				const struct sw_flow_key *key)
+{
+	pr_debug("%s misc  { in_port_ifindex %08x }\n",
+		 prefix,
+		 key->misc.in_port_ifindex);
+}
+
 static void print_flow_actions(struct sw_flow_actions *actions)
 {
 	int i;
@@ -120,6 +128,8 @@ static void print_flow(const struct sw_flow *flow, struct net_device *dev,
 	print_flow_key_ip(PREFIX_MASK, &flow->mask->key);
 	print_flow_key_ipv4(PREFIX_NONE, &flow->key);
 	print_flow_key_ipv4(PREFIX_MASK, &flow->mask->key);
+	print_flow_key_misc(PREFIX_NONE, &flow->key);
+	print_flow_key_misc(PREFIX_MASK, &flow->mask->key);
 	print_flow_actions(flow->actions);
 }
 
diff --git a/net/openvswitch/hw_offload.c b/net/openvswitch/hw_offload.c
index edb8a68..ac3997d 100644
--- a/net/openvswitch/hw_offload.c
+++ b/net/openvswitch/hw_offload.c
@@ -82,6 +82,24 @@ errout:
 	return err;
 }
 
+void ovs_hw_flow_adjust(struct datapath *dp, struct ovs_flow *flow)
+{
+	struct vport *vport;
+
+	flow->flow.key.misc.in_port_ifindex = 0;
+	flow->flow.mask->key.misc.in_port_ifindex = 0;
+	vport = ovs_vport_ovsl(dp, flow->flow.key.phy.in_port);
+	if (vport && vport->ops->type == OVS_VPORT_TYPE_NETDEV) {
+		struct net_device *dev;
+
+		dev = vport->ops->get_netdev(vport);
+		if (dev) {
+			flow->flow.key.misc.in_port_ifindex = dev->ifindex;
+			flow->flow.mask->key.misc.in_port_ifindex = 0xFFFFFFFF;
+		}
+	}
+}
+
 int ovs_hw_flow_insert(struct datapath *dp, struct ovs_flow *flow)
 {
 	struct sw_flow_actions *actions;
@@ -92,6 +110,8 @@ int ovs_hw_flow_insert(struct datapath *dp, struct ovs_flow *flow)
 	ASSERT_OVSL();
 	BUG_ON(flow->flow.actions);
 
+	ovs_hw_flow_adjust(dp, flow);
+
 	err = sw_flow_action_create(dp, &actions, flow->sf_acts);
 	if (err)
 		return err;
@@ -121,6 +141,9 @@ int ovs_hw_flow_remove(struct datapath *dp, struct ovs_flow *flow)
 	int err = 0;
 
 	ASSERT_OVSL();
+
+	ovs_hw_flow_adjust(dp, flow);
+
 	list_for_each_entry(vport, &dp->swdev_rep_list, swdev_rep_list) {
 		dev = vport->ops->get_netdev(vport);
 		BUG_ON(!dev);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [patch net-next RFC 12/12] rocker: introduce rocker switch driver
  2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
                   ` (8 preceding siblings ...)
  2014-08-21 16:19 ` [patch net-next RFC 11/12] sw_flow: add misc section to key with in_port_ifindex field Jiri Pirko
@ 2014-08-21 16:19 ` Jiri Pirko
  2014-08-21 17:19   ` Florian Fainelli
  2014-08-23 14:04   ` Thomas Graf
  9 siblings, 2 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 16:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

This patch introduces the first driver to benefit from the switchdev
infrastructure and to implement newly introduced switch ndos. This is a
driver for emulated switch chip implemented in qemu:
https://github.com/sfeldma/qemu-rocker/

This patch is a result of joint work with Scott Feldman.

Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 MAINTAINERS          |    6 +
 drivers/net/Kconfig  |    8 +
 drivers/net/Makefile |    2 +
 drivers/net/rocker.c | 3446 ++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/rocker.h |  465 +++++++
 5 files changed, 3927 insertions(+)
 create mode 100644 drivers/net/rocker.c
 create mode 100644 drivers/net/rocker.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2f85f55..55b9fd2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7607,6 +7607,12 @@ F:	drivers/hid/hid-roccat*
 F:	include/linux/hid-roccat*
 F:	Documentation/ABI/*/sysfs-driver-hid-roccat*
 
+ROCKER DRIVER
+M:	Jiri Pirko <jiri@resnulli.us>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	drivers/net/rocker.*
+
 ROCKETPORT DRIVER
 P:	Comtrol Corp.
 W:	http://www.comtrol.com
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 7822c74..64914a8 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -366,4 +366,12 @@ config VMXNET3
 
 source "drivers/net/hyperv/Kconfig"
 
+config ROCKER
+	tristate "Rocker switch driver"
+	depends on PCI && INET && NET_SWITCHDEV
+	help
+	  This driver supports Rocker switch device.
+	  To compile this driver as a module, choose M here: the
+	  module will be called rocker.
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 3c835ba..8fe0ea1 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -66,3 +66,5 @@ obj-$(CONFIG_USB_NET_DRIVERS) += usb/
 
 obj-$(CONFIG_HYPERV_NET) += hyperv/
 obj-$(CONFIG_NTB_NETDEV) += ntb_netdev.o
+
+obj-$(CONFIG_ROCKER) += rocker.o
diff --git a/drivers/net/rocker.c b/drivers/net/rocker.c
new file mode 100644
index 0000000..7426db5
--- /dev/null
+++ b/drivers/net/rocker.c
@@ -0,0 +1,3446 @@
+/*
+ * drivers/net/rocker.c - Rocker switch device driver
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ * Copyright (c) 2014 Scott Feldman <sfeldma@cumulusnetworks.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/spinlock.h>
+#include <linux/hashtable.h>
+#include <linux/crc32.h>
+#include <linux/sort.h>
+#include <linux/random.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/if_ether.h>
+#include <linux/if_vlan.h>
+#include <linux/sw_flow.h>
+#include <net/rtnetlink.h>
+#include <asm-generic/io-64-nonatomic-lo-hi.h>
+#include <generated/utsrelease.h>
+
+#include "rocker.h"
+
+static const char rocker_driver_name[] = "rocker";
+
+static const struct pci_device_id rocker_pci_id_table[] = {
+	{PCI_VDEVICE(REDHAT, PCI_DEVICE_ID_REDHAT_ROCKER), 0},
+	{0, }
+};
+
+enum rocker_op {
+	ROCKER_OP_ADD,
+	ROCKER_OP_DEL,
+};
+
+struct rocker_flow_tbl_key {
+	u32 priority;
+	enum rocker_of_dpa_table_id tbl_id;
+	union {
+		struct {
+			u32 in_lport;
+			u32 in_lport_mask;
+			enum rocker_of_dpa_table_id goto_tbl;
+		} ig_port;
+		struct {
+			u32 in_lport;
+			__be16 vlan_id;
+			__be16 vlan_id_mask;
+			enum rocker_of_dpa_table_id goto_tbl;
+			bool untagged;
+			__be16 new_vlan_id;
+		} vlan;
+		struct {
+			/* TODO */
+		} term_mac;
+		struct {
+			u8 eth_dst[ETH_ALEN];
+			u8 eth_dst_mask[ETH_ALEN];
+			int has_eth_dst;
+			int has_eth_dst_mask;
+			__be16 vlan_id;
+			u32 tunnel_id;
+			enum rocker_of_dpa_table_id goto_tbl;
+			u32 group_id;
+		} bridge;
+		struct {
+			u32 in_lport;
+			u32 in_lport_mask;
+			u8 eth_src[ETH_ALEN];
+			u8 eth_src_mask[ETH_ALEN];
+			u8 eth_dst[ETH_ALEN];
+			u8 eth_dst_mask[ETH_ALEN];
+			__be16 eth_type;
+			__be16 vlan_id;
+			__be16 vlan_id_mask;
+			u32 group_id;
+		} acl;
+	};
+};
+
+struct rocker_flow_tbl_entry {
+	struct hlist_node entry;
+	u32 ref_count;
+	u64 cookie;
+	struct rocker_flow_tbl_key key;
+	u32 key_crc32;
+};
+
+struct rocker_group_tbl_entry {
+	struct hlist_node entry;
+	u32 ref_count;
+	u32 group_id;
+	u16 group_count;
+	u32 *group_ids;
+	union {
+		struct {
+			u8 pop_vlan;
+		} l2_interface;
+	};
+};
+
+struct rocker_desc_info {
+	char *data; /* mapped */
+	size_t data_size;
+	size_t tlv_size;
+	struct rocker_desc *desc;
+	DEFINE_DMA_UNMAP_ADDR(mapaddr);
+};
+
+struct rocker_dma_ring_info {
+	size_t size;
+	u32 head;
+	u32 tail;
+	struct rocker_desc *desc; /* mapped */
+	dma_addr_t mapaddr;
+	struct rocker_desc_info *desc_info;
+	unsigned int type;
+};
+
+struct rocker;
+
+struct rocker_port {
+	struct net_device *dev;
+	unsigned int prev_flags;
+	struct rocker *rocker;
+	unsigned port_number;
+	struct napi_struct napi_tx;
+	struct napi_struct napi_rx;
+	struct rocker_dma_ring_info tx_ring;
+	struct rocker_dma_ring_info rx_ring;
+};
+
+struct rocker {
+	struct pci_dev *pdev;
+	u8 __iomem *hw_addr;
+	struct msix_entry *msix_entries;
+	unsigned port_count;
+	struct rocker_port **ports;
+	struct {
+		u64 id;
+	} hw;
+	spinlock_t cmd_ring_lock;
+	struct rocker_dma_ring_info cmd_ring;
+	struct rocker_dma_ring_info event_ring;
+	DECLARE_HASHTABLE(flow_tbl, 16);
+	spinlock_t flow_tbl_lock;
+	u64 flow_tbl_next_cookie;
+	DECLARE_HASHTABLE(group_tbl, 16);
+	spinlock_t group_tbl_lock;
+	u16 group_index_next;
+};
+
+struct rocker_wait {
+	wait_queue_head_t wait;
+	bool done;
+};
+
+static const u8 zero_mac[ETH_ALEN] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };
+static const u8 ff_mac[ETH_ALEN] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
+static const u8 lldp_mac[ETH_ALEN] = { 0x01, 0x80, 0xc2, 0x00, 0x00, 0x0e };
+
+/* Rocker priority levels for flow table entries.  Higher
+ * priority match takes precedence over lower priority match.
+ */
+
+enum {
+	ROCKER_PRIORITY_UNKNOWN = 0,
+	ROCKER_PRIORITY_IG_PORT = 1,
+	ROCKER_PRIORITY_VLAN = 1,
+	ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_EXACT = 1,
+	ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_WILD = 2,
+	ROCKER_PRIORITY_BRIDGING_VLAN = 3,
+	ROCKER_PRIORITY_BRIDGING_TENANT_DFLT_EXACT = 1,
+	ROCKER_PRIORITY_BRIDGING_TENANT_DFLT_WILD = 2,
+	ROCKER_PRIORITY_BRIDGING_TENANT = 3,
+	ROCKER_PRIORITY_ACL_PORT_PROMISC = 1,
+	ROCKER_PRIORITY_ACL = 2,
+};
+
+static u32 rocker_port_to_lport(struct rocker_port *rocker_port)
+{
+	return rocker_port->port_number + 1;
+}
+
+static void rocker_wait_reset(struct rocker_wait *wait)
+{
+	wait->done = false;
+}
+
+static void rocker_wait_init(struct rocker_wait *wait)
+{
+	init_waitqueue_head(&wait->wait);
+	rocker_wait_reset(wait);
+}
+
+static bool rocker_wait_event_timeout(struct rocker_wait *wait,
+				      unsigned long timeout)
+{
+	wait_event_timeout(wait->wait, wait->done, HZ / 10);
+	if (!wait->done)
+		return false;
+	return true;
+}
+
+static void rocker_wait_wake_up(struct rocker_wait *wait)
+{
+	wait->done = true;
+	wake_up(&wait->wait);
+}
+
+static u32 rocker_msix_vector(struct rocker *rocker, unsigned vector)
+{
+	return rocker->msix_entries[vector].vector;
+}
+
+static u32 rocker_msix_tx_vector(struct rocker_port *rocker_port)
+{
+	return rocker_msix_vector(rocker_port->rocker,
+				  ROCKER_MSIX_VEC_TX(rocker_port->port_number));
+}
+
+static u32 rocker_msix_rx_vector(struct rocker_port *rocker_port)
+{
+	return rocker_msix_vector(rocker_port->rocker,
+				  ROCKER_MSIX_VEC_RX(rocker_port->port_number));
+}
+
+#define rocker_write32(rocker, reg, val)	\
+	writel((val), (rocker)->hw_addr + (ROCKER_ ## reg))
+#define rocker_read32(rocker, reg)	\
+	readl((rocker)->hw_addr + (ROCKER_ ## reg))
+#define rocker_write64(rocker, reg, val)	\
+	writeq((val), (rocker)->hw_addr + (ROCKER_ ## reg))
+#define rocker_read64(rocker, reg)	\
+	readq((rocker)->hw_addr + (ROCKER_ ## reg))
+
+/*****************************
+ * HW basic testing functions
+ *****************************/
+
+static int rocker_reg_test(struct rocker *rocker)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	u64 test_reg;
+	u64 rnd;
+
+	rnd = prandom_u32();
+	rnd >>= 1;
+	rocker_write32(rocker, TEST_REG, rnd);
+	test_reg = rocker_read32(rocker, TEST_REG);
+	if (test_reg != rnd * 2) {
+		dev_err(&pdev->dev, "unexpected 32bit register value %08llx, expected %08llx\n",
+			test_reg, rnd * 2);
+		return -EIO;
+	}
+
+	rnd = prandom_u32();
+	rnd <<= 31;
+	rnd |= prandom_u32();
+	rocker_write64(rocker, TEST_REG64, rnd);
+	test_reg = rocker_read64(rocker, TEST_REG64);
+	if (test_reg != rnd * 2) {
+		dev_err(&pdev->dev, "unexpected 64bit register value %16llx, expected %16llx\n",
+			test_reg, rnd * 2);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int rocker_dma_test_one(struct rocker *rocker, struct rocker_wait *wait,
+			       u32 test_type, dma_addr_t dma_handle,
+			       unsigned char *buf, unsigned char *expect,
+			       size_t size)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int i;
+
+	rocker_wait_reset(wait);
+	rocker_write32(rocker, TEST_DMA_CTRL, test_type);
+
+	if (!rocker_wait_event_timeout(wait, HZ / 10)) {
+		dev_err(&pdev->dev, "no interrupt received within a timeout\n");
+		return -EIO;
+	}
+
+	for (i = 0; i < size; i++) {
+		if (buf[i] != expect[i]) {
+			dev_err(&pdev->dev, "unexpected memory content %02x at byte %x\n, %02x expected",
+				buf[i], i, expect[i]);
+			return -EIO;
+		}
+	}
+	return 0;
+}
+
+#define ROCKER_TEST_DMA_BUF_SIZE (PAGE_SIZE * 4)
+#define ROCKER_TEST_DMA_FILL_PATTERN 0x96
+
+static int rocker_dma_test_offset(struct rocker *rocker,
+				  struct rocker_wait *wait, int offset)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	unsigned char *alloc;
+	unsigned char *buf;
+	unsigned char *expect;
+	dma_addr_t dma_handle;
+	int i;
+	int err;
+
+	alloc = kzalloc(ROCKER_TEST_DMA_BUF_SIZE * 2 + offset,
+			GFP_KERNEL | GFP_DMA);
+	if (!alloc)
+		return -ENOMEM;
+	buf = alloc + offset;
+	expect = buf + ROCKER_TEST_DMA_BUF_SIZE;
+
+	dma_handle = pci_map_single(pdev, buf, ROCKER_TEST_DMA_BUF_SIZE,
+				    PCI_DMA_BIDIRECTIONAL);
+	if (pci_dma_mapping_error(pdev, dma_handle)) {
+		err = -EIO;
+		goto free_alloc;
+	}
+
+	rocker_write64(rocker, TEST_DMA_ADDR, dma_handle);
+	rocker_write32(rocker, TEST_DMA_SIZE, ROCKER_TEST_DMA_BUF_SIZE);
+
+	memset(expect, ROCKER_TEST_DMA_FILL_PATTERN, ROCKER_TEST_DMA_BUF_SIZE);
+	err = rocker_dma_test_one(rocker, wait, ROCKER_TEST_DMA_CTRL_FILL,
+				  dma_handle, buf, expect,
+				  ROCKER_TEST_DMA_BUF_SIZE);
+	if (err)
+		goto unmap;
+
+	memset(expect, 0, ROCKER_TEST_DMA_BUF_SIZE);
+	err = rocker_dma_test_one(rocker, wait, ROCKER_TEST_DMA_CTRL_CLEAR,
+				  dma_handle, buf, expect,
+				  ROCKER_TEST_DMA_BUF_SIZE);
+	if (err)
+		goto unmap;
+
+	prandom_bytes(buf, ROCKER_TEST_DMA_BUF_SIZE);
+	for (i = 0; i < ROCKER_TEST_DMA_BUF_SIZE; i++)
+		expect[i] = ~buf[i];
+	err = rocker_dma_test_one(rocker, wait, ROCKER_TEST_DMA_CTRL_INVERT,
+				  dma_handle, buf, expect,
+				  ROCKER_TEST_DMA_BUF_SIZE);
+	if (err)
+		goto unmap;
+
+unmap:
+	pci_unmap_single(pdev, dma_handle, ROCKER_TEST_DMA_BUF_SIZE,
+			 PCI_DMA_BIDIRECTIONAL);
+free_alloc:
+	kfree(alloc);
+
+	return err;
+}
+
+static int rocker_dma_test(struct rocker *rocker, struct rocker_wait *wait)
+{
+	int i;
+	int err;
+
+	for (i = 0; i < 8; i++) {
+		err = rocker_dma_test_offset(rocker, wait, i);
+		if (err)
+			return err;
+	}
+	return 0;
+}
+
+static irqreturn_t rocker_test_irq_handler(int irq, void *dev_id)
+{
+	struct rocker_wait *wait = dev_id;
+
+	rocker_wait_wake_up(wait);
+
+	return IRQ_HANDLED;
+}
+
+static int rocker_basic_hw_test(struct rocker *rocker)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	struct rocker_wait wait;
+	int err;
+
+	err = rocker_reg_test(rocker);
+	if (err) {
+		dev_err(&pdev->dev, "reg test failed\n");
+		return err;
+	}
+
+	err = request_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_TEST),
+			  rocker_test_irq_handler, 0,
+			  rocker_driver_name, &wait);
+	if (err) {
+		dev_err(&pdev->dev, "cannot assign test irq\n");
+		return err;
+	}
+
+	rocker_wait_init(&wait);
+	rocker_write32(rocker, TEST_IRQ, ROCKER_MSIX_VEC_TEST);
+
+	if (!rocker_wait_event_timeout(&wait, HZ / 10)) {
+		dev_err(&pdev->dev, "no interrupt received within a timeout\n");
+		err = -EIO;
+		goto free_irq;
+	}
+
+	err = rocker_dma_test(rocker, &wait);
+	if (err)
+		dev_err(&pdev->dev, "dma test failed\n");
+
+free_irq:
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_TEST), &wait);
+	return err;
+}
+
+/******
+ * TLV
+ ******/
+
+#define ROCKER_TLV_ALIGNTO 8U
+#define ROCKER_TLV_ALIGN(len) \
+	(((len) + ROCKER_TLV_ALIGNTO - 1) & ~(ROCKER_TLV_ALIGNTO - 1))
+#define ROCKER_TLV_HDRLEN ROCKER_TLV_ALIGN(sizeof(struct rocker_tlv))
+
+/*  <------- ROCKER_TLV_HDRLEN -------> <--- ROCKER_TLV_ALIGN(payload) --->
+ * +-----------------------------+- - -+- - - - - - - - - - - - - - -+- - -+
+ * |             Header          | Pad |           Payload           | Pad |
+ * |      (struct rocker_tlv)    | ing |                             | ing |
+ * +-----------------------------+- - -+- - - - - - - - - - - - - - -+- - -+
+ *  <--------------------------- tlv->len -------------------------->
+ */
+
+static struct rocker_tlv *rocker_tlv_next(const struct rocker_tlv *tlv,
+					  int *remaining)
+{
+	int totlen = ROCKER_TLV_ALIGN(tlv->len);
+
+	*remaining -= totlen;
+	return (struct rocker_tlv *) ((char *) tlv + totlen);
+}
+
+static int rocker_tlv_ok(const struct rocker_tlv *tlv, int remaining)
+{
+	return remaining >= (int) ROCKER_TLV_HDRLEN &&
+	       tlv->len >= ROCKER_TLV_HDRLEN &&
+	       tlv->len <= remaining;
+}
+
+#define rocker_tlv_for_each(pos, head, len, rem)	\
+	for (pos = head, rem = len;			\
+	     rocker_tlv_ok(pos, rem);			\
+	     pos = rocker_tlv_next(pos, &(rem)))
+
+#define rocker_tlv_for_each_nested(pos, tlv, rem)	\
+	rocker_tlv_for_each(pos, rocker_tlv_data(tlv),	\
+			    rocker_tlv_len(tlv), rem)
+
+static int rocker_tlv_attr_size(int payload)
+{
+	return ROCKER_TLV_HDRLEN + payload;
+}
+
+static int rocker_tlv_total_size(int payload)
+{
+	return ROCKER_TLV_ALIGN(rocker_tlv_attr_size(payload));
+}
+
+static int rocker_tlv_padlen(int payload)
+{
+	return rocker_tlv_total_size(payload) - rocker_tlv_attr_size(payload);
+}
+
+static int rocker_tlv_type(const struct rocker_tlv *tlv)
+{
+	return tlv->type;
+}
+
+static void *rocker_tlv_data(const struct rocker_tlv *tlv)
+{
+	return (char *) tlv + ROCKER_TLV_HDRLEN;
+}
+
+static int rocker_tlv_len(const struct rocker_tlv *tlv)
+{
+	return tlv->len - ROCKER_TLV_HDRLEN;
+}
+
+static u8 rocker_tlv_get_u8(const struct rocker_tlv *tlv)
+{
+	return *(u8 *) rocker_tlv_data(tlv);
+}
+
+static u16 rocker_tlv_get_u16(const struct rocker_tlv *tlv)
+{
+	return *(u16 *) rocker_tlv_data(tlv);
+}
+
+static u32 rocker_tlv_get_u32(const struct rocker_tlv *tlv)
+{
+	return *(u32 *) rocker_tlv_data(tlv);
+}
+
+static u64 rocker_tlv_get_u64(const struct rocker_tlv *tlv)
+{
+	return *(u64 *) rocker_tlv_data(tlv);
+}
+
+static void rocker_tlv_parse(struct rocker_tlv **tb, int maxtype,
+			     const char *buf, int buf_len)
+{
+	const struct rocker_tlv *tlv;
+	const struct rocker_tlv *head = (const struct rocker_tlv *) buf;
+	int rem;
+
+	memset(tb, 0, sizeof(struct rocker_tlv *) * (maxtype + 1));
+
+	rocker_tlv_for_each(tlv, head, buf_len, rem) {
+		u32 type = rocker_tlv_type(tlv);
+
+		if (type > 0 && type <= maxtype)
+			tb[type] = (struct rocker_tlv *) tlv;
+	}
+}
+
+static void rocker_tlv_parse_nested(struct rocker_tlv **tb, int maxtype,
+				    const struct rocker_tlv *tlv)
+{
+	rocker_tlv_parse(tb, maxtype, rocker_tlv_data(tlv),
+			 rocker_tlv_len(tlv));
+}
+
+static void rocker_tlv_parse_desc(struct rocker_tlv **tb, int maxtype,
+				  struct rocker_desc_info *desc_info)
+{
+	rocker_tlv_parse(tb, maxtype, desc_info->data,
+			 desc_info->desc->tlv_size);
+}
+
+static struct rocker_tlv *rocker_tlv_start(struct rocker_desc_info *desc_info)
+{
+	return (struct rocker_tlv *) ((char *) desc_info->data +
+					       desc_info->tlv_size);
+}
+
+static int rocker_tlv_put(struct rocker_desc_info *desc_info,
+			  int attrtype, int attrlen, const void *data)
+{
+	int tail_room = desc_info->data_size - desc_info->tlv_size;
+	int total_size = rocker_tlv_total_size(attrlen);
+	struct rocker_tlv *tlv;
+
+	if (unlikely(tail_room < total_size))
+		return -EMSGSIZE;
+
+	tlv = rocker_tlv_start(desc_info);
+	desc_info->tlv_size += total_size;
+	tlv->type = attrtype;
+	tlv->len = rocker_tlv_attr_size(attrlen);
+	memcpy(rocker_tlv_data(tlv), data, attrlen);
+	memset((char *) tlv + tlv->len, 0, rocker_tlv_padlen(attrlen));
+	return 0;
+}
+
+static int rocker_tlv_put_u8(struct rocker_desc_info *desc_info,
+			     int attrtype, u8 value)
+{
+	return rocker_tlv_put(desc_info, attrtype, sizeof(u8), &value);
+}
+
+static int rocker_tlv_put_u16(struct rocker_desc_info *desc_info,
+			      int attrtype, u16 value)
+{
+	return rocker_tlv_put(desc_info, attrtype, sizeof(u16), &value);
+}
+
+static int rocker_tlv_put_u32(struct rocker_desc_info *desc_info,
+			      int attrtype, u32 value)
+{
+	return rocker_tlv_put(desc_info, attrtype, sizeof(u32), &value);
+}
+
+static int rocker_tlv_put_u64(struct rocker_desc_info *desc_info,
+			      int attrtype, u64 value)
+{
+	return rocker_tlv_put(desc_info, attrtype, sizeof(u64), &value);
+}
+
+static struct rocker_tlv *
+rocker_tlv_nest_start(struct rocker_desc_info *desc_info, int attrtype)
+{
+	struct rocker_tlv *start = rocker_tlv_start(desc_info);
+
+	if (rocker_tlv_put(desc_info, attrtype, 0, NULL) < 0)
+		return NULL;
+
+	return start;
+}
+
+static void rocker_tlv_nest_end(struct rocker_desc_info *desc_info,
+				struct rocker_tlv *start)
+{
+	start->len = (char *) rocker_tlv_start(desc_info) - (char *) start;
+}
+
+static void rocker_tlv_nest_cancel(struct rocker_desc_info *desc_info,
+				   struct rocker_tlv *start)
+{
+	desc_info->tlv_size = (char *) start - desc_info->data;
+}
+
+/******************************************
+ * DMA rings and descriptors manipulations
+ ******************************************/
+
+static u32 __pos_inc(u32 pos, size_t limit)
+{
+	return ++pos == limit ? 0 : pos;
+}
+
+static int rocker_desc_err(struct rocker_desc_info *desc_info)
+{
+	return -(desc_info->desc->comp_err & ~ROCKER_DMA_DESC_COMP_ERR_GEN);
+}
+
+static void rocker_desc_gen_clear(struct rocker_desc_info *desc_info)
+{
+	desc_info->desc->comp_err &= ~ROCKER_DMA_DESC_COMP_ERR_GEN;
+}
+
+static bool rocker_desc_gen(struct rocker_desc_info *desc_info)
+{
+	u32 comp_err = desc_info->desc->comp_err;
+
+	return comp_err & ROCKER_DMA_DESC_COMP_ERR_GEN ? true : false;
+}
+
+static void *rocker_desc_cookie_ptr_get(struct rocker_desc_info *desc_info)
+{
+	return (void *) desc_info->desc->cookie;
+}
+
+static void rocker_desc_cookie_ptr_set(struct rocker_desc_info *desc_info,
+				       void *ptr)
+{
+	desc_info->desc->cookie = (long) ptr;
+}
+
+static struct rocker_desc_info *
+rocker_desc_head_get(struct rocker_dma_ring_info *info)
+{
+	static struct rocker_desc_info *desc_info;
+	u32 head = __pos_inc(info->head, info->size);
+
+	desc_info = &info->desc_info[info->head];
+	if (head == info->tail)
+		return NULL; /* ring full */
+	desc_info->tlv_size = 0;
+	return desc_info;
+}
+
+static void rocker_desc_commit(struct rocker_desc_info *desc_info)
+{
+	desc_info->desc->buf_size = desc_info->data_size;
+	desc_info->desc->tlv_size = desc_info->tlv_size;
+}
+
+static void rocker_desc_head_set(struct rocker *rocker,
+				 struct rocker_dma_ring_info *info,
+				 struct rocker_desc_info *desc_info)
+{
+	u32 head = __pos_inc(info->head, info->size);
+
+	BUG_ON(head == info->tail);
+	rocker_desc_commit(desc_info);
+	info->head = head;
+	rocker_write32(rocker, DMA_DESC_HEAD(info->type), head);
+}
+
+static struct rocker_desc_info *
+rocker_desc_tail_get(struct rocker_dma_ring_info *info)
+{
+	static struct rocker_desc_info *desc_info;
+
+	if (info->tail == info->head)
+		return NULL; /* no thing to be done between head and tail */
+	desc_info = &info->desc_info[info->tail];
+	if (!rocker_desc_gen(desc_info))
+		return NULL; /* gen bit not set, desc is not ready yet */
+	info->tail = __pos_inc(info->tail, info->size);
+	desc_info->tlv_size = desc_info->desc->tlv_size;
+	return desc_info;
+}
+
+static void rocker_dma_ring_credits_set(struct rocker *rocker,
+					struct rocker_dma_ring_info *info,
+					u32 credits)
+{
+	if (credits)
+		rocker_write32(rocker, DMA_DESC_CREDITS(info->type), credits);
+}
+
+static unsigned long rocker_dma_ring_size_fix(size_t size)
+{
+	return max(ROCKER_DMA_SIZE_MIN,
+		   min(roundup_pow_of_two(size), ROCKER_DMA_SIZE_MAX));
+}
+
+static int rocker_dma_ring_create(struct rocker *rocker,
+				  unsigned int type,
+				  size_t size,
+				  struct rocker_dma_ring_info *info)
+{
+	int i;
+
+	BUG_ON(size != rocker_dma_ring_size_fix(size));
+	info->size = size;
+	info->type = type;
+	info->head = 0;
+	info->tail = 0;
+	info->desc_info = kcalloc(info->size, sizeof(*info->desc_info),
+				  GFP_KERNEL);
+	if (!info->desc_info)
+		return -ENOMEM;
+
+	info->desc = pci_alloc_consistent(rocker->pdev,
+					  info->size * sizeof(*info->desc),
+					  &info->mapaddr);
+	if (!info->desc) {
+		kfree(info->desc_info);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < info->size; i++)
+		info->desc_info[i].desc = &info->desc[i];
+
+	rocker_write32(rocker, DMA_DESC_CTRL(info->type),
+		       ROCKER_DMA_DESC_CTRL_RESET);
+	rocker_write64(rocker, DMA_DESC_ADDR(info->type), info->mapaddr);
+	rocker_write32(rocker, DMA_DESC_SIZE(info->type), info->size);
+
+	return 0;
+}
+
+static void rocker_dma_ring_destroy(struct rocker *rocker,
+				    struct rocker_dma_ring_info *info)
+{
+	rocker_write64(rocker, DMA_DESC_ADDR(info->type), 0);
+
+	pci_free_consistent(rocker->pdev,
+			    info->size * sizeof(struct rocker_desc),
+			    info->desc, info->mapaddr);
+	kfree(info->desc_info);
+}
+
+static void rocker_dma_ring_pass_to_producer(struct rocker *rocker,
+					     struct rocker_dma_ring_info *info)
+{
+	int i;
+
+	BUG_ON(info->head || info->tail);
+
+	/* When ring is consumer, we need to advance head for each desc.
+	 * That tells hw that the desc is ready to be used by it.
+	 */
+	for (i = 0; i < info->size - 1; i++)
+		rocker_desc_head_set(rocker, info, &info->desc_info[i]);
+	rocker_desc_commit(&info->desc_info[i]);
+}
+
+static int rocker_dma_ring_bufs_alloc(struct rocker *rocker,
+				      struct rocker_dma_ring_info *info,
+				      int direction, size_t buf_size)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int i;
+	int err;
+
+	for (i = 0; i < info->size; i++) {
+		struct rocker_desc_info *desc_info = &info->desc_info[i];
+		struct rocker_desc *desc = &info->desc[i];
+		dma_addr_t dma_handle;
+		char *buf;
+
+		buf = kzalloc(buf_size, GFP_KERNEL | GFP_DMA);
+		if (!buf) {
+			err = -ENOMEM;
+			goto rollback;
+		}
+
+		dma_handle = pci_map_single(pdev, buf, buf_size, direction);
+		if (pci_dma_mapping_error(pdev, dma_handle)) {
+			kfree(buf);
+			err = -EIO;
+			goto rollback;
+		}
+
+		desc_info->data = buf;
+		desc_info->data_size = buf_size;
+		dma_unmap_addr_set(desc_info, mapaddr, dma_handle);
+
+		desc->buf_addr = dma_handle;
+		desc->buf_size = buf_size;
+	}
+	return 0;
+
+rollback:
+	for (i--; i >= 0; i--) {
+		struct rocker_desc_info *desc_info = &info->desc_info[i];
+
+		pci_unmap_single(pdev, dma_unmap_addr(desc_info, mapaddr),
+				 desc_info->data_size, direction);
+		kfree(desc_info->data);
+	}
+	return err;
+}
+
+static void rocker_dma_ring_bufs_free(struct rocker *rocker,
+				      struct rocker_dma_ring_info *info,
+				      int direction)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int i;
+
+	for (i = 0; i < info->size; i++) {
+		struct rocker_desc_info *desc_info = &info->desc_info[i];
+		struct rocker_desc *desc = &info->desc[i];
+
+		desc->buf_addr = 0;
+		desc->buf_size = 0;
+		pci_unmap_single(pdev, dma_unmap_addr(desc_info, mapaddr),
+				 desc_info->data_size, direction);
+		kfree(desc_info->data);
+	}
+}
+
+static int rocker_dma_rings_init(struct rocker *rocker)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int err;
+
+	err = rocker_dma_ring_create(rocker, ROCKER_DMA_CMD,
+				     ROCKER_DMA_CMD_DEFAULT_SIZE,
+				     &rocker->cmd_ring);
+	if (err) {
+		dev_err(&pdev->dev, "failed to create command dma ring\n");
+		return err;
+	}
+
+	spin_lock_init(&rocker->cmd_ring_lock);
+
+	err = rocker_dma_ring_bufs_alloc(rocker, &rocker->cmd_ring,
+					 PCI_DMA_BIDIRECTIONAL, PAGE_SIZE);
+	if (err) {
+		dev_err(&pdev->dev, "failed to alloc command dma ring buffers\n");
+		goto err_dma_cmd_ring_bufs_alloc;
+	}
+
+	err = rocker_dma_ring_create(rocker, ROCKER_DMA_EVENT,
+				     ROCKER_DMA_EVENT_DEFAULT_SIZE,
+				     &rocker->event_ring);
+	if (err) {
+		dev_err(&pdev->dev, "failed to create event dma ring\n");
+		goto err_dma_event_ring_create;
+	}
+
+	err = rocker_dma_ring_bufs_alloc(rocker, &rocker->event_ring,
+					 PCI_DMA_FROMDEVICE, PAGE_SIZE);
+	if (err) {
+		dev_err(&pdev->dev, "failed to alloc event dma ring buffers\n");
+		goto err_dma_event_ring_bufs_alloc;
+	}
+	rocker_dma_ring_pass_to_producer(rocker, &rocker->event_ring);
+	return 0;
+
+err_dma_event_ring_bufs_alloc:
+	rocker_dma_ring_destroy(rocker, &rocker->event_ring);
+err_dma_event_ring_create:
+	rocker_dma_ring_bufs_free(rocker, &rocker->cmd_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+err_dma_cmd_ring_bufs_alloc:
+	rocker_dma_ring_destroy(rocker, &rocker->cmd_ring);
+	return err;
+}
+
+static void rocker_dma_rings_fini(struct rocker *rocker)
+{
+	rocker_dma_ring_bufs_free(rocker, &rocker->event_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+	rocker_dma_ring_destroy(rocker, &rocker->event_ring);
+	rocker_dma_ring_bufs_free(rocker, &rocker->cmd_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+	rocker_dma_ring_destroy(rocker, &rocker->cmd_ring);
+}
+
+static int rocker_dma_rx_ring_skb_map(struct rocker *rocker,
+				      struct rocker_port *rocker_port,
+				      struct rocker_desc_info *desc_info,
+				      struct sk_buff *skb, size_t buf_len)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	dma_addr_t dma_handle;
+
+	dma_handle = pci_map_single(pdev, skb->data, buf_len,
+				    PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(pdev, dma_handle))
+		return -EIO;
+	if (rocker_tlv_put_u64(desc_info, ROCKER_TLV_RX_FRAG_ADDR, dma_handle))
+		goto tlv_put_failure;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_RX_FRAG_MAX_LEN, buf_len))
+		goto tlv_put_failure;
+	return 0;
+
+tlv_put_failure:
+	pci_unmap_single(pdev, dma_handle, buf_len, PCI_DMA_FROMDEVICE);
+	desc_info->tlv_size = 0;
+	return -EMSGSIZE;
+}
+
+static size_t rocker_port_rx_buf_len(struct rocker_port *rocker_port)
+{
+	return rocker_port->dev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
+}
+
+static int rocker_dma_rx_ring_skb_alloc(struct rocker *rocker,
+					struct rocker_port *rocker_port,
+					struct rocker_desc_info *desc_info)
+{
+	struct net_device *dev = rocker_port->dev;
+	struct sk_buff *skb;
+	size_t buf_len = rocker_port_rx_buf_len(rocker_port);
+	int err;
+
+	/* Ensure that hw will see tlv_size zero in case of an error.
+	 * That tells hw to use another descriptor.
+	 */
+	rocker_desc_cookie_ptr_set(desc_info, NULL);
+	desc_info->tlv_size = 0;
+
+	skb = netdev_alloc_skb_ip_align(dev, buf_len);
+	if (!skb)
+		return -ENOMEM;
+	err = rocker_dma_rx_ring_skb_map(rocker, rocker_port, desc_info,
+					 skb, buf_len);
+	if (err) {
+		dev_kfree_skb_any(skb);
+		return err;
+	}
+	rocker_desc_cookie_ptr_set(desc_info, skb);
+	return 0;
+}
+
+static void rocker_dma_rx_ring_skb_unmap(struct rocker *rocker,
+					 struct rocker_tlv **attrs)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	dma_addr_t dma_handle;
+	size_t len;
+
+	if (!attrs[ROCKER_TLV_RX_FRAG_ADDR] ||
+	    !attrs[ROCKER_TLV_RX_FRAG_MAX_LEN])
+		return;
+	dma_handle = rocker_tlv_get_u64(attrs[ROCKER_TLV_RX_FRAG_ADDR]);
+	len = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FRAG_MAX_LEN]);
+	pci_unmap_single(pdev, dma_handle, len, PCI_DMA_FROMDEVICE);
+}
+
+static void rocker_dma_rx_ring_skb_free(struct rocker *rocker,
+					struct rocker_desc_info *desc_info)
+{
+	struct rocker_tlv *attrs[ROCKER_TLV_RX_MAX + 1];
+	struct sk_buff *skb = rocker_desc_cookie_ptr_get(desc_info);
+
+	if (!skb)
+		return;
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_RX_MAX, desc_info);
+	rocker_dma_rx_ring_skb_unmap(rocker, attrs);
+	dev_kfree_skb_any(skb);
+}
+
+static int rocker_dma_rx_ring_skbs_alloc(struct rocker *rocker,
+					 struct rocker_port *rocker_port)
+{
+	struct rocker_dma_ring_info *rx_ring = &rocker_port->rx_ring;
+	int i;
+	int err;
+
+	for (i = 0; i < rx_ring->size; i++) {
+		err = rocker_dma_rx_ring_skb_alloc(rocker, rocker_port,
+						   &rx_ring->desc_info[i]);
+		if (err)
+			goto rollback;
+	}
+	return 0;
+
+rollback:
+	for (i--; i >= 0; i--)
+		rocker_dma_rx_ring_skb_free(rocker, &rx_ring->desc_info[i]);
+	return err;
+}
+
+static void rocker_dma_rx_ring_skbs_free(struct rocker *rocker,
+					 struct rocker_port *rocker_port)
+{
+	struct rocker_dma_ring_info *rx_ring = &rocker_port->rx_ring;
+	int i;
+
+	for (i = 0; i < rx_ring->size; i++)
+		rocker_dma_rx_ring_skb_free(rocker, &rx_ring->desc_info[i]);
+}
+
+static int rocker_port_dma_rings_init(struct rocker_port *rocker_port)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	int err;
+
+	err = rocker_dma_ring_create(rocker,
+				     ROCKER_DMA_TX(rocker_port->port_number),
+				     ROCKER_DMA_TX_DEFAULT_SIZE,
+				     &rocker_port->tx_ring);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to create tx dma ring\n");
+		return err;
+	}
+
+	err = rocker_dma_ring_bufs_alloc(rocker, &rocker_port->tx_ring,
+					 PCI_DMA_TODEVICE,
+					 ROCKER_DMA_TX_DESC_SIZE);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to alloc tx dma ring buffers\n");
+		goto err_dma_tx_ring_bufs_alloc;
+	}
+
+	err = rocker_dma_ring_create(rocker,
+				     ROCKER_DMA_RX(rocker_port->port_number),
+				     ROCKER_DMA_RX_DEFAULT_SIZE,
+				     &rocker_port->rx_ring);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to create rx dma ring\n");
+		goto err_dma_rx_ring_create;
+	}
+
+	err = rocker_dma_ring_bufs_alloc(rocker, &rocker_port->rx_ring,
+					 PCI_DMA_BIDIRECTIONAL,
+					 ROCKER_DMA_RX_DESC_SIZE);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to alloc rx dma ring buffers\n");
+		goto err_dma_rx_ring_bufs_alloc;
+	}
+
+	err = rocker_dma_rx_ring_skbs_alloc(rocker, rocker_port);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to alloc rx dma ring skbs\n");
+		goto err_dma_rx_ring_skbs_alloc;
+	}
+	rocker_dma_ring_pass_to_producer(rocker, &rocker_port->rx_ring);
+
+	return 0;
+
+err_dma_rx_ring_skbs_alloc:
+	rocker_dma_ring_bufs_free(rocker, &rocker_port->rx_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+err_dma_rx_ring_bufs_alloc:
+	rocker_dma_ring_destroy(rocker, &rocker_port->rx_ring);
+err_dma_rx_ring_create:
+	rocker_dma_ring_bufs_free(rocker, &rocker_port->tx_ring,
+				  PCI_DMA_TODEVICE);
+err_dma_tx_ring_bufs_alloc:
+	rocker_dma_ring_destroy(rocker, &rocker_port->tx_ring);
+	return err;
+}
+
+static void rocker_port_dma_rings_fini(struct rocker_port *rocker_port)
+{
+	struct rocker *rocker = rocker_port->rocker;
+
+	rocker_dma_rx_ring_skbs_free(rocker, rocker_port);
+	rocker_dma_ring_bufs_free(rocker, &rocker_port->rx_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+	rocker_dma_ring_destroy(rocker, &rocker_port->rx_ring);
+	rocker_dma_ring_bufs_free(rocker, &rocker_port->tx_ring,
+				  PCI_DMA_TODEVICE);
+	rocker_dma_ring_destroy(rocker, &rocker_port->tx_ring);
+}
+
+static void rocker_port_set_enable(struct rocker_port *rocker_port, bool enable)
+{
+	u64 val = rocker_read64(rocker_port->rocker, PORT_PHYS_ENABLE);
+
+	if (enable)
+		val |= 1 << rocker_port_to_lport(rocker_port);
+	else
+		val &= ~(1 << rocker_port_to_lport(rocker_port));
+	rocker_write64(rocker_port->rocker, PORT_PHYS_ENABLE, val);
+}
+
+/********************************
+ * Interrupt handler and helpers
+ ********************************/
+
+static irqreturn_t rocker_cmd_irq_handler(int irq, void *dev_id)
+{
+	struct rocker *rocker = dev_id;
+	struct rocker_desc_info *desc_info;
+	struct rocker_wait *wait;
+	u32 credits = 0;
+
+	spin_lock(&rocker->cmd_ring_lock);
+	while ((desc_info = rocker_desc_tail_get(&rocker->cmd_ring))) {
+		wait = rocker_desc_cookie_ptr_get(desc_info);
+		rocker_wait_wake_up(wait);
+		credits++;
+	}
+	spin_unlock(&rocker->cmd_ring_lock);
+	rocker_dma_ring_credits_set(rocker, &rocker->cmd_ring, credits);
+
+	return IRQ_HANDLED;
+}
+
+static void rocker_port_link_up(struct rocker_port *rocker_port)
+{
+	netif_carrier_on(rocker_port->dev);
+	netdev_info(rocker_port->dev, "Link is up\n");
+}
+
+static void rocker_port_link_down(struct rocker_port *rocker_port)
+{
+	netif_carrier_off(rocker_port->dev);
+	netdev_info(rocker_port->dev, "Link is down\n");
+}
+
+static int rocker_event_process(struct rocker *rocker,
+				struct rocker_desc_info *desc_info)
+{
+	struct rocker_tlv *attrs[ROCKER_TLV_EVENT_MAX + 1];
+	struct rocker_tlv *info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_MAX + 1];
+	u16 type;
+	unsigned port_number;
+	bool link_up;
+	struct rocker_port *rocker_port;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_EVENT_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_EVENT_TYPE] ||
+	    !attrs[ROCKER_TLV_EVENT_INFO])
+		return -EIO;
+
+	type = rocker_tlv_get_u16(attrs[ROCKER_TLV_EVENT_TYPE]);
+	if (!type)
+		return -EOPNOTSUPP;
+
+	rocker_tlv_parse_nested(info_attrs, ROCKER_TLV_EVENT_LINK_CHANGED_MAX,
+				attrs[ROCKER_TLV_EVENT_INFO]);
+	if (!info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_LPORT] ||
+	    !info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_LINKUP])
+		return -EIO;
+	port_number = rocker_tlv_get_u32(info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_LPORT]) - 1;
+	link_up = rocker_tlv_get_u8(info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_LINKUP]);
+
+	if (port_number >= rocker->port_count)
+		return -EINVAL;
+
+	rocker_port = rocker->ports[port_number];
+	if (netif_carrier_ok(rocker_port->dev) != link_up) {
+		if (link_up)
+			rocker_port_link_up(rocker_port);
+		else
+			rocker_port_link_down(rocker_port);
+	}
+	return 0;
+}
+
+static irqreturn_t rocker_event_irq_handler(int irq, void *dev_id)
+{
+	struct rocker *rocker = dev_id;
+	struct pci_dev *pdev = rocker->pdev;
+	struct rocker_desc_info *desc_info;
+	u32 credits = 0;
+	int err;
+
+	while ((desc_info = rocker_desc_tail_get(&rocker->event_ring))) {
+		err = rocker_desc_err(desc_info);
+		if (err) {
+			dev_err(&pdev->dev, "event desc received with err %d\n",
+				err);
+		} else {
+			err = rocker_event_process(rocker, desc_info);
+			if (err)
+				dev_err(&pdev->dev, "event processing failed with err %d\n",
+					err);
+		}
+		rocker_desc_gen_clear(desc_info);
+		rocker_desc_head_set(rocker, &rocker->event_ring, desc_info);
+		credits++;
+	}
+	rocker_dma_ring_credits_set(rocker, &rocker->event_ring, credits);
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rocker_tx_irq_handler(int irq, void *dev_id)
+{
+	struct rocker_port *rocker_port = dev_id;
+
+	napi_schedule(&rocker_port->napi_tx);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rocker_rx_irq_handler(int irq, void *dev_id)
+{
+	struct rocker_port *rocker_port = dev_id;
+
+	napi_schedule(&rocker_port->napi_rx);
+	return IRQ_HANDLED;
+}
+
+/********************
+ * Command interface
+ ********************/
+
+typedef int (*rocker_cmd_cb_t)(struct rocker *rocker,
+			       struct rocker_port *rocker_port,
+			       struct rocker_desc_info *desc_info,
+			       void *priv);
+
+static int rocker_cmd_exec(struct rocker *rocker,
+			   struct rocker_port *rocker_port,
+			   rocker_cmd_cb_t prepare, void *prepare_priv,
+			   rocker_cmd_cb_t process, void *process_priv)
+{
+	struct rocker_desc_info *desc_info;
+	struct rocker_wait wait;
+	unsigned long flags;
+	int err;
+
+	spin_lock_irqsave(&rocker->cmd_ring_lock, flags);
+	desc_info = rocker_desc_head_get(&rocker->cmd_ring);
+	if (!desc_info) {
+		spin_unlock_irqrestore(&rocker->cmd_ring_lock, flags);
+		return -EAGAIN;
+	}
+	err = prepare(rocker, rocker_port, desc_info, prepare_priv);
+	if (err) {
+		spin_unlock_irqrestore(&rocker->cmd_ring_lock, flags);
+		return err;
+	}
+	rocker_desc_cookie_ptr_set(desc_info, &wait);
+	rocker_wait_init(&wait);
+	rocker_desc_head_set(rocker, &rocker->cmd_ring, desc_info);
+	spin_unlock_irqrestore(&rocker->cmd_ring_lock, flags);
+
+	if (!rocker_wait_event_timeout(&wait, HZ / 10))
+		return -EIO;
+
+	err = rocker_desc_err(desc_info);
+	if (err)
+		return err;
+
+	if (process)
+		err = process(rocker, rocker_port, desc_info, process_priv);
+
+	rocker_desc_gen_clear(desc_info);
+	return err;
+}
+
+static int
+rocker_cmd_get_port_settings_prep(struct rocker *rocker,
+				  struct rocker_port *rocker_port,
+				  struct rocker_desc_info *desc_info,
+				  void *priv)
+{
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_GET_PORT_SETTINGS))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,
+			       rocker_port_to_lport(rocker_port)))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+	return 0;
+}
+
+static int
+rocker_cmd_get_port_settings_ethtool_proc(struct rocker *rocker,
+					  struct rocker_port *rocker_port,
+					  struct rocker_desc_info *desc_info,
+					  void *priv)
+{
+	struct ethtool_cmd *ecmd = priv;
+	struct rocker_tlv *attrs[ROCKER_TLV_CMD_MAX + 1];
+	struct rocker_tlv *info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MAX + 1];
+	u32 speed;
+	u8 duplex;
+	u8 autoneg;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_CMD_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_CMD_INFO])
+		return -EIO;
+
+	rocker_tlv_parse_nested(info_attrs, ROCKER_TLV_CMD_PORT_SETTINGS_MAX,
+				attrs[ROCKER_TLV_CMD_INFO]);
+	if (!info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_SPEED] ||
+	    !info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_DUPLEX] ||
+	    !info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_AUTONEG])
+		return -EIO;
+
+	speed = rocker_tlv_get_u32(info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_SPEED]);
+	duplex = rocker_tlv_get_u8(info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_DUPLEX]);
+	autoneg = rocker_tlv_get_u8(info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_AUTONEG]);
+
+	ecmd->transceiver = XCVR_INTERNAL;
+	ecmd->supported = SUPPORTED_TP;
+	ecmd->phy_address = 0xff;
+	ecmd->port = PORT_TP;
+	ethtool_cmd_speed_set(ecmd, speed);
+	ecmd->duplex = duplex ? DUPLEX_FULL : DUPLEX_HALF;
+	ecmd->autoneg = autoneg ? AUTONEG_ENABLE : AUTONEG_DISABLE;
+
+	return 0;
+}
+
+static int
+rocker_cmd_get_port_settings_macaddr_proc(struct rocker *rocker,
+					  struct rocker_port *rocker_port,
+					  struct rocker_desc_info *desc_info,
+					  void *priv)
+{
+	unsigned char *macaddr = priv;
+	struct rocker_tlv *attrs[ROCKER_TLV_CMD_MAX + 1];
+	struct rocker_tlv *info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MAX + 1];
+	struct rocker_tlv *attr;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_CMD_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_CMD_INFO])
+		return -EIO;
+
+	rocker_tlv_parse_nested(info_attrs, ROCKER_TLV_CMD_PORT_SETTINGS_MAX,
+				attrs[ROCKER_TLV_CMD_INFO]);
+	attr = info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MACADDR];
+	if (!attr)
+		return -EIO;
+
+	if (rocker_tlv_len(attr) != ETH_ALEN)
+		return -EINVAL;
+
+	ether_addr_copy(macaddr, rocker_tlv_data(attr));
+	return 0;
+}
+
+static int
+rocker_cmd_get_port_settings_mode_proc(struct rocker *rocker,
+				       struct rocker_port *rocker_port,
+				       struct rocker_desc_info *desc_info,
+				       void *priv)
+{
+	enum rocker_port_mode *mode = priv;
+	struct rocker_tlv *attrs[ROCKER_TLV_CMD_MAX + 1];
+	struct rocker_tlv *info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MAX + 1];
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_CMD_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_CMD_INFO])
+		return -EIO;
+
+	rocker_tlv_parse_nested(info_attrs, ROCKER_TLV_CMD_PORT_SETTINGS_MAX,
+				attrs[ROCKER_TLV_CMD_INFO]);
+	if (!info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MODE])
+		return -EIO;
+
+	*mode = rocker_tlv_get_u8(info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MODE]);
+
+	return 0;
+}
+
+static int
+rocker_cmd_set_port_settings_ethtool_prep(struct rocker *rocker,
+					  struct rocker_port *rocker_port,
+					  struct rocker_desc_info *desc_info,
+					  void *priv)
+{
+	struct ethtool_cmd *ecmd = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_SET_PORT_SETTINGS))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,
+			       rocker_port_to_lport(rocker_port)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_SPEED,
+			       ethtool_cmd_speed(ecmd)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_DUPLEX,
+			      ecmd->duplex))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_AUTONEG,
+			      ecmd->autoneg))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+	return 0;
+}
+
+static int
+rocker_cmd_set_port_settings_macaddr_prep(struct rocker *rocker,
+					  struct rocker_port *rocker_port,
+					  struct rocker_desc_info *desc_info,
+					  void *priv)
+{
+	unsigned char *macaddr = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_SET_PORT_SETTINGS))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,
+			       rocker_port_to_lport(rocker_port)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_MACADDR,
+			   ETH_ALEN, macaddr))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+	return 0;
+}
+
+static int
+rocker_cmd_set_port_settings_mode_prep(struct rocker *rocker,
+				       struct rocker_port *rocker_port,
+				       struct rocker_desc_info *desc_info,
+				       void *priv)
+{
+	enum rocker_port_mode *mode = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_SET_PORT_SETTINGS))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,
+			       rocker_port_to_lport(rocker_port)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_MODE,
+			      *mode))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+	return 0;
+}
+
+static int rocker_cmd_get_port_settings_ethtool(struct rocker_port *rocker_port,
+						struct ethtool_cmd *ecmd)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_get_port_settings_prep, NULL,
+			       rocker_cmd_get_port_settings_ethtool_proc,
+			       ecmd);
+}
+
+static int rocker_cmd_get_port_settings_macaddr(struct rocker_port *rocker_port,
+						unsigned char *macaddr)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_get_port_settings_prep, NULL,
+			       rocker_cmd_get_port_settings_macaddr_proc,
+			       macaddr);
+}
+
+static int rocker_cmd_get_port_settings_mode(struct rocker_port *rocker_port,
+					     enum rocker_port_mode *mode)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_get_port_settings_prep, NULL,
+			       rocker_cmd_get_port_settings_mode_proc,
+			       mode);
+}
+
+static int rocker_cmd_set_port_settings_ethtool(struct rocker_port *rocker_port,
+						struct ethtool_cmd *ecmd)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_set_port_settings_ethtool_prep,
+			       ecmd, NULL, NULL);
+}
+
+static int rocker_cmd_set_port_settings_macaddr(struct rocker_port *rocker_port,
+						unsigned char *macaddr)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_set_port_settings_macaddr_prep,
+			       macaddr, NULL, NULL);
+}
+
+static int rocker_cmd_set_port_settings_mode(struct rocker_port *rocker_port,
+					     enum rocker_port_mode mode)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_set_port_settings_mode_prep,
+			       &mode, NULL, NULL);
+}
+
+static int rocker_cmd_flow_tbl_add_ig_port(struct rocker_desc_info *desc_info,
+					   struct rocker_flow_tbl_entry *entry)
+{
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT,
+			       entry->key.ig_port.in_lport))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT_MASK,
+			       entry->key.ig_port.in_lport_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_GOTO_TABLE_ID,
+			       entry->key.ig_port.goto_tbl))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_add_vlan(struct rocker_desc_info *desc_info,
+					struct rocker_flow_tbl_entry *entry)
+{
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT,
+			       entry->key.vlan.in_lport))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID,
+			       entry->key.vlan.vlan_id))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID_MASK,
+			       entry->key.vlan.vlan_id_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_GOTO_TABLE_ID,
+			       entry->key.vlan.goto_tbl))
+		return -EMSGSIZE;
+	if (entry->key.vlan.untagged &&
+	    rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_NEW_VLAN_ID,
+			       entry->key.vlan.new_vlan_id))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_add_bridge(struct rocker_desc_info *desc_info,
+					  struct rocker_flow_tbl_entry *entry)
+{
+	if (entry->key.bridge.has_eth_dst &&
+	    rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_DST_MAC,
+			   ETH_ALEN, entry->key.bridge.eth_dst))
+		return -EMSGSIZE;
+	if (entry->key.bridge.has_eth_dst_mask &&
+	    rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_DST_MAC_MASK,
+			   ETH_ALEN, entry->key.bridge.eth_dst_mask))
+		return -EMSGSIZE;
+	if (entry->key.bridge.vlan_id &&
+	    rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID,
+			       entry->key.bridge.vlan_id))
+		return -EMSGSIZE;
+	if (entry->key.bridge.tunnel_id &&
+	    rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_TUNNEL_ID,
+			       entry->key.bridge.tunnel_id))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_GOTO_TABLE_ID,
+			       entry->key.bridge.goto_tbl))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_GROUP_ID,
+			       entry->key.bridge.group_id))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_add_acl(struct rocker_desc_info *desc_info,
+				       struct rocker_flow_tbl_entry *entry)
+{
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT,
+			       entry->key.acl.in_lport))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT_MASK,
+			       entry->key.acl.in_lport_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_SRC_MAC,
+			   ETH_ALEN, entry->key.acl.eth_src))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_SRC_MAC_MASK,
+			   ETH_ALEN, entry->key.acl.eth_src_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_DST_MAC,
+			   ETH_ALEN, entry->key.acl.eth_dst))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_DST_MAC_MASK,
+			   ETH_ALEN, entry->key.acl.eth_dst_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_ETHERTYPE,
+			       entry->key.acl.eth_type))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID,
+			       entry->key.acl.vlan_id))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID_MASK,
+			       entry->key.acl.vlan_id_mask))
+		return -EMSGSIZE;
+	if (entry->key.acl.group_id != ROCKER_GROUP_NONE &&
+	    rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_GROUP_ID,
+			       entry->key.acl.group_id))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_add(struct rocker *rocker,
+				   struct rocker_port *rocker_port,
+				   struct rocker_desc_info *desc_info,
+				   void *priv)
+{
+	struct rocker_flow_tbl_entry *entry = priv;
+	struct rocker_tlv *cmd_info;
+	int err = 0;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_ADD))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_TABLE_ID,
+			       entry->key.tbl_id))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_PRIORITY,
+			       entry->key.priority))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_HARDTIME, 0))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u64(desc_info, ROCKER_TLV_OF_DPA_COOKIE,
+			       entry->cookie))
+		return -EMSGSIZE;
+
+	switch (entry->key.tbl_id) {
+	case ROCKER_OF_DPA_TABLE_ID_INGRESS_PORT:
+		err = rocker_cmd_flow_tbl_add_ig_port(desc_info, entry);
+		break;
+	case ROCKER_OF_DPA_TABLE_ID_VLAN:
+		err = rocker_cmd_flow_tbl_add_vlan(desc_info, entry);
+		break;
+	case ROCKER_OF_DPA_TABLE_ID_BRIDGING:
+		err = rocker_cmd_flow_tbl_add_bridge(desc_info, entry);
+		break;
+	case ROCKER_OF_DPA_TABLE_ID_ACL_POLICY:
+		err = rocker_cmd_flow_tbl_add_acl(desc_info, entry);
+		break;
+	default:
+		err = -ENOTSUPP;
+		break;
+	}
+
+	if (err)
+		return err;
+
+	rocker_tlv_nest_end(desc_info, cmd_info);
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_del(struct rocker *rocker,
+				   struct rocker_port *rocker_port,
+				   struct rocker_desc_info *desc_info,
+				   void *priv)
+{
+	const struct rocker_flow_tbl_entry *entry = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_DEL))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u64(desc_info, ROCKER_TLV_OF_DPA_COOKIE,
+			       entry->cookie))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+
+	return 0;
+}
+
+static int
+rocker_cmd_group_tbl_add_l2_interface(struct rocker_desc_info *desc_info,
+				      struct rocker_group_tbl_entry *entry)
+{
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_OUT_LPORT,
+			       ROCKER_GROUP_PORT_GET(entry->group_id)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_OF_DPA_POP_VLAN,
+			      entry->l2_interface.pop_vlan))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int
+rocker_cmd_group_tbl_add_group_ids(struct rocker_desc_info *desc_info,
+				   struct rocker_group_tbl_entry *entry)
+{
+	int i;
+	struct rocker_tlv *group_ids;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_GROUP_COUNT,
+			       entry->group_count))
+		return -EMSGSIZE;
+
+	group_ids = rocker_tlv_nest_start(desc_info,
+					  ROCKER_TLV_OF_DPA_GROUP_IDS);
+	if (!group_ids)
+		return -EMSGSIZE;
+
+	for (i = 0; i < entry->group_count; i++)
+		/* Note TLV array is 1-based */
+		if (rocker_tlv_put_u32(desc_info, i + 1, entry->group_ids[i]))
+			return -EMSGSIZE;
+
+	rocker_tlv_nest_end(desc_info, group_ids);
+
+	return 0;
+}
+
+static int rocker_cmd_group_tbl_add(struct rocker *rocker,
+				    struct rocker_port *rocker_port,
+				    struct rocker_desc_info *desc_info,
+				    void *priv)
+{
+	struct rocker_group_tbl_entry *entry = priv;
+	struct rocker_tlv *cmd_info;
+	int err = 0;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_ADD))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_GROUP_ID,
+			       entry->group_id))
+		return -EMSGSIZE;
+
+	switch (ROCKER_GROUP_TYPE_GET(entry->group_id)) {
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_INTERFACE:
+		err = rocker_cmd_group_tbl_add_l2_interface(desc_info, entry);
+		break;
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST:
+		err = rocker_cmd_group_tbl_add_group_ids(desc_info, entry);
+		break;
+	default:
+		err = -ENOTSUPP;
+		break;
+	}
+
+	if (err)
+		return err;
+
+	rocker_tlv_nest_end(desc_info, cmd_info);
+
+	return 0;
+}
+
+static int rocker_cmd_group_tbl_del(struct rocker *rocker,
+				    struct rocker_port *rocker_port,
+				    struct rocker_desc_info *desc_info,
+				    void *priv)
+{
+	const struct rocker_group_tbl_entry *entry = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_DEL))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_GROUP_ID,
+			       entry->group_id))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+
+	return 0;
+}
+
+/************************
+ * Flow and group tables
+ ************************/
+
+static struct rocker_flow_tbl_entry *rocker_flow_tbl_find(
+	struct rocker *rocker, struct rocker_flow_tbl_entry *match)
+{
+	struct rocker_flow_tbl_entry *found;
+
+	hash_for_each_possible(rocker->flow_tbl, found, entry, match->key_crc32)
+		if (memcmp(&found->key, &match->key, sizeof(found->key)) == 0)
+			return found;
+
+	return NULL;
+}
+
+static int rocker_flow_tbl_add(struct rocker_port *rocker_port,
+			       struct rocker_flow_tbl_entry *match)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_flow_tbl_entry *found;
+	unsigned long flags;
+	bool add_to_hw = false;
+	int err = 0;
+
+	match->key_crc32 = crc32(~0, &match->key, sizeof(match->key));
+
+	netdev_info(rocker_port->dev, "flow tbl_add match->key_crc32 0x%08x\n",
+		    match->key_crc32);
+	spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
+
+	found = rocker_flow_tbl_find(rocker, match);
+	netdev_info(rocker_port->dev, "flow tbl_add found %p\n", found);
+
+	if (found) {
+		kfree(match);
+	} else {
+		found = match;
+		found->cookie = rocker->flow_tbl_next_cookie++;
+		hash_add(rocker->flow_tbl, &found->entry, found->key_crc32);
+		add_to_hw = true;
+	}
+
+	found->ref_count++;
+
+	spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
+
+	if (add_to_hw) {
+		err = rocker_cmd_exec(rocker, rocker_port,
+				      rocker_cmd_flow_tbl_add,
+				      found, NULL, NULL);
+		netdev_info(rocker_port->dev, "flow tbl_add err %d\n", err);
+		if (err) {
+			spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
+			hash_del(&found->entry);
+			spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
+			kfree(found);
+		}
+	}
+
+	return err;
+}
+
+static int rocker_flow_tbl_del(struct rocker_port *rocker_port,
+			       struct rocker_flow_tbl_entry *match)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_flow_tbl_entry *found;
+	unsigned long flags;
+	int del_from_hw = 0;
+	int err = 0;
+
+	match->key_crc32 = crc32(~0, &match->key, sizeof(match->key));
+
+	netdev_info(rocker_port->dev, "flow tbl_del match->key_crc32 0x%08x\n",
+		    match->key_crc32);
+	spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
+
+	found = rocker_flow_tbl_find(rocker, match);
+	netdev_info(rocker_port->dev, "flow tbl_del found %p\n", found);
+
+	if (found) {
+		found->ref_count--;
+		if (found->ref_count == 0) {
+			hash_del(&found->entry);
+			del_from_hw = 1;
+		}
+	}
+
+	spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
+
+	kfree(match);
+
+	if (del_from_hw) {
+		err = rocker_cmd_exec(rocker, rocker_port,
+				      rocker_cmd_flow_tbl_del,
+				      found, NULL, NULL);
+		netdev_info(rocker_port->dev, "flow tbl_del err %d\n", err);
+		kfree(found);
+	}
+
+	return err;
+}
+
+static int rocker_flow_op_do(struct rocker_port *rocker_port,
+			     enum rocker_op op,
+			     struct rocker_flow_tbl_entry *entry)
+{
+	switch (op) {
+	case ROCKER_OP_ADD:
+		return rocker_flow_tbl_add(rocker_port, entry);
+	case ROCKER_OP_DEL:
+		return rocker_flow_tbl_del(rocker_port, entry);
+	}
+
+	return -ENOTSUPP;
+}
+
+static int rocker_flow_tbl_ig_port(struct rocker_port *rocker_port,
+				   enum rocker_op op,
+				   u32 in_lport, u32 in_lport_mask,
+				   enum rocker_of_dpa_table_id goto_tbl)
+{
+	struct rocker_flow_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.priority = ROCKER_PRIORITY_IG_PORT;
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_INGRESS_PORT;
+	entry->key.ig_port.in_lport = in_lport;
+	entry->key.ig_port.in_lport_mask = in_lport_mask;
+	entry->key.ig_port.goto_tbl = goto_tbl;
+
+	netdev_info(rocker_port->dev, "flow ig_port\n");
+
+	return rocker_flow_op_do(rocker_port, op, entry);
+}
+
+static int rocker_flow_tbl_vlan(struct rocker_port *rocker_port,
+				enum rocker_op op, u32 in_lport,
+				__be16 vlan_id, __be16 vlan_id_mask,
+				enum rocker_of_dpa_table_id goto_tbl,
+				bool untagged, __be16 new_vlan_id)
+{
+	struct rocker_flow_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.priority = ROCKER_PRIORITY_VLAN;
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_VLAN;
+	entry->key.vlan.in_lport = in_lport;
+	entry->key.vlan.vlan_id = vlan_id;
+	entry->key.vlan.vlan_id_mask = vlan_id_mask;
+	entry->key.vlan.goto_tbl = goto_tbl;
+
+	entry->key.vlan.untagged = untagged;
+	entry->key.vlan.new_vlan_id = new_vlan_id;
+
+	netdev_info(rocker_port->dev, "flow vlan\n");
+
+	return rocker_flow_op_do(rocker_port, op, entry);
+}
+
+static int rocker_flow_tbl_bridge(struct rocker_port *rocker_port,
+				  enum rocker_op op,
+				  const u8 *eth_dst, const u8 *eth_dst_mask,
+				  __be16 vlan_id, u32 tunnel_id,
+				  enum rocker_of_dpa_table_id goto_tbl,
+				  u32 group_id)
+{
+	struct rocker_flow_tbl_entry *entry;
+	u32 priority;
+	bool vlan_bridging = !!vlan_id;
+	bool dflt = !eth_dst || (eth_dst && eth_dst_mask);
+	bool wild = false;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_BRIDGING;
+
+	if (eth_dst) {
+		entry->key.bridge.has_eth_dst = 1;
+		ether_addr_copy(entry->key.bridge.eth_dst, eth_dst);
+	}
+	if (eth_dst_mask) {
+		entry->key.bridge.has_eth_dst_mask = 1;
+		ether_addr_copy(entry->key.bridge.eth_dst_mask, eth_dst_mask);
+		if (memcmp(eth_dst_mask, zero_mac, ETH_ALEN))
+			wild = true;
+	}
+
+	priority = ROCKER_PRIORITY_UNKNOWN;
+	if (vlan_bridging & dflt & wild)
+		priority = ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_WILD;
+	else if (vlan_bridging & dflt & !wild)
+		priority = ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_EXACT;
+	else if (vlan_bridging & !dflt)
+		priority = ROCKER_PRIORITY_BRIDGING_VLAN;
+	else if (!vlan_bridging & dflt & wild)
+		priority = ROCKER_PRIORITY_BRIDGING_TENANT_DFLT_WILD;
+	else if (!vlan_bridging & dflt & !wild)
+		priority = ROCKER_PRIORITY_BRIDGING_TENANT_DFLT_EXACT;
+	else if (!vlan_bridging & !dflt)
+		priority = ROCKER_PRIORITY_BRIDGING_TENANT;
+
+	entry->key.priority = priority;
+	entry->key.bridge.vlan_id = vlan_id;
+	entry->key.bridge.tunnel_id = tunnel_id;
+	entry->key.bridge.goto_tbl = goto_tbl;
+	entry->key.bridge.group_id = group_id;
+
+	netdev_info(rocker_port->dev, "flow bridge\n");
+
+	return rocker_flow_op_do(rocker_port, op, entry);
+}
+
+static int rocker_flow_tbl_acl(struct rocker_port *rocker_port,
+			       enum rocker_op op,
+			       u32 priority, u32 in_lport,
+			       u32 in_lport_mask,
+			       const u8 *eth_src, const u8 *eth_src_mask,
+			       const u8 *eth_dst, const u8 *eth_dst_mask,
+			       __be16 eth_type,
+			       __be16 vlan_id, __be16 vlan_id_mask,
+			       u32 group_id)
+{
+	struct rocker_flow_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.priority = priority;
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_ACL_POLICY;
+	entry->key.acl.in_lport = in_lport;
+	entry->key.acl.in_lport_mask = in_lport_mask;
+
+	if (eth_src)
+		ether_addr_copy(entry->key.acl.eth_src, eth_src);
+	if (eth_src_mask)
+		ether_addr_copy(entry->key.acl.eth_src_mask, eth_src_mask);
+	if (eth_dst)
+		ether_addr_copy(entry->key.acl.eth_dst, eth_dst);
+	if (eth_dst_mask)
+		ether_addr_copy(entry->key.acl.eth_dst_mask, eth_dst_mask);
+
+	entry->key.acl.eth_type = eth_type;
+	entry->key.acl.vlan_id = vlan_id;
+	entry->key.acl.vlan_id_mask = vlan_id_mask;
+	entry->key.acl.group_id = group_id;
+
+	netdev_info(rocker_port->dev, "flow acl\n");
+
+	return rocker_flow_op_do(rocker_port, op, entry);
+}
+
+static struct rocker_group_tbl_entry *rocker_group_tbl_find(
+	struct rocker *rocker, struct rocker_group_tbl_entry *match)
+{
+	struct rocker_group_tbl_entry *found;
+	u8 type = ROCKER_GROUP_TYPE_GET(match->group_id);
+	u16 index;
+	int bkt;
+
+	switch (type) {
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_INTERFACE:
+		/* search for match by group_id */
+		hash_for_each_possible(rocker->group_tbl, found,
+				       entry, match->group_id)
+			if (found->group_id == match->group_id)
+				return found;
+		break;
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST:
+		/* search for match by group_ids */
+		hash_for_each(rocker->group_tbl, bkt, found, entry) {
+			if (type != ROCKER_GROUP_TYPE_GET(found->group_id))
+				continue;
+			if (found->group_count != match->group_count)
+				continue;
+			if (memcmp(found->group_ids, match->group_ids,
+				   found->group_count * sizeof(u32)) == 0)
+				return found;
+		}
+		/* no match: create new unique group_id */
+		index = rocker->group_index_next++;
+		match->group_id &= ~ROCKER_GROUP_INDEX_MASK;
+		match->group_id |= ROCKER_GROUP_INDEX_SET(index);
+		break;
+	default:
+		break;
+	}
+
+	return NULL;
+}
+
+static void rocker_group_tbl_entry_free(struct rocker_group_tbl_entry *entry)
+{
+	switch (ROCKER_GROUP_TYPE_GET(entry->group_id)) {
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST:
+		kfree(entry->group_ids);
+		break;
+	default:
+		break;
+	}
+	kfree(entry);
+}
+
+static int rocker_group_tbl_add(struct rocker_port *rocker_port,
+				struct rocker_group_tbl_entry *match,
+				u32 *group_id)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_group_tbl_entry *found;
+	unsigned long flags;
+	bool add_to_hw = false;
+	int err = 0;
+
+	netdev_info(rocker_port->dev, "group tbl_add\n");
+	spin_lock_irqsave(&rocker->group_tbl_lock, flags);
+
+	found = rocker_group_tbl_find(rocker, match);
+	netdev_info(rocker_port->dev, "group tbl_add found %p\n", found);
+
+	if (found) {
+		rocker_group_tbl_entry_free(match);
+	} else {
+		found = match;
+		hash_add(rocker->group_tbl, &found->entry, found->group_id);
+		add_to_hw = true;
+	}
+
+	found->ref_count++;
+
+	spin_unlock_irqrestore(&rocker->group_tbl_lock, flags);
+
+	*group_id = found->group_id;
+
+	if (add_to_hw) {
+		err = rocker_cmd_exec(rocker, rocker_port,
+				      rocker_cmd_group_tbl_add,
+				      found, NULL, NULL);
+		netdev_info(rocker_port->dev, "group tbl_add err %d\n", err);
+		if (err) {
+			spin_lock_irqsave(&rocker->group_tbl_lock, flags);
+			hash_del(&found->entry);
+			spin_unlock_irqrestore(&rocker->group_tbl_lock, flags);
+			rocker_group_tbl_entry_free(found);
+		}
+	}
+
+	return err;
+}
+
+static int rocker_group_tbl_del(struct rocker_port *rocker_port,
+				struct rocker_group_tbl_entry *match,
+				u32 *group_id)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_group_tbl_entry *found;
+	unsigned long flags;
+	bool del_from_hw = false;
+	int err = 0;
+
+	netdev_info(rocker_port->dev, "group tbl_del\n");
+	spin_lock_irqsave(&rocker->group_tbl_lock, flags);
+
+	found = rocker_group_tbl_find(rocker, match);
+	netdev_info(rocker_port->dev, "group tbl_del found %p\n", found);
+
+	if (found) {
+		*group_id = found->group_id;
+		found->ref_count--;
+		if (found->ref_count == 0) {
+			hash_del(&found->entry);
+			del_from_hw = true;
+		}
+	}
+
+	spin_unlock_irqrestore(&rocker->group_tbl_lock, flags);
+
+	rocker_group_tbl_entry_free(match);
+
+	if (del_from_hw) {
+		err = rocker_cmd_exec(rocker, rocker_port,
+				      rocker_cmd_group_tbl_del,
+				      found, NULL, NULL);
+		netdev_info(rocker_port->dev, "group tbl_del err %d\n", err);
+		rocker_group_tbl_entry_free(found);
+	}
+
+	return err;
+}
+
+static int rocker_group_op_do(struct rocker_port *rocker_port,
+			      enum rocker_op op,
+			      struct rocker_group_tbl_entry *entry,
+			      u32 *group_id)
+{
+	switch (op) {
+	case ROCKER_OP_ADD:
+		return rocker_group_tbl_add(rocker_port, entry, group_id);
+	case ROCKER_OP_DEL:
+		return rocker_group_tbl_del(rocker_port, entry, group_id);
+	}
+
+	return -ENOTSUPP;
+}
+
+static int rocker_group_l2_interface(struct rocker_port *rocker_port,
+				     enum rocker_op op, u32 group_id,
+				     int pop_vlan)
+{
+	struct rocker_group_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->group_id = group_id;
+	entry->l2_interface.pop_vlan = pop_vlan;
+
+	netdev_info(rocker_port->dev, "group l2 interface\n");
+
+	return rocker_group_op_do(rocker_port, op, entry, &group_id);
+}
+
+static int rocker_group_l2_mcast(struct rocker_port *rocker_port,
+				 enum rocker_op op, __be16 vlan_id,
+				 u16 group_count, u32 *group_ids,
+				 u32 *group_id)
+{
+	struct rocker_group_tbl_entry *entry;
+
+	*group_id = 0;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->group_id = ROCKER_GROUP_L2_MCAST(vlan_id, 0);
+	entry->group_count = group_count;
+	entry->group_ids = group_ids;
+
+	netdev_info(rocker_port->dev, "group l2 mcast\n");
+
+	return rocker_group_op_do(rocker_port, op, entry, group_id);
+}
+
+static int rocker_group_id_compare(const void *a, const void *b)
+{
+	return memcmp(a, b, sizeof(u32));
+}
+
+static bool rocker_port_dev_check(struct net_device *dev);
+
+static u32 *rocker_flow_get_group_ids(const struct sw_flow *flow,
+				      __be16 vlan_id, u16 *count)
+{
+	struct net_device *out_port_dev;
+	u32 *group_ids = NULL;
+	u32 out_lport;
+	bool send_up = false;
+	int i;
+
+	*count = 0;
+
+	for (i = 0; i < flow->actions->count; i++) {
+		out_port_dev = flow->actions->actions[i].output_dev;
+		if (rocker_port_dev_check(out_port_dev)) {
+			group_ids = krealloc(group_ids,
+					     ++(*count) * sizeof(u32),
+					     GFP_KERNEL);
+			if (!group_ids)
+				goto err_out;
+			out_lport =
+				rocker_port_to_lport(netdev_priv(out_port_dev));
+			group_ids[i] = ROCKER_GROUP_L2_INTERFACE(vlan_id,
+								 out_lport);
+		} else if (!send_up) {
+			send_up = true;
+			group_ids = krealloc(group_ids,
+					     ++(*count) * sizeof(u32),
+					     GFP_KERNEL);
+			if (!group_ids)
+				goto err_out;
+			out_lport = 0;
+			group_ids[i] = ROCKER_GROUP_L2_INTERFACE(vlan_id,
+								 out_lport);
+		}
+	}
+
+	sort(group_ids, *count, sizeof(u32), rocker_group_id_compare, NULL);
+
+	return group_ids;
+
+err_out:
+	*count = 0;
+	return NULL;
+}
+
+#define ROCKER_FLOW_WITHIN(flow, field) \
+	((offsetof(struct sw_flow_key, field) >= (flow)->mask->range.start) && \
+	 (offsetof(struct sw_flow_key, field) <= (flow)->mask->range.end))
+
+static int rocker_bridging_vlan_ucast(struct rocker_port *rocker_port,
+				      const struct sw_flow *flow,
+				      enum rocker_op op,
+				      __be16 vlan_id, bool pop_vlan)
+{
+	struct net_device *out_port_dev;
+	u32 out_lport;
+	u32 tunnel_id = 0;
+	u32 group_l2_interface;
+	int err;
+
+	/* L2 interface group for output */
+
+	if (flow->actions->count == 0) {
+		out_lport = 0; /* send it up */
+	} else if (flow->actions->count == 1) {
+		out_port_dev = flow->actions->actions[0].output_dev;
+		if (rocker_port_dev_check(out_port_dev))
+			out_lport = rocker_port_to_lport(
+				netdev_priv(out_port_dev));
+		else
+			out_lport = 0; /* send it up */
+	} else {
+		netdev_err(rocker_port->dev, "Trying to install unicast bridge vlan flow with more than one output device\n");
+		return -EINVAL;
+	}
+
+	group_l2_interface = ROCKER_GROUP_L2_INTERFACE(vlan_id, out_lport);
+	err = rocker_group_l2_interface(rocker_port, op,
+					group_l2_interface, pop_vlan);
+	netdev_info(rocker_port->dev, "ucast bridge l2 interface group err %d\n",
+		    err);
+	if (err)
+		return err;
+
+	/* VLAN unicast bridge table entry */
+
+	err = rocker_flow_tbl_bridge(rocker_port, op,
+				     flow->key.eth.dst, NULL,
+				     vlan_id, tunnel_id,
+				     ROCKER_OF_DPA_TABLE_ID_ACL_POLICY,
+				     group_l2_interface);
+	netdev_info(rocker_port->dev, "ucast bridge err %d\n", err);
+
+	return err;
+}
+
+static int rocker_bridging_vlan_mcast(struct rocker_port *rocker_port,
+				      const struct sw_flow *flow,
+				      enum rocker_op op,
+				      __be16 vlan_id, bool pop_vlan)
+{
+	u32 tunnel_id = 0;
+	u32 group_l2_mcast;
+	u16 group_count;
+	u32 *group_ids;
+	int err;
+	int i;
+
+	/* Get sorted list of output L2 interface group ids;
+	 * if there are none, there is nothing to forward in HW,
+	 * so we're done.
+	 */
+
+	group_ids = rocker_flow_get_group_ids(flow, vlan_id,
+					      &group_count);
+	if (group_ids == 0)
+		return 0;
+
+	/* L2 interface groups for each out_lport */
+
+	for (i = 0; i < group_count; i++) {
+		err = rocker_group_l2_interface(rocker_port, op,
+						group_ids[i], pop_vlan);
+		netdev_info(rocker_port->dev, "mcast bridge l2 interface group err %d\n",
+			    err);
+		if (err)
+			goto err_free_group_ids;
+	}
+
+	/* L2 multicast group entry */
+
+	err = rocker_group_l2_mcast(rocker_port, op,
+				    vlan_id, group_count,
+				    group_ids, &group_l2_mcast);
+	netdev_info(rocker_port->dev, "group l2 mcast group_id 0x%08x err %d\n",
+		    group_l2_mcast, err);
+	if (err)
+		goto err_free_group_ids;
+
+	/* VLAN multicast bridge table entry */
+
+	err = rocker_flow_tbl_bridge(rocker_port, op,
+				     flow->key.eth.dst, NULL,
+				     vlan_id, tunnel_id,
+				     ROCKER_OF_DPA_TABLE_ID_ACL_POLICY,
+				     group_l2_mcast);
+	netdev_info(rocker_port->dev, "mcast bridge err %d\n", err);
+
+	return err;
+
+err_free_group_ids:
+	kfree(group_ids);
+	return err;
+}
+
+static int rocker_flow_parse(struct rocker_port *rocker_port,
+			     const struct sw_flow *flow,
+			     enum rocker_op op)
+{
+	struct net_device *in_port_dev;
+	u32 in_lport;
+	u32 in_lport_mask;
+	__be16 vlan_id;
+	__be16 vlan_id_mask;
+	__be16 new_vlan_id;
+	__be16 outer_vlan_id;
+	u16 bridge_id;
+	u32 tunnel_id;
+	bool untagged;
+	bool unicast;
+	bool eth_dst_exact;
+	int err;
+
+	enum {
+		BRIDGING_MODE_UNKNOWN,
+		BRIDGING_MODE_VLAN_UCAST,
+		BRIDGING_MODE_VLAN_MCAST,
+		BRIDGING_MODE_VLAN_DFLT,
+		BRIDGING_MODE_TUNNEL_UCAST,
+		BRIDGING_MODE_TUNNEL_MCAST,
+		BRIDGING_MODE_TUNNEL_DFLT,
+	} bridging_mode = BRIDGING_MODE_UNKNOWN;
+
+	tunnel_id = 0; /* XXX for now */
+
+	/* A note about value masks: sw_flow uses mask bit value of
+	 * 0 for "don't care", whereas OF-DPA HW uses mask bit value
+	 * of 1 for "don't care", so sw_flow mask value must be
+	 * inverted beforing passing to OF-DPA HW.  To summurize:
+	 *
+	 *      mask bit   sw_flow         OF-DPA
+	 *      -------------------------------------
+	 *      0          don't care      care
+	 *      1          care            don't care
+	 */
+
+	/* Get lport fot in_port.  Skip sw_flows if in_port is not a
+	 * rocker port in our network namespace.
+	 */
+
+	if (!ROCKER_FLOW_WITHIN(flow, phy.in_port))
+		return 0;
+
+	in_port_dev = dev_get_by_index(rocker_port->dev->nd_net,
+				       flow->key.misc.in_port_ifindex);
+	if (!in_port_dev || !rocker_port_dev_check(in_port_dev))
+		return 0;
+
+	in_lport = rocker_port_to_lport(netdev_priv(in_port_dev));
+	in_lport_mask = 0;
+
+	/* Determine outer VLAN ID.  If untagged, use bridge VLAN ID,
+	 * otherwise use tagged VLAN ID for outer VLAN ID.
+	 */
+
+	if (!ROCKER_FLOW_WITHIN(flow, eth.tci))
+		return 0;
+
+	if (flow->key.eth.tci == htons(0) &&
+	    flow->mask->key.eth.tci == htons(0xffff)) {
+		vlan_id = flow->key.eth.tci;
+		vlan_id_mask = htons(0x0fff);
+		untagged = true;
+	} else {
+		/* XXX For now, fail any vlan except untagged vlan 0 */
+		netdev_warn(rocker_port->dev,
+			    "Can't parse vlan info, vlan 0x%04x mask 0x%04x\n",
+			    ntohs(flow->key.eth.tci),
+			    ntohs(flow->mask->key.eth.tci));
+		return 0;
+	}
+
+	bridge_id = 0; /* XXX for now, need unique ID for each bridge */
+	new_vlan_id = htons(bridge_id << 8 | in_lport);
+	outer_vlan_id = untagged ? new_vlan_id : vlan_id;
+
+	/* Ingress port table entry */
+
+	err = rocker_flow_tbl_ig_port(rocker_port, op,
+				      in_lport, in_lport_mask,
+				      ROCKER_OF_DPA_TABLE_ID_VLAN);
+	netdev_info(rocker_port->dev, "flow parse ig port err %d\n", err);
+	if (err)
+		return err;
+
+	/* VLAN table entry */
+
+	err = rocker_flow_tbl_vlan(rocker_port, op,
+				   in_lport, vlan_id, vlan_id_mask,
+				   ROCKER_OF_DPA_TABLE_ID_TERMINATION_MAC,
+				   untagged, new_vlan_id);
+	if (err)
+		return err;
+
+	/* XXX Determine if sw_flow wants L2 bridging or L3 routing.
+	 * XXX If wanting L3 routing, need to add termination mac
+	 * XXX table entry to catch L3 routing prefixes.
+	 * XXX For now, just doing L2 bridging, so skip term mac tbl
+	 * XXX (miss on term mac tbl goes to bridge tbl).
+	 */
+
+	unicast = (flow->key.eth.dst[5] & 0x01) == 0x00;
+	eth_dst_exact = memcmp(flow->mask->key.eth.dst, ff_mac, ETH_ALEN) == 0;
+
+	if (outer_vlan_id && unicast && eth_dst_exact)
+		bridging_mode = BRIDGING_MODE_VLAN_UCAST;
+	else if (outer_vlan_id && !unicast && eth_dst_exact)
+		bridging_mode = BRIDGING_MODE_VLAN_MCAST;
+
+	switch (bridging_mode) {
+	case BRIDGING_MODE_VLAN_UCAST:
+		err = rocker_bridging_vlan_ucast(rocker_port, flow, op,
+						 outer_vlan_id, untagged);
+		break;
+	case BRIDGING_MODE_VLAN_MCAST:
+		err = rocker_bridging_vlan_mcast(rocker_port, flow, op,
+						 outer_vlan_id, untagged);
+		break;
+	default:
+		netdev_info(rocker_port->dev, "Unknown bridging mode\n");
+		err = -ENOTSUPP;
+		break;
+	}
+
+	if (err)
+		return err;
+
+	/* ACL table entry */
+
+	err = rocker_flow_tbl_acl(rocker_port, op,
+				  ROCKER_PRIORITY_ACL,
+				  in_lport, in_lport_mask,
+				  flow->key.eth.src, zero_mac,
+				  flow->key.eth.dst, zero_mac,
+				  flow->key.eth.type,
+				  outer_vlan_id, vlan_id_mask,
+				  ROCKER_GROUP_NONE);
+	netdev_info(rocker_port->dev, "mcast bridge acl err %d\n", err);
+
+	return err;
+}
+
+static int rocker_flow_add(struct rocker_port *rocker_port,
+			   const struct sw_flow *flow)
+{
+	return rocker_flow_parse(rocker_port, flow, ROCKER_OP_ADD);
+}
+
+static int rocker_flow_del(struct rocker_port *rocker_port,
+			   const struct sw_flow *flow)
+{
+	return rocker_flow_parse(rocker_port, flow, ROCKER_OP_DEL);
+}
+
+/*****************
+ * Net device ops
+ *****************/
+
+static int rocker_port_open(struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	int err;
+
+	err = rocker_port_dma_rings_init(rocker_port);
+	if (err)
+		return err;
+
+	err = request_irq(rocker_msix_tx_vector(rocker_port),
+			  rocker_tx_irq_handler, 0,
+			  rocker_driver_name, rocker_port);
+	if (err) {
+		netdev_err(rocker_port->dev, "cannot assign tx irq\n");
+		goto err_request_tx_irq;
+	}
+
+	err = request_irq(rocker_msix_rx_vector(rocker_port),
+			  rocker_rx_irq_handler, 0,
+			  rocker_driver_name, rocker_port);
+	if (err) {
+		netdev_err(rocker_port->dev, "cannot assign rx irq\n");
+		goto err_request_rx_irq;
+	}
+
+	napi_enable(&rocker_port->napi_tx);
+	napi_enable(&rocker_port->napi_rx);
+	rocker_port_set_enable(rocker_port, true);
+	netif_start_queue(dev);
+	return 0;
+
+err_request_rx_irq:
+	free_irq(rocker_msix_tx_vector(rocker_port), rocker_port);
+err_request_tx_irq:
+	rocker_port_dma_rings_fini(rocker_port);
+	return err;
+}
+
+static int rocker_port_stop(struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	netif_stop_queue(dev);
+	rocker_port_set_enable(rocker_port, false);
+	napi_disable(&rocker_port->napi_rx);
+	napi_disable(&rocker_port->napi_tx);
+	free_irq(rocker_msix_rx_vector(rocker_port), rocker_port);
+	free_irq(rocker_msix_tx_vector(rocker_port), rocker_port);
+	rocker_port_dma_rings_fini(rocker_port);
+
+	return 0;
+}
+
+static void rocker_tx_desc_frags_unmap(struct rocker_port *rocker_port,
+				       struct rocker_desc_info *desc_info)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct pci_dev *pdev = rocker->pdev;
+	struct rocker_tlv *attrs[ROCKER_TLV_TX_MAX + 1];
+	struct rocker_tlv *attr;
+	int rem;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_TX_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_TX_FRAGS])
+		return;
+	rocker_tlv_for_each_nested(attr, attrs[ROCKER_TLV_TX_FRAGS], rem) {
+		struct rocker_tlv *frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_MAX + 1];
+		dma_addr_t dma_handle;
+		size_t len;
+
+		if (rocker_tlv_type(attr) != ROCKER_TLV_TX_FRAG)
+			continue;
+		rocker_tlv_parse_nested(frag_attrs, ROCKER_TLV_TX_FRAG_ATTR_MAX,
+					attr);
+		if (!frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_ADDR] ||
+		    !frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_LEN])
+			continue;
+		dma_handle = rocker_tlv_get_u64(frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_ADDR]);
+		len = rocker_tlv_get_u16(frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_LEN]);
+		pci_unmap_single(pdev, dma_handle, len, DMA_TO_DEVICE);
+	}
+}
+
+static int rocker_tx_desc_frag_map_put(struct rocker_port *rocker_port,
+				       struct rocker_desc_info *desc_info,
+				       char *buf, size_t buf_len)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct pci_dev *pdev = rocker->pdev;
+	dma_addr_t dma_handle;
+	struct rocker_tlv *frag;
+
+	dma_handle = pci_map_single(pdev, buf, buf_len, DMA_TO_DEVICE);
+	if (unlikely(pci_dma_mapping_error(pdev, dma_handle))) {
+		if (net_ratelimit())
+			netdev_err(rocker_port->dev, "failed to dma map tx frag\n");
+		return -EIO;
+	}
+	frag = rocker_tlv_nest_start(desc_info, ROCKER_TLV_TX_FRAG);
+	if (!frag)
+		goto unmap_frag;
+	if (rocker_tlv_put_u64(desc_info, ROCKER_TLV_TX_FRAG_ATTR_ADDR,
+			       dma_handle))
+		goto nest_cancel;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_TX_FRAG_ATTR_LEN,
+			       buf_len))
+		goto nest_cancel;
+	rocker_tlv_nest_end(desc_info, frag);
+	return 0;
+
+nest_cancel:
+	rocker_tlv_nest_cancel(desc_info, frag);
+unmap_frag:
+	pci_unmap_single(pdev, dma_handle, buf_len, DMA_TO_DEVICE);
+	return -EMSGSIZE;
+}
+
+static netdev_tx_t rocker_port_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_desc_info *desc_info;
+	struct rocker_tlv *frags;
+	int i;
+	int err;
+
+	desc_info = rocker_desc_head_get(&rocker_port->tx_ring);
+	if (unlikely(!desc_info)) {
+		if (net_ratelimit())
+			netdev_err(dev, "tx ring full when queue awake\n");
+		return NETDEV_TX_BUSY;
+	}
+
+	rocker_desc_cookie_ptr_set(desc_info, skb);
+
+	frags = rocker_tlv_nest_start(desc_info, ROCKER_TLV_TX_FRAGS);
+	if (!frags)
+		goto out;
+	err = rocker_tx_desc_frag_map_put(rocker_port, desc_info,
+					  skb->data, skb_headlen(skb));
+	if (err)
+		goto nest_cancel;
+	if (skb_shinfo(skb)->nr_frags > ROCKER_TX_FRAGS_MAX)
+		goto nest_cancel;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+		err = rocker_tx_desc_frag_map_put(rocker_port, desc_info,
+						  skb_frag_address(frag),
+						  skb_frag_size(frag));
+		if (err)
+			goto unmap_frags;
+	}
+	rocker_tlv_nest_end(desc_info, frags);
+
+	rocker_desc_gen_clear(desc_info);
+	rocker_desc_head_set(rocker, &rocker_port->tx_ring, desc_info);
+
+	desc_info = rocker_desc_head_get(&rocker_port->tx_ring);
+	if (!desc_info)
+		netif_stop_queue(dev);
+
+	return NETDEV_TX_OK;
+
+unmap_frags:
+	rocker_tx_desc_frags_unmap(rocker_port, desc_info);
+nest_cancel:
+	rocker_tlv_nest_cancel(desc_info, frags);
+out:
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static struct rocker_promisc_acl {
+	u16 eth_type;
+	const u8 *eth_src;
+	const u8 *eth_src_mask;
+	const u8 *eth_dst;
+	const u8 *eth_dst_mask;
+} rocker_promisc_acls[] = {
+	{
+		/* allow any ARP pkts */
+		.eth_type = htons(ETH_P_ARP),
+		.eth_src = zero_mac,
+		.eth_src_mask = ff_mac,
+		.eth_dst = zero_mac,
+		.eth_dst_mask = ff_mac,
+	},
+	{
+		/* allow any IP pkts */
+		.eth_type = htons(ETH_P_IP),
+		.eth_src = zero_mac,
+		.eth_src_mask = ff_mac,
+		.eth_dst = zero_mac,
+		.eth_dst_mask = ff_mac,
+	},
+	{
+		/* allow LLDP pkts */
+		.eth_type = htons(ETH_P_IP),
+		.eth_src = zero_mac,
+		.eth_src_mask = ff_mac,
+		.eth_dst = lldp_mac,
+		.eth_dst_mask = zero_mac,
+	},
+	{
+		/* mark end of list */
+		.eth_type = 0,
+	},
+};
+
+static int rocker_port_set_promisc(struct rocker_port *rocker_port,
+				   enum rocker_op op)
+{
+	u32 in_lport = rocker_port_to_lport(rocker_port);
+	u32 in_lport_mask = 0;
+	u32 out_lport;
+	u16 bridge_id;
+	__be16 vlan_id;
+	__be16 vlan_id_mask;
+	__be16 new_vlan_id;
+	struct rocker_promisc_acl *acl;
+	u32 group_l2_interface;
+	bool untagged;
+	bool pop_vlan;
+	int err;
+
+	/* ingress port table entry */
+
+	err = rocker_flow_tbl_ig_port(rocker_port, op,
+				      in_lport, in_lport_mask,
+				      ROCKER_OF_DPA_TABLE_ID_VLAN);
+	if (err)
+		return err;
+
+	/* VLAN table entry for untagged traffic */
+
+	vlan_id = 0;
+	vlan_id_mask = htons(0x0fff);
+	untagged = true;
+	bridge_id = 0; /* XXX for now, need a unique ID for each bridge */
+	new_vlan_id = htons(bridge_id << 8 | in_lport);
+
+	err = rocker_flow_tbl_vlan(rocker_port, op,
+				   in_lport, vlan_id, vlan_id_mask,
+				   ROCKER_OF_DPA_TABLE_ID_TERMINATION_MAC,
+				   untagged, new_vlan_id);
+	if (err)
+		return err;
+
+	/* L2 interface group entry for bridge (port 0) */
+
+	out_lport = 0;
+	pop_vlan = untagged;
+
+	group_l2_interface = ROCKER_GROUP_L2_INTERFACE(new_vlan_id, out_lport);
+	err = rocker_group_l2_interface(rocker_port, op, group_l2_interface,
+					pop_vlan);
+	if (err)
+		return err;
+
+	/* ACL table entries for acceptable pkts */
+
+	for (acl = rocker_promisc_acls; acl->eth_type; acl++) {
+		err = rocker_flow_tbl_acl(rocker_port, op,
+					  ROCKER_PRIORITY_ACL_PORT_PROMISC,
+					  in_lport, in_lport_mask,
+					  acl->eth_src, acl->eth_src_mask,
+					  acl->eth_dst, acl->eth_dst_mask,
+					  acl->eth_type,
+					  new_vlan_id, vlan_id_mask,
+					  group_l2_interface);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static void rocker_port_set_rx_mode(struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	int prev_promisc = (rocker_port->prev_flags & IFF_PROMISC) ? 1 : 0;
+	int promisc = (dev->flags & IFF_PROMISC) ? 1 : 0;
+	enum rocker_op op = promisc ? ROCKER_OP_ADD : ROCKER_OP_DEL;
+
+	if (promisc != prev_promisc)
+		rocker_port_set_promisc(rocker_port, op);
+
+	rocker_port->prev_flags = dev->flags;
+}
+
+static int rocker_port_set_mac_address(struct net_device *dev, void *p)
+{
+	struct sockaddr *addr = p;
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	int err;
+
+	if (!is_valid_ether_addr(addr->sa_data))
+		return -EADDRNOTAVAIL;
+
+	err = rocker_cmd_set_port_settings_macaddr(rocker_port, addr->sa_data);
+	if (err)
+		return err;
+	memcpy(dev->dev_addr, addr->sa_data, dev->addr_len);
+	return 0;
+}
+
+static int rocker_port_swdev_get_id(struct net_device *dev,
+				    struct netdev_phys_item_id *psid)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	struct rocker *rocker = rocker_port->rocker;
+
+	psid->id_len = sizeof(rocker->hw.id);
+	memcpy(&psid->id, &rocker->hw.id, psid->id_len);
+	return 0;
+}
+
+static int rocker_port_swdev_flow_insert(struct net_device *dev,
+					 const struct sw_flow *flow)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	return rocker_flow_add(rocker_port, flow);
+}
+
+static int rocker_port_swdev_flow_remove(struct net_device *dev,
+					 const struct sw_flow *flow)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	return rocker_flow_del(rocker_port, flow);
+}
+
+static const struct net_device_ops rocker_port_netdev_ops = {
+	.ndo_open		= rocker_port_open,
+	.ndo_stop		= rocker_port_stop,
+	.ndo_start_xmit		= rocker_port_xmit,
+	.ndo_set_rx_mode	= rocker_port_set_rx_mode,
+	.ndo_set_mac_address	= rocker_port_set_mac_address,
+	.ndo_swdev_get_id	= rocker_port_swdev_get_id,
+	.ndo_swdev_flow_insert	= rocker_port_swdev_flow_insert,
+	.ndo_swdev_flow_remove	= rocker_port_swdev_flow_remove,
+};
+
+static bool rocker_port_dev_check(struct net_device *dev)
+{
+	return dev->netdev_ops == &rocker_port_netdev_ops;
+}
+
+/********************
+ * ethtool interface
+ ********************/
+
+static int rocker_port_get_settings(struct net_device *dev,
+				    struct ethtool_cmd *ecmd)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	return rocker_cmd_get_port_settings_ethtool(rocker_port, ecmd);
+}
+
+static int rocker_port_set_settings(struct net_device *dev,
+				    struct ethtool_cmd *ecmd)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	return rocker_cmd_set_port_settings_ethtool(rocker_port, ecmd);
+}
+
+static void rocker_port_get_drvinfo(struct net_device *dev,
+				    struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, rocker_driver_name, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops rocker_port_ethtool_ops = {
+	.get_settings		= rocker_port_get_settings,
+	.set_settings		= rocker_port_set_settings,
+	.get_drvinfo		= rocker_port_get_drvinfo,
+	.get_link		= ethtool_op_get_link,
+};
+
+/*****************
+ * NAPI interface
+ *****************/
+
+static struct rocker_port *rocker_port_napi_tx_get(struct napi_struct *napi)
+{
+	return container_of(napi, struct rocker_port, napi_tx);
+}
+
+static int rocker_port_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct rocker_port *rocker_port = rocker_port_napi_tx_get(napi);
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_desc_info *desc_info;
+	u32 credits = 0;
+	int err;
+
+	/* Cleanup tx descriptors */
+	while ((desc_info = rocker_desc_tail_get(&rocker_port->tx_ring))) {
+		err = rocker_desc_err(desc_info);
+		if (err && net_ratelimit())
+			netdev_err(rocker_port->dev, "tx desc received with err %d\n",
+				   err);
+		rocker_tx_desc_frags_unmap(rocker_port, desc_info);
+		dev_kfree_skb_any(rocker_desc_cookie_ptr_get(desc_info));
+		credits++;
+	}
+
+	if (credits && netif_queue_stopped(rocker_port->dev))
+		netif_wake_queue(rocker_port->dev);
+
+	napi_complete(napi);
+	rocker_dma_ring_credits_set(rocker, &rocker_port->tx_ring, credits);
+
+	return 0;
+}
+
+static int rocker_port_rx_proc(struct rocker *rocker,
+			       struct rocker_port *rocker_port,
+			       struct rocker_desc_info *desc_info)
+{
+	struct rocker_tlv *attrs[ROCKER_TLV_RX_MAX + 1];
+	struct sk_buff *skb = rocker_desc_cookie_ptr_get(desc_info);
+	size_t rx_len;
+
+	if (!skb)
+		return -ENOENT;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_RX_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_RX_FRAG_LEN])
+		return -EINVAL;
+
+	rocker_dma_rx_ring_skb_unmap(rocker, attrs);
+
+	rx_len = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FRAG_LEN]);
+	skb_put(skb, rx_len);
+	skb->protocol = eth_type_trans(skb, rocker_port->dev);
+	netif_receive_skb(skb);
+
+	return rocker_dma_rx_ring_skb_alloc(rocker, rocker_port, desc_info);
+}
+
+static struct rocker_port *rocker_port_napi_rx_get(struct napi_struct *napi)
+{
+	return container_of(napi, struct rocker_port, napi_rx);
+}
+
+static int rocker_port_poll_rx(struct napi_struct *napi, int budget)
+{
+	struct rocker_port *rocker_port = rocker_port_napi_rx_get(napi);
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_desc_info *desc_info;
+	u32 credits = 0;
+	int err;
+
+	/* Process rx descriptors */
+	while (credits < budget &&
+	       (desc_info = rocker_desc_tail_get(&rocker_port->rx_ring))) {
+		err = rocker_desc_err(desc_info);
+		if (err) {
+			if (net_ratelimit())
+				netdev_err(rocker_port->dev, "rx desc received with err %d\n",
+					   err);
+		} else {
+			err = rocker_port_rx_proc(rocker, rocker_port,
+						  desc_info);
+			if (err && net_ratelimit())
+				netdev_err(rocker_port->dev, "rx processing failed with err %d\n",
+					   err);
+		}
+		rocker_desc_gen_clear(desc_info);
+		rocker_desc_head_set(rocker, &rocker_port->rx_ring, desc_info);
+		credits++;
+	}
+
+	if (credits < budget)
+		napi_complete(napi);
+
+	rocker_dma_ring_credits_set(rocker, &rocker_port->rx_ring, credits);
+
+	return credits;
+}
+
+/*****************
+ * PCI driver ops
+ *****************/
+
+static void rocker_carrier_init(struct rocker_port *rocker_port)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	u64 link_status = rocker_read64(rocker, PORT_PHYS_LINK_STATUS);
+	bool link_up;
+
+	link_up = link_status & (1 << rocker_port_to_lport(rocker_port));
+	if (link_up)
+		netif_carrier_on(rocker_port->dev);
+	else
+		netif_carrier_off(rocker_port->dev);
+}
+
+static void rocker_remove_ports(struct rocker *rocker)
+{
+	int i;
+
+	for (i = 0; i < rocker->port_count; i++)
+		unregister_netdev(rocker->ports[i]->dev);
+	kfree(rocker->ports);
+}
+
+static void rocker_port_dev_addr_init(struct rocker *rocker,
+				      struct rocker_port *rocker_port)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int err;
+
+	err = rocker_cmd_get_port_settings_macaddr(rocker_port,
+						   rocker_port->dev->dev_addr);
+	if (err) {
+		dev_warn(&pdev->dev, "failed to get mac address, using random\n");
+		eth_hw_addr_random(rocker_port->dev);
+	}
+}
+
+static int rocker_probe_port(struct rocker *rocker, unsigned port_number)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	struct rocker_port *rocker_port;
+	struct net_device *dev;
+	int err;
+
+	dev = alloc_etherdev(sizeof(struct rocker_port));
+	if (!dev)
+		return -ENOMEM;
+	rocker_port = netdev_priv(dev);
+	rocker_port->dev = dev;
+	rocker_port->rocker = rocker;
+	rocker_port->port_number = port_number;
+
+	rocker_port_dev_addr_init(rocker, rocker_port);
+	dev->netdev_ops = &rocker_port_netdev_ops;
+	dev->ethtool_ops = &rocker_port_ethtool_ops;
+	netif_napi_add(dev, &rocker_port->napi_tx, rocker_port_poll_tx,
+		       NAPI_POLL_WEIGHT);
+	netif_napi_add(dev, &rocker_port->napi_rx, rocker_port_poll_rx,
+		       NAPI_POLL_WEIGHT);
+	rocker_carrier_init(rocker_port);
+
+	err = register_netdev(dev);
+	if (err) {
+		dev_err(&pdev->dev, "register_netdev failed\n");
+		goto free_netdev;
+	}
+	rocker->ports[port_number] = rocker_port;
+	return 0;
+
+free_netdev:
+	free_netdev(dev);
+	return err;
+}
+
+static int rocker_probe_ports(struct rocker *rocker)
+{
+	int i;
+	size_t alloc_size;
+	int err;
+
+	alloc_size = sizeof(struct rocker_port *) * rocker->port_count;
+	rocker->ports = kmalloc(alloc_size, GFP_KERNEL);
+	for (i = 0; i < rocker->port_count; i++) {
+		err = rocker_probe_port(rocker, i);
+		if (err)
+			goto remove_ports;
+	}
+	return 0;
+
+remove_ports:
+	rocker_remove_ports(rocker);
+	return err;
+}
+
+static int rocker_msix_init(struct rocker *rocker)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int msix_entries;
+	int i;
+	int err;
+
+	msix_entries = pci_msix_vec_count(pdev);
+	if (msix_entries < 0)
+		return msix_entries;
+
+	if (msix_entries != ROCKER_MSIX_VEC_COUNT(rocker->port_count))
+		return -EINVAL;
+
+	rocker->msix_entries = kmalloc_array(msix_entries,
+					     sizeof(struct msix_entry),
+					     GFP_KERNEL);
+	if (!rocker->msix_entries)
+		return -ENOMEM;
+
+	for (i = 0; i < msix_entries; i++)
+		rocker->msix_entries[i].entry = i;
+
+	err = pci_enable_msix_exact(pdev, rocker->msix_entries, msix_entries);
+	if (err < 0)
+		goto err_enable_msix;
+
+	return 0;
+
+err_enable_msix:
+	kfree(rocker->msix_entries);
+	return err;
+}
+
+static void rocker_msix_fini(struct rocker *rocker)
+{
+	pci_disable_msix(rocker->pdev);
+	kfree(rocker->msix_entries);
+}
+
+static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct rocker *rocker;
+	int err;
+
+	rocker = kzalloc(sizeof(*rocker), GFP_KERNEL);
+	if (!rocker)
+		return -ENOMEM;
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "pci_enable_device failed\n");
+		goto err_pci_enable_device;
+	}
+
+	err = pci_request_regions(pdev, rocker_driver_name);
+	if (err) {
+		dev_err(&pdev->dev, "pci_request_regions failed\n");
+		goto err_pci_request_regions;
+	}
+
+	err = pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
+	if (!err) {
+		err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));
+		if (err) {
+			dev_err(&pdev->dev, "pci_set_consistent_dma_mask failed\n");
+			goto err_pci_set_dma_mask;
+		}
+	} else {
+		err = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+		if (err) {
+			dev_err(&pdev->dev, "pci_set_dma_mask failed\n");
+			goto err_pci_set_dma_mask;
+		}
+	}
+
+	if (pci_resource_len(pdev, 0) < ROCKER_PCI_BAR0_SIZE) {
+		dev_err(&pdev->dev, "invalid PCI region size\n");
+		goto err_pci_resource_len_check;
+	}
+
+	rocker->hw_addr = ioremap(pci_resource_start(pdev, 0),
+				  pci_resource_len(pdev, 0));
+	if (!rocker->hw_addr) {
+		dev_err(&pdev->dev, "ioremap failed\n");
+		err = -EIO;
+		goto err_ioremap;
+	}
+	pci_set_master(pdev);
+
+	rocker->pdev = pdev;
+	pci_set_drvdata(pdev, rocker);
+
+	rocker->port_count = rocker_read32(rocker, PORT_PHYS_COUNT);
+
+	err = rocker_msix_init(rocker);
+	if (err) {
+		dev_err(&pdev->dev, "MSI-X init failed\n");
+		goto err_msix_init;
+	}
+
+	err = rocker_basic_hw_test(rocker);
+	if (err) {
+		dev_err(&pdev->dev, "basic hw test failed\n");
+		goto err_basic_hw_test;
+	}
+
+	rocker_write32(rocker, CONTROL, ROCKER_CONTROL_RESET);
+
+	err = rocker_dma_rings_init(rocker);
+	if (err)
+		goto err_dma_rings_init;
+
+	err = request_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD),
+			  rocker_cmd_irq_handler, 0,
+			  rocker_driver_name, rocker);
+	if (err) {
+		dev_err(&pdev->dev, "cannot assign cmd irq\n");
+		goto err_request_cmd_irq;
+	}
+
+	err = request_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT),
+			  rocker_event_irq_handler, 0,
+			  rocker_driver_name, rocker);
+	if (err) {
+		dev_err(&pdev->dev, "cannot assign event irq\n");
+		goto err_request_event_irq;
+	}
+
+	rocker->hw.id = rocker_read64(rocker, SWITCH_ID);
+
+	err = rocker_probe_ports(rocker);
+	if (err) {
+		dev_err(&pdev->dev, "failed to probe ports\n");
+		goto err_probe_ports;
+	}
+
+	hash_init(rocker->flow_tbl);
+	spin_lock_init(&rocker->flow_tbl_lock);
+
+	hash_init(rocker->group_tbl);
+	spin_lock_init(&rocker->group_tbl_lock);
+
+	dev_info(&pdev->dev, "Rocker switch with id %016llx\n", rocker->hw.id);
+
+	return 0;
+
+err_probe_ports:
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
+err_request_event_irq:
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
+err_request_cmd_irq:
+	rocker_dma_rings_fini(rocker);
+err_dma_rings_init:
+err_basic_hw_test:
+	rocker_msix_fini(rocker);
+err_msix_init:
+	iounmap(rocker->hw_addr);
+err_ioremap:
+err_pci_resource_len_check:
+err_pci_set_dma_mask:
+	pci_release_regions(pdev);
+err_pci_request_regions:
+	pci_disable_device(pdev);
+err_pci_enable_device:
+	kfree(rocker);
+	return err;
+}
+
+static void rocker_remove(struct pci_dev *pdev)
+{
+	struct rocker *rocker = pci_get_drvdata(pdev);
+
+	rocker_remove_ports(rocker);
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
+	rocker_dma_rings_fini(rocker);
+	rocker_msix_fini(rocker);
+	iounmap(rocker->hw_addr);
+	pci_release_regions(rocker->pdev);
+	pci_disable_device(rocker->pdev);
+	kfree(rocker);
+}
+
+static struct pci_driver rocker_pci_driver = {
+	.name		= rocker_driver_name,
+	.id_table	= rocker_pci_id_table,
+	.probe		= rocker_probe,
+	.remove		= rocker_remove,
+};
+
+/************************************
+ * Net device notifier event handler
+ ************************************/
+
+static int rocker_port_master_changed(struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	enum rocker_port_mode newmode = ROCKER_PORT_MODE_L2L3;
+	enum rocker_port_mode oldmode;
+	struct net_device *master = netdev_master_upper_dev_get(dev);
+	int err;
+
+	if (master && master->rtnl_link_ops &&
+	    !strcmp(master->rtnl_link_ops->kind, "openvswitch"))
+		newmode = ROCKER_PORT_MODE_OF_DPA;
+	err = rocker_cmd_get_port_settings_mode(rocker_port, &oldmode);
+	if (err)
+		return err;
+	if (newmode == oldmode)
+		return 0;
+	err = rocker_cmd_set_port_settings_mode(rocker_port, newmode);
+	if (err)
+		return err;
+	netdev_info(dev, "port mode changed from %d to %d\n", oldmode, newmode);
+	return err;
+}
+
+static int rocker_device_event(struct notifier_block *unused,
+			       unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	int err;
+
+	if (!rocker_port_dev_check(dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_CHANGEUPPER:
+		err = rocker_port_master_changed(dev);
+		if (err)
+			netdev_warn(dev, "failed to reflect master change (err %d)\n",
+				    err);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block rocker_notifier_block __read_mostly = {
+	.notifier_call = rocker_device_event,
+};
+
+/***********************
+ * Module init and exit
+ ***********************/
+
+static int __init rocker_module_init(void)
+{
+	int err;
+
+	register_netdevice_notifier(&rocker_notifier_block);
+	err = pci_register_driver(&rocker_pci_driver);
+	if (err)
+		goto err_pci_register_driver;
+	return 0;
+
+err_pci_register_driver:
+	unregister_netdevice_notifier(&rocker_notifier_block);
+	return err;
+}
+
+static void __exit rocker_module_exit(void)
+{
+	unregister_netdevice_notifier(&rocker_notifier_block);
+	pci_unregister_driver(&rocker_pci_driver);
+}
+
+module_init(rocker_module_init);
+module_exit(rocker_module_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
+MODULE_AUTHOR("Scott Feldman <sfeldma@cumulusnetworks.com>");
+MODULE_DESCRIPTION("Rocker switch device driver");
+MODULE_DEVICE_TABLE(pci, rocker_pci_id_table);
diff --git a/drivers/net/rocker.h b/drivers/net/rocker.h
new file mode 100644
index 0000000..74836a4
--- /dev/null
+++ b/drivers/net/rocker.h
@@ -0,0 +1,465 @@
+/*
+ * drivers/net/rocker.h - Rocker switch device driver
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ * Copyright (c) 2014 Scott Feldman <sfeldma@cumulusnetworks.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _ROCKER_H
+#define _ROCKER_H
+
+#include <linux/types.h>
+
+#define PCI_VENDOR_ID_REDHAT		0x1b36
+#define PCI_DEVICE_ID_REDHAT_ROCKER	0x0006
+
+#define ROCKER_PCI_BAR0_SIZE		0x2000
+
+/* MSI-X vectors */
+enum {
+	ROCKER_MSIX_VEC_CMD,
+	ROCKER_MSIX_VEC_EVENT,
+	ROCKER_MSIX_VEC_TEST,
+	ROCKER_MSIX_VEC_RESERVED0,
+	__ROCKER_MSIX_VEC_TX,
+	__ROCKER_MSIX_VEC_RX,
+#define ROCKER_MSIX_VEC_TX(port) \
+	(__ROCKER_MSIX_VEC_TX + ((port) * 2))
+#define ROCKER_MSIX_VEC_RX(port) \
+	(__ROCKER_MSIX_VEC_RX + ((port) * 2))
+#define ROCKER_MSIX_VEC_COUNT(portcnt) \
+	(ROCKER_MSIX_VEC_RX((portcnt - 1)) + 1)
+};
+
+/* Rocker bogus registers */
+#define ROCKER_BOGUS_REG0		0x0000
+#define ROCKER_BOGUS_REG1		0x0004
+#define ROCKER_BOGUS_REG2		0x0008
+#define ROCKER_BOGUS_REG3		0x000c
+
+/* Rocker test registers */
+#define ROCKER_TEST_REG			0x0010
+#define ROCKER_TEST_REG64		0x0018  /* 8-byte */
+#define ROCKER_TEST_IRQ			0x0020
+#define ROCKER_TEST_DMA_ADDR		0x0028  /* 8-byte */
+#define ROCKER_TEST_DMA_SIZE		0x0030
+#define ROCKER_TEST_DMA_CTRL		0x0034
+
+/* Rocker test register ctrl */
+#define ROCKER_TEST_DMA_CTRL_CLEAR	(1 << 0)
+#define ROCKER_TEST_DMA_CTRL_FILL	(1 << 1)
+#define ROCKER_TEST_DMA_CTRL_INVERT	(1 << 2)
+
+/* Rocker DMA ring register offsets */
+#define ROCKER_DMA_DESC_ADDR(x)		(0x1000 + (x) * 32)  /* 8-byte */
+#define ROCKER_DMA_DESC_SIZE(x)		(0x1008 + (x) * 32)
+#define ROCKER_DMA_DESC_HEAD(x)		(0x100c + (x) * 32)
+#define ROCKER_DMA_DESC_TAIL(x)		(0x1010 + (x) * 32)
+#define ROCKER_DMA_DESC_CTRL(x)		(0x1014 + (x) * 32)
+#define ROCKER_DMA_DESC_CREDITS(x)	(0x1018 + (x) * 32)
+#define ROCKER_DMA_DESC_RES1(x)		(0x101c + (x) * 32)
+
+/* Rocker dma ctrl register bits */
+#define ROCKER_DMA_DESC_CTRL_RESET	(1 << 0)
+
+/* Rocker DMA ring types */
+enum rocker_dma_type {
+	ROCKER_DMA_CMD,
+	ROCKER_DMA_EVENT,
+	__ROCKER_DMA_TX,
+	__ROCKER_DMA_RX,
+#define ROCKER_DMA_TX(port) (__ROCKER_DMA_TX + (port) * 2)
+#define ROCKER_DMA_RX(port) (__ROCKER_DMA_RX + (port) * 2)
+};
+
+/* Rocker DMA ring size limits and default sizes */
+#define ROCKER_DMA_SIZE_MIN		2ul
+#define ROCKER_DMA_SIZE_MAX		65536ul
+#define ROCKER_DMA_CMD_DEFAULT_SIZE	32ul
+#define ROCKER_DMA_EVENT_DEFAULT_SIZE	32ul
+#define ROCKER_DMA_TX_DEFAULT_SIZE	64ul
+#define ROCKER_DMA_TX_DESC_SIZE		256
+#define ROCKER_DMA_RX_DEFAULT_SIZE	64ul
+#define ROCKER_DMA_RX_DESC_SIZE		256
+
+/* Rocker DMA descriptor struct */
+struct rocker_desc {
+	u64 buf_addr;
+	u64 cookie;
+	u16 buf_size;
+	u16 tlv_size;
+	u16 resv[5];
+	u16 comp_err;
+} __packed __aligned(8);
+
+#define ROCKER_DMA_DESC_COMP_ERR_GEN	(1 << 15)
+
+/* Rocker DMA TLV struct */
+struct rocker_tlv {
+	u32 type;
+	u16 len;
+} __packed __aligned(8);
+
+/* TLVs */
+enum {
+	ROCKER_TLV_CMD_UNSPEC,
+	ROCKER_TLV_CMD_TYPE,	/* u16 */
+	ROCKER_TLV_CMD_INFO,	/* nest */
+
+	__ROCKER_TLV_CMD_MAX,
+	ROCKER_TLV_CMD_MAX = __ROCKER_TLV_CMD_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_CMD_TYPE_UNSPEC,
+	ROCKER_TLV_CMD_TYPE_GET_PORT_SETTINGS,
+	ROCKER_TLV_CMD_TYPE_SET_PORT_SETTINGS,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_ADD,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_MOD,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_DEL,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_GET_STATS,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_ADD,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_MOD,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_DEL,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_GET_STATS,
+	ROCKER_TLV_CMD_TYPE_TRUNK,
+	ROCKER_TLV_CMD_TYPE_BRIDGE,
+
+	__ROCKER_TLV_CMD_TYPE_MAX,
+	ROCKER_TLV_CMD_TYPE_MAX = __ROCKER_TLV_CMD_TYPE_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_CMD_PORT_SETTINGS_UNSPEC,
+	ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,		/* u32 */
+	ROCKER_TLV_CMD_PORT_SETTINGS_SPEED,		/* u32 */
+	ROCKER_TLV_CMD_PORT_SETTINGS_DUPLEX,		/* u8 */
+	ROCKER_TLV_CMD_PORT_SETTINGS_AUTONEG,		/* u8 */
+	ROCKER_TLV_CMD_PORT_SETTINGS_MACADDR,		/* binary */
+	ROCKER_TLV_CMD_PORT_SETTINGS_MODE,		/* u8 */
+
+	__ROCKER_TLV_CMD_PORT_SETTINGS_MAX,
+	ROCKER_TLV_CMD_PORT_SETTINGS_MAX =
+			__ROCKER_TLV_CMD_PORT_SETTINGS_MAX - 1,
+};
+
+enum rocker_port_mode {
+	ROCKER_PORT_MODE_OF_DPA,
+	ROCKER_PORT_MODE_L2L3,
+};
+
+enum {
+	ROCKER_TLV_EVENT_UNSPEC,
+	ROCKER_TLV_EVENT_TYPE,	/* u16 */
+	ROCKER_TLV_EVENT_INFO,	/* nest */
+
+	__ROCKER_TLV_EVENT_MAX,
+	ROCKER_TLV_EVENT_MAX = __ROCKER_TLV_EVENT_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_EVENT_TYPE_UNSPEC,
+	ROCKER_TLV_EVENT_TYPE_LINK_CHANGED,
+
+	__ROCKER_TLV_EVENT_TYPE_MAX,
+	ROCKER_TLV_EVENT_TYPE_MAX = __ROCKER_TLV_EVENT_TYPE_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_EVENT_LINK_CHANGED_UNSPEC,
+	ROCKER_TLV_EVENT_LINK_CHANGED_LPORT,	/* u32 */
+	ROCKER_TLV_EVENT_LINK_CHANGED_LINKUP,	/* u8 */
+
+	__ROCKER_TLV_EVENT_LINK_CHANGED_MAX,
+	ROCKER_TLV_EVENT_LINK_CHANGED_MAX =
+			__ROCKER_TLV_EVENT_LINK_CHANGED_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_RX_UNSPEC,
+	ROCKER_TLV_RX_FLAGS,		/* u16, see ROCKER_RX_FLAGS_ */
+	ROCKER_TLV_RX_CSUM,		/* u16 */
+	ROCKER_TLV_RX_FRAG_ADDR,	/* u64 */
+	ROCKER_TLV_RX_FRAG_MAX_LEN,	/* u16 */
+	ROCKER_TLV_RX_FRAG_LEN,		/* u16 */
+
+	__ROCKER_TLV_RX_MAX,
+	ROCKER_TLV_RX_MAX = __ROCKER_TLV_RX_MAX - 1,
+};
+
+#define ROCKER_RX_FLAGS_IPV4			(1 << 0)
+#define ROCKER_RX_FLAGS_IPV6			(1 << 1)
+#define ROCKER_RX_FLAGS_CSUM_CALC		(1 << 2)
+#define ROCKER_RX_FLAGS_IPV4_CSUM_GOOD		(1 << 3)
+#define ROCKER_RX_FLAGS_IP_FRAG			(1 << 4)
+#define ROCKER_RX_FLAGS_TCP			(1 << 5)
+#define ROCKER_RX_FLAGS_UDP			(1 << 6)
+#define ROCKER_RX_FLAGS_TCP_UDP_CSUM_GOOD	(1 << 7)
+
+enum {
+	ROCKER_TLV_TX_UNSPEC,
+	ROCKER_TLV_TX_OFFLOAD,		/* u8, see ROCKER_TX_OFFLOAD_ */
+	ROCKER_TLV_TX_L3_CSUM_OFF,	/* u16 */
+	ROCKER_TLV_TX_TSO_MSS,		/* u16 */
+	ROCKER_TLV_TX_TSO_HDR_LEN,	/* u16 */
+	ROCKER_TLV_TX_FRAGS,		/* array */
+
+	__ROCKER_TLV_TX_MAX,
+	ROCKER_TLV_TX_MAX = __ROCKER_TLV_TX_MAX - 1,
+};
+
+#define ROCKER_TX_OFFLOAD_NONE		0
+#define ROCKER_TX_OFFLOAD_IP_CSUM	1
+#define ROCKER_TX_OFFLOAD_TCP_UDP_CSUM	2
+#define ROCKER_TX_OFFLOAD_L3_CSUM	3
+#define ROCKER_TX_OFFLOAD_TSO		4
+
+#define ROCKER_TX_FRAGS_MAX		16
+
+enum {
+	ROCKER_TLV_TX_FRAG_UNSPEC,
+	ROCKER_TLV_TX_FRAG,		/* nest */
+
+	__ROCKER_TLV_TX_FRAG_MAX,
+	ROCKER_TLV_TX_FRAG_MAX = __ROCKER_TLV_TX_FRAG_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_TX_FRAG_ATTR_UNSPEC,
+	ROCKER_TLV_TX_FRAG_ATTR_ADDR,	/* u64 */
+	ROCKER_TLV_TX_FRAG_ATTR_LEN,	/* u16 */
+
+	__ROCKER_TLV_TX_FRAG_ATTR_MAX,
+	ROCKER_TLV_TX_FRAG_ATTR_MAX = __ROCKER_TLV_TX_FRAG_ATTR_MAX - 1,
+};
+
+/* cmd info nested for OF-DPA msgs */
+enum {
+	ROCKER_TLV_OF_DPA_UNSPEC,
+	ROCKER_TLV_OF_DPA_TABLE_ID,		/* u16 */
+	ROCKER_TLV_OF_DPA_PRIORITY,		/* u32 */
+	ROCKER_TLV_OF_DPA_HARDTIME,		/* u32 */
+	ROCKER_TLV_OF_DPA_IDLETIME,		/* u32 */
+	ROCKER_TLV_OF_DPA_COOKIE,		/* u64 */
+	ROCKER_TLV_OF_DPA_IN_LPORT,		/* u32 */
+	ROCKER_TLV_OF_DPA_IN_LPORT_MASK,	/* u32 */
+	ROCKER_TLV_OF_DPA_OUT_LPORT,		/* u32 */
+	ROCKER_TLV_OF_DPA_GOTO_TABLE_ID,	/* u16 */
+	ROCKER_TLV_OF_DPA_GROUP_ID,		/* u32 */
+	ROCKER_TLV_OF_DPA_GROUP_COUNT,		/* u16 */
+	ROCKER_TLV_OF_DPA_GROUP_IDS,		/* u32 array */
+	ROCKER_TLV_OF_DPA_VLAN_ID,		/* __be16 */
+	ROCKER_TLV_OF_DPA_VLAN_ID_MASK,		/* __be16 */
+	ROCKER_TLV_OF_DPA_VLAN_PCP,		/* __be16 */
+	ROCKER_TLV_OF_DPA_VLAN_PCP_MASK,	/* __be16 */
+	ROCKER_TLV_OF_DPA_VLAN_PCP_ACTION,	/* u8 */
+	ROCKER_TLV_OF_DPA_NEW_VLAN_ID,		/* __be16 */
+	ROCKER_TLV_OF_DPA_NEW_VLAN_PCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_TUNNEL_ID,		/* u32 */
+	ROCKER_TLV_OF_DPA_TUN_LOG_LPORT,	/* u32 */
+	ROCKER_TLV_OF_DPA_ETHERTYPE,		/* __be16 */
+	ROCKER_TLV_OF_DPA_DST_MAC,		/* binary */
+	ROCKER_TLV_OF_DPA_DST_MAC_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_SRC_MAC,		/* binary */
+	ROCKER_TLV_OF_DPA_SRC_MAC_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_IP_PROTO,		/* __be16 */
+	ROCKER_TLV_OF_DPA_IP_PROTO_MASK,	/* __be16 */
+	ROCKER_TLV_OF_DPA_DSCP,			/* __be16 */
+	ROCKER_TLV_OF_DPA_DSCP_MASK,		/* __be16 */
+	ROCKER_TLV_OF_DPA_DSCP_ACTION,		/* u8 */
+	ROCKER_TLV_OF_DPA_NEW_DSCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_ECN,			/* __be16 */
+	ROCKER_TLV_OF_DPA_ECN_MASK,		/* __be16 */
+	ROCKER_TLV_OF_DPA_DST_IP,		/* __be32 */
+	ROCKER_TLV_OF_DPA_DST_IP_MASK,		/* __be32 */
+	ROCKER_TLV_OF_DPA_SRC_IP,		/* __be32 */
+	ROCKER_TLV_OF_DPA_SRC_IP_MASK,		/* __be32 */
+	ROCKER_TLV_OF_DPA_DST_IPV6,		/* binary */
+	ROCKER_TLV_OF_DPA_DST_IPV6_MASK,	/* binary */
+	ROCKER_TLV_OF_DPA_SRC_IPV6,		/* binary */
+	ROCKER_TLV_OF_DPA_SRC_IPV6_MASK,	/* binary */
+	ROCKER_TLV_OF_DPA_SRC_ARP_IP,		/* __be32 */
+	ROCKER_TLV_OF_DPA_SRC_ARP_IP_MASK,	/* __be32 */
+	ROCKER_TLV_OF_DPA_L4_DST_PORT,		/* __be16 */
+	ROCKER_TLV_OF_DPA_L4_DST_PORT_MASK,	/* __be16 */
+	ROCKER_TLV_OF_DPA_L4_SRC_PORT,		/* __be16 */
+	ROCKER_TLV_OF_DPA_L4_SRC_PORT_MASK,	/* __be16 */
+	ROCKER_TLV_OF_DPA_ICMP_TYPE,		/* u8 */
+	ROCKER_TLV_OF_DPA_ICMP_TYPE_MASK,	/* u8 */
+	ROCKER_TLV_OF_DPA_ICMP_CODE,		/* u8 */
+	ROCKER_TLV_OF_DPA_ICMP_CODE_MASK,	/* u8 */
+	ROCKER_TLV_OF_DPA_IPV6_LABEL,		/* __be32 */
+	ROCKER_TLV_OF_DPA_IPV6_LABEL_MASK,	/* __be32 */
+	ROCKER_TLV_OF_DPA_QUEUE_ID_ACTION,	/* u8 */
+	ROCKER_TLV_OF_DPA_NEW_QUEUE_ID,		/* u8 */
+	ROCKER_TLV_OF_DPA_CLEAR_ACTIONS,	/* u32 */
+	ROCKER_TLV_OF_DPA_POP_VLAN,		/* u8 */
+
+	__ROCKER_TLV_OF_DPA_MAX,
+	ROCKER_TLV_OF_DPA_MAX = __ROCKER_TLV_OF_DPA_MAX - 1,
+};
+
+/* OF-DPA table IDs */
+
+enum rocker_of_dpa_table_id {
+	ROCKER_OF_DPA_TABLE_ID_INGRESS_PORT = 0,
+	ROCKER_OF_DPA_TABLE_ID_VLAN = 10,
+	ROCKER_OF_DPA_TABLE_ID_TERMINATION_MAC = 20,
+	ROCKER_OF_DPA_TABLE_ID_UNICAST_ROUTING = 30,
+	ROCKER_OF_DPA_TABLE_ID_MULTICAST_ROUTING = 40,
+	ROCKER_OF_DPA_TABLE_ID_BRIDGING = 50,
+	ROCKER_OF_DPA_TABLE_ID_ACL_POLICY = 60,
+};
+
+/* OF_DPA_xxx nest */
+enum {
+	ROCKER_TLV_OF_DPA_INFO_UNSPEC,
+	ROCKER_TLV_OF_DPA_INFO_IN_LPORT,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_IN_LPORT_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_OUT_LPORT,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_GOTO_TABLE_ID,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_GROUP_ID,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_ID,			/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_ID_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_PCP,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_PCP_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_PCP_ACTION,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_NEW_VLAN_ID,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_NEW_VLAN_PCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_TUNNEL_ID,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_TUN_LOG_LPORT,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_ETHERTYPE,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DST_MAC,			/* binary */
+	ROCKER_TLV_OF_DPA_INFO_DST_MAC_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_MAC,			/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_MAC_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_IP_PROTO,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_IP_PROTO_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DSCP,			/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DSCP_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DSCP_ACTION,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_NEW_DSCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_ECN,			/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_ECN_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DST_IP,			/* binary */
+	ROCKER_TLV_OF_DPA_INFO_DST_IP_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_IP,			/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_IP_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_DST_IPV6,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_DST_IPV6_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_IPV6,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_IPV6_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_ARP_IP,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_SRC_ARP_IP_MASK,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_L4_DST_PORT,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_L4_DST_PORT_MASK,	/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_L4_SRC_PORT,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_L4_SRC_PORT_MASK,	/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_ICMP_TYPE,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_ICMP_TYPE_MASK,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_ICMP_CODE,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_ICMP_CODE_MASK,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_IPV6_LABEL,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_IPV6_LABEL_MASK,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_QUEUE_ID_ACTION,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_NEW_QUEUE_ID,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_CLEAR_ACTIONS,		/* u32 */
+
+	__ROCKER_TLV_OF_DPA_INFO_MAX,
+	ROCKER_TLV_OF_DPA_INFO_MAX = __ROCKER_TLV_OF_DPA_INFO_MAX - 1,
+};
+
+/* OF-DPA flow stats */
+enum {
+	ROCKER_TLV_OF_DPA_FLOW_STAT_UNSPEC,
+	ROCKER_TLV_OF_DPA_FLOW_STAT_DURATION,	/* u32 */
+	ROCKER_TLV_OF_DPA_FLOW_STAT_RX_PKTS,	/* u64 */
+	ROCKER_TLV_OF_DPA_FLOW_STAT_TX_PKTS,	/* u64 */
+
+	__ROCKER_TLV_OF_DPA_FLOW_STAT_MAX,
+	ROCKER_TLV_OF_DPA_FLOW_STAT_MAX = __ROCKER_TLV_OF_DPA_FLOW_STAT_MAX - 1,
+};
+
+/* OF-DPA group types */
+enum rocker_of_dpa_group_type {
+	ROCKER_OF_DPA_GROUP_TYPE_L2_INTERFACE = 0,
+	ROCKER_OF_DPA_GROUP_TYPE_L2_REWRITE,
+	ROCKER_OF_DPA_GROUP_TYPE_L3_UCAST,
+	ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST,
+	ROCKER_OF_DPA_GROUP_TYPE_L2_FLOOD,
+	ROCKER_OF_DPA_GROUP_TYPE_L3_INTERFACE,
+	ROCKER_OF_DPA_GROUP_TYPE_L3_MCAST,
+	ROCKER_OF_DPA_GROUP_TYPE_L3_ECMP,
+	ROCKER_OF_DPA_GROUP_TYPE_L2_OVERLAY,
+};
+
+/* OF-DPA group L2 overlay types */
+enum rocker_of_dpa_overlay_type {
+	ROCKER_OF_DPA_OVERLAY_TYPE_FLOOD_UCAST = 0,
+	ROCKER_OF_DPA_OVERLAY_TYPE_FLOOD_MCAST,
+	ROCKER_OF_DPA_OVERLAY_TYPE_MCAST_UCAST,
+	ROCKER_OF_DPA_OVERLAY_TYPE_MCAST_MCAST,
+};
+
+/* OF-DPA group ID encoding */
+#define ROCKER_GROUP_TYPE_SHIFT 28
+#define ROCKER_GROUP_TYPE_MASK 0xf0000000
+#define ROCKER_GROUP_VLAN_SHIFT 16
+#define ROCKER_GROUP_VLAN_MASK 0x0fff0000
+#define ROCKER_GROUP_PORT_SHIFT 0
+#define ROCKER_GROUP_PORT_MASK 0x0000ffff
+#define ROCKER_GROUP_TUNNEL_ID_SHIFT 12
+#define ROCKER_GROUP_TUNNEL_ID_MASK 0x0ffff000
+#define ROCKER_GROUP_SUBTYPE_SHIFT 10
+#define ROCKER_GROUP_SUBTYPE_MASK 0x00000c00
+#define ROCKER_GROUP_INDEX_SHIFT 0
+#define ROCKER_GROUP_INDEX_MASK 0x0000ffff
+#define ROCKER_GROUP_INDEX_LONG_SHIFT 0
+#define ROCKER_GROUP_INDEX_LONG_MASK 0x0fffffff
+
+#define ROCKER_GROUP_TYPE_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_TYPE_MASK) >> ROCKER_GROUP_TYPE_SHIFT)
+#define ROCKER_GROUP_TYPE_SET(type) \
+	(((type) << ROCKER_GROUP_TYPE_SHIFT) & ROCKER_GROUP_TYPE_MASK)
+#define ROCKER_GROUP_VLAN_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_VLAN_ID_MASK) >> ROCKER_GROUP_VLAN_ID_SHIFT)
+#define ROCKER_GROUP_VLAN_SET(vlan_id) \
+	(((vlan_id) << ROCKER_GROUP_VLAN_SHIFT) & ROCKER_GROUP_VLAN_MASK)
+#define ROCKER_GROUP_PORT_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_PORT_MASK) >> ROCKER_GROUP_PORT_SHIFT)
+#define ROCKER_GROUP_PORT_SET(port) \
+	(((port) << ROCKER_GROUP_PORT_SHIFT) & ROCKER_GROUP_PORT_MASK)
+#define ROCKER_GROUP_INDEX_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_INDEX_MASK) >> ROCKER_GROUP_INDEX_SHIFT)
+#define ROCKER_GROUP_INDEX_SET(index) \
+	(((index) << ROCKER_GROUP_INDEX_SHIFT) & ROCKER_GROUP_INDEX_MASK)
+#define ROCKER_GROUP_INDEX_LONG_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_INDEX_LONG_MASK) >> \
+	 ROCKER_GROUP_INDEX_LONG_SHIFT)
+#define ROCKER_GROUP_INDEX_LONG_SET(index) \
+	(((index) << ROCKER_GROUP_INDEX_LONG_SHIFT) & \
+	 ROCKER_GROUP_INDEX_LONG_MASK)
+
+#define ROCKER_GROUP_NONE 0
+#define ROCKER_GROUP_L2_INTERFACE(vlan_id, port) \
+	(ROCKER_GROUP_TYPE_SET(ROCKER_OF_DPA_GROUP_TYPE_L2_INTERFACE) |\
+	 ROCKER_GROUP_VLAN_SET(vlan_id) | ROCKER_GROUP_PORT_SET(port))
+#define ROCKER_GROUP_L2_MCAST(vlan_id, index) \
+	(ROCKER_GROUP_TYPE_SET(ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST) |\
+	 ROCKER_GROUP_VLAN_SET(vlan_id) | ROCKER_GROUP_INDEX_SET(index))
+
+/* Rocker general purpose registers */
+#define ROCKER_CONTROL			0x0300
+#define ROCKER_PORT_PHYS_COUNT		0x0304
+#define ROCKER_PORT_PHYS_LINK_STATUS	0x0310 /* 8-byte */
+#define ROCKER_PORT_PHYS_ENABLE		0x0318 /* 8-byte */
+#define ROCKER_SWITCH_ID		0x0320 /* 8-byte */
+
+/* Rocker control bits */
+#define ROCKER_CONTROL_RESET		(1 << 0)
+
+#endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id
  2014-08-21 16:19 ` [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id Jiri Pirko
@ 2014-08-21 16:38   ` Ben Hutchings
  2014-08-21 16:56   ` Florian Fainelli
  1 sibling, 0 replies; 87+ messages in thread
From: Ben Hutchings @ 2014-08-21 16:38 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, roopa, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye

[-- Attachment #1: Type: text/plain, Size: 736 bytes --]

On Thu, 2014-08-21 at 18:19 +0200, Jiri Pirko wrote:
[...]
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -171,6 +171,19 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>  	return -EOPNOTSUPP;
>  }
>  
> +static int dsa_slave_swdev_get_id(struct net_device *dev,
> +				  struct netdev_phys_item_id *psid)
> +{
> +	struct dsa_slave_priv *p = netdev_priv(dev);
> +	struct dsa_switch *ds = p->parent;
> +	u64 tmp = (u64) ds;
> +
> +	/* TODO: add more sophisticated id generation */
> +	memcpy(&psid->id, &tmp, sizeof(tmp));
[...]

Right, you must not expose kernel addresses to userland.

Ben.

-- 
Ben Hutchings
If at first you don't succeed, you're doing about average.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
  2014-08-21 16:18   ` [patch net-next RFC 03/12] net: introduce generic switch devices support Jiri Pirko
@ 2014-08-21 16:41     ` Ben Hutchings
  2014-08-21 17:03       ` Jiri Pirko
       [not found]       ` <1408639283.13073.3.camel-nDn/Rdv9kqW9Jme8/bJn5UCKIB8iOfG2tUK59QYPAWc@public.gmane.org>
       [not found]     ` <1408637945-10390-4-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 87+ messages in thread
From: Ben Hutchings @ 2014-08-21 16:41 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, roopa, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye

[-- Attachment #1: Type: text/plain, Size: 628 bytes --]

On Thu, 2014-08-21 at 18:18 +0200, Jiri Pirko wrote:
> The goal of this is to provide a possibility to suport various switch
> chips. Drivers should implement relevant ndos to do so. Now there is a
> couple of ndos defines:
> - for getting physical switch id is in place.
> - for work with flows.
> 
> Note that user can use random port netdevice to access the switch.
[...]

Why isn't the switch treated as a real device (not necessarily a net
device) that's included in the device model and that the port devices
refer to?

Ben.

-- 
Ben Hutchings
If at first you don't succeed, you're doing about average.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id
  2014-08-21 16:19 ` [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id Jiri Pirko
  2014-08-21 16:38   ` Ben Hutchings
@ 2014-08-21 16:56   ` Florian Fainelli
       [not found]     ` <CAGVrzcbs1yGb5RW++XZ=2PFsqUjZGVGfWx5=QQYcEX6x4WOq9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 87+ messages in thread
From: Florian Fainelli @ 2014-08-21 16:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, David Miller, Neil Horman, Andy Gospodarek, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, Jeff Kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Jamal Hadi Salim, Scott Feldman,
	Roopa Prabhu, John Linville, dev, jasowang, Eric W. Biederman

2014-08-21 9:19 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---
>  net/dsa/Kconfig |  2 +-
>  net/dsa/slave.c | 16 ++++++++++++++++
>  2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
> index f5eede1..66c445a 100644
> --- a/net/dsa/Kconfig
> +++ b/net/dsa/Kconfig
> @@ -1,6 +1,6 @@
>  config HAVE_NET_DSA
>         def_bool y
> -       depends on NETDEVICES && !S390
> +       depends on NETDEVICES && NET_SWITCHDEV && !S390
>
>  # Drivers must select NET_DSA and the appropriate tagging format
>
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 45a1e34..e069ba3 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -171,6 +171,19 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>         return -EOPNOTSUPP;
>  }
>
> +static int dsa_slave_swdev_get_id(struct net_device *dev,
> +                                 struct netdev_phys_item_id *psid)
> +{
> +       struct dsa_slave_priv *p = netdev_priv(dev);
> +       struct dsa_switch *ds = p->parent;
> +       u64 tmp = (u64) ds;
> +
> +       /* TODO: add more sophisticated id generation */
> +       memcpy(&psid->id, &tmp, sizeof(tmp));
> +       psid->id_len = sizeof(tmp);

There is already an unique id generated, which is the index in the
switch tree, and which is stored in struct dsa_switch, so this could
probably be simplified to:

psid->id = ds->index
--
Florian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
  2014-08-21 16:41     ` Ben Hutchings
@ 2014-08-21 17:03       ` Jiri Pirko
       [not found]       ` <1408639283.13073.3.camel-nDn/Rdv9kqW9Jme8/bJn5UCKIB8iOfG2tUK59QYPAWc@public.gmane.org>
  1 sibling, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 17:03 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma,
	f.fainelli, roopa, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye

Thu, Aug 21, 2014 at 06:41:23PM CEST, ben@decadent.org.uk wrote:
>On Thu, 2014-08-21 at 18:18 +0200, Jiri Pirko wrote:
>> The goal of this is to provide a possibility to suport various switch
>> chips. Drivers should implement relevant ndos to do so. Now there is a
>> couple of ndos defines:
>> - for getting physical switch id is in place.
>> - for work with flows.
>> 
>> Note that user can use random port netdevice to access the switch.
>[...]
>
>Why isn't the switch treated as a real device (not necessarily a net
>device) that's included in the device model and that the port devices
>refer to?

That is certainly possible. But so far, there is no need for it. But I
guess that it's probably a good idea at least to put it into sysfs
structure.

Noted, thanks Ben.

>
>Ben.
>
>-- 
>Ben Hutchings
>If at first you don't succeed, you're doing about average.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
       [not found]     ` <1408637945-10390-4-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-08-21 17:05       ` Florian Fainelli
       [not found]         ` <CAGVrzcYtnpcP4pfCJ0GSya01LTk0WwbSV1f+voF2K=S5CR3Arg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Florian Fainelli @ 2014-08-21 17:05 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, ogerlitz,
	Ben Hutchings, Lennert Buytenhek, Roopa Prabhu, Jamal Hadi Salim,
	Aviad Raveh, Nicolas Dichtel, vyasevic, Neil Horman, netdev,
	Stephen Hemminger, dborkman, Eric W. Biederman, David

2014-08-21 9:18 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
> The goal of this is to provide a possibility to suport various switch
> chips. Drivers should implement relevant ndos to do so. Now there is a
> couple of ndos defines:
> - for getting physical switch id is in place.
> - for work with flows.
>
> Note that user can use random port netdevice to access the switch.

I read through this patch set, and I still think that DSA is the
generic switch infrastructure we already have because it does provide
the following:

- taking a generic platform data structure (C struct or Device Tree),
validate, parse it and map it to internal kernel structures
- instantiate per-port network devices based on the configuration data provided
- delegate netdev_ops to the switch driver and/or the CPU NIC when relevant
- provide support for hooking RX and TX traffic coming from the CPU NIC

I would rather we build on the existing DSA infrastructure and add the
flow-related netdev_ops rather than having the two remain in
disconnect while flow-oriented switches driver get progressively
added. I guess I should take a closer look at the rocker driver to see
how hard would that be for you.

What do you think?

>
> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
> ---
>  Documentation/networking/switchdev.txt |  53 +++++++++++
>  include/linux/netdevice.h              |  28 ++++++
>  include/linux/switchdev.h              |  44 +++++++++
>  net/Kconfig                            |   6 ++
>  net/core/Makefile                      |   1 +
>  net/core/switchdev.c                   | 163 +++++++++++++++++++++++++++++++++
>  6 files changed, 295 insertions(+)
>  create mode 100644 Documentation/networking/switchdev.txt
>  create mode 100644 include/linux/switchdev.h
>  create mode 100644 net/core/switchdev.c
>
> diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
> new file mode 100644
> index 0000000..435746a
> --- /dev/null
> +++ b/Documentation/networking/switchdev.txt
> @@ -0,0 +1,53 @@
> +Switch device drivers HOWTO
> +===========================
> +
> +First lets describe a topology a bit. Imagine the following example:
> +
> +       +----------------------------+    +---------------+
> +       |     SOME switch chip       |    |      CPU      |
> +       +----------------------------+    +---------------+
> +       port1 port2 port3 port4 MNGMNT    |     PCI-E     |
> +         |     |     |     |     |       +---------------+
> +        PHY   PHY    |     |     |         |  NIC0 NIC1
> +                     |     |     |         |   |    |
> +                     |     |     +- PCI-E -+   |    |
> +                     |     +------- MII -------+    |
> +                     +------------- MII ------------+
> +
> +In this example, there are two independent lines between the switch silicon
> +and CPU. NIC0 and NIC1 drivers are not aware of a switch presence. They are
> +separate from the switch driver. SOME switch chip is by managed by a driver
> +via PCI-E device MNGMNT. Note that MNGMNT device, NIC0 and NIC1 may be
> +connected to some other type of bus.
> +
> +Now, for the previous example show the representation in kernel:
> +
> +       +----------------------------+    +---------------+
> +       |     SOME switch chip       |    |      CPU      |
> +       +----------------------------+    +---------------+
> +       sw0p0 sw0p1 sw0p2 sw0p3 MNGMNT    |     PCI-E     |
> +         |     |     |     |     |       +---------------+
> +        PHY   PHY    |     |     |         |  eth0 eth1
> +                     |     |     |         |   |    |
> +                     |     |     +- PCI-E -+   |    |
> +                     |     +------- MII -------+    |
> +                     +------------- MII ------------+
> +
> +Lets call the example switch driver for SOME switch chip "SOMEswitch". This
> +driver takes care of PCI-E device MNGMNT. There is a netdevice instance sw0pX
> +created for each port of a switch. These netdevices are instances
> +of "SOMEswitch" driver. sw0pX netdevices serve as a "representation"
> +of the switch chip. eth0 and eth1 are instances of some other existing driver.
> +
> +The only difference of the switch-port netdevice from the ordinary netdevice
> +is that is implements couple more NDOs:
> +
> +       ndo_swdev_get_id - This returns the same ID for two port netdevices of
> +                          the same physical switch chip. This is mandatory to
> +                          be implemented by all switch drivers and serves
> +                          the caller for recognition of a port netdevice.
> +       ndo_swdev_* - Functions that serve for a manipulation of the switch chip
> +                     itself. They are not port-specific. Caller might use
> +                     arbitrary port netdevice of the same switch and it will
> +                     make no difference.
> +       ndo_swportdev_* - Functions that serve for a port-specific manipulation.
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 39294b9..8b5d14c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -49,6 +49,8 @@
>
>  #include <linux/netdev_features.h>
>  #include <linux/neighbour.h>
> +#include <linux/sw_flow.h>
> +
>  #include <uapi/linux/netdevice.h>
>
>  struct netpoll_info;
> @@ -997,6 +999,24 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
>   *     Callback to use for xmit over the accelerated station. This
>   *     is used in place of ndo_start_xmit on accelerated net
>   *     devices.
> + *
> + * int (*ndo_swdev_get_id)(struct net_device *dev,
> + *                        struct netdev_phys_item_id *psid);
> + *     Called to get an ID of the switch chip this port is part of.
> + *     If driver implements this, it indicates that it represents a port
> + *     of a switch chip.
> + *
> + * int (*ndo_swdev_flow_insert)(struct net_device *dev,
> + *                             const struct sw_flow *flow);
> + *     Called to insert a flow into switch device. If driver does
> + *     not implement this, it is assumed that the hw does not have
> + *     a capability to work with flows.
> + *
> + * int (*ndo_swdev_flow_remove)(struct net_device *dev,
> + *                             const struct sw_flow *flow);
> + *     Called to remove a flow from switch device. If driver does
> + *     not implement this, it is assumed that the hw does not have
> + *     a capability to work with flows.
>   */
>  struct net_device_ops {
>         int                     (*ndo_init)(struct net_device *dev);
> @@ -1146,6 +1166,14 @@ struct net_device_ops {
>                                                         struct net_device *dev,
>                                                         void *priv);
>         int                     (*ndo_get_lock_subclass)(struct net_device *dev);
> +#ifdef CONFIG_NET_SWITCHDEV
> +       int                     (*ndo_swdev_get_id)(struct net_device *dev,
> +                                                   struct netdev_phys_item_id *psid);
> +       int                     (*ndo_swdev_flow_insert)(struct net_device *dev,
> +                                                        const struct sw_flow *flow);
> +       int                     (*ndo_swdev_flow_remove)(struct net_device *dev,
> +                                                        const struct sw_flow *flow);
> +#endif
>  };
>
>  /**
> diff --git a/include/linux/switchdev.h b/include/linux/switchdev.h
> new file mode 100644
> index 0000000..ba77a68
> --- /dev/null
> +++ b/include/linux/switchdev.h
> @@ -0,0 +1,44 @@
> +/*
> + * include/linux/switchdev.h - Switch device API
> + * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +#ifndef _LINUX_SWITCHDEV_H_
> +#define _LINUX_SWITCHDEV_H_
> +
> +#include <linux/netdevice.h>
> +#include <linux/sw_flow.h>
> +
> +#ifdef CONFIG_NET_SWITCHDEV
> +
> +int swdev_get_id(struct net_device *dev, struct netdev_phys_item_id *psid);
> +int swdev_flow_insert(struct net_device *dev, const struct sw_flow *flow);
> +int swdev_flow_remove(struct net_device *dev, const struct sw_flow *flow);
> +
> +#else
> +
> +static inline int swdev_get_id(struct net_device *dev,
> +                              struct netdev_phys_item_id *psid)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static inline int swdev_flow_insert(struct net_device *dev,
> +                                   const struct sw_flow *flow)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static inline int swdev_flow_remove(struct net_device *dev,
> +                                   const struct sw_flow *flow)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +#endif
> +
> +#endif /* _LINUX_SWITCHDEV_H_ */
> diff --git a/net/Kconfig b/net/Kconfig
> index 4051fdf..40f729f 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -290,6 +290,12 @@ config NET_FLOW_LIMIT
>           with many clients some protection against DoS by a single (spoofed)
>           flow that greatly exceeds average workload.
>
> +config NET_SWITCHDEV
> +       boolean "Switch device support"
> +       depends on INET
> +       ---help---
> +         This module provides support for hardware switch chips.
> +
>  menu "Network testing"
>
>  config NET_PKTGEN
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 71093d9..8583c38 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -24,3 +24,4 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
>  obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
>  obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
>  obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
> +obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o
> diff --git a/net/core/switchdev.c b/net/core/switchdev.c
> new file mode 100644
> index 0000000..4fad097
> --- /dev/null
> +++ b/net/core/switchdev.c
> @@ -0,0 +1,163 @@
> +/*
> + * net/core/switchdev.c - Switch device API
> + * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/netdevice.h>
> +#include <linux/switchdev.h>
> +
> +/**
> + *     swdev_get_id - Get ID of a switch
> + *     @dev: port device
> + *     @psid: switch ID
> + *
> + *     Get ID of a switch this port is part of.
> + */
> +int swdev_get_id(struct net_device *dev, struct netdev_phys_item_id *psid)
> +{
> +       const struct net_device_ops *ops = dev->netdev_ops;
> +
> +       if (!ops->ndo_swdev_get_id)
> +               return -EOPNOTSUPP;
> +       return ops->ndo_swdev_get_id(dev, psid);
> +}
> +EXPORT_SYMBOL(swdev_get_id);
> +
> +static void print_flow_key_tun(const char *prefix,
> +                              const struct sw_flow_key *key)
> +{
> +       pr_debug("%s tun  { id %08llx, s %pI4, d %pI4, f %02x, tos %x, ttl %x }\n",
> +                prefix,
> +                be64_to_cpu(key->tun_key.tun_id), &key->tun_key.ipv4_src,
> +                &key->tun_key.ipv4_dst, ntohs(key->tun_key.tun_flags),
> +                key->tun_key.ipv4_tos, key->tun_key.ipv4_ttl);
> +}
> +
> +static void print_flow_key_phy(const char *prefix,
> +                              const struct sw_flow_key *key)
> +{
> +       pr_debug("%s phy  { prio %04x, mark %04x, in_port %02x }\n",
> +                prefix,
> +                key->phy.priority, key->phy.skb_mark, key->phy.in_port);
> +}
> +
> +static void print_flow_key_eth(const char *prefix,
> +                              const struct sw_flow_key *key)
> +{
> +       pr_debug("%s eth  { sm %pM, dm %pM, tci %04x, type %04x }\n",
> +                prefix,
> +                key->eth.src, key->eth.dst, ntohs(key->eth.tci),
> +                ntohs(key->eth.type));
> +}
> +
> +static void print_flow_key_ip(const char *prefix,
> +                             const struct sw_flow_key *key)
> +{
> +       pr_debug("%s ip   { proto %02x, tos %02x, ttl %02x }\n",
> +                prefix,
> +                key->ip.proto, key->ip.tos, key->ip.ttl);
> +}
> +
> +static void print_flow_key_ipv4(const char *prefix,
> +                               const struct sw_flow_key *key)
> +{
> +       pr_debug("%s ipv4 { si %pI4, di %pI4, sm %pM, dm %pM }\n",
> +                prefix,
> +                &key->ipv4.addr.src, &key->ipv4.addr.dst,
> +                key->ipv4.arp.sha, key->ipv4.arp.tha);
> +}
> +
> +static void print_flow_actions(struct sw_flow_actions *actions)
> +{
> +       int i;
> +
> +       pr_debug("  actions:\n");
> +       if (!actions)
> +               return;
> +       for (i = 0; i < actions->count; i++) {
> +               struct sw_flow_action *action = &actions->actions[i];
> +
> +               switch (action->type) {
> +               case SW_FLOW_ACTION_TYPE_OUTPUT:
> +                       pr_debug("    output    { dev %s }\n",
> +                                action->output_dev->name);
> +                       break;
> +               case SW_FLOW_ACTION_TYPE_VLAN_PUSH:
> +                       pr_debug("    vlan push { proto %04x, tci %04x }\n",
> +                                ntohs(action->vlan.vlan_proto),
> +                                ntohs(action->vlan.vlan_tci));
> +                       break;
> +               case SW_FLOW_ACTION_TYPE_VLAN_POP:
> +                       pr_debug("    vlan pop\n");
> +                       break;
> +               }
> +       }
> +}
> +
> +#define PREFIX_NONE "      "
> +#define PREFIX_MASK "  mask"
> +
> +static void print_flow(const struct sw_flow *flow, struct net_device *dev,
> +                      const char *comment)
> +{
> +       pr_debug("%s flow %s (%x-%x):\n", dev->name, comment,
> +                flow->mask->range.start, flow->mask->range.end);
> +       print_flow_key_tun(PREFIX_NONE, &flow->key);
> +       print_flow_key_tun(PREFIX_MASK, &flow->mask->key);
> +       print_flow_key_phy(PREFIX_NONE, &flow->key);
> +       print_flow_key_phy(PREFIX_MASK, &flow->mask->key);
> +       print_flow_key_eth(PREFIX_NONE, &flow->key);
> +       print_flow_key_eth(PREFIX_MASK, &flow->mask->key);
> +       print_flow_key_ip(PREFIX_NONE, &flow->key);
> +       print_flow_key_ip(PREFIX_MASK, &flow->mask->key);
> +       print_flow_key_ipv4(PREFIX_NONE, &flow->key);
> +       print_flow_key_ipv4(PREFIX_MASK, &flow->mask->key);
> +       print_flow_actions(flow->actions);
> +}
> +
> +/**
> + *     swdev_flow_insert - Insert a flow into switch
> + *     @dev: port device
> + *     @flow: flow descriptor
> + *
> + *     Insert a flow into switch this port is part of.
> + */
> +int swdev_flow_insert(struct net_device *dev, const struct sw_flow *flow)
> +{
> +       const struct net_device_ops *ops = dev->netdev_ops;
> +
> +       print_flow(flow, dev, "insert");
> +       if (!ops->ndo_swdev_flow_insert)
> +               return -EOPNOTSUPP;
> +       WARN_ON(!ops->ndo_swdev_get_id);
> +       BUG_ON(!flow->actions);
> +       return ops->ndo_swdev_flow_insert(dev, flow);
> +}
> +EXPORT_SYMBOL(swdev_flow_insert);
> +
> +/**
> + *     swdev_flow_remove - Remove a flow from switch
> + *     @dev: port device
> + *     @flow: flow descriptor
> + *
> + *     Remove a flow from switch this port is part of.
> + */
> +int swdev_flow_remove(struct net_device *dev, const struct sw_flow *flow)
> +{
> +       const struct net_device_ops *ops = dev->netdev_ops;
> +
> +       print_flow(flow, dev, "remove");
> +       if (!ops->ndo_swdev_flow_remove)
> +               return -EOPNOTSUPP;
> +       WARN_ON(!ops->ndo_swdev_get_id);
> +       return ops->ndo_swdev_flow_remove(dev, flow);
> +}
> +EXPORT_SYMBOL(swdev_flow_remove);
> --
> 1.9.3
>



-- 
Florian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id
       [not found]     ` <CAGVrzcbs1yGb5RW++XZ=2PFsqUjZGVGfWx5=QQYcEX6x4WOq9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-21 17:06       ` Jiri Pirko
       [not found]         ` <20140821170645.GB10633-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-08-21 17:06 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, ogerlitz,
	Ben Hutchings, Lennert Buytenhek, Roopa Prabhu, Jamal Hadi Salim,
	Aviad Raveh, Nicolas Dichtel, vyasevic, Neil Horman, netdev,
	Stephen Hemminger, dborkman, Eric W. Biederman, David

Thu, Aug 21, 2014 at 06:56:13PM CEST, f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
>2014-08-21 9:19 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
>> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
>> ---
>>  net/dsa/Kconfig |  2 +-
>>  net/dsa/slave.c | 16 ++++++++++++++++
>>  2 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
>> index f5eede1..66c445a 100644
>> --- a/net/dsa/Kconfig
>> +++ b/net/dsa/Kconfig
>> @@ -1,6 +1,6 @@
>>  config HAVE_NET_DSA
>>         def_bool y
>> -       depends on NETDEVICES && !S390
>> +       depends on NETDEVICES && NET_SWITCHDEV && !S390
>>
>>  # Drivers must select NET_DSA and the appropriate tagging format
>>
>> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
>> index 45a1e34..e069ba3 100644
>> --- a/net/dsa/slave.c
>> +++ b/net/dsa/slave.c
>> @@ -171,6 +171,19 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>>         return -EOPNOTSUPP;
>>  }
>>
>> +static int dsa_slave_swdev_get_id(struct net_device *dev,
>> +                                 struct netdev_phys_item_id *psid)
>> +{
>> +       struct dsa_slave_priv *p = netdev_priv(dev);
>> +       struct dsa_switch *ds = p->parent;
>> +       u64 tmp = (u64) ds;
>> +
>> +       /* TODO: add more sophisticated id generation */
>> +       memcpy(&psid->id, &tmp, sizeof(tmp));
>> +       psid->id_len = sizeof(tmp);
>
>There is already an unique id generated, which is the index in the
>switch tree, and which is stored in struct dsa_switch, so this could
>probably be simplified to:
>
>psid->id = ds->index

That index is 0..n if I understand that correctly. That is not enough.
The point is to have unique id for every chip in the system. If we would
have 0,1,2... the collision is very likely.

>--
>Florian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id
       [not found]         ` <20140821170645.GB10633-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
@ 2014-08-21 17:12           ` Florian Fainelli
       [not found]             ` <CAGVrzcb=vkqPw2LUc4YO4Bs-eady2=1uN-jkG=kW2RnGx=24PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-08-23 11:33           ` Eric W. Biederman
  1 sibling, 1 reply; 87+ messages in thread
From: Florian Fainelli @ 2014-08-21 17:12 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, ogerlitz,
	Ben Hutchings, Lennert Buytenhek, Roopa Prabhu, Jamal Hadi Salim,
	Aviad Raveh, Nicolas Dichtel, vyasevic, Neil Horman, netdev,
	Stephen Hemminger, dborkman, Eric W. Biederman, David

2014-08-21 10:06 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
> Thu, Aug 21, 2014 at 06:56:13PM CEST, f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
>>2014-08-21 9:19 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
>>> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
>>> ---
>>>  net/dsa/Kconfig |  2 +-
>>>  net/dsa/slave.c | 16 ++++++++++++++++
>>>  2 files changed, 17 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
>>> index f5eede1..66c445a 100644
>>> --- a/net/dsa/Kconfig
>>> +++ b/net/dsa/Kconfig
>>> @@ -1,6 +1,6 @@
>>>  config HAVE_NET_DSA
>>>         def_bool y
>>> -       depends on NETDEVICES && !S390
>>> +       depends on NETDEVICES && NET_SWITCHDEV && !S390
>>>
>>>  # Drivers must select NET_DSA and the appropriate tagging format
>>>
>>> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
>>> index 45a1e34..e069ba3 100644
>>> --- a/net/dsa/slave.c
>>> +++ b/net/dsa/slave.c
>>> @@ -171,6 +171,19 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>>>         return -EOPNOTSUPP;
>>>  }
>>>
>>> +static int dsa_slave_swdev_get_id(struct net_device *dev,
>>> +                                 struct netdev_phys_item_id *psid)
>>> +{
>>> +       struct dsa_slave_priv *p = netdev_priv(dev);
>>> +       struct dsa_switch *ds = p->parent;
>>> +       u64 tmp = (u64) ds;
>>> +
>>> +       /* TODO: add more sophisticated id generation */
>>> +       memcpy(&psid->id, &tmp, sizeof(tmp));
>>> +       psid->id_len = sizeof(tmp);
>>
>>There is already an unique id generated, which is the index in the
>>switch tree, and which is stored in struct dsa_switch, so this could
>>probably be simplified to:
>>
>>psid->id = ds->index
>
> That index is 0..n if I understand that correctly. That is not enough.
> The point is to have unique id for every chip in the system. If we would
> have 0,1,2... the collision is very likely.

Good point, so an unique index for DSA switches could look like the
DSA platform device id plus the switch index in the tree..., but then
we would need something like (pdev->id << N) | switch index, so that
would not give a consistent naming scheme across different devices.

Maybe we are just better with using the Linux IDR API in include/linux/idr.h?
-- 
Florian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 12/12] rocker: introduce rocker switch driver
  2014-08-21 16:19 ` [patch net-next RFC 12/12] rocker: introduce rocker switch driver Jiri Pirko
@ 2014-08-21 17:19   ` Florian Fainelli
  2014-08-23 14:04   ` Thomas Graf
  1 sibling, 0 replies; 87+ messages in thread
From: Florian Fainelli @ 2014-08-21 17:19 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, David Miller, Neil Horman, Andy Gospodarek, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, Jeff Kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Jamal Hadi Salim, Scott Feldman,
	Roopa Prabhu, John Linville, dev, jasowang, Eric W. Biederman

2014-08-21 9:19 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
> This patch introduces the first driver to benefit from the switchdev
> infrastructure and to implement newly introduced switch ndos. This is a
> driver for emulated switch chip implemented in qemu:
> https://github.com/sfeldma/qemu-rocker/
>
> This patch is a result of joint work with Scott Feldman.

You could eliminate a lot of boilerplate code that delegates
ethtool/netdev operations from the network interface to the switch
driver if you made this a DSA driver, without registering a tag
protocol or by reworking that.

Other than that, this is a really nice piece of driver you got here,
it's definitively a clean design. Thanks!
--
Florian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id
       [not found]             ` <CAGVrzcb=vkqPw2LUc4YO4Bs-eady2=1uN-jkG=kW2RnGx=24PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-22  9:05               ` David Laight
  0 siblings, 0 replies; 87+ messages in thread
From: David Laight @ 2014-08-22  9:05 UTC (permalink / raw)
  To: 'Florian Fainelli', Jiri Pirko
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, ogerlitz,
	Ben Hutchings, Lennert Buytenhek, Roopa Prabhu, Jamal Hadi Salim,
	Aviad Raveh, Nicolas Dichtel, vyasevic, Neil Horman, netdev,
	Stephen Hemminger, dborkman, Eric W. Biederman

From: Florian Fainelli
> 2014-08-21 10:06 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
> > Thu, Aug 21, 2014 at 06:56:13PM CEST, f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
> >>2014-08-21 9:19 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
> >>> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
> >>> ---
> >>>  net/dsa/Kconfig |  2 +-
> >>>  net/dsa/slave.c | 16 ++++++++++++++++
> >>>  2 files changed, 17 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
> >>> index f5eede1..66c445a 100644
> >>> --- a/net/dsa/Kconfig
> >>> +++ b/net/dsa/Kconfig
> >>> @@ -1,6 +1,6 @@
> >>>  config HAVE_NET_DSA
> >>>         def_bool y
> >>> -       depends on NETDEVICES && !S390
> >>> +       depends on NETDEVICES && NET_SWITCHDEV && !S390
> >>>
> >>>  # Drivers must select NET_DSA and the appropriate tagging format
> >>>
> >>> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> >>> index 45a1e34..e069ba3 100644
> >>> --- a/net/dsa/slave.c
> >>> +++ b/net/dsa/slave.c
> >>> @@ -171,6 +171,19 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int
> cmd)
> >>>         return -EOPNOTSUPP;
> >>>  }
> >>>
> >>> +static int dsa_slave_swdev_get_id(struct net_device *dev,
> >>> +                                 struct netdev_phys_item_id *psid)
> >>> +{
> >>> +       struct dsa_slave_priv *p = netdev_priv(dev);
> >>> +       struct dsa_switch *ds = p->parent;
> >>> +       u64 tmp = (u64) ds;
> >>> +
> >>> +       /* TODO: add more sophisticated id generation */
> >>> +       memcpy(&psid->id, &tmp, sizeof(tmp));
> >>> +       psid->id_len = sizeof(tmp);
> >>
> >>There is already an unique id generated, which is the index in the
> >>switch tree, and which is stored in struct dsa_switch, so this could
> >>probably be simplified to:
> >>
> >>psid->id = ds->index
> >
> > That index is 0..n if I understand that correctly. That is not enough.
> > The point is to have unique id for every chip in the system. If we would
> > have 0,1,2... the collision is very likely.
> 
> Good point, so an unique index for DSA switches could look like the
> DSA platform device id plus the switch index in the tree..., but then
> we would need something like (pdev->id << N) | switch index, so that
> would not give a consistent naming scheme across different devices.

Do you also need to worry about the 'lifetime' of these ids?
In which case some of the high bits need to be used as 'generation number'.

	David

> Maybe we are just better with using the Linux IDR API in include/linux/idr.h?
> --
> Florian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
       [not found]         ` <CAGVrzcYtnpcP4pfCJ0GSya01LTk0WwbSV1f+voF2K=S5CR3Arg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-22 12:42           ` Jamal Hadi Salim
  2014-08-22 12:56             ` Jiri Pirko
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-22 12:42 UTC (permalink / raw)
  To: Florian Fainelli, Jiri Pirko
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, ogerlitz,
	Ben Hutchings, Lennert Buytenhek, Roopa Prabhu, Aviad Raveh,
	Nicolas Dichtel, vyasevic, Neil Horman, netdev,
	Stephen Hemminger, dborkman, Eric W. Biederman, David Miller

On 08/21/14 13:05, Florian Fainelli wrote:
> 2014-08-21 9:18 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
>> The goal of this is to provide a possibility to suport various switch
>> chips. Drivers should implement relevant ndos to do so. Now there is a
>> couple of ndos defines:
>> - for getting physical switch id is in place.
>> - for work with flows.
>>
>> Note that user can use random port netdevice to access the switch.
>
> I read through this patch set, and I still think that DSA is the
> generic switch infrastructure we already have because it does provide
> the following:
>
> - taking a generic platform data structure (C struct or Device Tree),
> validate, parse it and map it to internal kernel structures
> - instantiate per-port network devices based on the configuration data provided
> - delegate netdev_ops to the switch driver and/or the CPU NIC when relevant
> - provide support for hooking RX and TX traffic coming from the CPU NIC
>
> I would rather we build on the existing DSA infrastructure and add the
> flow-related netdev_ops rather than having the two remain in
> disconnect while flow-oriented switches driver get progressively
> added. I guess I should take a closer look at the rocker driver to see
> how hard would that be for you.
>
> What do you think?


I thought we had concluded that DSA was a good path forward?  Or maybe 
at this stage we need to have several alternative approaches
and we eventually converge?

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
  2014-08-22 12:42           ` Jamal Hadi Salim
@ 2014-08-22 12:56             ` Jiri Pirko
  2014-08-22 19:14               ` John Fastabend
       [not found]               ` <20140822125655.GB1916-6KJVSR23iU488b5SBfVpbw@public.gmane.org>
  0 siblings, 2 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-22 12:56 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Florian Fainelli, netdev, David Miller, Neil Horman,
	Andy Gospodarek, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Roopa Prabhu, John Linville, dev, jasowang, Eric W. Biederman

Fri, Aug 22, 2014 at 02:42:04PM CEST, jhs@mojatatu.com wrote:
>On 08/21/14 13:05, Florian Fainelli wrote:
>>2014-08-21 9:18 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>The goal of this is to provide a possibility to suport various switch
>>>chips. Drivers should implement relevant ndos to do so. Now there is a
>>>couple of ndos defines:
>>>- for getting physical switch id is in place.
>>>- for work with flows.
>>>
>>>Note that user can use random port netdevice to access the switch.
>>
>>I read through this patch set, and I still think that DSA is the
>>generic switch infrastructure we already have because it does provide
>>the following:
>>
>>- taking a generic platform data structure (C struct or Device Tree),
>>validate, parse it and map it to internal kernel structures
>>- instantiate per-port network devices based on the configuration data provided
>>- delegate netdev_ops to the switch driver and/or the CPU NIC when relevant
>>- provide support for hooking RX and TX traffic coming from the CPU NIC
>>
>>I would rather we build on the existing DSA infrastructure and add the
>>flow-related netdev_ops rather than having the two remain in
>>disconnect while flow-oriented switches driver get progressively
>>added. I guess I should take a closer look at the rocker driver to see
>>how hard would that be for you.
>>
>>What do you think?
>
>
>I thought we had concluded that DSA was a good path forward?  Or maybe at
>this stage we need to have several alternative approaches
>and we eventually converge?

That is true. I'm still unsure how to fit this on to DSA or how to change DSA
the way this fits. This is my quest now. Will report back in a week or so.

>
>cheers,
>jamal
>
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 04/12] rtnl: expose physical switch id for particular device
       [not found]   ` <1408637945-10390-5-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-08-22 19:08     ` John Fastabend
       [not found]       ` <53F79537.20207-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: John Fastabend @ 2014-08-22 19:08 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 08/21/2014 09:18 AM, Jiri Pirko wrote:
> The netdevice represents a port in a switch, it will expose
> IFLA_PHYS_SWITCH_ID value via rtnl. Two netdevices with the same value
> belong to one physical switch.
>
> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>

What is the relation between phys_port_id and phys_switch_id?

phys_port_id was intended to identify a set of ports that belong
to a single uplink port,


	eth0     eth1    eth2   eth3      (host facing)
           |       |       |      |
           |       |       |      |
       +---+-------+-------+------+---+
       |      embedded switch         |
       +------------------------------+
                      |
                     MAC                   (network)

In the NIC case there is a simply switch with a port to the
network which we currently don't represent with a netdev. Any
netdev where the phys_switch_id's are behind the same embedded
switch.

In the switch id case we are indicating the port is attached to
the same embedded switch as well.

          eth0 eth1 eth2 eth3
           |    |    |    |
      +----+----+----+----+----+
      |         switch         |
      +----+----+----+----+----+

but they do not share an uplink port? So in this case each ethx
has a unique phys_port_id but the same phys_switch_id?

In the first case both phys_port_id and phys_switch_id should
be equal for all interfaces correct?

Is that clear/useful at all? We need to document this somewhere
if/when the patches are submitted otherwise I doubt we will get it
consistently right across drivers. There could for example be
somewhat strange devices with virtual functions hanging off of the
switch.

Thanks,
John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
  2014-08-22 12:56             ` Jiri Pirko
@ 2014-08-22 19:14               ` John Fastabend
       [not found]                 ` <53F7969C.1060509-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
       [not found]               ` <20140822125655.GB1916-6KJVSR23iU488b5SBfVpbw@public.gmane.org>
  1 sibling, 1 reply; 87+ messages in thread
From: John Fastabend @ 2014-08-22 19:14 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, Florian Fainelli, netdev, David Miller,
	Neil Horman, Andy Gospodarek, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	vyasevic, Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Roopa Prabhu, John Linville, dev, jasowang,

On 08/22/2014 05:56 AM, Jiri Pirko wrote:
> Fri, Aug 22, 2014 at 02:42:04PM CEST, jhs@mojatatu.com wrote:
>> On 08/21/14 13:05, Florian Fainelli wrote:
>>> 2014-08-21 9:18 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>> The goal of this is to provide a possibility to suport various switch
>>>> chips. Drivers should implement relevant ndos to do so. Now there is a
>>>> couple of ndos defines:
>>>> - for getting physical switch id is in place.
>>>> - for work with flows.
>>>>
>>>> Note that user can use random port netdevice to access the switch.
>>>
>>> I read through this patch set, and I still think that DSA is the
>>> generic switch infrastructure we already have because it does provide
>>> the following:
>>>
>>> - taking a generic platform data structure (C struct or Device Tree),
>>> validate, parse it and map it to internal kernel structures
>>> - instantiate per-port network devices based on the configuration data provided
>>> - delegate netdev_ops to the switch driver and/or the CPU NIC when relevant
>>> - provide support for hooking RX and TX traffic coming from the CPU NIC
>>>
>>> I would rather we build on the existing DSA infrastructure and add the
>>> flow-related netdev_ops rather than having the two remain in
>>> disconnect while flow-oriented switches driver get progressively
>>> added. I guess I should take a closer look at the rocker driver to see
>>> how hard would that be for you.
>>>
>>> What do you think?
>>
>>
>> I thought we had concluded that DSA was a good path forward?  Or maybe at
>> this stage we need to have several alternative approaches
>> and we eventually converge?
>
> That is true. I'm still unsure how to fit this on to DSA or how to change DSA
> the way this fits. This is my quest now. Will report back in a week or so.
>

I would like to use the flow ops in some of our NICs that have
a limited flow table in hardware. It might be easier to use the
NICs as the first implementers of the API even though they are
usually not as capable or large as flow tables in some of the
larger switch asics.

In my opinion it can replace the ioctl flow director APIs although
I don't like how it is tied to OVS in the some of the later RFC patches
but we can work on that.

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]   ` <1408637945-10390-11-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-08-22 19:39     ` John Fastabend
       [not found]       ` <53F79C54.5050701-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: John Fastabend @ 2014-08-22 19:39 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 08/21/2014 09:19 AM, Jiri Pirko wrote:
> Benefit from the possibility to work with flows in switch devices and
> use the swdev api to offload flow datapath.

we should add a description here on the strategy being used.

If I read this correctly this will try to add any flow to the
hardware along with the actions and duplicate it in software.

There are a couple things I don't like,

  - this requires OVS to be loaded to work. If all I want is
    direct access to the hardware flow tables requiring openvswitch.ko
    shouldn't be needed IMO. For example I may want to use the
    hardware flow tables with something not openvswitch and we
    shouldn't preclude that.

  - Also there is no programmatic way to learn which flows are
    in hardware and which in software. There is a pr_warn but
    that doesn't help when interacting with the hardware remotely.
    I need some mechanism to dump the set of hardware tables and
    the set of software tables.

  - Simply duplicating the software flow/action into
    hardware may not optimally use the hardware tables. If I have
    a TCAM in hardware for instance. (This is how I read the patch
    let me know if I missed something)

  - I need a way to specify put this flow/action in hardware,
    put this flow/action in software, or put this in both software
    and hardware.

    We did this with a bitmask in the fdb L2 stuff and it seems to
    work reasonable well so maybe something like that would help.

    For example if I don't have this what happens if I have an
    entry to decrement TTL in both hardware and software. If the
    flow hits both the hardware path and software path the TTL
    gets decremented. Here userspace needs to indicate where to
    do the decrement to avoid the duplication.

I think if we can pull this out OVS and add the hw/sw bitmask (or
maybe a better implementation of that idea) then this should work
for the stuff I'm looking at. I want to try and get it working on
the i40e driver as a fdir replacement but it might take me a bit
to get to it.


Thanks,
John


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]       ` <53F79C54.5050701-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-08-22 22:53         ` Scott Feldman
       [not found]           ` <464DB0A8-0073-4CE0-9483-0F36B73A53A1-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Scott Feldman @ 2014-08-22 22:53 UTC (permalink / raw)
  To: John Fastabend
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q


On Aug 22, 2014, at 12:39 PM, John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On 08/21/2014 09:19 AM, Jiri Pirko wrote:
>> Benefit from the possibility to work with flows in switch devices and
>> use the swdev api to offload flow datapath.
> 
> we should add a description here on the strategy being used.
> 
> If I read this correctly this will try to add any flow to the
> hardware along with the actions and duplicate it in software.
> 
> There are a couple things I don't like,
> 
> - this requires OVS to be loaded to work. If all I want is
>   direct access to the hardware flow tables requiring openvswitch.ko
>   shouldn't be needed IMO. For example I may want to use the
>   hardware flow tables with something not openvswitch and we
>   shouldn't preclude that.
> 

The intent is to use openvswitch.ko’s struct sw_flow to program hardware via the ndo_swdev_flow_* ops, but otherwise be independent of OVS.  So the upper layer of the driver is struct sw_flow and any module above the driver can construct a struct sw_flow and push it down via ndo_swdev_flow_*.  So your non-OVS use-case should be handled.  OVS is another use-case.  struct sw_flow should not be OVS-aware, but rather a generic flow match/action sufficient to offload the data plane to HW.

> - Also there is no programmatic way to learn which flows are
>   in hardware and which in software. There is a pr_warn but
>   that doesn't help when interacting with the hardware remotely.
>   I need some mechanism to dump the set of hardware tables and
>   the set of software tables.

Agreed, we need a way to annotate which flows are installed hardware.

> - Simply duplicating the software flow/action into
>   hardware may not optimally use the hardware tables. If I have
>   a TCAM in hardware for instance. (This is how I read the patch
>   let me know if I missed something)

The hardware-specific driver is the right place to handle optimizing the flow/action in hardware since only the driver can know the size/shape of the device.  struct sw_flow is a generic flow description; how (or if) a flow gets programmed into hardware must be handled in the swdev driver.  If the device driver can’t make the sw_flow fit into HW because of resource limitations or the flow simply can’t be represented in HW, then the flow is SW only.  

In the rocker driver posted in this patch set, the steps are to parse the struct sw_flow to figure out what type of flow match/action we’re dealing with (L2 or L3 or L4, ucast or mcast, ipv4 or ipv6, etc) and then install the correct entries into the corresponding device tables within the constraints of the device’s pipeline.  Any optimizations, like coalescing HW entries, is something only the driver can do.

> 
> - I need a way to specify put this flow/action in hardware,
>   put this flow/action in software, or put this in both software
>   and hardware.
> 

This seems above the swdev layer.  In other words, don’t call ndo_swdev_flow_* if you don’t want flow match/action install in HW.

>   We did this with a bitmask in the fdb L2 stuff and it seems to
>   work reasonable well so maybe something like that would help.
> 
>   For example if I don't have this what happens if I have an
>   entry to decrement TTL in both hardware and software. If the
>   flow hits both the hardware path and software path the TTL
>   gets decremented. Here userspace needs to indicate where to
>   do the decrement to avoid the duplication.

I’m not following why a flow would hit both HW and SW paths.  That seems bad, and negating to effort of offloading the flow to HW in the first place.  My simple view is if flow hits HW path, then SW path is unaware.  Clearly work is needed to provide coherent view to user with respect to stat counters and such, but I believe do-able.

> 
> I think if we can pull this out OVS and add the hw/sw bitmask (or
> maybe a better implementation of that idea) then this should work
> for the stuff I'm looking at. I want to try and get it working on
> the i40e driver as a fdir replacement but it might take me a bit
> to get to it.

That sounds cool and would really help get the interface in place.  Take another look at the way Jiri has busted out sw_flow.h and see if this works for you outside of an OVS context.  If not, we need to fix it.


> 
> Thanks,
> John
> 
> 
> -- 
> John Fastabend         Intel Corporation
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-scott

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
       [not found]                 ` <53F7969C.1060509-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-08-22 23:12                   ` Scott Feldman
  0 siblings, 0 replies; 87+ messages in thread
From: Scott Feldman @ 2014-08-22 23:12 UTC (permalink / raw)
  To: John Fastabend
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	Florian Fainelli, ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher,
	ogerlitz, Ben Hutchings, Lennert Buytenhek, Jiri Pirko,
	Roopa Prabhu, Jamal Hadi Salim, Aviad Raveh, Nicolas Dichtel,
	vyasevic, Neil Horman, netdev, Stephen Hemminger, dborkman


On Aug 22, 2014, at 12:14 PM, John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On 08/22/2014 05:56 AM, Jiri Pirko wrote:
>> Fri, Aug 22, 2014 at 02:42:04PM CEST, jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org wrote:
>>> On 08/21/14 13:05, Florian Fainelli wrote:
>>>> 2014-08-21 9:18 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
>>>>> The goal of this is to provide a possibility to suport various switch
>>>>> chips. Drivers should implement relevant ndos to do so. Now there is a
>>>>> couple of ndos defines:
>>>>> - for getting physical switch id is in place.
>>>>> - for work with flows.
>>>>> 
>>>>> Note that user can use random port netdevice to access the switch.
>>>> 
>>>> I read through this patch set, and I still think that DSA is the
>>>> generic switch infrastructure we already have because it does provide
>>>> the following:
>>>> 
>>>> - taking a generic platform data structure (C struct or Device Tree),
>>>> validate, parse it and map it to internal kernel structures
>>>> - instantiate per-port network devices based on the configuration data provided
>>>> - delegate netdev_ops to the switch driver and/or the CPU NIC when relevant
>>>> - provide support for hooking RX and TX traffic coming from the CPU NIC
>>>> 
>>>> I would rather we build on the existing DSA infrastructure and add the
>>>> flow-related netdev_ops rather than having the two remain in
>>>> disconnect while flow-oriented switches driver get progressively
>>>> added. I guess I should take a closer look at the rocker driver to see
>>>> how hard would that be for you.
>>>> 
>>>> What do you think?
>>> 
>>> 
>>> I thought we had concluded that DSA was a good path forward?  Or maybe at
>>> this stage we need to have several alternative approaches
>>> and we eventually converge?
>> 
>> That is true. I'm still unsure how to fit this on to DSA or how to change DSA
>> the way this fits. This is my quest now. Will report back in a week or so.
>> 
> 
> I would like to use the flow ops in some of our NICs that have
> a limited flow table in hardware. It might be easier to use the
> NICs as the first implementers of the API even though they are
> usually not as capable or large as flow tables in some of the
> larger switch asics.
> 
> In my opinion it can replace the ioctl flow director APIs although
> I don't like how it is tied to OVS in the some of the later RFC patches
> but we can work on that.

I think parallel efforts to get flow ops working on a real NIC with limited capabilities as well our fake enterprise-class rocker switch with full capabilities will really help solidify the swdev API.  Maybe even someone in the background that has access to a real enterprise switch can play along ;).  I hope you’re convinced from my other reply that the intent of swdev is to be independent of OVS.  If the implementation in the RFC doesn’t match the intent, we should fix it.  The elements of the swdev API we need convergence on are:

	- ndo_swdev_* ops to identify switch port and add/del flow match/action
	- struct sw_flow as generic flow match/action description
	- port netdevs to represent physical device ports

And maybe:

	- new device class to represent the switch itself
	- capabilities to express HW capabilities (limits, constraints, etc)


-scott

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
       [not found]               ` <20140822125655.GB1916-6KJVSR23iU488b5SBfVpbw@public.gmane.org>
@ 2014-08-23  1:02                 ` Florian Fainelli
       [not found]                   ` <CAGVrzcZS=Y2stxSNMfVjWTpPT8GoDOpOD9tExnDnoF0jj_owoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Florian Fainelli @ 2014-08-23  1:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, ogerlitz,
	Ben Hutchings, Lennert Buytenhek, Roopa Prabhu, Jamal Hadi Salim,
	Aviad Raveh, Nicolas Dichtel, vyasevic, Neil Horman, netdev,
	Stephen Hemminger, dborkman, Eric W. Biederman, David

2014-08-22 5:56 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
> Fri, Aug 22, 2014 at 02:42:04PM CEST, jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org wrote:
>>On 08/21/14 13:05, Florian Fainelli wrote:
>>>2014-08-21 9:18 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
>>>>The goal of this is to provide a possibility to suport various switch
>>>>chips. Drivers should implement relevant ndos to do so. Now there is a
>>>>couple of ndos defines:
>>>>- for getting physical switch id is in place.
>>>>- for work with flows.
>>>>
>>>>Note that user can use random port netdevice to access the switch.
>>>
>>>I read through this patch set, and I still think that DSA is the
>>>generic switch infrastructure we already have because it does provide
>>>the following:
>>>
>>>- taking a generic platform data structure (C struct or Device Tree),
>>>validate, parse it and map it to internal kernel structures
>>>- instantiate per-port network devices based on the configuration data provided
>>>- delegate netdev_ops to the switch driver and/or the CPU NIC when relevant
>>>- provide support for hooking RX and TX traffic coming from the CPU NIC
>>>
>>>I would rather we build on the existing DSA infrastructure and add the
>>>flow-related netdev_ops rather than having the two remain in
>>>disconnect while flow-oriented switches driver get progressively
>>>added. I guess I should take a closer look at the rocker driver to see
>>>how hard would that be for you.
>>>
>>>What do you think?
>>
>>
>>I thought we had concluded that DSA was a good path forward?  Or maybe at
>>this stage we need to have several alternative approaches
>>and we eventually converge?
>
> That is true. I'm still unsure how to fit this on to DSA or how to change DSA
> the way this fits. This is my quest now. Will report back in a week or so.

I don't want to hold off this patch series, so let's proceed with your
submission, since I believe John Fastabend would also directly benefit
from this.

In the meantime, I will keep working on DSA, and prototype changes
with the rocker driver.

Once we are confident we have bridged the gap, we can unify things.
How does that sound?
-- 
Florian

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
       [not found]                   ` <CAGVrzcZS=Y2stxSNMfVjWTpPT8GoDOpOD9tExnDnoF0jj_owoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-23  9:17                     ` Jiri Pirko
  0 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-23  9:17 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, ogerlitz,
	Ben Hutchings, Lennert Buytenhek, Roopa Prabhu, Jamal Hadi Salim,
	Aviad Raveh, Nicolas Dichtel, vyasevic, Neil Horman, netdev,
	Stephen Hemminger, dborkman, Eric W. Biederman, David

Sat, Aug 23, 2014 at 03:02:10AM CEST, f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
>2014-08-22 5:56 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
>> Fri, Aug 22, 2014 at 02:42:04PM CEST, jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org wrote:
>>>On 08/21/14 13:05, Florian Fainelli wrote:
>>>>2014-08-21 9:18 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
>>>>>The goal of this is to provide a possibility to suport various switch
>>>>>chips. Drivers should implement relevant ndos to do so. Now there is a
>>>>>couple of ndos defines:
>>>>>- for getting physical switch id is in place.
>>>>>- for work with flows.
>>>>>
>>>>>Note that user can use random port netdevice to access the switch.
>>>>
>>>>I read through this patch set, and I still think that DSA is the
>>>>generic switch infrastructure we already have because it does provide
>>>>the following:
>>>>
>>>>- taking a generic platform data structure (C struct or Device Tree),
>>>>validate, parse it and map it to internal kernel structures
>>>>- instantiate per-port network devices based on the configuration data provided
>>>>- delegate netdev_ops to the switch driver and/or the CPU NIC when relevant
>>>>- provide support for hooking RX and TX traffic coming from the CPU NIC
>>>>
>>>>I would rather we build on the existing DSA infrastructure and add the
>>>>flow-related netdev_ops rather than having the two remain in
>>>>disconnect while flow-oriented switches driver get progressively
>>>>added. I guess I should take a closer look at the rocker driver to see
>>>>how hard would that be for you.
>>>>
>>>>What do you think?
>>>
>>>
>>>I thought we had concluded that DSA was a good path forward?  Or maybe at
>>>this stage we need to have several alternative approaches
>>>and we eventually converge?
>>
>> That is true. I'm still unsure how to fit this on to DSA or how to change DSA
>> the way this fits. This is my quest now. Will report back in a week or so.
>
>I don't want to hold off this patch series, so let's proceed with your
>submission, since I believe John Fastabend would also directly benefit
>from this.
>
>In the meantime, I will keep working on DSA, and prototype changes
>with the rocker driver.
>
>Once we are confident we have bridged the gap, we can unify things.
>How does that sound?

Sounds good. 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]           ` <464DB0A8-0073-4CE0-9483-0F36B73A53A1-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
@ 2014-08-23  9:24             ` Jiri Pirko
  2014-08-23 14:51               ` Thomas Graf
  2014-08-24  1:53             ` Jamal Hadi Salim
  1 sibling, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-08-23  9:24 UTC (permalink / raw)
  To: Scott Feldman
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

Sat, Aug 23, 2014 at 12:53:34AM CEST, sfeldma@cumulusnetworks.com wrote:
>
>On Aug 22, 2014, at 12:39 PM, John Fastabend <john.fastabend@gmail.com> wrote:
>
>> On 08/21/2014 09:19 AM, Jiri Pirko wrote:
>>> Benefit from the possibility to work with flows in switch devices and
>>> use the swdev api to offload flow datapath.
>> 
>> we should add a description here on the strategy being used.
>> 
>> If I read this correctly this will try to add any flow to the
>> hardware along with the actions and duplicate it in software.
>> 
>> There are a couple things I don't like,
>> 
>> - this requires OVS to be loaded to work. If all I want is
>>   direct access to the hardware flow tables requiring openvswitch.ko
>>   shouldn't be needed IMO. For example I may want to use the
>>   hardware flow tables with something not openvswitch and we
>>   shouldn't preclude that.
>> 
>
>The intent is to use openvswitch.ko’s struct sw_flow to program hardware via the ndo_swdev_flow_* ops, but otherwise be independent of OVS.  So the upper layer of the driver is struct sw_flow and any module above the driver can construct a struct sw_flow and push it down via ndo_swdev_flow_*.  So your non-OVS use-case should be handled.  OVS is another use-case.  struct sw_flow should not be OVS-aware, but rather a generic flow match/action sufficient to offload the data plane to HW.

Yes. I was thinking about simple Netlink API that would expose direct
sw_flow manipulation (ndo_swdev_flow_* wrapper) to userspace. I will
think abou that more and perhaps add it to my next patchset version.

>
>> - Also there is no programmatic way to learn which flows are
>>   in hardware and which in software. There is a pr_warn but
>>   that doesn't help when interacting with the hardware remotely.
>>   I need some mechanism to dump the set of hardware tables and
>>   the set of software tables.
>
>Agreed, we need a way to annotate which flows are installed hardware.

Yes, we discussed that already. We need to make OVS daemon hw-offload
aware indicating which flow it want/prefers to be offloaded. This is I
believe easily extentable feature and can be added whenever the right
time is.

>
>> - Simply duplicating the software flow/action into
>>   hardware may not optimally use the hardware tables. If I have
>>   a TCAM in hardware for instance. (This is how I read the patch
>>   let me know if I missed something)
>
>The hardware-specific driver is the right place to handle optimizing the flow/action in hardware since only the driver can know the size/shape of the device.  struct sw_flow is a generic flow description; how (or if) a flow gets programmed into hardware must be handled in the swdev driver.  If the device driver can’t make the sw_flow fit into HW because of resource limitations or the flow simply can’t be represented in HW, then the flow is SW only.  
>
>In the rocker driver posted in this patch set, the steps are to parse the struct sw_flow to figure out what type of flow match/action we’re dealing with (L2 or L3 or L4, ucast or mcast, ipv4 or ipv6, etc) and then install the correct entries into the corresponding device tables within the constraints of the device’s pipeline.  Any optimizations, like coalescing HW entries, is something only the driver can do.
>
>> 
>> - I need a way to specify put this flow/action in hardware,
>>   put this flow/action in software, or put this in both software
>>   and hardware.
>> 
>
>This seems above the swdev layer.  In other words, don’t call ndo_swdev_flow_* if you don’t want flow match/action install in HW.
>
>>   We did this with a bitmask in the fdb L2 stuff and it seems to
>>   work reasonable well so maybe something like that would help.
>> 
>>   For example if I don't have this what happens if I have an
>>   entry to decrement TTL in both hardware and software. If the
>>   flow hits both the hardware path and software path the TTL
>>   gets decremented. Here userspace needs to indicate where to
>>   do the decrement to avoid the duplication.
>
>I’m not following why a flow would hit both HW and SW paths.  That seems bad, and negating to effort of offloading the flow to HW in the first place.  My simple view is if flow hits HW path, then SW path is unaware.  Clearly work is needed to provide coherent view to user with respect to stat counters and such, but I believe do-able.
>
>> 
>> I think if we can pull this out OVS and add the hw/sw bitmask (or
>> maybe a better implementation of that idea) then this should work
>> for the stuff I'm looking at. I want to try and get it working on
>> the i40e driver as a fdir replacement but it might take me a bit
>> to get to it.
>
>That sounds cool and would really help get the interface in place.  Take another look at the way Jiri has busted out sw_flow.h and see if this works for you outside of an OVS context.  If not, we need to fix it.

Great. John, please keep us posted about your progress.
Let me know if you need any help.

>
>
>> 
>> Thanks,
>> John
>> 
>> 
>> -- 
>> John Fastabend         Intel Corporation
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>-scott
>
>
>
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id
       [not found]         ` <20140821170645.GB10633-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
  2014-08-21 17:12           ` Florian Fainelli
@ 2014-08-23 11:33           ` Eric W. Biederman
  1 sibling, 0 replies; 87+ messages in thread
From: Eric W. Biederman @ 2014-08-23 11:33 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek, dev, Felix Fietkau,
	Florian Fainelli, ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher,
	ogerlitz, Ben Hutchings, Lennert Buytenhek, Roopa Prabhu,
	Jamal Hadi Salim, Aviad Raveh, Nicolas Dichtel, vyasevic,
	Neil Horman, netdev, Stephen Hemminger, dborkman, David

Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org> writes:

> Thu, Aug 21, 2014 at 06:56:13PM CEST, f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
>>2014-08-21 9:19 GMT-07:00 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>:
>>> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
>>> ---
>>>  net/dsa/Kconfig |  2 +-
>>>  net/dsa/slave.c | 16 ++++++++++++++++
>>>  2 files changed, 17 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
>>> index f5eede1..66c445a 100644
>>> --- a/net/dsa/Kconfig
>>> +++ b/net/dsa/Kconfig
>>> @@ -1,6 +1,6 @@
>>>  config HAVE_NET_DSA
>>>         def_bool y
>>> -       depends on NETDEVICES && !S390
>>> +       depends on NETDEVICES && NET_SWITCHDEV && !S390
>>>
>>>  # Drivers must select NET_DSA and the appropriate tagging format
>>>
>>> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
>>> index 45a1e34..e069ba3 100644
>>> --- a/net/dsa/slave.c
>>> +++ b/net/dsa/slave.c
>>> @@ -171,6 +171,19 @@ static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>>>         return -EOPNOTSUPP;
>>>  }
>>>
>>> +static int dsa_slave_swdev_get_id(struct net_device *dev,
>>> +                                 struct netdev_phys_item_id *psid)
>>> +{
>>> +       struct dsa_slave_priv *p = netdev_priv(dev);
>>> +       struct dsa_switch *ds = p->parent;
>>> +       u64 tmp = (u64) ds;
>>> +
>>> +       /* TODO: add more sophisticated id generation */
>>> +       memcpy(&psid->id, &tmp, sizeof(tmp));
>>> +       psid->id_len = sizeof(tmp);
>>
>>There is already an unique id generated, which is the index in the
>>switch tree, and which is stored in struct dsa_switch, so this could
>>probably be simplified to:
>>
>>psid->id = ds->index
>
> That index is 0..n if I understand that correctly. That is not enough.
> The point is to have unique id for every chip in the system. If we would
> have 0,1,2... the collision is very likely.

I am just kibitzing but ethernet switches capable of speaking stp
require a mac address per port.  So if you want a unique id I would
pick one of your mac addresses, which should be uniuqe.

I can understand low end devices where sophisticated things will never
happen fudging on the mac address requirements but by the time you care
I expect you have a mac address.

Eric

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 12/12] rocker: introduce rocker switch driver
  2014-08-21 16:19 ` [patch net-next RFC 12/12] rocker: introduce rocker switch driver Jiri Pirko
  2014-08-21 17:19   ` Florian Fainelli
@ 2014-08-23 14:04   ` Thomas Graf
  2014-08-29  7:06     ` Jiri Pirko
  1 sibling, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-23 14:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

On 08/21/14 at 06:19pm, Jiri Pirko wrote:
> This patch introduces the first driver to benefit from the switchdev
> infrastructure and to implement newly introduced switch ndos. This is a
> driver for emulated switch chip implemented in qemu:
> https://github.com/sfeldma/qemu-rocker/

The design looks very clean. I noticed that the TLV API is almost an
exact dupliate of the Netlink attributes API. Any specific reason for
not reusing lib/nlattr.c and add what is missing?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-23  9:24             ` Jiri Pirko
@ 2014-08-23 14:51               ` Thomas Graf
       [not found]                 ` <20140823145126.GB24116-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-23 14:51 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Scott Feldman, John Fastabend, netdev, davem, nhorman, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, john.r.fastabend,
	edumazet, jhs, f.fainelli, roopa, linville, dev, jasowang,
	ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye

On 08/23/14 at 11:24am, Jiri Pirko wrote:
> Sat, Aug 23, 2014 at 12:53:34AM CEST, sfeldma@cumulusnetworks.com wrote:
> >
> >On Aug 22, 2014, at 12:39 PM, John Fastabend <john.fastabend@gmail.com> wrote:
> >> - this requires OVS to be loaded to work. If all I want is
> >>   direct access to the hardware flow tables requiring openvswitch.ko
> >>   shouldn't be needed IMO. For example I may want to use the
> >>   hardware flow tables with something not openvswitch and we
> >>   shouldn't preclude that.
> >> 
> >
> >The intent is to use openvswitch.ko’s struct sw_flow to program hardware via the ndo_swdev_flow_* ops, but otherwise be independent of OVS.  So the upper layer of the driver is struct sw_flow and any module above the driver can construct a struct sw_flow and push it down via ndo_swdev_flow_*.  So your non-OVS use-case should be handled.  OVS is another use-case.  struct sw_flow should not be OVS-aware, but rather a generic flow match/action sufficient to offload the data plane to HW.
> 
> Yes. I was thinking about simple Netlink API that would expose direct
> sw_flow manipulation (ndo_swdev_flow_* wrapper) to userspace. I will
> think abou that more and perhaps add it to my next patchset version.

I agree that this might help to give a better API consumption example
for everyone not familiar with OVS.

> >> - Also there is no programmatic way to learn which flows are
> >>   in hardware and which in software. There is a pr_warn but
> >>   that doesn't help when interacting with the hardware remotely.
> >>   I need some mechanism to dump the set of hardware tables and
> >>   the set of software tables.
> >
> >Agreed, we need a way to annotate which flows are installed hardware.
> 
> Yes, we discussed that already. We need to make OVS daemon hw-offload
> aware indicating which flow it want/prefers to be offloaded. This is I
> believe easily extentable feature and can be added whenever the right
> time is.

I think the swdev flow API is good as-is. The bitmask specyfing the
offload preference with all the granularity (offload-or-fail,
try-to-offload, never-offload) needed can be added later, either in
OVS only or in swdev itself.

What is unclear in this patch is how OVS user space can know which
flows are offloaded and which aren't. A status field would help here
which indicates either: flow inserted and offloaded, flow inserted but
not offloaded. Given that, the API consumer can easily keep track of
which flows are currently offloaded.

Also, I'm not sure whether flow expiration is something the API must
take care of. The current proposal assumes that HW flows are only
ever removed by the API itself. Could the switch CPU run code which
removes flows as well? That would call for Netlink notifications.
Not that it's needed at this stage of the code but maybe worth
considerating for the API design.

> >> - Simply duplicating the software flow/action into
> >>   hardware may not optimally use the hardware tables. If I have
> >>   a TCAM in hardware for instance. (This is how I read the patch
> >>   let me know if I missed something)
> >
> >The hardware-specific driver is the right place to handle optimizing the flow/action in hardware since only the driver can know the size/shape of the device.  struct sw_flow is a generic flow description; how (or if) a flow gets programmed into hardware must be handled in the swdev driver.  If the device driver can’t make the sw_flow fit into HW because of resource limitations or the flow simply can’t be represented in HW, then the flow is SW only.  
> >
> >In the rocker driver posted in this patch set, the steps are to parse the struct sw_flow to figure out what type of flow match/action we’re dealing with (L2 or L3 or L4, ucast or mcast, ipv4 or ipv6, etc) and then install the correct entries into the corresponding device tables within the constraints of the device’s pipeline.  Any optimizations, like coalescing HW entries, is something only the driver can do.

The later examples definitely make sense and I'm not argueing against
that. There is also a non hardware capabilities perspective that I
would like to present:

1) TCAM capacity is limtied, we offload based on some priority assigned
to flows.  Some are critical and need to be in HW, others are best effort,
others never go into hardware. An API user will likely want to offload
best-effort flows until some watermark is reached and then switch to
critical flows only. The driver is not the right place for high level
optimization like this. The kernel API might but doesn't really have to
either because it would mean we need APIs to transfer all of the
needed context for the decision in the kernel. It might be easier to
expose the hardware context to user space instead and handle these
kind of optimizations in something like Quagga.

2) There is definitely a desire to allow adapting the software flow table
based on the hardware capabilities. Example, given a route like this:

   20.1.0.0/16, mark=50, tos=0x12, actions: output:eth1

The hardware can satisfy everything except the mark=50 match. Given a
a blind 1:1 copy between hardware and software we cannot offload
because a mach would be illegal. With the full context as available
north of the API, this could be translated into something like this:

  HW: 20.1.0.0/16, tos=0x12, actions: meta=1, output:cpu
  SW: meta=1, mark=50, output:eth1

This will allow for partial offloads to bypass expensive masked flow
table lookups by converting them into efficient flat exact match
tables, offload TC classifiers, nftables or even the existing L2 and
L3 forwarding path.

In summary, I think the swdev API as proposed is a good start as the
in-kernel flow abstraction is sufficient for many API users but we
should consider enabling the model described above as well once we
have the basic model put in place. I will be very interested in helping
out on this for both existing classifiers and OVS flow tables.


> >> - I need a way to specify put this flow/action in hardware,
> >>   put this flow/action in software, or put this in both software
> >>   and hardware.
> >> 
> >
> >This seems above the swdev layer.  In other words, don’t call ndo_swdev_flow_* if you don’t want flow match/action install in HW.

It can certainly be done northbound but this seems like a basic
requirement and we might end up avoiding the code duplication and
extending the API instead.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                 ` <20140823145126.GB24116-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-08-23 17:09                   ` John Fastabend
       [not found]                     ` <53F8CAB9.8080407-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: John Fastabend @ 2014-08-23 17:09 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 08/23/2014 07:51 AM, Thomas Graf wrote:
> On 08/23/14 at 11:24am, Jiri Pirko wrote:
>> Sat, Aug 23, 2014 at 12:53:34AM CEST, sfeldma-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org wrote:
>>>
>>> On Aug 22, 2014, at 12:39 PM, John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> - this requires OVS to be loaded to work. If all I want is
>>>>    direct access to the hardware flow tables requiring openvswitch.ko
>>>>    shouldn't be needed IMO. For example I may want to use the
>>>>    hardware flow tables with something not openvswitch and we
>>>>    shouldn't preclude that.
>>>>
>>>
>>> The intent is to use openvswitch.ko’s struct sw_flow to program hardware via the ndo_swdev_flow_* ops, but otherwise be independent of OVS.  So the upper layer of the driver is struct sw_flow and any module above the driver can construct a struct sw_flow and push it down via ndo_swdev_flow_*.  So your non-OVS use-case should be handled.  OVS is another use-case.  struct sw_flow should not be OVS-aware, but rather a generic flow match/action sufficient to offload the data plane to HW.
>>
>> Yes. I was thinking about simple Netlink API that would expose direct
>> sw_flow manipulation (ndo_swdev_flow_* wrapper) to userspace. I will
>> think abou that more and perhaps add it to my next patchset version.
>
> I agree that this might help to give a better API consumption example
> for everyone not familiar with OVS.

Yep and it solves one of my simple cases where I have macvlan configured
with SR-IOV or the l2-dfwd-offload bit set and want to push some basic
static ACLs into the flow table. If you have to bring the port into the
OVS framework I'm not sure how make this coexist.

>
>>>> - Also there is no programmatic way to learn which flows are
>>>>    in hardware and which in software. There is a pr_warn but
>>>>    that doesn't help when interacting with the hardware remotely.
>>>>    I need some mechanism to dump the set of hardware tables and
>>>>    the set of software tables.
>>>
>>> Agreed, we need a way to annotate which flows are installed hardware.
>>
>> Yes, we discussed that already. We need to make OVS daemon hw-offload
>> aware indicating which flow it want/prefers to be offloaded. This is I
>> believe easily extentable feature and can be added whenever the right
>> time is.
>
> I think the swdev flow API is good as-is. The bitmask specyfing the
> offload preference with all the granularity (offload-or-fail,
> try-to-offload, never-offload) needed can be added later, either in
> OVS only or in swdev itself.
>
> What is unclear in this patch is how OVS user space can know which
> flows are offloaded and which aren't. A status field would help here
> which indicates either: flow inserted and offloaded, flow inserted but
> not offloaded. Given that, the API consumer can easily keep track of
> which flows are currently offloaded.
>

Right. I think this is basically what Jiri and I discussed when he
originally posted the series. For my use cases this is one of the
more interesting pieces. If no one else is looking at it I can try
it on some of the already existing open source drivers that have some
very simple support for ingress flow tables read flow director.

> Also, I'm not sure whether flow expiration is something the API must
> take care of. The current proposal assumes that HW flows are only
> ever removed by the API itself. Could the switch CPU run code which
> removes flows as well? That would call for Netlink notifications.
> Not that it's needed at this stage of the code but maybe worth
> considerating for the API design.

I think this will be very useful when we get to a point where we
can use this on some of the switch silicon that supports bigger tables
with more capabilities. Like you say we probably don't need it in
the first draft but having a path to support it is needed.

>
>>>> - Simply duplicating the software flow/action into
>>>>    hardware may not optimally use the hardware tables. If I have
>>>>    a TCAM in hardware for instance. (This is how I read the patch
>>>>    let me know if I missed something)
>>>
>>> The hardware-specific driver is the right place to handle optimizing the flow/action in hardware since only the driver can know the size/shape of the device.  struct sw_flow is a generic flow description; how (or if) a flow gets programmed into hardware must be handled in the swdev driver.  If the device driver can’t make the sw_flow fit into HW because of resource limitations or the flow simply can’t be represented in HW, then the flow is SW only.
>>>
>>> In the rocker driver posted in this patch set, the steps are to parse the struct sw_flow to figure out what type of flow match/action we’re dealing with (L2 or L3 or L4, ucast or mcast, ipv4 or ipv6, etc) and then install the correct entries into the corresponding device tables within the constraints of the device’s pipeline.  Any optimizations, like coalescing HW entries, is something only the driver can do.
>
> The later examples definitely make sense and I'm not argueing against
> that. There is also a non hardware capabilities perspective that I
> would like to present:
>
> 1) TCAM capacity is limtied, we offload based on some priority assigned
> to flows.  Some are critical and need to be in HW, others are best effort,
> others never go into hardware. An API user will likely want to offload
> best-effort flows until some watermark is reached and then switch to
> critical flows only. The driver is not the right place for high level
> optimization like this. The kernel API might but doesn't really have to
> either because it would mean we need APIs to transfer all of the
> needed context for the decision in the kernel. It might be easier to
> expose the hardware context to user space instead and handle these
> kind of optimizations in something like Quagga.
>
> 2) There is definitely a desire to allow adapting the software flow table
> based on the hardware capabilities. Example, given a route like this:
>
>     20.1.0.0/16, mark=50, tos=0x12, actions: output:eth1
>
> The hardware can satisfy everything except the mark=50 match. Given a
> a blind 1:1 copy between hardware and software we cannot offload
> because a mach would be illegal. With the full context as available
> north of the API, this could be translated into something like this:
>
>    HW: 20.1.0.0/16, tos=0x12, actions: meta=1, output:cpu
>    SW: meta=1, mark=50, output:eth1
>
> This will allow for partial offloads to bypass expensive masked flow
> table lookups by converting them into efficient flat exact match
> tables, offload TC classifiers, nftables or even the existing L2 and
> L3 forwarding path.

Thanks. This is exactly what I was trying to hint at and why the
optimization can not be done in the driver. The driver shouldn't
have to know about the cost models of SW vs HW rules or how to
break up rules into sets of complimentary hw/sw rules.

the other thing I've been thinking about is how to handle hardware
with multiple flow tables. We could let the driver handle this
but if I ever want to employ a new optimization strategy then I
need to rewrite the driver. To me this looks a lot like policy
which should not be driven by the kernel. We can probably ignore
this case for the moment until we get some of the other things
addressed.

>
> In summary, I think the swdev API as proposed is a good start as the
> in-kernel flow abstraction is sufficient for many API users but we
> should consider enabling the model described above as well once we
> have the basic model put in place. I will be very interested in helping
> out on this for both existing classifiers and OVS flow tables.
>
>
>>>> - I need a way to specify put this flow/action in hardware,
>>>>    put this flow/action in software, or put this in both software
>>>>    and hardware.
>>>>
>>>
>>> This seems above the swdev layer.  In other words, don’t call ndo_swdev_flow_* if you don’t want flow match/action install in HW.
>
> It can certainly be done northbound but this seems like a basic
> requirement and we might end up avoiding the code duplication and
> extending the API instead.
>

IMO I think extending the API is the easiest route but the best
way to resolve this is to try and write the code. I'll take a
stab at it next week.

by the way Jiri I think the patches are a great start.

Thanks,
John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]           ` <464DB0A8-0073-4CE0-9483-0F36B73A53A1-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  2014-08-23  9:24             ` Jiri Pirko
@ 2014-08-24  1:53             ` Jamal Hadi Salim
  2014-08-24 11:12               ` Thomas Graf
  1 sibling, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-24  1:53 UTC (permalink / raw)
  To: Scott Feldman, John Fastabend
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 08/22/14 18:53, Scott Feldman wrote:

Ok, Scott - now i have looked at the patches on the plane and i am
still not convinced ;->

> The intent is to use openvswitch.ko’s struct sw_flow to program hardware via the
>ndo_swdev_flow_* ops, but otherwise be independent of OVS.  So the upper layer of
>the driver is struct sw_flow and any module above the driver can construct a struct
>sw_flow and push it down via ndo_swdev_flow_*.  So your non-OVS use-case should be
>handled.  OVS is another use-case.  struct sw_flow should not be OVS-aware, but
>rather a generic flow match/action sufficient to offload the data plane to HW.


There is a legitimate case to be made for offloading OVS but *not*
a basis for making it the offload interface.
My suggestion is to make all OVS stuff a separate patchset.
This thing needs to stand alone without OVS and we dont need
to confuse the two.

Having said that:
I believe in starting simple - by solving the basic functions of
L2/3 offload first because those are well understood and fundamental.
There is the simplicity of those network functions and then
need to deal with tons of quarks that surround them....
I think getting that right will help in understanding the issues and
make this interface better. This is where i am going to focus my effort.

Here's my view on flows in the patchset:
What we need is ability to specify different types of classifiers.
But leave L2 and 3 out of that - that should be part of the basic
feature set.
Your 15-tuple classifier should be one of those classifiers.
This is because you *cannot possibly* have a universal classifier.
The tc classifier/action API has got this part right. There is
no ONE flow classifier but rather it has flexibility to add as many
as you want.
IOW:
I should be able to specify a classifier that matches the
definition of the openflow thing you are using. But then i should also
be able to create one based on 32 bit value/masks, one that classifies
strings, one that classifies metadata, my own pigeon observer
classifier etc. And be able to attach them in combinations
to select different things within the packet and act differently.

Lets pick an example of the u32 classifier (or i could pick nftables).
Using your scheme i have to incur penalties to translating u32 to your
classifier and only achieve basic functionality; and now in addition
i cant do 90% of my u32 features. And u32 is very implementable
in hardware.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-24  1:53             ` Jamal Hadi Salim
@ 2014-08-24 11:12               ` Thomas Graf
       [not found]                 ` <20140824111218.GA32741-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-24 11:12 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Scott Feldman, John Fastabend, Jiri Pirko, netdev, davem,
	nhorman, andy, dborkman, ogerlitz, jesse, pshelar, azhou, ben,
	stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, f.fainelli, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

On 08/23/14 at 09:53pm, Jamal Hadi Salim wrote:
> On 08/22/14 18:53, Scott Feldman wrote:
> 
> Ok, Scott - now i have looked at the patches on the plane and i am
> still not convinced ;->
> 
> >The intent is to use openvswitch.ko’s struct sw_flow to program hardware via the
> >ndo_swdev_flow_* ops, but otherwise be independent of OVS.  So the upper layer of
> >the driver is struct sw_flow and any module above the driver can construct a struct
> >sw_flow and push it down via ndo_swdev_flow_*.  So your non-OVS use-case should be
> >handled.  OVS is another use-case.  struct sw_flow should not be OVS-aware, but
> >rather a generic flow match/action sufficient to offload the data plane to HW.
> 
> 
> There is a legitimate case to be made for offloading OVS but *not*
> a basis for making it the offload interface.
> My suggestion is to make all OVS stuff a separate patchset.
> This thing needs to stand alone without OVS and we dont need
> to confuse the two.

I get what you are saying but I don't see that to be the case here. I
don't see how this series proposes the OVS case as *the* interface.
It proposes *a* interface which in this case is flow based with mask
support to accomodate the typical ntuple filter API in HW. OVS happens
to be one of the easiest to use examples as a consumer because it
already provides a flat flow representation.

That said, I already mentioned that I see a lot of value in having a
non OVS API example ASAP and I will be glad to help out John to achieve
that.

> Having said that:
> I believe in starting simple - by solving the basic functions of
> L2/3 offload first because those are well understood and fundamental.
> There is the simplicity of those network functions and then
> need to deal with tons of quarks that surround them....
> I think getting that right will help in understanding the issues and
> make this interface better. This is where i am going to focus my effort.

I thought this is exactly what is happening here. The flow key/mask
based API as proposed focuses on basic forwarding for L2-L4.

> Here's my view on flows in the patchset:
> What we need is ability to specify different types of classifiers.
> But leave L2 and 3 out of that - that should be part of the basic
> feature set.
>
> Your 15-tuple classifier should be one of those classifiers.
> This is because you *cannot possibly* have a universal classifier.
> The tc classifier/action API has got this part right. There is
> no ONE flow classifier but rather it has flexibility to add as many
> as you want.

Exactly and I never saw Jiri claim that swdev_flow_insert() would be
the only offload capability exposed by the API. I see no reason why
it could not also provide swdev_offset_match_insert() or
swdev_ebpf_insert() for the 2*next generation HW. I don't think it
makes sense to focus entirely on finding a single common denominator
and channel everything through a single function to represent all the
different generic and less generic offload capabilities. I believe
that doing so will raise the minimal HW requirements barrier HW too
much. I think we should start somewhere, learn and evolve.

> IOW:
> I should be able to specify a classifier that matches the
> definition of the openflow thing you are using. But then i should also
> be able to create one based on 32 bit value/masks, one that classifies
> strings, one that classifies metadata, my own pigeon observer
> classifier etc. And be able to attach them in combinations
> to select different things within the packet and act differently.

So essentially what you are saying is that the tc interface
(in particular cls and act) could be used as an API to achieve offloads.
Yes! I thought this was very clear and a given. I don't think that it
makes sense to force every offload API consumer through the tc interface
though. This comes back to my statements in a previous email. I don't
think we should require that all the offload decision complexity *has*
to live in the kernel. Quagga, nft, or OVS should be given an API to
influence this more directly (with the hardware complexity properly
abstracted). In-kernel users such as bridge, l3 (especially rules),
and tc itself could be handled through a cls/act derived API internally.

> Lets pick an example of the u32 classifier (or i could pick nftables).
> Using your scheme i have to incur penalties to translating u32 to your
> classifier and only achieve basic functionality; and now in addition
> i cant do 90% of my u32 features. And u32 is very implementable
> in hardware.

I don't fully understand the last claim. Given the specific ntuple
capabilities of a lot of hardware out there (let's assume a typical
5-tuple capability with N capacity for exact matches and M capacity for
wildcard matches) supporting a generic u32 offset-len-mask is not exactly
trivial at all and I don't see how you can get around converting the
generic offset into a ntuple filter *at some point* to verify if the HW
can fullfil the generic offset match request or not. Could you share
what kind of HW you regard as a minimal requirement to base the offload
API on? Personally I'm highly interested in the existing limited tuple
filters and flow directors of NICs already available and their next
successors. I think that the code that Jiri proposes and what John is
planning to do makes a lot of sense in that context.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                     ` <53F8CAB9.8080407-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-08-24 11:32                       ` Thomas Graf
  0 siblings, 0 replies; 87+ messages in thread
From: Thomas Graf @ 2014-08-24 11:32 UTC (permalink / raw)
  To: John Fastabend
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 08/23/14 at 10:09am, John Fastabend wrote:
> Right. I think this is basically what Jiri and I discussed when he
> originally posted the series. For my use cases this is one of the
> more interesting pieces. If no one else is looking at it I can try
> it on some of the already existing open source drivers that have some
> very simple support for ingress flow tables read flow director.

Awesome. I'm definitely very interested in helping out on this part
as well.

> Thanks. This is exactly what I was trying to hint at and why the
> optimization can not be done in the driver. The driver shouldn't
> have to know about the cost models of SW vs HW rules or how to
> break up rules into sets of complimentary hw/sw rules.

That's an excellent summary of what I wanted to say.

> the other thing I've been thinking about is how to handle hardware
> with multiple flow tables. We could let the driver handle this
> but if I ever want to employ a new optimization strategy then I
> need to rewrite the driver. To me this looks a lot like policy
> which should not be driven by the kernel. We can probably ignore
> this case for the moment until we get some of the other things
> addressed.

Agreed, this sounds like something to handle a bit later.
It is potentially very interesting as it would allow to offload at
least partial pipelines but it obviously adds a new dimension to
the API. I strongly feel that the API as proposed could be extended
in this direction though. It will require a notion of tables for
swdev_flow_insert() and we'll likely need an API to set default
table policies although that is likely even needed for single table
support. We might also have to introduce a concept of bundles at
some point to provide atomic updates across multiple tables for
consistency.

> IMO I think extending the API is the easiest route but the best
> way to resolve this is to try and write the code. I'll take a
> stab at it next week.

I'm absolutely interested in writing code for this as well. If we
can find consensus on merging at least the core API bits in some
form then that would allow for more people to get involved. Maybe
we can skip the OVS bits in the first merge and continue that work
in a separate git tree. I'm also definitely very interested in hearing
Pravin's and Jesse's thoughts on the overall API ;-)

John's flow director API replacement idea can definitely serve as an
excellent first in-tree consumer as it looks even simpler.

> by the way Jiri I think the patches are a great start.

+1

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
  2014-08-21 16:18   ` [patch net-next RFC 03/12] net: introduce generic switch devices support Jiri Pirko
  2014-08-21 16:41     ` Ben Hutchings
       [not found]     ` <1408637945-10390-4-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-08-24 11:46     ` Thomas Graf
       [not found]       ` <20140824114605.GC32741-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-08-27 22:19     ` Cong Wang
  3 siblings, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-24 11:46 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

On 08/21/14 at 06:18pm, Jiri Pirko wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 39294b9..8b5d14c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -49,6 +49,8 @@
>  
>  #include <linux/netdev_features.h>
>  #include <linux/neighbour.h>
> +#include <linux/sw_flow.h>
> +
>  #include <uapi/linux/netdevice.h>
>  
>  struct netpoll_info;
> @@ -997,6 +999,24 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
> + * int (*ndo_swdev_flow_insert)(struct net_device *dev,
> + *				const struct sw_flow *flow);
> + *	Called to insert a flow into switch device. If driver does
> + *	not implement this, it is assumed that the hw does not have
> + *	a capability to work with flows.

I asume you are planning to add an additional expandable struct
paramter to handle insertion parameters when the first is introduced
to avoid requiring to touch every driver every time.

> +/**
> + *	swdev_flow_insert - Insert a flow into switch
> + *	@dev: port device
> + *	@flow: flow descriptor
> + *
> + *	Insert a flow into switch this port is part of.
> + */
> +int swdev_flow_insert(struct net_device *dev, const struct sw_flow *flow)
> +{
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +
> +	print_flow(flow, dev, "insert");
> +	if (!ops->ndo_swdev_flow_insert)
> +		return -EOPNOTSUPP;
> +	WARN_ON(!ops->ndo_swdev_get_id);
> +	BUG_ON(!flow->actions);
> +	return ops->ndo_swdev_flow_insert(dev, flow);
> +}
> +EXPORT_SYMBOL(swdev_flow_insert);

Splitting the flow specific API into a separate file (maybe
swdev_flow.c?) might help resolve some of the concerns around the
focus on flows. It would make it clear that it's one of multiple
models to be supported.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                 ` <20140824111218.GA32741-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-08-24 15:15                   ` Jamal Hadi Salim
       [not found]                     ` <53FA01AC.10507-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
  2014-08-25 14:54                     ` Thomas Graf
  0 siblings, 2 replies; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-24 15:15 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 08/24/14 07:12, Thomas Graf wrote:
> On 08/23/14 at 09:53pm, Jamal Hadi Salim wrote:

>
> I get what you are saying but I don't see that to be the case here. I
> don't see how this series proposes the OVS case as *the* interface.

The focus of the patches is on offloading flows (uses the
ovs or shall i say the broadcom OF-DPA API, which is one
vendor's view of the world).

Yes, people are going to deploy more hardware which knows how to do
a lot of flows (but today that is in the tiny tiny minority)

I would have liked to see more focus on L2/3 as a first step because
they are more predominantly deployed than anything with flows. And
they are well understood from a functional perspective.
Then that would bring to the front API issues since you have
a large sample space of deployments and we can refactor as needed.
i.e
The hard part is dealing with 10 different chips which have a slightly
different meaning of (example) how to do L3 in their implementation.
I dont see such a focus in these patches because they start with a
premise "the world is about flows".

> It proposes *a* interface which in this case is flow based with mask
> support to accomodate the typical ntuple filter API in HW. OVS happens
> to be one of the easiest to use examples as a consumer because it
> already provides a flat flow representation.
>

In other words, there is a direct 1-1 map between this approach and OVS.
That is a contentious point.

> I thought this is exactly what is happening here. The flow key/mask
> based API as proposed focuses on basic forwarding for L2-L4.
>

Not at all.
I gave an example earlier with u32, but lets pick the other extreme
of well understood functions, say L3 (I could pick L2 as well).
This openflow api tries to describe different header
fields in the packet. That is not the challenge for such an
API. The challenge is dealing with the quarks.
Some chips implement FIB and NH conjoined; others implement
them separately.
I dont see how this is even being remotely touched on.


>
> Exactly and I never saw Jiri claim that swdev_flow_insert() would be
> the only offload capability exposed by the API. I see no reason why
> it could not also provide swdev_offset_match_insert() or
> swdev_ebpf_insert() for the 2*next generation HW. I don't think it
> makes sense to focus entirely on finding a single common denominator
> and channel everything through a single function to represent all the
> different generic and less generic offload capabilities. I believe
> that doing so will raise the minimal HW requirements barrier HW too
> much. I think we should start somewhere, learn and evolve.
>

You are asking me to go and add a new ndo() every time i have a new 
network function? That is not scalable. I have no problem with
the approach that was posted - I have a problem that it is it
focused on flows (and is lacking ability to specify different
classifiers). It should not be called xxx_flow_xxx


> So essentially what you are saying is that the tc interface
> (in particular cls and act) could be used as an API to achieve offloads.

I am pointing to it as an example of something that is *done right* in
terms of not picking a universal classifier. Something the current
OVS posted/used api lacks (and to be frank OF never cared about because
it had a niche use case; lets not make that niche use case the centre
of gravity).

> Yes! I thought this was very clear and a given. I don't think that it
> makes sense to force every offload API consumer through the tc interface
> though.

If you looked at all my presentations I have never laid such
claim but i have always said I want everything described in
iproute2 to work. I dont think anyone disagreed.
I dont expect tc to be used as *the interface*; but on the same
token i dont expect OVS to be used as *the interface*.
Lets start with hardware abstraction. Lets map to existing Linux APIs
and then see where some massaging maybe needed.

> This comes back to my statements in a previous email. I don't
> think we should require that all the offload decision complexity *has*
> to live in the kernel.

Agreed. Move policy decisions out of the kernel for one but also
any complex acrobatics as well that are use case specific.

> Quagga, nft, or OVS should be given an API to
> influence this more directly (with the hardware complexity properly
> abstracted). In-kernel users such as bridge, l3 (especially rules),
> and tc itself could be handled through a cls/act derived API internally.
>

This abstraction gives OVS 1-1 mapping which is something i object to.
You want to penalize me for the sake of getting the OVS api in place?
Beginning with flows and laying claim to that one would be able to
cover everything is non-starter.

>> Lets pick an example of the u32 classifier (or i could pick nftables).
>> Using your scheme i have to incur penalties to translating u32 to your
>> classifier and only achieve basic functionality; and now in addition
>> i cant do 90% of my u32 features. And u32 is very implementable
>> in hardware.
>
> I don't fully understand the last claim.


I will simplify:
You cant possibly do the u32 classifier completely using the posted
hard-coded 15 tuple classifier. It is an NP-complete problem.
There are *a lot* of use cases which can be specified by u32 that are
not possible to specify with the tuples the patches posted propose.
The reverse is not true. You can fully specify the OVS classifier
with u32.
So if you want to specify the closest to a universal grammar for
specifying a classifier - use u32 and create templates for your
classifier.
There are some cases where that approach doesnt make sense:
example if i wanted to specify a string classifier etc.
But if we are talking packet header classifier - it is flexible.
There are also good reasons to specify a universal 5 tuple classifier.
As there are good reasons to specify your latest OF classifier.
But that OF classifier being the starting point is not pragmatic.

Sorry -I cut the email a little because people with short attention span
are probably not following by this time.

I may be slower in responding since i will be offline.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                     ` <53FA01AC.10507-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
@ 2014-08-25  2:24                       ` Scott Feldman
       [not found]                         ` <A67C7591-19BF-4431-9119-F61361F5E618-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Scott Feldman @ 2014-08-25  2:24 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller


On Aug 24, 2014, at 8:15 AM, Jamal Hadi Salim <jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org> wrote:

> On 08/24/14 07:12, Thomas Graf wrote:
>> On 08/23/14 at 09:53pm, Jamal Hadi Salim wrote:
> 
>> 
>> I get what you are saying but I don't see that to be the case here. I
>> don't see how this series proposes the OVS case as *the* interface.
> 
> The focus of the patches is on offloading flows (uses the
> ovs or shall i say the broadcom OF-DPA API, which is one
> vendor's view of the world).
> 
> Yes, people are going to deploy more hardware which knows how to do
> a lot of flows (but today that is in the tiny tiny minority)
> 
> I would have liked to see more focus on L2/3 as a first step because
> they are more predominantly deployed than anything with flows. And
> they are well understood from a functional perspective.
> Then that would bring to the front API issues since you have
> a large sample space of deployments and we can refactor as needed.

With respect to focus on L2/L3, I have a pretty *good* hunch someone could write a kernel module that gleans from the L2/L3 netlink echoes already flying around and translates to sw_flows and in turn into ndo_swdev_flow_* calls.  So the existing linux bonds and bridges and vlans and ROUTEs and NEIGHs and LINKs and ADDRs work as normal, unchanged, with iproute2 still originating the netlink msgs from the user.  The new kernel module (let’s call it “dagger” after the white spy from spy vs. spy MAD comic) can figure out what forwarding gets offloaded to HW just from the netlink echoes.  If someone wrote a dagger module in parallel with the other efforts being discussed here, I think we’d have a pretty good idea what the API needs to look like, at least to cover existing L2/L3 world we're all familiar with.  Gleaning netlink msgs isn’t ideal for several reasons (and probably making more than a few in the audience squeamish), but it would be a quick way to get us closer to the answer we’re seeking which is the swdev driver model.

-scott

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                         ` <A67C7591-19BF-4431-9119-F61361F5E618-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
@ 2014-08-25  2:42                           ` John Fastabend
       [not found]                             ` <53FAA2A2.7070801-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2014-08-25 13:42                           ` Jamal Hadi Salim
  1 sibling, 1 reply; 87+ messages in thread
From: John Fastabend @ 2014-08-25  2:42 UTC (permalink / raw)
  To: Scott Feldman
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	Jamal Hadi Salim, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 08/24/2014 07:24 PM, Scott Feldman wrote:
>
> On Aug 24, 2014, at 8:15 AM, Jamal Hadi Salim <jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org> wrote:
>
>> On 08/24/14 07:12, Thomas Graf wrote:
>>> On 08/23/14 at 09:53pm, Jamal Hadi Salim wrote:
>>
>>>
>>> I get what you are saying but I don't see that to be the case here. I
>>> don't see how this series proposes the OVS case as *the* interface.
>>
>> The focus of the patches is on offloading flows (uses the
>> ovs or shall i say the broadcom OF-DPA API, which is one
>> vendor's view of the world).
>>
>> Yes, people are going to deploy more hardware which knows how to do
>> a lot of flows (but today that is in the tiny tiny minority)
>>
>> I would have liked to see more focus on L2/3 as a first step because
>> they are more predominantly deployed than anything with flows. And
>> they are well understood from a functional perspective.
>> Then that would bring to the front API issues since you have
>> a large sample space of deployments and we can refactor as needed.
>
> With respect to focus on L2/L3, I have a pretty *good* hunch someone
could write a kernel module that gleans from the L2/L3 netlink echoes
already flying around and translates to sw_flows and in turn into
ndo_swdev_flow_* calls. So the existing linux bonds and bridges and
vlans and ROUTEs and NEIGHs and LINKs and ADDRs work as normal,
unchanged, with iproute2 still originating the netlink msgs from the
user. The new kernel module (let’s call it “dagger” after the white spy
from spy vs. spy MAD comic) can figure out what forwarding gets
offloaded to HW just from the netlink echoes. If someone wrote a dagger
module in parallel with the other efforts being discussed here, I think
we’d have a pretty good idea what the API needs to look like, at least
to cover existing L2/L3 world we're all familiar with. Gleaning netlink
msgs isn’t ideal for several reasons (and probably making more than a
few in the audience squeamish), but it would be a quick way to get us
closer to the answer we’re seeking which is the swdev driver model.
>

In the L2 case we already have the fdb_add and fdb_del semantics that
are being used today by NICs with embedded switches. And we have a DSA
patch we could dig out of patchwork for those drivers.

So I think it makes more sense to use the explicit interface rather
than put another shim layer in the kernel. Its simpler and more to the
point IMO. I suspect the resulting code will be smaller and easier to
read. I'm the squemish one in the audience here.

.John


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                         ` <A67C7591-19BF-4431-9119-F61361F5E618-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  2014-08-25  2:42                           ` John Fastabend
@ 2014-08-25 13:42                           ` Jamal Hadi Salim
  1 sibling, 0 replies; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-25 13:42 UTC (permalink / raw)
  To: Scott Feldman
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 08/24/14 22:24, Scott Feldman wrote:
>
> On Aug 24, 2014, at 8:15 AM, Jamal Hadi Salim <jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org> wrote:
>

>
> With respect to focus on L2/L3, I have a pretty *good* hunch someone could
 > write a kernel module that gleans from the L2/L3 netlink echoes 
already flying
>around and translates to sw_flows and in turn into ndo_swdev_flow_* calls.

And herein lies the fundamental disagreement.
I dont believe the flow interface is appropriate for either L2/3.
We already have some L2 interfaces (they are incomplete if you want
to capture everything a large switch does).
L3 should go the same path.

And as i said earlier, i dont think the flow interface
is appropriate as a universal classifier interface either.
You need to allow for different types of classifiers. You
need to allow for a mix and match of classifiers (although
the later could be evolutionary).

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                             ` <53FAA2A2.7070801-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-08-25 13:53                               ` Jamal Hadi Salim
  2014-08-25 14:17                                 ` Thomas Graf
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-25 13:53 UTC (permalink / raw)
  To: John Fastabend, Scott Feldman
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 08/24/14 22:42, John Fastabend wrote:

> In the L2 case we already have the fdb_add and fdb_del semantics that
> are being used today by NICs with embedded switches. And we have a DSA
> patch we could dig out of patchwork for those drivers.

Indeed. That is an excellent starting point of something that is proven
to work. I was hoping L3 would follow this path. For L2,
there is the mess of claiming unicast NIC addresses as part of
the fdb, but that can almost be ignored.
Caveat: fdb_XXX works well for NICs as well as different larger
ASICs - but some quark handling is going to be needed for the tinier
openwrt type devices.
For larger switches we are going to need more for L2:
Really, the bridge already covers everthing large switches
offer (Vlan filtering, bridge port setting etc). We just need
to offload that...

> So I think it makes more sense to use the explicit interface rather
> than put another shim layer in the kernel. Its simpler and more to the
> point IMO. I suspect the resulting code will be smaller and easier to
> read. I'm the squemish one in the audience here.
>

L2/3 should be left alone and go this path. My concern on excessive
NDOs was more on the flow view of the world.

cheers,
jamal

> .John
>
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-25 13:53                               ` Jamal Hadi Salim
@ 2014-08-25 14:17                                 ` Thomas Graf
       [not found]                                   ` <20140825141754.GA30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-25 14:17 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: John Fastabend, Scott Feldman, Jiri Pirko, netdev, David Miller,
	Neil Horman, Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, f.fainelli, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

On 08/25/14 at 09:53am, Jamal Hadi Salim wrote:
> On 08/24/14 22:42, John Fastabend wrote:
> 
> >In the L2 case we already have the fdb_add and fdb_del semantics that
> >are being used today by NICs with embedded switches. And we have a DSA
> >patch we could dig out of patchwork for those drivers.
> 
> Indeed. That is an excellent starting point of something that is proven
> to work. I was hoping L3 would follow this path. For L2,
> there is the mess of claiming unicast NIC addresses as part of
> the fdb, but that can almost be ignored.
> Caveat: fdb_XXX works well for NICs as well as different larger
> ASICs - but some quark handling is going to be needed for the tinier
> openwrt type devices.
> For larger switches we are going to need more for L2:
> Really, the bridge already covers everthing large switches
> offer (Vlan filtering, bridge port setting etc). We just need
> to offload that...

fdb_add() *is* flow based. At least in my understanding, the whole
point here is to extend the idea of fdb_add() and make it understand
L2-L4 in a more generic way for the most common protocols.

The reason fdb_add() is not reused is because it is Netlink specific
and only suitable for User -> HW offload. Kernel -> HW offload is
technically possible but not clean.

The only reason swdev is needed at all is to represent the port model
and to allow for non flow based models built on top of the same
hardware abstraction. I see now reason why br_fdb cannot be represented
through swdev as soon as the code is stable.

> >So I think it makes more sense to use the explicit interface rather
> >than put another shim layer in the kernel. Its simpler and more to the
> >point IMO. I suspect the resulting code will be smaller and easier to
> >read. I'm the squemish one in the audience here.
> >
> 
> L2/3 should be left alone and go this path. My concern on excessive
> NDOs was more on the flow view of the world.

The point I was trying to make earlier is that it is very hard to
program both protocol aware and generic filtering hardware through
a single NDO. It will make the driver specific part complex.

If you are saying we need yet another classifier model in the kernel
then I'm not sure that is needed in the presence of cls/act, iptables,
and nftables. They seem suitable to represent non flow based models
and I see nothing that would prevent an offload through swdev for them.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-24 15:15                   ` Jamal Hadi Salim
       [not found]                     ` <53FA01AC.10507-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
@ 2014-08-25 14:54                     ` Thomas Graf
       [not found]                       ` <20140825145449.GB30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  1 sibling, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-25 14:54 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Scott Feldman, John Fastabend, Jiri Pirko, netdev, davem,
	nhorman, andy, dborkman, ogerlitz, jesse, pshelar, azhou, ben,
	stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, f.fainelli, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

On 08/24/14 at 11:15am, Jamal Hadi Salim wrote:
> The focus of the patches is on offloading flows (uses the
> ovs or shall i say the broadcom OF-DPA API, which is one
> vendor's view of the world).

Let's keep vendors out of this discussion. I have no affiliation
with this vendor. In fact I'm personally more interested in the
host use case with the biggest concerns/focus on integration with
existing APIs.

> >It proposes *a* interface which in this case is flow based with mask
> >support to accomodate the typical ntuple filter API in HW. OVS happens
> >to be one of the easiest to use examples as a consumer because it
> >already provides a flat flow representation.
> >
> 
> In other words, there is a direct 1-1 map between this approach and OVS.
> That is a contentious point.

That is simply not the case. The fact that John is using this model
to replace the flow director ioctl API should prove this.

> Not at all.
> I gave an example earlier with u32, but lets pick the other extreme
> of well understood functions, say L3 (I could pick L2 as well).
> This openflow api tries to describe different header

There is not a single bit specific to OpenFlow and there is absolutely
no awareness of OF within the kernel in OVS.

> fields in the packet. That is not the challenge for such an
> API. The challenge is dealing with the quarks.
> Some chips implement FIB and NH conjoined; others implement
> them separately.
> I dont see how this is even being remotely touched on.

First of all, that sounds like exactly like something that should
be handled in the driver specific portion of the API. Secondly,
can you provide additional information on these specific pieces of
hardware so we take it into account?

> You are asking me to go and add a new ndo() every time i have a new network
> function? That is not scalable. I have no problem with
> the approach that was posted - I have a problem that it is it
> focused on flows (and is lacking ability to specify different
> classifiers). It should not be called xxx_flow_xxx

Realistically there will only be a handful, maybe something
like:

flow_insert / flow_remove
p4_add / p4_remove
[...]

Maybe you can share some information the specific API you have
in mind?

> If you looked at all my presentations I have never laid such
> claim but i have always said I want everything described in
> iproute2 to work. I dont think anyone disagreed.
> I dont expect tc to be used as *the interface*; but on the same
> token i dont expect OVS to be used as *the interface*.

Agreed, I don't think anybody expects anything else.

> Lets start with hardware abstraction. Lets map to existing Linux APIs
> and then see where some massaging maybe needed.

That's what's being done. HW offload is being mapped to OVS and
to an existing ioctl interface. Those are existing Linux APIs.
Can you explain why swdev as proposed is not suitable for the
other existing Linux APIs? They don't *have* to use the flow_insert(),
they are free to exted the API to represent more generic programmable
hardware.

> This abstraction gives OVS 1-1 mapping which is something i object to.
> You want to penalize me for the sake of getting the OVS api in place?

I don't understand this.

> Beginning with flows and laying claim to that one would be able to
> cover everything is non-starter.

Nobody claims that. In fact, I'm very interested in seeing the API
extended for non flow based models. I'm actually convinced that flow
based models are not the ultimate answers on HW level but a vast majority
of hardware understands some form of protocol aware exact match or
wildcard filters of limited capacity. This category of hardware is
being addressed with the flow_insert() API.

> I will simplify:
> You cant possibly do the u32 classifier completely using the posted
> hard-coded 15 tuple classifier. It is an NP-complete problem.
> There are *a lot* of use cases which can be specified by u32 that are
> not possible to specify with the tuples the patches posted propose.
> The reverse is not true. You can fully specify the OVS classifier
> with u32.
> So if you want to specify the closest to a universal grammar for
> specifying a classifier - use u32 and create templates for your
> classifier.

Completely agreed, this is why we have cls/act and nftables.

> There are some cases where that approach doesnt make sense:
> example if i wanted to specify a string classifier etc.
> But if we are talking packet header classifier - it is flexible.
> There are also good reasons to specify a universal 5 tuple classifier.
> As there are good reasons to specify your latest OF classifier.
> But that OF classifier being the starting point is not pragmatic.

So you agree that at least on the driver level some form of ntuple
awareness must be given because the hardware has limited capabilities.
This is exactly what flow_insert() is, it is a generic ntuple
classifier which can implement a subset of the 15 tuple in HW. So
instead of adding a separate NDO for each fixed tuple, a generic
NDO can handle the different levels of offloads. Very similar to how
the xmit to the NIC can handle various protocol offloads already.

What is being proposed is a generic ntuple with masking support to
describe filtering needs. What is missing is a capabilities reporting
channel so API users can know in advance what is supported to
implement partial offloads.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                   ` <20140825141754.GA30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-08-25 16:15                                     ` Jamal Hadi Salim
       [not found]                                       ` <53FB6122.2040901-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-25 16:15 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 08/25/14 10:17, Thomas Graf wrote:
> On 08/25/14 at 09:53am, Jamal Hadi Salim wrote:

> fdb_add() *is* flow based. At least in my understanding, the whole
> point here is to extend the idea of fdb_add() and make it understand
> L2-L4 in a more generic way for the most common protocols.
>
> The reason fdb_add() is not reused is because it is Netlink specific
> and only suitable for User -> HW offload. Kernel -> HW offload is
> technically possible but not clean.
>

I dont think we have a problem handling any of this today.


> The only reason swdev is needed at all is to represent the port model
> and to allow for non flow based models built on top of the same
> hardware abstraction. I see now reason why br_fdb cannot be represented
> through swdev as soon as the code is stable.
>

This is where our (shall i say strong) disagreement is.
I think you will find it non-trivial to show me how you can
actually take the simple L2 bridge and map it to a "flow".
Since your starting point is "everything can be represented via a flow
and some table" - we are at a crosspath.

> The point I was trying to make earlier is that it is very hard to
> program both protocol aware and generic filtering hardware through
> a single NDO. It will make the driver specific part complex.
>

The tc filter API seems to be doing just that.
You have different types of classifiers - the h/w may not be able
to support some classifier types - but that is a capability discovery
challenge.

> If you are saying we need yet another classifier model in the kernel
> then I'm not sure that is needed in the presence of cls/act, iptables,
> and nftables. They seem suitable to represent non flow based models
> and I see nothing that would prevent an offload through swdev for them.
>

I am saying two things:
1) There are a few "fundamental" interfaces; L2 and L3 being some.
Add crypto offload and a few i mentioned in  my presentation. We
know how to do those. example; there is nothing i cant do with
the rtmsg that is L3. or the fdb/port/vlan filter for L2.
This flow thing should stay out of those.

2) The flow thing should allow a variety of classifiers to be
handled. Again capability discovery would take care of differences.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                       ` <20140825145449.GB30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-08-25 16:48                         ` Jamal Hadi Salim
  2014-08-25 22:11                           ` Thomas Graf
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-25 16:48 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 08/25/14 10:54, Thomas Graf wrote:
> On 08/24/14 at 11:15am, Jamal Hadi Salim wrote:

> Let's keep vendors out of this discussion.

The API is from a vendor. It is clearly labelled as an OF API.
It covers well abstracting that vendors SDK to enable OF. That
is relevant info.
If it covers all other vendors (which is where
the quark handling comes in), I will be fine with it.
I dont believe it does.

> That is simply not the case. The fact that John is using this model
> to replace the flow director ioctl API should prove this.

depends what NIC classifier John is mapping to. The Intels have
about 4-5 different types of classifier on different hardware
with different interfaces.
If it is a "flow type" - yes. I think you could wing-in the
RSS(and somehow announce you cant handle UDP). You may be able
to tie in RSS.
I am not sure about VMDQ; neither am i sure about what happens
when you need to deal with a combination of 2 or more classifiers
which i believe is part of the lookups in such hardware.
So that aside:
If you are telling me John is going to also map the L2 fdb here we are
going to have a strong disagreement.
And back to my earlier arguement:
allow for multiple classifiers to be expressed not THE ONE.
If i wanted to support CLASSIFIER_RSS from tc i could write one
and i can use tc to configure it. Or i could write one for nftables.
In general i probably should be able to wing it with some small
acrobatics.


> There is not a single bit specific to OpenFlow and there is absolutely
> no awareness of OF within the kernel in OVS.
>

The API is for OF support in a vendor ASIC.

>> fields in the packet. That is not the challenge for such an
>> API. The challenge is dealing with the quarks.
>> Some chips implement FIB and NH conjoined; others implement
>> them separately.
>> I dont see how this is even being remotely touched on.
>
> First of all, that sounds like exactly like something that should
> be handled in the driver specific portion of the API. Secondly,
> can you provide additional information on these specific pieces of
> hardware so we take it into account?
>

I gave a simple example.
There are a hell more quarks than that.
There are cases where there are multiple tables in terms of net masks
etc.
Yes, this should be handled in the driver. The input is the route
message we already specify and not some XXX_Flow_XXx struct.


> Realistically there will only be a handful, maybe something
> like:
>
> flow_insert / flow_remove
> p4_add / p4_remove
> [...]
>
> Maybe you can share some information the specific API you have
> in mind?
>

I would be tagging along with you guys for flows if you:
a) allow for different classifiers. This allows me to implement
u32 and offload it.
b) different actions (I think this part is not controversial, you
seem to be having it already).
c) stay out of L2/3. We know how to do this already. We have
representative data structures that *completely* define those.


> Agreed, I don't think anybody expects anything else.
>

I understand intent may be that. That is not the reality
when you start at OF-DPA as the api.


>> Lets start with hardware abstraction. Lets map to existing Linux APIs
>> and then see where some massaging maybe needed.
>
> That's what's being done. HW offload is being mapped to OVS and
> to an existing ioctl interface. Those are existing Linux APIs.
> Can you explain why swdev as proposed is not suitable for the
> other existing Linux APIs? They don't *have* to use the flow_insert(),
> they are free to exted the API to represent more generic programmable
> hardware.
>

I would like XXX_flow_XXX to allow for multiple types of classifiers.
nftables may express one and the driver which is capable offload it.

>> This abstraction gives OVS 1-1 mapping which is something i object to.
>> You want to penalize me for the sake of getting the OVS api in place?
>
> I don't understand this.
>

Refer to my comments earlier.

>> Beginning with flows and laying claim to that one would be able to
>> cover everything is non-starter.
>
> Nobody claims that. In fact, I'm very interested in seeing the API
> extended for non flow based models. I'm actually convinced that flow
> based models are not the ultimate answers on HW level but a vast majority
> of hardware understands some form of protocol aware exact match or
> wildcard filters of limited capacity. This category of hardware is
> being addressed with the flow_insert() API.
>

And make that also take input the classifier type.


>> There are some cases where that approach doesnt make sense:
>> example if i wanted to specify a string classifier etc.
>> But if we are talking packet header classifier - it is flexible.
>> There are also good reasons to specify a universal 5 tuple classifier.
>> As there are good reasons to specify your latest OF classifier.
>> But that OF classifier being the starting point is not pragmatic.
>
> So you agree that at least on the driver level some form of ntuple
> awareness must be given because the hardware has limited capabilities.

Yes, there is a classifier *type* where the 15 tuples makes sense.

> This is exactly what flow_insert() is, it is a generic ntuple
> classifier which can implement a subset of the 15 tuple in HW. So
> instead of adding a separate NDO for each fixed tuple, a generic
> NDO can handle the different levels of offloads. Very similar to how
> the xmit to the NIC can handle various protocol offloads already.
>
> What is being proposed is a generic ntuple with masking support to
> describe filtering needs. What is missing is a capabilities reporting
> channel so API users can know in advance what is supported to
> implement partial offloads.
>


The 15 tuple itself needs to be one-of several classifiers.
Creating a univesal classifier is problematic. Look at tc classifier
approach (which i know you understand well).

Sorry -I am on time constraint and may not be as responsive.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-25 16:48                         ` Jamal Hadi Salim
@ 2014-08-25 22:11                           ` Thomas Graf
  2014-08-26 14:00                             ` Jamal Hadi Salim
  0 siblings, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-25 22:11 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Scott Feldman, John Fastabend, Jiri Pirko, netdev, davem,
	nhorman, andy, dborkman, ogerlitz, jesse, pshelar, azhou, ben,
	stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, f.fainelli, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

First of all, thanks for the animated discussion, wouldn't
want to miss our arguments ;-)

On 08/25/14 at 12:48pm, Jamal Hadi Salim wrote:
> On 08/25/14 10:54, Thomas Graf wrote:
> >On 08/24/14 at 11:15am, Jamal Hadi Salim wrote:
> 
> >Let's keep vendors out of this discussion.
> 
> The API is from a vendor. It is clearly labelled as an OF API.
> It covers well abstracting that vendors SDK to enable OF. That
> is relevant info.
> If it covers all other vendors (which is where
> the quark handling comes in), I will be fine with it.
> I dont believe it does.

If I understand you correctly you are referring to the rocker
patch here. That is not part of the API.
 
> >That is simply not the case. The fact that John is using this model
> >to replace the flow director ioctl API should prove this.
> 
> depends what NIC classifier John is mapping to. The Intels have
> about 4-5 different types of classifier on different hardware

Sorry for not addressing this but I think John should speak for
himself here, I don't want to misrepresent his plans.

> I gave a simple example.
> There are a hell more quarks than that.
> There are cases where there are multiple tables in terms of net masks
> etc.
> Yes, this should be handled in the driver. The input is the route
> message we already specify and not some XXX_Flow_XXx struct.

I would argue that swflow is a superset of a Netlink route. It
may infact be very useful to extend the API with something that
understands the Netlink representation of a route and have the
API translate that to a classifier that can be offloaded.

> I would be tagging along with you guys for flows if you:
> a) allow for different classifiers. This allows me to implement
> u32 and offload it.

Agreed. What you seem to disagree on is:

 - ndo_add_type1([...])
 - ndo_add_type2([...])
 - ndo_add_type3([...])

vs.

 - ndo_add_classifier(type, [...])

I honestly have little against the 2nd. It sounds a bit like an
ioctl interface though where a giant switch statement will cast
the data to a classifier specific struct which is why I slightly
dislike it.

It looks to me that a specific chip may either work in a flow/filter
mode, in a generic programmable mode or by providing a list of very
specific filters without a generic flow -> action relation. Having
multiple classifier types for all of them gives the impression that
an API user could use them in any combination which I would say will
typically not be the case.

> b) different actions (I think this part is not controversial, you
> seem to be having it already).

Agreed

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                       ` <53FB6122.2040901-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
@ 2014-08-25 22:50                                         ` Thomas Graf
       [not found]                                           ` <20140825225057.GD30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-08-26 14:26                                           ` Jamal Hadi Salim
  0 siblings, 2 replies; 87+ messages in thread
From: Thomas Graf @ 2014-08-25 22:50 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 08/25/14 at 12:15pm, Jamal Hadi Salim wrote:
> On 08/25/14 10:17, Thomas Graf wrote:
> >On 08/25/14 at 09:53am, Jamal Hadi Salim wrote:
> 
> >fdb_add() *is* flow based. At least in my understanding, the whole
> >point here is to extend the idea of fdb_add() and make it understand
> >L2-L4 in a more generic way for the most common protocols.
> >
> >The reason fdb_add() is not reused is because it is Netlink specific
> >and only suitable for User -> HW offload. Kernel -> HW offload is
> >technically possible but not clean.
> >
> 
> I dont think we have a problem handling any of this today.

Yes we do. It's restricted to L2 and we can't extend it easily
because it is based on NDA_*. The use of Netlink makes in-kernel
usage a pain. To me this is the sole reason for not using fdb_add()
in the first place. It seems absolutely clear though that fdb_add()
should be removed after the more generic ndo is in place providing
a superset of what fdb_add() can do today.

> This is where our (shall i say strong) disagreement is.
> I think you will find it non-trivial to show me how you can
> actually take the simple L2 bridge and map it to a "flow".
> Since your starting point is "everything can be represented via a flow
> and some table" - we are at a crosspath.

OK, let me do the convertion for you:

NDA_DST		unused
NDA_LLADDR	sw_flow_key.eth.dst
NDA_CACHEINFO	unused
NDA_PROBES	unused
NDA_VLAN	sw_flow_key.eth.tci
NDA_PORT	unused
NDA_VNI		sw_flow_key.tun_key.tun_id
NDA_IFINDEX	sw_flow_key.phys.in_port
NDA_MASTER	unused

> The tc filter API seems to be doing just that.
> You have different types of classifiers - the h/w may not be able
> to support some classifier types - but that is a capability discovery
> challenge.

Agreed but tc is only one out of many possible existing interfaces
we have. macvtap (given we want to extend beyond L2), routing,
OVS, bridge and eventually even things like a team device can and
should make use of offloads. 

> I am saying two things:
> 1) There are a few "fundamental" interfaces; L2 and L3 being some.
> Add crypto offload and a few i mentioned in  my presentation. We

Can you share that preso? I was not present.

> know how to do those. example; there is nothing i cant do with
> the rtmsg that is L3. or the fdb/port/vlan filter for L2.
> This flow thing should stay out of those.

Let me remind you about the name of the structure behind all L3
forwarding decisions:

        struct flowi4 {   
		[...]
	}

Adding a route means adding a flow. Can we please stop the flow
bashing? The concept of a flow is very generic, well known and already
very present in the kernel.

The sw_flow_key proposed comes close to flowi4. Some fields are
different. They can eventually get merged. The strict IPv4/IPv6
separation is what makes it non obvious and probably why Jiri chose
the OVS representation. If you say rtmsg is complete then that clearly
is not the case. In particular VTEP fields, ARP, and TCP flags are
clearly missing for many uses.

Again, I'm not saying flow is the ultimate answer to everything. It
is not. But a lot of hardware out there is aware of flows in combination
with some form of action execution. Non flow based hardware can have
their own classifier.

> 2) The flow thing should allow a variety of classifiers to be
> handled. Again capability discovery would take care of differences.

So you want the flow to represent something that is not a flow. Again,
this comes back to the conversation in the other email. If this is
all about having a single ndo I'm sure we can find common grounds on
that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 04/12] rtnl: expose physical switch id for particular device
       [not found]       ` <53F79537.20207-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-08-26  8:32         ` Jiri Pirko
  0 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-26  8:32 UTC (permalink / raw)
  To: John Fastabend
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

Fri, Aug 22, 2014 at 09:08:39PM CEST, john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
>On 08/21/2014 09:18 AM, Jiri Pirko wrote:
>>The netdevice represents a port in a switch, it will expose
>>IFLA_PHYS_SWITCH_ID value via rtnl. Two netdevices with the same value
>>belong to one physical switch.
>>
>>Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
>
>What is the relation between phys_port_id and phys_switch_id?
>
>phys_port_id was intended to identify a set of ports that belong
>to a single uplink port,
>
>
>	eth0     eth1    eth2   eth3      (host facing)
>          |       |       |      |
>          |       |       |      |
>      +---+-------+-------+------+---+
>      |      embedded switch         |
>      +------------------------------+
>                     |
>                    MAC                   (network)
>
>In the NIC case there is a simply switch with a port to the
>network which we currently don't represent with a netdev. Any
>netdev where the phys_switch_id's are behind the same embedded
>switch.

I think that MAC in your picture should be represented as netdev (switch
port). Also, the other ports connected to eth0-3 should be represented
as netdevs. All of these + the MAC should have the same switch id.

>
>In the switch id case we are indicating the port is attached to
>the same embedded switch as well.
>
>         eth0 eth1 eth2 eth3
>          |    |    |    |
>     +----+----+----+----+----+
>     |         switch         |
>     +----+----+----+----+----+
>
>but they do not share an uplink port? So in this case each ethx
>has a unique phys_port_id but the same phys_switch_id?

Yes.

>
>In the first case both phys_port_id and phys_switch_id should
>be equal for all interfaces correct?

See above. In case of embedded switch on nic I believe that eth0-eth3
shoud have the same port_id and no switch_id as they are not ports of
switch (the counterparts in switch (marked as "+" on your picture are
the switch ports)

>
>Is that clear/useful at all? We need to document this somewhere
>if/when the patches are submitted otherwise I doubt we will get it
>consistently right across drivers. There could for example be
>somewhat strange devices with virtual functions hanging off of the
>switch.

I will extend the documentation to my "net: introduce generic switch
devices support" patch.

>
>Thanks,
>John
>
>-- 
>John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
       [not found]       ` <20140824114605.GC32741-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-08-26  8:34         ` Jiri Pirko
  0 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-26  8:34 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

Sun, Aug 24, 2014 at 01:46:05PM CEST, tgraf-G/eBtMaohhA@public.gmane.org wrote:
>On 08/21/14 at 06:18pm, Jiri Pirko wrote:
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 39294b9..8b5d14c 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -49,6 +49,8 @@
>>  
>>  #include <linux/netdev_features.h>
>>  #include <linux/neighbour.h>
>> +#include <linux/sw_flow.h>
>> +
>>  #include <uapi/linux/netdevice.h>
>>  
>>  struct netpoll_info;
>> @@ -997,6 +999,24 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
>> + * int (*ndo_swdev_flow_insert)(struct net_device *dev,
>> + *				const struct sw_flow *flow);
>> + *	Called to insert a flow into switch device. If driver does
>> + *	not implement this, it is assumed that the hw does not have
>> + *	a capability to work with flows.
>
>I asume you are planning to add an additional expandable struct
>paramter to handle insertion parameters when the first is introduced
>to avoid requiring to touch every driver every time.

Sure. That is the way to go.

>
>> +/**
>> + *	swdev_flow_insert - Insert a flow into switch
>> + *	@dev: port device
>> + *	@flow: flow descriptor
>> + *
>> + *	Insert a flow into switch this port is part of.
>> + */
>> +int swdev_flow_insert(struct net_device *dev, const struct sw_flow *flow)
>> +{
>> +	const struct net_device_ops *ops = dev->netdev_ops;
>> +
>> +	print_flow(flow, dev, "insert");
>> +	if (!ops->ndo_swdev_flow_insert)
>> +		return -EOPNOTSUPP;
>> +	WARN_ON(!ops->ndo_swdev_get_id);
>> +	BUG_ON(!flow->actions);
>> +	return ops->ndo_swdev_flow_insert(dev, flow);
>> +}
>> +EXPORT_SYMBOL(swdev_flow_insert);
>
>Splitting the flow specific API into a separate file (maybe
>swdev_flow.c?) might help resolve some of the concerns around the
>focus on flows. It would make it clear that it's one of multiple
>models to be supported.

I understand your point. But the file is tiny as it is. I would keep all
in one file for now.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 02/12] net: rename netdev_phys_port_id to more generic name
       [not found]   ` <1408637945-10390-3-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-08-26 12:23     ` Or Gerlitz
       [not found]       ` <53FC7C3C.3090901-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Or Gerlitz @ 2014-08-26 12:23 UTC (permalink / raw)
  To: Jiri Pirko, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 21/08/2014 19:18, Jiri Pirko wrote:
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -868,7 +868,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
>   	       + rtnl_port_size(dev, ext_filter_mask) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
>   	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
>   	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
> -	       + nla_total_size(MAX_PHYS_PORT_ID_LEN); /* IFLA_PHYS_PORT_ID */
> +	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_PORT_ID */
>   }
>
>   static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
> @@ -952,7 +952,7 @@ static int rtnl_port_fill(struct sk_buff *skb, struct net_device *dev,
>   static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
>   {
>   	int err;
> -	struct netdev_phys_port_id ppid;
> +	struct netdev_phys_item_id ppid;
>
>   	err = dev_get_phys_port_id(dev, &ppid);
>   	if (err) {
> @@ -1196,7 +1196,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
>   	[IFLA_PROMISCUITY]	= { .type = NLA_U32 },
>   	[IFLA_NUM_TX_QUEUES]	= { .type = NLA_U32 },
>   	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
> -	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
> +	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
>   	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
>   };
>

just a nit, but if this approach/patch goes in, any reason not to change 
IFLA_PHYS_PORT_ID to IFLA_PHYS_ITEM_ID?

Or.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                           ` <20140825225057.GD30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-08-26 13:50                                             ` Roopa Prabhu
  2014-08-26 14:06                                               ` Jiri Pirko
  2014-08-26 15:01                                               ` Scott Feldman
  0 siblings, 2 replies; 87+ messages in thread
From: Roopa Prabhu @ 2014-08-26 13:50 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, Shrijeet Mukherjee,
	John Fastabend, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko, Jamal Hadi Salim,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 8/25/14, 3:50 PM, Thomas Graf wrote:
> On 08/25/14 at 12:15pm, Jamal Hadi Salim wrote:
>> On 08/25/14 10:17, Thomas Graf wrote:
>>> On 08/25/14 at 09:53am, Jamal Hadi Salim wrote:
>>> fdb_add() *is* flow based. At least in my understanding, the whole
>>> point here is to extend the idea of fdb_add() and make it understand
>>> L2-L4 in a more generic way for the most common protocols.
>>>
>>> The reason fdb_add() is not reused is because it is Netlink specific
>>> and only suitable for User -> HW offload. Kernel -> HW offload is
>>> technically possible but not clean.
>>>
>> I dont think we have a problem handling any of this today.
> Yes we do. It's restricted to L2 and we can't extend it easily
> because it is based on NDA_*. The use of Netlink makes in-kernel
> usage a pain. To me this is the sole reason for not using fdb_add()
> in the first place. It seems absolutely clear though that fdb_add()
> should be removed after the more generic ndo is in place providing
> a superset of what fdb_add() can do today.
>
>> This is where our (shall i say strong) disagreement is.
>> I think you will find it non-trivial to show me how you can
>> actually take the simple L2 bridge and map it to a "flow".
>> Since your starting point is "everything can be represented via a flow
>> and some table" - we are at a crosspath.
> OK, let me do the convertion for you:
>
> NDA_DST		unused
> NDA_LLADDR	sw_flow_key.eth.dst
> NDA_CACHEINFO	unused
> NDA_PROBES	unused
> NDA_VLAN	sw_flow_key.eth.tci
> NDA_PORT	unused
> NDA_VNI		sw_flow_key.tun_key.tun_id
> NDA_IFINDEX	sw_flow_key.phys.in_port
> NDA_MASTER	unused
>
>> The tc filter API seems to be doing just that.
>> You have different types of classifiers - the h/w may not be able
>> to support some classifier types - but that is a capability discovery
>> challenge.
> Agreed but tc is only one out of many possible existing interfaces
> we have. macvtap (given we want to extend beyond L2), routing,
> OVS, bridge and eventually even things like a team device can and
> should make use of offloads.
>
>> I am saying two things:
>> 1) There are a few "fundamental" interfaces; L2 and L3 being some.
>> Add crypto offload and a few i mentioned in  my presentation. We
> Can you share that preso? I was not present.
>
>> know how to do those. example; there is nothing i cant do with
>> the rtmsg that is L3. or the fdb/port/vlan filter for L2.
>> This flow thing should stay out of those.
> Let me remind you about the name of the structure behind all L3
> forwarding decisions:
>
>          struct flowi4 {
> 		[...]
> 	}
>
> Adding a route means adding a flow. Can we please stop the flow
> bashing? The concept of a flow is very generic, well known and already
> very present in the kernel.
>
> The sw_flow_key proposed comes close to flowi4. Some fields are
> different. They can eventually get merged. The strict IPv4/IPv6
> separation is what makes it non obvious and probably why Jiri chose
> the OVS representation. If you say rtmsg is complete then that clearly
> is not the case. In particular VTEP fields, ARP, and TCP flags are
> clearly missing for many uses.
>
> Again, I'm not saying flow is the ultimate answer to everything. It
> is not. But a lot of hardware out there is aware of flows in combination
> with some form of action execution. Non flow based hardware can have
> their own classifier.
>
>> 2) The flow thing should allow a variety of classifiers to be
>> handled. Again capability discovery would take care of differences.
> So you want the flow to represent something that is not a flow. Again,
> this comes back to the conversation in the other email. If this is
> all about having a single ndo I'm sure we can find common grounds on
> that.

 From what i understood (trying to summarize here for my own benefit):
the switchdev api currently under review proposes every switch asic 
offload abstraction as a flow.
It does not mandate this via code, however, there seems to be some 
discussion along those lines.

The switchdev api flow ndo's need to stay for switch asic drivers that 
support flows directly or
possibly want all their hw offload abstraction to be represented by the 
flow abstraction (openvswitch, the rocker dev ). The details of how the 
flow is mapped to hw lies in the corresponding switch driver code.

We think rtnetlink is the api to model switch asic hw tables.
We have a working model (Cumulus) that maps rtnetlink to switch
asic hw tables (via snooping rtnetlink msgs). This can be done by 
extending the switchdev api
with new ndo's for l2 and l3.

Example:
   new switchdev ndo's for fdb_add/fdb_del
   new switchdev ndo's for l3

Now we only need working patches that implement switchdev api ndo ops 
for l2/l3 (this is in the works).

As long as the current patches under review allow the extension of the 
api to cover non-flow based l2/l3 switch asic offloads, we might be good 
(?).

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-25 22:11                           ` Thomas Graf
@ 2014-08-26 14:00                             ` Jamal Hadi Salim
  2014-08-26 14:20                               ` Thomas Graf
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-26 14:00 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Scott Feldman, John Fastabend, Jiri Pirko, netdev, davem,
	nhorman, andy, dborkman, ogerlitz, jesse, pshelar, azhou, ben,
	stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, f.fainelli, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

On 08/25/14 18:11, Thomas Graf wrote:
> First of all, thanks for the animated discussion, wouldn't
> want to miss our arguments ;-)
>

Passion is key my friend;-> It is said that ancient Greeks
would ask of a person whose funeral they are thinking to attend
"was s/he passionate in life?" And if the answer is negative
they simply dont show up;->

> On 08/25/14 at 12:48pm, Jamal Hadi Salim wrote:
>> On 08/25/14 10:54, Thomas Graf wrote:

> I would argue that swflow is a superset of a Netlink route. It
> may infact be very useful to extend the API with something that
> understands the Netlink representation of a route and have the
> API translate that to a classifier that can be offloaded.
>

Sorry Thomas, I disagree.
A route has a lot more knobs than just a simple flow representation.
We are talking next hops (of which there could be multiple) etc.
There is no way you can boil that down to a simple flow representation.

>> I would be tagging along with you guys for flows if you:
>> a) allow for different classifiers. This allows me to implement
>> u32 and offload it.
>
> Agreed. What you seem to disagree on is:
>
>   - ndo_add_type1([...])
>   - ndo_add_type2([...])
>   - ndo_add_type3([...])
>
> vs.
>
>   - ndo_add_classifier(type, [...])
>

Only for what you call a "flow" - mostly because you have decided
on the universal classifier (lets call it THEONE).
Implementation-wise, you dont have to pass a type. It could be
a sub-ops() function pointer.


cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-26 13:50                                             ` Roopa Prabhu
@ 2014-08-26 14:06                                               ` Jiri Pirko
  2014-08-26 14:58                                                 ` Jamal Hadi Salim
  2014-08-26 15:01                                               ` Scott Feldman
  1 sibling, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-08-26 14:06 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Thomas Graf, Jamal Hadi Salim, John Fastabend, Scott Feldman,
	netdev, David Miller, Neil Horman, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, john.r.fastabend, edumazet, f.fainelli,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ron

Tue, Aug 26, 2014 at 03:50:21PM CEST, roopa@cumulusnetworks.com wrote:
>On 8/25/14, 3:50 PM, Thomas Graf wrote:
>>On 08/25/14 at 12:15pm, Jamal Hadi Salim wrote:
>>>On 08/25/14 10:17, Thomas Graf wrote:
>>>>On 08/25/14 at 09:53am, Jamal Hadi Salim wrote:
>>>>fdb_add() *is* flow based. At least in my understanding, the whole
>>>>point here is to extend the idea of fdb_add() and make it understand
>>>>L2-L4 in a more generic way for the most common protocols.
>>>>
>>>>The reason fdb_add() is not reused is because it is Netlink specific
>>>>and only suitable for User -> HW offload. Kernel -> HW offload is
>>>>technically possible but not clean.
>>>>
>>>I dont think we have a problem handling any of this today.
>>Yes we do. It's restricted to L2 and we can't extend it easily
>>because it is based on NDA_*. The use of Netlink makes in-kernel
>>usage a pain. To me this is the sole reason for not using fdb_add()
>>in the first place. It seems absolutely clear though that fdb_add()
>>should be removed after the more generic ndo is in place providing
>>a superset of what fdb_add() can do today.
>>
>>>This is where our (shall i say strong) disagreement is.
>>>I think you will find it non-trivial to show me how you can
>>>actually take the simple L2 bridge and map it to a "flow".
>>>Since your starting point is "everything can be represented via a flow
>>>and some table" - we are at a crosspath.
>>OK, let me do the convertion for you:
>>
>>NDA_DST		unused
>>NDA_LLADDR	sw_flow_key.eth.dst
>>NDA_CACHEINFO	unused
>>NDA_PROBES	unused
>>NDA_VLAN	sw_flow_key.eth.tci
>>NDA_PORT	unused
>>NDA_VNI		sw_flow_key.tun_key.tun_id
>>NDA_IFINDEX	sw_flow_key.phys.in_port
>>NDA_MASTER	unused
>>
>>>The tc filter API seems to be doing just that.
>>>You have different types of classifiers - the h/w may not be able
>>>to support some classifier types - but that is a capability discovery
>>>challenge.
>>Agreed but tc is only one out of many possible existing interfaces
>>we have. macvtap (given we want to extend beyond L2), routing,
>>OVS, bridge and eventually even things like a team device can and
>>should make use of offloads.
>>
>>>I am saying two things:
>>>1) There are a few "fundamental" interfaces; L2 and L3 being some.
>>>Add crypto offload and a few i mentioned in  my presentation. We
>>Can you share that preso? I was not present.
>>
>>>know how to do those. example; there is nothing i cant do with
>>>the rtmsg that is L3. or the fdb/port/vlan filter for L2.
>>>This flow thing should stay out of those.
>>Let me remind you about the name of the structure behind all L3
>>forwarding decisions:
>>
>>         struct flowi4 {
>>		[...]
>>	}
>>
>>Adding a route means adding a flow. Can we please stop the flow
>>bashing? The concept of a flow is very generic, well known and already
>>very present in the kernel.
>>
>>The sw_flow_key proposed comes close to flowi4. Some fields are
>>different. They can eventually get merged. The strict IPv4/IPv6
>>separation is what makes it non obvious and probably why Jiri chose
>>the OVS representation. If you say rtmsg is complete then that clearly
>>is not the case. In particular VTEP fields, ARP, and TCP flags are
>>clearly missing for many uses.
>>
>>Again, I'm not saying flow is the ultimate answer to everything. It
>>is not. But a lot of hardware out there is aware of flows in combination
>>with some form of action execution. Non flow based hardware can have
>>their own classifier.
>>
>>>2) The flow thing should allow a variety of classifiers to be
>>>handled. Again capability discovery would take care of differences.
>>So you want the flow to represent something that is not a flow. Again,
>>this comes back to the conversation in the other email. If this is
>>all about having a single ndo I'm sure we can find common grounds on
>>that.
>
>From what i understood (trying to summarize here for my own benefit):
>the switchdev api currently under review proposes every switch asic offload
>abstraction as a flow.
>It does not mandate this via code, however, there seems to be some discussion
>along those lines.
>
>The switchdev api flow ndo's need to stay for switch asic drivers that
>support flows directly or
>possibly want all their hw offload abstraction to be represented by the flow
>abstraction (openvswitch, the rocker dev ). The details of how the flow is
>mapped to hw lies in the corresponding switch driver code.

Nod.

>
>We think rtnetlink is the api to model switch asic hw tables.
>We have a working model (Cumulus) that maps rtnetlink to switch
>asic hw tables (via snooping rtnetlink msgs). This can be done by extending
>the switchdev api
>with new ndo's for l2 and l3.
>
>Example:
>  new switchdev ndo's for fdb_add/fdb_del
>  new switchdev ndo's for l3

Nod.

>
>Now we only need working patches that implement switchdev api ndo ops for
>l2/l3 (this is in the works).
>
>As long as the current patches under review allow the extension of the api to
>cover non-flow based l2/l3 switch asic offloads, we might be good (?).


Yes. Flows are phase one. The api will be extended in for whatever is
needed for l2/l3 as you said. Also I see a possibility to implement the
l2/l3 use case with flows as well. But generally, as stands for ever in-kernel
api, we can extend it and change it.


>
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 02/12] net: rename netdev_phys_port_id to more generic name
       [not found]       ` <53FC7C3C.3090901-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-08-26 14:10         ` Jiri Pirko
  2014-08-26 17:14         ` Stephen Hemminger
  1 sibling, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-26 14:10 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

Tue, Aug 26, 2014 at 02:23:24PM CEST, ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org wrote:
>On 21/08/2014 19:18, Jiri Pirko wrote:
>>--- a/net/core/rtnetlink.c
>>+++ b/net/core/rtnetlink.c
>>@@ -868,7 +868,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
>>  	       + rtnl_port_size(dev, ext_filter_mask) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
>>  	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
>>  	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
>>-	       + nla_total_size(MAX_PHYS_PORT_ID_LEN); /* IFLA_PHYS_PORT_ID */
>>+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_PORT_ID */
>>  }
>>
>>  static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
>>@@ -952,7 +952,7 @@ static int rtnl_port_fill(struct sk_buff *skb, struct net_device *dev,
>>  static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
>>  {
>>  	int err;
>>-	struct netdev_phys_port_id ppid;
>>+	struct netdev_phys_item_id ppid;
>>
>>  	err = dev_get_phys_port_id(dev, &ppid);
>>  	if (err) {
>>@@ -1196,7 +1196,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
>>  	[IFLA_PROMISCUITY]	= { .type = NLA_U32 },
>>  	[IFLA_NUM_TX_QUEUES]	= { .type = NLA_U32 },
>>  	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
>>-	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
>>+	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
>>  	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
>>  };
>>
>
>just a nit, but if this approach/patch goes in, any reason not to change
>IFLA_PHYS_PORT_ID to IFLA_PHYS_ITEM_ID?

It would be still port_id. No change there. I changed in-kernel struct
name "port"->"item" so it can be reused for switch_id as well.

>
>Or.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-26 14:00                             ` Jamal Hadi Salim
@ 2014-08-26 14:20                               ` Thomas Graf
  0 siblings, 0 replies; 87+ messages in thread
From: Thomas Graf @ 2014-08-26 14:20 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Scott Feldman, John Fastabend, Jiri Pirko, netdev, davem,
	nhorman, andy, dborkman, ogerlitz, jesse, pshelar, azhou, ben,
	stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, f.fainelli, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

On 08/26/14 at 10:00am, Jamal Hadi Salim wrote:
> >I would argue that swflow is a superset of a Netlink route. It
> >may infact be very useful to extend the API with something that
> >understands the Netlink representation of a route and have the
> >API translate that to a classifier that can be offloaded.
> >
> 
> Sorry Thomas, I disagree.
> A route has a lot more knobs than just a simple flow representation.
> We are talking next hops (of which there could be multiple) etc.
> There is no way you can boil that down to a simple flow representation.

I guess we could argue forever.

To answer your specific statement. If per (wildcard) flow nexthop
balancing behaviour is good enough, the current flow insert is likely
sufficient. If the hardware should do the balancing, an additional
API function is probably needed to cleanly represent that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-25 22:50                                         ` Thomas Graf
       [not found]                                           ` <20140825225057.GD30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-08-26 14:26                                           ` Jamal Hadi Salim
  1 sibling, 0 replies; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-26 14:26 UTC (permalink / raw)
  To: Thomas Graf
  Cc: John Fastabend, Scott Feldman, Jiri Pirko, netdev, David Miller,
	Neil Horman, Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, f.fainelli, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

On 08/25/14 18:50, Thomas Graf wrote:
> On 08/25/14 at 12:15pm, Jamal Hadi Salim wrote:
>> On 08/25/14 10:17, Thomas Graf wrote:

>> I dont think we have a problem handling any of this today.
>
> Yes we do. It's restricted to L2 and we can't extend it easily

It is restricted to L2 because it is L2 processing;->
i.e a fixed function that is widely deployed and well understood.
Possible new extensions that are added are still L2
(example I think if you were to add TRILL support, you would
likely need to inherit and extend the bridge then add new TLVs).

> because it is based on NDA_*. The use of Netlink makes in-kernel
> usage a pain.

Ok, I understand what you mean by "in kernel" now.
I believe we have representations that are complete today at L3.
The offloader just feeds on that.
L2 needs some work because we have only been offloading the fdb.

>To me this is the sole reason for not using fdb_add()
> in the first place. It seems absolutely clear though that fdb_add()
> should be removed after the more generic ndo is in place providing
> a superset of what fdb_add() can do today.
>

It is by no means complete as i pointed to in my other email.
We need to worry about bridge ports, vlan filtering, igmp snooping
possibly STP parametrization and other knobs of control (flood control,
learning control etc).

> OK, let me do the convertion for you:
>
> NDA_DST		unused
> NDA_LLADDR	sw_flow_key.eth.dst
> NDA_CACHEINFO	unused
> NDA_PROBES	unused
> NDA_VLAN	sw_flow_key.eth.tci
> NDA_PORT	unused
> NDA_VNI		sw_flow_key.tun_key.tun_id
> NDA_IFINDEX	sw_flow_key.phys.in_port
> NDA_MASTER	unused
>

You are waaaay oversimplifying;->.
You need to worry about the rest of the other knobs that
are relevant when one offloads the bridge (refer above to
some of the things i said are missing from current fdb()
interface).

> Agreed but tc is only one out of many possible existing interfaces
> we have. macvtap (given we want to extend beyond L2), routing,
> OVS, bridge and eventually even things like a team device can and
> should make use of offloads.
>

Sure. I just want my cookies. I want it such that if i use tc filter
and that filter is offloadable and there exist a device capable
of offloading in my system - that it should work.


> Can you share that preso? I was not present.
>

I think it should be posted in the netconf site.
Also refer to my earlier presentation in the online meeting
which you were present at.

> Let me remind you about the name of the structure behind all L3
> forwarding decisions:
>
>          struct flowi4 {
> 		[...]
> 	}
>
> Adding a route means adding a flow.

Come on Thomas;->
It is called "flowi" structure - but it represent a much complex thing
than your definition of "flow".

>Can we please stop the flow bashing?

Let me get out my club and bash it some more ;->
I am going to start a newsgroup called alt.bash.bash.flow
Any postings from stanford will be censored by the banana republic
dictator.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-26 14:06                                               ` Jiri Pirko
@ 2014-08-26 14:58                                                 ` Jamal Hadi Salim
  2014-08-26 15:22                                                   ` Jiri Pirko
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-26 14:58 UTC (permalink / raw)
  To: Jiri Pirko, Roopa Prabhu
  Cc: Thomas Graf, John Fastabend, Scott Feldman, netdev, David Miller,
	Neil Horman, Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, f.fainelli, linville, dev, jasowang,
	ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye, Shrijeet Mukherjee

On 08/26/14 10:06, Jiri Pirko wrote:

> Yes. Flows are phase one. The api will be extended in for whatever is
> needed for l2/l3 as you said. Also I see a possibility to implement the
> l2/l3 use case with flows as well.

And as a note: This is where i have the disagreement.
It is good there is acknowledgement you are handling flows for now.
Or whatever tuples you defined as "flow". I dont think L2 or 3 fit
in that. If thats not what you are saying then we are in agreement.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-26 13:50                                             ` Roopa Prabhu
  2014-08-26 14:06                                               ` Jiri Pirko
@ 2014-08-26 15:01                                               ` Scott Feldman
       [not found]                                                 ` <D891A8EC-548C-453E-AC70-8431DAC4B8C4-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  1 sibling, 1 reply; 87+ messages in thread
From: Scott Feldman @ 2014-08-26 15:01 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Thomas Graf, Jamal Hadi Salim, John Fastabend, Jiri Pirko,
	netdev, David Miller, Neil Horman, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, john.r.fastabend, edumazet, f.fainelli,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye


On Aug 26, 2014, at 6:50 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> On 8/25/14, 3:50 PM, Thomas Graf wrote:
>> On 08/25/14 at 12:15pm, Jamal Hadi Salim wrote:
>>> On 08/25/14 10:17, Thomas Graf wrote:
>>>> On 08/25/14 at 09:53am, Jamal Hadi Salim wrote:
>>>> fdb_add() *is* flow based. At least in my understanding, the whole
>>>> point here is to extend the idea of fdb_add() and make it understand
>>>> L2-L4 in a more generic way for the most common protocols.
>>>> 
>>>> The reason fdb_add() is not reused is because it is Netlink specific
>>>> and only suitable for User -> HW offload. Kernel -> HW offload is
>>>> technically possible but not clean.
>>>> 
>>> I dont think we have a problem handling any of this today.
>> Yes we do. It's restricted to L2 and we can't extend it easily
>> because it is based on NDA_*. The use of Netlink makes in-kernel
>> usage a pain. To me this is the sole reason for not using fdb_add()
>> in the first place. It seems absolutely clear though that fdb_add()
>> should be removed after the more generic ndo is in place providing
>> a superset of what fdb_add() can do today.
>> 
>>> This is where our (shall i say strong) disagreement is.
>>> I think you will find it non-trivial to show me how you can
>>> actually take the simple L2 bridge and map it to a "flow".
>>> Since your starting point is "everything can be represented via a flow
>>> and some table" - we are at a crosspath.
>> OK, let me do the convertion for you:
>> 
>> NDA_DST		unused
>> NDA_LLADDR	sw_flow_key.eth.dst
>> NDA_CACHEINFO	unused
>> NDA_PROBES	unused
>> NDA_VLAN	sw_flow_key.eth.tci
>> NDA_PORT	unused
>> NDA_VNI		sw_flow_key.tun_key.tun_id
>> NDA_IFINDEX	sw_flow_key.phys.in_port
>> NDA_MASTER	unused
>> 
>>> The tc filter API seems to be doing just that.
>>> You have different types of classifiers - the h/w may not be able
>>> to support some classifier types - but that is a capability discovery
>>> challenge.
>> Agreed but tc is only one out of many possible existing interfaces
>> we have. macvtap (given we want to extend beyond L2), routing,
>> OVS, bridge and eventually even things like a team device can and
>> should make use of offloads.
>> 
>>> I am saying two things:
>>> 1) There are a few "fundamental" interfaces; L2 and L3 being some.
>>> Add crypto offload and a few i mentioned in  my presentation. We
>> Can you share that preso? I was not present.
>> 
>>> know how to do those. example; there is nothing i cant do with
>>> the rtmsg that is L3. or the fdb/port/vlan filter for L2.
>>> This flow thing should stay out of those.
>> Let me remind you about the name of the structure behind all L3
>> forwarding decisions:
>> 
>>         struct flowi4 {
>> 		[...]
>> 	}
>> 
>> Adding a route means adding a flow. Can we please stop the flow
>> bashing? The concept of a flow is very generic, well known and already
>> very present in the kernel.
>> 
>> The sw_flow_key proposed comes close to flowi4. Some fields are
>> different. They can eventually get merged. The strict IPv4/IPv6
>> separation is what makes it non obvious and probably why Jiri chose
>> the OVS representation. If you say rtmsg is complete then that clearly
>> is not the case. In particular VTEP fields, ARP, and TCP flags are
>> clearly missing for many uses.
>> 
>> Again, I'm not saying flow is the ultimate answer to everything. It
>> is not. But a lot of hardware out there is aware of flows in combination
>> with some form of action execution. Non flow based hardware can have
>> their own classifier.
>> 
>>> 2) The flow thing should allow a variety of classifiers to be
>>> handled. Again capability discovery would take care of differences.
>> So you want the flow to represent something that is not a flow. Again,
>> this comes back to the conversation in the other email. If this is
>> all about having a single ndo I'm sure we can find common grounds on
>> that.
> 
> From what i understood (trying to summarize here for my own benefit):
> the switchdev api currently under review proposes every switch asic offload abstraction as a flow.
> It does not mandate this via code, however, there seems to be some discussion along those lines.
> 
> The switchdev api flow ndo's need to stay for switch asic drivers that support flows directly or
> possibly want all their hw offload abstraction to be represented by the flow abstraction (openvswitch, the rocker dev ). The details of how the flow is mapped to hw lies in the corresponding switch driver code.
> 
> We think rtnetlink is the api to model switch asic hw tables.
> We have a working model (Cumulus) that maps rtnetlink to switch
> asic hw tables (via snooping rtnetlink msgs). This can be done by extending the switchdev api
> with new ndo's for l2 and l3.
> 

I don’t see it that way.  I believe sw_flow can be the intermediary representation to span flow-based and non-flow-based HW, and from flow-based world and traditional l2/l3 world.


> Example:
>  new switchdev ndo's for fdb_add/fdb_del
>  new switchdev ndo's for l3
> 
> Now we only need working patches that implement switchdev api ndo ops for l2/l3 (this is in the works).
> 
> As long as the current patches under review allow the extension of the api to cover non-flow based l2/l3 switch asic offloads, we might be good (?).
> 
> Thanks,
> Roopa
> 
> 
> 


-scott

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                                 ` <D891A8EC-548C-453E-AC70-8431DAC4B8C4-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
@ 2014-08-26 15:12                                                   ` Jamal Hadi Salim
  0 siblings, 0 replies; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-26 15:12 UTC (permalink / raw)
  To: Scott Feldman, Roopa Prabhu
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, Shrijeet Mukherjee,
	John Fastabend, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 08/26/14 11:01, Scott Feldman wrote:
>

>
> I don’t see it that way.  I believe sw_flow can be the intermediary representation to span flow-based and non-flow-based HW,
> and from flow-based world and traditional l2/l3 world.
>
>

Is there more magic to this than what Thomas just presented in this thread?

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-26 14:58                                                 ` Jamal Hadi Salim
@ 2014-08-26 15:22                                                   ` Jiri Pirko
       [not found]                                                     ` <20140826152217.GA1843-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-08-26 15:22 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Roopa Prabhu, Thomas Graf, John Fastabend, Scott Feldman, netdev,
	David Miller, Neil Horman, Andy Gospodarek, dborkman, ogerlitz,
	jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, f.fainelli, linville,
	dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram

Tue, Aug 26, 2014 at 04:58:54PM CEST, jhs@mojatatu.com wrote:
>On 08/26/14 10:06, Jiri Pirko wrote:
>
>>Yes. Flows are phase one. The api will be extended in for whatever is
>>needed for l2/l3 as you said. Also I see a possibility to implement the
>>l2/l3 use case with flows as well.
>
>And as a note: This is where i have the disagreement.
>It is good there is acknowledgement you are handling flows for now.
>Or whatever tuples you defined as "flow". I dont think L2 or 3 fit
>in that. If thats not what you are saying then we are in agreement.

I do not think that really matters. Phase one is flows. After that we
can focus on l2/l3. If we would be able to fit in in flows (some drivers
may), then ok. If not, we extend the api with couple of more l2/l3
related ndos. I see no problem there.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                                     ` <20140826152217.GA1843-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
@ 2014-08-26 15:29                                                       ` Jamal Hadi Salim
  2014-08-26 15:44                                                         ` Jiri Pirko
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-26 15:29 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, Shrijeet Mukherjee,
	John Fastabend, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Roopa Prabhu,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 08/26/14 11:22, Jiri Pirko wrote:

> I do not think that really matters. Phase one is flows. After that we
> can focus on l2/l3. If we would be able to fit in in flows (some drivers
> may), then ok. If not, we extend the api with couple of more l2/l3
> related ndos. I see no problem there.

Well, it matters because we are proceeding to implement L2/3.
i.e the simple stuff first. We dont have anything to show yet - but
we will hopefully have some useful bit by Plumbers.
So maybe best path forward is we talk then and see how we can merge
efforts since we cant seem to agree at this point.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-26 15:29                                                       ` Jamal Hadi Salim
@ 2014-08-26 15:44                                                         ` Jiri Pirko
       [not found]                                                           ` <20140826154459.GB1843-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-08-26 15:44 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Roopa Prabhu, Thomas Graf, John Fastabend, Scott Feldman, netdev,
	David Miller, Neil Horman, Andy Gospodarek, dborkman, ogerlitz,
	jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, f.fainelli, linville,
	dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram

Tue, Aug 26, 2014 at 05:29:10PM CEST, jhs@mojatatu.com wrote:
>On 08/26/14 11:22, Jiri Pirko wrote:
>
>>I do not think that really matters. Phase one is flows. After that we
>>can focus on l2/l3. If we would be able to fit in in flows (some drivers
>>may), then ok. If not, we extend the api with couple of more l2/l3
>>related ndos. I see no problem there.
>
>Well, it matters because we are proceeding to implement L2/3.
>i.e the simple stuff first. We dont have anything to show yet - but
>we will hopefully have some useful bit by Plumbers.
>So maybe best path forward is we talk then and see how we can merge
>efforts since we cant seem to agree at this point.

I think we are in agreement. We have two worlds: flows and l2/3. We need
both for sure. And my patchset adds an initial part of the first one.
The second one can be added later. I do not see any issues in that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                                           ` <20140826154459.GB1843-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
@ 2014-08-26 15:54                                                             ` Andy Gospodarek
       [not found]                                                               ` <20140826155426.GA5275-Me9pkO/C/lgvPfuUPAiksl6hYfS7NtTn@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Andy Gospodarek @ 2014-08-26 15:54 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, Shrijeet Mukherjee,
	John Fastabend, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Roopa Prabhu, Jamal Hadi Salim,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On Tue, Aug 26, 2014 at 05:44:59PM +0200, Jiri Pirko wrote:
> Tue, Aug 26, 2014 at 05:29:10PM CEST, jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org wrote:
> >On 08/26/14 11:22, Jiri Pirko wrote:
> >
> >>I do not think that really matters. Phase one is flows. After that we
> >>can focus on l2/l3. If we would be able to fit in in flows (some drivers
> >>may), then ok. If not, we extend the api with couple of more l2/l3
> >>related ndos. I see no problem there.
> >
> >Well, it matters because we are proceeding to implement L2/3.
> >i.e the simple stuff first. We dont have anything to show yet - but
> >we will hopefully have some useful bit by Plumbers.
> >So maybe best path forward is we talk then and see how we can merge
> >efforts since we cant seem to agree at this point.
> 
> I think we are in agreement. We have two worlds: flows and l2/3. We need
> both for sure. And my patchset adds an initial part of the first one.
> The second one can be added later. I do not see any issues in that.

It is easy to *say* it could be added later, but connecting to software
forwarding in the kernel outside of OVS (which is important to some)
would take significant effort since this set only connects switch
hardware to OVS.

It may be that all software-based forwarding is done via OVS in the
future, but it feels like we are long way from that future for those
that do not want to use an external controller.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                                               ` <20140826155426.GA5275-Me9pkO/C/lgvPfuUPAiksl6hYfS7NtTn@public.gmane.org>
@ 2014-08-26 16:19                                                                 ` Thomas Graf
       [not found]                                                                   ` <20140826161956.GA15316-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-26 16:19 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, Shrijeet Mukherjee,
	John Fastabend, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko, Roopa Prabhu,
	Jamal Hadi Salim, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 08/26/14 at 11:54am, Andy Gospodarek wrote:
> It is easy to *say* it could be added later, but connecting to software
> forwarding in the kernel outside of OVS (which is important to some)
> would take significant effort since this set only connects switch
> hardware to OVS.

Can you explain why that effort is more significant if a flow API
added first? I'm not saying it is easy to offload the existing
forwarding path, otherwise it would have been done already, but
I don't understand how the proposal makes this any more difficult.

> It may be that all software-based forwarding is done via OVS in the
> future, but it feels like we are long way from that future for those
> that do not want to use an external controller.

Wait... I don't want to use OpenFlow to configure my laptop ;-)

We should leave the controller out of this discussion though. A
controller is not required to run OVS at all. OpenStack Neutron
is a very good example for that. There are even applications which
use the OVS kernel datapath but not the OVS user space portion.
We have a wide set of APIs serving different purposes and need to
account for all of them. I'm as much interested in an offloaded
nftables and tc command as you.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 02/12] net: rename netdev_phys_port_id to more generic name
       [not found]       ` <53FC7C3C.3090901-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-08-26 14:10         ` Jiri Pirko
@ 2014-08-26 17:14         ` Stephen Hemminger
  1 sibling, 0 replies; 87+ messages in thread
From: Stephen Hemminger @ 2014-08-26 17:14 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA, dborkman-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Tue, 26 Aug 2014 15:23:24 +0300
Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> just a nit, but if this approach/patch goes in, any reason not to change 
> IFLA_PHYS_PORT_ID to IFLA_PHYS_ITEM_ID?

Userspace API

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                                                   ` <20140826161956.GA15316-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-08-26 18:41                                                                     ` Andy Gospodarek
  2014-08-26 20:13                                                                     ` Alexei Starovoitov
  1 sibling, 0 replies; 87+ messages in thread
From: Andy Gospodarek @ 2014-08-26 18:41 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, Shrijeet Mukherjee,
	John Fastabend, jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko, Roopa Prabhu,
	Jamal Hadi Salim, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On Tue, Aug 26, 2014 at 05:19:56PM +0100, Thomas Graf wrote:
> On 08/26/14 at 11:54am, Andy Gospodarek wrote:
> > It is easy to *say* it could be added later, but connecting to software
> > forwarding in the kernel outside of OVS (which is important to some)
> > would take significant effort since this set only connects switch
> > hardware to OVS.
> 
> Can you explain why that effort is more significant if a flow API
> added first? I'm not saying it is easy to offload the existing
> forwarding path, otherwise it would have been done already, but
> I don't understand how the proposal makes this any more difficult.
Sorry if I introduced any confusion.  My intent was not to imply there
was specific increased technical effort required to implement other
software forwarding elements in hardware if a flow-based API is added
first.

> > It may be that all software-based forwarding is done via OVS in the
> > future, but it feels like we are long way from that future for those
> > that do not want to use an external controller.
> 
> Wait... I don't want to use OpenFlow to configure my laptop ;-)
You don't?  I do.  ;-)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 06/12] net: introduce dummy switch
       [not found]     ` <1408637945-10390-7-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-08-26 19:14       ` Andy Gospodarek
       [not found]         ` <20140826191420.GC5275-Me9pkO/C/lgvPfuUPAiksl6hYfS7NtTn@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Andy Gospodarek @ 2014-08-26 19:14 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Thu, Aug 21, 2014 at 06:18:59PM +0200, Jiri Pirko wrote:
> Dummy switch implementation using switchdev interface
> 
[...]
> +	if (!data || !data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID])
[...]
> +	dsp->psid.id_len = nla_len(data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID]);
> +	memcpy(dsp->psid.id, nla_data(data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID]),
[...]
> +	[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID] = { .type = NLA_BINARY,
[...]
> +	IFLA_DYMMYSWPORT_PHYS_SWITCH_ID,
I realize this does compile, but I suspect this was a typo?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                                                   ` <20140826161956.GA15316-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-08-26 18:41                                                                     ` Andy Gospodarek
@ 2014-08-26 20:13                                                                     ` Alexei Starovoitov
  2014-08-26 20:54                                                                       ` Thomas Graf
  1 sibling, 1 reply; 87+ messages in thread
From: Alexei Starovoitov @ 2014-08-26 20:13 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ, Eric Dumazet,
	Andy Gospodarek, dev-yBygre7rU0TnMu66kgdUjQ,
	nbd-p3rKhJxN3npAfugRpC6u6w, Florian Fainelli, Andy Gospodarek,
	Shrijeet Mukherjee, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, Roopa Prabhu, Jamal Hadi Salim,
	aviadr-VPRAkNaXOzVWk0Htik3J/w, Nicolas Dichtel,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, netdev,
	Stephen Hemminger, dborkman

On Tue, Aug 26, 2014 at 9:19 AM, Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org> wrote:
>
> Wait... I don't want to use OpenFlow to configure my laptop ;-)

+1

> We should leave the controller out of this discussion though. A
> controller is not required to run OVS at all. OpenStack Neutron
> is a very good example for that. There are even applications which
> use the OVS kernel datapath but not the OVS user space portion.
> We have a wide set of APIs serving different purposes and need to
> account for all of them. I'm as much interested in an offloaded
> nftables and tc command as you.

I think it's important distinction. In-kernel OVS is not OF.
It's a networking function that has hard-coded packet parser,
N-tuple match and programmable actions.
There were times when HW vendors were using OF check-box
to sell more chips, but at the end there is not a single HW
that is fully OF compliant. OF brand is still around, but
OF 2.0 is not tcam+action anymore.
Imo trying to standardize HW offload interface based on OF 1.x
principles is strange. Does anyone has performance data
that shows that hard-parser+N-tuple-match offload actually speeds
up real life applications ?
Why are we designing kernel offload based on 'rocker' emulator?
Enterprise silicon I've seen doesn't look like it...
I'm not saying that kernel should not have a driver for rocker.
It should, but it shouldn't be a golden model for HW offload.

"straw-man proposal for OF 2.0" paper have very
interesting ideas:
http://arxiv.org/pdf/1312.1719.pdf
sooner or later off the shelf NICs will have similar functionality.

In Linux we already have bridge that is perfect abstraction of
L2 network functions. OF 1.x has to use 'tcam' to do bridge and
in-kernel OVS has to fall back to 'mega-flows', but HW has proper
exact match tables and HW mac learning,
so OF 1.x principles just don't fit to L2 offloading.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-26 20:13                                                                     ` Alexei Starovoitov
@ 2014-08-26 20:54                                                                       ` Thomas Graf
  2014-08-29 14:20                                                                         ` Jamal Hadi Salim
  0 siblings, 1 reply; 87+ messages in thread
From: Thomas Graf @ 2014-08-26 20:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Gospodarek, Jiri Pirko, Jamal Hadi Salim, Roopa Prabhu,
	John Fastabend, Scott Feldman, netdev, David Miller, Neil Horman,
	Andy Gospodarek, dborkman, ogerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, ben, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, john.r.fastabend, Eric Dumazet, Florian Fainelli,
	John Linville, dev@openvswitch.org

On 08/26/14 at 01:13pm, Alexei Starovoitov wrote:
> I think it's important distinction. In-kernel OVS is not OF.
> It's a networking function that has hard-coded packet parser,
> N-tuple match and programmable actions.
> There were times when HW vendors were using OF check-box
> to sell more chips, but at the end there is not a single HW
> that is fully OF compliant. OF brand is still around, but
> OF 2.0 is not tcam+action anymore.
> Imo trying to standardize HW offload interface based on OF 1.x
> principles is strange. Does anyone has performance data
> that shows that hard-parser+N-tuple-match offload actually speeds
> up real life applications ?

That really depends on your definition of application. A pure
switch application will obviously benefit. The host case is more
complicated, offloading packet switching doesn't buy you anything
obviously but it does allow to use SR-IOV in a broader context. If
I have to chose between a DPDK host stack bypass and a well
abstracted and flexible SR-IOV bypass I would favour the VF approach.
Especially once things like P4 make its way into hardware and
the flexibility of hardware becomes less of an issue. My interest
is mainly driven from this perspective.

> Why are we designing kernel offload based on 'rocker' emulator?
> Enterprise silicon I've seen doesn't look like it...
> I'm not saying that kernel should not have a driver for rocker.
> It should, but it shouldn't be a golden model for HW offload.

I agree that it shouldn't be the golden model. Is that even
happening? Given the earlier discussions, the only reason rocker
seems to exist is because of lack of public hardware specs. I doubt
that Jiri and Scott are doing this excercise if they wouldn't have
to.
 
> "straw-man proposal for OF 2.0" paper have very
> interesting ideas:
> http://arxiv.org/pdf/1312.1719.pdf
> sooner or later off the shelf NICs will have similar functionality.
> 
> In Linux we already have bridge that is perfect abstraction of
> L2 network functions. OF 1.x has to use 'tcam' to do bridge and
> in-kernel OVS has to fall back to 'mega-flows', but HW has proper
> exact match tables and HW mac learning,
> so OF 1.x principles just don't fit to L2 offloading.

Agreed but I don't think we should restrict this to L2. As you say,
that problem is solved and we have fdb_add() to take care of the
offload.

What we're facing as minimal requirements are L2-L3 forwarding needs
combined with encap and encryption. Does it have to be implemented
with OF1.x flows? No. In fact I would love to use eBPF in the
software path and will support you all the way ;-) P4 hardware will
make that even sweeter. But using a flow model does seem kind of
straight forward to allow for offload with today's available HW n-tuple
filters.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
       [not found]       ` <1408639283.13073.3.camel-nDn/Rdv9kqW9Jme8/bJn5UCKIB8iOfG2tUK59QYPAWc@public.gmane.org>
@ 2014-08-27  2:45         ` Tom Herbert
  0 siblings, 0 replies; 87+ messages in thread
From: Tom Herbert @ 2014-08-27  2:45 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w, Jason Wang, John Fastabend,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ, Eric Dumazet,
	andy-QlMahl40kYEqcZcGjlUOXw, dev-yBygre7rU0TnMu66kgdUjQ,
	Felix Fietkau, Florian Fainelli, ronye-VPRAkNaXOzVWk0Htik3J/w,
	Jeff Kirsher, Or Gerlitz, Lennert Buytenhek, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR, Jamal Hadi Salim,
	aviadr-VPRAkNaXOzVWk0Htik3J/w, Nicolas Dichtel,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman, Linux Netdev List,
	Stephen Hemminger, Daniel Borkmann, Eric W. Biederman,
	David Miller

On Thu, Aug 21, 2014 at 9:41 AM, Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org> wrote:
> On Thu, 2014-08-21 at 18:18 +0200, Jiri Pirko wrote:
>> The goal of this is to provide a possibility to suport various switch
>> chips. Drivers should implement relevant ndos to do so. Now there is a
>> couple of ndos defines:
>> - for getting physical switch id is in place.
>> - for work with flows.
>>
>> Note that user can use random port netdevice to access the switch.
> [...]
>
> Why isn't the switch treated as a real device (not necessarily a net
> device) that's included in the device model and that the port devices
> refer to?
>
+1. In the same way that software switch (e.g. OVS) is mostly
implemented out of the core stack, I think hardware switch should have
same model. Besides, if you define this as part of a network interface
then you'll need to resolve unpleasant issues like accounting for
bytes and packets switched (e.g. are these accounted in interface
stats?)), what happens if netdevice is put in promiscuous mode (e.g.
do we need a tap for all switched packets?), etc. Another thing to
consider is that some vendors will undoubtably implement an open flow
agent in the NIC to manage the switch, possibly even autonomously from
the host-- a host might see the device but not be able to manage it
other than adding interfaces as local ports.

> Ben.
>
> --
> Ben Hutchings
> If at first you don't succeed, you're doing about average.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 03/12] net: introduce generic switch devices support
  2014-08-21 16:18   ` [patch net-next RFC 03/12] net: introduce generic switch devices support Jiri Pirko
                       ` (2 preceding siblings ...)
  2014-08-24 11:46     ` Thomas Graf
@ 2014-08-27 22:19     ` Cong Wang
  3 siblings, 0 replies; 87+ messages in thread
From: Cong Wang @ 2014-08-27 22:19 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, David Miller, Neil Horman, Andy Gospodarek, Thomas Graf,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin B Shelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, john.r.fastabend, Eric Dumazet, Jamal Hadi Salim,
	Scott Feldman, f.fainelli, roopa, linville, dev, jasowang,
	Eric W. Biederman, Nicolas Dichtel, ryazanov.s.a

On Thu, Aug 21, 2014 at 9:18 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> diff --git a/include/linux/switchdev.h b/include/linux/switchdev.h
> new file mode 100644
> index 0000000..ba77a68
> --- /dev/null
> +++ b/include/linux/switchdev.h

It should be in include/net/ instead, since it never
goes out of networking.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 06/12] net: introduce dummy switch
       [not found]         ` <20140826191420.GC5275-Me9pkO/C/lgvPfuUPAiksl6hYfS7NtTn@public.gmane.org>
@ 2014-08-29  7:00           ` Jiri Pirko
  0 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-29  7:00 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

Tue, Aug 26, 2014 at 09:14:20PM CEST, gospo-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org wrote:
>On Thu, Aug 21, 2014 at 06:18:59PM +0200, Jiri Pirko wrote:
>> Dummy switch implementation using switchdev interface
>> 
>[...]
>> +	if (!data || !data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID])
>[...]
>> +	dsp->psid.id_len = nla_len(data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID]);
>> +	memcpy(dsp->psid.id, nla_data(data[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID]),
>[...]
>> +	[IFLA_DYMMYSWPORT_PHYS_SWITCH_ID] = { .type = NLA_BINARY,
>[...]
>> +	IFLA_DYMMYSWPORT_PHYS_SWITCH_ID,
>I realize this does compile, but I suspect this was a typo?

Fixed. Thanks.

>
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 12/12] rocker: introduce rocker switch driver
  2014-08-23 14:04   ` Thomas Graf
@ 2014-08-29  7:06     ` Jiri Pirko
  0 siblings, 0 replies; 87+ messages in thread
From: Jiri Pirko @ 2014-08-29  7:06 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, davem, nhorman, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye

Sat, Aug 23, 2014 at 04:04:50PM CEST, tgraf@suug.ch wrote:
>On 08/21/14 at 06:19pm, Jiri Pirko wrote:
>> This patch introduces the first driver to benefit from the switchdev
>> infrastructure and to implement newly introduced switch ndos. This is a
>> driver for emulated switch chip implemented in qemu:
>> https://github.com/sfeldma/qemu-rocker/
>
>The design looks very clean. I noticed that the TLV API is almost an
>exact dupliate of the Netlink attributes API. Any specific reason for
>not reusing lib/nlattr.c and add what is missing?

Well the api is almost the same. But the implementation is different.
See for example rocker_tlv_put. It works directly with desc info.
But nla_put works with skb which is not convenient for rocker.

Also, struct nlattr is 2xu16 but struct rocker_tlv is u32 u16

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-08-26 20:54                                                                       ` Thomas Graf
@ 2014-08-29 14:20                                                                         ` Jamal Hadi Salim
       [not found]                                                                           ` <54008C47.5040503-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-08-29 14:20 UTC (permalink / raw)
  To: Thomas Graf, Alexei Starovoitov
  Cc: Andy Gospodarek, Jiri Pirko, Roopa Prabhu, John Fastabend,
	Scott Feldman, netdev, David Miller, Neil Horman,
	Andy Gospodarek, dborkman, ogerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, ben, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, john.r.fastabend, Eric Dumazet, Florian Fainelli,
	John Linville, dev, jasowang

On 08/26/14 16:54, Thomas Graf wrote:
> On 08/26/14 at 01:13pm, Alexei Starovoitov wrote:
>> I think it's important distinction. In-kernel OVS is not OF.
>> It's a networking function that has hard-coded packet parser,
>> N-tuple match and programmable actions.
>> There were times when HW vendors were using OF check-box
>> to sell more chips, but at the end there is not a single HW
>> that is fully OF compliant. OF brand is still around, but
>> OF 2.0 is not tcam+action anymore.
>> Imo trying to standardize HW offload interface based on OF 1.x
>> principles is strange.


I actually have no issues with whatever classifier someone decides
to use. To each their poison. But I do take issue mandating the
specified classifer it as THE CLASSIFIER as in this case,
is where i start taking issue. I have a few things that i offload
to hardware with speacilized classifiers such that i object strongly
to the approach this driver has taken.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                                                           ` <54008C47.5040503-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
@ 2014-09-01  8:13                                                                             ` Simon Horman
       [not found]                                                                               ` <20140901081343.GC12731-IxS8c3vjKQDk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Simon Horman @ 2014-09-01  8:13 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ, Eric Dumazet,
	Andy Gospodarek, dev-yBygre7rU0TnMu66kgdUjQ,
	nbd-p3rKhJxN3npAfugRpC6u6w, Florian Fainelli, Andy Gospodarek,
	Shrijeet Mukherjee, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, Roopa Prabhu, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	Nicolas Dichtel, vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman,
	netdev, Stephen Hemminger, dborkman, Eric W. Biederman

On Fri, Aug 29, 2014 at 10:20:55AM -0400, Jamal Hadi Salim wrote:
> On 08/26/14 16:54, Thomas Graf wrote:
> >On 08/26/14 at 01:13pm, Alexei Starovoitov wrote:
> >>I think it's important distinction. In-kernel OVS is not OF.
> >>It's a networking function that has hard-coded packet parser,
> >>N-tuple match and programmable actions.
> >>There were times when HW vendors were using OF check-box
> >>to sell more chips, but at the end there is not a single HW
> >>that is fully OF compliant. OF brand is still around, but
> >>OF 2.0 is not tcam+action anymore.
> >>Imo trying to standardize HW offload interface based on OF 1.x
> >>principles is strange.
> 
> 
> I actually have no issues with whatever classifier someone decides
> to use. To each their poison. But I do take issue mandating the
> specified classifer it as THE CLASSIFIER as in this case,
> is where i start taking issue. I have a few things that i offload
> to hardware with speacilized classifiers such that i object strongly
> to the approach this driver has taken.

My reading of this thread is that allowing different classifiers
is not under dispute.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                                                                               ` <20140901081343.GC12731-IxS8c3vjKQDk1uMJSBkQmQ@public.gmane.org>
@ 2014-09-01 16:37                                                                                 ` Jamal Hadi Salim
  2014-09-01 20:28                                                                                   ` Jiri Pirko
  0 siblings, 1 reply; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-09-01 16:37 UTC (permalink / raw)
  To: Simon Horman
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	ronye-VPRAkNaXOzVWk0Htik3J/w, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ, Eric Dumazet,
	Andy Gospodarek, dev-yBygre7rU0TnMu66kgdUjQ,
	nbd-p3rKhJxN3npAfugRpC6u6w, Florian Fainelli, Andy Gospodarek,
	Shrijeet Mukherjee, John Fastabend,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w, ogerlitz,
	ben-/+tVBieCtBitmTQ+vhA3Yw, buytenh-OLH4Qvv75CYX/NnBR394Jw,
	Jiri Pirko, Roopa Prabhu, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	Nicolas Dichtel, vyasevic-H+wXaHxf7aLQT0dZR+AlfA, Neil Horman,
	netdev, Stephen Hemminger, dborkman, Eric W. Biederman

On 09/01/14 04:13, Simon Horman wrote:
> On Fri, Aug 29, 2014 at 10:20:55AM -0400, Jamal Hadi Salim wrote:

>> I actually have no issues with whatever classifier someone decides
>> to use. To each their poison. But I do take issue mandating the
>> specified classifer it as THE CLASSIFIER as in this case,
>> is where i start taking issue. I have a few things that i offload
>> to hardware with speacilized classifiers such that i object strongly
>> to the approach this driver has taken.
>
> My reading of this thread is that allowing different classifiers
> is not under dispute.


I am not sure how you reached that conclusion by reading this thread;->
But i would be glad if that was the conclusion and i missed it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-09-01 16:37                                                                                 ` Jamal Hadi Salim
@ 2014-09-01 20:28                                                                                   ` Jiri Pirko
  2014-09-02  1:08                                                                                     ` Jamal Hadi Salim
  0 siblings, 1 reply; 87+ messages in thread
From: Jiri Pirko @ 2014-09-01 20:28 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Simon Horman, Thomas Graf, Alexei Starovoitov, Andy Gospodarek,
	Roopa Prabhu, John Fastabend, Scott Feldman, netdev,
	David Miller, Neil Horman, Andy Gospodarek, dborkman, ogerlitz,
	Jesse Gross, Pravin Shelar, Andy Zhou, ben, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, john.r.fastabend,
	Eric Dumazet, Florian Fainelli, John

Mon, Sep 01, 2014 at 06:37:05PM CEST, jhs@mojatatu.com wrote:
>On 09/01/14 04:13, Simon Horman wrote:
>>On Fri, Aug 29, 2014 at 10:20:55AM -0400, Jamal Hadi Salim wrote:
>
>>>I actually have no issues with whatever classifier someone decides
>>>to use. To each their poison. But I do take issue mandating the
>>>specified classifer it as THE CLASSIFIER as in this case,
>>>is where i start taking issue. I have a few things that i offload
>>>to hardware with speacilized classifiers such that i object strongly
>>>to the approach this driver has taken.
>>
>>My reading of this thread is that allowing different classifiers
>>is not under dispute.
>
>
>I am not sure how you reached that conclusion by reading this thread;->
>But i would be glad if that was the conclusion and i missed it.

Jamal, please be ensured that no one I know of is against future
different classifiers.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
  2014-09-01 20:28                                                                                   ` Jiri Pirko
@ 2014-09-02  1:08                                                                                     ` Jamal Hadi Salim
  0 siblings, 0 replies; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-09-02  1:08 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Simon Horman, Thomas Graf, Alexei Starovoitov, Andy Gospodarek,
	Roopa Prabhu, John Fastabend, Scott Feldman, netdev,
	David Miller, Neil Horman, Andy Gospodarek, dborkman, ogerlitz,
	Jesse Gross, Pravin Shelar, Andy Zhou, ben, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, john.r.fastabend,
	Eric Dumazet, Florian Fainelli, John

On 09/01/14 16:28, Jiri Pirko wrote:

> Jamal, please be ensured that no one I know of is against future
> different classifiers.
>

Ok, glad to hear that.
The patches and/or some of the discussion were not projecting that
view. Even for the flow case, I am pretty sure we are going to
need a few iterations before we settle on a general consensus.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]     ` <20140904090447.GB3176-IxS8c3vjKQDk1uMJSBkQmQ@public.gmane.org>
@ 2014-09-04 16:30       ` Scott Feldman
       [not found]         ` <F4498A89-C1D6-4C5A-A6F0-942015D36B77-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Scott Feldman @ 2014-09-04 16:30 UTC (permalink / raw)
  To: Simon Horman
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q


On Sep 4, 2014, at 2:04 AM, Simon Horman <simon.horman-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:

> 
> 
> [snip]
> 
> In relation to ports and datapaths it seems to me that the API that
> has been developed accommodates a model where a port may belong to a
> switch device; that this topology is fixed before any API calls are made
> and that all all ports belonging to the same switch belong to the same
> datapath.
> 
> This makes sense in the case of hardware that looks a lot like a switch.
> But I think that other scenarios are possible. For example hardware that
> is able to handle the same abstractions handled by the datapath: datapaths
> may be created or destroyed; vports may be added and removed from datapaths.
> 
> So one might have a piece of hardware that is configured with more than one
> datapath configured and its different ports belong to it might be
> associated with different data paths.

I’ve tested multiple datapaths on one switch hardware with the current patch set and it works fine, without the need to push down any datapath id in the API.  It works because a switch port can’t belong to more than one datapath.  Datapaths can be created/destroyed and ports added/removed from datapaths dynamically and the right sw_flows are added/removed to program HW.

> 
> Or we might have hardware that is able to offload a tunnel vport.
> 
> In short I am thinking in terms of API callbacks to manipulate datapaths
> and vports. Although I have not thought about it in detail I believe
> that the current model you have implemented using such a scheme because
> the scheme I am suggesting maps to that of the datapath and you have
> implemented your model there.


-scott

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]         ` <F4498A89-C1D6-4C5A-A6F0-942015D36B77-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
@ 2014-09-05  4:08           ` Simon Horman
       [not found]             ` <20140905040810.GB32481-IxS8c3vjKQDk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Simon Horman @ 2014-09-05  4:08 UTC (permalink / raw)
  To: Scott Feldman
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw, Jiri Pirko,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Thu, Sep 04, 2014 at 09:30:45AM -0700, Scott Feldman wrote:
> 
> On Sep 4, 2014, at 2:04 AM, Simon Horman <simon.horman@netronome.com> wrote:
> 
> > 
> > 
> > [snip]
> > 
> > In relation to ports and datapaths it seems to me that the API that
> > has been developed accommodates a model where a port may belong to a
> > switch device; that this topology is fixed before any API calls are made
> > and that all all ports belonging to the same switch belong to the same
> > datapath.
> > 
> > This makes sense in the case of hardware that looks a lot like a switch.
> > But I think that other scenarios are possible. For example hardware that
> > is able to handle the same abstractions handled by the datapath: datapaths
> > may be created or destroyed; vports may be added and removed from datapaths.
> > 
> > So one might have a piece of hardware that is configured with more than one
> > datapath configured and its different ports belong to it might be
> > associated with different data paths.
> 
> I’ve tested multiple datapaths on one switch hardware with the current patch set and it works fine, without the need to push down any datapath id in the API.  It works because a switch port can’t belong to more than one datapath.  Datapaths can be created/destroyed and ports added/removed from datapaths dynamically and the right sw_flows are added/removed to program HW.

And the flows added to a switch always match the in port? Thus
so a given flow is only ever for one in-port and thus one datapath?

> > Or we might have hardware that is able to offload a tunnel vport.

I think tunnel vports is still an unsolved part of the larger puzzle.

> > In short I am thinking in terms of API callbacks to manipulate datapaths
> > and vports. Although I have not thought about it in detail I believe
> > that the current model you have implemented using such a scheme because
> > the scheme I am suggesting maps to that of the datapath and you have
> > implemented your model there.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]             ` <20140905040810.GB32481-IxS8c3vjKQDk1uMJSBkQmQ@public.gmane.org>
@ 2014-09-05  7:02               ` Scott Feldman
       [not found]                 ` <E3C7797F-081E-484F-918E-937C705B43D6-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 87+ messages in thread
From: Scott Feldman @ 2014-09-05  7:02 UTC (permalink / raw)
  To: Simon Horman
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ, Eric Dumazet,
	Andy Gospodarek, dev-yBygre7rU0TnMu66kgdUjQ, Felix Fietkau,
	Florian Fainelli, ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher,
	ogerlitz, Ben Hutchings, Lennert Buytenhek, Jiri Pirko,
	Roopa Prabhu, Jamal Hadi Salim, Aviad Raveh,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w, Vlad Yasevich,
	Neil Horman, netdev, Stephen Hemminger, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller


On Sep 4, 2014, at 9:08 PM, Simon Horman <simon.horman-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:

> On Thu, Sep 04, 2014 at 09:30:45AM -0700, Scott Feldman wrote:
>> 
>> On Sep 4, 2014, at 2:04 AM, Simon Horman <simon.horman-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:
>> 
>>> 
>>> 
>>> [snip]
>>> 
>>> In relation to ports and datapaths it seems to me that the API that
>>> has been developed accommodates a model where a port may belong to a
>>> switch device; that this topology is fixed before any API calls are made
>>> and that all all ports belonging to the same switch belong to the same
>>> datapath.
>>> 
>>> This makes sense in the case of hardware that looks a lot like a switch.
>>> But I think that other scenarios are possible. For example hardware that
>>> is able to handle the same abstractions handled by the datapath: datapaths
>>> may be created or destroyed; vports may be added and removed from datapaths.
>>> 
>>> So one might have a piece of hardware that is configured with more than one
>>> datapath configured and its different ports belong to it might be
>>> associated with different data paths.
>> 
>> I’ve tested multiple datapaths on one switch hardware with the current patch set and it works fine, without the need to push down any datapath id in the API.  It works because a switch port can’t belong to more than one datapath.  Datapaths can be created/destroyed and ports added/removed from datapaths dynamically and the right sw_flows are added/removed to program HW.
> 
> And the flows added to a switch always match the in port? Thus
> so a given flow is only ever for one in-port and thus one datapath?

Correct, for the particular switch implementation we’re working with.  If another implementation can’t match on in_port then it seems datapath_id may be needed to partition flows.  


>>> Or we might have hardware that is able to offload a tunnel vport.
> 
> I think tunnel vports is still an unsolved part of the larger puzzle.

Agreed, TBD work to offload tunnel vports.  Current implementation only looking at VLAN vports so far.

> 
>>> In short I am thinking in terms of API callbacks to manipulate datapaths
>>> and vports. Although I have not thought about it in detail I believe
>>> that the current model you have implemented using such a scheme because
>>> the scheme I am suggesting maps to that of the datapath and you have
>>> implemented your model there.


-scott

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                 ` <E3C7797F-081E-484F-918E-937C705B43D6-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
@ 2014-09-05 10:46                   ` Jamal Hadi Salim
  2014-09-08  0:02                   ` Simon Horman
  1 sibling, 0 replies; 87+ messages in thread
From: Jamal Hadi Salim @ 2014-09-05 10:46 UTC (permalink / raw)
  To: Scott Feldman, Simon Horman
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ, Eric Dumazet,
	Andy Gospodarek, dev-yBygre7rU0TnMu66kgdUjQ, Felix Fietkau,
	Florian Fainelli, ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher,
	ogerlitz, Ben Hutchings, Lennert Buytenhek, Jiri Pirko,
	Roopa Prabhu, Aviad Raveh,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w, Vlad Yasevich,
	Neil Horman, netdev, Stephen Hemminger, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On 09/05/14 03:02, Scott Feldman wrote:

>> On Thu, Sep 04, 2014 at 09:30:45AM -0700, Scott Feldman wrote:
>>>

> Correct, for the particular switch implementation we’re working with.

Do you have L2/3 working with this interface on said switch?
I am interested.

cheers,
jamal

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload
       [not found]                 ` <E3C7797F-081E-484F-918E-937C705B43D6-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  2014-09-05 10:46                   ` Jamal Hadi Salim
@ 2014-09-08  0:02                   ` Simon Horman
  1 sibling, 0 replies; 87+ messages in thread
From: Simon Horman @ 2014-09-08  0:02 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Sergey Ryazanov, jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ, Eric Dumazet,
	Andy Gospodarek, dev-yBygre7rU0TnMu66kgdUjQ, Felix Fietkau,
	Florian Fainelli, ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher,
	ogerlitz, Ben Hutchings, Lennert Buytenhek, Jiri Pirko,
	Roopa Prabhu, Jamal Hadi Salim, Aviad Raveh,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w, Vlad Yasevich,
	Neil Horman, netdev, Stephen Hemminger, dborkman,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, David Miller

On Fri, Sep 05, 2014 at 12:02:03AM -0700, Scott Feldman wrote:
> 
> On Sep 4, 2014, at 9:08 PM, Simon Horman <simon.horman@netronome.com> wrote:
> 
> > On Thu, Sep 04, 2014 at 09:30:45AM -0700, Scott Feldman wrote:
> >> 
> >> On Sep 4, 2014, at 2:04 AM, Simon Horman <simon.horman@netronome.com> wrote:
> >> 
> >>> 
> >>> 
> >>> [snip]
> >>> 
> >>> In relation to ports and datapaths it seems to me that the API that
> >>> has been developed accommodates a model where a port may belong to a
> >>> switch device; that this topology is fixed before any API calls are made
> >>> and that all all ports belonging to the same switch belong to the same
> >>> datapath.
> >>> 
> >>> This makes sense in the case of hardware that looks a lot like a switch.
> >>> But I think that other scenarios are possible. For example hardware that
> >>> is able to handle the same abstractions handled by the datapath: datapaths
> >>> may be created or destroyed; vports may be added and removed from datapaths.
> >>> 
> >>> So one might have a piece of hardware that is configured with more than one
> >>> datapath configured and its different ports belong to it might be
> >>> associated with different data paths.
> >> 
> >> I’ve tested multiple datapaths on one switch hardware with the current patch set and it works fine, without the need to push down any datapath id in the API.  It works because a switch port can’t belong to more than one datapath.  Datapaths can be created/destroyed and ports added/removed from datapaths dynamically and the right sw_flows are added/removed to program HW.
> > 
> > And the flows added to a switch always match the in port? Thus
> > so a given flow is only ever for one in-port and thus one datapath?
> 
> Correct, for the particular switch implementation we’re working with.  If
> another implementation can’t match on in_port then it seems datapath_id
> may be needed to partition flows.

Thanks, I understand and agree.

> >>> Or we might have hardware that is able to offload a tunnel vport.
> > 
> > I think tunnel vports is still an unsolved part of the larger puzzle.
> 
> Agreed, TBD work to offload tunnel vports.  Current implementation only looking at VLAN vports so far.
> 
> > 
> >>> In short I am thinking in terms of API callbacks to manipulate datapaths
> >>> and vports. Although I have not thought about it in detail I believe
> >>> that the current model you have implemented using such a scheme because
> >>> the scheme I am suggesting maps to that of the datapath and you have
> >>> implemented your model there.
> 
> 
> -scott
> 
> 
> 
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2014-09-08  0:02 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-21 16:18 [patch net-next RFC 00/12] introduce rocker switch driver with openvswitch hardware accelerated datapath Jiri Pirko
2014-08-21 16:18 ` [patch net-next RFC 02/12] net: rename netdev_phys_port_id to more generic name Jiri Pirko
     [not found]   ` <1408637945-10390-3-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-08-26 12:23     ` Or Gerlitz
     [not found]       ` <53FC7C3C.3090901-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-08-26 14:10         ` Jiri Pirko
2014-08-26 17:14         ` Stephen Hemminger
2014-08-21 16:18 ` [patch net-next RFC 04/12] rtnl: expose physical switch id for particular device Jiri Pirko
     [not found]   ` <1408637945-10390-5-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-08-22 19:08     ` John Fastabend
     [not found]       ` <53F79537.20207-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-26  8:32         ` Jiri Pirko
2014-08-21 16:18 ` [patch net-next RFC 05/12] net-sysfs: " Jiri Pirko
     [not found] ` <1408637945-10390-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-08-21 16:18   ` [patch net-next RFC 01/12] openvswitch: split flow structures into ovs specific and generic ones Jiri Pirko
2014-08-21 16:18   ` [patch net-next RFC 03/12] net: introduce generic switch devices support Jiri Pirko
2014-08-21 16:41     ` Ben Hutchings
2014-08-21 17:03       ` Jiri Pirko
     [not found]       ` <1408639283.13073.3.camel-nDn/Rdv9kqW9Jme8/bJn5UCKIB8iOfG2tUK59QYPAWc@public.gmane.org>
2014-08-27  2:45         ` Tom Herbert
     [not found]     ` <1408637945-10390-4-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-08-21 17:05       ` Florian Fainelli
     [not found]         ` <CAGVrzcYtnpcP4pfCJ0GSya01LTk0WwbSV1f+voF2K=S5CR3Arg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-22 12:42           ` Jamal Hadi Salim
2014-08-22 12:56             ` Jiri Pirko
2014-08-22 19:14               ` John Fastabend
     [not found]                 ` <53F7969C.1060509-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-22 23:12                   ` Scott Feldman
     [not found]               ` <20140822125655.GB1916-6KJVSR23iU488b5SBfVpbw@public.gmane.org>
2014-08-23  1:02                 ` Florian Fainelli
     [not found]                   ` <CAGVrzcZS=Y2stxSNMfVjWTpPT8GoDOpOD9tExnDnoF0jj_owoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-23  9:17                     ` Jiri Pirko
2014-08-24 11:46     ` Thomas Graf
     [not found]       ` <20140824114605.GC32741-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-08-26  8:34         ` Jiri Pirko
2014-08-27 22:19     ` Cong Wang
2014-08-21 16:18   ` [patch net-next RFC 06/12] net: introduce dummy switch Jiri Pirko
     [not found]     ` <1408637945-10390-7-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-08-26 19:14       ` Andy Gospodarek
     [not found]         ` <20140826191420.GC5275-Me9pkO/C/lgvPfuUPAiksl6hYfS7NtTn@public.gmane.org>
2014-08-29  7:00           ` Jiri Pirko
2014-08-21 16:19 ` [patch net-next RFC 07/12] dsa: implement ndo_swdev_get_id Jiri Pirko
2014-08-21 16:38   ` Ben Hutchings
2014-08-21 16:56   ` Florian Fainelli
     [not found]     ` <CAGVrzcbs1yGb5RW++XZ=2PFsqUjZGVGfWx5=QQYcEX6x4WOq9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-21 17:06       ` Jiri Pirko
     [not found]         ` <20140821170645.GB10633-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
2014-08-21 17:12           ` Florian Fainelli
     [not found]             ` <CAGVrzcb=vkqPw2LUc4YO4Bs-eady2=1uN-jkG=kW2RnGx=24PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-22  9:05               ` David Laight
2014-08-23 11:33           ` Eric W. Biederman
2014-08-21 16:19 ` [patch net-next RFC 08/12] net: introduce netdev_phys_item_ids_match helper Jiri Pirko
2014-08-21 16:19 ` [patch net-next RFC 09/12] openvswitch: introduce vport_op get_netdev Jiri Pirko
2014-08-21 16:19 ` [patch net-next RFC 10/12] openvswitch: add support for datapath hardware offload Jiri Pirko
     [not found]   ` <1408637945-10390-11-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-08-22 19:39     ` John Fastabend
     [not found]       ` <53F79C54.5050701-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-22 22:53         ` Scott Feldman
     [not found]           ` <464DB0A8-0073-4CE0-9483-0F36B73A53A1-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
2014-08-23  9:24             ` Jiri Pirko
2014-08-23 14:51               ` Thomas Graf
     [not found]                 ` <20140823145126.GB24116-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-08-23 17:09                   ` John Fastabend
     [not found]                     ` <53F8CAB9.8080407-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-24 11:32                       ` Thomas Graf
2014-08-24  1:53             ` Jamal Hadi Salim
2014-08-24 11:12               ` Thomas Graf
     [not found]                 ` <20140824111218.GA32741-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-08-24 15:15                   ` Jamal Hadi Salim
     [not found]                     ` <53FA01AC.10507-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
2014-08-25  2:24                       ` Scott Feldman
     [not found]                         ` <A67C7591-19BF-4431-9119-F61361F5E618-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
2014-08-25  2:42                           ` John Fastabend
     [not found]                             ` <53FAA2A2.7070801-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-25 13:53                               ` Jamal Hadi Salim
2014-08-25 14:17                                 ` Thomas Graf
     [not found]                                   ` <20140825141754.GA30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-08-25 16:15                                     ` Jamal Hadi Salim
     [not found]                                       ` <53FB6122.2040901-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
2014-08-25 22:50                                         ` Thomas Graf
     [not found]                                           ` <20140825225057.GD30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-08-26 13:50                                             ` Roopa Prabhu
2014-08-26 14:06                                               ` Jiri Pirko
2014-08-26 14:58                                                 ` Jamal Hadi Salim
2014-08-26 15:22                                                   ` Jiri Pirko
     [not found]                                                     ` <20140826152217.GA1843-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
2014-08-26 15:29                                                       ` Jamal Hadi Salim
2014-08-26 15:44                                                         ` Jiri Pirko
     [not found]                                                           ` <20140826154459.GB1843-6KJVSR23iU5sFDB2n11ItA@public.gmane.org>
2014-08-26 15:54                                                             ` Andy Gospodarek
     [not found]                                                               ` <20140826155426.GA5275-Me9pkO/C/lgvPfuUPAiksl6hYfS7NtTn@public.gmane.org>
2014-08-26 16:19                                                                 ` Thomas Graf
     [not found]                                                                   ` <20140826161956.GA15316-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-08-26 18:41                                                                     ` Andy Gospodarek
2014-08-26 20:13                                                                     ` Alexei Starovoitov
2014-08-26 20:54                                                                       ` Thomas Graf
2014-08-29 14:20                                                                         ` Jamal Hadi Salim
     [not found]                                                                           ` <54008C47.5040503-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
2014-09-01  8:13                                                                             ` Simon Horman
     [not found]                                                                               ` <20140901081343.GC12731-IxS8c3vjKQDk1uMJSBkQmQ@public.gmane.org>
2014-09-01 16:37                                                                                 ` Jamal Hadi Salim
2014-09-01 20:28                                                                                   ` Jiri Pirko
2014-09-02  1:08                                                                                     ` Jamal Hadi Salim
2014-08-26 15:01                                               ` Scott Feldman
     [not found]                                                 ` <D891A8EC-548C-453E-AC70-8431DAC4B8C4-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
2014-08-26 15:12                                                   ` Jamal Hadi Salim
2014-08-26 14:26                                           ` Jamal Hadi Salim
2014-08-25 13:42                           ` Jamal Hadi Salim
2014-08-25 14:54                     ` Thomas Graf
     [not found]                       ` <20140825145449.GB30140-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-08-25 16:48                         ` Jamal Hadi Salim
2014-08-25 22:11                           ` Thomas Graf
2014-08-26 14:00                             ` Jamal Hadi Salim
2014-08-26 14:20                               ` Thomas Graf
     [not found]   ` <20140904090447.GB3176@vergenet.net>
     [not found]     ` <20140904090447.GB3176-IxS8c3vjKQDk1uMJSBkQmQ@public.gmane.org>
2014-09-04 16:30       ` Scott Feldman
     [not found]         ` <F4498A89-C1D6-4C5A-A6F0-942015D36B77-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
2014-09-05  4:08           ` Simon Horman
     [not found]             ` <20140905040810.GB32481-IxS8c3vjKQDk1uMJSBkQmQ@public.gmane.org>
2014-09-05  7:02               ` Scott Feldman
     [not found]                 ` <E3C7797F-081E-484F-918E-937C705B43D6-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
2014-09-05 10:46                   ` Jamal Hadi Salim
2014-09-08  0:02                   ` Simon Horman
2014-08-21 16:19 ` [patch net-next RFC 11/12] sw_flow: add misc section to key with in_port_ifindex field Jiri Pirko
2014-08-21 16:19 ` [patch net-next RFC 12/12] rocker: introduce rocker switch driver Jiri Pirko
2014-08-21 17:19   ` Florian Fainelli
2014-08-23 14:04   ` Thomas Graf
2014-08-29  7:06     ` Jiri Pirko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.