netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api
@ 2014-09-19 13:49 Jiri Pirko
  2014-09-19 13:49 ` [patch net-next v2 1/9] net: rename netdev_phys_port_id to more generic name Jiri Pirko
                   ` (8 more replies)
  0 siblings, 9 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	simon.horman-wFxRvT7yatFl57MIdRCFDg,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

This patchset can be divided into 3 main sections:
- introduce switchdev api for implementing switch drivers
- introduce switchdev generic netlink api for userspace manipulation
- introduce rocker switch driver which implements switchdev api

More info in separate patches.

So now there is possible to create ovs bridge over rocker
switch ports. ovs daemon can decide which flows to offload to hw
and uses switchdev genl api to tell driver. For now the easiest
way it to do it by hand via "sw" tool (https://github.com/jpirko/switchdev).

v1->v2 changes:
- removed DSA phys switch id implementation for now per Florian's request
- introduced own match key structure so the internal ovs flow struct
  stays untouched
- extended the flow match in order to easily add more match types (hope that
  Jamal will like this :)
- per ovs maintainers' request, removed ovs offload bits - that will be handled
  in ovs userspace using switchdev genl interface.
- added switchdev features so that driver can indicate what it supports
- little renames/fixes here and there

RFC->v1 changes:
- moved include/linux/*.h -> include/net/
- moved net/core/switchdev.c -> net/switchdev/
- moved drivers/net/rocker.* -> drivers/net/ethernet/rocker/
- fixed couple of little bugs and typos
- in dsa the switch id is generated randomly
- fixed rocker schedule in atomic context bug in rocker_port_set_rx_mode
- added switchdev Netlink API

Jiri Pirko (9):
  net: rename netdev_phys_port_id to more generic name
  net: introduce generic switch devices support
  rtnl: expose physical switch id for particular device
  net-sysfs: expose physical switch id for particular device
  net: introduce dummy switch
  switchdev: add basic support for flow matching and actions
  switchdev: add swdev features
  switchdev: introduce Netlink API
  rocker: introduce rocker switch driver

 Documentation/networking/switchdev.txt           |   53 +
 MAINTAINERS                                      |   14 +
 drivers/net/Kconfig                              |    7 +
 drivers/net/Makefile                             |    1 +
 drivers/net/dummyswitch.c                        |  130 +
 drivers/net/ethernet/Kconfig                     |    1 +
 drivers/net/ethernet/Makefile                    |    1 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |    2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |    2 +-
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |    2 +-
 drivers/net/ethernet/rocker/Kconfig              |   29 +
 drivers/net/ethernet/rocker/Makefile             |    5 +
 drivers/net/ethernet/rocker/rocker.c             | 3561 ++++++++++++++++++++++
 drivers/net/ethernet/rocker/rocker.h             |  465 +++
 include/linux/netdevice.h                        |   49 +-
 include/net/switchdev.h                          |  160 +
 include/uapi/linux/if_link.h                     |   10 +
 include/uapi/linux/switchdev.h                   |  113 +
 net/Kconfig                                      |    1 +
 net/Makefile                                     |    3 +
 net/core/dev.c                                   |    2 +-
 net/core/net-sysfs.c                             |   26 +-
 net/core/rtnetlink.c                             |   30 +-
 net/switchdev/Kconfig                            |   20 +
 net/switchdev/Makefile                           |    6 +
 net/switchdev/switchdev.c                        |  188 ++
 net/switchdev/switchdev_netlink.c                |  441 +++
 28 files changed, 5307 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/networking/switchdev.txt
 create mode 100644 drivers/net/dummyswitch.c
 create mode 100644 drivers/net/ethernet/rocker/Kconfig
 create mode 100644 drivers/net/ethernet/rocker/Makefile
 create mode 100644 drivers/net/ethernet/rocker/rocker.c
 create mode 100644 drivers/net/ethernet/rocker/rocker.h
 create mode 100644 include/net/switchdev.h
 create mode 100644 include/uapi/linux/switchdev.h
 create mode 100644 net/switchdev/Kconfig
 create mode 100644 net/switchdev/Makefile
 create mode 100644 net/switchdev/switchdev.c
 create mode 100644 net/switchdev/switchdev_netlink.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [patch net-next v2 1/9] net: rename netdev_phys_port_id to more generic name
  2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
@ 2014-09-19 13:49 ` Jiri Pirko
       [not found]   ` <1411134590-4586-2-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  2014-09-19 13:49 ` [patch net-next v2 3/9] rtnl: expose physical switch id for particular device Jiri Pirko
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

So this can be reused for identification of other "items" as well.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c      |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |  2 +-
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |  2 +-
 include/linux/netdevice.h                        | 16 ++++++++--------
 net/core/dev.c                                   |  2 +-
 net/core/net-sysfs.c                             |  2 +-
 net/core/rtnetlink.c                             |  6 +++---
 8 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 32e2444..770e3d0 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12425,7 +12425,7 @@ static int bnx2x_validate_addr(struct net_device *dev)
 }
 
 static int bnx2x_get_phys_port_id(struct net_device *netdev,
-				  struct netdev_phys_port_id *ppid)
+				  struct netdev_phys_item_id *ppid)
 {
 	struct bnx2x *bp = netdev_priv(netdev);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ed5f1c1..ab2f3e1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7341,7 +7341,7 @@ static void i40e_del_vxlan_port(struct net_device *netdev,
 
 #endif
 static int i40e_get_phys_port_id(struct net_device *netdev,
-				 struct netdev_phys_port_id *ppid)
+				 struct netdev_phys_item_id *ppid)
 {
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_pf *pf = np->vsi->back;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index abddcf8..f45e161 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2249,7 +2249,7 @@ static int mlx4_en_set_vf_link_state(struct net_device *dev, int vf, int link_st
 
 #define PORT_ID_BYTE_LEN 8
 static int mlx4_en_get_phys_port_id(struct net_device *dev,
-				    struct netdev_phys_port_id *ppid)
+				    struct netdev_phys_item_id *ppid)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_dev *mdev = priv->mdev->dev;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index f5e29f7..6e514d2 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -460,7 +460,7 @@ static void qlcnic_82xx_cancel_idc_work(struct qlcnic_adapter *adapter)
 }
 
 static int qlcnic_get_phys_port_id(struct net_device *netdev,
-				   struct netdev_phys_port_id *ppid)
+				   struct netdev_phys_item_id *ppid)
 {
 	struct qlcnic_adapter *adapter = netdev_priv(netdev);
 	struct qlcnic_hardware_context *ahw = adapter->ahw;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 28d4378..13d765f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -739,13 +739,13 @@ struct netdev_fcoe_hbainfo {
 };
 #endif
 
-#define MAX_PHYS_PORT_ID_LEN 32
+#define MAX_PHYS_ITEM_ID_LEN 32
 
-/* This structure holds a unique identifier to identify the
- * physical port used by a netdevice.
+/* This structure holds a unique identifier to identify some
+ * physical item (port for example) used by a netdevice.
  */
-struct netdev_phys_port_id {
-	unsigned char id[MAX_PHYS_PORT_ID_LEN];
+struct netdev_phys_item_id {
+	unsigned char id[MAX_PHYS_ITEM_ID_LEN];
 	unsigned char id_len;
 };
 
@@ -961,7 +961,7 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	USB_CDC_NOTIFY_NETWORK_CONNECTION) should NOT implement this function.
  *
  * int (*ndo_get_phys_port_id)(struct net_device *dev,
- *			       struct netdev_phys_port_id *ppid);
+ *			       struct netdev_phys_item_id *ppid);
  *	Called to get ID of physical port of this device. If driver does
  *	not implement this, it is assumed that the hw is not able to have
  *	multiple net devices on single physical port.
@@ -1129,7 +1129,7 @@ struct net_device_ops {
 	int			(*ndo_change_carrier)(struct net_device *dev,
 						      bool new_carrier);
 	int			(*ndo_get_phys_port_id)(struct net_device *dev,
-							struct netdev_phys_port_id *ppid);
+							struct netdev_phys_item_id *ppid);
 	void			(*ndo_add_vxlan_port)(struct  net_device *dev,
 						      sa_family_t sa_family,
 						      __be16 port);
@@ -2820,7 +2820,7 @@ void dev_set_group(struct net_device *, int);
 int dev_set_mac_address(struct net_device *, struct sockaddr *);
 int dev_change_carrier(struct net_device *, bool new_carrier);
 int dev_get_phys_port_id(struct net_device *dev,
-			 struct netdev_phys_port_id *ppid);
+			 struct netdev_phys_item_id *ppid);
 struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device *dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
diff --git a/net/core/dev.c b/net/core/dev.c
index e916ba8..38ce5b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5768,7 +5768,7 @@ EXPORT_SYMBOL(dev_change_carrier);
  *	Get device physical port ID
  */
 int dev_get_phys_port_id(struct net_device *dev,
-			 struct netdev_phys_port_id *ppid)
+			 struct netdev_phys_item_id *ppid)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 9dd0669..55dc4da 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -387,7 +387,7 @@ static ssize_t phys_port_id_show(struct device *dev,
 		return restart_syscall();
 
 	if (dev_isalive(netdev)) {
-		struct netdev_phys_port_id ppid;
+		struct netdev_phys_item_id ppid;
 
 		ret = dev_get_phys_port_id(netdev, &ppid);
 		if (!ret)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a688268..1087c6d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -868,7 +868,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + rtnl_port_size(dev, ext_filter_mask) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
 	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
-	       + nla_total_size(MAX_PHYS_PORT_ID_LEN); /* IFLA_PHYS_PORT_ID */
+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_PORT_ID */
 }
 
 static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
@@ -952,7 +952,7 @@ static int rtnl_port_fill(struct sk_buff *skb, struct net_device *dev,
 static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
 {
 	int err;
-	struct netdev_phys_port_id ppid;
+	struct netdev_phys_item_id ppid;
 
 	err = dev_get_phys_port_id(dev, &ppid);
 	if (err) {
@@ -1196,7 +1196,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PROMISCUITY]	= { .type = NLA_U32 },
 	[IFLA_NUM_TX_QUEUES]	= { .type = NLA_U32 },
 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
-	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
+	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
 };
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [patch net-next v2 2/9] net: introduce generic switch devices support
       [not found] ` <1411134590-4586-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-09-19 13:49   ` Jiri Pirko
  2014-09-19 14:15   ` [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api David Laight
  1 sibling, 0 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	simon.horman-wFxRvT7yatFl57MIdRCFDg,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

The goal of this is to provide a possibility to suport various switch
chips. Drivers should implement relevant ndos to do so. Now there is
only one ndo defined:
- for getting physical switch id is in place.

Note that user can use random port netdevice to access the switch.

Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
---
 Documentation/networking/switchdev.txt | 53 ++++++++++++++++++++++++++++++++++
 MAINTAINERS                            |  7 +++++
 include/linux/netdevice.h              | 12 ++++++++
 include/net/switchdev.h                | 29 +++++++++++++++++++
 net/Kconfig                            |  1 +
 net/Makefile                           |  3 ++
 net/switchdev/Kconfig                  |  9 ++++++
 net/switchdev/Makefile                 |  5 ++++
 net/switchdev/switchdev.c              | 32 ++++++++++++++++++++
 9 files changed, 151 insertions(+)
 create mode 100644 Documentation/networking/switchdev.txt
 create mode 100644 include/net/switchdev.h
 create mode 100644 net/switchdev/Kconfig
 create mode 100644 net/switchdev/Makefile
 create mode 100644 net/switchdev/switchdev.c

diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
new file mode 100644
index 0000000..435746a
--- /dev/null
+++ b/Documentation/networking/switchdev.txt
@@ -0,0 +1,53 @@
+Switch device drivers HOWTO
+===========================
+
+First lets describe a topology a bit. Imagine the following example:
+
+       +----------------------------+    +---------------+
+       |     SOME switch chip       |    |      CPU      |
+       +----------------------------+    +---------------+
+       port1 port2 port3 port4 MNGMNT    |     PCI-E     |
+         |     |     |     |     |       +---------------+
+        PHY   PHY    |     |     |         |  NIC0 NIC1
+                     |     |     |         |   |    |
+                     |     |     +- PCI-E -+   |    |
+                     |     +------- MII -------+    |
+                     +------------- MII ------------+
+
+In this example, there are two independent lines between the switch silicon
+and CPU. NIC0 and NIC1 drivers are not aware of a switch presence. They are
+separate from the switch driver. SOME switch chip is by managed by a driver
+via PCI-E device MNGMNT. Note that MNGMNT device, NIC0 and NIC1 may be
+connected to some other type of bus.
+
+Now, for the previous example show the representation in kernel:
+
+       +----------------------------+    +---------------+
+       |     SOME switch chip       |    |      CPU      |
+       +----------------------------+    +---------------+
+       sw0p0 sw0p1 sw0p2 sw0p3 MNGMNT    |     PCI-E     |
+         |     |     |     |     |       +---------------+
+        PHY   PHY    |     |     |         |  eth0 eth1
+                     |     |     |         |   |    |
+                     |     |     +- PCI-E -+   |    |
+                     |     +------- MII -------+    |
+                     +------------- MII ------------+
+
+Lets call the example switch driver for SOME switch chip "SOMEswitch". This
+driver takes care of PCI-E device MNGMNT. There is a netdevice instance sw0pX
+created for each port of a switch. These netdevices are instances
+of "SOMEswitch" driver. sw0pX netdevices serve as a "representation"
+of the switch chip. eth0 and eth1 are instances of some other existing driver.
+
+The only difference of the switch-port netdevice from the ordinary netdevice
+is that is implements couple more NDOs:
+
+	ndo_swdev_get_id - This returns the same ID for two port netdevices of
+			   the same physical switch chip. This is mandatory to
+			   be implemented by all switch drivers and serves
+			   the caller for recognition of a port netdevice.
+	ndo_swdev_* - Functions that serve for a manipulation of the switch chip
+		      itself. They are not port-specific. Caller might use
+		      arbitrary port netdevice of the same switch and it will
+		      make no difference.
+	ndo_swportdev_* - Functions that serve for a port-specific manipulation.
diff --git a/MAINTAINERS b/MAINTAINERS
index 5e3709e..f1f26db 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8826,6 +8826,13 @@ F:	lib/swiotlb.c
 F:	arch/*/kernel/pci-swiotlb.c
 F:	include/linux/swiotlb.h
 
+SWITCHDEV
+M:	Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
+L:	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+S:	Supported
+F:	net/switchdev/
+F:	include/net/switchdev.h
+
 SYNOPSYS ARC ARCHITECTURE
 M:	Vineet Gupta <vgupta-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
 S:	Supported
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 13d765f..b290dcf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -752,6 +752,8 @@ struct netdev_phys_item_id {
 typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
 				       struct sk_buff *skb);
 
+#include <net/switchdev.h>
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -997,6 +999,12 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	Callback to use for xmit over the accelerated station. This
  *	is used in place of ndo_start_xmit on accelerated net
  *	devices.
+ *
+ * int (*ndo_swdev_id_get)(struct net_device *dev,
+ *			   struct netdev_phys_item_id *psid);
+ *	Called to get an ID of the switch chip this port is part of.
+ *	If driver implements this, it indicates that it represents a port
+ *	of a switch chip.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1146,6 +1154,10 @@ struct net_device_ops {
 							struct net_device *dev,
 							void *priv);
 	int			(*ndo_get_lock_subclass)(struct net_device *dev);
+#ifdef CONFIG_NET_SWITCHDEV
+	int			(*ndo_swdev_id_get)(struct net_device *dev,
+						    struct netdev_phys_item_id *psid);
+#endif
 };
 
 /**
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
new file mode 100644
index 0000000..af30f75
--- /dev/null
+++ b/include/net/switchdev.h
@@ -0,0 +1,29 @@
+/*
+ * include/net/switchdev.h - Switch device API
+ * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#ifndef _LINUX_SWITCHDEV_H_
+#define _LINUX_SWITCHDEV_H_
+
+#include <linux/netdevice.h>
+
+#ifdef CONFIG_NET_SWITCHDEV
+
+int swdev_id_get(struct net_device *dev, struct netdev_phys_item_id *psid);
+
+#else
+
+static inline int swdev_id_get(struct net_device *dev,
+			       struct netdev_phys_item_id *psid)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif
+
+#endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/Kconfig b/net/Kconfig
index 4051fdf..89a7fec 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -226,6 +226,7 @@ source "net/vmw_vsock/Kconfig"
 source "net/netlink/Kconfig"
 source "net/mpls/Kconfig"
 source "net/hsr/Kconfig"
+source "net/switchdev/Kconfig"
 
 config RPS
 	boolean
diff --git a/net/Makefile b/net/Makefile
index 7ed1970..95fc694 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -73,3 +73,6 @@ obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
 obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
 obj-$(CONFIG_NET_MPLS_GSO)	+= mpls/
 obj-$(CONFIG_HSR)		+= hsr/
+ifneq ($(CONFIG_NET_SWITCHDEV),)
+obj-y				+= switchdev/
+endif
diff --git a/net/switchdev/Kconfig b/net/switchdev/Kconfig
new file mode 100644
index 0000000..20e8ed2
--- /dev/null
+++ b/net/switchdev/Kconfig
@@ -0,0 +1,9 @@
+#
+# Configuration for Switch device support
+#
+
+config NET_SWITCHDEV
+	boolean "Switch device support (EXPERIMENTAL)"
+	depends on INET
+	---help---
+	  This module provides support for hardware switch chips.
diff --git a/net/switchdev/Makefile b/net/switchdev/Makefile
new file mode 100644
index 0000000..5ed63ed
--- /dev/null
+++ b/net/switchdev/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the Switch device API
+#
+
+obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
new file mode 100644
index 0000000..14a5fc9
--- /dev/null
+++ b/net/switchdev/switchdev.c
@@ -0,0 +1,32 @@
+/*
+ * net/switchdev/switchdev.c - Switch device API
+ * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/netdevice.h>
+#include <net/switchdev.h>
+
+/**
+ *	swdev_id_get - Get ID of a switch
+ *	@dev: port device
+ *	@psid: switch ID
+ *
+ *	Get ID of a switch this port is part of.
+ */
+int swdev_id_get(struct net_device *dev, struct netdev_phys_item_id *psid)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (!ops->ndo_swdev_id_get)
+		return -EOPNOTSUPP;
+	return ops->ndo_swdev_id_get(dev, psid);
+}
+EXPORT_SYMBOL(swdev_id_get);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [patch net-next v2 3/9] rtnl: expose physical switch id for particular device
  2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
  2014-09-19 13:49 ` [patch net-next v2 1/9] net: rename netdev_phys_port_id to more generic name Jiri Pirko
@ 2014-09-19 13:49 ` Jiri Pirko
  2014-09-19 13:49 ` [patch net-next v2 4/9] net-sysfs: " Jiri Pirko
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

The netdevice represents a port in a switch, it will expose
IFLA_PHYS_SWITCH_ID value via rtnl. Two netdevices with the same value
belong to one physical switch.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         | 26 +++++++++++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index c80f95f..c5ca3b9 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -145,6 +145,7 @@ enum {
 	IFLA_CARRIER,
 	IFLA_PHYS_PORT_ID,
 	IFLA_CARRIER_CHANGES,
+	IFLA_PHYS_SWITCH_ID,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1087c6d..8947297 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -43,6 +43,7 @@
 
 #include <linux/inet.h>
 #include <linux/netdevice.h>
+#include <net/switchdev.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/arp.h>
@@ -868,7 +869,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + rtnl_port_size(dev, ext_filter_mask) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
 	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
-	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_PORT_ID */
+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
+	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN); /* IFLA_PHYS_SWITCH_ID */
 }
 
 static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
@@ -967,6 +969,24 @@ static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static int rtnl_phys_switch_id_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	int err;
+	struct netdev_phys_item_id psid;
+
+	err = swdev_id_get(dev, &psid);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+
+	if (nla_put(skb, IFLA_PHYS_SWITCH_ID, psid.id_len, psid.id))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask)
@@ -1039,6 +1059,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_phys_port_id_fill(skb, dev))
 		goto nla_put_failure;
 
+	if (rtnl_phys_switch_id_fill(skb, dev))
+		goto nla_put_failure;
+
 	attr = nla_reserve(skb, IFLA_STATS,
 			sizeof(struct rtnl_link_stats));
 	if (attr == NULL)
@@ -1198,6 +1221,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
 	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
+	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [patch net-next v2 4/9] net-sysfs: expose physical switch id for particular device
  2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
  2014-09-19 13:49 ` [patch net-next v2 1/9] net: rename netdev_phys_port_id to more generic name Jiri Pirko
  2014-09-19 13:49 ` [patch net-next v2 3/9] rtnl: expose physical switch id for particular device Jiri Pirko
@ 2014-09-19 13:49 ` Jiri Pirko
  2014-09-19 13:49 ` [patch net-next v2 5/9] net: introduce dummy switch Jiri Pirko
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/core/net-sysfs.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 55dc4da..87b97bc 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -12,6 +12,7 @@
 #include <linux/capability.h>
 #include <linux/kernel.h>
 #include <linux/netdevice.h>
+#include <net/switchdev.h>
 #include <linux/if_arp.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
@@ -399,6 +400,28 @@ static ssize_t phys_port_id_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(phys_port_id);
 
+static ssize_t phys_switch_id_show(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct net_device *netdev = to_net_dev(dev);
+	ssize_t ret = -EINVAL;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	if (dev_isalive(netdev)) {
+		struct netdev_phys_item_id ppid;
+
+		ret = swdev_id_get(netdev, &ppid);
+		if (!ret)
+			ret = sprintf(buf, "%*phN\n", ppid.id_len, ppid.id);
+	}
+	rtnl_unlock();
+
+	return ret;
+}
+static DEVICE_ATTR_RO(phys_switch_id);
+
 static struct attribute *net_class_attrs[] = {
 	&dev_attr_netdev_group.attr,
 	&dev_attr_type.attr,
@@ -423,6 +446,7 @@ static struct attribute *net_class_attrs[] = {
 	&dev_attr_flags.attr,
 	&dev_attr_tx_queue_len.attr,
 	&dev_attr_phys_port_id.attr,
+	&dev_attr_phys_switch_id.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(net_class);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [patch net-next v2 5/9] net: introduce dummy switch
  2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
                   ` (2 preceding siblings ...)
  2014-09-19 13:49 ` [patch net-next v2 4/9] net-sysfs: " Jiri Pirko
@ 2014-09-19 13:49 ` Jiri Pirko
       [not found]   ` <1411134590-4586-6-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  2014-09-19 13:49 ` [patch net-next v2 6/9] switchdev: add basic support for flow matching and actions Jiri Pirko
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

Dummy switch implementation using switchdev interface

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/Kconfig          |   7 +++
 drivers/net/Makefile         |   1 +
 drivers/net/dummyswitch.c    | 130 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/if_link.h |   9 +++
 4 files changed, 147 insertions(+)
 create mode 100644 drivers/net/dummyswitch.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c6f6f69..7822c74 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -71,6 +71,13 @@ config DUMMY
 	  To compile this driver as a module, choose M here: the module
 	  will be called dummy.
 
+config NET_DUMMY_SWITCH
+	tristate "Dummy switch net driver support"
+	depends on NET_SWITCHDEV
+	---help---
+	  To compile this driver as a module, choose M here: the module
+	  will be called dummyswitch.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 61aefdd..3c835ba 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -7,6 +7,7 @@
 #
 obj-$(CONFIG_BONDING) += bonding/
 obj-$(CONFIG_DUMMY) += dummy.o
+obj-$(CONFIG_NET_DUMMY_SWITCH) += dummyswitch.o
 obj-$(CONFIG_EQUALIZER) += eql.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
diff --git a/drivers/net/dummyswitch.c b/drivers/net/dummyswitch.c
new file mode 100644
index 0000000..e7a48f4
--- /dev/null
+++ b/drivers/net/dummyswitch.c
@@ -0,0 +1,130 @@
+/*
+ * drivers/net/dummyswitch.c - Dummy switch device
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <net/rtnetlink.h>
+
+struct dummyswport_priv {
+	struct netdev_phys_item_id psid;
+};
+
+static netdev_tx_t dummyswport_start_xmit(struct sk_buff *skb,
+					  struct net_device *dev)
+{
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int dummyswport_swdev_id_get(struct net_device *dev,
+				    struct netdev_phys_item_id *psid)
+{
+	struct dummyswport_priv *dsp = netdev_priv(dev);
+
+	memcpy(psid, &dsp->psid, sizeof(*psid));
+	return 0;
+}
+
+static int dummyswport_change_carrier(struct net_device *dev, bool new_carrier)
+{
+	if (new_carrier)
+		netif_carrier_on(dev);
+	else
+		netif_carrier_off(dev);
+	return 0;
+}
+
+static const struct net_device_ops dummyswport_netdev_ops = {
+	.ndo_start_xmit		= dummyswport_start_xmit,
+	.ndo_swdev_id_get	= dummyswport_swdev_id_get,
+	.ndo_change_carrier	= dummyswport_change_carrier,
+};
+
+static void dummyswport_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+
+	/* Initialize the device structure. */
+	dev->netdev_ops = &dummyswport_netdev_ops;
+	dev->destructor = free_netdev;
+
+	/* Fill in device structure with ethernet-generic values. */
+	dev->tx_queue_len = 0;
+	dev->flags |= IFF_NOARP;
+	dev->flags &= ~IFF_MULTICAST;
+	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+	dev->features	|= NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO;
+	dev->features	|= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
+	eth_hw_addr_random(dev);
+}
+
+static int dummyswport_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (tb[IFLA_ADDRESS])
+		return -EINVAL;
+	if (!data || !data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID])
+		return -EINVAL;
+	return 0;
+}
+
+static int dummyswport_newlink(struct net *src_net, struct net_device *dev,
+			       struct nlattr *tb[], struct nlattr *data[])
+{
+	struct dummyswport_priv *dsp = netdev_priv(dev);
+	int err;
+
+	dsp->psid.id_len = nla_len(data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID]);
+	memcpy(dsp->psid.id, nla_data(data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID]),
+	       dsp->psid.id_len);
+
+	err = register_netdevice(dev);
+	if (err)
+		return err;
+
+	netif_carrier_on(dev);
+
+	return 0;
+}
+
+static const struct nla_policy dummyswport_policy[IFLA_DUMMYSWPORT_MAX + 1] = {
+	[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID] = { .type = NLA_BINARY,
+					      .len = MAX_PHYS_ITEM_ID_LEN },
+};
+
+static struct rtnl_link_ops dummyswport_link_ops __read_mostly = {
+	.kind		= "dummyswport",
+	.priv_size	= sizeof(struct dummyswport_priv),
+	.setup		= dummyswport_setup,
+	.validate	= dummyswport_validate,
+	.newlink	= dummyswport_newlink,
+	.policy		= dummyswport_policy,
+	.maxtype	= IFLA_DUMMYSWPORT_MAX,
+};
+
+static int __init dummysw_module_init(void)
+{
+	return rtnl_link_register(&dummyswport_link_ops);
+}
+
+static void __exit dummysw_module_exit(void)
+{
+	rtnl_link_unregister(&dummyswport_link_ops);
+}
+
+module_init(dummysw_module_init);
+module_exit(dummysw_module_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
+MODULE_DESCRIPTION("Dummy switch device");
+MODULE_ALIAS_RTNL_LINK("dummyswport");
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index c5ca3b9..bd24d69 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -574,4 +574,13 @@ enum {
 
 #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
 
+/* DUMMYSWPORT section */
+enum {
+	IFLA_DUMMYSWPORT_UNSPEC,
+	IFLA_DUMMYSWPORT_PHYS_SWITCH_ID,
+	__IFLA_DUMMYSWPORT_MAX,
+};
+
+#define IFLA_DUMMYSWPORT_MAX (__IFLA_DUMMYSWPORT_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [patch net-next v2 6/9] switchdev: add basic support for flow matching and actions
  2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
                   ` (3 preceding siblings ...)
  2014-09-19 13:49 ` [patch net-next v2 5/9] net: introduce dummy switch Jiri Pirko
@ 2014-09-19 13:49 ` Jiri Pirko
  2014-09-20  5:32   ` Florian Fainelli
  2014-09-19 13:49 ` [patch net-next v2 7/9] switchdev: add swdev features Jiri Pirko
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

This patch adds basic support for flows. The infrastructure is prepared
to easily add another flow matching types. So far, only the key one is
implemented.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/netdevice.h |  16 ++++++
 include/net/switchdev.h   | 113 ++++++++++++++++++++++++++++++++++++++++++
 net/switchdev/switchdev.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 252 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b290dcf..034baca 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1005,6 +1005,18 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	Called to get an ID of the switch chip this port is part of.
  *	If driver implements this, it indicates that it represents a port
  *	of a switch chip.
+ *
+ * int (*ndo_swdev_flow_insert)(struct net_device *dev,
+ *				const struct swdev_flow *flow);
+ *	Called to insert a flow into switch device. If driver does
+ *	not implement this, it is assumed that the hw does not have
+ *	a capability to work with flows.
+ *
+ * int (*ndo_swdev_flow_remove)(struct net_device *dev,
+ *				const struct swdev_flow *flow);
+ *	Called to remove a flow from switch device. If driver does
+ *	not implement this, it is assumed that the hw does not have
+ *	a capability to work with flows.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1157,6 +1169,10 @@ struct net_device_ops {
 #ifdef CONFIG_NET_SWITCHDEV
 	int			(*ndo_swdev_id_get)(struct net_device *dev,
 						    struct netdev_phys_item_id *psid);
+	int			(*ndo_swdev_flow_insert)(struct net_device *dev,
+							 const struct swdev_flow *flow);
+	int			(*ndo_swdev_flow_remove)(struct net_device *dev,
+							 const struct swdev_flow *flow);
 #endif
 };
 
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index af30f75..060d3fc 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -12,9 +12,110 @@
 
 #include <linux/netdevice.h>
 
+struct swdev_flow_match_key {
+	struct {
+		u32	priority;	/* Packet QoS priority. */
+		u32	in_port_ifindex; /* Input switch port ifindex (or 0). */
+	} phy;
+	struct {
+		u8     src[ETH_ALEN];	/* Ethernet source address. */
+		u8     dst[ETH_ALEN];	/* Ethernet destination address. */
+		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
+		__be16 type;		/* Ethernet frame type. */
+	} eth;
+	struct {
+		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
+		u8     tos;		/* IP ToS. */
+		u8     ttl;		/* IP TTL/hop limit. */
+		u8     frag;		/* One of OVS_FRAG_TYPE_*. */
+	} ip;
+	struct {
+		__be16 src;		/* TCP/UDP/SCTP source port. */
+		__be16 dst;		/* TCP/UDP/SCTP destination port. */
+		__be16 flags;		/* TCP flags. */
+	} tp;
+	union {
+		struct {
+			struct {
+				__be32 src;	/* IP source address. */
+				__be32 dst;	/* IP destination address. */
+			} addr;
+			struct {
+				u8 sha[ETH_ALEN];	/* ARP source hardware address. */
+				u8 tha[ETH_ALEN];	/* ARP target hardware address. */
+			} arp;
+		} ipv4;
+		struct {
+			struct {
+				struct in6_addr src;	/* IPv6 source address. */
+				struct in6_addr dst;	/* IPv6 destination address. */
+			} addr;
+			__be32 label;			/* IPv6 flow label. */
+			struct {
+				struct in6_addr target;	/* ND target address. */
+				u8 sll[ETH_ALEN];	/* ND source link layer address. */
+				u8 tll[ETH_ALEN];	/* ND target link layer address. */
+			} nd;
+		} ipv6;
+	};
+};
+
+enum swdev_flow_match_type {
+	SW_FLOW_MATCH_TYPE_KEY,
+};
+
+struct swdev_flow_match {
+	enum swdev_flow_match_type			type;
+	union {
+		struct {
+			struct swdev_flow_match_key	key;
+			struct swdev_flow_match_key	key_mask;
+		};
+	};
+};
+
+enum swdev_flow_action_type {
+	SW_FLOW_ACTION_TYPE_OUTPUT,
+	SW_FLOW_ACTION_TYPE_VLAN_PUSH,
+	SW_FLOW_ACTION_TYPE_VLAN_POP,
+};
+
+struct swdev_flow_action {
+	enum swdev_flow_action_type	type;
+	union {
+		u32			out_port_ifindex;
+		struct {
+			__be16		proto;
+			__be16		tci;
+		} vlan;
+	};
+};
+
+struct swdev_flow {
+	struct swdev_flow_match		match;
+	unsigned			action_count;
+	struct swdev_flow_action	action[0];
+};
+
+static inline struct swdev_flow *swdev_flow_alloc(unsigned action_count,
+						  gfp_t flags)
+{
+	struct swdev_flow *flow;
+
+	flow = kzalloc(sizeof(struct swdev_flow) +
+		       sizeof(struct swdev_flow_action) * action_count,
+		       flags);
+	if (!flow)
+		return NULL;
+	flow->action_count = action_count;
+	return flow;
+}
+
 #ifdef CONFIG_NET_SWITCHDEV
 
 int swdev_id_get(struct net_device *dev, struct netdev_phys_item_id *psid);
+int swdev_flow_insert(struct net_device *dev, const struct swdev_flow *flow);
+int swdev_flow_remove(struct net_device *dev, const struct swdev_flow *flow);
 
 #else
 
@@ -24,6 +125,18 @@ static inline int swdev_id_get(struct net_device *dev,
 	return -EOPNOTSUPP;
 }
 
+static inline int swdev_flow_insert(struct net_device *dev,
+				    const struct swdev_flow *flow)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int swdev_flow_remove(struct net_device *dev,
+				    const struct swdev_flow *flow)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 14a5fc9..90bc5e4 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -30,3 +30,126 @@ int swdev_id_get(struct net_device *dev, struct netdev_phys_item_id *psid)
 	return ops->ndo_swdev_id_get(dev, psid);
 }
 EXPORT_SYMBOL(swdev_id_get);
+
+static void print_flow_key_phy(const char *prefix,
+			       const struct swdev_flow_match_key *key)
+{
+	pr_debug("%s phy  { prio %08x, in_port_ifindex %08x }\n",
+		 prefix,
+		 key->phy.priority, key->phy.in_port_ifindex);
+}
+
+static void print_flow_key_eth(const char *prefix,
+			       const struct swdev_flow_match_key *key)
+{
+	pr_debug("%s eth  { sm %pM, dm %pM, tci %04x, type %04x }\n",
+		 prefix,
+		 key->eth.src, key->eth.dst, ntohs(key->eth.tci),
+		 ntohs(key->eth.type));
+}
+
+static void print_flow_key_ip(const char *prefix,
+			      const struct swdev_flow_match_key *key)
+{
+	pr_debug("%s ip   { proto %02x, tos %02x, ttl %02x }\n",
+		 prefix,
+		 key->ip.proto, key->ip.tos, key->ip.ttl);
+}
+
+static void print_flow_key_ipv4(const char *prefix,
+				const struct swdev_flow_match_key *key)
+{
+	pr_debug("%s ipv4 { si %pI4, di %pI4, sm %pM, dm %pM }\n",
+		 prefix,
+		 &key->ipv4.addr.src, &key->ipv4.addr.dst,
+		 key->ipv4.arp.sha, key->ipv4.arp.tha);
+}
+
+static void print_flow_actions(const struct swdev_flow_action *action,
+			       unsigned action_count)
+{
+	int i;
+
+	pr_debug("  actions:\n");
+	for (i = 0; i < action_count; i++) {
+		switch (action->type) {
+		case SW_FLOW_ACTION_TYPE_OUTPUT:
+			pr_debug("    output    { ifindex %u }\n",
+				 action->out_port_ifindex);
+			break;
+		case SW_FLOW_ACTION_TYPE_VLAN_PUSH:
+			pr_debug("    vlan push { proto %04x, tci %04x }\n",
+				 ntohs(action->vlan.proto),
+				 ntohs(action->vlan.tci));
+			break;
+		case SW_FLOW_ACTION_TYPE_VLAN_POP:
+			pr_debug("    vlan pop\n");
+			break;
+		}
+		action++;
+	}
+}
+
+#define PREFIX_NONE "      "
+#define PREFIX_MASK "  mask"
+
+static void print_flow_match(const struct swdev_flow_match *match)
+{
+	switch (match->type) {
+	case SW_FLOW_MATCH_TYPE_KEY:
+		print_flow_key_phy(PREFIX_NONE, &match->key);
+		print_flow_key_phy(PREFIX_MASK, &match->key_mask);
+		print_flow_key_eth(PREFIX_NONE, &match->key);
+		print_flow_key_eth(PREFIX_MASK, &match->key_mask);
+		print_flow_key_ip(PREFIX_NONE, &match->key);
+		print_flow_key_ip(PREFIX_MASK, &match->key_mask);
+		print_flow_key_ipv4(PREFIX_NONE, &match->key);
+		print_flow_key_ipv4(PREFIX_MASK, &match->key_mask);
+	}
+}
+
+static void print_flow(const struct swdev_flow *flow, struct net_device *dev,
+		       const char *comment)
+{
+	pr_debug("%s flow %s:\n", dev->name, comment);
+	print_flow_match(&flow->match);
+	print_flow_actions(flow->action, flow->action_count);
+}
+
+/**
+ *	swdev_flow_insert - Insert a flow into switch
+ *	@dev: port device
+ *	@flow: flow descriptor
+ *
+ *	Insert a flow into switch this port is part of.
+ */
+int swdev_flow_insert(struct net_device *dev, const struct swdev_flow *flow)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	print_flow(flow, dev, "insert");
+	if (!ops->ndo_swdev_flow_insert)
+		return -EOPNOTSUPP;
+	WARN_ON(!ops->ndo_swdev_id_get);
+	return ops->ndo_swdev_flow_insert(dev, flow);
+}
+EXPORT_SYMBOL(swdev_flow_insert);
+
+/**
+ *	swdev_flow_remove - Remove a flow from switch
+ *	@dev: port device
+ *	@flow: flow descriptor
+ *
+ *	Remove a flow from switch this port is part of.
+ */
+int swdev_flow_remove(struct net_device *dev, const struct swdev_flow *flow)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	print_flow(flow, dev, "remove");
+	if (!ops->ndo_swdev_flow_remove)
+		return -EOPNOTSUPP;
+	WARN_ON(!ops->ndo_swdev_id_get);
+	return ops->ndo_swdev_flow_remove(dev, flow);
+}
+EXPORT_SYMBOL(swdev_flow_remove);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [patch net-next v2 7/9] switchdev: add swdev features
  2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
                   ` (4 preceding siblings ...)
  2014-09-19 13:49 ` [patch net-next v2 6/9] switchdev: add basic support for flow matching and actions Jiri Pirko
@ 2014-09-19 13:49 ` Jiri Pirko
  2014-09-19 13:49 ` [patch net-next v2 8/9] switchdev: introduce Netlink API Jiri Pirko
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

Driver should define ndo_swdev_festures_get and indicate which switch
features it supports.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/netdevice.h |  5 +++++
 include/net/switchdev.h   | 18 ++++++++++++++++++
 net/switchdev/switchdev.c | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 56 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 034baca..b87d0cc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1006,6 +1006,10 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	If driver implements this, it indicates that it represents a port
  *	of a switch chip.
  *
+ * swdev_features_t (*ndo_swdev_features_get)(struct net_device *dev);
+ *	Called to get a list of features the switch chip this port is part of
+ *	supports.
+ *
  * int (*ndo_swdev_flow_insert)(struct net_device *dev,
  *				const struct swdev_flow *flow);
  *	Called to insert a flow into switch device. If driver does
@@ -1169,6 +1173,7 @@ struct net_device_ops {
 #ifdef CONFIG_NET_SWITCHDEV
 	int			(*ndo_swdev_id_get)(struct net_device *dev,
 						    struct netdev_phys_item_id *psid);
+	swdev_features_t	(*ndo_swdev_features_get)(struct net_device *dev);
 	int			(*ndo_swdev_flow_insert)(struct net_device *dev,
 							 const struct swdev_flow *flow);
 	int			(*ndo_swdev_flow_remove)(struct net_device *dev,
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 060d3fc..91cdb47 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -12,6 +12,18 @@
 
 #include <linux/netdevice.h>
 
+typedef u64 swdev_features_t;
+
+enum {
+	SWDEV_F_FLOW_MATCH_KEY_BIT,	/* Supports fixed key match */
+	/**/SWDEV_FEATURE_COUNT
+};
+
+#define __SWDEV_F_BIT(bit)	((swdev_features_t)1 << (bit))
+#define __SWDEV_F(name)		__SWDEV_F_BIT(SWDEV_F_##name##_BIT)
+
+#define SWDEV_F_FLOW_MATCH_KEY	__SWDEV_F(FLOW_MATCH_KEY)
+
 struct swdev_flow_match_key {
 	struct {
 		u32	priority;	/* Packet QoS priority. */
@@ -114,6 +126,7 @@ static inline struct swdev_flow *swdev_flow_alloc(unsigned action_count,
 #ifdef CONFIG_NET_SWITCHDEV
 
 int swdev_id_get(struct net_device *dev, struct netdev_phys_item_id *psid);
+swdev_features_t swdev_features_get(struct net_device *dev);
 int swdev_flow_insert(struct net_device *dev, const struct swdev_flow *flow);
 int swdev_flow_remove(struct net_device *dev, const struct swdev_flow *flow);
 
@@ -125,6 +138,11 @@ static inline int swdev_id_get(struct net_device *dev,
 	return -EOPNOTSUPP;
 }
 
+static inline swdev_features_t swdev_features_get(struct net_device *dev)
+{
+	return 0;
+}
+
 static inline int swdev_flow_insert(struct net_device *dev,
 				    const struct swdev_flow *flow)
 {
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 90bc5e4..5348125 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -31,6 +31,22 @@ int swdev_id_get(struct net_device *dev, struct netdev_phys_item_id *psid)
 }
 EXPORT_SYMBOL(swdev_id_get);
 
+/**
+ *	swdev_features_get - Get list of features switch supports
+ *	@dev: port device
+ *
+ *	Get list of features switch this port is part of supports.
+ */
+swdev_features_t swdev_features_get(struct net_device *dev)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (!ops->ndo_swdev_features_get)
+		return 0;
+	return ops->ndo_swdev_features_get(dev);
+}
+EXPORT_SYMBOL(swdev_features_get);
+
 static void print_flow_key_phy(const char *prefix,
 			       const struct swdev_flow_match_key *key)
 {
@@ -116,6 +132,15 @@ static void print_flow(const struct swdev_flow *flow, struct net_device *dev,
 	print_flow_actions(flow->action, flow->action_count);
 }
 
+static int check_match_type_features(struct net_device *dev,
+				     const struct swdev_flow *flow)
+{
+	if (flow->match.type == SW_FLOW_MATCH_TYPE_KEY &&
+	    !(swdev_features_get(dev) & SWDEV_F_FLOW_MATCH_KEY))
+		return -EOPNOTSUPP;
+	return 0;
+}
+
 /**
  *	swdev_flow_insert - Insert a flow into switch
  *	@dev: port device
@@ -126,10 +151,14 @@ static void print_flow(const struct swdev_flow *flow, struct net_device *dev,
 int swdev_flow_insert(struct net_device *dev, const struct swdev_flow *flow)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
+	int err;
 
 	print_flow(flow, dev, "insert");
 	if (!ops->ndo_swdev_flow_insert)
 		return -EOPNOTSUPP;
+	err = check_match_type_features(dev, flow);
+	if (err)
+		return err;
 	WARN_ON(!ops->ndo_swdev_id_get);
 	return ops->ndo_swdev_flow_insert(dev, flow);
 }
@@ -145,10 +174,14 @@ EXPORT_SYMBOL(swdev_flow_insert);
 int swdev_flow_remove(struct net_device *dev, const struct swdev_flow *flow)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
+	int err;
 
 	print_flow(flow, dev, "remove");
 	if (!ops->ndo_swdev_flow_remove)
 		return -EOPNOTSUPP;
+	err = check_match_type_features(dev, flow);
+	if (err)
+		return err;
 	WARN_ON(!ops->ndo_swdev_id_get);
 	return ops->ndo_swdev_flow_remove(dev, flow);
 }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
                   ` (5 preceding siblings ...)
  2014-09-19 13:49 ` [patch net-next v2 7/9] switchdev: add swdev features Jiri Pirko
@ 2014-09-19 13:49 ` Jiri Pirko
  2014-09-19 15:25   ` Jamal Hadi Salim
  2014-09-19 13:49 ` [patch net-next v2 9/9] rocker: introduce rocker switch driver Jiri Pirko
       [not found] ` <1411134590-4586-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  8 siblings, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

This patch exposes switchdev API using generic Netlink.
Example userspace utility is here:
https://github.com/jpirko/switchdev

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 MAINTAINERS                       |   1 +
 include/uapi/linux/switchdev.h    | 113 ++++++++++
 net/switchdev/Kconfig             |  11 +
 net/switchdev/Makefile            |   1 +
 net/switchdev/switchdev_netlink.c | 441 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 567 insertions(+)
 create mode 100644 include/uapi/linux/switchdev.h
 create mode 100644 net/switchdev/switchdev_netlink.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f1f26db..0fe2822 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8832,6 +8832,7 @@ L:	netdev@vger.kernel.org
 S:	Supported
 F:	net/switchdev/
 F:	include/net/switchdev.h
+F:	include/uapi/linux/switchdev.h
 
 SYNOPSYS ARC ARCHITECTURE
 M:	Vineet Gupta <vgupta@synopsys.com>
diff --git a/include/uapi/linux/switchdev.h b/include/uapi/linux/switchdev.h
new file mode 100644
index 0000000..f945b57
--- /dev/null
+++ b/include/uapi/linux/switchdev.h
@@ -0,0 +1,113 @@
+/*
+ * include/uapi/linux/switchdev.h - Netlink interface to Switch device
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_SWITCHDEV_H_
+#define _UAPI_LINUX_SWITCHDEV_H_
+
+enum {
+	SWDEV_CMD_NOOP,
+	SWDEV_CMD_FLOW_INSERT,
+	SWDEV_CMD_FLOW_REMOVE,
+};
+
+enum {
+	SWDEV_ATTR_UNSPEC,
+	SWDEV_ATTR_IFINDEX,			/* u32 */
+	SWDEV_ATTR_FLOW,			/* nest */
+
+	__SWDEV_ATTR_MAX,
+	SWDEV_ATTR_MAX = (__SWDEV_ATTR_MAX - 1),
+};
+
+enum {
+	SWDEV_ATTR_FLOW_MATCH_KEY_UNSPEC,
+	SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY,		/* u32 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT,		/* u32 (ifindex) */
+	SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO,		/* u8 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS,		/* u8 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL,		/* u8 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG,		/* u8 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC,	/* be32 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST,	/* be32 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC,	/* struct in6_addr */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST,	/* struct in6_addr */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL,		/* be32 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET,	/* struct in6_addr */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL,		/* ETH_ALEN */
+
+	__SWDEV_ATTR_FLOW_MATCH_KEY_MAX,
+	SWDEV_ATTR_FLOW_MATCH_KEY_MAX = (__SWDEV_ATTR_FLOW_MATCH_KEY_MAX - 1),
+};
+
+enum {
+	SWDEV_FLOW_ACTION_TYPE_OUTPUT,
+	SWDEV_FLOW_ACTION_TYPE_VLAN_PUSH,
+	SWDEV_FLOW_ACTION_TYPE_VLAN_POP,
+};
+
+enum {
+	SWDEV_ATTR_FLOW_ACTION_UNSPEC,
+	SWDEV_ATTR_FLOW_ACTION_TYPE,		/* u32 */
+	SWDEV_ATTR_FLOW_ACTION_OUT_PORT,	/* u32 (ifindex) */
+	SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO,	/* be16 */
+	SWDEV_ATTR_FLOW_ACTION_VLAN_TCI,	/* u16 */
+
+	__SWDEV_ATTR_FLOW_ACTION_MAX,
+	SWDEV_ATTR_FLOW_ACTION_MAX = (__SWDEV_ATTR_FLOW_ACTION_MAX - 1),
+};
+
+enum {
+	SWDEV_ATTR_FLOW_ITEM_UNSPEC,
+	SWDEV_ATTR_FLOW_ITEM_ACTION,		/* nest */
+
+	__SWDEV_ATTR_FLOW_ITEM_MAX,
+	SWDEV_ATTR_FLOW_ITEM_MAX = (__SWDEV_ATTR_FLOW_ITEM_MAX - 1),
+};
+
+enum {
+	SWDEV_ATTR_FLOW_UNSPEC,
+	SWDEV_ATTR_FLOW_MATCH_KEY,		/* nest */
+	SWDEV_ATTR_FLOW_MATCH_KEY_MASK,		/* nest */
+	SWDEV_ATTR_FLOW_LIST_ACTION,		/* nest */
+
+	__SWDEV_ATTR_FLOW_MAX,
+	SWDEV_ATTR_FLOW_MAX = (__SWDEV_ATTR_FLOW_MAX - 1),
+};
+
+/* Nested layout of flow add/remove command message:
+ *
+ *	[SWDEV_ATTR_IFINDEX]
+ *	[SWDEV_ATTR_FLOW]
+ *		[SWDEV_ATTR_FLOW_MATCH_KEY]
+ *			[SWDEV_ATTR_FLOW_MATCH_KEY_*], ...
+ *		[SWDEV_ATTR_FLOW_MATCH_KEY_MASK]
+ *			[SWDEV_ATTR_FLOW_MATCH_KEY_*], ...
+ *		[SWDEV_ATTR_FLOW_LIST_ACTION]
+ *			[SWDEV_ATTR_FLOW_ITEM_ACTION]
+ *				[SWDEV_ATTR_FLOW_ACTION_*], ...
+ *			[SWDEV_ATTR_FLOW_ITEM_ACTION]
+ *				[SWDEV_ATTR_FLOW_ACTION_*], ...
+ *			...
+ */
+
+#define SWITCHDEV_GENL_NAME "switchdev"
+#define SWITCHDEV_GENL_VERSION 0x1
+
+#endif /* _UAPI_LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/Kconfig b/net/switchdev/Kconfig
index 20e8ed2..4470d6e 100644
--- a/net/switchdev/Kconfig
+++ b/net/switchdev/Kconfig
@@ -7,3 +7,14 @@ config NET_SWITCHDEV
 	depends on INET
 	---help---
 	  This module provides support for hardware switch chips.
+
+config NET_SWITCHDEV_NETLINK
+	tristate "Netlink interface to Switch device"
+	depends on NET_SWITCHDEV
+	default m
+	---help---
+	  This module provides Generic Netlink intercace to hardware switch
+	  chips.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called switchdev_netlink.
diff --git a/net/switchdev/Makefile b/net/switchdev/Makefile
index 5ed63ed..0695b53 100644
--- a/net/switchdev/Makefile
+++ b/net/switchdev/Makefile
@@ -3,3 +3,4 @@
 #
 
 obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o
+obj-$(CONFIG_NET_SWITCHDEV_NETLINK) += switchdev_netlink.o
diff --git a/net/switchdev/switchdev_netlink.c b/net/switchdev/switchdev_netlink.c
new file mode 100644
index 0000000..d97db8b
--- /dev/null
+++ b/net/switchdev/switchdev_netlink.c
@@ -0,0 +1,441 @@
+/*
+ * net/switchdev/switchdev_netlink.c - Netlink interface to Switch device
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <net/switchdev.h>
+#include <net/netlink.h>
+#include <net/genetlink.h>
+#include <net/netlink.h>
+#include <uapi/linux/switchdev.h>
+
+static struct genl_family swdev_nl_family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= SWITCHDEV_GENL_NAME,
+	.version	= SWITCHDEV_GENL_VERSION,
+	.maxattr	= SWDEV_ATTR_MAX,
+	.netnsok	= true,
+};
+
+static const struct nla_policy swdev_nl_flow_policy[SWDEV_ATTR_FLOW_MAX + 1] = {
+	[SWDEV_ATTR_FLOW_UNSPEC]		= { .type = NLA_UNSPEC, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY]		= { .type = NLA_NESTED },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_MASK]	= { .type = NLA_NESTED },
+	[SWDEV_ATTR_FLOW_LIST_ACTION]		= { .type = NLA_NESTED },
+};
+
+#define __IN6_ALEN sizeof(struct in6_addr)
+
+static const struct nla_policy
+swdev_nl_flow_match_key_policy[SWDEV_ATTR_FLOW_MATCH_KEY_MAX + 1] = {
+	[SWDEV_ATTR_FLOW_MATCH_KEY_UNSPEC]		= { .type = NLA_UNSPEC, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC]		= { .len  = ETH_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST]		= { .len  = ETH_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO]		= { .type = NLA_U8, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS]		= { .type = NLA_U8, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL]		= { .type = NLA_U8, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG]		= { .type = NLA_U8, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA]	= { .len  = ETH_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA]	= { .len  = ETH_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC]	= { .len  = __IN6_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST]	= { .len  = __IN6_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET]	= { .len  = __IN6_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL]	= { .len  = ETH_ALEN },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL]	= { .len  = ETH_ALEN },
+};
+
+static const struct nla_policy
+swdev_nl_flow_action_policy[SWDEV_ATTR_FLOW_ACTION_MAX + 1] = {
+	[SWDEV_ATTR_FLOW_ACTION_UNSPEC]		= { .type = NLA_UNSPEC, },
+	[SWDEV_ATTR_FLOW_ACTION_TYPE]		= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO]	= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_ACTION_VLAN_TCI]	= { .type = NLA_U16, },
+};
+
+static int swdev_nl_cmd_noop(struct sk_buff *skb, struct genl_info *info)
+{
+	struct sk_buff *msg;
+	void *hdr;
+	int err;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	hdr = genlmsg_put(msg, info->snd_portid, info->snd_seq,
+			  &swdev_nl_family, 0, SWDEV_CMD_NOOP);
+	if (!hdr) {
+		err = -EMSGSIZE;
+		goto err_msg_put;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	return genlmsg_unicast(genl_info_net(info), msg, info->snd_portid);
+
+err_msg_put:
+	nlmsg_free(msg);
+
+	return err;
+}
+
+static int swdev_nl_parse_flow_match_key(struct nlattr *key_attr,
+					 struct swdev_flow_match_key *key)
+{
+	struct nlattr *attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MAX + 1];
+	int err;
+
+	err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_MATCH_KEY_MAX,
+			       key_attr, swdev_nl_flow_match_key_policy);
+	if (err)
+		return err;
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY])
+		key->phy.priority =
+			nla_get_u32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT])
+		key->phy.in_port_ifindex =
+			nla_get_u32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC])
+		ether_addr_copy(key->eth.src,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST])
+		ether_addr_copy(key->eth.dst,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI])
+		key->eth.tci =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE])
+		key->eth.type =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO])
+		key->ip.proto =
+			nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS])
+		key->ip.tos =
+			nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL])
+		key->ip.ttl =
+			nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG])
+		key->ip.frag =
+			nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC])
+		key->tp.src =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST])
+		key->tp.dst =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS])
+		key->tp.flags =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC])
+		key->ipv4.addr.src =
+			nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST])
+		key->ipv4.addr.dst =
+			nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA])
+		ether_addr_copy(key->ipv4.arp.sha,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA])
+		ether_addr_copy(key->ipv4.arp.tha,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC])
+		memcpy(&key->ipv6.addr.src,
+		       nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC]),
+		       sizeof(key->ipv6.addr.src));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST])
+		memcpy(&key->ipv6.addr.dst,
+		       nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST]),
+		       sizeof(key->ipv6.addr.dst));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL])
+		key->ipv6.label =
+			nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET])
+		memcpy(&key->ipv6.nd.target,
+		       nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET]),
+		       sizeof(key->ipv6.nd.target));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL])
+		ether_addr_copy(key->ipv6.nd.sll,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL])
+		ether_addr_copy(key->ipv6.nd.tll,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL]));
+
+	return 0;
+}
+
+static int swdev_nl_parse_flow_action(struct nlattr *action_attr,
+				      struct swdev_flow_action *flow_action)
+{
+	struct nlattr *attrs[SWDEV_ATTR_FLOW_ACTION_MAX + 1];
+	int err;
+
+	err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_ACTION_MAX,
+			       action_attr, swdev_nl_flow_action_policy);
+	if (err)
+		return err;
+
+	if (!attrs[SWDEV_ATTR_FLOW_ACTION_TYPE])
+		return -EINVAL;
+
+	switch (nla_get_u32(attrs[SWDEV_ATTR_FLOW_ACTION_TYPE])) {
+	case SWDEV_FLOW_ACTION_TYPE_OUTPUT:
+		if (!attrs[SWDEV_ATTR_FLOW_ACTION_OUT_PORT])
+			return -EINVAL;
+		flow_action->out_port_ifindex =
+			nla_get_u32(attrs[SWDEV_ATTR_FLOW_ACTION_OUT_PORT]);
+		flow_action->type = SW_FLOW_ACTION_TYPE_OUTPUT;
+		break;
+	case SWDEV_FLOW_ACTION_TYPE_VLAN_PUSH:
+		if (!attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO] ||
+		    !attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_TCI])
+			return -EINVAL;
+		flow_action->vlan.proto =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO]);
+		flow_action->vlan.tci =
+			nla_get_u16(attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_TCI]);
+		flow_action->type = SW_FLOW_ACTION_TYPE_VLAN_PUSH;
+		break;
+	case SWDEV_FLOW_ACTION_TYPE_VLAN_POP:
+		flow_action->type = SW_FLOW_ACTION_TYPE_VLAN_POP;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int swdev_nl_parse_flow_actions(struct nlattr *actions_attr,
+				       struct swdev_flow_action *action)
+{
+	struct swdev_flow_action *cur;
+	struct nlattr *action_attr;
+	int rem;
+	int err;
+
+	cur = action;
+	nla_for_each_nested(action_attr, actions_attr, rem) {
+		err = swdev_nl_parse_flow_action(action_attr, cur);
+		if (err)
+			return err;
+		cur++;
+	}
+	return 0;
+}
+
+static int swdev_nl_parse_flow_action_count(struct nlattr *actions_attr,
+					    unsigned *p_action_count)
+{
+	struct nlattr *action_attr;
+	int rem;
+	int count = 0;
+
+	nla_for_each_nested(action_attr, actions_attr, rem) {
+		if (nla_type(action_attr) != SWDEV_ATTR_FLOW_ITEM_ACTION)
+			return -EINVAL;
+		count++;
+	}
+	*p_action_count = count;
+	return 0;
+}
+
+static void swdev_nl_free_flow(struct swdev_flow *flow)
+{
+	kfree(flow);
+}
+
+static int swdev_nl_parse_flow(struct nlattr *flow_attr, struct swdev_flow **p_flow)
+{
+	struct swdev_flow *flow;
+	struct nlattr *attrs[SWDEV_ATTR_FLOW_MAX + 1];
+	unsigned action_count;
+	int err;
+
+	err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_MAX,
+			       flow_attr, swdev_nl_flow_policy);
+	if (err)
+		return err;
+
+	if (!attrs[SWDEV_ATTR_FLOW_MATCH_KEY] ||
+	    !attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MASK] ||
+	    !attrs[SWDEV_ATTR_FLOW_LIST_ACTION])
+		return -EINVAL;
+
+	err = swdev_nl_parse_flow_action_count(attrs[SWDEV_ATTR_FLOW_LIST_ACTION],
+					       &action_count);
+	if (err)
+		return err;
+	flow = swdev_flow_alloc(action_count, GFP_KERNEL);
+	if (!flow)
+		return -ENOMEM;
+
+	err = swdev_nl_parse_flow_match_key(attrs[SWDEV_ATTR_FLOW_MATCH_KEY],
+					    &flow->match.key);
+	if (err)
+		goto out;
+
+	err = swdev_nl_parse_flow_match_key(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MASK],
+					    &flow->match.key_mask);
+	if (err)
+		goto out;
+
+	err = swdev_nl_parse_flow_actions(attrs[SWDEV_ATTR_FLOW_LIST_ACTION],
+					  flow->action);
+	if (err)
+		goto out;
+
+	*p_flow = flow;
+	return 0;
+
+out:
+	kfree(flow);
+	return err;
+}
+
+static struct net_device *swdev_nl_dev_get(struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	int ifindex;
+
+	if (!info->attrs[SWDEV_ATTR_IFINDEX])
+		return NULL;
+
+	ifindex = nla_get_u32(info->attrs[SWDEV_ATTR_IFINDEX]);
+	return dev_get_by_index(net, ifindex);
+}
+
+static void swdev_nl_dev_put(struct net_device *dev)
+{
+	dev_put(dev);
+}
+
+static int swdev_nl_cmd_flow_insert(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net_device *dev;
+	struct swdev_flow *flow;
+	int err;
+
+	if (!info->attrs[SWDEV_ATTR_FLOW])
+		return -EINVAL;
+
+	dev = swdev_nl_dev_get(info);
+	if (!dev)
+		return -EINVAL;
+
+	err = swdev_nl_parse_flow(info->attrs[SWDEV_ATTR_FLOW], &flow);
+	if (err)
+		goto dev_put;
+
+	err = swdev_flow_insert(dev, flow);
+	swdev_nl_free_flow(flow);
+dev_put:
+	swdev_nl_dev_put(dev);
+	return err;
+}
+
+static int swdev_nl_cmd_flow_remove(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net_device *dev;
+	struct swdev_flow *flow;
+	int err;
+
+	if (!info->attrs[SWDEV_ATTR_FLOW])
+		return -EINVAL;
+
+	dev = swdev_nl_dev_get(info);
+	if (!dev)
+		return -EINVAL;
+
+	err = swdev_nl_parse_flow(info->attrs[SWDEV_ATTR_FLOW], &flow);
+	if (err)
+		goto dev_put;
+
+	err = swdev_flow_remove(dev, flow);
+	swdev_nl_free_flow(flow);
+dev_put:
+	swdev_nl_dev_put(dev);
+	return err;
+}
+
+static const struct genl_ops swdev_nl_ops[] = {
+	{
+		.cmd = SWDEV_CMD_NOOP,
+		.doit = swdev_nl_cmd_noop,
+	},
+	{
+		.cmd = SWDEV_CMD_FLOW_INSERT,
+		.doit = swdev_nl_cmd_flow_insert,
+		.policy = swdev_nl_flow_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+	{
+		.cmd = SWDEV_CMD_FLOW_REMOVE,
+		.doit = swdev_nl_cmd_flow_remove,
+		.policy = swdev_nl_flow_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+};
+
+static int __init swdev_nl_module_init(void)
+{
+	return genl_register_family_with_ops(&swdev_nl_family, swdev_nl_ops);
+}
+
+static void swdev_nl_module_fini(void)
+{
+	genl_unregister_family(&swdev_nl_family);
+}
+
+module_init(swdev_nl_module_init);
+module_exit(swdev_nl_module_fini);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
+MODULE_DESCRIPTION("Netlink interface to Switch device");
+MODULE_ALIAS_GENL_FAMILY(SWITCHDEV_GENL_NAME);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [patch net-next v2 9/9] rocker: introduce rocker switch driver
  2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
                   ` (6 preceding siblings ...)
  2014-09-19 13:49 ` [patch net-next v2 8/9] switchdev: introduce Netlink API Jiri Pirko
@ 2014-09-19 13:49 ` Jiri Pirko
       [not found] ` <1411134590-4586-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  8 siblings, 0 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 13:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

This patch introduces the first driver to benefit from the switchdev
infrastructure and to implement newly introduced switch ndos. This is a
driver for emulated switch chip implemented in qemu:
https://github.com/sfeldma/qemu-rocker/

This patch is a result of joint work with Scott Feldman.

Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 MAINTAINERS                          |    6 +
 drivers/net/ethernet/Kconfig         |    1 +
 drivers/net/ethernet/Makefile        |    1 +
 drivers/net/ethernet/rocker/Kconfig  |   29 +
 drivers/net/ethernet/rocker/Makefile |    5 +
 drivers/net/ethernet/rocker/rocker.c | 3561 ++++++++++++++++++++++++++++++++++
 drivers/net/ethernet/rocker/rocker.h |  465 +++++
 7 files changed, 4068 insertions(+)
 create mode 100644 drivers/net/ethernet/rocker/Kconfig
 create mode 100644 drivers/net/ethernet/rocker/Makefile
 create mode 100644 drivers/net/ethernet/rocker/rocker.c
 create mode 100644 drivers/net/ethernet/rocker/rocker.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 0fe2822..703f482 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7656,6 +7656,12 @@ F:	drivers/hid/hid-roccat*
 F:	include/linux/hid-roccat*
 F:	Documentation/ABI/*/sysfs-driver-hid-roccat*
 
+ROCKER DRIVER
+M:	Jiri Pirko <jiri@resnulli.us>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	drivers/net/ethernet/rocker/
+
 ROCKETPORT DRIVER
 P:	Comtrol Corp.
 W:	http://www.comtrol.com
diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index dc7406c..61c9cc4 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -153,6 +153,7 @@ source "drivers/net/ethernet/qlogic/Kconfig"
 source "drivers/net/ethernet/realtek/Kconfig"
 source "drivers/net/ethernet/renesas/Kconfig"
 source "drivers/net/ethernet/rdc/Kconfig"
+source "drivers/net/ethernet/rocker/Kconfig"
 
 config S6GMAC
 	tristate "S6105 GMAC ethernet support"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index 224a018..51ff723 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_NET_VENDOR_QLOGIC) += qlogic/
 obj-$(CONFIG_NET_VENDOR_REALTEK) += realtek/
 obj-$(CONFIG_SH_ETH) += renesas/
 obj-$(CONFIG_NET_VENDOR_RDC) += rdc/
+obj-$(CONFIG_NET_VENDOR_ROCKER) += rocker/
 obj-$(CONFIG_S6GMAC) += s6gmac.o
 obj-$(CONFIG_NET_VENDOR_SAMSUNG) += samsung/
 obj-$(CONFIG_NET_VENDOR_SEEQ) += seeq/
diff --git a/drivers/net/ethernet/rocker/Kconfig b/drivers/net/ethernet/rocker/Kconfig
new file mode 100644
index 0000000..0441932
--- /dev/null
+++ b/drivers/net/ethernet/rocker/Kconfig
@@ -0,0 +1,29 @@
+#
+# Rocker device configuration
+#
+
+config NET_VENDOR_ROCKER
+	bool "Rocker devices"
+	default y
+	---help---
+	  If you have a network (Ethernet) card belonging to this class, say Y
+	  and read the Ethernet-HOWTO, available from
+	  <http://www.tldp.org/docs.html#howto>.
+
+	  Note that the answer to this question doesn't directly affect the
+	  kernel: saying N will just cause the configurator to skip all
+	  the questions about Rocker devices. If you say Y, you will be asked for
+	  your specific card in the following questions.
+
+if NET_VENDOR_ROCKER
+
+config ROCKER
+	tristate "Rocker switch driver (EXPERIMENTAL)"
+	depends on PCI && NET_SWITCHDEV
+	---help---
+	  This driver supports Rocker switch device.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called rocker.
+
+endif # NET_VENDOR_ROCKER
diff --git a/drivers/net/ethernet/rocker/Makefile b/drivers/net/ethernet/rocker/Makefile
new file mode 100644
index 0000000..f85fb12
--- /dev/null
+++ b/drivers/net/ethernet/rocker/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the Rocker network device drivers.
+#
+
+obj-$(CONFIG_ROCKER) += rocker.o
diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
new file mode 100644
index 0000000..c8984bb
--- /dev/null
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -0,0 +1,3561 @@
+/*
+ * drivers/net/ethernet/rocker/rocker.c - Rocker switch device driver
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ * Copyright (c) 2014 Scott Feldman <sfeldma@cumulusnetworks.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/spinlock.h>
+#include <linux/hashtable.h>
+#include <linux/crc32.h>
+#include <linux/sort.h>
+#include <linux/random.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/if_ether.h>
+#include <linux/if_vlan.h>
+#include <net/switchdev.h>
+#include <net/rtnetlink.h>
+#include <asm-generic/io-64-nonatomic-lo-hi.h>
+#include <generated/utsrelease.h>
+
+#include "rocker.h"
+
+static const char rocker_driver_name[] = "rocker";
+
+static const struct pci_device_id rocker_pci_id_table[] = {
+	{PCI_VDEVICE(REDHAT, PCI_DEVICE_ID_REDHAT_ROCKER), 0},
+	{0, }
+};
+
+struct rocker_flow_tbl_key {
+	u32 priority;
+	enum rocker_of_dpa_table_id tbl_id;
+	union {
+		struct {
+			u32 in_lport;
+			u32 in_lport_mask;
+			enum rocker_of_dpa_table_id goto_tbl;
+		} ig_port;
+		struct {
+			u32 in_lport;
+			__be16 vlan_id;
+			__be16 vlan_id_mask;
+			enum rocker_of_dpa_table_id goto_tbl;
+			bool untagged;
+			__be16 new_vlan_id;
+		} vlan;
+		struct {
+			/* TODO */
+		} term_mac;
+		struct {
+			u8 eth_dst[ETH_ALEN];
+			u8 eth_dst_mask[ETH_ALEN];
+			int has_eth_dst;
+			int has_eth_dst_mask;
+			__be16 vlan_id;
+			u32 tunnel_id;
+			enum rocker_of_dpa_table_id goto_tbl;
+			u32 group_id;
+		} bridge;
+		struct {
+			u32 in_lport;
+			u32 in_lport_mask;
+			u8 eth_src[ETH_ALEN];
+			u8 eth_src_mask[ETH_ALEN];
+			u8 eth_dst[ETH_ALEN];
+			u8 eth_dst_mask[ETH_ALEN];
+			__be16 eth_type;
+			__be16 vlan_id;
+			__be16 vlan_id_mask;
+			u8 ip_proto;
+			u8 ip_proto_mask;
+			u8 ip_tos;
+			u8 ip_tos_mask;
+			u32 group_id;
+		} acl;
+	};
+};
+
+struct rocker_flow_tbl_entry {
+	struct hlist_node entry;
+	u32 ref_count;
+	u64 cookie;
+	struct rocker_flow_tbl_key key;
+	u32 key_crc32;
+};
+
+struct rocker_group_tbl_entry {
+	struct hlist_node entry;
+	u32 ref_count;
+	u32 group_id;
+	u16 group_count;
+	u32 *group_ids;
+	union {
+		struct {
+			u8 pop_vlan;
+		} l2_interface;
+	};
+};
+
+struct rocker_desc_info {
+	char *data; /* mapped */
+	size_t data_size;
+	size_t tlv_size;
+	struct rocker_desc *desc;
+	DEFINE_DMA_UNMAP_ADDR(mapaddr);
+};
+
+struct rocker_dma_ring_info {
+	size_t size;
+	u32 head;
+	u32 tail;
+	struct rocker_desc *desc; /* mapped */
+	dma_addr_t mapaddr;
+	struct rocker_desc_info *desc_info;
+	unsigned int type;
+};
+
+struct rocker;
+
+struct rocker_port {
+	struct net_device *dev;
+	unsigned int prev_flags;
+	struct rocker *rocker;
+	unsigned port_number;
+	struct napi_struct napi_tx;
+	struct napi_struct napi_rx;
+	struct rocker_dma_ring_info tx_ring;
+	struct rocker_dma_ring_info rx_ring;
+};
+
+struct rocker {
+	struct pci_dev *pdev;
+	u8 __iomem *hw_addr;
+	struct msix_entry *msix_entries;
+	unsigned port_count;
+	struct rocker_port **ports;
+	struct {
+		u64 id;
+	} hw;
+	spinlock_t cmd_ring_lock;
+	struct rocker_dma_ring_info cmd_ring;
+	struct rocker_dma_ring_info event_ring;
+	DECLARE_HASHTABLE(flow_tbl, 16);
+	spinlock_t flow_tbl_lock;
+	u64 flow_tbl_next_cookie;
+	DECLARE_HASHTABLE(group_tbl, 16);
+	spinlock_t group_tbl_lock;
+	u16 group_index_next;
+};
+
+struct rocker_wait {
+	wait_queue_head_t wait;
+	bool done;
+	bool nowait;
+};
+
+static const u8 zero_mac[ETH_ALEN] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };
+static const u8 ff_mac[ETH_ALEN] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
+static const u8 lldp_mac[ETH_ALEN] = { 0x01, 0x80, 0xc2, 0x00, 0x00, 0x0e };
+
+/* Rocker priority levels for flow table entries.  Higher
+ * priority match takes precedence over lower priority match.
+ */
+
+enum {
+	ROCKER_PRIORITY_UNKNOWN = 0,
+	ROCKER_PRIORITY_IG_PORT = 1,
+	ROCKER_PRIORITY_VLAN = 1,
+	ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_EXACT = 1,
+	ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_WILD = 2,
+	ROCKER_PRIORITY_BRIDGING_VLAN = 3,
+	ROCKER_PRIORITY_BRIDGING_TENANT_DFLT_EXACT = 1,
+	ROCKER_PRIORITY_BRIDGING_TENANT_DFLT_WILD = 2,
+	ROCKER_PRIORITY_BRIDGING_TENANT = 3,
+	ROCKER_PRIORITY_ACL_PORT_PROMISC = 1,
+	ROCKER_PRIORITY_ACL = 2,
+};
+
+static u32 rocker_port_to_lport(struct rocker_port *rocker_port)
+{
+	return rocker_port->port_number + 1;
+}
+
+static void rocker_wait_reset(struct rocker_wait *wait)
+{
+	wait->done = false;
+	wait->nowait = false;
+}
+
+static void rocker_wait_init(struct rocker_wait *wait)
+{
+	init_waitqueue_head(&wait->wait);
+	rocker_wait_reset(wait);
+}
+
+static struct rocker_wait *rocker_wait_create(gfp_t gfp)
+{
+	struct rocker_wait *wait;
+
+	wait = kmalloc(sizeof(*wait), gfp);
+	if (!wait)
+		return NULL;
+	rocker_wait_init(wait);
+	return wait;
+}
+
+static void rocker_wait_destroy(struct rocker_wait *work)
+{
+	kfree(work);
+}
+
+static bool rocker_wait_event_timeout(struct rocker_wait *wait,
+				      unsigned long timeout)
+{
+	wait_event_timeout(wait->wait, wait->done, HZ / 10);
+	if (!wait->done)
+		return false;
+	return true;
+}
+
+static void rocker_wait_wake_up(struct rocker_wait *wait)
+{
+	wait->done = true;
+	wake_up(&wait->wait);
+}
+
+static u32 rocker_msix_vector(struct rocker *rocker, unsigned vector)
+{
+	return rocker->msix_entries[vector].vector;
+}
+
+static u32 rocker_msix_tx_vector(struct rocker_port *rocker_port)
+{
+	return rocker_msix_vector(rocker_port->rocker,
+				  ROCKER_MSIX_VEC_TX(rocker_port->port_number));
+}
+
+static u32 rocker_msix_rx_vector(struct rocker_port *rocker_port)
+{
+	return rocker_msix_vector(rocker_port->rocker,
+				  ROCKER_MSIX_VEC_RX(rocker_port->port_number));
+}
+
+#define rocker_write32(rocker, reg, val)	\
+	writel((val), (rocker)->hw_addr + (ROCKER_ ## reg))
+#define rocker_read32(rocker, reg)	\
+	readl((rocker)->hw_addr + (ROCKER_ ## reg))
+#define rocker_write64(rocker, reg, val)	\
+	writeq((val), (rocker)->hw_addr + (ROCKER_ ## reg))
+#define rocker_read64(rocker, reg)	\
+	readq((rocker)->hw_addr + (ROCKER_ ## reg))
+
+/*****************************
+ * HW basic testing functions
+ *****************************/
+
+static int rocker_reg_test(struct rocker *rocker)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	u64 test_reg;
+	u64 rnd;
+
+	rnd = prandom_u32();
+	rnd >>= 1;
+	rocker_write32(rocker, TEST_REG, rnd);
+	test_reg = rocker_read32(rocker, TEST_REG);
+	if (test_reg != rnd * 2) {
+		dev_err(&pdev->dev, "unexpected 32bit register value %08llx, expected %08llx\n",
+			test_reg, rnd * 2);
+		return -EIO;
+	}
+
+	rnd = prandom_u32();
+	rnd <<= 31;
+	rnd |= prandom_u32();
+	rocker_write64(rocker, TEST_REG64, rnd);
+	test_reg = rocker_read64(rocker, TEST_REG64);
+	if (test_reg != rnd * 2) {
+		dev_err(&pdev->dev, "unexpected 64bit register value %16llx, expected %16llx\n",
+			test_reg, rnd * 2);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int rocker_dma_test_one(struct rocker *rocker, struct rocker_wait *wait,
+			       u32 test_type, dma_addr_t dma_handle,
+			       unsigned char *buf, unsigned char *expect,
+			       size_t size)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int i;
+
+	rocker_wait_reset(wait);
+	rocker_write32(rocker, TEST_DMA_CTRL, test_type);
+
+	if (!rocker_wait_event_timeout(wait, HZ / 10)) {
+		dev_err(&pdev->dev, "no interrupt received within a timeout\n");
+		return -EIO;
+	}
+
+	for (i = 0; i < size; i++) {
+		if (buf[i] != expect[i]) {
+			dev_err(&pdev->dev, "unexpected memory content %02x at byte %x\n, %02x expected",
+				buf[i], i, expect[i]);
+			return -EIO;
+		}
+	}
+	return 0;
+}
+
+#define ROCKER_TEST_DMA_BUF_SIZE (PAGE_SIZE * 4)
+#define ROCKER_TEST_DMA_FILL_PATTERN 0x96
+
+static int rocker_dma_test_offset(struct rocker *rocker,
+				  struct rocker_wait *wait, int offset)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	unsigned char *alloc;
+	unsigned char *buf;
+	unsigned char *expect;
+	dma_addr_t dma_handle;
+	int i;
+	int err;
+
+	alloc = kzalloc(ROCKER_TEST_DMA_BUF_SIZE * 2 + offset,
+			GFP_KERNEL | GFP_DMA);
+	if (!alloc)
+		return -ENOMEM;
+	buf = alloc + offset;
+	expect = buf + ROCKER_TEST_DMA_BUF_SIZE;
+
+	dma_handle = pci_map_single(pdev, buf, ROCKER_TEST_DMA_BUF_SIZE,
+				    PCI_DMA_BIDIRECTIONAL);
+	if (pci_dma_mapping_error(pdev, dma_handle)) {
+		err = -EIO;
+		goto free_alloc;
+	}
+
+	rocker_write64(rocker, TEST_DMA_ADDR, dma_handle);
+	rocker_write32(rocker, TEST_DMA_SIZE, ROCKER_TEST_DMA_BUF_SIZE);
+
+	memset(expect, ROCKER_TEST_DMA_FILL_PATTERN, ROCKER_TEST_DMA_BUF_SIZE);
+	err = rocker_dma_test_one(rocker, wait, ROCKER_TEST_DMA_CTRL_FILL,
+				  dma_handle, buf, expect,
+				  ROCKER_TEST_DMA_BUF_SIZE);
+	if (err)
+		goto unmap;
+
+	memset(expect, 0, ROCKER_TEST_DMA_BUF_SIZE);
+	err = rocker_dma_test_one(rocker, wait, ROCKER_TEST_DMA_CTRL_CLEAR,
+				  dma_handle, buf, expect,
+				  ROCKER_TEST_DMA_BUF_SIZE);
+	if (err)
+		goto unmap;
+
+	prandom_bytes(buf, ROCKER_TEST_DMA_BUF_SIZE);
+	for (i = 0; i < ROCKER_TEST_DMA_BUF_SIZE; i++)
+		expect[i] = ~buf[i];
+	err = rocker_dma_test_one(rocker, wait, ROCKER_TEST_DMA_CTRL_INVERT,
+				  dma_handle, buf, expect,
+				  ROCKER_TEST_DMA_BUF_SIZE);
+	if (err)
+		goto unmap;
+
+unmap:
+	pci_unmap_single(pdev, dma_handle, ROCKER_TEST_DMA_BUF_SIZE,
+			 PCI_DMA_BIDIRECTIONAL);
+free_alloc:
+	kfree(alloc);
+
+	return err;
+}
+
+static int rocker_dma_test(struct rocker *rocker, struct rocker_wait *wait)
+{
+	int i;
+	int err;
+
+	for (i = 0; i < 8; i++) {
+		err = rocker_dma_test_offset(rocker, wait, i);
+		if (err)
+			return err;
+	}
+	return 0;
+}
+
+static irqreturn_t rocker_test_irq_handler(int irq, void *dev_id)
+{
+	struct rocker_wait *wait = dev_id;
+
+	rocker_wait_wake_up(wait);
+
+	return IRQ_HANDLED;
+}
+
+static int rocker_basic_hw_test(struct rocker *rocker)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	struct rocker_wait wait;
+	int err;
+
+	err = rocker_reg_test(rocker);
+	if (err) {
+		dev_err(&pdev->dev, "reg test failed\n");
+		return err;
+	}
+
+	err = request_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_TEST),
+			  rocker_test_irq_handler, 0,
+			  rocker_driver_name, &wait);
+	if (err) {
+		dev_err(&pdev->dev, "cannot assign test irq\n");
+		return err;
+	}
+
+	rocker_wait_init(&wait);
+	rocker_write32(rocker, TEST_IRQ, ROCKER_MSIX_VEC_TEST);
+
+	if (!rocker_wait_event_timeout(&wait, HZ / 10)) {
+		dev_err(&pdev->dev, "no interrupt received within a timeout\n");
+		err = -EIO;
+		goto free_irq;
+	}
+
+	err = rocker_dma_test(rocker, &wait);
+	if (err)
+		dev_err(&pdev->dev, "dma test failed\n");
+
+free_irq:
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_TEST), &wait);
+	return err;
+}
+
+/******
+ * TLV
+ ******/
+
+#define ROCKER_TLV_ALIGNTO 8U
+#define ROCKER_TLV_ALIGN(len) \
+	(((len) + ROCKER_TLV_ALIGNTO - 1) & ~(ROCKER_TLV_ALIGNTO - 1))
+#define ROCKER_TLV_HDRLEN ROCKER_TLV_ALIGN(sizeof(struct rocker_tlv))
+
+/*  <------- ROCKER_TLV_HDRLEN -------> <--- ROCKER_TLV_ALIGN(payload) --->
+ * +-----------------------------+- - -+- - - - - - - - - - - - - - -+- - -+
+ * |             Header          | Pad |           Payload           | Pad |
+ * |      (struct rocker_tlv)    | ing |                             | ing |
+ * +-----------------------------+- - -+- - - - - - - - - - - - - - -+- - -+
+ *  <--------------------------- tlv->len -------------------------->
+ */
+
+static struct rocker_tlv *rocker_tlv_next(const struct rocker_tlv *tlv,
+					  int *remaining)
+{
+	int totlen = ROCKER_TLV_ALIGN(tlv->len);
+
+	*remaining -= totlen;
+	return (struct rocker_tlv *) ((char *) tlv + totlen);
+}
+
+static int rocker_tlv_ok(const struct rocker_tlv *tlv, int remaining)
+{
+	return remaining >= (int) ROCKER_TLV_HDRLEN &&
+	       tlv->len >= ROCKER_TLV_HDRLEN &&
+	       tlv->len <= remaining;
+}
+
+#define rocker_tlv_for_each(pos, head, len, rem)	\
+	for (pos = head, rem = len;			\
+	     rocker_tlv_ok(pos, rem);			\
+	     pos = rocker_tlv_next(pos, &(rem)))
+
+#define rocker_tlv_for_each_nested(pos, tlv, rem)	\
+	rocker_tlv_for_each(pos, rocker_tlv_data(tlv),	\
+			    rocker_tlv_len(tlv), rem)
+
+static int rocker_tlv_attr_size(int payload)
+{
+	return ROCKER_TLV_HDRLEN + payload;
+}
+
+static int rocker_tlv_total_size(int payload)
+{
+	return ROCKER_TLV_ALIGN(rocker_tlv_attr_size(payload));
+}
+
+static int rocker_tlv_padlen(int payload)
+{
+	return rocker_tlv_total_size(payload) - rocker_tlv_attr_size(payload);
+}
+
+static int rocker_tlv_type(const struct rocker_tlv *tlv)
+{
+	return tlv->type;
+}
+
+static void *rocker_tlv_data(const struct rocker_tlv *tlv)
+{
+	return (char *) tlv + ROCKER_TLV_HDRLEN;
+}
+
+static int rocker_tlv_len(const struct rocker_tlv *tlv)
+{
+	return tlv->len - ROCKER_TLV_HDRLEN;
+}
+
+static u8 rocker_tlv_get_u8(const struct rocker_tlv *tlv)
+{
+	return *(u8 *) rocker_tlv_data(tlv);
+}
+
+static u16 rocker_tlv_get_u16(const struct rocker_tlv *tlv)
+{
+	return *(u16 *) rocker_tlv_data(tlv);
+}
+
+static u32 rocker_tlv_get_u32(const struct rocker_tlv *tlv)
+{
+	return *(u32 *) rocker_tlv_data(tlv);
+}
+
+static u64 rocker_tlv_get_u64(const struct rocker_tlv *tlv)
+{
+	return *(u64 *) rocker_tlv_data(tlv);
+}
+
+static void rocker_tlv_parse(struct rocker_tlv **tb, int maxtype,
+			     const char *buf, int buf_len)
+{
+	const struct rocker_tlv *tlv;
+	const struct rocker_tlv *head = (const struct rocker_tlv *) buf;
+	int rem;
+
+	memset(tb, 0, sizeof(struct rocker_tlv *) * (maxtype + 1));
+
+	rocker_tlv_for_each(tlv, head, buf_len, rem) {
+		u32 type = rocker_tlv_type(tlv);
+
+		if (type > 0 && type <= maxtype)
+			tb[type] = (struct rocker_tlv *) tlv;
+	}
+}
+
+static void rocker_tlv_parse_nested(struct rocker_tlv **tb, int maxtype,
+				    const struct rocker_tlv *tlv)
+{
+	rocker_tlv_parse(tb, maxtype, rocker_tlv_data(tlv),
+			 rocker_tlv_len(tlv));
+}
+
+static void rocker_tlv_parse_desc(struct rocker_tlv **tb, int maxtype,
+				  struct rocker_desc_info *desc_info)
+{
+	rocker_tlv_parse(tb, maxtype, desc_info->data,
+			 desc_info->desc->tlv_size);
+}
+
+static struct rocker_tlv *rocker_tlv_start(struct rocker_desc_info *desc_info)
+{
+	return (struct rocker_tlv *) ((char *) desc_info->data +
+					       desc_info->tlv_size);
+}
+
+static int rocker_tlv_put(struct rocker_desc_info *desc_info,
+			  int attrtype, int attrlen, const void *data)
+{
+	int tail_room = desc_info->data_size - desc_info->tlv_size;
+	int total_size = rocker_tlv_total_size(attrlen);
+	struct rocker_tlv *tlv;
+
+	if (unlikely(tail_room < total_size))
+		return -EMSGSIZE;
+
+	tlv = rocker_tlv_start(desc_info);
+	desc_info->tlv_size += total_size;
+	tlv->type = attrtype;
+	tlv->len = rocker_tlv_attr_size(attrlen);
+	memcpy(rocker_tlv_data(tlv), data, attrlen);
+	memset((char *) tlv + tlv->len, 0, rocker_tlv_padlen(attrlen));
+	return 0;
+}
+
+static int rocker_tlv_put_u8(struct rocker_desc_info *desc_info,
+			     int attrtype, u8 value)
+{
+	return rocker_tlv_put(desc_info, attrtype, sizeof(u8), &value);
+}
+
+static int rocker_tlv_put_u16(struct rocker_desc_info *desc_info,
+			      int attrtype, u16 value)
+{
+	return rocker_tlv_put(desc_info, attrtype, sizeof(u16), &value);
+}
+
+static int rocker_tlv_put_u32(struct rocker_desc_info *desc_info,
+			      int attrtype, u32 value)
+{
+	return rocker_tlv_put(desc_info, attrtype, sizeof(u32), &value);
+}
+
+static int rocker_tlv_put_u64(struct rocker_desc_info *desc_info,
+			      int attrtype, u64 value)
+{
+	return rocker_tlv_put(desc_info, attrtype, sizeof(u64), &value);
+}
+
+static struct rocker_tlv *
+rocker_tlv_nest_start(struct rocker_desc_info *desc_info, int attrtype)
+{
+	struct rocker_tlv *start = rocker_tlv_start(desc_info);
+
+	if (rocker_tlv_put(desc_info, attrtype, 0, NULL) < 0)
+		return NULL;
+
+	return start;
+}
+
+static void rocker_tlv_nest_end(struct rocker_desc_info *desc_info,
+				struct rocker_tlv *start)
+{
+	start->len = (char *) rocker_tlv_start(desc_info) - (char *) start;
+}
+
+static void rocker_tlv_nest_cancel(struct rocker_desc_info *desc_info,
+				   struct rocker_tlv *start)
+{
+	desc_info->tlv_size = (char *) start - desc_info->data;
+}
+
+/******************************************
+ * DMA rings and descriptors manipulations
+ ******************************************/
+
+static u32 __pos_inc(u32 pos, size_t limit)
+{
+	return ++pos == limit ? 0 : pos;
+}
+
+static int rocker_desc_err(struct rocker_desc_info *desc_info)
+{
+	return -(desc_info->desc->comp_err & ~ROCKER_DMA_DESC_COMP_ERR_GEN);
+}
+
+static void rocker_desc_gen_clear(struct rocker_desc_info *desc_info)
+{
+	desc_info->desc->comp_err &= ~ROCKER_DMA_DESC_COMP_ERR_GEN;
+}
+
+static bool rocker_desc_gen(struct rocker_desc_info *desc_info)
+{
+	u32 comp_err = desc_info->desc->comp_err;
+
+	return comp_err & ROCKER_DMA_DESC_COMP_ERR_GEN ? true : false;
+}
+
+static void *rocker_desc_cookie_ptr_get(struct rocker_desc_info *desc_info)
+{
+	return (void *) desc_info->desc->cookie;
+}
+
+static void rocker_desc_cookie_ptr_set(struct rocker_desc_info *desc_info,
+				       void *ptr)
+{
+	desc_info->desc->cookie = (long) ptr;
+}
+
+static struct rocker_desc_info *
+rocker_desc_head_get(struct rocker_dma_ring_info *info)
+{
+	static struct rocker_desc_info *desc_info;
+	u32 head = __pos_inc(info->head, info->size);
+
+	desc_info = &info->desc_info[info->head];
+	if (head == info->tail)
+		return NULL; /* ring full */
+	desc_info->tlv_size = 0;
+	return desc_info;
+}
+
+static void rocker_desc_commit(struct rocker_desc_info *desc_info)
+{
+	desc_info->desc->buf_size = desc_info->data_size;
+	desc_info->desc->tlv_size = desc_info->tlv_size;
+}
+
+static void rocker_desc_head_set(struct rocker *rocker,
+				 struct rocker_dma_ring_info *info,
+				 struct rocker_desc_info *desc_info)
+{
+	u32 head = __pos_inc(info->head, info->size);
+
+	BUG_ON(head == info->tail);
+	rocker_desc_commit(desc_info);
+	info->head = head;
+	rocker_write32(rocker, DMA_DESC_HEAD(info->type), head);
+}
+
+static struct rocker_desc_info *
+rocker_desc_tail_get(struct rocker_dma_ring_info *info)
+{
+	static struct rocker_desc_info *desc_info;
+
+	if (info->tail == info->head)
+		return NULL; /* no thing to be done between head and tail */
+	desc_info = &info->desc_info[info->tail];
+	if (!rocker_desc_gen(desc_info))
+		return NULL; /* gen bit not set, desc is not ready yet */
+	info->tail = __pos_inc(info->tail, info->size);
+	desc_info->tlv_size = desc_info->desc->tlv_size;
+	return desc_info;
+}
+
+static void rocker_dma_ring_credits_set(struct rocker *rocker,
+					struct rocker_dma_ring_info *info,
+					u32 credits)
+{
+	if (credits)
+		rocker_write32(rocker, DMA_DESC_CREDITS(info->type), credits);
+}
+
+static unsigned long rocker_dma_ring_size_fix(size_t size)
+{
+	return max(ROCKER_DMA_SIZE_MIN,
+		   min(roundup_pow_of_two(size), ROCKER_DMA_SIZE_MAX));
+}
+
+static int rocker_dma_ring_create(struct rocker *rocker,
+				  unsigned int type,
+				  size_t size,
+				  struct rocker_dma_ring_info *info)
+{
+	int i;
+
+	BUG_ON(size != rocker_dma_ring_size_fix(size));
+	info->size = size;
+	info->type = type;
+	info->head = 0;
+	info->tail = 0;
+	info->desc_info = kcalloc(info->size, sizeof(*info->desc_info),
+				  GFP_KERNEL);
+	if (!info->desc_info)
+		return -ENOMEM;
+
+	info->desc = pci_alloc_consistent(rocker->pdev,
+					  info->size * sizeof(*info->desc),
+					  &info->mapaddr);
+	if (!info->desc) {
+		kfree(info->desc_info);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < info->size; i++)
+		info->desc_info[i].desc = &info->desc[i];
+
+	rocker_write32(rocker, DMA_DESC_CTRL(info->type),
+		       ROCKER_DMA_DESC_CTRL_RESET);
+	rocker_write64(rocker, DMA_DESC_ADDR(info->type), info->mapaddr);
+	rocker_write32(rocker, DMA_DESC_SIZE(info->type), info->size);
+
+	return 0;
+}
+
+static void rocker_dma_ring_destroy(struct rocker *rocker,
+				    struct rocker_dma_ring_info *info)
+{
+	rocker_write64(rocker, DMA_DESC_ADDR(info->type), 0);
+
+	pci_free_consistent(rocker->pdev,
+			    info->size * sizeof(struct rocker_desc),
+			    info->desc, info->mapaddr);
+	kfree(info->desc_info);
+}
+
+static void rocker_dma_ring_pass_to_producer(struct rocker *rocker,
+					     struct rocker_dma_ring_info *info)
+{
+	int i;
+
+	BUG_ON(info->head || info->tail);
+
+	/* When ring is consumer, we need to advance head for each desc.
+	 * That tells hw that the desc is ready to be used by it.
+	 */
+	for (i = 0; i < info->size - 1; i++)
+		rocker_desc_head_set(rocker, info, &info->desc_info[i]);
+	rocker_desc_commit(&info->desc_info[i]);
+}
+
+static int rocker_dma_ring_bufs_alloc(struct rocker *rocker,
+				      struct rocker_dma_ring_info *info,
+				      int direction, size_t buf_size)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int i;
+	int err;
+
+	for (i = 0; i < info->size; i++) {
+		struct rocker_desc_info *desc_info = &info->desc_info[i];
+		struct rocker_desc *desc = &info->desc[i];
+		dma_addr_t dma_handle;
+		char *buf;
+
+		buf = kzalloc(buf_size, GFP_KERNEL | GFP_DMA);
+		if (!buf) {
+			err = -ENOMEM;
+			goto rollback;
+		}
+
+		dma_handle = pci_map_single(pdev, buf, buf_size, direction);
+		if (pci_dma_mapping_error(pdev, dma_handle)) {
+			kfree(buf);
+			err = -EIO;
+			goto rollback;
+		}
+
+		desc_info->data = buf;
+		desc_info->data_size = buf_size;
+		dma_unmap_addr_set(desc_info, mapaddr, dma_handle);
+
+		desc->buf_addr = dma_handle;
+		desc->buf_size = buf_size;
+	}
+	return 0;
+
+rollback:
+	for (i--; i >= 0; i--) {
+		struct rocker_desc_info *desc_info = &info->desc_info[i];
+
+		pci_unmap_single(pdev, dma_unmap_addr(desc_info, mapaddr),
+				 desc_info->data_size, direction);
+		kfree(desc_info->data);
+	}
+	return err;
+}
+
+static void rocker_dma_ring_bufs_free(struct rocker *rocker,
+				      struct rocker_dma_ring_info *info,
+				      int direction)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int i;
+
+	for (i = 0; i < info->size; i++) {
+		struct rocker_desc_info *desc_info = &info->desc_info[i];
+		struct rocker_desc *desc = &info->desc[i];
+
+		desc->buf_addr = 0;
+		desc->buf_size = 0;
+		pci_unmap_single(pdev, dma_unmap_addr(desc_info, mapaddr),
+				 desc_info->data_size, direction);
+		kfree(desc_info->data);
+	}
+}
+
+static int rocker_dma_rings_init(struct rocker *rocker)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int err;
+
+	err = rocker_dma_ring_create(rocker, ROCKER_DMA_CMD,
+				     ROCKER_DMA_CMD_DEFAULT_SIZE,
+				     &rocker->cmd_ring);
+	if (err) {
+		dev_err(&pdev->dev, "failed to create command dma ring\n");
+		return err;
+	}
+
+	spin_lock_init(&rocker->cmd_ring_lock);
+
+	err = rocker_dma_ring_bufs_alloc(rocker, &rocker->cmd_ring,
+					 PCI_DMA_BIDIRECTIONAL, PAGE_SIZE);
+	if (err) {
+		dev_err(&pdev->dev, "failed to alloc command dma ring buffers\n");
+		goto err_dma_cmd_ring_bufs_alloc;
+	}
+
+	err = rocker_dma_ring_create(rocker, ROCKER_DMA_EVENT,
+				     ROCKER_DMA_EVENT_DEFAULT_SIZE,
+				     &rocker->event_ring);
+	if (err) {
+		dev_err(&pdev->dev, "failed to create event dma ring\n");
+		goto err_dma_event_ring_create;
+	}
+
+	err = rocker_dma_ring_bufs_alloc(rocker, &rocker->event_ring,
+					 PCI_DMA_FROMDEVICE, PAGE_SIZE);
+	if (err) {
+		dev_err(&pdev->dev, "failed to alloc event dma ring buffers\n");
+		goto err_dma_event_ring_bufs_alloc;
+	}
+	rocker_dma_ring_pass_to_producer(rocker, &rocker->event_ring);
+	return 0;
+
+err_dma_event_ring_bufs_alloc:
+	rocker_dma_ring_destroy(rocker, &rocker->event_ring);
+err_dma_event_ring_create:
+	rocker_dma_ring_bufs_free(rocker, &rocker->cmd_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+err_dma_cmd_ring_bufs_alloc:
+	rocker_dma_ring_destroy(rocker, &rocker->cmd_ring);
+	return err;
+}
+
+static void rocker_dma_rings_fini(struct rocker *rocker)
+{
+	rocker_dma_ring_bufs_free(rocker, &rocker->event_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+	rocker_dma_ring_destroy(rocker, &rocker->event_ring);
+	rocker_dma_ring_bufs_free(rocker, &rocker->cmd_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+	rocker_dma_ring_destroy(rocker, &rocker->cmd_ring);
+}
+
+static int rocker_dma_rx_ring_skb_map(struct rocker *rocker,
+				      struct rocker_port *rocker_port,
+				      struct rocker_desc_info *desc_info,
+				      struct sk_buff *skb, size_t buf_len)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	dma_addr_t dma_handle;
+
+	dma_handle = pci_map_single(pdev, skb->data, buf_len,
+				    PCI_DMA_FROMDEVICE);
+	if (pci_dma_mapping_error(pdev, dma_handle))
+		return -EIO;
+	if (rocker_tlv_put_u64(desc_info, ROCKER_TLV_RX_FRAG_ADDR, dma_handle))
+		goto tlv_put_failure;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_RX_FRAG_MAX_LEN, buf_len))
+		goto tlv_put_failure;
+	return 0;
+
+tlv_put_failure:
+	pci_unmap_single(pdev, dma_handle, buf_len, PCI_DMA_FROMDEVICE);
+	desc_info->tlv_size = 0;
+	return -EMSGSIZE;
+}
+
+static size_t rocker_port_rx_buf_len(struct rocker_port *rocker_port)
+{
+	return rocker_port->dev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
+}
+
+static int rocker_dma_rx_ring_skb_alloc(struct rocker *rocker,
+					struct rocker_port *rocker_port,
+					struct rocker_desc_info *desc_info)
+{
+	struct net_device *dev = rocker_port->dev;
+	struct sk_buff *skb;
+	size_t buf_len = rocker_port_rx_buf_len(rocker_port);
+	int err;
+
+	/* Ensure that hw will see tlv_size zero in case of an error.
+	 * That tells hw to use another descriptor.
+	 */
+	rocker_desc_cookie_ptr_set(desc_info, NULL);
+	desc_info->tlv_size = 0;
+
+	skb = netdev_alloc_skb_ip_align(dev, buf_len);
+	if (!skb)
+		return -ENOMEM;
+	err = rocker_dma_rx_ring_skb_map(rocker, rocker_port, desc_info,
+					 skb, buf_len);
+	if (err) {
+		dev_kfree_skb_any(skb);
+		return err;
+	}
+	rocker_desc_cookie_ptr_set(desc_info, skb);
+	return 0;
+}
+
+static void rocker_dma_rx_ring_skb_unmap(struct rocker *rocker,
+					 struct rocker_tlv **attrs)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	dma_addr_t dma_handle;
+	size_t len;
+
+	if (!attrs[ROCKER_TLV_RX_FRAG_ADDR] ||
+	    !attrs[ROCKER_TLV_RX_FRAG_MAX_LEN])
+		return;
+	dma_handle = rocker_tlv_get_u64(attrs[ROCKER_TLV_RX_FRAG_ADDR]);
+	len = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FRAG_MAX_LEN]);
+	pci_unmap_single(pdev, dma_handle, len, PCI_DMA_FROMDEVICE);
+}
+
+static void rocker_dma_rx_ring_skb_free(struct rocker *rocker,
+					struct rocker_desc_info *desc_info)
+{
+	struct rocker_tlv *attrs[ROCKER_TLV_RX_MAX + 1];
+	struct sk_buff *skb = rocker_desc_cookie_ptr_get(desc_info);
+
+	if (!skb)
+		return;
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_RX_MAX, desc_info);
+	rocker_dma_rx_ring_skb_unmap(rocker, attrs);
+	dev_kfree_skb_any(skb);
+}
+
+static int rocker_dma_rx_ring_skbs_alloc(struct rocker *rocker,
+					 struct rocker_port *rocker_port)
+{
+	struct rocker_dma_ring_info *rx_ring = &rocker_port->rx_ring;
+	int i;
+	int err;
+
+	for (i = 0; i < rx_ring->size; i++) {
+		err = rocker_dma_rx_ring_skb_alloc(rocker, rocker_port,
+						   &rx_ring->desc_info[i]);
+		if (err)
+			goto rollback;
+	}
+	return 0;
+
+rollback:
+	for (i--; i >= 0; i--)
+		rocker_dma_rx_ring_skb_free(rocker, &rx_ring->desc_info[i]);
+	return err;
+}
+
+static void rocker_dma_rx_ring_skbs_free(struct rocker *rocker,
+					 struct rocker_port *rocker_port)
+{
+	struct rocker_dma_ring_info *rx_ring = &rocker_port->rx_ring;
+	int i;
+
+	for (i = 0; i < rx_ring->size; i++)
+		rocker_dma_rx_ring_skb_free(rocker, &rx_ring->desc_info[i]);
+}
+
+static int rocker_port_dma_rings_init(struct rocker_port *rocker_port)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	int err;
+
+	err = rocker_dma_ring_create(rocker,
+				     ROCKER_DMA_TX(rocker_port->port_number),
+				     ROCKER_DMA_TX_DEFAULT_SIZE,
+				     &rocker_port->tx_ring);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to create tx dma ring\n");
+		return err;
+	}
+
+	err = rocker_dma_ring_bufs_alloc(rocker, &rocker_port->tx_ring,
+					 PCI_DMA_TODEVICE,
+					 ROCKER_DMA_TX_DESC_SIZE);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to alloc tx dma ring buffers\n");
+		goto err_dma_tx_ring_bufs_alloc;
+	}
+
+	err = rocker_dma_ring_create(rocker,
+				     ROCKER_DMA_RX(rocker_port->port_number),
+				     ROCKER_DMA_RX_DEFAULT_SIZE,
+				     &rocker_port->rx_ring);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to create rx dma ring\n");
+		goto err_dma_rx_ring_create;
+	}
+
+	err = rocker_dma_ring_bufs_alloc(rocker, &rocker_port->rx_ring,
+					 PCI_DMA_BIDIRECTIONAL,
+					 ROCKER_DMA_RX_DESC_SIZE);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to alloc rx dma ring buffers\n");
+		goto err_dma_rx_ring_bufs_alloc;
+	}
+
+	err = rocker_dma_rx_ring_skbs_alloc(rocker, rocker_port);
+	if (err) {
+		netdev_err(rocker_port->dev, "failed to alloc rx dma ring skbs\n");
+		goto err_dma_rx_ring_skbs_alloc;
+	}
+	rocker_dma_ring_pass_to_producer(rocker, &rocker_port->rx_ring);
+
+	return 0;
+
+err_dma_rx_ring_skbs_alloc:
+	rocker_dma_ring_bufs_free(rocker, &rocker_port->rx_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+err_dma_rx_ring_bufs_alloc:
+	rocker_dma_ring_destroy(rocker, &rocker_port->rx_ring);
+err_dma_rx_ring_create:
+	rocker_dma_ring_bufs_free(rocker, &rocker_port->tx_ring,
+				  PCI_DMA_TODEVICE);
+err_dma_tx_ring_bufs_alloc:
+	rocker_dma_ring_destroy(rocker, &rocker_port->tx_ring);
+	return err;
+}
+
+static void rocker_port_dma_rings_fini(struct rocker_port *rocker_port)
+{
+	struct rocker *rocker = rocker_port->rocker;
+
+	rocker_dma_rx_ring_skbs_free(rocker, rocker_port);
+	rocker_dma_ring_bufs_free(rocker, &rocker_port->rx_ring,
+				  PCI_DMA_BIDIRECTIONAL);
+	rocker_dma_ring_destroy(rocker, &rocker_port->rx_ring);
+	rocker_dma_ring_bufs_free(rocker, &rocker_port->tx_ring,
+				  PCI_DMA_TODEVICE);
+	rocker_dma_ring_destroy(rocker, &rocker_port->tx_ring);
+}
+
+static void rocker_port_set_enable(struct rocker_port *rocker_port, bool enable)
+{
+	u64 val = rocker_read64(rocker_port->rocker, PORT_PHYS_ENABLE);
+
+	if (enable)
+		val |= 1 << rocker_port_to_lport(rocker_port);
+	else
+		val &= ~(1 << rocker_port_to_lport(rocker_port));
+	rocker_write64(rocker_port->rocker, PORT_PHYS_ENABLE, val);
+}
+
+/********************************
+ * Interrupt handler and helpers
+ ********************************/
+
+static irqreturn_t rocker_cmd_irq_handler(int irq, void *dev_id)
+{
+	struct rocker *rocker = dev_id;
+	struct rocker_desc_info *desc_info;
+	struct rocker_wait *wait;
+	u32 credits = 0;
+
+	spin_lock(&rocker->cmd_ring_lock);
+	while ((desc_info = rocker_desc_tail_get(&rocker->cmd_ring))) {
+		wait = rocker_desc_cookie_ptr_get(desc_info);
+		if (wait->nowait) {
+			rocker_desc_gen_clear(desc_info);
+			rocker_wait_destroy(wait);
+		} else {
+			rocker_wait_wake_up(wait);
+		}
+		credits++;
+	}
+	spin_unlock(&rocker->cmd_ring_lock);
+	rocker_dma_ring_credits_set(rocker, &rocker->cmd_ring, credits);
+
+	return IRQ_HANDLED;
+}
+
+static void rocker_port_link_up(struct rocker_port *rocker_port)
+{
+	netif_carrier_on(rocker_port->dev);
+	netdev_info(rocker_port->dev, "Link is up\n");
+}
+
+static void rocker_port_link_down(struct rocker_port *rocker_port)
+{
+	netif_carrier_off(rocker_port->dev);
+	netdev_info(rocker_port->dev, "Link is down\n");
+}
+
+static int rocker_event_process(struct rocker *rocker,
+				struct rocker_desc_info *desc_info)
+{
+	struct rocker_tlv *attrs[ROCKER_TLV_EVENT_MAX + 1];
+	struct rocker_tlv *info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_MAX + 1];
+	u16 type;
+	unsigned port_number;
+	bool link_up;
+	struct rocker_port *rocker_port;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_EVENT_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_EVENT_TYPE] ||
+	    !attrs[ROCKER_TLV_EVENT_INFO])
+		return -EIO;
+
+	type = rocker_tlv_get_u16(attrs[ROCKER_TLV_EVENT_TYPE]);
+	if (!type)
+		return -EOPNOTSUPP;
+
+	rocker_tlv_parse_nested(info_attrs, ROCKER_TLV_EVENT_LINK_CHANGED_MAX,
+				attrs[ROCKER_TLV_EVENT_INFO]);
+	if (!info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_LPORT] ||
+	    !info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_LINKUP])
+		return -EIO;
+	port_number = rocker_tlv_get_u32(info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_LPORT]) - 1;
+	link_up = rocker_tlv_get_u8(info_attrs[ROCKER_TLV_EVENT_LINK_CHANGED_LINKUP]);
+
+	if (port_number >= rocker->port_count)
+		return -EINVAL;
+
+	rocker_port = rocker->ports[port_number];
+	if (netif_carrier_ok(rocker_port->dev) != link_up) {
+		if (link_up)
+			rocker_port_link_up(rocker_port);
+		else
+			rocker_port_link_down(rocker_port);
+	}
+	return 0;
+}
+
+static irqreturn_t rocker_event_irq_handler(int irq, void *dev_id)
+{
+	struct rocker *rocker = dev_id;
+	struct pci_dev *pdev = rocker->pdev;
+	struct rocker_desc_info *desc_info;
+	u32 credits = 0;
+	int err;
+
+	while ((desc_info = rocker_desc_tail_get(&rocker->event_ring))) {
+		err = rocker_desc_err(desc_info);
+		if (err) {
+			dev_err(&pdev->dev, "event desc received with err %d\n",
+				err);
+		} else {
+			err = rocker_event_process(rocker, desc_info);
+			if (err)
+				dev_err(&pdev->dev, "event processing failed with err %d\n",
+					err);
+		}
+		rocker_desc_gen_clear(desc_info);
+		rocker_desc_head_set(rocker, &rocker->event_ring, desc_info);
+		credits++;
+	}
+	rocker_dma_ring_credits_set(rocker, &rocker->event_ring, credits);
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rocker_tx_irq_handler(int irq, void *dev_id)
+{
+	struct rocker_port *rocker_port = dev_id;
+
+	napi_schedule(&rocker_port->napi_tx);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rocker_rx_irq_handler(int irq, void *dev_id)
+{
+	struct rocker_port *rocker_port = dev_id;
+
+	napi_schedule(&rocker_port->napi_rx);
+	return IRQ_HANDLED;
+}
+
+/********************
+ * Command interface
+ ********************/
+
+typedef int (*rocker_cmd_cb_t)(struct rocker *rocker,
+			       struct rocker_port *rocker_port,
+			       struct rocker_desc_info *desc_info,
+			       void *priv);
+
+static int rocker_cmd_exec(struct rocker *rocker,
+			   struct rocker_port *rocker_port,
+			   rocker_cmd_cb_t prepare, void *prepare_priv,
+			   rocker_cmd_cb_t process, void *process_priv,
+			   bool nowait)
+{
+	struct rocker_desc_info *desc_info;
+	struct rocker_wait *wait;
+	unsigned long flags;
+	int err;
+
+	wait = rocker_wait_create(nowait ? GFP_ATOMIC : GFP_KERNEL);
+	if (!wait)
+		return -ENOMEM;
+	wait->nowait = nowait;
+
+	spin_lock_irqsave(&rocker->cmd_ring_lock, flags);
+	desc_info = rocker_desc_head_get(&rocker->cmd_ring);
+	if (!desc_info) {
+		spin_unlock_irqrestore(&rocker->cmd_ring_lock, flags);
+		err = -EAGAIN;
+		goto out;
+	}
+	err = prepare(rocker, rocker_port, desc_info, prepare_priv);
+	if (err) {
+		spin_unlock_irqrestore(&rocker->cmd_ring_lock, flags);
+		goto out;
+	}
+	rocker_desc_cookie_ptr_set(desc_info, wait);
+	rocker_desc_head_set(rocker, &rocker->cmd_ring, desc_info);
+	spin_unlock_irqrestore(&rocker->cmd_ring_lock, flags);
+
+	if (nowait)
+		return 0;
+
+	if (!rocker_wait_event_timeout(wait, HZ / 10))
+		return -EIO;
+
+	err = rocker_desc_err(desc_info);
+	if (err)
+		return err;
+
+	if (process)
+		err = process(rocker, rocker_port, desc_info, process_priv);
+
+	rocker_desc_gen_clear(desc_info);
+out:
+	rocker_wait_destroy(wait);
+	return err;
+}
+
+static int
+rocker_cmd_get_port_settings_prep(struct rocker *rocker,
+				  struct rocker_port *rocker_port,
+				  struct rocker_desc_info *desc_info,
+				  void *priv)
+{
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_GET_PORT_SETTINGS))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,
+			       rocker_port_to_lport(rocker_port)))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+	return 0;
+}
+
+static int
+rocker_cmd_get_port_settings_ethtool_proc(struct rocker *rocker,
+					  struct rocker_port *rocker_port,
+					  struct rocker_desc_info *desc_info,
+					  void *priv)
+{
+	struct ethtool_cmd *ecmd = priv;
+	struct rocker_tlv *attrs[ROCKER_TLV_CMD_MAX + 1];
+	struct rocker_tlv *info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MAX + 1];
+	u32 speed;
+	u8 duplex;
+	u8 autoneg;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_CMD_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_CMD_INFO])
+		return -EIO;
+
+	rocker_tlv_parse_nested(info_attrs, ROCKER_TLV_CMD_PORT_SETTINGS_MAX,
+				attrs[ROCKER_TLV_CMD_INFO]);
+	if (!info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_SPEED] ||
+	    !info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_DUPLEX] ||
+	    !info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_AUTONEG])
+		return -EIO;
+
+	speed = rocker_tlv_get_u32(info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_SPEED]);
+	duplex = rocker_tlv_get_u8(info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_DUPLEX]);
+	autoneg = rocker_tlv_get_u8(info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_AUTONEG]);
+
+	ecmd->transceiver = XCVR_INTERNAL;
+	ecmd->supported = SUPPORTED_TP;
+	ecmd->phy_address = 0xff;
+	ecmd->port = PORT_TP;
+	ethtool_cmd_speed_set(ecmd, speed);
+	ecmd->duplex = duplex ? DUPLEX_FULL : DUPLEX_HALF;
+	ecmd->autoneg = autoneg ? AUTONEG_ENABLE : AUTONEG_DISABLE;
+
+	return 0;
+}
+
+static int
+rocker_cmd_get_port_settings_macaddr_proc(struct rocker *rocker,
+					  struct rocker_port *rocker_port,
+					  struct rocker_desc_info *desc_info,
+					  void *priv)
+{
+	unsigned char *macaddr = priv;
+	struct rocker_tlv *attrs[ROCKER_TLV_CMD_MAX + 1];
+	struct rocker_tlv *info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MAX + 1];
+	struct rocker_tlv *attr;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_CMD_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_CMD_INFO])
+		return -EIO;
+
+	rocker_tlv_parse_nested(info_attrs, ROCKER_TLV_CMD_PORT_SETTINGS_MAX,
+				attrs[ROCKER_TLV_CMD_INFO]);
+	attr = info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MACADDR];
+	if (!attr)
+		return -EIO;
+
+	if (rocker_tlv_len(attr) != ETH_ALEN)
+		return -EINVAL;
+
+	ether_addr_copy(macaddr, rocker_tlv_data(attr));
+	return 0;
+}
+
+static int
+rocker_cmd_get_port_settings_mode_proc(struct rocker *rocker,
+				       struct rocker_port *rocker_port,
+				       struct rocker_desc_info *desc_info,
+				       void *priv)
+{
+	enum rocker_port_mode *mode = priv;
+	struct rocker_tlv *attrs[ROCKER_TLV_CMD_MAX + 1];
+	struct rocker_tlv *info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MAX + 1];
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_CMD_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_CMD_INFO])
+		return -EIO;
+
+	rocker_tlv_parse_nested(info_attrs, ROCKER_TLV_CMD_PORT_SETTINGS_MAX,
+				attrs[ROCKER_TLV_CMD_INFO]);
+	if (!info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MODE])
+		return -EIO;
+
+	*mode = rocker_tlv_get_u8(info_attrs[ROCKER_TLV_CMD_PORT_SETTINGS_MODE]);
+
+	return 0;
+}
+
+static int
+rocker_cmd_set_port_settings_ethtool_prep(struct rocker *rocker,
+					  struct rocker_port *rocker_port,
+					  struct rocker_desc_info *desc_info,
+					  void *priv)
+{
+	struct ethtool_cmd *ecmd = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_SET_PORT_SETTINGS))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,
+			       rocker_port_to_lport(rocker_port)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_SPEED,
+			       ethtool_cmd_speed(ecmd)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_DUPLEX,
+			      ecmd->duplex))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_AUTONEG,
+			      ecmd->autoneg))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+	return 0;
+}
+
+static int
+rocker_cmd_set_port_settings_macaddr_prep(struct rocker *rocker,
+					  struct rocker_port *rocker_port,
+					  struct rocker_desc_info *desc_info,
+					  void *priv)
+{
+	unsigned char *macaddr = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_SET_PORT_SETTINGS))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,
+			       rocker_port_to_lport(rocker_port)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_MACADDR,
+			   ETH_ALEN, macaddr))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+	return 0;
+}
+
+static int
+rocker_cmd_set_port_settings_mode_prep(struct rocker *rocker,
+				       struct rocker_port *rocker_port,
+				       struct rocker_desc_info *desc_info,
+				       void *priv)
+{
+	enum rocker_port_mode *mode = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_SET_PORT_SETTINGS))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,
+			       rocker_port_to_lport(rocker_port)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_CMD_PORT_SETTINGS_MODE,
+			      *mode))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+	return 0;
+}
+
+static int rocker_cmd_get_port_settings_ethtool(struct rocker_port *rocker_port,
+						struct ethtool_cmd *ecmd)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_get_port_settings_prep, NULL,
+			       rocker_cmd_get_port_settings_ethtool_proc,
+			       ecmd, false);
+}
+
+static int rocker_cmd_get_port_settings_macaddr(struct rocker_port *rocker_port,
+						unsigned char *macaddr)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_get_port_settings_prep, NULL,
+			       rocker_cmd_get_port_settings_macaddr_proc,
+			       macaddr, false);
+}
+
+static int rocker_cmd_get_port_settings_mode(struct rocker_port *rocker_port,
+					     enum rocker_port_mode *mode)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_get_port_settings_prep, NULL,
+			       rocker_cmd_get_port_settings_mode_proc,
+			       mode, false);
+}
+
+static int rocker_cmd_set_port_settings_ethtool(struct rocker_port *rocker_port,
+						struct ethtool_cmd *ecmd)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_set_port_settings_ethtool_prep,
+			       ecmd, NULL, NULL, false);
+}
+
+static int rocker_cmd_set_port_settings_macaddr(struct rocker_port *rocker_port,
+						unsigned char *macaddr)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_set_port_settings_macaddr_prep,
+			       macaddr, NULL, NULL, false);
+}
+
+static int rocker_cmd_set_port_settings_mode(struct rocker_port *rocker_port,
+					     enum rocker_port_mode mode)
+{
+	return rocker_cmd_exec(rocker_port->rocker, rocker_port,
+			       rocker_cmd_set_port_settings_mode_prep,
+			       &mode, NULL, NULL, false);
+}
+
+static int rocker_cmd_flow_tbl_add_ig_port(struct rocker_desc_info *desc_info,
+					   struct rocker_flow_tbl_entry *entry)
+{
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT,
+			       entry->key.ig_port.in_lport))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT_MASK,
+			       entry->key.ig_port.in_lport_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_GOTO_TABLE_ID,
+			       entry->key.ig_port.goto_tbl))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_add_vlan(struct rocker_desc_info *desc_info,
+					struct rocker_flow_tbl_entry *entry)
+{
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT,
+			       entry->key.vlan.in_lport))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID,
+			       entry->key.vlan.vlan_id))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID_MASK,
+			       entry->key.vlan.vlan_id_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_GOTO_TABLE_ID,
+			       entry->key.vlan.goto_tbl))
+		return -EMSGSIZE;
+	if (entry->key.vlan.untagged &&
+	    rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_NEW_VLAN_ID,
+			       entry->key.vlan.new_vlan_id))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_add_bridge(struct rocker_desc_info *desc_info,
+					  struct rocker_flow_tbl_entry *entry)
+{
+	if (entry->key.bridge.has_eth_dst &&
+	    rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_DST_MAC,
+			   ETH_ALEN, entry->key.bridge.eth_dst))
+		return -EMSGSIZE;
+	if (entry->key.bridge.has_eth_dst_mask &&
+	    rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_DST_MAC_MASK,
+			   ETH_ALEN, entry->key.bridge.eth_dst_mask))
+		return -EMSGSIZE;
+	if (entry->key.bridge.vlan_id &&
+	    rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID,
+			       entry->key.bridge.vlan_id))
+		return -EMSGSIZE;
+	if (entry->key.bridge.tunnel_id &&
+	    rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_TUNNEL_ID,
+			       entry->key.bridge.tunnel_id))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_GOTO_TABLE_ID,
+			       entry->key.bridge.goto_tbl))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_GROUP_ID,
+			       entry->key.bridge.group_id))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_add_acl(struct rocker_desc_info *desc_info,
+				       struct rocker_flow_tbl_entry *entry)
+{
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT,
+			       entry->key.acl.in_lport))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_IN_LPORT_MASK,
+			       entry->key.acl.in_lport_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_SRC_MAC,
+			   ETH_ALEN, entry->key.acl.eth_src))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_SRC_MAC_MASK,
+			   ETH_ALEN, entry->key.acl.eth_src_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_DST_MAC,
+			   ETH_ALEN, entry->key.acl.eth_dst))
+		return -EMSGSIZE;
+	if (rocker_tlv_put(desc_info, ROCKER_TLV_OF_DPA_DST_MAC_MASK,
+			   ETH_ALEN, entry->key.acl.eth_dst_mask))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_ETHERTYPE,
+			       entry->key.acl.eth_type))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID,
+			       entry->key.acl.vlan_id))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_VLAN_ID_MASK,
+			       entry->key.acl.vlan_id_mask))
+		return -EMSGSIZE;
+
+	switch (ntohs(entry->key.acl.eth_type)) {
+	case ETH_P_IP:
+	case ETH_P_IPV6:
+		if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_OF_DPA_IP_PROTO,
+				      entry->key.acl.ip_proto))
+			return -EMSGSIZE;
+		if (rocker_tlv_put_u8(desc_info,
+				      ROCKER_TLV_OF_DPA_IP_PROTO_MASK,
+				      entry->key.acl.ip_proto_mask))
+			return -EMSGSIZE;
+		if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_OF_DPA_IP_DSCP,
+				      entry->key.acl.ip_tos & 0x3f))
+			return -EMSGSIZE;
+		if (rocker_tlv_put_u8(desc_info,
+				      ROCKER_TLV_OF_DPA_IP_DSCP_MASK,
+				      entry->key.acl.ip_tos_mask & 0x3f))
+			return -EMSGSIZE;
+		if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_OF_DPA_IP_ECN,
+				      (entry->key.acl.ip_tos & 0xc0) >> 6))
+			return -EMSGSIZE;
+		if (rocker_tlv_put_u8(desc_info,
+				      ROCKER_TLV_OF_DPA_IP_ECN_MASK,
+				      (entry->key.acl.ip_tos_mask & 0xc0) >> 6))
+			return -EMSGSIZE;
+		break;
+	}
+
+	if (entry->key.acl.group_id != ROCKER_GROUP_NONE &&
+	    rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_GROUP_ID,
+			       entry->key.acl.group_id))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_add(struct rocker *rocker,
+				   struct rocker_port *rocker_port,
+				   struct rocker_desc_info *desc_info,
+				   void *priv)
+{
+	struct rocker_flow_tbl_entry *entry = priv;
+	struct rocker_tlv *cmd_info;
+	int err = 0;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_ADD))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_TABLE_ID,
+			       entry->key.tbl_id))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_PRIORITY,
+			       entry->key.priority))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_HARDTIME, 0))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u64(desc_info, ROCKER_TLV_OF_DPA_COOKIE,
+			       entry->cookie))
+		return -EMSGSIZE;
+
+	switch (entry->key.tbl_id) {
+	case ROCKER_OF_DPA_TABLE_ID_INGRESS_PORT:
+		err = rocker_cmd_flow_tbl_add_ig_port(desc_info, entry);
+		break;
+	case ROCKER_OF_DPA_TABLE_ID_VLAN:
+		err = rocker_cmd_flow_tbl_add_vlan(desc_info, entry);
+		break;
+	case ROCKER_OF_DPA_TABLE_ID_BRIDGING:
+		err = rocker_cmd_flow_tbl_add_bridge(desc_info, entry);
+		break;
+	case ROCKER_OF_DPA_TABLE_ID_ACL_POLICY:
+		err = rocker_cmd_flow_tbl_add_acl(desc_info, entry);
+		break;
+	default:
+		err = -ENOTSUPP;
+		break;
+	}
+
+	if (err)
+		return err;
+
+	rocker_tlv_nest_end(desc_info, cmd_info);
+
+	return 0;
+}
+
+static int rocker_cmd_flow_tbl_del(struct rocker *rocker,
+				   struct rocker_port *rocker_port,
+				   struct rocker_desc_info *desc_info,
+				   void *priv)
+{
+	const struct rocker_flow_tbl_entry *entry = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_DEL))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u64(desc_info, ROCKER_TLV_OF_DPA_COOKIE,
+			       entry->cookie))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+
+	return 0;
+}
+
+static int
+rocker_cmd_group_tbl_add_l2_interface(struct rocker_desc_info *desc_info,
+				      struct rocker_group_tbl_entry *entry)
+{
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_OUT_LPORT,
+			       ROCKER_GROUP_PORT_GET(entry->group_id)))
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u8(desc_info, ROCKER_TLV_OF_DPA_POP_VLAN,
+			      entry->l2_interface.pop_vlan))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int
+rocker_cmd_group_tbl_add_group_ids(struct rocker_desc_info *desc_info,
+				   struct rocker_group_tbl_entry *entry)
+{
+	int i;
+	struct rocker_tlv *group_ids;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_OF_DPA_GROUP_COUNT,
+			       entry->group_count))
+		return -EMSGSIZE;
+
+	group_ids = rocker_tlv_nest_start(desc_info,
+					  ROCKER_TLV_OF_DPA_GROUP_IDS);
+	if (!group_ids)
+		return -EMSGSIZE;
+
+	for (i = 0; i < entry->group_count; i++)
+		/* Note TLV array is 1-based */
+		if (rocker_tlv_put_u32(desc_info, i + 1, entry->group_ids[i]))
+			return -EMSGSIZE;
+
+	rocker_tlv_nest_end(desc_info, group_ids);
+
+	return 0;
+}
+
+static int rocker_cmd_group_tbl_add(struct rocker *rocker,
+				    struct rocker_port *rocker_port,
+				    struct rocker_desc_info *desc_info,
+				    void *priv)
+{
+	struct rocker_group_tbl_entry *entry = priv;
+	struct rocker_tlv *cmd_info;
+	int err = 0;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_ADD))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_GROUP_ID,
+			       entry->group_id))
+		return -EMSGSIZE;
+
+	switch (ROCKER_GROUP_TYPE_GET(entry->group_id)) {
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_INTERFACE:
+		err = rocker_cmd_group_tbl_add_l2_interface(desc_info, entry);
+		break;
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST:
+		err = rocker_cmd_group_tbl_add_group_ids(desc_info, entry);
+		break;
+	default:
+		err = -ENOTSUPP;
+		break;
+	}
+
+	if (err)
+		return err;
+
+	rocker_tlv_nest_end(desc_info, cmd_info);
+
+	return 0;
+}
+
+static int rocker_cmd_group_tbl_del(struct rocker *rocker,
+				    struct rocker_port *rocker_port,
+				    struct rocker_desc_info *desc_info,
+				    void *priv)
+{
+	const struct rocker_group_tbl_entry *entry = priv;
+	struct rocker_tlv *cmd_info;
+
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_CMD_TYPE,
+			       ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_DEL))
+		return -EMSGSIZE;
+	cmd_info = rocker_tlv_nest_start(desc_info, ROCKER_TLV_CMD_INFO);
+	if (!cmd_info)
+		return -EMSGSIZE;
+	if (rocker_tlv_put_u32(desc_info, ROCKER_TLV_OF_DPA_GROUP_ID,
+			       entry->group_id))
+		return -EMSGSIZE;
+	rocker_tlv_nest_end(desc_info, cmd_info);
+
+	return 0;
+}
+
+/************************
+ * Flow and group tables
+ ************************/
+
+static struct rocker_flow_tbl_entry *rocker_flow_tbl_find(
+	struct rocker *rocker, struct rocker_flow_tbl_entry *match)
+{
+	struct rocker_flow_tbl_entry *found;
+
+	hash_for_each_possible(rocker->flow_tbl, found, entry, match->key_crc32)
+		if (memcmp(&found->key, &match->key, sizeof(found->key)) == 0)
+			return found;
+
+	return NULL;
+}
+
+static int rocker_flow_tbl_add(struct rocker_port *rocker_port,
+			       struct rocker_flow_tbl_entry *match,
+			       bool nowait)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_flow_tbl_entry *found;
+	unsigned long flags;
+	bool add_to_hw = false;
+	int err = 0;
+
+	match->key_crc32 = crc32(~0, &match->key, sizeof(match->key));
+
+	spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
+
+	found = rocker_flow_tbl_find(rocker, match);
+
+	if (found) {
+		kfree(match);
+	} else {
+		found = match;
+		found->cookie = rocker->flow_tbl_next_cookie++;
+		hash_add(rocker->flow_tbl, &found->entry, found->key_crc32);
+		add_to_hw = true;
+	}
+
+	found->ref_count++;
+
+	spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
+
+	if (add_to_hw) {
+		err = rocker_cmd_exec(rocker, rocker_port,
+				      rocker_cmd_flow_tbl_add,
+				      found, NULL, NULL, nowait);
+		if (err) {
+			spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
+			hash_del(&found->entry);
+			spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
+			kfree(found);
+		}
+	}
+
+	return err;
+}
+
+static int rocker_flow_tbl_del(struct rocker_port *rocker_port,
+			       struct rocker_flow_tbl_entry *match,
+			       bool nowait)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_flow_tbl_entry *found;
+	unsigned long flags;
+	int del_from_hw = 0;
+	int err = 0;
+
+	match->key_crc32 = crc32(~0, &match->key, sizeof(match->key));
+
+	spin_lock_irqsave(&rocker->flow_tbl_lock, flags);
+
+	found = rocker_flow_tbl_find(rocker, match);
+
+	if (found) {
+		found->ref_count--;
+		if (found->ref_count == 0) {
+			hash_del(&found->entry);
+			del_from_hw = 1;
+		}
+	}
+
+	spin_unlock_irqrestore(&rocker->flow_tbl_lock, flags);
+
+	kfree(match);
+
+	if (del_from_hw) {
+		err = rocker_cmd_exec(rocker, rocker_port,
+				      rocker_cmd_flow_tbl_del,
+				      found, NULL, NULL, nowait);
+		kfree(found);
+	}
+
+	return err;
+}
+
+#define ROCKER_OP_FLAG_REMOVE		(1 << 0)
+#define ROCKER_OP_FLAG_NOWAIT		(1 << 1)
+
+static gfp_t rocker_op_flags_gfp(int flags)
+{
+	return flags & ROCKER_OP_FLAG_NOWAIT ? GFP_ATOMIC : GFP_KERNEL;
+}
+
+static int rocker_flow_tbl_do(struct rocker_port *rocker_port,
+			      int flags, struct rocker_flow_tbl_entry *entry)
+{
+	bool nowait = flags & ROCKER_OP_FLAG_NOWAIT;
+
+	if (flags & ROCKER_OP_FLAG_REMOVE)
+		return rocker_flow_tbl_del(rocker_port, entry, nowait);
+	else
+		return rocker_flow_tbl_add(rocker_port, entry, nowait);
+}
+
+static int rocker_flow_tbl_ig_port(struct rocker_port *rocker_port,
+				   int flags, u32 in_lport, u32 in_lport_mask,
+				   enum rocker_of_dpa_table_id goto_tbl)
+{
+	struct rocker_flow_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.priority = ROCKER_PRIORITY_IG_PORT;
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_INGRESS_PORT;
+	entry->key.ig_port.in_lport = in_lport;
+	entry->key.ig_port.in_lport_mask = in_lport_mask;
+	entry->key.ig_port.goto_tbl = goto_tbl;
+
+	return rocker_flow_tbl_do(rocker_port, flags, entry);
+}
+
+static int rocker_flow_tbl_vlan(struct rocker_port *rocker_port,
+				int flags, u32 in_lport,
+				__be16 vlan_id, __be16 vlan_id_mask,
+				enum rocker_of_dpa_table_id goto_tbl,
+				bool untagged, __be16 new_vlan_id)
+{
+	struct rocker_flow_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.priority = ROCKER_PRIORITY_VLAN;
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_VLAN;
+	entry->key.vlan.in_lport = in_lport;
+	entry->key.vlan.vlan_id = vlan_id;
+	entry->key.vlan.vlan_id_mask = vlan_id_mask;
+	entry->key.vlan.goto_tbl = goto_tbl;
+
+	entry->key.vlan.untagged = untagged;
+	entry->key.vlan.new_vlan_id = new_vlan_id;
+
+	return rocker_flow_tbl_do(rocker_port, flags, entry);
+}
+
+static int rocker_flow_tbl_bridge(struct rocker_port *rocker_port,
+				  int flags,
+				  const u8 *eth_dst, const u8 *eth_dst_mask,
+				  __be16 vlan_id, u32 tunnel_id,
+				  enum rocker_of_dpa_table_id goto_tbl,
+				  u32 group_id)
+{
+	struct rocker_flow_tbl_entry *entry;
+	u32 priority;
+	bool vlan_bridging = !!vlan_id;
+	bool dflt = !eth_dst || (eth_dst && eth_dst_mask);
+	bool wild = false;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_BRIDGING;
+
+	if (eth_dst) {
+		entry->key.bridge.has_eth_dst = 1;
+		ether_addr_copy(entry->key.bridge.eth_dst, eth_dst);
+	}
+	if (eth_dst_mask) {
+		entry->key.bridge.has_eth_dst_mask = 1;
+		ether_addr_copy(entry->key.bridge.eth_dst_mask, eth_dst_mask);
+		if (memcmp(eth_dst_mask, zero_mac, ETH_ALEN))
+			wild = true;
+	}
+
+	priority = ROCKER_PRIORITY_UNKNOWN;
+	if (vlan_bridging & dflt & wild)
+		priority = ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_WILD;
+	else if (vlan_bridging & dflt & !wild)
+		priority = ROCKER_PRIORITY_BRIDGING_VLAN_DFLT_EXACT;
+	else if (vlan_bridging & !dflt)
+		priority = ROCKER_PRIORITY_BRIDGING_VLAN;
+	else if (!vlan_bridging & dflt & wild)
+		priority = ROCKER_PRIORITY_BRIDGING_TENANT_DFLT_WILD;
+	else if (!vlan_bridging & dflt & !wild)
+		priority = ROCKER_PRIORITY_BRIDGING_TENANT_DFLT_EXACT;
+	else if (!vlan_bridging & !dflt)
+		priority = ROCKER_PRIORITY_BRIDGING_TENANT;
+
+	entry->key.priority = priority;
+	entry->key.bridge.vlan_id = vlan_id;
+	entry->key.bridge.tunnel_id = tunnel_id;
+	entry->key.bridge.goto_tbl = goto_tbl;
+	entry->key.bridge.group_id = group_id;
+
+	return rocker_flow_tbl_do(rocker_port, flags, entry);
+}
+
+static int rocker_flow_tbl_acl(struct rocker_port *rocker_port,
+			       int flags, u32 priority, u32 in_lport,
+			       u32 in_lport_mask,
+			       const u8 *eth_src, const u8 *eth_src_mask,
+			       const u8 *eth_dst, const u8 *eth_dst_mask,
+			       __be16 eth_type,
+			       __be16 vlan_id, __be16 vlan_id_mask,
+			       u8 ip_proto, u8 ip_proto_mask,
+			       u8 ip_tos, u8 ip_tos_mask,
+			       u32 group_id)
+{
+	struct rocker_flow_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	entry->key.priority = priority;
+	entry->key.tbl_id = ROCKER_OF_DPA_TABLE_ID_ACL_POLICY;
+	entry->key.acl.in_lport = in_lport;
+	entry->key.acl.in_lport_mask = in_lport_mask;
+
+	if (eth_src)
+		ether_addr_copy(entry->key.acl.eth_src, eth_src);
+	if (eth_src_mask)
+		ether_addr_copy(entry->key.acl.eth_src_mask, eth_src_mask);
+	if (eth_dst)
+		ether_addr_copy(entry->key.acl.eth_dst, eth_dst);
+	if (eth_dst_mask)
+		ether_addr_copy(entry->key.acl.eth_dst_mask, eth_dst_mask);
+
+	entry->key.acl.eth_type = eth_type;
+	entry->key.acl.vlan_id = vlan_id;
+	entry->key.acl.vlan_id_mask = vlan_id_mask;
+	entry->key.acl.ip_proto = ip_proto;
+	entry->key.acl.ip_proto_mask = ip_proto_mask;
+	entry->key.acl.ip_tos = ip_tos;
+	entry->key.acl.ip_tos_mask = ip_tos_mask;
+	entry->key.acl.group_id = group_id;
+
+	return rocker_flow_tbl_do(rocker_port, flags, entry);
+}
+
+static struct rocker_group_tbl_entry *rocker_group_tbl_find(
+	struct rocker *rocker, struct rocker_group_tbl_entry *match)
+{
+	struct rocker_group_tbl_entry *found;
+	u8 type = ROCKER_GROUP_TYPE_GET(match->group_id);
+	u16 index;
+	int bkt;
+
+	switch (type) {
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_INTERFACE:
+		/* search for match by group_id */
+		hash_for_each_possible(rocker->group_tbl, found,
+				       entry, match->group_id)
+			if (found->group_id == match->group_id)
+				return found;
+		break;
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST:
+		/* search for match by group_ids */
+		hash_for_each(rocker->group_tbl, bkt, found, entry) {
+			if (type != ROCKER_GROUP_TYPE_GET(found->group_id))
+				continue;
+			if (found->group_count != match->group_count)
+				continue;
+			if (memcmp(found->group_ids, match->group_ids,
+				   found->group_count * sizeof(u32)) == 0)
+				return found;
+		}
+		/* no match: create new unique group_id */
+		index = rocker->group_index_next++;
+		match->group_id &= ~ROCKER_GROUP_INDEX_MASK;
+		match->group_id |= ROCKER_GROUP_INDEX_SET(index);
+		break;
+	default:
+		break;
+	}
+
+	return NULL;
+}
+
+static void rocker_group_tbl_entry_free(struct rocker_group_tbl_entry *entry)
+{
+	switch (ROCKER_GROUP_TYPE_GET(entry->group_id)) {
+	case ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST:
+		kfree(entry->group_ids);
+		break;
+	default:
+		break;
+	}
+	kfree(entry);
+}
+
+static int rocker_group_tbl_add(struct rocker_port *rocker_port,
+				struct rocker_group_tbl_entry *match,
+				u32 *group_id, bool nowait)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_group_tbl_entry *found;
+	unsigned long flags;
+	bool add_to_hw = false;
+	int err = 0;
+
+	spin_lock_irqsave(&rocker->group_tbl_lock, flags);
+
+	found = rocker_group_tbl_find(rocker, match);
+
+	if (found) {
+		rocker_group_tbl_entry_free(match);
+	} else {
+		found = match;
+		hash_add(rocker->group_tbl, &found->entry, found->group_id);
+		add_to_hw = true;
+	}
+
+	found->ref_count++;
+
+	spin_unlock_irqrestore(&rocker->group_tbl_lock, flags);
+
+	*group_id = found->group_id;
+
+	if (add_to_hw) {
+		err = rocker_cmd_exec(rocker, rocker_port,
+				      rocker_cmd_group_tbl_add,
+				      found, NULL, NULL, nowait);
+		if (err) {
+			spin_lock_irqsave(&rocker->group_tbl_lock, flags);
+			hash_del(&found->entry);
+			spin_unlock_irqrestore(&rocker->group_tbl_lock, flags);
+			rocker_group_tbl_entry_free(found);
+		}
+	}
+
+	return err;
+}
+
+static int rocker_group_tbl_del(struct rocker_port *rocker_port,
+				struct rocker_group_tbl_entry *match,
+				u32 *group_id, bool nowait)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_group_tbl_entry *found;
+	unsigned long flags;
+	bool del_from_hw = false;
+	int err = 0;
+
+	spin_lock_irqsave(&rocker->group_tbl_lock, flags);
+
+	found = rocker_group_tbl_find(rocker, match);
+
+	if (found) {
+		*group_id = found->group_id;
+		found->ref_count--;
+		if (found->ref_count == 0) {
+			hash_del(&found->entry);
+			del_from_hw = true;
+		}
+	}
+
+	spin_unlock_irqrestore(&rocker->group_tbl_lock, flags);
+
+	rocker_group_tbl_entry_free(match);
+
+	if (del_from_hw) {
+		err = rocker_cmd_exec(rocker, rocker_port,
+				      rocker_cmd_group_tbl_del,
+				      found, NULL, NULL, nowait);
+		rocker_group_tbl_entry_free(found);
+	}
+
+	return err;
+}
+
+static int rocker_group_tbl_do(struct rocker_port *rocker_port,
+			       int flags, struct rocker_group_tbl_entry *entry,
+			       u32 *group_id)
+{
+	bool nowait = flags & ROCKER_OP_FLAG_NOWAIT;
+
+	if (flags & ROCKER_OP_FLAG_REMOVE)
+		return rocker_group_tbl_del(rocker_port, entry,
+					    group_id, nowait);
+	else
+		return rocker_group_tbl_add(rocker_port, entry,
+					    group_id, nowait);
+}
+
+static int rocker_group_l2_interface(struct rocker_port *rocker_port,
+				     int flags, u32 group_id,
+				     int pop_vlan)
+{
+	struct rocker_group_tbl_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	entry->group_id = group_id;
+	entry->l2_interface.pop_vlan = pop_vlan;
+
+	return rocker_group_tbl_do(rocker_port, flags, entry, &group_id);
+}
+
+static int rocker_group_l2_mcast(struct rocker_port *rocker_port,
+				 int flags, __be16 vlan_id,
+				 u16 group_count, u32 *group_ids,
+				 u32 *group_id)
+{
+	struct rocker_group_tbl_entry *entry;
+
+	*group_id = 0;
+
+	entry = kzalloc(sizeof(*entry), rocker_op_flags_gfp(flags));
+	if (!entry)
+		return -ENOMEM;
+
+	entry->group_id = ROCKER_GROUP_L2_MCAST(vlan_id, 0);
+	entry->group_count = group_count;
+	entry->group_ids = group_ids;
+
+	return rocker_group_tbl_do(rocker_port, flags, entry, group_id);
+}
+
+static int rocker_group_id_compare(const void *a, const void *b)
+{
+	return memcmp(a, b, sizeof(u32));
+}
+
+static struct rocker_port *rocker_port_get_by_ifindex(struct rocker *rocker,
+						      int ifindex)
+{
+	int i;
+
+	for (i = 0; i < rocker->port_count; i++)
+		if (rocker->ports[i]->dev->ifindex == ifindex)
+			return rocker->ports[i];
+	return NULL;
+}
+
+static u32 *rocker_flow_get_group_ids(struct rocker_port *rocker_port,
+				      const struct swdev_flow *flow, int flags,
+				      __be16 vlan_id, u16 *count)
+{
+	struct rocker_port *out_port;
+	u32 *group_ids = NULL;
+	u32 out_lport;
+	bool send_up = false;
+	int i;
+
+	*count = 0;
+
+	for (i = 0; i < flow->action_count; i++) {
+		int ifindex = flow->action[i].out_port_ifindex;
+
+		out_port = rocker_port_get_by_ifindex(rocker_port->rocker,
+						      ifindex);
+		if (out_port) {
+			group_ids = krealloc(group_ids,
+					     ++(*count) * sizeof(u32),
+					     rocker_op_flags_gfp(flags));
+			if (!group_ids)
+				goto err_out;
+			out_lport =
+				rocker_port_to_lport(out_port);
+			group_ids[i] = ROCKER_GROUP_L2_INTERFACE(vlan_id,
+								 out_lport);
+		} else if (!send_up) {
+			send_up = true;
+			group_ids = krealloc(group_ids,
+					     ++(*count) * sizeof(u32),
+					     rocker_op_flags_gfp(flags));
+			if (!group_ids)
+				goto err_out;
+			out_lport = 0;
+			group_ids[i] = ROCKER_GROUP_L2_INTERFACE(vlan_id,
+								 out_lport);
+		}
+	}
+
+	sort(group_ids, *count, sizeof(u32), rocker_group_id_compare, NULL);
+
+	return group_ids;
+
+err_out:
+	*count = 0;
+	return NULL;
+}
+
+static int rocker_bridging_vlan_ucast(struct rocker_port *rocker_port,
+				      const struct swdev_flow *flow,
+				      int flags, __be16 vlan_id, bool pop_vlan)
+{
+	struct rocker_port *out_port;
+	u32 out_lport;
+	u32 tunnel_id = 0;
+	u32 group_l2_interface;
+	int err;
+
+	/* L2 interface group for output */
+
+	if (flow->action_count == 0) {
+		out_lport = 0; /* send it up */
+	} else if (flow->action_count == 1) {
+		int ifindex = flow->action[0].out_port_ifindex;
+
+		out_port = rocker_port_get_by_ifindex(rocker_port->rocker,
+						      ifindex);
+		if (out_port)
+			out_lport = rocker_port_to_lport(out_port);
+		else
+			out_lport = 0; /* send it up */
+	} else {
+		netdev_err(rocker_port->dev, "Trying to install unicast bridge vlan flow with more than one output device\n");
+		return -EINVAL;
+	}
+
+	group_l2_interface = ROCKER_GROUP_L2_INTERFACE(vlan_id, out_lport);
+	err = rocker_group_l2_interface(rocker_port, flags,
+					group_l2_interface, pop_vlan);
+	if (err) {
+		netdev_err(rocker_port->dev, "Error (%d) L2 interface group\n",
+			   err);
+		return err;
+	}
+
+	/* VLAN unicast bridge table entry */
+
+	err = rocker_flow_tbl_bridge(rocker_port, flags,
+				     flow->match.key.eth.dst, NULL,
+				     vlan_id, tunnel_id,
+				     ROCKER_OF_DPA_TABLE_ID_ACL_POLICY,
+				     group_l2_interface);
+
+	if (err)
+		netdev_err(rocker_port->dev, "Error (%d) VLAN unicast bridging table entry\n",
+			   err);
+
+	return err;
+}
+
+static int rocker_bridging_vlan_mcast(struct rocker_port *rocker_port,
+				      const struct swdev_flow *flow,
+				      int flags, __be16 vlan_id, bool pop_vlan)
+{
+	u32 tunnel_id = 0;
+	u32 group_l2_mcast;
+	u16 group_count;
+	u32 *group_ids;
+	int err;
+	int i;
+
+	/* Get sorted list of output L2 interface group ids;
+	 * if there are none, there is nothing to forward in HW,
+	 * so we're done.
+	 */
+
+	group_ids = rocker_flow_get_group_ids(rocker_port, flow, flags, vlan_id,
+					      &group_count);
+	if (group_ids == 0)
+		return 0;
+
+	/* L2 interface groups for each out_lport */
+
+	for (i = 0; i < group_count; i++) {
+		err = rocker_group_l2_interface(rocker_port, flags,
+						group_ids[i], pop_vlan);
+		if (err) {
+			netdev_err(rocker_port->dev, "Error (%d) L2 interface group\n",
+				   err);
+			goto err_free_group_ids;
+		}
+	}
+
+	/* L2 multicast group entry */
+
+	err = rocker_group_l2_mcast(rocker_port, flags,
+				    vlan_id, group_count,
+				    group_ids, &group_l2_mcast);
+	if (err) {
+		netdev_err(rocker_port->dev, "Error (%d) L2 mcast group\n",
+			   err);
+		goto err_free_group_ids;
+	}
+
+	/* VLAN multicast bridge table entry */
+
+	err = rocker_flow_tbl_bridge(rocker_port, flags,
+				     flow->match.key.eth.dst, NULL,
+				     vlan_id, tunnel_id,
+				     ROCKER_OF_DPA_TABLE_ID_ACL_POLICY,
+				     group_l2_mcast);
+
+	if (err)
+		netdev_err(rocker_port->dev, "Error (%d) VLAN mcast bridging\n",
+			   err);
+
+	return err;
+
+err_free_group_ids:
+	kfree(group_ids);
+	return err;
+}
+
+static int rocker_flow_parse(struct rocker_port *rocker_port,
+			     const struct swdev_flow *flow,
+			     int flags)
+{
+	struct rocker_port *in_port;
+	u32 in_lport;
+	u32 in_lport_mask;
+	__be16 vlan_id;
+	__be16 vlan_id_mask;
+	__be16 new_vlan_id;
+	__be16 outer_vlan_id;
+	u16 bridge_id;
+	u32 tunnel_id;
+	bool untagged;
+	bool unicast;
+	bool eth_dst_exact;
+	const struct swdev_flow_match_key *key = &flow->match.key;
+	const struct swdev_flow_match_key *key_mask = &flow->match.key_mask;
+	int err;
+
+	enum {
+		BRIDGING_MODE_UNKNOWN,
+		BRIDGING_MODE_VLAN_UCAST,
+		BRIDGING_MODE_VLAN_MCAST,
+		BRIDGING_MODE_VLAN_DFLT,
+		BRIDGING_MODE_TUNNEL_UCAST,
+		BRIDGING_MODE_TUNNEL_MCAST,
+		BRIDGING_MODE_TUNNEL_DFLT,
+	} bridging_mode = BRIDGING_MODE_UNKNOWN;
+
+	tunnel_id = 0; /* XXX for now */
+
+	/* A note about value masks: sw_flow uses mask bit value of
+	 * 0 for "don't care", whereas OF-DPA HW uses mask bit value
+	 * of 1 for "don't care", so sw_flow mask value must be
+	 * inverted beforing passing to OF-DPA HW.  To summarize:
+	 *
+	 *      mask bit   sw_flow         OF-DPA
+	 *      -------------------------------------
+	 *      0          don't care      care
+	 *      1          care            don't care
+	 */
+
+	/* Get lport for in_port.  Skip sw_flows if in_port is not a
+	 * rocker port in our network namespace.
+	 */
+
+	in_port = rocker_port_get_by_ifindex(rocker_port->rocker,
+					     key->phy.in_port_ifindex);
+	if (!in_port)
+		return 0;
+
+	in_lport = rocker_port_to_lport(in_port);
+	in_lport_mask = 0;
+
+	/* Determine outer VLAN ID.  If untagged, use bridge VLAN ID,
+	 * otherwise use tagged VLAN ID for outer VLAN ID.
+	 */
+
+	if (key->eth.tci == htons(0) &&
+	    key_mask->eth.tci == htons(0xffff)) {
+		vlan_id = key->eth.tci;
+		vlan_id_mask = htons(0x0fff);
+		untagged = true;
+	} else {
+		/* XXX For now, fail any vlan except untagged vlan 0 */
+		netdev_warn(rocker_port->dev,
+			    "Can't parse vlan info, vlan 0x%04x mask 0x%04x\n",
+			    ntohs(key->eth.tci),
+			    ntohs(key_mask->eth.tci));
+		return 0;
+	}
+
+	bridge_id = 0; /* XXX for now, need unique ID for each bridge */
+	new_vlan_id = htons(bridge_id << 8 | in_lport);
+	outer_vlan_id = untagged ? new_vlan_id : vlan_id;
+
+	/* Ingress port table entry */
+
+	err = rocker_flow_tbl_ig_port(rocker_port, flags,
+				      in_lport, in_lport_mask,
+				      ROCKER_OF_DPA_TABLE_ID_VLAN);
+	if (err) {
+		netdev_err(rocker_port->dev, "Error (%d) ingress port table entry\n",
+			   err);
+		return err;
+	}
+
+	/* VLAN table entry */
+
+	err = rocker_flow_tbl_vlan(rocker_port, flags,
+				   in_lport, vlan_id, vlan_id_mask,
+				   ROCKER_OF_DPA_TABLE_ID_TERMINATION_MAC,
+				   untagged, new_vlan_id);
+	if (err) {
+		netdev_err(rocker_port->dev, "Error (%d) VLAN table entry\n",
+			   err);
+		return err;
+	}
+
+	/* XXX Determine if sw_flow wants L2 bridging or L3 routing.
+	 * XXX If wanting L3 routing, need to add termination mac
+	 * XXX table entry to catch L3 routing prefixes.
+	 * XXX For now, just doing L2 bridging, so skip term mac tbl
+	 * XXX (miss on term mac tbl goes to bridge tbl).
+	 */
+
+	unicast = (key->eth.dst[0] & 0x01) == 0x00;
+	eth_dst_exact = memcmp(key_mask->eth.dst, ff_mac, ETH_ALEN) == 0;
+
+	if (outer_vlan_id && unicast && eth_dst_exact)
+		bridging_mode = BRIDGING_MODE_VLAN_UCAST;
+	else if (outer_vlan_id && !unicast && eth_dst_exact)
+		bridging_mode = BRIDGING_MODE_VLAN_MCAST;
+
+	switch (bridging_mode) {
+	case BRIDGING_MODE_VLAN_UCAST:
+		err = rocker_bridging_vlan_ucast(rocker_port, flow, flags,
+						 outer_vlan_id, untagged);
+		break;
+	case BRIDGING_MODE_VLAN_MCAST:
+		err = rocker_bridging_vlan_mcast(rocker_port, flow, flags,
+						 outer_vlan_id, untagged);
+		break;
+	default:
+		netdev_err(rocker_port->dev, "Unknown bridging mode\n");
+		err = -ENOTSUPP;
+		break;
+	}
+
+	if (err) {
+		netdev_err(rocker_port->dev, "Error (%d) bridging table entry\n",
+			   err);
+		return err;
+	}
+
+	/* ACL table entry */
+
+	err = rocker_flow_tbl_acl(rocker_port, flags,
+				  ROCKER_PRIORITY_ACL,
+				  in_lport, in_lport_mask,
+				  key->eth.src, zero_mac,
+				  key->eth.dst, zero_mac,
+				  key->eth.type,
+				  outer_vlan_id, vlan_id_mask,
+				  key->ip.proto, ~key->ip.proto,
+				  key->ip.tos, ~key_mask->ip.tos,
+				  ROCKER_GROUP_NONE);
+
+	if (err)
+		netdev_err(rocker_port->dev, "Error (%d) ACL table entry\n",
+			   err);
+
+	return err;
+}
+
+static int rocker_flow_add(struct rocker_port *rocker_port,
+			   const struct swdev_flow *flow)
+{
+	return rocker_flow_parse(rocker_port, flow, 0);
+}
+
+static int rocker_flow_del(struct rocker_port *rocker_port,
+			   const struct swdev_flow *flow)
+{
+	return rocker_flow_parse(rocker_port, flow, ROCKER_OP_FLAG_REMOVE);
+}
+
+/*****************
+ * Net device ops
+ *****************/
+
+static int rocker_port_open(struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	int err;
+
+	err = rocker_port_dma_rings_init(rocker_port);
+	if (err)
+		return err;
+
+	err = request_irq(rocker_msix_tx_vector(rocker_port),
+			  rocker_tx_irq_handler, 0,
+			  rocker_driver_name, rocker_port);
+	if (err) {
+		netdev_err(rocker_port->dev, "cannot assign tx irq\n");
+		goto err_request_tx_irq;
+	}
+
+	err = request_irq(rocker_msix_rx_vector(rocker_port),
+			  rocker_rx_irq_handler, 0,
+			  rocker_driver_name, rocker_port);
+	if (err) {
+		netdev_err(rocker_port->dev, "cannot assign rx irq\n");
+		goto err_request_rx_irq;
+	}
+
+	napi_enable(&rocker_port->napi_tx);
+	napi_enable(&rocker_port->napi_rx);
+	rocker_port_set_enable(rocker_port, true);
+	netif_start_queue(dev);
+	return 0;
+
+err_request_rx_irq:
+	free_irq(rocker_msix_tx_vector(rocker_port), rocker_port);
+err_request_tx_irq:
+	rocker_port_dma_rings_fini(rocker_port);
+	return err;
+}
+
+static int rocker_port_stop(struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	netif_stop_queue(dev);
+	rocker_port_set_enable(rocker_port, false);
+	napi_disable(&rocker_port->napi_rx);
+	napi_disable(&rocker_port->napi_tx);
+	free_irq(rocker_msix_rx_vector(rocker_port), rocker_port);
+	free_irq(rocker_msix_tx_vector(rocker_port), rocker_port);
+	rocker_port_dma_rings_fini(rocker_port);
+
+	return 0;
+}
+
+static void rocker_tx_desc_frags_unmap(struct rocker_port *rocker_port,
+				       struct rocker_desc_info *desc_info)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct pci_dev *pdev = rocker->pdev;
+	struct rocker_tlv *attrs[ROCKER_TLV_TX_MAX + 1];
+	struct rocker_tlv *attr;
+	int rem;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_TX_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_TX_FRAGS])
+		return;
+	rocker_tlv_for_each_nested(attr, attrs[ROCKER_TLV_TX_FRAGS], rem) {
+		struct rocker_tlv *frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_MAX + 1];
+		dma_addr_t dma_handle;
+		size_t len;
+
+		if (rocker_tlv_type(attr) != ROCKER_TLV_TX_FRAG)
+			continue;
+		rocker_tlv_parse_nested(frag_attrs, ROCKER_TLV_TX_FRAG_ATTR_MAX,
+					attr);
+		if (!frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_ADDR] ||
+		    !frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_LEN])
+			continue;
+		dma_handle = rocker_tlv_get_u64(frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_ADDR]);
+		len = rocker_tlv_get_u16(frag_attrs[ROCKER_TLV_TX_FRAG_ATTR_LEN]);
+		pci_unmap_single(pdev, dma_handle, len, DMA_TO_DEVICE);
+	}
+}
+
+static int rocker_tx_desc_frag_map_put(struct rocker_port *rocker_port,
+				       struct rocker_desc_info *desc_info,
+				       char *buf, size_t buf_len)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	struct pci_dev *pdev = rocker->pdev;
+	dma_addr_t dma_handle;
+	struct rocker_tlv *frag;
+
+	dma_handle = pci_map_single(pdev, buf, buf_len, DMA_TO_DEVICE);
+	if (unlikely(pci_dma_mapping_error(pdev, dma_handle))) {
+		if (net_ratelimit())
+			netdev_err(rocker_port->dev, "failed to dma map tx frag\n");
+		return -EIO;
+	}
+	frag = rocker_tlv_nest_start(desc_info, ROCKER_TLV_TX_FRAG);
+	if (!frag)
+		goto unmap_frag;
+	if (rocker_tlv_put_u64(desc_info, ROCKER_TLV_TX_FRAG_ATTR_ADDR,
+			       dma_handle))
+		goto nest_cancel;
+	if (rocker_tlv_put_u16(desc_info, ROCKER_TLV_TX_FRAG_ATTR_LEN,
+			       buf_len))
+		goto nest_cancel;
+	rocker_tlv_nest_end(desc_info, frag);
+	return 0;
+
+nest_cancel:
+	rocker_tlv_nest_cancel(desc_info, frag);
+unmap_frag:
+	pci_unmap_single(pdev, dma_handle, buf_len, DMA_TO_DEVICE);
+	return -EMSGSIZE;
+}
+
+static netdev_tx_t rocker_port_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_desc_info *desc_info;
+	struct rocker_tlv *frags;
+	int i;
+	int err;
+
+	desc_info = rocker_desc_head_get(&rocker_port->tx_ring);
+	if (unlikely(!desc_info)) {
+		if (net_ratelimit())
+			netdev_err(dev, "tx ring full when queue awake\n");
+		return NETDEV_TX_BUSY;
+	}
+
+	rocker_desc_cookie_ptr_set(desc_info, skb);
+
+	frags = rocker_tlv_nest_start(desc_info, ROCKER_TLV_TX_FRAGS);
+	if (!frags)
+		goto out;
+	err = rocker_tx_desc_frag_map_put(rocker_port, desc_info,
+					  skb->data, skb_headlen(skb));
+	if (err)
+		goto nest_cancel;
+	if (skb_shinfo(skb)->nr_frags > ROCKER_TX_FRAGS_MAX)
+		goto nest_cancel;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+		err = rocker_tx_desc_frag_map_put(rocker_port, desc_info,
+						  skb_frag_address(frag),
+						  skb_frag_size(frag));
+		if (err)
+			goto unmap_frags;
+	}
+	rocker_tlv_nest_end(desc_info, frags);
+
+	rocker_desc_gen_clear(desc_info);
+	rocker_desc_head_set(rocker, &rocker_port->tx_ring, desc_info);
+
+	desc_info = rocker_desc_head_get(&rocker_port->tx_ring);
+	if (!desc_info)
+		netif_stop_queue(dev);
+
+	return NETDEV_TX_OK;
+
+unmap_frags:
+	rocker_tx_desc_frags_unmap(rocker_port, desc_info);
+nest_cancel:
+	rocker_tlv_nest_cancel(desc_info, frags);
+out:
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static struct rocker_promisc_acl {
+	u16 eth_type;
+	const u8 *eth_src;
+	const u8 *eth_src_mask;
+	const u8 *eth_dst;
+	const u8 *eth_dst_mask;
+	u8 ip_proto;
+	u8 ip_proto_mask;
+	u8 ip_tos;
+	u8 ip_tos_mask;
+} rocker_promisc_acls[] = {
+	{
+		/* allow any ARP pkts */
+		.eth_type = htons(ETH_P_ARP),
+		.eth_src = zero_mac,
+		.eth_src_mask = ff_mac,
+		.eth_dst = zero_mac,
+		.eth_dst_mask = ff_mac,
+	},
+	{
+		/* allow any IP pkts */
+		.eth_type = htons(ETH_P_IP),
+		.eth_src = zero_mac,
+		.eth_src_mask = ff_mac,
+		.eth_dst = zero_mac,
+		.eth_dst_mask = ff_mac,
+		.ip_proto = 0,
+		.ip_proto_mask = 0xff,
+		.ip_tos = 0,
+		.ip_tos_mask = 0xff,
+	},
+	{
+		/* allow LLDP pkts */
+		.eth_type = htons(0x88cc),
+		.eth_src = zero_mac,
+		.eth_src_mask = ff_mac,
+		.eth_dst = lldp_mac,
+		.eth_dst_mask = zero_mac,
+	},
+	{
+		/* allow any IPv6 pkts */
+		.eth_type = htons(ETH_P_IPV6),
+		.eth_src = zero_mac,
+		.eth_src_mask = ff_mac,
+		.eth_dst = zero_mac,
+		.eth_dst_mask = ff_mac,
+		.ip_proto = 0,
+		.ip_proto_mask = 0xff,
+		.ip_tos = 0,
+		.ip_tos_mask = 0xff,
+	},
+	{
+		/* mark end of list */
+		.eth_type = 0,
+	},
+};
+
+static int rocker_port_set_promisc(struct rocker_port *rocker_port,
+				   int flags)
+{
+	u32 in_lport = rocker_port_to_lport(rocker_port);
+	u32 in_lport_mask = 0;
+	u32 out_lport;
+	u16 bridge_id;
+	__be16 vlan_id;
+	__be16 vlan_id_mask;
+	__be16 new_vlan_id;
+	struct rocker_promisc_acl *acl;
+	u32 group_l2_interface;
+	bool untagged;
+	bool pop_vlan;
+	int err;
+
+	/* ingress port table entry */
+
+	err = rocker_flow_tbl_ig_port(rocker_port, flags,
+				      in_lport, in_lport_mask,
+				      ROCKER_OF_DPA_TABLE_ID_VLAN);
+	if (err) {
+		netdev_err(rocker_port->dev, "Error (%d) ingress port table entry\n",
+			   err);
+		return err;
+	}
+
+	/* VLAN table entry for untagged traffic */
+
+	vlan_id = 0;
+	vlan_id_mask = htons(0x0fff);
+	untagged = true;
+	bridge_id = 0; /* XXX for now, need a unique ID for each bridge */
+	new_vlan_id = htons(bridge_id << 8 | in_lport);
+
+	err = rocker_flow_tbl_vlan(rocker_port, flags,
+				   in_lport, vlan_id, vlan_id_mask,
+				   ROCKER_OF_DPA_TABLE_ID_TERMINATION_MAC,
+				   untagged, new_vlan_id);
+	if (err) {
+		netdev_err(rocker_port->dev, "Error (%d) VLAN table entry\n",
+			   err);
+		return err;
+	}
+
+	/* L2 interface group entry for bridge (port 0) */
+
+	out_lport = 0;
+	pop_vlan = untagged;
+
+	group_l2_interface = ROCKER_GROUP_L2_INTERFACE(new_vlan_id, out_lport);
+	err = rocker_group_l2_interface(rocker_port, flags, group_l2_interface,
+					pop_vlan);
+	if (err) {
+		netdev_err(rocker_port->dev, "Error (%d) L2 interface group\n",
+			   err);
+		return err;
+	}
+
+	/* ACL table entries for acceptable pkts */
+
+	for (acl = rocker_promisc_acls; acl->eth_type; acl++) {
+		err = rocker_flow_tbl_acl(rocker_port, flags,
+					  ROCKER_PRIORITY_ACL_PORT_PROMISC,
+					  in_lport, in_lport_mask,
+					  acl->eth_src, acl->eth_src_mask,
+					  acl->eth_dst, acl->eth_dst_mask,
+					  acl->eth_type,
+					  new_vlan_id, vlan_id_mask,
+					  acl->ip_proto, acl->ip_proto_mask,
+					  acl->ip_tos, acl->ip_tos_mask,
+					  group_l2_interface);
+		if (err) {
+			netdev_err(rocker_port->dev, "Error (%d) ACL table entry\n",
+				   err);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static void rocker_port_set_rx_mode(struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	int prev_promisc = (rocker_port->prev_flags & IFF_PROMISC) ? 1 : 0;
+	int promisc = (dev->flags & IFF_PROMISC) ? 1 : 0;
+	int op_flags = ROCKER_OP_FLAG_NOWAIT;
+
+	if (!promisc)
+		op_flags |= ROCKER_OP_FLAG_REMOVE;
+
+	if (promisc != prev_promisc)
+		rocker_port_set_promisc(rocker_port, op_flags);
+
+	rocker_port->prev_flags = dev->flags;
+}
+
+static int rocker_port_set_mac_address(struct net_device *dev, void *p)
+{
+	struct sockaddr *addr = p;
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	int err;
+
+	if (!is_valid_ether_addr(addr->sa_data))
+		return -EADDRNOTAVAIL;
+
+	err = rocker_cmd_set_port_settings_macaddr(rocker_port, addr->sa_data);
+	if (err)
+		return err;
+	memcpy(dev->dev_addr, addr->sa_data, dev->addr_len);
+	return 0;
+}
+
+static int rocker_port_swdev_id_get(struct net_device *dev,
+				    struct netdev_phys_item_id *psid)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	struct rocker *rocker = rocker_port->rocker;
+
+	psid->id_len = sizeof(rocker->hw.id);
+	memcpy(&psid->id, &rocker->hw.id, psid->id_len);
+	return 0;
+}
+
+swdev_features_t rocker_port_swdev_features_get(struct net_device *dev)
+{
+	return SWDEV_F_FLOW_MATCH_KEY;
+}
+
+static int rocker_port_swdev_flow_insert(struct net_device *dev,
+					 const struct swdev_flow *flow)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	return rocker_flow_add(rocker_port, flow);
+}
+
+static int rocker_port_swdev_flow_remove(struct net_device *dev,
+					 const struct swdev_flow *flow)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	return rocker_flow_del(rocker_port, flow);
+}
+
+static const struct net_device_ops rocker_port_netdev_ops = {
+	.ndo_open		= rocker_port_open,
+	.ndo_stop		= rocker_port_stop,
+	.ndo_start_xmit		= rocker_port_xmit,
+	.ndo_set_rx_mode	= rocker_port_set_rx_mode,
+	.ndo_set_mac_address	= rocker_port_set_mac_address,
+	.ndo_swdev_id_get	= rocker_port_swdev_id_get,
+	.ndo_swdev_features_get	= rocker_port_swdev_features_get,
+	.ndo_swdev_flow_insert	= rocker_port_swdev_flow_insert,
+	.ndo_swdev_flow_remove	= rocker_port_swdev_flow_remove,
+};
+
+static bool rocker_port_dev_check(struct net_device *dev)
+{
+	return dev->netdev_ops == &rocker_port_netdev_ops;
+}
+
+/********************
+ * ethtool interface
+ ********************/
+
+static int rocker_port_get_settings(struct net_device *dev,
+				    struct ethtool_cmd *ecmd)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	return rocker_cmd_get_port_settings_ethtool(rocker_port, ecmd);
+}
+
+static int rocker_port_set_settings(struct net_device *dev,
+				    struct ethtool_cmd *ecmd)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+
+	return rocker_cmd_set_port_settings_ethtool(rocker_port, ecmd);
+}
+
+static void rocker_port_get_drvinfo(struct net_device *dev,
+				    struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, rocker_driver_name, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops rocker_port_ethtool_ops = {
+	.get_settings		= rocker_port_get_settings,
+	.set_settings		= rocker_port_set_settings,
+	.get_drvinfo		= rocker_port_get_drvinfo,
+	.get_link		= ethtool_op_get_link,
+};
+
+/*****************
+ * NAPI interface
+ *****************/
+
+static struct rocker_port *rocker_port_napi_tx_get(struct napi_struct *napi)
+{
+	return container_of(napi, struct rocker_port, napi_tx);
+}
+
+static int rocker_port_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct rocker_port *rocker_port = rocker_port_napi_tx_get(napi);
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_desc_info *desc_info;
+	u32 credits = 0;
+	int err;
+
+	/* Cleanup tx descriptors */
+	while ((desc_info = rocker_desc_tail_get(&rocker_port->tx_ring))) {
+		err = rocker_desc_err(desc_info);
+		if (err && net_ratelimit())
+			netdev_err(rocker_port->dev, "tx desc received with err %d\n",
+				   err);
+		rocker_tx_desc_frags_unmap(rocker_port, desc_info);
+		dev_kfree_skb_any(rocker_desc_cookie_ptr_get(desc_info));
+		credits++;
+	}
+
+	if (credits && netif_queue_stopped(rocker_port->dev))
+		netif_wake_queue(rocker_port->dev);
+
+	napi_complete(napi);
+	rocker_dma_ring_credits_set(rocker, &rocker_port->tx_ring, credits);
+
+	return 0;
+}
+
+static int rocker_port_rx_proc(struct rocker *rocker,
+			       struct rocker_port *rocker_port,
+			       struct rocker_desc_info *desc_info)
+{
+	struct rocker_tlv *attrs[ROCKER_TLV_RX_MAX + 1];
+	struct sk_buff *skb = rocker_desc_cookie_ptr_get(desc_info);
+	size_t rx_len;
+
+	if (!skb)
+		return -ENOENT;
+
+	rocker_tlv_parse_desc(attrs, ROCKER_TLV_RX_MAX, desc_info);
+	if (!attrs[ROCKER_TLV_RX_FRAG_LEN])
+		return -EINVAL;
+
+	rocker_dma_rx_ring_skb_unmap(rocker, attrs);
+
+	rx_len = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FRAG_LEN]);
+	skb_put(skb, rx_len);
+	skb->protocol = eth_type_trans(skb, rocker_port->dev);
+	netif_receive_skb(skb);
+
+	return rocker_dma_rx_ring_skb_alloc(rocker, rocker_port, desc_info);
+}
+
+static struct rocker_port *rocker_port_napi_rx_get(struct napi_struct *napi)
+{
+	return container_of(napi, struct rocker_port, napi_rx);
+}
+
+static int rocker_port_poll_rx(struct napi_struct *napi, int budget)
+{
+	struct rocker_port *rocker_port = rocker_port_napi_rx_get(napi);
+	struct rocker *rocker = rocker_port->rocker;
+	struct rocker_desc_info *desc_info;
+	u32 credits = 0;
+	int err;
+
+	/* Process rx descriptors */
+	while (credits < budget &&
+	       (desc_info = rocker_desc_tail_get(&rocker_port->rx_ring))) {
+		err = rocker_desc_err(desc_info);
+		if (err) {
+			if (net_ratelimit())
+				netdev_err(rocker_port->dev, "rx desc received with err %d\n",
+					   err);
+		} else {
+			err = rocker_port_rx_proc(rocker, rocker_port,
+						  desc_info);
+			if (err && net_ratelimit())
+				netdev_err(rocker_port->dev, "rx processing failed with err %d\n",
+					   err);
+		}
+		rocker_desc_gen_clear(desc_info);
+		rocker_desc_head_set(rocker, &rocker_port->rx_ring, desc_info);
+		credits++;
+	}
+
+	if (credits < budget)
+		napi_complete(napi);
+
+	rocker_dma_ring_credits_set(rocker, &rocker_port->rx_ring, credits);
+
+	return credits;
+}
+
+/*****************
+ * PCI driver ops
+ *****************/
+
+static void rocker_carrier_init(struct rocker_port *rocker_port)
+{
+	struct rocker *rocker = rocker_port->rocker;
+	u64 link_status = rocker_read64(rocker, PORT_PHYS_LINK_STATUS);
+	bool link_up;
+
+	link_up = link_status & (1 << rocker_port_to_lport(rocker_port));
+	if (link_up)
+		netif_carrier_on(rocker_port->dev);
+	else
+		netif_carrier_off(rocker_port->dev);
+}
+
+static void rocker_remove_ports(struct rocker *rocker)
+{
+	int i;
+
+	for (i = 0; i < rocker->port_count; i++)
+		unregister_netdev(rocker->ports[i]->dev);
+	kfree(rocker->ports);
+}
+
+static void rocker_port_dev_addr_init(struct rocker *rocker,
+				      struct rocker_port *rocker_port)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int err;
+
+	err = rocker_cmd_get_port_settings_macaddr(rocker_port,
+						   rocker_port->dev->dev_addr);
+	if (err) {
+		dev_warn(&pdev->dev, "failed to get mac address, using random\n");
+		eth_hw_addr_random(rocker_port->dev);
+	}
+}
+
+static int rocker_probe_port(struct rocker *rocker, unsigned port_number)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	struct rocker_port *rocker_port;
+	struct net_device *dev;
+	int err;
+
+	dev = alloc_etherdev(sizeof(struct rocker_port));
+	if (!dev)
+		return -ENOMEM;
+	rocker_port = netdev_priv(dev);
+	rocker_port->dev = dev;
+	rocker_port->rocker = rocker;
+	rocker_port->port_number = port_number;
+
+	rocker_port_dev_addr_init(rocker, rocker_port);
+	dev->netdev_ops = &rocker_port_netdev_ops;
+	dev->ethtool_ops = &rocker_port_ethtool_ops;
+	netif_napi_add(dev, &rocker_port->napi_tx, rocker_port_poll_tx,
+		       NAPI_POLL_WEIGHT);
+	netif_napi_add(dev, &rocker_port->napi_rx, rocker_port_poll_rx,
+		       NAPI_POLL_WEIGHT);
+	rocker_carrier_init(rocker_port);
+
+	err = register_netdev(dev);
+	if (err) {
+		dev_err(&pdev->dev, "register_netdev failed\n");
+		goto free_netdev;
+	}
+	rocker->ports[port_number] = rocker_port;
+	return 0;
+
+free_netdev:
+	free_netdev(dev);
+	return err;
+}
+
+static int rocker_probe_ports(struct rocker *rocker)
+{
+	int i;
+	size_t alloc_size;
+	int err;
+
+	alloc_size = sizeof(struct rocker_port *) * rocker->port_count;
+	rocker->ports = kmalloc(alloc_size, GFP_KERNEL);
+	for (i = 0; i < rocker->port_count; i++) {
+		err = rocker_probe_port(rocker, i);
+		if (err)
+			goto remove_ports;
+	}
+	return 0;
+
+remove_ports:
+	rocker_remove_ports(rocker);
+	return err;
+}
+
+static int rocker_msix_init(struct rocker *rocker)
+{
+	struct pci_dev *pdev = rocker->pdev;
+	int msix_entries;
+	int i;
+	int err;
+
+	msix_entries = pci_msix_vec_count(pdev);
+	if (msix_entries < 0)
+		return msix_entries;
+
+	if (msix_entries != ROCKER_MSIX_VEC_COUNT(rocker->port_count))
+		return -EINVAL;
+
+	rocker->msix_entries = kmalloc_array(msix_entries,
+					     sizeof(struct msix_entry),
+					     GFP_KERNEL);
+	if (!rocker->msix_entries)
+		return -ENOMEM;
+
+	for (i = 0; i < msix_entries; i++)
+		rocker->msix_entries[i].entry = i;
+
+	err = pci_enable_msix_exact(pdev, rocker->msix_entries, msix_entries);
+	if (err < 0)
+		goto err_enable_msix;
+
+	return 0;
+
+err_enable_msix:
+	kfree(rocker->msix_entries);
+	return err;
+}
+
+static void rocker_msix_fini(struct rocker *rocker)
+{
+	pci_disable_msix(rocker->pdev);
+	kfree(rocker->msix_entries);
+}
+
+static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct rocker *rocker;
+	int err;
+
+	rocker = kzalloc(sizeof(*rocker), GFP_KERNEL);
+	if (!rocker)
+		return -ENOMEM;
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "pci_enable_device failed\n");
+		goto err_pci_enable_device;
+	}
+
+	err = pci_request_regions(pdev, rocker_driver_name);
+	if (err) {
+		dev_err(&pdev->dev, "pci_request_regions failed\n");
+		goto err_pci_request_regions;
+	}
+
+	err = pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
+	if (!err) {
+		err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));
+		if (err) {
+			dev_err(&pdev->dev, "pci_set_consistent_dma_mask failed\n");
+			goto err_pci_set_dma_mask;
+		}
+	} else {
+		err = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+		if (err) {
+			dev_err(&pdev->dev, "pci_set_dma_mask failed\n");
+			goto err_pci_set_dma_mask;
+		}
+	}
+
+	if (pci_resource_len(pdev, 0) < ROCKER_PCI_BAR0_SIZE) {
+		dev_err(&pdev->dev, "invalid PCI region size\n");
+		goto err_pci_resource_len_check;
+	}
+
+	rocker->hw_addr = ioremap(pci_resource_start(pdev, 0),
+				  pci_resource_len(pdev, 0));
+	if (!rocker->hw_addr) {
+		dev_err(&pdev->dev, "ioremap failed\n");
+		err = -EIO;
+		goto err_ioremap;
+	}
+	pci_set_master(pdev);
+
+	rocker->pdev = pdev;
+	pci_set_drvdata(pdev, rocker);
+
+	rocker->port_count = rocker_read32(rocker, PORT_PHYS_COUNT);
+
+	err = rocker_msix_init(rocker);
+	if (err) {
+		dev_err(&pdev->dev, "MSI-X init failed\n");
+		goto err_msix_init;
+	}
+
+	err = rocker_basic_hw_test(rocker);
+	if (err) {
+		dev_err(&pdev->dev, "basic hw test failed\n");
+		goto err_basic_hw_test;
+	}
+
+	rocker_write32(rocker, CONTROL, ROCKER_CONTROL_RESET);
+
+	err = rocker_dma_rings_init(rocker);
+	if (err)
+		goto err_dma_rings_init;
+
+	err = request_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD),
+			  rocker_cmd_irq_handler, 0,
+			  rocker_driver_name, rocker);
+	if (err) {
+		dev_err(&pdev->dev, "cannot assign cmd irq\n");
+		goto err_request_cmd_irq;
+	}
+
+	err = request_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT),
+			  rocker_event_irq_handler, 0,
+			  rocker_driver_name, rocker);
+	if (err) {
+		dev_err(&pdev->dev, "cannot assign event irq\n");
+		goto err_request_event_irq;
+	}
+
+	rocker->hw.id = rocker_read64(rocker, SWITCH_ID);
+
+	err = rocker_probe_ports(rocker);
+	if (err) {
+		dev_err(&pdev->dev, "failed to probe ports\n");
+		goto err_probe_ports;
+	}
+
+	hash_init(rocker->flow_tbl);
+	spin_lock_init(&rocker->flow_tbl_lock);
+
+	hash_init(rocker->group_tbl);
+	spin_lock_init(&rocker->group_tbl_lock);
+
+	dev_info(&pdev->dev, "Rocker switch with id %016llx\n", rocker->hw.id);
+
+	return 0;
+
+err_probe_ports:
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
+err_request_event_irq:
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
+err_request_cmd_irq:
+	rocker_dma_rings_fini(rocker);
+err_dma_rings_init:
+err_basic_hw_test:
+	rocker_msix_fini(rocker);
+err_msix_init:
+	iounmap(rocker->hw_addr);
+err_ioremap:
+err_pci_resource_len_check:
+err_pci_set_dma_mask:
+	pci_release_regions(pdev);
+err_pci_request_regions:
+	pci_disable_device(pdev);
+err_pci_enable_device:
+	kfree(rocker);
+	return err;
+}
+
+static void rocker_remove(struct pci_dev *pdev)
+{
+	struct rocker *rocker = pci_get_drvdata(pdev);
+
+	rocker_remove_ports(rocker);
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
+	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
+	rocker_dma_rings_fini(rocker);
+	rocker_msix_fini(rocker);
+	iounmap(rocker->hw_addr);
+	pci_release_regions(rocker->pdev);
+	pci_disable_device(rocker->pdev);
+	kfree(rocker);
+}
+
+static struct pci_driver rocker_pci_driver = {
+	.name		= rocker_driver_name,
+	.id_table	= rocker_pci_id_table,
+	.probe		= rocker_probe,
+	.remove		= rocker_remove,
+};
+
+/************************************
+ * Net device notifier event handler
+ ************************************/
+
+static int rocker_port_master_changed(struct net_device *dev)
+{
+	struct rocker_port *rocker_port = netdev_priv(dev);
+	enum rocker_port_mode newmode = ROCKER_PORT_MODE_L2L3;
+	enum rocker_port_mode oldmode;
+	struct net_device *master = netdev_master_upper_dev_get(dev);
+	int err;
+
+	if (master && master->rtnl_link_ops &&
+	    !strcmp(master->rtnl_link_ops->kind, "openvswitch"))
+		newmode = ROCKER_PORT_MODE_OF_DPA;
+	err = rocker_cmd_get_port_settings_mode(rocker_port, &oldmode);
+	if (err)
+		return err;
+	if (newmode == oldmode)
+		return 0;
+	err = rocker_cmd_set_port_settings_mode(rocker_port, newmode);
+	if (err)
+		return err;
+	netdev_info(dev, "port mode changed from %d to %d\n", oldmode, newmode);
+	return err;
+}
+
+static int rocker_device_event(struct notifier_block *unused,
+			       unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	int err;
+
+	if (!rocker_port_dev_check(dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_CHANGEUPPER:
+		err = rocker_port_master_changed(dev);
+		if (err)
+			netdev_warn(dev, "failed to reflect master change (err %d)\n",
+				    err);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block rocker_notifier_block __read_mostly = {
+	.notifier_call = rocker_device_event,
+};
+
+/***********************
+ * Module init and exit
+ ***********************/
+
+static int __init rocker_module_init(void)
+{
+	int err;
+
+	register_netdevice_notifier(&rocker_notifier_block);
+	err = pci_register_driver(&rocker_pci_driver);
+	if (err)
+		goto err_pci_register_driver;
+	return 0;
+
+err_pci_register_driver:
+	unregister_netdevice_notifier(&rocker_notifier_block);
+	return err;
+}
+
+static void __exit rocker_module_exit(void)
+{
+	unregister_netdevice_notifier(&rocker_notifier_block);
+	pci_unregister_driver(&rocker_pci_driver);
+}
+
+module_init(rocker_module_init);
+module_exit(rocker_module_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
+MODULE_AUTHOR("Scott Feldman <sfeldma@cumulusnetworks.com>");
+MODULE_DESCRIPTION("Rocker switch device driver");
+MODULE_DEVICE_TABLE(pci, rocker_pci_id_table);
diff --git a/drivers/net/ethernet/rocker/rocker.h b/drivers/net/ethernet/rocker/rocker.h
new file mode 100644
index 0000000..fc08592
--- /dev/null
+++ b/drivers/net/ethernet/rocker/rocker.h
@@ -0,0 +1,465 @@
+/*
+ * drivers/net/ethernet/rocker/rocker.h - Rocker switch device driver
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ * Copyright (c) 2014 Scott Feldman <sfeldma@cumulusnetworks.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _ROCKER_H
+#define _ROCKER_H
+
+#include <linux/types.h>
+
+#define PCI_VENDOR_ID_REDHAT		0x1b36
+#define PCI_DEVICE_ID_REDHAT_ROCKER	0x0006
+
+#define ROCKER_PCI_BAR0_SIZE		0x2000
+
+/* MSI-X vectors */
+enum {
+	ROCKER_MSIX_VEC_CMD,
+	ROCKER_MSIX_VEC_EVENT,
+	ROCKER_MSIX_VEC_TEST,
+	ROCKER_MSIX_VEC_RESERVED0,
+	__ROCKER_MSIX_VEC_TX,
+	__ROCKER_MSIX_VEC_RX,
+#define ROCKER_MSIX_VEC_TX(port) \
+	(__ROCKER_MSIX_VEC_TX + ((port) * 2))
+#define ROCKER_MSIX_VEC_RX(port) \
+	(__ROCKER_MSIX_VEC_RX + ((port) * 2))
+#define ROCKER_MSIX_VEC_COUNT(portcnt) \
+	(ROCKER_MSIX_VEC_RX((portcnt - 1)) + 1)
+};
+
+/* Rocker bogus registers */
+#define ROCKER_BOGUS_REG0		0x0000
+#define ROCKER_BOGUS_REG1		0x0004
+#define ROCKER_BOGUS_REG2		0x0008
+#define ROCKER_BOGUS_REG3		0x000c
+
+/* Rocker test registers */
+#define ROCKER_TEST_REG			0x0010
+#define ROCKER_TEST_REG64		0x0018  /* 8-byte */
+#define ROCKER_TEST_IRQ			0x0020
+#define ROCKER_TEST_DMA_ADDR		0x0028  /* 8-byte */
+#define ROCKER_TEST_DMA_SIZE		0x0030
+#define ROCKER_TEST_DMA_CTRL		0x0034
+
+/* Rocker test register ctrl */
+#define ROCKER_TEST_DMA_CTRL_CLEAR	(1 << 0)
+#define ROCKER_TEST_DMA_CTRL_FILL	(1 << 1)
+#define ROCKER_TEST_DMA_CTRL_INVERT	(1 << 2)
+
+/* Rocker DMA ring register offsets */
+#define ROCKER_DMA_DESC_ADDR(x)		(0x1000 + (x) * 32)  /* 8-byte */
+#define ROCKER_DMA_DESC_SIZE(x)		(0x1008 + (x) * 32)
+#define ROCKER_DMA_DESC_HEAD(x)		(0x100c + (x) * 32)
+#define ROCKER_DMA_DESC_TAIL(x)		(0x1010 + (x) * 32)
+#define ROCKER_DMA_DESC_CTRL(x)		(0x1014 + (x) * 32)
+#define ROCKER_DMA_DESC_CREDITS(x)	(0x1018 + (x) * 32)
+#define ROCKER_DMA_DESC_RES1(x)		(0x101c + (x) * 32)
+
+/* Rocker dma ctrl register bits */
+#define ROCKER_DMA_DESC_CTRL_RESET	(1 << 0)
+
+/* Rocker DMA ring types */
+enum rocker_dma_type {
+	ROCKER_DMA_CMD,
+	ROCKER_DMA_EVENT,
+	__ROCKER_DMA_TX,
+	__ROCKER_DMA_RX,
+#define ROCKER_DMA_TX(port) (__ROCKER_DMA_TX + (port) * 2)
+#define ROCKER_DMA_RX(port) (__ROCKER_DMA_RX + (port) * 2)
+};
+
+/* Rocker DMA ring size limits and default sizes */
+#define ROCKER_DMA_SIZE_MIN		2ul
+#define ROCKER_DMA_SIZE_MAX		65536ul
+#define ROCKER_DMA_CMD_DEFAULT_SIZE	32ul
+#define ROCKER_DMA_EVENT_DEFAULT_SIZE	32ul
+#define ROCKER_DMA_TX_DEFAULT_SIZE	64ul
+#define ROCKER_DMA_TX_DESC_SIZE		256
+#define ROCKER_DMA_RX_DEFAULT_SIZE	64ul
+#define ROCKER_DMA_RX_DESC_SIZE		256
+
+/* Rocker DMA descriptor struct */
+struct rocker_desc {
+	u64 buf_addr;
+	u64 cookie;
+	u16 buf_size;
+	u16 tlv_size;
+	u16 resv[5];
+	u16 comp_err;
+} __packed __aligned(8);
+
+#define ROCKER_DMA_DESC_COMP_ERR_GEN	(1 << 15)
+
+/* Rocker DMA TLV struct */
+struct rocker_tlv {
+	u32 type;
+	u16 len;
+} __packed __aligned(8);
+
+/* TLVs */
+enum {
+	ROCKER_TLV_CMD_UNSPEC,
+	ROCKER_TLV_CMD_TYPE,	/* u16 */
+	ROCKER_TLV_CMD_INFO,	/* nest */
+
+	__ROCKER_TLV_CMD_MAX,
+	ROCKER_TLV_CMD_MAX = __ROCKER_TLV_CMD_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_CMD_TYPE_UNSPEC,
+	ROCKER_TLV_CMD_TYPE_GET_PORT_SETTINGS,
+	ROCKER_TLV_CMD_TYPE_SET_PORT_SETTINGS,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_ADD,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_MOD,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_DEL,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_FLOW_GET_STATS,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_ADD,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_MOD,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_DEL,
+	ROCKER_TLV_CMD_TYPE_OF_DPA_GROUP_GET_STATS,
+	ROCKER_TLV_CMD_TYPE_TRUNK,
+	ROCKER_TLV_CMD_TYPE_BRIDGE,
+
+	__ROCKER_TLV_CMD_TYPE_MAX,
+	ROCKER_TLV_CMD_TYPE_MAX = __ROCKER_TLV_CMD_TYPE_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_CMD_PORT_SETTINGS_UNSPEC,
+	ROCKER_TLV_CMD_PORT_SETTINGS_LPORT,		/* u32 */
+	ROCKER_TLV_CMD_PORT_SETTINGS_SPEED,		/* u32 */
+	ROCKER_TLV_CMD_PORT_SETTINGS_DUPLEX,		/* u8 */
+	ROCKER_TLV_CMD_PORT_SETTINGS_AUTONEG,		/* u8 */
+	ROCKER_TLV_CMD_PORT_SETTINGS_MACADDR,		/* binary */
+	ROCKER_TLV_CMD_PORT_SETTINGS_MODE,		/* u8 */
+
+	__ROCKER_TLV_CMD_PORT_SETTINGS_MAX,
+	ROCKER_TLV_CMD_PORT_SETTINGS_MAX =
+			__ROCKER_TLV_CMD_PORT_SETTINGS_MAX - 1,
+};
+
+enum rocker_port_mode {
+	ROCKER_PORT_MODE_OF_DPA,
+	ROCKER_PORT_MODE_L2L3,
+};
+
+enum {
+	ROCKER_TLV_EVENT_UNSPEC,
+	ROCKER_TLV_EVENT_TYPE,	/* u16 */
+	ROCKER_TLV_EVENT_INFO,	/* nest */
+
+	__ROCKER_TLV_EVENT_MAX,
+	ROCKER_TLV_EVENT_MAX = __ROCKER_TLV_EVENT_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_EVENT_TYPE_UNSPEC,
+	ROCKER_TLV_EVENT_TYPE_LINK_CHANGED,
+
+	__ROCKER_TLV_EVENT_TYPE_MAX,
+	ROCKER_TLV_EVENT_TYPE_MAX = __ROCKER_TLV_EVENT_TYPE_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_EVENT_LINK_CHANGED_UNSPEC,
+	ROCKER_TLV_EVENT_LINK_CHANGED_LPORT,	/* u32 */
+	ROCKER_TLV_EVENT_LINK_CHANGED_LINKUP,	/* u8 */
+
+	__ROCKER_TLV_EVENT_LINK_CHANGED_MAX,
+	ROCKER_TLV_EVENT_LINK_CHANGED_MAX =
+			__ROCKER_TLV_EVENT_LINK_CHANGED_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_RX_UNSPEC,
+	ROCKER_TLV_RX_FLAGS,		/* u16, see ROCKER_RX_FLAGS_ */
+	ROCKER_TLV_RX_CSUM,		/* u16 */
+	ROCKER_TLV_RX_FRAG_ADDR,	/* u64 */
+	ROCKER_TLV_RX_FRAG_MAX_LEN,	/* u16 */
+	ROCKER_TLV_RX_FRAG_LEN,		/* u16 */
+
+	__ROCKER_TLV_RX_MAX,
+	ROCKER_TLV_RX_MAX = __ROCKER_TLV_RX_MAX - 1,
+};
+
+#define ROCKER_RX_FLAGS_IPV4			(1 << 0)
+#define ROCKER_RX_FLAGS_IPV6			(1 << 1)
+#define ROCKER_RX_FLAGS_CSUM_CALC		(1 << 2)
+#define ROCKER_RX_FLAGS_IPV4_CSUM_GOOD		(1 << 3)
+#define ROCKER_RX_FLAGS_IP_FRAG			(1 << 4)
+#define ROCKER_RX_FLAGS_TCP			(1 << 5)
+#define ROCKER_RX_FLAGS_UDP			(1 << 6)
+#define ROCKER_RX_FLAGS_TCP_UDP_CSUM_GOOD	(1 << 7)
+
+enum {
+	ROCKER_TLV_TX_UNSPEC,
+	ROCKER_TLV_TX_OFFLOAD,		/* u8, see ROCKER_TX_OFFLOAD_ */
+	ROCKER_TLV_TX_L3_CSUM_OFF,	/* u16 */
+	ROCKER_TLV_TX_TSO_MSS,		/* u16 */
+	ROCKER_TLV_TX_TSO_HDR_LEN,	/* u16 */
+	ROCKER_TLV_TX_FRAGS,		/* array */
+
+	__ROCKER_TLV_TX_MAX,
+	ROCKER_TLV_TX_MAX = __ROCKER_TLV_TX_MAX - 1,
+};
+
+#define ROCKER_TX_OFFLOAD_NONE		0
+#define ROCKER_TX_OFFLOAD_IP_CSUM	1
+#define ROCKER_TX_OFFLOAD_TCP_UDP_CSUM	2
+#define ROCKER_TX_OFFLOAD_L3_CSUM	3
+#define ROCKER_TX_OFFLOAD_TSO		4
+
+#define ROCKER_TX_FRAGS_MAX		16
+
+enum {
+	ROCKER_TLV_TX_FRAG_UNSPEC,
+	ROCKER_TLV_TX_FRAG,		/* nest */
+
+	__ROCKER_TLV_TX_FRAG_MAX,
+	ROCKER_TLV_TX_FRAG_MAX = __ROCKER_TLV_TX_FRAG_MAX - 1,
+};
+
+enum {
+	ROCKER_TLV_TX_FRAG_ATTR_UNSPEC,
+	ROCKER_TLV_TX_FRAG_ATTR_ADDR,	/* u64 */
+	ROCKER_TLV_TX_FRAG_ATTR_LEN,	/* u16 */
+
+	__ROCKER_TLV_TX_FRAG_ATTR_MAX,
+	ROCKER_TLV_TX_FRAG_ATTR_MAX = __ROCKER_TLV_TX_FRAG_ATTR_MAX - 1,
+};
+
+/* cmd info nested for OF-DPA msgs */
+enum {
+	ROCKER_TLV_OF_DPA_UNSPEC,
+	ROCKER_TLV_OF_DPA_TABLE_ID,		/* u16 */
+	ROCKER_TLV_OF_DPA_PRIORITY,		/* u32 */
+	ROCKER_TLV_OF_DPA_HARDTIME,		/* u32 */
+	ROCKER_TLV_OF_DPA_IDLETIME,		/* u32 */
+	ROCKER_TLV_OF_DPA_COOKIE,		/* u64 */
+	ROCKER_TLV_OF_DPA_IN_LPORT,		/* u32 */
+	ROCKER_TLV_OF_DPA_IN_LPORT_MASK,	/* u32 */
+	ROCKER_TLV_OF_DPA_OUT_LPORT,		/* u32 */
+	ROCKER_TLV_OF_DPA_GOTO_TABLE_ID,	/* u16 */
+	ROCKER_TLV_OF_DPA_GROUP_ID,		/* u32 */
+	ROCKER_TLV_OF_DPA_GROUP_COUNT,		/* u16 */
+	ROCKER_TLV_OF_DPA_GROUP_IDS,		/* u32 array */
+	ROCKER_TLV_OF_DPA_VLAN_ID,		/* __be16 */
+	ROCKER_TLV_OF_DPA_VLAN_ID_MASK,		/* __be16 */
+	ROCKER_TLV_OF_DPA_VLAN_PCP,		/* __be16 */
+	ROCKER_TLV_OF_DPA_VLAN_PCP_MASK,	/* __be16 */
+	ROCKER_TLV_OF_DPA_VLAN_PCP_ACTION,	/* u8 */
+	ROCKER_TLV_OF_DPA_NEW_VLAN_ID,		/* __be16 */
+	ROCKER_TLV_OF_DPA_NEW_VLAN_PCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_TUNNEL_ID,		/* u32 */
+	ROCKER_TLV_OF_DPA_TUN_LOG_LPORT,	/* u32 */
+	ROCKER_TLV_OF_DPA_ETHERTYPE,		/* __be16 */
+	ROCKER_TLV_OF_DPA_DST_MAC,		/* binary */
+	ROCKER_TLV_OF_DPA_DST_MAC_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_SRC_MAC,		/* binary */
+	ROCKER_TLV_OF_DPA_SRC_MAC_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_IP_PROTO,		/* u8 */
+	ROCKER_TLV_OF_DPA_IP_PROTO_MASK,	/* u8 */
+	ROCKER_TLV_OF_DPA_IP_DSCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_IP_DSCP_MASK,		/* u8 */
+	ROCKER_TLV_OF_DPA_IP_DSCP_ACTION,	/* u8 */
+	ROCKER_TLV_OF_DPA_NEW_IP_DSCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_IP_ECN,		/* u8 */
+	ROCKER_TLV_OF_DPA_IP_ECN_MASK,		/* u8 */
+	ROCKER_TLV_OF_DPA_DST_IP,		/* __be32 */
+	ROCKER_TLV_OF_DPA_DST_IP_MASK,		/* __be32 */
+	ROCKER_TLV_OF_DPA_SRC_IP,		/* __be32 */
+	ROCKER_TLV_OF_DPA_SRC_IP_MASK,		/* __be32 */
+	ROCKER_TLV_OF_DPA_DST_IPV6,		/* binary */
+	ROCKER_TLV_OF_DPA_DST_IPV6_MASK,	/* binary */
+	ROCKER_TLV_OF_DPA_SRC_IPV6,		/* binary */
+	ROCKER_TLV_OF_DPA_SRC_IPV6_MASK,	/* binary */
+	ROCKER_TLV_OF_DPA_SRC_ARP_IP,		/* __be32 */
+	ROCKER_TLV_OF_DPA_SRC_ARP_IP_MASK,	/* __be32 */
+	ROCKER_TLV_OF_DPA_L4_DST_PORT,		/* __be16 */
+	ROCKER_TLV_OF_DPA_L4_DST_PORT_MASK,	/* __be16 */
+	ROCKER_TLV_OF_DPA_L4_SRC_PORT,		/* __be16 */
+	ROCKER_TLV_OF_DPA_L4_SRC_PORT_MASK,	/* __be16 */
+	ROCKER_TLV_OF_DPA_ICMP_TYPE,		/* u8 */
+	ROCKER_TLV_OF_DPA_ICMP_TYPE_MASK,	/* u8 */
+	ROCKER_TLV_OF_DPA_ICMP_CODE,		/* u8 */
+	ROCKER_TLV_OF_DPA_ICMP_CODE_MASK,	/* u8 */
+	ROCKER_TLV_OF_DPA_IPV6_LABEL,		/* __be32 */
+	ROCKER_TLV_OF_DPA_IPV6_LABEL_MASK,	/* __be32 */
+	ROCKER_TLV_OF_DPA_QUEUE_ID_ACTION,	/* u8 */
+	ROCKER_TLV_OF_DPA_NEW_QUEUE_ID,		/* u8 */
+	ROCKER_TLV_OF_DPA_CLEAR_ACTIONS,	/* u32 */
+	ROCKER_TLV_OF_DPA_POP_VLAN,		/* u8 */
+
+	__ROCKER_TLV_OF_DPA_MAX,
+	ROCKER_TLV_OF_DPA_MAX = __ROCKER_TLV_OF_DPA_MAX - 1,
+};
+
+/* OF-DPA table IDs */
+
+enum rocker_of_dpa_table_id {
+	ROCKER_OF_DPA_TABLE_ID_INGRESS_PORT = 0,
+	ROCKER_OF_DPA_TABLE_ID_VLAN = 10,
+	ROCKER_OF_DPA_TABLE_ID_TERMINATION_MAC = 20,
+	ROCKER_OF_DPA_TABLE_ID_UNICAST_ROUTING = 30,
+	ROCKER_OF_DPA_TABLE_ID_MULTICAST_ROUTING = 40,
+	ROCKER_OF_DPA_TABLE_ID_BRIDGING = 50,
+	ROCKER_OF_DPA_TABLE_ID_ACL_POLICY = 60,
+};
+
+/* OF_DPA_xxx nest */
+enum {
+	ROCKER_TLV_OF_DPA_INFO_UNSPEC,
+	ROCKER_TLV_OF_DPA_INFO_IN_LPORT,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_IN_LPORT_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_OUT_LPORT,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_GOTO_TABLE_ID,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_GROUP_ID,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_ID,			/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_ID_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_PCP,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_PCP_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_VLAN_PCP_ACTION,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_NEW_VLAN_ID,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_NEW_VLAN_PCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_TUNNEL_ID,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_TUN_LOG_LPORT,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_ETHERTYPE,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DST_MAC,			/* binary */
+	ROCKER_TLV_OF_DPA_INFO_DST_MAC_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_MAC,			/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_MAC_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_IP_PROTO,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_IP_PROTO_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DSCP,			/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DSCP_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DSCP_ACTION,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_NEW_DSCP,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_ECN,			/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_ECN_MASK,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_DST_IP,			/* binary */
+	ROCKER_TLV_OF_DPA_INFO_DST_IP_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_IP,			/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_IP_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_DST_IPV6,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_DST_IPV6_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_IPV6,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_IPV6_MASK,		/* binary */
+	ROCKER_TLV_OF_DPA_INFO_SRC_ARP_IP,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_SRC_ARP_IP_MASK,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_L4_DST_PORT,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_L4_DST_PORT_MASK,	/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_L4_SRC_PORT,		/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_L4_SRC_PORT_MASK,	/* u16 */
+	ROCKER_TLV_OF_DPA_INFO_ICMP_TYPE,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_ICMP_TYPE_MASK,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_ICMP_CODE,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_ICMP_CODE_MASK,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_IPV6_LABEL,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_IPV6_LABEL_MASK,		/* u32 */
+	ROCKER_TLV_OF_DPA_INFO_QUEUE_ID_ACTION,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_NEW_QUEUE_ID,		/* u8 */
+	ROCKER_TLV_OF_DPA_INFO_CLEAR_ACTIONS,		/* u32 */
+
+	__ROCKER_TLV_OF_DPA_INFO_MAX,
+	ROCKER_TLV_OF_DPA_INFO_MAX = __ROCKER_TLV_OF_DPA_INFO_MAX - 1,
+};
+
+/* OF-DPA flow stats */
+enum {
+	ROCKER_TLV_OF_DPA_FLOW_STAT_UNSPEC,
+	ROCKER_TLV_OF_DPA_FLOW_STAT_DURATION,	/* u32 */
+	ROCKER_TLV_OF_DPA_FLOW_STAT_RX_PKTS,	/* u64 */
+	ROCKER_TLV_OF_DPA_FLOW_STAT_TX_PKTS,	/* u64 */
+
+	__ROCKER_TLV_OF_DPA_FLOW_STAT_MAX,
+	ROCKER_TLV_OF_DPA_FLOW_STAT_MAX = __ROCKER_TLV_OF_DPA_FLOW_STAT_MAX - 1,
+};
+
+/* OF-DPA group types */
+enum rocker_of_dpa_group_type {
+	ROCKER_OF_DPA_GROUP_TYPE_L2_INTERFACE = 0,
+	ROCKER_OF_DPA_GROUP_TYPE_L2_REWRITE,
+	ROCKER_OF_DPA_GROUP_TYPE_L3_UCAST,
+	ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST,
+	ROCKER_OF_DPA_GROUP_TYPE_L2_FLOOD,
+	ROCKER_OF_DPA_GROUP_TYPE_L3_INTERFACE,
+	ROCKER_OF_DPA_GROUP_TYPE_L3_MCAST,
+	ROCKER_OF_DPA_GROUP_TYPE_L3_ECMP,
+	ROCKER_OF_DPA_GROUP_TYPE_L2_OVERLAY,
+};
+
+/* OF-DPA group L2 overlay types */
+enum rocker_of_dpa_overlay_type {
+	ROCKER_OF_DPA_OVERLAY_TYPE_FLOOD_UCAST = 0,
+	ROCKER_OF_DPA_OVERLAY_TYPE_FLOOD_MCAST,
+	ROCKER_OF_DPA_OVERLAY_TYPE_MCAST_UCAST,
+	ROCKER_OF_DPA_OVERLAY_TYPE_MCAST_MCAST,
+};
+
+/* OF-DPA group ID encoding */
+#define ROCKER_GROUP_TYPE_SHIFT 28
+#define ROCKER_GROUP_TYPE_MASK 0xf0000000
+#define ROCKER_GROUP_VLAN_SHIFT 16
+#define ROCKER_GROUP_VLAN_MASK 0x0fff0000
+#define ROCKER_GROUP_PORT_SHIFT 0
+#define ROCKER_GROUP_PORT_MASK 0x0000ffff
+#define ROCKER_GROUP_TUNNEL_ID_SHIFT 12
+#define ROCKER_GROUP_TUNNEL_ID_MASK 0x0ffff000
+#define ROCKER_GROUP_SUBTYPE_SHIFT 10
+#define ROCKER_GROUP_SUBTYPE_MASK 0x00000c00
+#define ROCKER_GROUP_INDEX_SHIFT 0
+#define ROCKER_GROUP_INDEX_MASK 0x0000ffff
+#define ROCKER_GROUP_INDEX_LONG_SHIFT 0
+#define ROCKER_GROUP_INDEX_LONG_MASK 0x0fffffff
+
+#define ROCKER_GROUP_TYPE_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_TYPE_MASK) >> ROCKER_GROUP_TYPE_SHIFT)
+#define ROCKER_GROUP_TYPE_SET(type) \
+	(((type) << ROCKER_GROUP_TYPE_SHIFT) & ROCKER_GROUP_TYPE_MASK)
+#define ROCKER_GROUP_VLAN_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_VLAN_ID_MASK) >> ROCKER_GROUP_VLAN_ID_SHIFT)
+#define ROCKER_GROUP_VLAN_SET(vlan_id) \
+	(((vlan_id) << ROCKER_GROUP_VLAN_SHIFT) & ROCKER_GROUP_VLAN_MASK)
+#define ROCKER_GROUP_PORT_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_PORT_MASK) >> ROCKER_GROUP_PORT_SHIFT)
+#define ROCKER_GROUP_PORT_SET(port) \
+	(((port) << ROCKER_GROUP_PORT_SHIFT) & ROCKER_GROUP_PORT_MASK)
+#define ROCKER_GROUP_INDEX_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_INDEX_MASK) >> ROCKER_GROUP_INDEX_SHIFT)
+#define ROCKER_GROUP_INDEX_SET(index) \
+	(((index) << ROCKER_GROUP_INDEX_SHIFT) & ROCKER_GROUP_INDEX_MASK)
+#define ROCKER_GROUP_INDEX_LONG_GET(group_id) \
+	(((group_id) & ROCKER_GROUP_INDEX_LONG_MASK) >> \
+	 ROCKER_GROUP_INDEX_LONG_SHIFT)
+#define ROCKER_GROUP_INDEX_LONG_SET(index) \
+	(((index) << ROCKER_GROUP_INDEX_LONG_SHIFT) & \
+	 ROCKER_GROUP_INDEX_LONG_MASK)
+
+#define ROCKER_GROUP_NONE 0
+#define ROCKER_GROUP_L2_INTERFACE(vlan_id, port) \
+	(ROCKER_GROUP_TYPE_SET(ROCKER_OF_DPA_GROUP_TYPE_L2_INTERFACE) |\
+	 ROCKER_GROUP_VLAN_SET(vlan_id) | ROCKER_GROUP_PORT_SET(port))
+#define ROCKER_GROUP_L2_MCAST(vlan_id, index) \
+	(ROCKER_GROUP_TYPE_SET(ROCKER_OF_DPA_GROUP_TYPE_L2_MCAST) |\
+	 ROCKER_GROUP_VLAN_SET(vlan_id) | ROCKER_GROUP_INDEX_SET(index))
+
+/* Rocker general purpose registers */
+#define ROCKER_CONTROL			0x0300
+#define ROCKER_PORT_PHYS_COUNT		0x0304
+#define ROCKER_PORT_PHYS_LINK_STATUS	0x0310 /* 8-byte */
+#define ROCKER_PORT_PHYS_ENABLE		0x0318 /* 8-byte */
+#define ROCKER_SWITCH_ID		0x0320 /* 8-byte */
+
+/* Rocker control bits */
+#define ROCKER_CONTROL_RESET		(1 << 0)
+
+#endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 1/9] net: rename netdev_phys_port_id to more generic name
       [not found]   ` <1411134590-4586-2-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-09-19 13:54     ` Jeff Kirsher
  0 siblings, 0 replies; 67+ messages in thread
From: Jeff Kirsher @ 2014-09-19 13:54 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	simon.horman-wFxRvT7yatFl57MIdRCFDg,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On Fri, 2014-09-19 at 15:49 +0200, Jiri Pirko wrote:
> So this can be reused for identification of other "items" as well.
> 
> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  2 +-
>  drivers/net/ethernet/intel/i40e/i40e_main.c      |  2 +-
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |  2 +-
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |  2 +-
>  include/linux/netdevice.h                        | 16
> ++++++++--------
>  net/core/dev.c                                   |  2 +-
>  net/core/net-sysfs.c                             |  2 +-
>  net/core/rtnetlink.c                             |  6 +++---
>  8 files changed, 17 insertions(+), 17 deletions(-)

Acked-by: Jeff Kirsher <jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

With regards to the i40e changes.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api
       [not found] ` <1411134590-4586-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
  2014-09-19 13:49   ` [patch net-next v2 2/9] net: introduce generic switch devices support Jiri Pirko
@ 2014-09-19 14:15   ` David Laight
       [not found]     ` <063D6719AE5E284EB5DD2968C1650D6D17495CC6-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
  1 sibling, 1 reply; 67+ messages in thread
From: David Laight @ 2014-09-19 14:15 UTC (permalink / raw)
  To: 'Jiri Pirko', netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	simon.horman-wFxRvT7yatFl57MIdRCFDg,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org

From: Jiri Pirko
> This patchset can be divided into 3 main sections:
> - introduce switchdev api for implementing switch drivers
> - introduce switchdev generic netlink api for userspace manipulation
> - introduce rocker switch driver which implements switchdev api

Perhaps you should be including the name of what you are switching
in the name of the API?

Is this for interfacing to ethernet switches, TDM switches or
mechanical ones?
It isn't really clear from any of these commit messages.

	David

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api
       [not found]     ` <063D6719AE5E284EB5DD2968C1650D6D17495CC6-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
@ 2014-09-19 14:20       ` Jiri Pirko
  2014-09-20  5:37         ` Florian Fainelli
  0 siblings, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 14:20 UTC (permalink / raw)
  To: David Laight
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	simon.horman-wFxRvT7yatFl57MIdRCFDg,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org

Fri, Sep 19, 2014 at 04:15:32PM CEST, David.Laight-ZS65k/vG3HxXrIkS9f7CXA@public.gmane.org wrote:
>From: Jiri Pirko
>> This patchset can be divided into 3 main sections:
>> - introduce switchdev api for implementing switch drivers
>> - introduce switchdev generic netlink api for userspace manipulation
>> - introduce rocker switch driver which implements switchdev api
>
>Perhaps you should be including the name of what you are switching
>in the name of the API?
>
>Is this for interfacing to ethernet switches, TDM switches or
>mechanical ones?
>It isn't really clear from any of these commit messages.

I thought that is isn't necessary, that it is clear this is about ethernet
switches.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 13:49 ` [patch net-next v2 8/9] switchdev: introduce Netlink API Jiri Pirko
@ 2014-09-19 15:25   ` Jamal Hadi Salim
  2014-09-19 15:49     ` Jiri Pirko
  0 siblings, 1 reply; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-19 15:25 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, sfeldma, f.fainelli, roopa, linville,
	dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

On 09/19/14 09:49, Jiri Pirko wrote:
> This patch exposes switchdev API using generic Netlink.
> Example userspace utility is here:
> https://github.com/jpirko/switchdev
>

Is this just a temporary test tool? Otherwise i dont see reason
for its existence (or the API that it feeds on).

cheers,
jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 15:25   ` Jamal Hadi Salim
@ 2014-09-19 15:49     ` Jiri Pirko
  2014-09-19 17:57       ` Jamal Hadi Salim
  2014-09-20  3:41       ` Roopa Prabhu
  0 siblings, 2 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-19 15:49 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, sfeldma, f.fainelli,
	roopa, linville, dev, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck

Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>On 09/19/14 09:49, Jiri Pirko wrote:
>>This patch exposes switchdev API using generic Netlink.
>>Example userspace utility is here:
>>https://github.com/jpirko/switchdev
>>
>
>Is this just a temporary test tool? Otherwise i dont see reason
>for its existence (or the API that it feeds on).

Please read the conversation I had with Pravin and Jesse in v1 thread.
Long story short they like to have the api separated from ovs datapath
so ovs daemon can use it to directly communicate with driver. Also John
Fastabend requested a way to work with driver flows without using ovs ->
that was the original reason I created switchdev genl api.

Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
will use directly switchdev genl api.

I hope I cleared this out.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 15:49     ` Jiri Pirko
@ 2014-09-19 17:57       ` Jamal Hadi Salim
  2014-09-19 22:12         ` John Fastabend
  2014-09-20  3:41       ` Roopa Prabhu
  1 sibling, 1 reply; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-19 17:57 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, sfeldma, f.fainelli,
	roopa, linville, dev, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck

On 09/19/14 11:49, Jiri Pirko wrote:
> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:

>> Is this just a temporary test tool? Otherwise i dont see reason
>> for its existence (or the API that it feeds on).
>
> Please read the conversation I had with Pravin and Jesse in v1 thread.
> Long story short they like to have the api separated from ovs datapath
> so ovs daemon can use it to directly communicate with driver. Also John
> Fastabend requested a way to work with driver flows without using ovs ->
> that was the original reason I created switchdev genl api.
>
> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
> will use directly switchdev genl api.
>
> I hope I cleared this out.
>

It is - thanks Jiri.

cheers,
jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 17:57       ` Jamal Hadi Salim
@ 2014-09-19 22:12         ` John Fastabend
  2014-09-19 22:18           ` Jamal Hadi Salim
                             ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: John Fastabend @ 2014-09-19 22:12 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, edumazet, sfeldma, f.fainelli, roopa, linville,
	dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
> On 09/19/14 11:49, Jiri Pirko wrote:
>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
> 
>>> Is this just a temporary test tool? Otherwise i dont see reason
>>> for its existence (or the API that it feeds on).
>>
>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>> Long story short they like to have the api separated from ovs datapath
>> so ovs daemon can use it to directly communicate with driver. Also John
>> Fastabend requested a way to work with driver flows without using ovs ->
>> that was the original reason I created switchdev genl api.
>>
>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>> will use directly switchdev genl api.
>>
>> I hope I cleared this out.
>>
> 
> It is - thanks Jiri.
> 
> cheers,
> jamal

Hi Jiri,

I was considering a slightly different approach where the
device would report via netlink the fields/actions it
supported rather than creating pre-defined enums for every
possible key.

I already need to have an API to report fields/matches
that are being supported why not have the device report
the headers as header fields (len, offset) and the
associated parse graph the hardware uses? Vendors should
have this already to describe/design their real hardware.

As always its better to have code and when I get some
time I'll try to write it up. Maybe its just a separate
classifier although I don't actually want two hardware
flow APIs.

I see you dropped the RFC tag are you proposing we include
this now?

.John

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 22:12         ` John Fastabend
@ 2014-09-19 22:18           ` Jamal Hadi Salim
  2014-09-20  5:39             ` Florian Fainelli
  2014-09-20  8:17             ` Jiri Pirko
  2014-09-20  5:36           ` Florian Fainelli
  2014-09-20  8:14           ` Jiri Pirko
  2 siblings, 2 replies; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-19 22:18 UTC (permalink / raw)
  To: John Fastabend, Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, edumazet, sfeldma, f.fainelli, roopa, linville,
	dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

On 09/19/14 18:12, John Fastabend wrote:
> On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>> On 09/19/14 11:49, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>>
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>>>
>>
>> It is - thanks Jiri.
>>
>> cheers,
>> jamal
>
> Hi Jiri,
>
> I was considering a slightly different approach where the
> device would report via netlink the fields/actions it
> supported rather than creating pre-defined enums for every
> possible key.
>
> I already need to have an API to report fields/matches
> that are being supported why not have the device report
> the headers as header fields (len, offset) and the
> associated parse graph the hardware uses? Vendors should
> have this already to describe/design their real hardware.
>
> As always its better to have code and when I get some
> time I'll try to write it up. Maybe its just a separate
> classifier although I don't actually want two hardware
> flow APIs.
>
> I see you dropped the RFC tag are you proposing we include
> this now?
>


Actually I just realized i missed something very basic that
Jiri said. I think i understand the tool being there for testing
but i am assumed the same about the genlink api.
Jiri, are you saying that genlink api is there to
stay?

cheers,
jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 15:49     ` Jiri Pirko
  2014-09-19 17:57       ` Jamal Hadi Salim
@ 2014-09-20  3:41       ` Roopa Prabhu
  2014-09-20  8:09         ` Jiri Pirko
  2014-09-20  8:10         ` Scott Feldman
  1 sibling, 2 replies; 67+ messages in thread
From: Roopa Prabhu @ 2014-09-20  3:41 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, netdev, davem, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, john.r.fastabend, edumazet, sfeldma,
	f.fainelli, linville, dev, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck,
	Shrijeet Mukherjee

On 9/19/14, 8:49 AM, Jiri Pirko wrote:
> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>> On 09/19/14 09:49, Jiri Pirko wrote:
>>> This patch exposes switchdev API using generic Netlink.
>>> Example userspace utility is here:
>>> https://github.com/jpirko/switchdev
>>>
>> Is this just a temporary test tool? Otherwise i dont see reason
>> for its existence (or the API that it feeds on).
> Please read the conversation I had with Pravin and Jesse in v1 thread.
> Long story short they like to have the api separated from ovs datapath
> so ovs daemon can use it to directly communicate with driver. Also John
> Fastabend requested a way to work with driver flows without using ovs ->
> that was the original reason I created switchdev genl api.
>
> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
> will use directly switchdev genl api.
>
> I hope I cleared this out.
We already have all the needed rtnetlink kernel api and userspace tools 
around it to support all
switching asic features. ie, the rtnetlink api is the switchdev api. We 
can do l2, l3, acl's with it.
Its unclear to me why we need another new netlink api. Which will mean 
none of the existing tools to
create bridges etc will work on a switchdev.
Which seems like going in the direction exactly opposite to what we had 
discussed earlier.

If a non-ovs flow interface is needed from userspace, we can extend the 
existing interface to include flows.
I don't understand why we should replace the existing rtnetlink 
switchdev api to accommodate flows.

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 5/9] net: introduce dummy switch
       [not found]   ` <1411134590-4586-6-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
@ 2014-09-20  5:21     ` Florian Fainelli
  2014-09-20  7:37       ` Jiri Pirko
  0 siblings, 1 reply; 67+ messages in thread
From: Florian Fainelli @ 2014-09-20  5:21 UTC (permalink / raw)
  To: Jiri Pirko, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	simon.horman-wFxRvT7yatFl57MIdRCFDg,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	jhs-jkUAjuhPggJWk0Htik3J/w, aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 09/19/14 06:49, Jiri Pirko wrote:
> Dummy switch implementation using switchdev interface

This really looks like a DSA driver that has 0 ports, and is not 
attached to an useful network interface, and which is registering its 
own set of rtnl operations for a purpose that is unclear to me.

I think registering these rtnl ops is misleading as it leads to a false 
idea that this is allowed, and that people are actually encouraged to do 
that for custom switch drivers, and this completely defeats the purpose 
of coming up with a generic API.

If we are to go that route anyway, I really prefer the way Felix did it 
in swconfig, and the fake switch driver did do something useful being 
attached to the loopback interface:

http://lists.openwall.net/netdev/2013/10/22/103
http://patchwork.ozlabs.org/patch/285478/

>
> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
> ---
>   drivers/net/Kconfig          |   7 +++
>   drivers/net/Makefile         |   1 +
>   drivers/net/dummyswitch.c    | 130 +++++++++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/if_link.h |   9 +++
>   4 files changed, 147 insertions(+)
>   create mode 100644 drivers/net/dummyswitch.c
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index c6f6f69..7822c74 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -71,6 +71,13 @@ config DUMMY
>   	  To compile this driver as a module, choose M here: the module
>   	  will be called dummy.
>
> +config NET_DUMMY_SWITCH
> +	tristate "Dummy switch net driver support"
> +	depends on NET_SWITCHDEV
> +	---help---
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called dummyswitch.
> +
>   config EQUALIZER
>   	tristate "EQL (serial line load balancing) support"
>   	---help---
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 61aefdd..3c835ba 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -7,6 +7,7 @@
>   #
>   obj-$(CONFIG_BONDING) += bonding/
>   obj-$(CONFIG_DUMMY) += dummy.o
> +obj-$(CONFIG_NET_DUMMY_SWITCH) += dummyswitch.o
>   obj-$(CONFIG_EQUALIZER) += eql.o
>   obj-$(CONFIG_IFB) += ifb.o
>   obj-$(CONFIG_MACVLAN) += macvlan.o
> diff --git a/drivers/net/dummyswitch.c b/drivers/net/dummyswitch.c
> new file mode 100644
> index 0000000..e7a48f4
> --- /dev/null
> +++ b/drivers/net/dummyswitch.c
> @@ -0,0 +1,130 @@
> +/*
> + * drivers/net/dummyswitch.c - Dummy switch device
> + * Copyright (c) 2014 Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <net/rtnetlink.h>
> +
> +struct dummyswport_priv {
> +	struct netdev_phys_item_id psid;
> +};
> +
> +static netdev_tx_t dummyswport_start_xmit(struct sk_buff *skb,
> +					  struct net_device *dev)
> +{
> +	dev_kfree_skb(skb);
> +	return NETDEV_TX_OK;
> +}
> +
> +static int dummyswport_swdev_id_get(struct net_device *dev,
> +				    struct netdev_phys_item_id *psid)
> +{
> +	struct dummyswport_priv *dsp = netdev_priv(dev);
> +
> +	memcpy(psid, &dsp->psid, sizeof(*psid));
> +	return 0;
> +}
> +
> +static int dummyswport_change_carrier(struct net_device *dev, bool new_carrier)
> +{
> +	if (new_carrier)
> +		netif_carrier_on(dev);
> +	else
> +		netif_carrier_off(dev);
> +	return 0;
> +}
> +
> +static const struct net_device_ops dummyswport_netdev_ops = {
> +	.ndo_start_xmit		= dummyswport_start_xmit,
> +	.ndo_swdev_id_get	= dummyswport_swdev_id_get,
> +	.ndo_change_carrier	= dummyswport_change_carrier,
> +};
> +
> +static void dummyswport_setup(struct net_device *dev)
> +{
> +	ether_setup(dev);
> +
> +	/* Initialize the device structure. */
> +	dev->netdev_ops = &dummyswport_netdev_ops;
> +	dev->destructor = free_netdev;
> +
> +	/* Fill in device structure with ethernet-generic values. */
> +	dev->tx_queue_len = 0;
> +	dev->flags |= IFF_NOARP;
> +	dev->flags &= ~IFF_MULTICAST;
> +	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
> +	dev->features	|= NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO;
> +	dev->features	|= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
> +	eth_hw_addr_random(dev);
> +}
> +
> +static int dummyswport_validate(struct nlattr *tb[], struct nlattr *data[])
> +{
> +	if (tb[IFLA_ADDRESS])
> +		return -EINVAL;
> +	if (!data || !data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID])
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static int dummyswport_newlink(struct net *src_net, struct net_device *dev,
> +			       struct nlattr *tb[], struct nlattr *data[])
> +{
> +	struct dummyswport_priv *dsp = netdev_priv(dev);
> +	int err;
> +
> +	dsp->psid.id_len = nla_len(data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID]);
> +	memcpy(dsp->psid.id, nla_data(data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID]),
> +	       dsp->psid.id_len);
> +
> +	err = register_netdevice(dev);
> +	if (err)
> +		return err;
> +
> +	netif_carrier_on(dev);
> +
> +	return 0;
> +}
> +
> +static const struct nla_policy dummyswport_policy[IFLA_DUMMYSWPORT_MAX + 1] = {
> +	[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID] = { .type = NLA_BINARY,
> +					      .len = MAX_PHYS_ITEM_ID_LEN },
> +};
> +
> +static struct rtnl_link_ops dummyswport_link_ops __read_mostly = {
> +	.kind		= "dummyswport",
> +	.priv_size	= sizeof(struct dummyswport_priv),
> +	.setup		= dummyswport_setup,
> +	.validate	= dummyswport_validate,
> +	.newlink	= dummyswport_newlink,
> +	.policy		= dummyswport_policy,
> +	.maxtype	= IFLA_DUMMYSWPORT_MAX,
> +};
> +
> +static int __init dummysw_module_init(void)
> +{
> +	return rtnl_link_register(&dummyswport_link_ops);
> +}
> +
> +static void __exit dummysw_module_exit(void)
> +{
> +	rtnl_link_unregister(&dummyswport_link_ops);
> +}
> +
> +module_init(dummysw_module_init);
> +module_exit(dummysw_module_exit);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>");
> +MODULE_DESCRIPTION("Dummy switch device");
> +MODULE_ALIAS_RTNL_LINK("dummyswport");
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index c5ca3b9..bd24d69 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -574,4 +574,13 @@ enum {
>
>   #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
>
> +/* DUMMYSWPORT section */
> +enum {
> +	IFLA_DUMMYSWPORT_UNSPEC,
> +	IFLA_DUMMYSWPORT_PHYS_SWITCH_ID,
> +	__IFLA_DUMMYSWPORT_MAX,
> +};
> +
> +#define IFLA_DUMMYSWPORT_MAX (__IFLA_DUMMYSWPORT_MAX - 1)
> +
>   #endif /* _UAPI_LINUX_IF_LINK_H */
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 6/9] switchdev: add basic support for flow matching and actions
  2014-09-19 13:49 ` [patch net-next v2 6/9] switchdev: add basic support for flow matching and actions Jiri Pirko
@ 2014-09-20  5:32   ` Florian Fainelli
  2014-09-20  7:28     ` Jiri Pirko
  0 siblings, 1 reply; 67+ messages in thread
From: Florian Fainelli @ 2014-09-20  5:32 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, jhs, sfeldma, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

On 09/19/14 06:49, Jiri Pirko wrote:
> This patch adds basic support for flows. The infrastructure is prepared
> to easily add another flow matching types. So far, only the key one is
> implemented.
>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---

[snip]

>
> +struct swdev_flow_match_key {
> +	struct {
> +		u32	priority;	/* Packet QoS priority. */
> +		u32	in_port_ifindex; /* Input switch port ifindex (or 0). */
> +	} phy;
> +	struct {
> +		u8     src[ETH_ALEN];	/* Ethernet source address. */
> +		u8     dst[ETH_ALEN];	/* Ethernet destination address. */
> +		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */

Humm, how about QinQ here? I would provision two more 16 bits fields so 
we can do all sorts of VLAN matching.

You might want to allow for a 4 to 8 bytes hardware switch tag as well.

> +		__be16 type;		/* Ethernet frame type. */
> +	} eth;
> +	struct {
> +		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
> +		u8     tos;		/* IP ToS. */
> +		u8     ttl;		/* IP TTL/hop limit. */
> +		u8     frag;		/* One of OVS_FRAG_TYPE_*. */

Options might be missing?

[snip]

> +
> +static void print_flow(const struct swdev_flow *flow, struct net_device *dev,
> +		       const char *comment)
> +{
> +	pr_debug("%s flow %s:\n", dev->name, comment);
> +	print_flow_match(&flow->match);
> +	print_flow_actions(flow->action, flow->action_count);
> +}

I am really not sure how much of this valuable besides early (as in, 
right now) debugging, don't we rather want a generic way to dump a given 
flow under a its native netlink format, does that code has to be here in 
the first place?
--
Florian

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 22:12         ` John Fastabend
  2014-09-19 22:18           ` Jamal Hadi Salim
@ 2014-09-20  5:36           ` Florian Fainelli
  2014-09-20  8:14           ` Jiri Pirko
  2 siblings, 0 replies; 67+ messages in thread
From: Florian Fainelli @ 2014-09-20  5:36 UTC (permalink / raw)
  To: John Fastabend, Jamal Hadi Salim, Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, edumazet, sfeldma, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

On 09/19/14 15:12, John Fastabend wrote:
> On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>> On 09/19/14 11:49, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>>
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>>>
>>
>> It is - thanks Jiri.
>>
>> cheers,
>> jamal
>
> Hi Jiri,
>
> I was considering a slightly different approach where the
> device would report via netlink the fields/actions it
> supported rather than creating pre-defined enums for every
> possible key.
>
> I already need to have an API to report fields/matches
> that are being supported why not have the device report
> the headers as header fields (len, offset) and the
> associated parse graph the hardware uses? Vendors should
> have this already to describe/design their real hardware.

Humm would not that slightly go against coming with a netlink API that 
is generic? Surely we could pay close attention when reviewing what is 
being added and spot when a common API needs to be introduced...

This might become very similar to the private ioctl(), private wireless 
extensions, nl80211 testmode and well it's not extremely pretty.

>
> As always its better to have code and when I get some
> time I'll try to write it up. Maybe its just a separate
> classifier although I don't actually want two hardware
> flow APIs.
>
> I see you dropped the RFC tag are you proposing we include
> this now?
>
> .John
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api
  2014-09-19 14:20       ` Jiri Pirko
@ 2014-09-20  5:37         ` Florian Fainelli
  0 siblings, 0 replies; 67+ messages in thread
From: Florian Fainelli @ 2014-09-20  5:37 UTC (permalink / raw)
  To: Jiri Pirko, David Laight
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs,
	sfeldma@cumulusnetworks.com

On 09/19/14 07:20, Jiri Pirko wrote:
> Fri, Sep 19, 2014 at 04:15:32PM CEST, David.Laight@ACULAB.COM wrote:
>> From: Jiri Pirko
>>> This patchset can be divided into 3 main sections:
>>> - introduce switchdev api for implementing switch drivers
>>> - introduce switchdev generic netlink api for userspace manipulation
>>> - introduce rocker switch driver which implements switchdev api
>>
>> Perhaps you should be including the name of what you are switching
>> in the name of the API?
>>
>> Is this for interfacing to ethernet switches, TDM switches or
>> mechanical ones?
>> It isn't really clear from any of these commit messages.
>
> I thought that is isn't necessary, that it is clear this is about ethernet
> switches.
>

How about putting some of this code in net/ethernet/* to make very very 
obvious this is what this thing is about?
--
Florian

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 22:18           ` Jamal Hadi Salim
@ 2014-09-20  5:39             ` Florian Fainelli
  2014-09-20  8:25               ` Jiri Pirko
  2014-09-20  8:17             ` Jiri Pirko
  1 sibling, 1 reply; 67+ messages in thread
From: Florian Fainelli @ 2014-09-20  5:39 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend, Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, edumazet, sfeldma, roopa, linville, dev,
	jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a, buytenh,
	aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

On 09/19/14 15:18, Jamal Hadi Salim wrote:
> On 09/19/14 18:12, John Fastabend wrote:
>> On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>>> On 09/19/14 11:49, Jiri Pirko wrote:
>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>
>>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>>> for its existence (or the API that it feeds on).
>>>>
>>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>> Long story short they like to have the api separated from ovs datapath
>>>> so ovs daemon can use it to directly communicate with driver. Also John
>>>> Fastabend requested a way to work with driver flows without using
>>>> ovs ->
>>>> that was the original reason I created switchdev genl api.
>>>>
>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>> will use directly switchdev genl api.
>>>>
>>>> I hope I cleared this out.
>>>>
>>>
>>> It is - thanks Jiri.
>>>
>>> cheers,
>>> jamal
>>
>> Hi Jiri,
>>
>> I was considering a slightly different approach where the
>> device would report via netlink the fields/actions it
>> supported rather than creating pre-defined enums for every
>> possible key.
>>
>> I already need to have an API to report fields/matches
>> that are being supported why not have the device report
>> the headers as header fields (len, offset) and the
>> associated parse graph the hardware uses? Vendors should
>> have this already to describe/design their real hardware.
>>
>> As always its better to have code and when I get some
>> time I'll try to write it up. Maybe its just a separate
>> classifier although I don't actually want two hardware
>> flow APIs.
>>
>> I see you dropped the RFC tag are you proposing we include
>> this now?
>>
>
>
> Actually I just realized i missed something very basic that
> Jiri said. I think i understand the tool being there for testing
> but i am assumed the same about the genlink api.
> Jiri, are you saying that genlink api is there to
> stay?

So, I really have mixed feelings about this netlink API, in particular 
because it is not clear to me where is the line between what should be a 
network device ndo operation, what should be an ethtool command, what 
should be a netlink message, and the rest.

I can certainly acknowledge the fact that manipulating flows is not 
ideal with the current set of tools, but really once we are there with 
netlink, how far are we from not having any network devices at all, and 
how does that differ from OpenWrt's swconfig in the end [1]?

[1]: https://lwn.net/Articles/571390/
--
Florian

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 6/9] switchdev: add basic support for flow matching and actions
  2014-09-20  5:32   ` Florian Fainelli
@ 2014-09-20  7:28     ` Jiri Pirko
  0 siblings, 0 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-20  7:28 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

Sat, Sep 20, 2014 at 07:32:08AM CEST, f.fainelli@gmail.com wrote:
>On 09/19/14 06:49, Jiri Pirko wrote:
>>This patch adds basic support for flows. The infrastructure is prepared
>>to easily add another flow matching types. So far, only the key one is
>>implemented.
>>
>>Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>>---
>
>[snip]
>
>>
>>+struct swdev_flow_match_key {
>>+	struct {
>>+		u32	priority;	/* Packet QoS priority. */
>>+		u32	in_port_ifindex; /* Input switch port ifindex (or 0). */
>>+	} phy;
>>+	struct {
>>+		u8     src[ETH_ALEN];	/* Ethernet source address. */
>>+		u8     dst[ETH_ALEN];	/* Ethernet destination address. */
>>+		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
>
>Humm, how about QinQ here? I would provision two more 16 bits fields so we
>can do all sorts of VLAN matching.
>
>You might want to allow for a 4 to 8 bytes hardware switch tag as well.

Note this structure is not carved in stone and can be easily adjusted
without any problems any time. So when the time comes and the changes
you are describing will be needed, we can do it.


>
>>+		__be16 type;		/* Ethernet frame type. */
>>+	} eth;
>>+	struct {
>>+		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
>>+		u8     tos;		/* IP ToS. */
>>+		u8     ttl;		/* IP TTL/hop limit. */
>>+		u8     frag;		/* One of OVS_FRAG_TYPE_*. */
>
>Options might be missing?
>
>[snip]
>
>>+
>>+static void print_flow(const struct swdev_flow *flow, struct net_device *dev,
>>+		       const char *comment)
>>+{
>>+	pr_debug("%s flow %s:\n", dev->name, comment);
>>+	print_flow_match(&flow->match);
>>+	print_flow_actions(flow->action, flow->action_count);
>>+}
>
>I am really not sure how much of this valuable besides early (as in, right
>now) debugging, don't we rather want a generic way to dump a given flow under
>a its native netlink format, does that code has to be here in the first
>place?

Hmm, I think you have a point here, let me think about that.

>--
>Florian

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 5/9] net: introduce dummy switch
  2014-09-20  5:21     ` Florian Fainelli
@ 2014-09-20  7:37       ` Jiri Pirko
  0 siblings, 0 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-20  7:37 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

Sat, Sep 20, 2014 at 07:21:00AM CEST, f.fainelli@gmail.com wrote:
>On 09/19/14 06:49, Jiri Pirko wrote:
>>Dummy switch implementation using switchdev interface
>
>This really looks like a DSA driver that has 0 ports, and is not attached to
>an useful network interface, and which is registering its own set of rtnl
>operations for a purpose that is unclear to me.
>
>I think registering these rtnl ops is misleading as it leads to a false idea
>that this is allowed, and that people are actually encouraged to do that for
>custom switch drivers, and this completely defeats the purpose of coming up
>with a generic API.
>
>If we are to go that route anyway, I really prefer the way Felix did it in
>swconfig, and the fake switch driver did do something useful being attached
>to the loopback interface:
>
>http://lists.openwall.net/netdev/2013/10/22/103
>http://patchwork.ozlabs.org/patch/285478/

I will drop dummyswitch because it serves primary as an example. But
since rocker is here, this is no longer needed.


>
>>
>>Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>>---
>>  drivers/net/Kconfig          |   7 +++
>>  drivers/net/Makefile         |   1 +
>>  drivers/net/dummyswitch.c    | 130 +++++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/if_link.h |   9 +++
>>  4 files changed, 147 insertions(+)
>>  create mode 100644 drivers/net/dummyswitch.c
>>
>>diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>>index c6f6f69..7822c74 100644
>>--- a/drivers/net/Kconfig
>>+++ b/drivers/net/Kconfig
>>@@ -71,6 +71,13 @@ config DUMMY
>>  	  To compile this driver as a module, choose M here: the module
>>  	  will be called dummy.
>>
>>+config NET_DUMMY_SWITCH
>>+	tristate "Dummy switch net driver support"
>>+	depends on NET_SWITCHDEV
>>+	---help---
>>+	  To compile this driver as a module, choose M here: the module
>>+	  will be called dummyswitch.
>>+
>>  config EQUALIZER
>>  	tristate "EQL (serial line load balancing) support"
>>  	---help---
>>diff --git a/drivers/net/Makefile b/drivers/net/Makefile
>>index 61aefdd..3c835ba 100644
>>--- a/drivers/net/Makefile
>>+++ b/drivers/net/Makefile
>>@@ -7,6 +7,7 @@
>>  #
>>  obj-$(CONFIG_BONDING) += bonding/
>>  obj-$(CONFIG_DUMMY) += dummy.o
>>+obj-$(CONFIG_NET_DUMMY_SWITCH) += dummyswitch.o
>>  obj-$(CONFIG_EQUALIZER) += eql.o
>>  obj-$(CONFIG_IFB) += ifb.o
>>  obj-$(CONFIG_MACVLAN) += macvlan.o
>>diff --git a/drivers/net/dummyswitch.c b/drivers/net/dummyswitch.c
>>new file mode 100644
>>index 0000000..e7a48f4
>>--- /dev/null
>>+++ b/drivers/net/dummyswitch.c
>>@@ -0,0 +1,130 @@
>>+/*
>>+ * drivers/net/dummyswitch.c - Dummy switch device
>>+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ */
>>+
>>+#include <linux/module.h>
>>+#include <linux/kernel.h>
>>+#include <linux/init.h>
>>+#include <linux/netdevice.h>
>>+#include <linux/etherdevice.h>
>>+#include <net/rtnetlink.h>
>>+
>>+struct dummyswport_priv {
>>+	struct netdev_phys_item_id psid;
>>+};
>>+
>>+static netdev_tx_t dummyswport_start_xmit(struct sk_buff *skb,
>>+					  struct net_device *dev)
>>+{
>>+	dev_kfree_skb(skb);
>>+	return NETDEV_TX_OK;
>>+}
>>+
>>+static int dummyswport_swdev_id_get(struct net_device *dev,
>>+				    struct netdev_phys_item_id *psid)
>>+{
>>+	struct dummyswport_priv *dsp = netdev_priv(dev);
>>+
>>+	memcpy(psid, &dsp->psid, sizeof(*psid));
>>+	return 0;
>>+}
>>+
>>+static int dummyswport_change_carrier(struct net_device *dev, bool new_carrier)
>>+{
>>+	if (new_carrier)
>>+		netif_carrier_on(dev);
>>+	else
>>+		netif_carrier_off(dev);
>>+	return 0;
>>+}
>>+
>>+static const struct net_device_ops dummyswport_netdev_ops = {
>>+	.ndo_start_xmit		= dummyswport_start_xmit,
>>+	.ndo_swdev_id_get	= dummyswport_swdev_id_get,
>>+	.ndo_change_carrier	= dummyswport_change_carrier,
>>+};
>>+
>>+static void dummyswport_setup(struct net_device *dev)
>>+{
>>+	ether_setup(dev);
>>+
>>+	/* Initialize the device structure. */
>>+	dev->netdev_ops = &dummyswport_netdev_ops;
>>+	dev->destructor = free_netdev;
>>+
>>+	/* Fill in device structure with ethernet-generic values. */
>>+	dev->tx_queue_len = 0;
>>+	dev->flags |= IFF_NOARP;
>>+	dev->flags &= ~IFF_MULTICAST;
>>+	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
>>+	dev->features	|= NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO;
>>+	dev->features	|= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
>>+	eth_hw_addr_random(dev);
>>+}
>>+
>>+static int dummyswport_validate(struct nlattr *tb[], struct nlattr *data[])
>>+{
>>+	if (tb[IFLA_ADDRESS])
>>+		return -EINVAL;
>>+	if (!data || !data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID])
>>+		return -EINVAL;
>>+	return 0;
>>+}
>>+
>>+static int dummyswport_newlink(struct net *src_net, struct net_device *dev,
>>+			       struct nlattr *tb[], struct nlattr *data[])
>>+{
>>+	struct dummyswport_priv *dsp = netdev_priv(dev);
>>+	int err;
>>+
>>+	dsp->psid.id_len = nla_len(data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID]);
>>+	memcpy(dsp->psid.id, nla_data(data[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID]),
>>+	       dsp->psid.id_len);
>>+
>>+	err = register_netdevice(dev);
>>+	if (err)
>>+		return err;
>>+
>>+	netif_carrier_on(dev);
>>+
>>+	return 0;
>>+}
>>+
>>+static const struct nla_policy dummyswport_policy[IFLA_DUMMYSWPORT_MAX + 1] = {
>>+	[IFLA_DUMMYSWPORT_PHYS_SWITCH_ID] = { .type = NLA_BINARY,
>>+					      .len = MAX_PHYS_ITEM_ID_LEN },
>>+};
>>+
>>+static struct rtnl_link_ops dummyswport_link_ops __read_mostly = {
>>+	.kind		= "dummyswport",
>>+	.priv_size	= sizeof(struct dummyswport_priv),
>>+	.setup		= dummyswport_setup,
>>+	.validate	= dummyswport_validate,
>>+	.newlink	= dummyswport_newlink,
>>+	.policy		= dummyswport_policy,
>>+	.maxtype	= IFLA_DUMMYSWPORT_MAX,
>>+};
>>+
>>+static int __init dummysw_module_init(void)
>>+{
>>+	return rtnl_link_register(&dummyswport_link_ops);
>>+}
>>+
>>+static void __exit dummysw_module_exit(void)
>>+{
>>+	rtnl_link_unregister(&dummyswport_link_ops);
>>+}
>>+
>>+module_init(dummysw_module_init);
>>+module_exit(dummysw_module_exit);
>>+
>>+MODULE_LICENSE("GPL v2");
>>+MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
>>+MODULE_DESCRIPTION("Dummy switch device");
>>+MODULE_ALIAS_RTNL_LINK("dummyswport");
>>diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>>index c5ca3b9..bd24d69 100644
>>--- a/include/uapi/linux/if_link.h
>>+++ b/include/uapi/linux/if_link.h
>>@@ -574,4 +574,13 @@ enum {
>>
>>  #define IFLA_HSR_MAX (__IFLA_HSR_MAX - 1)
>>
>>+/* DUMMYSWPORT section */
>>+enum {
>>+	IFLA_DUMMYSWPORT_UNSPEC,
>>+	IFLA_DUMMYSWPORT_PHYS_SWITCH_ID,
>>+	__IFLA_DUMMYSWPORT_MAX,
>>+};
>>+
>>+#define IFLA_DUMMYSWPORT_MAX (__IFLA_DUMMYSWPORT_MAX - 1)
>>+
>>  #endif /* _UAPI_LINUX_IF_LINK_H */
>>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20  3:41       ` Roopa Prabhu
@ 2014-09-20  8:09         ` Jiri Pirko
  2014-09-20 12:39           ` Roopa Prabhu
  2014-09-20  8:10         ` Scott Feldman
  1 sibling, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-20  8:09 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jamal Hadi Salim, netdev, davem, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, john.r.fastabend, edumazet, sfeldma,
	f.fainelli, linville, dev, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck,
	Shrijeet Mukherjee

Sat, Sep 20, 2014 at 05:41:16AM CEST, roopa@cumulusnetworks.com wrote:
>On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>On 09/19/14 09:49, Jiri Pirko wrote:
>>>>This patch exposes switchdev API using generic Netlink.
>>>>Example userspace utility is here:
>>>>https://github.com/jpirko/switchdev
>>>>
>>>Is this just a temporary test tool? Otherwise i dont see reason
>>>for its existence (or the API that it feeds on).
>>Please read the conversation I had with Pravin and Jesse in v1 thread.
>>Long story short they like to have the api separated from ovs datapath
>>so ovs daemon can use it to directly communicate with driver. Also John
>>Fastabend requested a way to work with driver flows without using ovs ->
>>that was the original reason I created switchdev genl api.
>>
>>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>will use directly switchdev genl api.
>>
>>I hope I cleared this out.
>We already have all the needed rtnetlink kernel api and userspace tools
>around it to support all
>switching asic features. ie, the rtnetlink api is the switchdev api. We can
>do l2, l3, acl's with it.
>Its unclear to me why we need another new netlink api. Which will mean none
>of the existing tools to
>create bridges etc will work on a switchdev.

No one is proposing such API. Note that what I'm trying to solve in my
patchset is FLOW world. There is only one API there, ovs genl. But the
usage of that for hw offload purposes was nacked by ovs maintainer. Plus
couple of people wanted to run the offloading independently on ovs
instance. Therefore I introduced the switchdev genl, which takes care of
that. No plan to extend it for other things you mentioned, just flows.


>Which seems like going in the direction exactly opposite to what we had
>discussed earlier.

Nope. The previous discussion ignored flows.


>
>If a non-ovs flow interface is needed from userspace, we can extend the
>existing interface to include flows.

How? You mean to extend rtnetlink? What advantage it would bring
comparing to separate genl iface?


>I don't understand why we should replace the existing rtnetlink switchdev api
>to accommodate flows.

Sorry, I do not undertand what "existing rtnetlink switchdev api" you
have on mind. Would you care to explain?


>
>Thanks,
>Roopa
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20  3:41       ` Roopa Prabhu
  2014-09-20  8:09         ` Jiri Pirko
@ 2014-09-20  8:10         ` Scott Feldman
  2014-09-20 10:31           ` Jamal Hadi Salim
       [not found]           ` <DDC24110-C3F5-470F-B9BE-1D1792415D1E-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  1 sibling, 2 replies; 67+ messages in thread
From: Scott Feldman @ 2014-09-20  8:10 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiri Pirko, Jamal Hadi Salim, netdev, davem, nhorman, andy,
	tgraf, dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, john.r.fastabend,
	edumazet, f.fainelli, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye, simon.horman,
	alexander.h.duyck, Shrijeet Mukherjee


On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>> This patch exposes switchdev API using generic Netlink.
>>>> Example userspace utility is here:
>>>> https://github.com/jpirko/switchdev
>>>> 
>>> Is this just a temporary test tool? Otherwise i dont see reason
>>> for its existence (or the API that it feeds on).
>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>> Long story short they like to have the api separated from ovs datapath
>> so ovs daemon can use it to directly communicate with driver. Also John
>> Fastabend requested a way to work with driver flows without using ovs ->
>> that was the original reason I created switchdev genl api.
>> 
>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>> will use directly switchdev genl api.
>> 
>> I hope I cleared this out.
> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
> create bridges etc will work on a switchdev.
> Which seems like going in the direction exactly opposite to what we had discussed earlier.

Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.

You have:
    user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW

Jiri has:
    user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW

> If a non-ovs flow interface is needed from userspace, we can extend the existing interface to include flows.
> I don't understand why we should replace the existing rtnetlink switchdev api to accommodate flows.
> 
> Thanks,
> Roopa
> 


-scott

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 22:12         ` John Fastabend
  2014-09-19 22:18           ` Jamal Hadi Salim
  2014-09-20  5:36           ` Florian Fainelli
@ 2014-09-20  8:14           ` Jiri Pirko
  2014-09-20 10:53             ` Thomas Graf
  2 siblings, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-20  8:14 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jamal Hadi Salim, netdev, davem, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, edumazet, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote:
>On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>> On 09/19/14 11:49, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>> 
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>>
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>>>
>> 
>> It is - thanks Jiri.
>> 
>> cheers,
>> jamal
>
>Hi Jiri,
>
>I was considering a slightly different approach where the
>device would report via netlink the fields/actions it
>supported rather than creating pre-defined enums for every
>possible key.
>
>I already need to have an API to report fields/matches
>that are being supported why not have the device report
>the headers as header fields (len, offset) and the
>associated parse graph the hardware uses? Vendors should
>have this already to describe/design their real hardware.

Hmm, let me think about this a bit more. I will try to figure out how to
handle that. Sound logic though. Will try to incorporate the idea in the
patchset.


>
>As always its better to have code and when I get some
>time I'll try to write it up. Maybe its just a separate
>classifier although I don't actually want two hardware
>flow APIs.

Understood.

>
>I see you dropped the RFC tag are you proposing we include
>this now?

v11 is my bet :)

>
>.John

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-19 22:18           ` Jamal Hadi Salim
  2014-09-20  5:39             ` Florian Fainelli
@ 2014-09-20  8:17             ` Jiri Pirko
  2014-09-20 10:19               ` Jamal Hadi Salim
  1 sibling, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-20  8:17 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: John Fastabend, netdev, davem, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, edumazet, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

Sat, Sep 20, 2014 at 12:18:02AM CEST, jhs@mojatatu.com wrote:
>On 09/19/14 18:12, John Fastabend wrote:
>>On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>>>On 09/19/14 11:49, Jiri Pirko wrote:
>>>>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>
>>>>>Is this just a temporary test tool? Otherwise i dont see reason
>>>>>for its existence (or the API that it feeds on).
>>>>
>>>>Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>>Long story short they like to have the api separated from ovs datapath
>>>>so ovs daemon can use it to directly communicate with driver. Also John
>>>>Fastabend requested a way to work with driver flows without using ovs ->
>>>>that was the original reason I created switchdev genl api.
>>>>
>>>>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>>will use directly switchdev genl api.
>>>>
>>>>I hope I cleared this out.
>>>>
>>>
>>>It is - thanks Jiri.
>>>
>>>cheers,
>>>jamal
>>
>>Hi Jiri,
>>
>>I was considering a slightly different approach where the
>>device would report via netlink the fields/actions it
>>supported rather than creating pre-defined enums for every
>>possible key.
>>
>>I already need to have an API to report fields/matches
>>that are being supported why not have the device report
>>the headers as header fields (len, offset) and the
>>associated parse graph the hardware uses? Vendors should
>>have this already to describe/design their real hardware.
>>
>>As always its better to have code and when I get some
>>time I'll try to write it up. Maybe its just a separate
>>classifier although I don't actually want two hardware
>>flow APIs.
>>
>>I see you dropped the RFC tag are you proposing we include
>>this now?
>>
>
>
>Actually I just realized i missed something very basic that
>Jiri said. I think i understand the tool being there for testing
>but i am assumed the same about the genlink api.
>Jiri, are you saying that genlink api is there to
>stay?

Yes, that I say. It is needed for flow manipulation, because such api does
not exist. As I stated earlier, I do not want to use switchdev genl for
anything other than flow manipulation.

>
>cheers,
>jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20  5:39             ` Florian Fainelli
@ 2014-09-20  8:25               ` Jiri Pirko
  0 siblings, 0 replies; 67+ messages in thread
From: Jiri Pirko @ 2014-09-20  8:25 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Jamal Hadi Salim, John Fastabend, netdev, davem, nhorman, andy,
	tgraf, dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, edumazet, sfeldma,
	roopa, linville, dev, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck

Sat, Sep 20, 2014 at 07:39:51AM CEST, f.fainelli@gmail.com wrote:
>On 09/19/14 15:18, Jamal Hadi Salim wrote:
>>On 09/19/14 18:12, John Fastabend wrote:
>>>On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>>>>On 09/19/14 11:49, Jiri Pirko wrote:
>>>>>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>>
>>>>>>Is this just a temporary test tool? Otherwise i dont see reason
>>>>>>for its existence (or the API that it feeds on).
>>>>>
>>>>>Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>>>Long story short they like to have the api separated from ovs datapath
>>>>>so ovs daemon can use it to directly communicate with driver. Also John
>>>>>Fastabend requested a way to work with driver flows without using
>>>>>ovs ->
>>>>>that was the original reason I created switchdev genl api.
>>>>>
>>>>>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>>>will use directly switchdev genl api.
>>>>>
>>>>>I hope I cleared this out.
>>>>>
>>>>
>>>>It is - thanks Jiri.
>>>>
>>>>cheers,
>>>>jamal
>>>
>>>Hi Jiri,
>>>
>>>I was considering a slightly different approach where the
>>>device would report via netlink the fields/actions it
>>>supported rather than creating pre-defined enums for every
>>>possible key.
>>>
>>>I already need to have an API to report fields/matches
>>>that are being supported why not have the device report
>>>the headers as header fields (len, offset) and the
>>>associated parse graph the hardware uses? Vendors should
>>>have this already to describe/design their real hardware.
>>>
>>>As always its better to have code and when I get some
>>>time I'll try to write it up. Maybe its just a separate
>>>classifier although I don't actually want two hardware
>>>flow APIs.
>>>
>>>I see you dropped the RFC tag are you proposing we include
>>>this now?
>>>
>>
>>
>>Actually I just realized i missed something very basic that
>>Jiri said. I think i understand the tool being there for testing
>>but i am assumed the same about the genlink api.
>>Jiri, are you saying that genlink api is there to
>>stay?
>
>So, I really have mixed feelings about this netlink API, in particular
>because it is not clear to me where is the line between what should be a
>network device ndo operation, what should be an ethtool command, what should
>be a netlink message, and the rest.

Well as I said, this api should serve for flow manipulation only,
therefore swdev flow related ndos are used.


>
>I can certainly acknowledge the fact that manipulating flows is not ideal
>with the current set of tools, but really once we are there with netlink, how
>far are we from not having any network devices at all, and how does that
>differ from OpenWrt's swconfig in the end [1]?


I'm all ears on proposals how to make flow manipulation better.


>
>[1]: https://lwn.net/Articles/571390/
>--
>Florian

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20  8:17             ` Jiri Pirko
@ 2014-09-20 10:19               ` Jamal Hadi Salim
  2014-09-20 11:01                 ` Thomas Graf
  0 siblings, 1 reply; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-20 10:19 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: John Fastabend, netdev, davem, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, edumazet, sfeldma, f.fainelli, roopa,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck

On 09/20/14 04:17, Jiri Pirko wrote:

> Yes, that I say. It is needed for flow manipulation, because such api does
> not exist.

Come on Jiri!
The ovs guys are against this and now no *api exists*?
Write a 15 tuple classifier tc classifier and use it. I will be more
than happy to help you. I will get to it when we have basics L2 working
on real devices.

>As I stated earlier, I do not want to use switchdev genl for
> anything other than flow manipulation.


Totally unacceptable in my books. If the OVS guys want some way out
to be able to ride on some vendor sdks then that is their problem.
We shouldnt allow for such loopholes. This is why/how TOE never made it
in the kernel.

cheers,
jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20  8:10         ` Scott Feldman
@ 2014-09-20 10:31           ` Jamal Hadi Salim
       [not found]           ` <DDC24110-C3F5-470F-B9BE-1D1792415D1E-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
  1 sibling, 0 replies; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-20 10:31 UTC (permalink / raw)
  To: Scott Feldman, Roopa Prabhu
  Cc: Jiri Pirko, netdev, davem, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, john.r.fastabend, edumazet, f.fainelli,
	linville, dev, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck, Shrijeet Mukherjee

On 09/20/14 04:10, Scott Feldman wrote:
>
> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>

>
> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components
>(bridge, fib, etc).

You have made this claim before Scott and I am still not following.
Why do we need to echo things to get FDB or FIB to work? device ops for 
FDB offload for example already exist. I think they need to be
revamped, but that consensus can be reasonably reached. Why do we
need this flow api for such activities?

cheers,
jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20  8:14           ` Jiri Pirko
@ 2014-09-20 10:53             ` Thomas Graf
  2014-09-20 22:50               ` Alexei Starovoitov
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Graf @ 2014-09-20 10:53 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: John Fastabend, Jamal Hadi Salim, netdev, davem, nhorman, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, edumazet, sfeldma,
	f.fainelli, roopa, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye, simon.horman,
	alexander.h.duyck

On 09/20/14 at 10:14am, Jiri Pirko wrote:
> Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote:
> >I was considering a slightly different approach where the
> >device would report via netlink the fields/actions it
> >supported rather than creating pre-defined enums for every
> >possible key.
> >
> >I already need to have an API to report fields/matches
> >that are being supported why not have the device report
> >the headers as header fields (len, offset) and the
> >associated parse graph the hardware uses? Vendors should
> >have this already to describe/design their real hardware.
> 
> Hmm, let me think about this a bit more. I will try to figure out how to
> handle that. Sound logic though. Will try to incorporate the idea in the
> patchset.

I think this is the right track.

I agree with Jamal that there is no need for a new permanent and
separate Netlink interface for this. I think this would best be described
as a structure of nested Netlink attributes in the form John proposes
which is then embedded into existing Netlink interfaces such as rtnetlink
and OVS genl.

OVS can register new genl ops to check capabilities and insert
hardware flows which allows implementation of the offload decision in
user space and allows for arbitary combination of hardware and software
flows. It also allows to run a eBPF software data path in combination
with a hardware flow setup.

rtnetlink can embed the nested attribute structure into existing APIs
to allow feature capability detection from user space, statistic
reporting and optional direct hardware offload if a transaprent
offload is not feasible. Would that work for you John?

I think we should focus on getting the layering right and make it
generic enough so we allow evolving naturally.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 10:19               ` Jamal Hadi Salim
@ 2014-09-20 11:01                 ` Thomas Graf
  2014-09-20 11:32                   ` Jamal Hadi Salim
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Graf @ 2014-09-20 11:01 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, John Fastabend, netdev, davem, nhorman, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, edumazet, sfeldma,
	f.fainelli, roopa, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye, simon.horman,
	alexander.h.duyck

On 09/20/14 at 06:19am, Jamal Hadi Salim wrote:
> The ovs guys are against this and now no *api exists*?
> Write a 15 tuple classifier tc classifier and use it. I will be more
> than happy to help you. I will get to it when we have basics L2 working
> on real devices.

Nothing speaks against having such a tc classifier. In fact, having
the interface consist of only an embedded Netlink attribute structure
would allow for such a classifier in a very straight forward way.

That doesn't mean everybody should be forced to use the stateful
tc interface.

> Totally unacceptable in my books. If the OVS guys want some way out
> to be able to ride on some vendor sdks then that is their problem.
> We shouldnt allow for such loopholes. This is why/how TOE never made it
> in the kernel.

No need for false accusations here. Nobody ever mentioned vendor SDKs.

The statement was that the requirement of deriving hardware flows from
software flows *in the kernel* is not flexible enough for the future
for reasons such as:

1) The OVS software data path might be based on eBPF in the future and
   it is unclear how we could derive hardware flows from that
   transparently.

2) Depending on hardware capabilities. Hardware flows might need to be
   assisted by software flow counterparts and it is believed that it
   is the wrong approach to push all the necessary context for the
   decision down into the kernel. This can be argued about and I don't
   feel strongly either way.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 11:01                 ` Thomas Graf
@ 2014-09-20 11:32                   ` Jamal Hadi Salim
  2014-09-20 11:51                     ` Thomas Graf
  2014-09-22  7:53                     ` Jiri Pirko
  0 siblings, 2 replies; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-20 11:32 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jiri Pirko, John Fastabend, netdev, davem, nhorman, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, edumazet, sfeldma,
	f.fainelli, roopa, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye, simon.horman,
	alexander.h.duyck

On 09/20/14 07:01, Thomas Graf wrote:

> Nothing speaks against having such a tc classifier. In fact, having
> the interface consist of only an embedded Netlink attribute structure
> would allow for such a classifier in a very straight forward way.
>
> That doesn't mean everybody should be forced to use the stateful
> tc interface.
>


Agreed. The response was to Jiri's strange statement that now that
he cant use OVS, there is no such api. I point to tc as very capable of
such usage.

> No need for false accusations here. Nobody ever mentioned vendor SDKs.
>

I am sorry to have tied the two together. Maybe not OVS but the approach
described is heaven for vendor SDKs.

> The statement was that the requirement of deriving hardware flows from
> software flows *in the kernel* is not flexible enough for the future
> for reasons such as:
>
> 1) The OVS software data path might be based on eBPF in the future and
>     it is unclear how we could derive hardware flows from that
>     transparently.
>

Who says you cant put BPF in hardware?
And why is OVS defining how BPF should evolve or how it should be used?

> 2) Depending on hardware capabilities. Hardware flows might need to be
>     assisted by software flow counterparts and it is believed that it
>     is the wrong approach to push all the necessary context for the
>     decision down into the kernel. This can be argued about and I don't
>     feel strongly either way.
>

Pointing to the current FDB offload: You can select to bypass
and not use s/ware.

cheers,
jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 11:32                   ` Jamal Hadi Salim
@ 2014-09-20 11:51                     ` Thomas Graf
       [not found]                       ` <20140920115140.GA3777-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-09-22  7:53                     ` Jiri Pirko
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Graf @ 2014-09-20 11:51 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, John Fastabend, netdev, davem, nhorman, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, edumazet, sfeldma,
	f.fainelli, roopa, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye, simon.horman,
	alexander.h.duyck

On 09/20/14 at 07:32am, Jamal Hadi Salim wrote:
> I am sorry to have tied the two together. Maybe not OVS but the approach
> described is heaven for vendor SDKs.

I fail to see the connection. You can use switch vendor SDK no matter
how we define the kernel APIs. They already exist and have been
designed in a way to be completely indepenedent from the kernel.

Are you referring to vendor specific decisions in user space in
general? I believe that the whole point of swdev is to provide *that*
level of abstraction so decisions can be made in a vendor neutral way.

> >The statement was that the requirement of deriving hardware flows from
> >software flows *in the kernel* is not flexible enough for the future
> >for reasons such as:
> >
> >1) The OVS software data path might be based on eBPF in the future and
> >    it is unclear how we could derive hardware flows from that
> >    transparently.
> >
> 
> Who says you cant put BPF in hardware?

I don't think anybody is saying that. P4 is likely a reality soon. But
we definitely want hardware offload in a BPF world even if the hardware
can't do BPF yet.

> And why is OVS defining how BPF should evolve or how it should be used?

Not sure I understand. OVS would be a user of eBPF just like tracing,
xt_BPF, socket filter, ...

> >2) Depending on hardware capabilities. Hardware flows might need to be
> >    assisted by software flow counterparts and it is believed that it
> >    is the wrong approach to push all the necessary context for the
> >    decision down into the kernel. This can be argued about and I don't
> >    feel strongly either way.
> >
> 
> Pointing to the current FDB offload: You can select to bypass
> and not use s/ware.

As I said, this can be argued about. It would require to push a lot of
context into the kernel though. The FDB offload is relatively trivial
in comparison to the complexity OVS user space can handle. I can't think
of any reasons why to complicate the kernel further with OVS specific
knowledge as long as we can guarantee the vendor abstraction.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
       [not found]                       ` <20140920115140.GA3777-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-09-20 12:35                         ` Jamal Hadi Salim
  0 siblings, 0 replies; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-20 12:35 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w, Jiri Pirko,
	simon.horman-wFxRvT7yatFl57MIdRCFDg,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 09/20/14 07:51, Thomas Graf wrote:

> I fail to see the connection. You can use switch vendor SDK no matter
> how we define the kernel APIs. They already exist and have been
> designed in a way to be completely indepenedent from the kernel.
>
> Are you referring to vendor specific decisions in user space in
> general? I believe that the whole point of swdev is to provide *that*
> level of abstraction so decisions can be made in a vendor neutral way.
>

I am not against the swdev idea. I think we have disagreements
for the general classification/action interface how that should look
like - but that is resolvable with correct interfaces.
The vendor neutral way *already exists* via current netlink
abstractions that existing tools use. When we need to add new
interfaces then we should.

> I don't think anybody is saying that. P4 is likely a reality soon. But
> we definitely want hardware offload in a BPF world even if the hardware
> can't do BPF yet.
>


I dont think we have contradictions. We are speaking past each other.
You implied that in the future OVS s/w path might be based on BPF.
I implied BPF itself could be offloaded and stands on its own merit
and should work if we have the correct interface. As an example,
I dont care about P4 or OVS - but i have no problem if they use
the common interfaces provided by Linux. i.e
If i want to build  a little cpu running the BPF instruction set
and use that as my offload then that interface should work and if
it doesnt i should provide extensions.

> Not sure I understand. OVS would be a user of eBPF just like tracing,
> xt_BPF, socket filter, ...
>

Ok, we are on the same page then.

> As I said, this can be argued about. It would require to push a lot of
> context into the kernel though. The FDB offload is relatively trivial
> in comparison to the complexity OVS user space can handle. I can't think
> of any reasons why to complicate the kernel further with OVS specific
> knowledge as long as we can guarantee the vendor abstraction.
>

I disagree. OVS maybe complex in that sense (I am sorry i am making
an assumption based on what you are saying) but i dont think there is
any other kernel subsystem that has this challenge.
Note: i am pointing to fdb only because it carries the concept of "put
this in hardware and/or software". I agree the fdb maybe reasonably
simpler.

cheers,
jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20  8:09         ` Jiri Pirko
@ 2014-09-20 12:39           ` Roopa Prabhu
  0 siblings, 0 replies; 67+ messages in thread
From: Roopa Prabhu @ 2014-09-20 12:39 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, netdev, davem, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher,
	vyasevic, xiyou.wangcong, john.r.fastabend, edumazet, sfeldma,
	f.fainelli, linville, dev, jasowang, ebiederm, nicolas.dichtel,
	ryazanov.s.a, buytenh, aviadr, nbd, alexei.starovoitov,
	Neil.Jerram, ronye, simon.horman, alexander.h.duyck,
	Shrijeet Mukherjee

On 9/20/14, 1:09 AM, Jiri Pirko wrote:
> Sat, Sep 20, 2014 at 05:41:16AM CEST, roopa@cumulusnetworks.com wrote:
>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>> This patch exposes switchdev API using generic Netlink.
>>>>> Example userspace utility is here:
>>>>> https://github.com/jpirko/switchdev
>>>>>
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>> We already have all the needed rtnetlink kernel api and userspace tools
>> around it to support all
>> switching asic features. ie, the rtnetlink api is the switchdev api. We can
>> do l2, l3, acl's with it.
>> Its unclear to me why we need another new netlink api. Which will mean none
>> of the existing tools to
>> create bridges etc will work on a switchdev.
> No one is proposing such API. Note that what I'm trying to solve in my
> patchset is FLOW world. There is only one API there, ovs genl. But the
> usage of that for hw offload purposes was nacked by ovs maintainer. Plus
> couple of people wanted to run the offloading independently on ovs
> instance. Therefore I introduced the switchdev genl, which takes care of
> that. No plan to extend it for other things you mentioned, just flows.
ok, That was not clear to me. Introducing a new genl api and calling it the
switchd dev api can result it non-flow creep into it in the future.
>
>
>> Which seems like going in the direction exactly opposite to what we had
>> discussed earlier.
> Nope. The previous discussion ignored flows.
>> If a non-ovs flow interface is needed from userspace, we can extend the
>> existing interface to include flows.
> How? You mean to extend rtnetlink? What advantage it would bring
> comparing to separate genl iface?
yes. Advantage would be that we dont have yet another parallel switchdev 
netlink api.


>> I don't understand why we should replace the existing rtnetlink switchdev api
>> to accommodate flows.
> Sorry, I do not undertand what "existing rtnetlink switchdev api" you
> have on mind. Would you care to explain?

I am taking about existing rtnetlink api that bridge, ip link uses to 
talk l2 and l3 to the kernel.
RTM_NEWROUTE etc.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
       [not found]           ` <DDC24110-C3F5-470F-B9BE-1D1792415D1E-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
@ 2014-09-20 12:51             ` Roopa Prabhu
  2014-09-20 17:21               ` Scott Feldman
  0 siblings, 1 reply; 67+ messages in thread
From: Roopa Prabhu @ 2014-09-20 12:51 UTC (permalink / raw)
  To: Scott Feldman
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, Shrijeet Mukherjee,
	ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w, Jiri Pirko,
	simon.horman-wFxRvT7yatFl57MIdRCFDg, Jamal Hadi Salim,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev, stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 9/20/14, 1:10 AM, Scott Feldman wrote:
> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org> wrote:
>
>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org wrote:
>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>> This patch exposes switchdev API using generic Netlink.
>>>>> Example userspace utility is here:
>>>>> https://github.com/jpirko/switchdev
>>>>>
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
>> create bridges etc will work on a switchdev.
>> Which seems like going in the direction exactly opposite to what we had discussed earlier.
> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.
>
> You have:
>      user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW
>
> Jiri has:
>      user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW
>
Keeping the goal to not change or not add a new userspace API in mind,

I have :
     user -> rtnetlink -> kernel -> ndo_op -> swdev driver  -> HW

Jiri has:
     user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 12:51             ` Roopa Prabhu
@ 2014-09-20 17:21               ` Scott Feldman
  2014-09-20 17:38                 ` Jiri Pirko
  0 siblings, 1 reply; 67+ messages in thread
From: Scott Feldman @ 2014-09-20 17:21 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiri Pirko, Jamal Hadi Salim, netdev, David Miller, Neil Horman,
	Andy Gospodarek, Thomas Graf, dborkman, ogerlitz, jesse, pshelar,
	azhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vlad Yasevich, Cong Wang, John Fastabend, Eric Dumazet,
	Florian Fainelli, John W. Linville, dev, jasowang, ebiederm,
	nicolas.dichtel, Sergey Ryazanov


On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> On 9/20/14, 1:10 AM, Scott Feldman wrote:
>> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com>
>>  wrote:
>> 
>> 
>>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>> 
>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com
>>>>  wrote:
>>>> 
>>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>> 
>>>>>> This patch exposes switchdev API using generic Netlink.
>>>>>> Example userspace utility is here:
>>>>>> 
>>>>>> https://github.com/jpirko/switchdev
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>>> for its existence (or the API that it feeds on).
>>>>> 
>>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>> Long story short they like to have the api separated from ovs datapath
>>>> so ovs daemon can use it to directly communicate with driver. Also John
>>>> Fastabend requested a way to work with driver flows without using ovs ->
>>>> that was the original reason I created switchdev genl api.
>>>> 
>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>> will use directly switchdev genl api.
>>>> 
>>>> I hope I cleared this out.
>>>> 
>>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
>>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
>>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
>>> create bridges etc will work on a switchdev.
>>> Which seems like going in the direction exactly opposite to what we had discussed earlier.
>>> 
>> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.
>> 
>> You have:
>>     user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW
>> 
>> 
>> Jiri has:
>>     user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW
>> 
>> 
> Keeping the goal to not change or not add a new userspace API in mind,
> I have :
>     user -> rtnetlink -> kernel -> ndo_op -> swdev driver  -> HW
> 

Then you have the same as Jiri, for the traditional L2/L3 case.

> Jiri has:
>     user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW

Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS.  It’s not a substitute for rtnetlink, it’s an alternative.  The complete picture is:

user -> swdev genl -----
                        \
                         \
                          -------> kernel -> ndo_swdev_* -> swdev driver -> HW
                         /
                        /
user -> rtnetlink ------


-scott

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 17:21               ` Scott Feldman
@ 2014-09-20 17:38                 ` Jiri Pirko
  2014-09-21  1:30                   ` Roopa Prabhu
  0 siblings, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-20 17:38 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Roopa Prabhu, Jamal Hadi Salim, netdev, David Miller,
	Neil Horman, Andy Gospodarek, Thomas Graf, dborkman, ogerlitz,
	jesse, pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	Jeff Kirsher, Vlad Yasevich, Cong Wang, John Fastabend,
	Eric Dumazet, Florian Fainelli, John W. Linville, dev, jasowang,
	ebiederm, nicolas.dichtel, Sergey

Sat, Sep 20, 2014 at 07:21:10PM CEST, sfeldma@cumulusnetworks.com wrote:
>
>On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>
>> On 9/20/14, 1:10 AM, Scott Feldman wrote:
>>> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com>
>>>  wrote:
>>> 
>>> 
>>>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>>> 
>>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com
>>>>>  wrote:
>>>>> 
>>>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>>> 
>>>>>>> This patch exposes switchdev API using generic Netlink.
>>>>>>> Example userspace utility is here:
>>>>>>> 
>>>>>>> https://github.com/jpirko/switchdev
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>>>> for its existence (or the API that it feeds on).
>>>>>> 
>>>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>>> Long story short they like to have the api separated from ovs datapath
>>>>> so ovs daemon can use it to directly communicate with driver. Also John
>>>>> Fastabend requested a way to work with driver flows without using ovs ->
>>>>> that was the original reason I created switchdev genl api.
>>>>> 
>>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>>> will use directly switchdev genl api.
>>>>> 
>>>>> I hope I cleared this out.
>>>>> 
>>>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
>>>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
>>>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
>>>> create bridges etc will work on a switchdev.
>>>> Which seems like going in the direction exactly opposite to what we had discussed earlier.
>>>> 
>>> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.
>>> 
>>> You have:
>>>     user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW
>>> 
>>> 
>>> Jiri has:
>>>     user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW
>>> 
>>> 
>> Keeping the goal to not change or not add a new userspace API in mind,
>> I have :
>>     user -> rtnetlink -> kernel -> ndo_op -> swdev driver  -> HW
>> 
>
>Then you have the same as Jiri, for the traditional L2/L3 case.
>
>> Jiri has:
>>     user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW
>
>Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS.  It’s not a substitute for rtnetlink, it’s an alternative.  The complete picture is:

Not an alternative, an addition.

>
>user -> swdev genl -----
>                        \
>                         \
>                          -------> kernel -> ndo_swdev_* -> swdev driver -> HW
>                         /
>                        /
>user -> rtnetlink ------

True is that, as Thomas pointed out, we can probably nest this into
rtnl_link messages. That might work.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 10:53             ` Thomas Graf
@ 2014-09-20 22:50               ` Alexei Starovoitov
  2014-09-22  8:13                 ` Thomas Graf
  0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2014-09-20 22:50 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jiri Pirko, John Fastabend, Jamal Hadi Salim, netdev,
	David S. Miller, Neil Horman, Andy Gospodarek, Daniel Borkmann,
	Or Gerlitz, Jesse Gross, Pravin Shelar, Andy Zhou, ben,
	Stephen Hemminger, jeffrey.t.kirsher, Vladislav Yasevich,
	Cong Wang, Eric Dumazet, Scott Feldman, Florian Fainelli,
	Roopa Prabhu, John Linville, dev@openvswitch.org

On Sat, Sep 20, 2014 at 3:53 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/20/14 at 10:14am, Jiri Pirko wrote:
>> Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote:
>> >I was considering a slightly different approach where the
>> >device would report via netlink the fields/actions it
>> >supported rather than creating pre-defined enums for every
>> >possible key.
>> >
>> >I already need to have an API to report fields/matches
>> >that are being supported why not have the device report
>> >the headers as header fields (len, offset) and the
>> >associated parse graph the hardware uses? Vendors should
>> >have this already to describe/design their real hardware.
>>
>> Hmm, let me think about this a bit more. I will try to figure out how to
>> handle that. Sound logic though. Will try to incorporate the idea in the
>> patchset.
>
> I think this is the right track.

I agree with John and Thomas here.
I think HW should not be limited by SW abstractions whether
these abstractions are called flows, n-tuples, bridge or else.
Really looking forward to see "device reporting the headers as
header fields (len, offset) and the associated parse graph"
as the first step.

Another topic that this discussion didn't cover yet is how this
all connects to tunnels and what is 'tunnel offloading'.
imo flow offloading by itself serves only academic interest.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 17:38                 ` Jiri Pirko
@ 2014-09-21  1:30                   ` Roopa Prabhu
  0 siblings, 0 replies; 67+ messages in thread
From: Roopa Prabhu @ 2014-09-21  1:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Scott Feldman, Jamal Hadi Salim, netdev, David Miller,
	Neil Horman, Andy Gospodarek, Thomas Graf, dborkman, ogerlitz,
	jesse, pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	Jeff Kirsher, Vlad Yasevich, Cong Wang, John Fastabend,
	Eric Dumazet, Florian Fainelli, John W. Linville, dev, jasowang,
	ebiederm, nicolas.dichtel, Sergey

On 9/20/14, 10:38 AM, Jiri Pirko wrote:
> Sat, Sep 20, 2014 at 07:21:10PM CEST, sfeldma@cumulusnetworks.com wrote:
>> On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>>
>>> On 9/20/14, 1:10 AM, Scott Feldman wrote:
>>>> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>   wrote:
>>>>
>>>>
>>>>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>>>>
>>>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com
>>>>>>   wrote:
>>>>>>
>>>>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>>>>
>>>>>>>> This patch exposes switchdev API using generic Netlink.
>>>>>>>> Example userspace utility is here:
>>>>>>>>
>>>>>>>> https://github.com/jpirko/switchdev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>>>>> for its existence (or the API that it feeds on).
>>>>>>>
>>>>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>>>> Long story short they like to have the api separated from ovs datapath
>>>>>> so ovs daemon can use it to directly communicate with driver. Also John
>>>>>> Fastabend requested a way to work with driver flows without using ovs ->
>>>>>> that was the original reason I created switchdev genl api.
>>>>>>
>>>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>>>> will use directly switchdev genl api.
>>>>>>
>>>>>> I hope I cleared this out.
>>>>>>
>>>>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
>>>>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
>>>>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
>>>>> create bridges etc will work on a switchdev.
>>>>> Which seems like going in the direction exactly opposite to what we had discussed earlier.
>>>>>
>>>> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.
>>>>
>>>> You have:
>>>>      user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW
>>>>
>>>>
>>>> Jiri has:
>>>>      user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW
>>>>
>>>>
>>> Keeping the goal to not change or not add a new userspace API in mind,
>>> I have :
>>>      user -> rtnetlink -> kernel -> ndo_op -> swdev driver  -> HW
>>>
>> Then you have the same as Jiri, for the traditional L2/L3 case.
>>
>>> Jiri has:
>>>      user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW
>> Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS.  It’s not a substitute for rtnetlink, it’s an alternative.  The complete picture is:
> Not an alternative, an addition.
>
>> user -> swdev genl -----
>>                         \
>>                          \
>>                           -------> kernel -> ndo_swdev_* -> swdev driver -> HW
>>                          /
>>                         /
>> user -> rtnetlink ------
> True is that, as Thomas pointed out, we can probably nest this into
> rtnl_link messages. That might work.
That's the thing i was hinting as well. You can extend the existing 
infrastructure/api  instead of adding a new one.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 11:32                   ` Jamal Hadi Salim
  2014-09-20 11:51                     ` Thomas Graf
@ 2014-09-22  7:53                     ` Jiri Pirko
       [not found]                       ` <20140922075337.GA1828-6KJVSR23iU488b5SBfVpbw@public.gmane.org>
  1 sibling, 1 reply; 67+ messages in thread
From: Jiri Pirko @ 2014-09-22  7:53 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Thomas Graf, John Fastabend, netdev, davem, nhorman, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, ben, stephen,
	jeffrey.t.kirsher, vyasevic, xiyou.wangcong, edumazet, sfeldma,
	f.fainelli, roopa, linville, dev, jasowang, ebiederm,
	nicolas.dichtel, ryazanov.s.a, buytenh, aviadr, nbd,
	alexei.starovoitov, Neil.Jerram, ronye, simon.horman,
	alexander.h.duyck

Sat, Sep 20, 2014 at 01:32:30PM CEST, jhs@mojatatu.com wrote:
>On 09/20/14 07:01, Thomas Graf wrote:
>
>>Nothing speaks against having such a tc classifier. In fact, having
>>the interface consist of only an embedded Netlink attribute structure
>>would allow for such a classifier in a very straight forward way.
>>
>>That doesn't mean everybody should be forced to use the stateful
>>tc interface.
>>
>
>
>Agreed. The response was to Jiri's strange statement that now that
>he cant use OVS, there is no such api. I point to tc as very capable of
>such usage.

Jamal, would you please give us some examples on how to use tc to work
with flows? I have a feeling that you see something other people does not.
Lets get on the same page now.
Thanks.


>
>>No need for false accusations here. Nobody ever mentioned vendor SDKs.
>>
>
>I am sorry to have tied the two together. Maybe not OVS but the approach
>described is heaven for vendor SDKs.
>
>>The statement was that the requirement of deriving hardware flows from
>>software flows *in the kernel* is not flexible enough for the future
>>for reasons such as:
>>
>>1) The OVS software data path might be based on eBPF in the future and
>>    it is unclear how we could derive hardware flows from that
>>    transparently.
>>
>
>Who says you cant put BPF in hardware?
>And why is OVS defining how BPF should evolve or how it should be used?
>
>>2) Depending on hardware capabilities. Hardware flows might need to be
>>    assisted by software flow counterparts and it is believed that it
>>    is the wrong approach to push all the necessary context for the
>>    decision down into the kernel. This can be argued about and I don't
>>    feel strongly either way.
>>
>
>Pointing to the current FDB offload: You can select to bypass
>and not use s/ware.
>
>cheers,
>jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-20 22:50               ` Alexei Starovoitov
@ 2014-09-22  8:13                 ` Thomas Graf
  2014-09-22 15:10                   ` Tom Herbert
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Graf @ 2014-09-22  8:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Pirko, John Fastabend, Jamal Hadi Salim, netdev,
	David S. Miller, Neil Horman, Andy Gospodarek, Daniel Borkmann,
	Or Gerlitz, Jesse Gross, Pravin Shelar, Andy Zhou, ben,
	Stephen Hemminger, jeffrey.t.kirsher, Vladislav Yasevich,
	Cong Wang, Eric Dumazet, Scott Feldman, Florian Fainelli,
	Roopa Prabhu, John Linville, dev@openvswitch.org

On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
> I think HW should not be limited by SW abstractions whether
> these abstractions are called flows, n-tuples, bridge or else.
> Really looking forward to see "device reporting the headers as
> header fields (len, offset) and the associated parse graph"
> as the first step.
> 
> Another topic that this discussion didn't cover yet is how this
> all connects to tunnels and what is 'tunnel offloading'.
> imo flow offloading by itself serves only academic interest.

We haven't touched encryption yet either ;-)

Certainly true for the host case. The Linux on TOR case is less
dependant on this and L2/L3 offload w/o encap already has value.

I'm with you though, all of this has little value on the host in
the DC if stateful encap offload is not incorporated. I expect the
HW to provide filters on the outer header plus metadata in the
encap. Actually, this was a follow-up question I had for John as
this is not easily describable with offset/len filters. How would
we represent such capabilities?

The TX side of this was one of the reasons why I initially thought
it would be beneficial to implement a cache like offload as we could
serve an initial encap in SW, do the FIB lookup and offload it
transparently to avoid replicating the FIB in user space.

What seems most feasisble to me right now is to separate the offload
of the encap action from the IP -> dev mapping decision. The eSwitch
would send the first encap for an unknown dest IP to the CPU due
to a miss in the IP mapping table, the CPU would do the FIB lookup,
update the table and send it back.

What do you have in mind?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
       [not found]                       ` <20140922075337.GA1828-6KJVSR23iU488b5SBfVpbw@public.gmane.org>
@ 2014-09-22 11:48                         ` Jamal Hadi Salim
  0 siblings, 0 replies; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-22 11:48 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, John Fastabend,
	Neil.Jerram-QnUH15yq9NYqDJ6do+/SaQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, andy-QlMahl40kYEqcZcGjlUOXw,
	dev-yBygre7rU0TnMu66kgdUjQ, nbd-p3rKhJxN3npAfugRpC6u6w,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w, ronye-VPRAkNaXOzVWk0Htik3J/w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, ben-/+tVBieCtBitmTQ+vhA3Yw,
	buytenh-OLH4Qvv75CYX/NnBR394Jw,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	simon.horman-wFxRvT7yatFl57MIdRCFDg,
	roopa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR,
	aviadr-VPRAkNaXOzVWk0Htik3J/w,
	nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	vyasevic-H+wXaHxf7aLQT0dZR+AlfA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q

On 09/22/14 03:53, Jiri Pirko wrote:

> Jamal, would you please give us some examples on how to use tc to work
> with flows? I have a feeling that you see something other people does not.

I will be a little verbose so as to avoid knowledge assumption.

Lets talk about tc classifier/action subsystem because that is what
would take advantage of flows. We could also talk about qdiscs i.e
schedulers and queue objects because the two are often related
(the default classification action is "classid" which typically
maps to a queue class).

tc classification/action subsystem allows you to specify arbitrary
classifiers and actions.
You can then specify (using a precise BNF grammar) how filters and
actions are to be related.
Look at iproute2/f_*.c to see the currently defined ones.

Each classifier has a name/id and attributes/options specific to
itself. Classifiers dont necessarily have to filter on packet
headers; they could filter on metadata for example.
Each classifier running in software may be offloaded. I think that
simple model would allow usable tools.
The classifier you have defined currently in your patches could
be realized via the u32 classifier but i think that would
require knowledge of u32. So for usability reasons I would
suggest to write a brand new classifier. For lack of a better
name, lets call it "multi-tuple classifier".
I would expect this classifier to be usable in software tc as
well without necessarily being offloaded.

There are two important details to note:
1) many different types of classifiers exist. This would very
likely depend on hardware implementation. It is academic bullshit
(i.e not pragmatic) to claim all hardware offload can use the
same classification language. As i was telling Thomas
I dont see why one wouldnt offload the defined bpf classifier.
 From an API level, this means your ->flow_add/del/get would have
to support ability to define different classifiers.

2) Each classifier will have different semantics.
 From a device API level this means you have to allow the different
classifiers to pass attributes specific to them. This means
each classifier may override the ops(). I am indifferent how
it is achieved. So while you could pass one big structure
such as your flow struct, one should be able to do u32
kind of semantics.

We also need to discover which device supports which classifiers
and what constraints exist in the hardware implementation exist
(we can talk about that because it is important). Example
if one supports u32, how many u32 rules can be offloaded etc.

As to how it is to be implemented:
I like the semantics of the current bridge code. I have always
wondered why we didnt use that scheme for offloading qdiscs.
Each device supporting FDB offload has an ->fdb_add/del/get
(dont quote me on the naming). User space describes what
it wants. If something is to be offloaded we already know the
netdev the user is pointing to. We invoke the appropriate
->flow() calls with appropriately cooked structures.
I am not sure i like that we pass the netlink structure as Scott
often seems to point to; i think that passing the internal
structure we would install in s/ware may be the better approach
since:
a) we would need to parse the data anyways for validation etc
b) each hardware offload will likely need to translate further in
internal format
c)we have well defined mapping between user and offload,
the generic structure will be very close to hardware.
note: that is what the fdb offload does.

Note: I described this using tc, but i dont see why nftable
couldnt follow the same approach. My angle is that we dont
impede other users by over-focussing on ovs and whatever
other things that surround it.
cheers,
jamal

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-22  8:13                 ` Thomas Graf
@ 2014-09-22 15:10                   ` Tom Herbert
  2014-09-22 22:17                     ` Thomas Graf
  2014-09-23  1:54                     ` Alexei Starovoitov
  0 siblings, 2 replies; 67+ messages in thread
From: Tom Herbert @ 2014-09-22 15:10 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Alexei Starovoitov, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu

On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
>> I think HW should not be limited by SW abstractions whether
>> these abstractions are called flows, n-tuples, bridge or else.
>> Really looking forward to see "device reporting the headers as
>> header fields (len, offset) and the associated parse graph"
>> as the first step.
>>
>> Another topic that this discussion didn't cover yet is how this
>> all connects to tunnels and what is 'tunnel offloading'.
>> imo flow offloading by itself serves only academic interest.
>
> We haven't touched encryption yet either ;-)
>
> Certainly true for the host case. The Linux on TOR case is less
> dependant on this and L2/L3 offload w/o encap already has value.
>
Thomas, can you (or someone else) quantify what the host case is. I
suppose there may be merit in using a switch on NIC for kernel bypass
scenarios, but I'm still having a hard time understanding how this
could be integrated into the host stack with benefits that outweigh
complexity. The history of stateful offloads in NICs is not great, and
encapsulation (stuffing a few bytes of header into a packet) is in
itself not nearly an expensive enough operation to warrant offloading
to the NIC. Personally, I wish if NIC vendors are going to focus on
stateful offload I rather see it be for encryption which I believe
currently does warrant offload at 40G and higher speeds.

Tom

> I'm with you though, all of this has little value on the host in
> the DC if stateful encap offload is not incorporated. I expect the
> HW to provide filters on the outer header plus metadata in the
> encap. Actually, this was a follow-up question I had for John as
> this is not easily describable with offset/len filters. How would
> we represent such capabilities?
>
> The TX side of this was one of the reasons why I initially thought
> it would be beneficial to implement a cache like offload as we could
> serve an initial encap in SW, do the FIB lookup and offload it
> transparently to avoid replicating the FIB in user space.
>
> What seems most feasisble to me right now is to separate the offload
> of the encap action from the IP -> dev mapping decision. The eSwitch
> would send the first encap for an unknown dest IP to the CPU due
> to a miss in the IP mapping table, the CPU would do the FIB lookup,
> update the table and send it back.
>
> What do you have in mind?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-22 15:10                   ` Tom Herbert
@ 2014-09-22 22:17                     ` Thomas Graf
       [not found]                       ` <20140922221727.GA4708-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
  2014-09-23  1:54                     ` Alexei Starovoitov
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Graf @ 2014-09-22 22:17 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu

On 09/22/14 at 08:10am, Tom Herbert wrote:
> Thomas, can you (or someone else) quantify what the host case is. I
> suppose there may be merit in using a switch on NIC for kernel bypass
> scenarios, but I'm still having a hard time understanding how this
> could be integrated into the host stack with benefits that outweigh

Personally my primary interest is on lxc and vm based workloads w/
end to end encryption, encap, distributed L3 and NAT, and policy
enforcement including service graphs which imply both east-west
and north-south traffic patterns on a host. The usual I guess ;-)

> complexity. The history of stateful offloads in NICs is not great, and
> encapsulation (stuffing a few bytes of header into a packet) is in
> itself not nearly an expensive enough operation to warrant offloading

No argument here. The direct benchmark comparisons I've measured showed
only around 2% improvement.

What makes stateful offload interesting to me is that the final
desintation of a packet is known at RX and can be redirected to a
queue or VF. This allows to build packet batches on shared pages
while preserving the securiy model.

Will the gains outweigh complexity? I hope so but I don't know for
sure. If you have insights, let me know. What I know for sure is that
I don't want to rely on a kernel bypass for the above.

> to the NIC. Personally, I wish if NIC vendors are going to focus on
> stateful offload I rather see it be for encryption which I believe
> currently does warrant offload at 40G and higher speeds.

Agreed. I would like to be see a focus on both.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
       [not found]                       ` <20140922221727.GA4708-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
@ 2014-09-22 22:40                         ` Tom Herbert
  2014-09-22 22:53                           ` Thomas Graf
       [not found]                           ` <CA+mtBx9ZVQ5r5Hzy9-uEnk+iu+HKkOP4+VANC06Xf8VvTxktwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 67+ messages in thread
From: Tom Herbert @ 2014-09-22 22:40 UTC (permalink / raw)
  To: Thomas Graf
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w, Jason Wang, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, Felix Fietkau, Florian Fainelli,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, Or Gerlitz,
	Ben Hutchings, Lennert Buytenhek, Alexander Duyck, Jiri Pirko,
	simon.horman-wFxRvT7yatFl57MIdRCFDg, Roopa Prabhu,
	Jamal Hadi Salim, aviadr-VPRAkNaXOzVWk0Htik3J/w, Nicolas Dichtel,
	Vladislav Yasevich, Neil Horman,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org> wrote:
> On 09/22/14 at 08:10am, Tom Herbert wrote:
>> Thomas, can you (or someone else) quantify what the host case is. I
>> suppose there may be merit in using a switch on NIC for kernel bypass
>> scenarios, but I'm still having a hard time understanding how this
>> could be integrated into the host stack with benefits that outweigh
>
> Personally my primary interest is on lxc and vm based workloads w/
> end to end encryption, encap, distributed L3 and NAT, and policy
> enforcement including service graphs which imply both east-west
> and north-south traffic patterns on a host. The usual I guess ;-)
>
>> complexity. The history of stateful offloads in NICs is not great, and
>> encapsulation (stuffing a few bytes of header into a packet) is in
>> itself not nearly an expensive enough operation to warrant offloading
>
> No argument here. The direct benchmark comparisons I've measured showed
> only around 2% improvement.
>
> What makes stateful offload interesting to me is that the final
> desintation of a packet is known at RX and can be redirected to a
> queue or VF. This allows to build packet batches on shared pages
> while preserving the securiy model.
>
How is this different from what rx-filtering already does?

> Will the gains outweigh complexity? I hope so but I don't know for
> sure. If you have insights, let me know. What I know for sure is that
> I don't want to rely on a kernel bypass for the above.
>
>> to the NIC. Personally, I wish if NIC vendors are going to focus on
>> stateful offload I rather see it be for encryption which I believe
>> currently does warrant offload at 40G and higher speeds.
>
> Agreed. I would like to be see a focus on both.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-22 22:40                         ` Tom Herbert
@ 2014-09-22 22:53                           ` Thomas Graf
  2014-09-22 23:07                             ` Tom Herbert
       [not found]                           ` <CA+mtBx9ZVQ5r5Hzy9-uEnk+iu+HKkOP4+VANC06Xf8VvTxktwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Graf @ 2014-09-22 22:53 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu

On 09/22/14 at 03:40pm, Tom Herbert wrote:
> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > What makes stateful offload interesting to me is that the final
> > desintation of a packet is known at RX and can be redirected to a
> > queue or VF. This allows to build packet batches on shared pages
> > while preserving the securiy model.
> >
> How is this different from what rx-filtering already does?

Without stateful offload I can't know where the packet is destined
to until after I've allocated an skb and parsed the packet in
software. I might be missing what you refer to here specifically.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-22 22:53                           ` Thomas Graf
@ 2014-09-22 23:07                             ` Tom Herbert
  2014-09-23  1:36                               ` John Fastabend
  0 siblings, 1 reply; 67+ messages in thread
From: Tom Herbert @ 2014-09-22 23:07 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Alexei Starovoitov, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu

On Mon, Sep 22, 2014 at 3:53 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/22/14 at 03:40pm, Tom Herbert wrote:
>> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> > What makes stateful offload interesting to me is that the final
>> > desintation of a packet is known at RX and can be redirected to a
>> > queue or VF. This allows to build packet batches on shared pages
>> > while preserving the securiy model.
>> >
>> How is this different from what rx-filtering already does?
>
> Without stateful offload I can't know where the packet is destined
> to until after I've allocated an skb and parsed the packet in
> software. I might be missing what you refer to here specifically.

n-tuple filtering in as exposed by ethtool.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-22 23:07                             ` Tom Herbert
@ 2014-09-23  1:36                               ` John Fastabend
  2014-09-23  7:19                                 ` Thomas Graf
  2014-09-23 11:09                                 ` Jamal Hadi Salim
  0 siblings, 2 replies; 67+ messages in thread
From: John Fastabend @ 2014-09-23  1:36 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Thomas Graf, Alexei Starovoitov, Jiri Pirko, John Fastabend,
	Jamal Hadi Salim, netdev, David S. Miller, Neil Horman,
	Andy Gospodarek, Daniel Borkmann, Or Gerlitz, Jesse Gross,
	Pravin Shelar, Andy Zhou, Ben Hutchings, Stephen Hemminger,
	Jeff Kirsher, Vladislav Yasevich, Cong Wang, Eric Dumazet,
	Scott Feldman, Florian Fainelli, Roopa

On 09/22/2014 04:07 PM, Tom Herbert wrote:
> On Mon, Sep 22, 2014 at 3:53 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> On 09/22/14 at 03:40pm, Tom Herbert wrote:
>>> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote:
>>>> What makes stateful offload interesting to me is that the final
>>>> desintation of a packet is known at RX and can be redirected to a
>>>> queue or VF. This allows to build packet batches on shared pages
>>>> while preserving the securiy model.
>>>>
>>> How is this different from what rx-filtering already does?
>>
>> Without stateful offload I can't know where the packet is destined
>> to until after I've allocated an skb and parsed the packet in
>> software. I might be missing what you refer to here specifically.
>
> n-tuple filtering in as exposed by ethtool.

n-tuple has some deficiencies,

	- its not possible to get the capabilities to learn what
	  fields are supported by the device, what actions, etc.

	- its ioctl based so we have to poll the device

	- only supports a single table, where we have devices with
	  multiple tables

	- sort of the same as above but it doesn't allow creating new
	  tables or destroying old tables.

I probably missed a few others but those are the main ones that I
would like to address. Granted other than the ioctl line the rest could
be solved by extending the existing interface. However I would just
assume port it to ndo_ops and netlink then extend the existing ioctl
seeing it needs a reasonable overall to support the above anyways.

We could port the ethtool ops over to the new interface to
simplify drivers.

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-22 15:10                   ` Tom Herbert
  2014-09-22 22:17                     ` Thomas Graf
@ 2014-09-23  1:54                     ` Alexei Starovoitov
  2014-09-23  2:16                       ` Tom Herbert
  1 sibling, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2014-09-23  1:54 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Thomas Graf, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu, John Linville

On Mon, Sep 22, 2014 at 8:10 AM, Tom Herbert <therbert@google.com> wrote:
> On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
>>> I think HW should not be limited by SW abstractions whether
>>> these abstractions are called flows, n-tuples, bridge or else.
>>> Really looking forward to see "device reporting the headers as
>>> header fields (len, offset) and the associated parse graph"
>>> as the first step.
>>>
>>> Another topic that this discussion didn't cover yet is how this
>>> all connects to tunnels and what is 'tunnel offloading'.

> encapsulation (stuffing a few bytes of header into a packet) is in
> itself not nearly an expensive enough operation to warrant offloading
> to the NIC. Personally, I wish if NIC vendors are going to focus on

On contrary, generic tunneling is most important one to get right
when we're talking offloads.
Adding encap header is easy to do in hw, but it breaks all other
offloads if hw is not generic. Consider gso packet coming from vm.
Generic tunnel allows sw to add inner headers, outer headers and
setup offload offsets, so that HW does segmentation, checksuming
of inner packet, adjusts inner headers and adds final outer encap.
And this is just tx offload. On rx smart tunnel offload in HW parses
encap and goes all the way to inner headers to verify checksums,
it also steers based on inner headers.
Try mellanox nics with and without vxlan offload to see
the difference.
It looks like fm10k will be just as good, but existing encaps are
not going to last forever, so RX should be improved they way John
is saying. There gotta to be a 'parse graph' for HW to see past
variable length encap and into inner headers.
checksum_complete style of offloading checksum verification
is not efficient. The cost of adjusting it over and over while
parsing encaps is too high. Plus cpu steering based on outer
headers is just too slow when speeds are in 40G range.

> stateful offload I rather see it be for encryption which I believe
> currently does warrant offload at 40G and higher speeds.

encryption offload is badly needed as well. Unfortunately it's
not seen as nic feature yet.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23  1:54                     ` Alexei Starovoitov
@ 2014-09-23  2:16                       ` Tom Herbert
  2014-09-23  4:11                         ` Andy Gospodarek
  2014-09-23  9:52                         ` Thomas Graf
  0 siblings, 2 replies; 67+ messages in thread
From: Tom Herbert @ 2014-09-23  2:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Graf, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu, John Linville

On Mon, Sep 22, 2014 at 6:54 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Sep 22, 2014 at 8:10 AM, Tom Herbert <therbert@google.com> wrote:
>> On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote:
>>> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
>>>> I think HW should not be limited by SW abstractions whether
>>>> these abstractions are called flows, n-tuples, bridge or else.
>>>> Really looking forward to see "device reporting the headers as
>>>> header fields (len, offset) and the associated parse graph"
>>>> as the first step.
>>>>
>>>> Another topic that this discussion didn't cover yet is how this
>>>> all connects to tunnels and what is 'tunnel offloading'.
>
>> encapsulation (stuffing a few bytes of header into a packet) is in
>> itself not nearly an expensive enough operation to warrant offloading
>> to the NIC. Personally, I wish if NIC vendors are going to focus on
>
> On contrary, generic tunneling is most important one to get right
> when we're talking offloads.
> Adding encap header is easy to do in hw, but it breaks all other
> offloads if hw is not generic. Consider gso packet coming from vm.
> Generic tunnel allows sw to add inner headers, outer headers and
> setup offload offsets, so that HW does segmentation, checksuming
> of inner packet, adjusts inner headers and adds final outer encap.

As I pointed out on a previous thread, we already have a sufficiently
generic interface to allow HW to do encapsulated TSO
(SKB_GSO_UDP_TUNNEL and SKB_GSO_UDP_TUNNEL_CSUM with the inner
headers). If properly implemented, HW can implement a whole bunch of
UDP encap protocols without knowing how to parse them. I don't see how
a switch on the NIC helps this...

> And this is just tx offload. On rx smart tunnel offload in HW parses
> encap and goes all the way to inner headers to verify checksums,
> it also steers based on inner headers.
> Try mellanox nics with and without vxlan offload to see
> the difference.

Turn on UDP RSS on the device and I bet you'll see those differences
go away! Once we moved to UDP encapsulation, there's really little
value in looking at inner headers for RSS or ECMP, this should be
sufficient. Sure someone might want to parse the inner headers for
some sort of advanced RX steering, but again this implies rx-filtering
and not switch functionality.

Alexei, I believe you said previously said that SW should not dictate
HW models. I agree with this, but also believe the converse is true--
HW shouldn't dictate SW model. This is really why I'm raising the
question of what it means to integrate a switch into the host stack.
If this is something that doesn't require any model change to the
stack and is just a clever backend for rx-filters or tc, then I'm fine
with that!

Thanks,
Tom

> It looks like fm10k will be just as good, but existing encaps are
> not going to last forever, so RX should be improved they way John
> is saying. There gotta to be a 'parse graph' for HW to see past
> variable length encap and into inner headers.
> checksum_complete style of offloading checksum verification
> is not efficient. The cost of adjusting it over and over while
> parsing encaps is too high. Plus cpu steering based on outer
> headers is just too slow when speeds are in 40G range.
>
>> stateful offload I rather see it be for encryption which I believe
>> currently does warrant offload at 40G and higher speeds.
>
> encryption offload is badly needed as well. Unfortunately it's
> not seen as nic feature yet.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23  2:16                       ` Tom Herbert
@ 2014-09-23  4:11                         ` Andy Gospodarek
  2014-09-23 10:11                           ` Thomas Graf
  2014-09-23 15:32                           ` Or Gerlitz
  2014-09-23  9:52                         ` Thomas Graf
  1 sibling, 2 replies; 67+ messages in thread
From: Andy Gospodarek @ 2014-09-23  4:11 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, Thomas Graf, Jiri Pirko, John Fastabend,
	Jamal Hadi Salim, netdev, David S. Miller, Neil Horman,
	Andy Gospodarek, Daniel Borkmann, Or Gerlitz, Jesse Gross,
	Pravin Shelar, Andy Zhou, Ben Hutchings, Stephen Hemminger,
	Jeff Kirsher, Vladislav Yasevich, Cong Wang, Eric Dumazet,
	Scott Feldman, Florian Fainelli, Roopa

On Mon, Sep 22, 2014 at 07:16:47PM -0700, Tom Herbert wrote:
[...]
> 
> Alexei, I believe you said previously said that SW should not dictate
> HW models. I agree with this, but also believe the converse is true--
> HW shouldn't dictate SW model. This is really why I'm raising the
> question of what it means to integrate a switch into the host stack.

Tom, when I read this I cannot help but remind myself that the
intentions/hopes/dreams of those on this thread and how different their
views can be on what it means to add additional 'offload support' to the
kernel.

There are clearly some that are most interested in how an eSwitch on an
SR-IOV capable NIC be controlled can provide traditional forwarding help
as well as offload the various technologies they hope to terminate
at/inside their endpoint (host/guest/container) -- Thomas's _simple_
use-case demonstrates this. ;)  This is a logical extention/increase in
functionality that is offered in many eSwitches that was previously
hidden from the user with the first generation SR-IOV capable network
devices on hosts/servers.

Others (like Florian who has been working to extend DSA or those pushing
hardware vendors to make SDKs more open) where the existing bridging/
routing/offload code can take advantage of the hardware offload/encap
available in merchant silicon.  The general idea seems to add the
knowledge of offload hardware to the kernel -- either via new ndo_ops or
netlink.  This gives users who have this hardware the ability to have a
solution for their router/switch that makes it feel like Linux is
actually helping make forwarding decisions -- rather than just being the
kernel chosen to provide an environment where some other non-community
code runs that makes all of the decisions.

And now we also have the patchset that spawned what I think has been
more excellent discussion.  Jiri and Scott's patches bring up another,
more generic model that while not currently backed by hardware provided
an example/vision for what could be done if such hardware existed and
how to consider interacting with that driver/hardware (that clearly has
been met with some resistance, but the discussion has been great).
There ultimate goals appear to be similar to those that want full
offload/fordwarding support for a device, but via a different method
than what some would consider standard.

I am personally hopeful that most who are passionate about this will be
able to get together next month at LPC (or send someone to represent
them!) so that those interested can sit in the same room and try to
better understand each others desires and start to form some concrete
direction towards a solution that seems to meet the needs of most while
not being an architectural disaster.

Of course that may be way too optimistic for this crowd!  :-D

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23  1:36                               ` John Fastabend
@ 2014-09-23  7:19                                 ` Thomas Graf
  2014-09-23 11:09                                 ` Jamal Hadi Salim
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Graf @ 2014-09-23  7:19 UTC (permalink / raw)
  To: John Fastabend
  Cc: Tom Herbert, Alexei Starovoitov, Jiri Pirko, John Fastabend,
	Jamal Hadi Salim, netdev, David S. Miller, Neil Horman,
	Andy Gospodarek, Daniel Borkmann, Or Gerlitz, Jesse Gross,
	Pravin Shelar, Andy Zhou, Ben Hutchings, Stephen Hemminger,
	Jeff Kirsher, Vladislav Yasevich, Cong Wang, Eric Dumazet,
	Scott Feldman, Florian Fainelli, Roop

On 09/22/14 at 06:36pm, John Fastabend wrote:
> n-tuple has some deficiencies,
> 
> 	- its not possible to get the capabilities to learn what
> 	  fields are supported by the device, what actions, etc.
> 
> 	- its ioctl based so we have to poll the device
> 
> 	- only supports a single table, where we have devices with
> 	  multiple tables
> 
> 	- sort of the same as above but it doesn't allow creating new
> 	  tables or destroying old tables.

OK, I understand where Tom was going. Given we add feature detection
capabilities this could be used to identify the guest for fixed length
encap. I still assume HW won't be able to match on the inner header
for any variable lengh encap with metadata packet unless it can
actually parse the encap. I hope I didn't bring encap format to this
thread at this very moment ;-)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
       [not found]                           ` <CA+mtBx9ZVQ5r5Hzy9-uEnk+iu+HKkOP4+VANC06Xf8VvTxktwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-09-23  9:18                             ` Thomas Graf
  0 siblings, 0 replies; 67+ messages in thread
From: Thomas Graf @ 2014-09-23  9:18 UTC (permalink / raw)
  To: Tom Herbert
  Cc: ryazanov.s.a-Re5JQEeQqe8AvxtiuMwx3w, Jason Wang, John Fastabend,
	Neil Jerram, Eric Dumazet, Andy Gospodarek,
	dev-yBygre7rU0TnMu66kgdUjQ, Felix Fietkau, Florian Fainelli,
	ronye-VPRAkNaXOzVWk0Htik3J/w, Jeff Kirsher, Or Gerlitz,
	Ben Hutchings, Lennert Buytenhek, Alexander Duyck, Jiri Pirko,
	simon.horman-wFxRvT7yatFl57MIdRCFDg, Roopa Prabhu,
	Jamal Hadi Salim, aviadr-VPRAkNaXOzVWk0Htik3J/w, Nicolas Dichtel,
	Vladislav Yasevich, Neil Horman,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 09/22/14 at 03:40pm, Tom Herbert wrote:
> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org> wrote:
> > What makes stateful offload interesting to me is that the final
> > desintation of a packet is known at RX and can be redirected to a
> > queue or VF. This allows to build packet batches on shared pages
> > while preserving the securiy model.

To put this in other words: It is equivalent to applying the snabbswitch
+ vhost-user principle to the kernel but with encap support. The SR-IOV
case would be a further optimization of that.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23  2:16                       ` Tom Herbert
  2014-09-23  4:11                         ` Andy Gospodarek
@ 2014-09-23  9:52                         ` Thomas Graf
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Graf @ 2014-09-23  9:52 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu

On 09/22/14 at 07:16pm, Tom Herbert wrote:
> Turn on UDP RSS on the device and I bet you'll see those differences
> go away! Once we moved to UDP encapsulation, there's really little
> value in looking at inner headers for RSS or ECMP, this should be
> sufficient. Sure someone might want to parse the inner headers for
> some sort of advanced RX steering, but again this implies rx-filtering
> and not switch functionality.

Agreed. The reason we discuss this in the context of this thread is
because the required rx-filtering capabilities seem to be introduced
in the form of (adapted) switch chip integrations onto NICs. In that
sense, OVS is essentially doing advanced RX steering in software.

I agree that switch functionality (whatever that specifically implies)
is not strictly required for the host if you consider queue
redirection as part of RX steering. The exception here would be use
of SR-IOV which could be highly interesting for corner cases if
combined with smart elephant guest detection. A classic example would
be NFV deployed in a virtualized environment, i.e. a virtual firewall
or DPI application serving a bunch of guests.

> If this is something that doesn't require any model change to the
> stack and is just a clever backend for rx-filters or tc, then I'm fine
> with that!

I haven't seen any model change proposed. I'm most certainly not
advocating that. Anyone who can live a model change might as well
just stick to SnabbSwitch or DPDK.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23  4:11                         ` Andy Gospodarek
@ 2014-09-23 10:11                           ` Thomas Graf
  2014-09-23 15:32                           ` Or Gerlitz
  1 sibling, 0 replies; 67+ messages in thread
From: Thomas Graf @ 2014-09-23 10:11 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Tom Herbert, Alexei Starovoitov, Jiri Pirko, John Fastabend,
	Jamal Hadi Salim, netdev, David S. Miller, Neil Horman,
	Andy Gospodarek, Daniel Borkmann, Or Gerlitz, Jesse Gross,
	Pravin Shelar, Andy Zhou, Ben Hutchings, Stephen Hemminger,
	Jeff Kirsher, Vladislav Yasevich, Cong Wang, Eric Dumazet,
	Scott Feldman, Florian Fainelli, Roop

On 09/23/14 at 12:11am, Andy Gospodarek wrote:
> There are clearly some that are most interested in how an eSwitch on an
> SR-IOV capable NIC be controlled can provide traditional forwarding help
> as well as offload the various technologies they hope to terminate
> at/inside their endpoint (host/guest/container) -- Thomas's _simple_
> use-case demonstrates this. ;)  This is a logical extention/increase in
> functionality that is offered in many eSwitches that was previously
> hidden from the user with the first generation SR-IOV capable network
> devices on hosts/servers.

I think we can define this more broadly and state that providing RX
steering capabilities to identify a guest in the NIC allows to
directly map packets into a memory region shared between host and
guest. Not a new concept at all but the existing dMAC and VLAN rx
filtering is just too limiting. We require a programmable API with
support for encap and encryption. SR-IOV is a hardware assisted form
of that which can expedite the guest to guest path on a host.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23  1:36                               ` John Fastabend
  2014-09-23  7:19                                 ` Thomas Graf
@ 2014-09-23 11:09                                 ` Jamal Hadi Salim
  1 sibling, 0 replies; 67+ messages in thread
From: Jamal Hadi Salim @ 2014-09-23 11:09 UTC (permalink / raw)
  To: John Fastabend, Tom Herbert
  Cc: Thomas Graf, Alexei Starovoitov, Jiri Pirko, John Fastabend,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu, John

On 09/22/14 21:36, John Fastabend wrote:

> n-tuple has some deficiencies,
>
>      - its not possible to get the capabilities to learn what
>        fields are supported by the device, what actions, etc.
>
>      - its ioctl based so we have to poll the device
>
>      - only supports a single table, where we have devices with
>        multiple tables
>
>      - sort of the same as above but it doesn't allow creating new
>        tables or destroying old tables.
>
> I probably missed a few others

A few more I can think of which are generic:
The whole event subsystem allowing multi-user sync or monitoring
offered by netlink is missing because ethtool ioctl go
directly to the driver.
The synchronous interface vs async offered by netlink offers
a more effective user programmability.
The ioctl binary interface whose extensibility is a pain (dont
let Stephen H hear you mention ioctls for just this one reason).

> but those are the main ones that I
> would like to address. Granted other than the ioctl line the rest could
> be solved by extending the existing interface. However I would just
> assume port it to ndo_ops and netlink then extend the existing ioctl
> seeing it needs a reasonable overall to support the above anyways.
>
> We could port the ethtool ops over to the new interface to
> simplify drivers.

Indeed.

cheers,
jamal

> .John
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23  4:11                         ` Andy Gospodarek
  2014-09-23 10:11                           ` Thomas Graf
@ 2014-09-23 15:32                           ` Or Gerlitz
  2014-09-24 13:32                             ` Thomas Graf
  1 sibling, 1 reply; 67+ messages in thread
From: Or Gerlitz @ 2014-09-23 15:32 UTC (permalink / raw)
  To: Andy Gospodarek, Tom Herbert
  Cc: Alexei Starovoitov, Thomas Graf, Jiri Pirko, John Fastabend,
	Jamal Hadi Salim, netdev, David S. Miller, Neil Horman,
	Andy Gospodarek, Daniel Borkmann, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu, Joh


On 9/23/2014 7:11 AM, Andy Gospodarek wrote:
> On Mon, Sep 22, 2014 at 07:16:47PM -0700, Tom Herbert wrote:
> [...]
>> Alexei, I believe you said previously said that SW should not dictate
>> HW models. I agree with this, but also believe the converse is true--
>> HW shouldn't dictate SW model. This is really why I'm raising the
>> question of what it means to integrate a switch into the host stack.
> Tom, when I read this I cannot help but remind myself that the
> intentions/hopes/dreams of those on this thread and how different their
> views can be on what it means to add additional 'offload support' to the
> kernel.
>
> There are clearly some that are most interested in how an eSwitch on an
> SR-IOV capable NIC be controlled can provide traditional forwarding help
> as well as offload the various technologies they hope to terminate
> at/inside their endpoint (host/guest/container) -- Thomas's _simple_
> use-case demonstrates this. ;)  This is a logical extention/increase in
> functionality that is offered in many eSwitches that was previously
> hidden from the user with the first generation SR-IOV capable network
> devices on hosts/servers.

Indeed.

The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as 
well (NOT to offload OVS).

We envision a seamless integration of user environment which is based on 
OVS with SRIOV eSwitch and the grounds for that were very well supported 
in Jiri’s V1.

The eSwitch hardware does not need to have multiple tables and ‘enjoys’ 
the flat rule of OVS. The kernel datapath does not need to be aware of 
the existence of HW nor its capabilities, it just pushes the flow also 
to the switchdev which represents the eSwitch.

If the flow can be supported in HW it will be forwarded in HW and if not 
it will be forwarded by the kernel

> [....]
>
> And now we also have the patchset that spawned what I think has been
> more excellent discussion.  Jiri and Scott's patches bring up another,
> more generic model that while not currently backed by hardware provided
> an example/vision for what could be done if such hardware existed and
> how to consider interacting with that driver/hardware (that clearly has
> been met with some resistance, but the discussion has been great).
> There ultimate goals appear to be similar to those that want full
> offload/fordwarding support for a device, but via a different method
> than what some would consider standard.
>
> I am personally hopeful that most who are passionate about this will be
> able to get together next month at LPC (or send someone to represent
> them!) so that those interested can sit in the same room and try to
> better understand each others desires and start to form some concrete
> direction towards a solution that seems to meet the needs of most while
> not being an architectural disaster.
>

Yep. LPC is the time and place to go over the multiple use-cases 
(phyiscal switch, eSwitch, eBPF, etc) that could (should) be supported 
by the basic framework.

Or.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23 15:32                           ` Or Gerlitz
@ 2014-09-24 13:32                             ` Thomas Graf
  2014-09-26 20:03                               ` Or Gerlitz
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Graf @ 2014-09-24 13:32 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Andy Gospodarek, Tom Herbert, Alexei Starovoitov, Jiri Pirko,
	John Fastabend, Jamal Hadi Salim, netdev, David S. Miller,
	Neil Horman, Andy Gospodarek, Daniel Borkmann, Jesse Gross,
	Pravin Shelar, Andy Zhou, Ben Hutchings, Stephen Hemminger,
	Jeff Kirsher, Vladislav Yasevich, Cong Wang, Eric Dumazet,
	Scott Feldman, Florian Fainelli

On 09/23/14 at 06:32pm, Or Gerlitz wrote:
> Indeed.
> 
> The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as well
> (NOT to offload OVS).
> 
> We envision a seamless integration of user environment which is based on OVS
> with SRIOV eSwitch and the grounds for that were very well supported in
> Jiri’s V1.

Please consider comparing your model with what is described here [0].
I'm trying to write down an architecture document that we can finalize
in Düsseldorf.

[0] http://goo.gl/qkzW5y

> The eSwitch hardware does not need to have multiple tables and ‘enjoys’ the
> flat rule of OVS. The kernel datapath does not need to be aware of the
> existence of HW nor its capabilities, it just pushes the flow also to the
> switchdev which represents the eSwitch.

I think you are saying that the kernel should not be required to make
the offload decision which is fair. We definitely don't want to force
the decision to be outside though, there are several legit reasons to
support transparent offloads within the kernel as well outside of OVS.

> Yep. LPC is the time and place to go over the multiple use-cases (phyiscal
> switch, eSwitch, eBPF, etc) that could (should) be supported by the basic
> framework.

For reference:
http://www.linuxplumbersconf.org/2014/ocw/proposals/2463

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-24 13:32                             ` Thomas Graf
@ 2014-09-26 20:03                               ` Or Gerlitz
  2014-09-26 21:02                                 ` Thomas Graf
  0 siblings, 1 reply; 67+ messages in thread
From: Or Gerlitz @ 2014-09-26 20:03 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Or Gerlitz, Andy Gospodarek, Tom Herbert, Alexei Starovoitov,
	Jiri Pirko, John Fastabend, Jamal Hadi Salim, netdev,
	David S. Miller, Neil Horman, Andy Gospodarek, Daniel Borkmann,
	Jesse Gross, Pravin Shelar, Andy Zhou, Ben Hutchings,
	Stephen Hemminger, Jeff Kirsher, Vladislav Yasevich, Cong Wang,
	Eric Dumazet, Scott Feldman

On Wed, Sep 24, 2014 at 4:32 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/23/14 at 06:32pm, Or Gerlitz wrote:
>> Indeed.
>>
>> The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as well
>> (NOT to offload OVS).
>>
>> We envision a seamless integration of user environment which is based on OVS
>> with SRIOV eSwitch and the grounds for that were very well supported in
>> Jiri’s V1.
>
> Please consider comparing your model with what is described here [0].
> I'm trying to write down an architecture document that we can finalize
> in Düsseldorf.
> [0] http://goo.gl/qkzW5y

Yep, this can serve us for the architecture discussion @ LPC. Re the
SRIOV case, you referred to the case where guest VF traffic goes
through HW (say) VXLAN encap/decap -- just to make sure, we need also
to support the simpler case, where guest traffic just goes through
vlan tag/strip.


>> The eSwitch hardware does not need to have multiple tables and ‘enjoys’ the
>> flat rule of OVS. The kernel datapath does not need to be aware of the
>> existence of HW nor its capabilities, it just pushes the flow also to the
>> switchdev which represents the eSwitch.

> I think you are saying that the kernel should not be required to make
> the offload decision which is fair. We definitely don't want to force
> the decision to be outside though, there are several legit reasons to
> support transparent offloads within the kernel as well outside of OVS.
>
>> Yep. LPC is the time and place to go over the multiple use-cases (phyiscal
>> switch, eSwitch, eBPF, etc) that could (should) be supported by the basic
>> framework.
>
> For reference:
> http://www.linuxplumbersconf.org/2014/ocw/proposals/2463

The SRIOV case is only mentioned here in the "Compatibility with
existing FDB ioctls for SR-IOV" bullet, so I'm a bit nervous... we
need to have it clear in the agenda. Also, this BoF needs to be
double-len, two hours, can you act to get that done?

thanks,

Or.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-26 20:03                               ` Or Gerlitz
@ 2014-09-26 21:02                                 ` Thomas Graf
  0 siblings, 0 replies; 67+ messages in thread
From: Thomas Graf @ 2014-09-26 21:02 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Andy Gospodarek, Tom Herbert, Alexei Starovoitov,
	Jiri Pirko, John Fastabend, Jamal Hadi Salim, netdev,
	David S. Miller, Neil Horman, Andy Gospodarek, Daniel Borkmann,
	Jesse Gross, Pravin Shelar, Andy Zhou, Ben Hutchings,
	Stephen Hemminger, Jeff Kirsher, Vladislav Yasevich, Cong Wang,
	Eric Dumazet, Scott Feldman

On 09/26/14 at 11:03pm, Or Gerlitz wrote:
> Yep, this can serve us for the architecture discussion @ LPC. Re the
> SRIOV case, you referred to the case where guest VF traffic goes
> through HW (say) VXLAN encap/decap -- just to make sure, we need also
> to support the simpler case, where guest traffic just goes through
> vlan tag/strip.

Agreed.

> The SRIOV case is only mentioned here in the "Compatibility with
> existing FDB ioctls for SR-IOV" bullet, so I'm a bit nervous... we
> need to have it clear in the agenda.

I think the offload API discussion should consider the SR-iOV case
but we might need to discuss additional details outside of that
BoF to ensure that the BoF can keep focus on the offload API itself.
That said, I suggest we define the specific agenda once we know that
the BoF has been accepted and 2 hours have been allocated ;-)

> Also, this BoF needs to be double-len, two hours, can you act to
> get that done?

This has already been requested.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
  2014-09-23  3:43 [patch net-next v2 8/9] switchdev: introduce Netlink API Alexei Starovoitov
@ 2014-09-23 20:57 ` Tom Herbert
  0 siblings, 0 replies; 67+ messages in thread
From: Tom Herbert @ 2014-09-23 20:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Graf, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu, John Linville

> SKB_GSO_UDP_TUNNEL_CSUM was the right way
> to start splitting overloaded and messy semantics of
> UDP_TUNNEL. I'm still not sure whether you've intended
> it for both rx and tx, since to support tunnel_csum on rx,
> parsing of encap is needed, whereas tx is so much simpler.
> Unless you're assuming checksum_complete model for rx...
>
>> If properly implemented, HW can implement a whole bunch of
>> UDP encap protocols without knowing how to parse them.
>
> on a tx side... yes, but I cannot see how you can do rx
> with inner csum verify without parsing encap.
> What do you have in mind ?
>
Implement checksum-complete. It does not require a device to parse the
encap, is usable with probably all encapsulation formats being
discussed, and easily supports multiple checksums in a packet. This
will even work with something like L2TP where a device can't do
stateless parsing (pseudo wire encapsulation).

Of the five basic NIC offloads (RX-csum, TX-csum, TSO, LRO, and RSS),
LRO is the one that probably cannot be generalized so that NICs don't
need to parse specific encapsulation protocols. Fortunately, GRO
performance is now very comparable anyway so I tend to think LRO
support is not crucial (the same argument might be made for GSO/TSO I
suppose, but TSO we can mostly generalize). HW support for checksum
offloads and RSS are definitely still very relevant!

>> I don't see how
>> a switch on the NIC helps this...
>
> correct, just a switch on a nic isn't very useful.
>
> If immediate consumer of the packet is a VM,
> then doing switching in the nic after decap doesn't
> add much speed, since bridge+router+nat+policy in sw
> after decap and csum verify done by hw are fast enough.
> But switching in HW becomes useful when VF
> is a destination device, since it avoids hw->sw->hw
> roundtrip as Thomas was saying.
>
> Also there are x86 network gateways where tunneled
> traffic from virtual network is terminated and sent
> over internet or to other datacenter. Performance
> demands are high, so if tunnel+switch+nat+policy
> can be done in off-the-shelf HW it would be great.
>
>>> And this is just tx offload. On rx smart tunnel offload in HW parses
>>> encap and goes all the way to inner headers to verify checksums,
>>> it also steers based on inner headers.
>>> Try mellanox nics with and without vxlan offload to see
>>> the difference.
>>
>> Turn on UDP RSS on the device and I bet you'll see those differences
>> go away!
>
> Logically it should, since all inner flows should get
> hashed into different outer src_port, but somehow
> that didn't work. Need to re-investigate with your
> l4_hash stuff.
>
You may need to enable RSS for UDP. Like "ethtool -N eth0 rx-flow-hash
udp4 sdfn"

>> Alexei, I believe you said previously said that SW should not dictate
>> HW models. I agree with this, but also believe the converse is true--
>> HW shouldn't dictate SW model.
>
> completely agree!
>
>> This is really why I'm raising the
>> question of what it means to integrate a switch into the host stack.
>> If this is something that doesn't require any model change to the
>> stack and is just a clever backend for rx-filters or tc, then I'm fine
>> with that!
>
> agree as well. I'm not excited about switchdev
> abstraction from this given patch, since it looks overly
> simplified and not applicable to real silicon, but
> discussion about exposing programmable
> nics/switches to sw in a generic way is worth having :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch net-next v2 8/9] switchdev: introduce Netlink API
@ 2014-09-23  3:43 Alexei Starovoitov
  2014-09-23 20:57 ` Tom Herbert
  0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2014-09-23  3:43 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Thomas Graf, Jiri Pirko, John Fastabend, Jamal Hadi Salim,
	netdev, David S. Miller, Neil Horman, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, Pravin Shelar,
	Andy Zhou, Ben Hutchings, Stephen Hemminger, Jeff Kirsher,
	Vladislav Yasevich, Cong Wang, Eric Dumazet, Scott Feldman,
	Florian Fainelli, Roopa Prabhu, John Linville

On Mon, Sep 22, 2014 at 7:16 PM, Tom Herbert <therbert@google.com> wrote:
> On Mon, Sep 22, 2014 at 6:54 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Sep 22, 2014 at 8:10 AM, Tom Herbert <therbert@google.com> wrote:
>>> On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote:
>>>> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
>>>>> I think HW should not be limited by SW abstractions whether
>>>>> these abstractions are called flows, n-tuples, bridge or else.
>>>>> Really looking forward to see "device reporting the headers as
>>>>> header fields (len, offset) and the associated parse graph"
>>>>> as the first step.
>>>>>
>>>>> Another topic that this discussion didn't cover yet is how this
>>>>> all connects to tunnels and what is 'tunnel offloading'.
>>
>>> encapsulation (stuffing a few bytes of header into a packet) is in
>>> itself not nearly an expensive enough operation to warrant offloading
>>> to the NIC. Personally, I wish if NIC vendors are going to focus on
>>
>> On contrary, generic tunneling is most important one to get right
>> when we're talking offloads.
>> Adding encap header is easy to do in hw, but it breaks all other
>> offloads if hw is not generic. Consider gso packet coming from vm.
>> Generic tunnel allows sw to add inner headers, outer headers and
>> setup offload offsets, so that HW does segmentation, checksuming
>> of inner packet, adjusts inner headers and adds final outer encap.
>
> As I pointed out on a previous thread, we already have a sufficiently
> generic interface to allow HW to do encapsulated TSO
> (SKB_GSO_UDP_TUNNEL and SKB_GSO_UDP_TUNNEL_CSUM with the inner
> headers).

SKB_GSO_UDP_TUNNEL_CSUM was the right way
to start splitting overloaded and messy semantics of
UDP_TUNNEL. I'm still not sure whether you've intended
it for both rx and tx, since to support tunnel_csum on rx,
parsing of encap is needed, whereas tx is so much simpler.
Unless you're assuming checksum_complete model for rx...

> If properly implemented, HW can implement a whole bunch of
> UDP encap protocols without knowing how to parse them.

on a tx side... yes, but I cannot see how you can do rx
with inner csum verify without parsing encap.
What do you have in mind ?

> I don't see how
> a switch on the NIC helps this...

correct, just a switch on a nic isn't very useful.

If immediate consumer of the packet is a VM,
then doing switching in the nic after decap doesn't
add much speed, since bridge+router+nat+policy in sw
after decap and csum verify done by hw are fast enough.
But switching in HW becomes useful when VF
is a destination device, since it avoids hw->sw->hw
roundtrip as Thomas was saying.

Also there are x86 network gateways where tunneled
traffic from virtual network is terminated and sent
over internet or to other datacenter. Performance
demands are high, so if tunnel+switch+nat+policy
can be done in off-the-shelf HW it would be great.

>> And this is just tx offload. On rx smart tunnel offload in HW parses
>> encap and goes all the way to inner headers to verify checksums,
>> it also steers based on inner headers.
>> Try mellanox nics with and without vxlan offload to see
>> the difference.
>
> Turn on UDP RSS on the device and I bet you'll see those differences
> go away!

Logically it should, since all inner flows should get
hashed into different outer src_port, but somehow
that didn't work. Need to re-investigate with your
l4_hash stuff.

> Alexei, I believe you said previously said that SW should not dictate
> HW models. I agree with this, but also believe the converse is true--
> HW shouldn't dictate SW model.

completely agree!

> This is really why I'm raising the
> question of what it means to integrate a switch into the host stack.
> If this is something that doesn't require any model change to the
> stack and is just a clever backend for rx-filters or tc, then I'm fine
> with that!

agree as well. I'm not excited about switchdev
abstraction from this given patch, since it looks overly
simplified and not applicable to real silicon, but
discussion about exposing programmable
nics/switches to sw in a generic way is worth having :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2014-09-26 21:02 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-19 13:49 [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api Jiri Pirko
2014-09-19 13:49 ` [patch net-next v2 1/9] net: rename netdev_phys_port_id to more generic name Jiri Pirko
     [not found]   ` <1411134590-4586-2-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-09-19 13:54     ` Jeff Kirsher
2014-09-19 13:49 ` [patch net-next v2 3/9] rtnl: expose physical switch id for particular device Jiri Pirko
2014-09-19 13:49 ` [patch net-next v2 4/9] net-sysfs: " Jiri Pirko
2014-09-19 13:49 ` [patch net-next v2 5/9] net: introduce dummy switch Jiri Pirko
     [not found]   ` <1411134590-4586-6-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-09-20  5:21     ` Florian Fainelli
2014-09-20  7:37       ` Jiri Pirko
2014-09-19 13:49 ` [patch net-next v2 6/9] switchdev: add basic support for flow matching and actions Jiri Pirko
2014-09-20  5:32   ` Florian Fainelli
2014-09-20  7:28     ` Jiri Pirko
2014-09-19 13:49 ` [patch net-next v2 7/9] switchdev: add swdev features Jiri Pirko
2014-09-19 13:49 ` [patch net-next v2 8/9] switchdev: introduce Netlink API Jiri Pirko
2014-09-19 15:25   ` Jamal Hadi Salim
2014-09-19 15:49     ` Jiri Pirko
2014-09-19 17:57       ` Jamal Hadi Salim
2014-09-19 22:12         ` John Fastabend
2014-09-19 22:18           ` Jamal Hadi Salim
2014-09-20  5:39             ` Florian Fainelli
2014-09-20  8:25               ` Jiri Pirko
2014-09-20  8:17             ` Jiri Pirko
2014-09-20 10:19               ` Jamal Hadi Salim
2014-09-20 11:01                 ` Thomas Graf
2014-09-20 11:32                   ` Jamal Hadi Salim
2014-09-20 11:51                     ` Thomas Graf
     [not found]                       ` <20140920115140.GA3777-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-09-20 12:35                         ` Jamal Hadi Salim
2014-09-22  7:53                     ` Jiri Pirko
     [not found]                       ` <20140922075337.GA1828-6KJVSR23iU488b5SBfVpbw@public.gmane.org>
2014-09-22 11:48                         ` Jamal Hadi Salim
2014-09-20  5:36           ` Florian Fainelli
2014-09-20  8:14           ` Jiri Pirko
2014-09-20 10:53             ` Thomas Graf
2014-09-20 22:50               ` Alexei Starovoitov
2014-09-22  8:13                 ` Thomas Graf
2014-09-22 15:10                   ` Tom Herbert
2014-09-22 22:17                     ` Thomas Graf
     [not found]                       ` <20140922221727.GA4708-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>
2014-09-22 22:40                         ` Tom Herbert
2014-09-22 22:53                           ` Thomas Graf
2014-09-22 23:07                             ` Tom Herbert
2014-09-23  1:36                               ` John Fastabend
2014-09-23  7:19                                 ` Thomas Graf
2014-09-23 11:09                                 ` Jamal Hadi Salim
     [not found]                           ` <CA+mtBx9ZVQ5r5Hzy9-uEnk+iu+HKkOP4+VANC06Xf8VvTxktwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-09-23  9:18                             ` Thomas Graf
2014-09-23  1:54                     ` Alexei Starovoitov
2014-09-23  2:16                       ` Tom Herbert
2014-09-23  4:11                         ` Andy Gospodarek
2014-09-23 10:11                           ` Thomas Graf
2014-09-23 15:32                           ` Or Gerlitz
2014-09-24 13:32                             ` Thomas Graf
2014-09-26 20:03                               ` Or Gerlitz
2014-09-26 21:02                                 ` Thomas Graf
2014-09-23  9:52                         ` Thomas Graf
2014-09-20  3:41       ` Roopa Prabhu
2014-09-20  8:09         ` Jiri Pirko
2014-09-20 12:39           ` Roopa Prabhu
2014-09-20  8:10         ` Scott Feldman
2014-09-20 10:31           ` Jamal Hadi Salim
     [not found]           ` <DDC24110-C3F5-470F-B9BE-1D1792415D1E-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>
2014-09-20 12:51             ` Roopa Prabhu
2014-09-20 17:21               ` Scott Feldman
2014-09-20 17:38                 ` Jiri Pirko
2014-09-21  1:30                   ` Roopa Prabhu
2014-09-19 13:49 ` [patch net-next v2 9/9] rocker: introduce rocker switch driver Jiri Pirko
     [not found] ` <1411134590-4586-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
2014-09-19 13:49   ` [patch net-next v2 2/9] net: introduce generic switch devices support Jiri Pirko
2014-09-19 14:15   ` [patch net-next v2 0/9] introduce rocker switch driver with hardware accelerated datapath api David Laight
     [not found]     ` <063D6719AE5E284EB5DD2968C1650D6D17495CC6-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
2014-09-19 14:20       ` Jiri Pirko
2014-09-20  5:37         ` Florian Fainelli
2014-09-23  3:43 [patch net-next v2 8/9] switchdev: introduce Netlink API Alexei Starovoitov
2014-09-23 20:57 ` Tom Herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).