All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC net-next 0/9] TC filter HW offloads
@ 2016-02-01  8:34 Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 1/9] net/flow_dissector: Make dissector_uses_key() and skb_flow_dissector_target() public Amir Vadai
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

Hi,

So... just before sending that, I noted Jonh's series that
deals with tc and u32. One notable difference between the 
two approaches is that here we "normalize" the upper layer
way of describing matching and actions into a generic structure
(flow dissector, etc), which should allow to use offload different
potential consumer tools (TC flower, TC u32 subset), netfilter, etc).
Another difference is with this series uses the switchdev
framework which would allow using the proposed HW offloading
mechanisms for physical and SRIOV embedded switches too that
make use of switchdev.

This patchset introduces an infrastructure to offload matching of flows and
some basic actions to hardware, currenrtly using iproute2 / tc tool.

In this patchset, the classification is described using the flower filter, and
the supported actions are drop (using gact) and mark (using skbedit).

Flow classifcation is described using a flow dissector that is built by 
the tc filter. The filter also calls the actions to be serialized into the new
structure - switchdev_obj_port_flow_act.

The flow dissector and the serialized actions are passed using switchdev ops to
the HW driver, which parse it to hardware commands. We propose to use the
kernel flow-dissector to describe flows/ACLs in the switchdev framework which
by itself could be also used for HW offloading of other kernel networking
components.

An implementation for the above is provided using mlx5 driver and Mellanox 
ConnectX4 HW.

Some issues that will be addressed before making the final submission:
1. 'offload' should be a generic filter attribute and not flower filter
   specific.
2. Serialization of actions will be changed into a list instead of one big
   structure to describe all actions.

Few more matters to discuss 

1. Should HW offloading be done only under explicit admin directive?

2. switchdev is used today for physical switch HW and on an upcoming proposal
for SRIOV e-switch vport representors too. Here, we're doing that with a NIC, 
that can potentially serve as an uplink port for v-switch (e.g under Para-Virtual 
scheme).

Sample usage of the feature:

export TC=../iproute2/tc/tc
export ETH=ens9

ifconfig ens9 11.11.11.11/24 up

# add an ingress qdisc
$TC qdisc add dev $ETH ingress

# Drop ICMP (ip_proto 1) packets
$TC filter add dev $ETH protocol ip prio 20 parent ffff: \
                flower eth_type ip ip_proto 1 \
                indev $ETH offload \
                action drop

# Mark (with 0x1234) TCP (ip_proto 6) packets
$TC filter add dev $ETH protocol ip prio 30 parent ffff: \
                flower eth_type ip ip_proto 6 \
                indev $ETH offload \
                action skbedit mark 0x1234

# A NOP filter for packets that are marked (0x1234)
$TC filter add dev $ETH protocol ip prio 10 parent ffff: \
                handle 0x1234 fw action pass

# See that pings are blocked
# See that ssh is working (=TCP traffic)

# See NOP filter counters. If >0, HW marked and NOP filter catched it
$TC -s filter show dev $ETH parent ffff:

This patchset depends on a small fix [1] that is currently under review in the
mailing list.  It was applied and tested on net-next commit 7a26019
("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")

[1] Depends on "net/mlx5_core: Set flow steering dest only for forward rules"
    - http://patchwork.ozlabs.org/patch/574055/   

Thanks,
Amir

Amir Vadai (9):
  net/flow_dissector: Make dissector_uses_key() and
    skb_flow_dissector_target() public
  net/switchdev: Introduce hardware offload support
  net/act: Offload support by tc actions
  net/act_skbedit: Introduce hardware offload support
  net/act_gact: Introduce hardware offload support for drop
  net/cls_flower: Introduce hardware offloading
  net/mlx5_core: Go to next flow table support
  net/mlx5e: Introduce MLX5_FLOW_NAMESPACE_OFFLOADS
  net/mlx5e: Flow steering support through switchdev

 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |   7 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    |  10 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_switchdev.c | 475 +++++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_switchdev.h |  60 +++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  26 ++
 include/linux/mlx5/fs.h                            |   1 +
 include/net/act_api.h                              |   3 +
 include/net/flow_dissector.h                       |  13 +
 include/net/pkt_cls.h                              |   2 +
 include/net/switchdev.h                            |  46 ++
 include/uapi/linux/pkt_cls.h                       |   1 +
 net/core/flow_dissector.c                          |  13 -
 net/sched/act_gact.c                               |  17 +
 net/sched/act_skbedit.c                            |  18 +
 net/sched/cls_api.c                                |  27 ++
 net/sched/cls_flower.c                             |  54 ++-
 net/switchdev/switchdev.c                          |  33 ++
 21 files changed, 807 insertions(+), 16 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC net-next 1/9] net/flow_dissector: Make dissector_uses_key() and skb_flow_dissector_target() public
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 2/9] net/switchdev: Introduce hardware offload support Amir Vadai
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

Will be used in a following patch.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 include/net/flow_dissector.h | 13 +++++++++++++
 net/core/flow_dissector.c    | 13 -------------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 8c8548c..d3d60dc 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -184,4 +184,17 @@ static inline bool flow_keys_have_l4(struct flow_keys *keys)
 
 u32 flow_hash_from_keys(struct flow_keys *keys);
 
+static inline bool dissector_uses_key(const struct flow_dissector *flow_dissector,
+				      enum flow_dissector_key_id key_id)
+{
+	return flow_dissector->used_keys & (1 << key_id);
+}
+
+static inline void *skb_flow_dissector_target(struct flow_dissector *flow_dissector,
+					      enum flow_dissector_key_id key_id,
+					      void *target_container)
+{
+	return ((char *)target_container) + flow_dissector->offset[key_id];
+}
+
 #endif
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index d79699c..db0aa1c 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -19,25 +19,12 @@
 #include <net/flow_dissector.h>
 #include <scsi/fc/fc_fcoe.h>
 
-static bool dissector_uses_key(const struct flow_dissector *flow_dissector,
-			       enum flow_dissector_key_id key_id)
-{
-	return flow_dissector->used_keys & (1 << key_id);
-}
-
 static void dissector_set_key(struct flow_dissector *flow_dissector,
 			      enum flow_dissector_key_id key_id)
 {
 	flow_dissector->used_keys |= (1 << key_id);
 }
 
-static void *skb_flow_dissector_target(struct flow_dissector *flow_dissector,
-				       enum flow_dissector_key_id key_id,
-				       void *target_container)
-{
-	return ((char *) target_container) + flow_dissector->offset[key_id];
-}
-
 void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 			     const struct flow_dissector_key *key,
 			     unsigned int key_count)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC net-next 2/9] net/switchdev: Introduce hardware offload support
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 1/9] net/flow_dissector: Make dissector_uses_key() and skb_flow_dissector_target() public Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01  9:06   ` Jiri Pirko
  2016-02-01  9:26   ` John Fastabend
  2016-02-01  8:34 ` [RFC net-next 3/9] net/act: Offload support by tc actions Amir Vadai
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

Extend the switchdev API with new operations: switchdev_port_flow_add()
and switchdev_port_flow_del().
It allows the user to add/del a hardware offloaded flow classification
and actions.
For every new flow object a cookie is supplied. This cookie will be
used later on to identify the flow when removed.

In order to make the API as flexible as possible, flow_dissector is
being used to describe the flow classifier.

Every new flow object is consists of a flow_dissector+key+mask to
describe the classifier and a switchdev_obj_port_flow_act to describe
the actions and their attributes.

object is passed to the lower layer driver to be pushed into the
hardware.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 include/net/switchdev.h   | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 net/switchdev/switchdev.c | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index d451122..c5a5681 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -15,6 +15,7 @@
 #include <linux/notifier.h>
 #include <linux/list.h>
 #include <net/ip_fib.h>
+#include <net/flow_dissector.h>
 
 #define SWITCHDEV_F_NO_RECURSE		BIT(0)
 #define SWITCHDEV_F_SKIP_EOPNOTSUPP	BIT(1)
@@ -69,6 +70,7 @@ enum switchdev_obj_id {
 	SWITCHDEV_OBJ_ID_IPV4_FIB,
 	SWITCHDEV_OBJ_ID_PORT_FDB,
 	SWITCHDEV_OBJ_ID_PORT_MDB,
+	SWITCHDEV_OBJ_ID_PORT_FLOW,
 };
 
 struct switchdev_obj {
@@ -124,6 +126,30 @@ struct switchdev_obj_port_mdb {
 #define SWITCHDEV_OBJ_PORT_MDB(obj) \
 	container_of(obj, struct switchdev_obj_port_mdb, obj)
 
+/* SWITCHDEV_OBJ_ID_PORT_FLOW */
+enum switchdev_obj_port_flow_action {
+	SWITCHDEV_OBJ_PORT_FLOW_ACT_DROP = 0,
+	SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK = 1,
+};
+
+struct switchdev_obj_port_flow_act {
+	u32 actions; /* Bitmap of requested actions */
+	u32 mark; /* Value for mark action - if requested */
+};
+
+struct switchdev_obj_port_flow {
+	struct switchdev_obj obj;
+
+	unsigned long cookie;
+	struct flow_dissector *dissector; /* Dissector for mask and keys */
+	void *mask; /* Flow keys mask */
+	void *key;  /* Flow keys */
+	struct switchdev_obj_port_flow_act *actions;
+};
+
+#define SWITCHDEV_OBJ_PORT_FLOW(obj) \
+	container_of(obj, struct switchdev_obj_port_flow, obj)
+
 void switchdev_trans_item_enqueue(struct switchdev_trans *trans,
 				  void *data, void (*destructor)(void const *),
 				  struct switchdev_trans_item *tritem);
@@ -223,6 +249,12 @@ void switchdev_port_fwd_mark_set(struct net_device *dev,
 				 struct net_device *group_dev,
 				 bool joining);
 
+int switchdev_port_flow_add(struct net_device *dev,
+			    struct flow_dissector *dissector,
+			    void *mask, void *key,
+			    struct switchdev_obj_port_flow_act *actions,
+			    unsigned long cookie);
+int switchdev_port_flow_del(struct net_device *dev, unsigned long cookie);
 #else
 
 static inline void switchdev_deferred_process(void)
@@ -347,6 +379,20 @@ static inline void switchdev_port_fwd_mark_set(struct net_device *dev,
 {
 }
 
+static inline int switchdev_port_flow_add(struct net_device *dev,
+					  struct flow_dissector *dissector,
+					  void *mask, void *key,
+					  struct switchdev_obj_port_flow_act *actions,
+					  unsigned long cookie)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int switchdev_port_flow_del(struct net_device *dev,
+					  unsigned long cookie)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index ebc661d..67b4678 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -1383,3 +1383,36 @@ void switchdev_port_fwd_mark_set(struct net_device *dev,
 	dev->offload_fwd_mark = mark;
 }
 EXPORT_SYMBOL_GPL(switchdev_port_fwd_mark_set);
+
+/* Must not be deferred, since deffering does shallow copy, which will not
+ * copy mask and key content
+ */
+int switchdev_port_flow_add(struct net_device *dev,
+			    struct flow_dissector *dissector,
+			    void *mask, void *key,
+			    struct switchdev_obj_port_flow_act *actions,
+			    unsigned long cookie)
+{
+	struct switchdev_obj_port_flow flow = {
+		.obj.id = SWITCHDEV_OBJ_ID_PORT_FLOW,
+		.cookie = cookie,
+		.dissector = dissector,
+		.mask = mask,
+		.key = key,
+		.actions = actions,
+	};
+
+	return switchdev_port_obj_add(dev, &flow.obj);
+}
+EXPORT_SYMBOL_GPL(switchdev_port_flow_add);
+
+int switchdev_port_flow_del(struct net_device *dev, unsigned long cookie)
+{
+	struct switchdev_obj_port_flow flow = {
+		.obj.id = SWITCHDEV_OBJ_ID_PORT_FLOW,
+		.cookie = cookie,
+	};
+
+	return switchdev_port_obj_del(dev, &flow.obj);
+}
+EXPORT_SYMBOL_GPL(switchdev_port_flow_del);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC net-next 3/9] net/act: Offload support by tc actions
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 1/9] net/flow_dissector: Make dissector_uses_key() and skb_flow_dissector_target() public Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 2/9] net/switchdev: Introduce hardware offload support Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 4/9] net/act_skbedit: Introduce hardware offload support Amir Vadai
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

In order to support hardware offloading, an action should implment the
new offload_init() callback.
During filter initialization, offload_init() will be called to add
the action description to the actions object that will be used by the
filter to configure the hardware.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 include/net/act_api.h |  3 +++
 include/net/pkt_cls.h |  2 ++
 net/sched/cls_api.c   | 27 +++++++++++++++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 9d446f13..fcabe93 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -7,6 +7,7 @@
 
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
+#include <net/switchdev.h>
 
 struct tcf_common {
 	struct hlist_node		tcfc_head;
@@ -108,6 +109,8 @@ struct tc_action_ops {
 			struct nlattr *est, struct tc_action *act, int ovr,
 			int bind);
 	int     (*walk)(struct sk_buff *, struct netlink_callback *, int, struct tc_action *);
+	int	(*offload_init)(struct tc_action *,
+				struct switchdev_obj_port_flow_act *);
 };
 
 int tcf_hash_search(struct tc_action *a, u32 index);
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index bc49967..7eb8ee9 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -130,6 +130,8 @@ tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts,
 	return 0;
 }
 
+int tcf_exts_offload_init(struct tcf_exts *e,
+			  struct switchdev_obj_port_flow_act *actions);
 int tcf_exts_validate(struct net *net, struct tcf_proto *tp,
 		      struct nlattr **tb, struct nlattr *rate_tlv,
 		      struct tcf_exts *exts, bool ovr);
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a75864d..d675c31 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -29,6 +29,7 @@
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
+#include <net/switchdev.h>
 
 /* The list of all installed classifier types */
 static LIST_HEAD(tcf_proto_base);
@@ -551,6 +552,32 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 }
 EXPORT_SYMBOL(tcf_exts_validate);
 
+int tcf_exts_offload_init(struct tcf_exts *e,
+			  struct switchdev_obj_port_flow_act *actions)
+{
+#ifdef CONFIG_NET_CLS_ACT
+	struct tc_action *act;
+	int err = 0;
+
+	list_for_each_entry(act, &e->actions, list) {
+		if (!act->ops->offload_init) {
+			pr_err("Action %s doesn't have offload support\n",
+			       act->ops->kind);
+			err = -EINVAL;
+			break;
+		}
+		err = act->ops->offload_init(act, actions);
+		if (err)
+			break;
+	}
+
+	return err;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+EXPORT_SYMBOL(tcf_exts_offload_init);
+
 void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
 		     struct tcf_exts *src)
 {
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC net-next 4/9] net/act_skbedit: Introduce hardware offload support
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
                   ` (2 preceding siblings ...)
  2016-02-01  8:34 ` [RFC net-next 3/9] net/act: Offload support by tc actions Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 5/9] net/act_gact: Introduce hardware offload support for drop Amir Vadai
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

Currently only 'mark' operation is supported when hardware offload is
requested.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 net/sched/act_skbedit.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 6751b5f..3113dfc 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -23,6 +23,7 @@
 #include <linux/rtnetlink.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#include <net/switchdev.h>
 
 #include <linux/tc_act/tc_skbedit.h>
 #include <net/tc_act/tc_skbedit.h>
@@ -173,6 +174,22 @@ nla_put_failure:
 	return -1;
 }
 
+static int tcf_skbedit_offload_init(struct tc_action *a,
+				    struct switchdev_obj_port_flow_act *obj)
+{
+	struct tcf_skbedit *d = a->priv;
+
+	if (d->flags == SKBEDIT_F_MARK) {
+		obj->actions |= BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK);
+		obj->mark = d->mark;
+
+		return 0;
+	}
+
+	pr_err("Only 'mark' is supported for offloaded skbedit\n");
+	return -ENOTSUPP;
+}
+
 static struct tc_action_ops act_skbedit_ops = {
 	.kind		=	"skbedit",
 	.type		=	TCA_ACT_SKBEDIT,
@@ -180,6 +197,7 @@ static struct tc_action_ops act_skbedit_ops = {
 	.act		=	tcf_skbedit,
 	.dump		=	tcf_skbedit_dump,
 	.init		=	tcf_skbedit_init,
+	.offload_init	=	tcf_skbedit_offload_init,
 };
 
 MODULE_AUTHOR("Alexander Duyck, <alexander.h.duyck@intel.com>");
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC net-next 5/9] net/act_gact: Introduce hardware offload support for drop
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
                   ` (3 preceding siblings ...)
  2016-02-01  8:34 ` [RFC net-next 4/9] net/act_skbedit: Introduce hardware offload support Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading Amir Vadai
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

Enable hardware offloaded packet dropping when filter is marked with
'offload' attribute.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 net/sched/act_gact.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index 5c1b051..b639b18 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#include <net/switchdev.h>
 #include <linux/tc_act/tc_gact.h>
 #include <net/tc_act/tc_gact.h>
 
@@ -183,6 +184,21 @@ nla_put_failure:
 	return -1;
 }
 
+static int tcf_gact_offload_init(struct tc_action *a,
+				 struct switchdev_obj_port_flow_act *obj)
+{
+	struct tcf_gact *gact = a->priv;
+
+	if (gact->tcf_action == TC_ACT_SHOT) {
+		obj->actions |= BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_DROP);
+
+		return 0;
+	}
+
+	pr_err("Only 'drop' is supported for offloaded gact\n");
+	return -ENOTSUPP;
+}
+
 static struct tc_action_ops act_gact_ops = {
 	.kind		=	"gact",
 	.type		=	TCA_ACT_GACT,
@@ -190,6 +206,7 @@ static struct tc_action_ops act_gact_ops = {
 	.act		=	tcf_gact,
 	.dump		=	tcf_gact_dump,
 	.init		=	tcf_gact_init,
+	.offload_init	=	tcf_gact_offload_init,
 };
 
 MODULE_AUTHOR("Jamal Hadi Salim(2002-4)");
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
                   ` (4 preceding siblings ...)
  2016-02-01  8:34 ` [RFC net-next 5/9] net/act_gact: Introduce hardware offload support for drop Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01  9:31   ` John Fastabend
  2016-02-01  8:34 ` [RFC net-next 7/9] net/mlx5_core: Go to next flow table support Amir Vadai
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

During initialization, tcf_exts_offload_init() is called to initialize
the list of actions description. later on, the classifier description
is prepared and sent to the switchdev using switchdev_port_flow_add().

When offloaded, fl_classify() is a NOP - already done in hardware.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 include/uapi/linux/pkt_cls.h |  1 +
 net/sched/cls_flower.c       | 54 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 4398737..c18e82d 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -416,6 +416,7 @@ enum {
 	TCA_FLOWER_KEY_TCP_DST,		/* be16 */
 	TCA_FLOWER_KEY_UDP_SRC,		/* be16 */
 	TCA_FLOWER_KEY_UDP_DST,		/* be16 */
+	TCA_FLOWER_OFFLOAD,		/* flag */
 	__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 95b0212..e36d408 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -22,6 +22,7 @@
 #include <net/pkt_cls.h>
 #include <net/ip.h>
 #include <net/flow_dissector.h>
+#include <net/switchdev.h>
 
 struct fl_flow_key {
 	int	indev_ifindex;
@@ -56,6 +57,7 @@ struct cls_fl_head {
 	struct list_head filters;
 	struct rhashtable_params ht_params;
 	struct rcu_head rcu;
+	bool offload;
 };
 
 struct cls_fl_filter {
@@ -67,6 +69,7 @@ struct cls_fl_filter {
 	struct list_head list;
 	u32 handle;
 	struct rcu_head	rcu;
+	struct net_device *indev;
 };
 
 static unsigned short int fl_mask_range(const struct fl_flow_mask *mask)
@@ -123,6 +126,9 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	struct fl_flow_key skb_key;
 	struct fl_flow_key skb_mkey;
 
+	if (head->offload)
+		return -1;
+
 	fl_clear_masked_range(&skb_key, &head->mask);
 	skb_key.indev_ifindex = skb->skb_iif;
 	/* skb_flow_dissect() does not set n_proto in case an unknown protocol,
@@ -174,6 +180,9 @@ static bool fl_destroy(struct tcf_proto *tp, bool force)
 		return false;
 
 	list_for_each_entry_safe(f, next, &head->filters, list) {
+		if (head->offload)
+			switchdev_port_flow_del(f->indev, (unsigned long)f);
+
 		list_del_rcu(&f->list);
 		call_rcu(&f->rcu, fl_destroy_filter);
 	}
@@ -396,9 +405,11 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
 }
 
 static int fl_set_parms(struct net *net, struct tcf_proto *tp,
+			struct cls_fl_head *head,
 			struct cls_fl_filter *f, struct fl_flow_mask *mask,
 			unsigned long base, struct nlattr **tb,
-			struct nlattr *est, bool ovr)
+			struct nlattr *est, bool ovr,
+			struct switchdev_obj_port_flow_act *actions)
 {
 	struct tcf_exts e;
 	int err;
@@ -413,6 +424,8 @@ static int fl_set_parms(struct net *net, struct tcf_proto *tp,
 		tcf_bind_filter(tp, &f->res, base);
 	}
 
+	head->offload = nla_get_flag(tb[TCA_FLOWER_OFFLOAD]);
+
 	err = fl_set_key(net, tb, &f->key, &mask->key);
 	if (err)
 		goto errout;
@@ -420,6 +433,24 @@ static int fl_set_parms(struct net *net, struct tcf_proto *tp,
 	fl_mask_update_range(mask);
 	fl_set_masked_key(&f->mkey, &f->key, mask);
 
+	if (head->offload) {
+		if (!f->key.indev_ifindex) {
+			pr_err("indev must be set when using offloaded filter\n");
+			err = -EINVAL;
+			goto errout;
+		}
+
+		f->indev = __dev_get_by_index(net, f->key.indev_ifindex);
+		if (!f->indev) {
+			err = -EINVAL;
+			goto errout;
+		}
+
+		err = tcf_exts_offload_init(&e, actions);
+		if (err)
+			goto errout;
+	}
+
 	tcf_exts_change(tp, &f->exts, &e);
 
 	return 0;
@@ -459,6 +490,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	struct cls_fl_filter *fnew;
 	struct nlattr *tb[TCA_FLOWER_MAX + 1];
 	struct fl_flow_mask mask = {};
+	struct switchdev_obj_port_flow_act actions = {};
 	int err;
 
 	if (!tca[TCA_OPTIONS])
@@ -486,7 +518,8 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	}
 	fnew->handle = handle;
 
-	err = fl_set_parms(net, tp, fnew, &mask, base, tb, tca[TCA_RATE], ovr);
+	err = fl_set_parms(net, tp, head, fnew, &mask, base, tb,
+			   tca[TCA_RATE], ovr, &actions);
 	if (err)
 		goto errout;
 
@@ -494,6 +527,17 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	if (err)
 		goto errout;
 
+	if (head->offload) {
+		err = switchdev_port_flow_add(fnew->indev,
+					      &head->dissector,
+					      &mask.key,
+					      &fnew->key,
+					      &actions,
+					      (unsigned long)fnew);
+		if (err)
+			goto errout;
+	}
+
 	err = rhashtable_insert_fast(&head->ht, &fnew->ht_node,
 				     head->ht_params);
 	if (err)
@@ -505,6 +549,12 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	*arg = (unsigned long) fnew;
 
 	if (fold) {
+		if (head->offload) {
+			err = switchdev_port_flow_del(fold->indev,
+						      (unsigned long)fold);
+			if (err)
+				goto errout;
+		}
 		list_replace_rcu(&fold->list, &fnew->list);
 		tcf_unbind_filter(tp, &fold->res);
 		call_rcu(&fold->rcu, fl_destroy_filter);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC net-next 7/9] net/mlx5_core: Go to next flow table support
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
                   ` (5 preceding siblings ...)
  2016-02-01  8:34 ` [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 8/9] net/mlx5e: Introduce MLX5_FLOW_NAMESPACE_OFFLOADS Amir Vadai
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

When destination is NULL, continue processing packet in the following
table.
Will be used by the offloads table, to process the traffic before any
other table (without it knowing who is the next table)

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 6f68dba..fb3717a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -993,9 +993,27 @@ mlx5_add_flow_rule(struct mlx5_flow_table *ft,
 		   u32 flow_tag,
 		   struct mlx5_flow_destination *dest)
 {
+	struct mlx5_flow_destination *my_dest = NULL;
 	struct mlx5_flow_group *g;
 	struct mlx5_flow_rule *rule;
 
+	if (!dest) {
+		struct mlx5_flow_table *next_ft;
+		struct fs_prio *prio;
+
+		fs_get_obj(prio, ft->node.parent);
+		next_ft = find_next_chained_ft(prio);
+		if (!next_ft) {
+			pr_warn("There is no next flow table\n");
+			return ERR_PTR(-EINVAL);
+		}
+		my_dest = kzalloc(sizeof(*my_dest), GFP_KERNEL);
+		if (!my_dest)
+			return ERR_PTR(-ENOMEM);
+		my_dest->type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+		my_dest->ft = next_ft;
+		dest = my_dest;
+	}
 	nested_lock_ref_node(&ft->node, FS_MUTEX_GRANDPARENT);
 	fs_for_each_fg(g, ft)
 		if (compare_match_criteria(g->mask.match_criteria_enable,
@@ -1012,6 +1030,7 @@ mlx5_add_flow_rule(struct mlx5_flow_table *ft,
 				   match_value, action, flow_tag, dest);
 unlock:
 	unlock_ref_node(&ft->node);
+	kfree(my_dest);
 	return rule;
 }
 EXPORT_SYMBOL(mlx5_add_flow_rule);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC net-next 8/9] net/mlx5e: Introduce MLX5_FLOW_NAMESPACE_OFFLOADS
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
                   ` (6 preceding siblings ...)
  2016-02-01  8:34 ` [RFC net-next 7/9] net/mlx5_core: Go to next flow table support Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01  8:34 ` [RFC net-next 9/9] net/mlx5e: Flow steering support through switchdev Amir Vadai
  2016-02-01  9:21 ` [RFC net-next 0/9] TC filter HW offloads John Fastabend
  9 siblings, 0 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

A new namespace to be populated with flow steering rules that deal with
offloading rules (matching and/or actions) set for higher level entities
such as the TC subsystem.
This namespace is located after the bypass namespace and before the
kernel.
Therefore, it precedes the HW processing done for rules set for the
kernel NIC name-space.
This would allow to conduct actions such as HW drop or HW setting of
flow tag which will later become skb->mark for packets, before matching
by the kernel name space rules used by the EN NIC.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 7 +++++++
 include/linux/mlx5/fs.h                           | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index fb3717a..ffe1397 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -77,6 +77,10 @@
 #define KERNEL_NUM_PRIOS 1
 #define KENREL_MIN_LEVEL 2
 
+#define OFFLOADS_MAX_FT 1
+#define OFFLOADS_NUM_PRIOS 1
+#define OFFLOADS_MIN_LEVEL (BY_PASS_MIN_LEVEL + 1)
+
 struct node_caps {
 	size_t	arr_sz;
 	long	*caps;
@@ -100,6 +104,8 @@ static struct init_tree_node {
 					  FS_CAP(flow_table_properties_nic_receive.identified_miss_table_mode),
 					  FS_CAP(flow_table_properties_nic_receive.flow_table_modify)),
 			 ADD_NS(ADD_MULTIPLE_PRIO(MLX5_BY_PASS_NUM_PRIOS, BY_PASS_PRIO_MAX_FT))),
+		ADD_PRIO(0, OFFLOADS_MIN_LEVEL, 0, {},
+			 ADD_NS(ADD_MULTIPLE_PRIO(OFFLOADS_NUM_PRIOS, OFFLOADS_MAX_FT))),
 		ADD_PRIO(0, KENREL_MIN_LEVEL, 0, {},
 			 ADD_NS(ADD_MULTIPLE_PRIO(KERNEL_NUM_PRIOS, KERNEL_MAX_FT))),
 		ADD_PRIO(0, BY_PASS_MIN_LEVEL, 0,
@@ -1143,6 +1149,7 @@ struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
 
 	switch (type) {
 	case MLX5_FLOW_NAMESPACE_BYPASS:
+	case MLX5_FLOW_NAMESPACE_OFFLOADS:
 	case MLX5_FLOW_NAMESPACE_KERNEL:
 	case MLX5_FLOW_NAMESPACE_LEFTOVERS:
 		prio = type;
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 8230caa..40e79e2 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -50,6 +50,7 @@ static inline void build_leftovers_ft_param(int *priority,
 
 enum mlx5_flow_namespace_type {
 	MLX5_FLOW_NAMESPACE_BYPASS,
+	MLX5_FLOW_NAMESPACE_OFFLOADS,
 	MLX5_FLOW_NAMESPACE_KERNEL,
 	MLX5_FLOW_NAMESPACE_LEFTOVERS,
 	MLX5_FLOW_NAMESPACE_FDB,
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC net-next 9/9] net/mlx5e: Flow steering support through switchdev
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
                   ` (7 preceding siblings ...)
  2016-02-01  8:34 ` [RFC net-next 8/9] net/mlx5e: Introduce MLX5_FLOW_NAMESPACE_OFFLOADS Amir Vadai
@ 2016-02-01  8:34 ` Amir Vadai
  2016-02-01 18:52   ` Saeed Mahameed
  2016-02-01  9:21 ` [RFC net-next 0/9] TC filter HW offloads John Fastabend
  9 siblings, 1 reply; 23+ messages in thread
From: Amir Vadai @ 2016-02-01  8:34 UTC (permalink / raw)
  To: David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Amir Vadai

Parse switchdev flow object into device specific commands and program
the hardware to classify and mark/drop the flow accordingly.

A new Kconfig is introduced: MLX5_EN_SWITCHDEV. This config enables to
compile the driver when switchdev is not compiled.

Signed-off-by: Amir Vadai <amir@vadai.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |   7 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    |  10 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_switchdev.c | 475 +++++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_switchdev.h |  60 +++
 8 files changed, 568 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index c503ea0..61a9eed 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -19,3 +19,10 @@ config MLX5_CORE_EN
 	  Ethernet support in Mellanox Technologies ConnectX-4 NIC.
 	  Ethernet and Infiniband support in ConnectX-4 are currently mutually
 	  exclusive.
+
+config MLX5_EN_SWITCHDEV
+	bool "MLX5 EN switchdev support"
+	depends on MLX5_CORE_EN && NET_SWITCHDEV
+	default y
+	---help---
+	  Switchdev support in Mellanox Technologies ConnectX-4 NIC.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 01c0256..b80143e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -3,6 +3,9 @@ obj-$(CONFIG_MLX5_CORE)		+= mlx5_core.o
 mlx5_core-y :=	main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
 		health.o mcg.o cq.o srq.o alloc.o qp.o port.o mr.o pd.o   \
 		mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o
+
+mlx5_core-$(CONFIG_MLX5_EN_SWITCHDEV) += en_switchdev.o
+
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o \
 		en_main.o en_fs.o en_ethtool.o en_tx.o en_rx.o \
 		en_txrx.o en_clock.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 9ea49a8..e61a67c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -39,6 +39,8 @@
 #include <linux/mlx5/qp.h>
 #include <linux/mlx5/cq.h>
 #include <linux/mlx5/vport.h>
+#include <linux/rhashtable.h>
+#include <net/switchdev.h>
 #include "wq.h"
 #include "transobj.h"
 #include "mlx5_core.h"
@@ -497,8 +499,16 @@ struct mlx5e_flow_table {
 	struct mlx5_flow_group		**g;
 };
 
+struct mlx5e_offloads_flow_table {
+	struct mlx5_flow_table		*t;
+
+	struct rhashtable_params        ht_params;
+	struct rhashtable               ht;
+};
+
 struct mlx5e_flow_tables {
 	struct mlx5_flow_namespace	*ns;
+	struct mlx5e_offloads_flow_table      offloads;
 	struct mlx5e_flow_table		vlan;
 	struct mlx5e_flow_table		main;
 };
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index 80d81ab..0fbe45c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -36,6 +36,7 @@
 #include <linux/tcp.h>
 #include <linux/mlx5/fs.h>
 #include "en.h"
+#include "en_switchdev.h"
 
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
@@ -1202,12 +1203,18 @@ int mlx5e_create_flow_tables(struct mlx5e_priv *priv)
 	if (err)
 		goto err_destroy_vlan_flow_table;
 
-	err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
+	err = mlx5e_create_offloads_flow_table(priv);
 	if (err)
 		goto err_destroy_main_flow_table;
 
+	err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
+	if (err)
+		goto err_destroy_offloads_flow_table;
+
 	return 0;
 
+err_destroy_offloads_flow_table:
+	mlx5e_destroy_offloads_flow_table(priv);
 err_destroy_main_flow_table:
 	mlx5e_destroy_main_flow_table(priv);
 err_destroy_vlan_flow_table:
@@ -1219,6 +1226,7 @@ err_destroy_vlan_flow_table:
 void mlx5e_destroy_flow_tables(struct mlx5e_priv *priv)
 {
 	mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
+	mlx5e_destroy_offloads_flow_table(priv);
 	mlx5e_destroy_main_flow_table(priv);
 	mlx5e_destroy_vlan_flow_table(priv);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5c74a73..4bc9243 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -32,6 +32,7 @@
 
 #include <linux/mlx5/fs.h>
 #include "en.h"
+#include "en_switchdev.h"
 #include "eswitch.h"
 
 struct mlx5e_rq_param {
@@ -2178,6 +2179,7 @@ static void mlx5e_build_netdev(struct net_device *netdev)
 
 	netdev->priv_flags       |= IFF_UNICAST_FLT;
 
+	mlx5e_switchdev_init(netdev);
 	mlx5e_set_netdev_dev_addr(netdev);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index dd959d9..678d4e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -223,6 +223,8 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 	if (cqe_has_vlan(cqe))
 		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
 				       be16_to_cpu(cqe->vlan_info));
+
+	skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & 0x00ffffff;
 }
 
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c b/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c
new file mode 100644
index 0000000..b88ead4
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c
@@ -0,0 +1,475 @@
+/*
+ * Copyright (c) 2015, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <net/switchdev.h>
+#include <linux/mlx5/fs.h>
+#include <linux/mlx5/device.h>
+#include <linux/rhashtable.h>
+#include "en.h"
+#include "en_switchdev.h"
+#include "eswitch.h"
+
+struct mlx5e_switchdev_flow {
+	struct rhash_head	node;
+	unsigned long		cookie;
+	void			*rule;
+};
+
+static int prep_flow_attr(struct switchdev_obj_port_flow *f)
+{
+	struct switchdev_obj_port_flow_act *act = f->actions;
+
+	if (~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
+	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_VLANID) |
+	      BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_PORTS) |
+	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS)) & f->dissector->used_keys) {
+		pr_warn("Unsupported key used: 0x%x\n",
+			f->dissector->used_keys);
+		return -ENOTSUPP;
+	}
+
+	if (~(BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_DROP) |
+	      BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK)) & act->actions) {
+		pr_warn("Unsupported action used: 0x%x\n", act->actions);
+		return -ENOTSUPP;
+	}
+
+	if (BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK) & act->actions &&
+	    (act->mark & ~0xffff)) {
+		pr_warn("Bad flow mark - only 16 bit is supported: 0x%x\n",
+			act->mark);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int parse_flow_attr(u32 *match_c, u32 *match_v,
+			   u32 *action, u32 *flow_tag,
+			   struct switchdev_obj_port_flow *f)
+{
+	void *outer_headers_c = MLX5_ADDR_OF(fte_match_param, match_c,
+					     outer_headers);
+	void *outer_headers_v = MLX5_ADDR_OF(fte_match_param, match_v,
+					     outer_headers);
+	struct switchdev_obj_port_flow_act *act = f->actions;
+	u16 addr_type = 0;
+	u8 ip_proto = 0;
+
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_CONTROL)) {
+		struct flow_dissector_key_control *key =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_BASIC,
+						  f->key);
+		addr_type = key->addr_type;
+	}
+
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
+		struct flow_dissector_key_basic *key =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_BASIC,
+						  f->key);
+		struct flow_dissector_key_basic *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_BASIC,
+						  f->mask);
+		ip_proto = key->ip_proto;
+
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, ethertype,
+			 ntohs(mask->n_proto));
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, ethertype,
+			 ntohs(key->n_proto));
+
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, ip_protocol,
+			 mask->ip_proto);
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, ip_protocol,
+			 key->ip_proto);
+	}
+
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+		struct flow_dissector_key_eth_addrs *key =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_ETH_ADDRS,
+						  f->key);
+		struct flow_dissector_key_eth_addrs *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_ETH_ADDRS,
+						  f->mask);
+
+		ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4,
+					     outer_headers_c, dmac_47_16),
+				mask->dst);
+		ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4,
+					     outer_headers_v, dmac_47_16),
+				key->dst);
+
+		ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4,
+					     outer_headers_c, smac_47_16),
+				mask->src);
+		ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4,
+					     outer_headers_v, smac_47_16),
+				key->src);
+	}
+
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_VLANID)) {
+		struct flow_dissector_key_tags *key =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_VLANID,
+						  f->key);
+		struct flow_dissector_key_tags *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_VLANID,
+						  f->mask);
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, vlan_tag, 1);
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, vlan_tag, 1);
+
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, first_vid,
+			 ntohs(mask->vlan_id));
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, first_vid,
+			 ntohs(key->vlan_id));
+
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, first_cfi,
+			 ntohs(mask->flow_label));
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, first_cfi,
+			 ntohs(key->flow_label));
+
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, first_prio,
+			 ntohs(mask->flow_label) >> 1);
+		MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, first_prio,
+			 ntohs(key->flow_label) >> 1);
+	}
+
+	if (addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
+		struct flow_dissector_key_ipv4_addrs *key =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_IPV4_ADDRS,
+						  f->key);
+		struct flow_dissector_key_ipv4_addrs *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_IPV4_ADDRS,
+						  f->mask);
+
+		memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c,
+				    src_ipv4_src_ipv6.ipv4_layout.ipv4),
+		       &mask->src, sizeof(mask->src));
+		memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v,
+				    src_ipv4_src_ipv6.ipv4_layout.ipv4),
+		       &key->src, sizeof(key->src));
+		memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c,
+				    dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
+		       &mask->dst, sizeof(mask->dst));
+		memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v,
+				    dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
+		       &key->dst, sizeof(key->dst));
+	}
+
+	if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
+		struct flow_dissector_key_ipv6_addrs *key =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_IPV6_ADDRS,
+						  f->key);
+		struct flow_dissector_key_ipv6_addrs *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_IPV6_ADDRS,
+						  f->mask);
+
+		memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c,
+				    src_ipv4_src_ipv6.ipv6_layout.ipv6),
+		       &mask->src, sizeof(mask->src));
+		memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v,
+				    src_ipv4_src_ipv6.ipv6_layout.ipv6),
+		       &key->src, sizeof(key->src));
+
+		memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c,
+				    dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
+		       &mask->dst, sizeof(mask->dst));
+		memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v,
+				    dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
+		       &key->dst, sizeof(key->dst));
+	}
+
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_PORTS)) {
+		struct flow_dissector_key_ports *key =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_PORTS,
+						  f->key);
+		struct flow_dissector_key_ports *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_PORTS,
+						  f->mask);
+		switch (ip_proto) {
+		case IPPROTO_TCP:
+			MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c,
+				 tcp_sport, ntohs(mask->src));
+			MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v,
+				 tcp_sport, ntohs(key->src));
+
+			MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c,
+				 tcp_dport, ntohs(mask->dst));
+			MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v,
+				 tcp_dport, ntohs(key->dst));
+			break;
+
+		case IPPROTO_UDP:
+			MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c,
+				 udp_sport, ntohs(mask->src));
+			MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v,
+				 udp_sport, ntohs(key->src));
+
+			MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c,
+				 udp_dport, ntohs(mask->dst));
+			MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v,
+				 udp_dport, ntohs(key->dst));
+			break;
+		default:
+			pr_err("Only UDP and TCP transport are supported\n");
+			return -EINVAL;
+		}
+	}
+
+	/* Actions: */
+	if (BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK) & act->actions) {
+		*flow_tag = act->mark;
+		*action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
+	}
+
+	if (BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_DROP) & act->actions)
+		*action |= MLX5_FLOW_CONTEXT_ACTION_DROP;
+
+	return 0;
+}
+
+#define MLX5E_TC_FLOW_TABLE_NUM_ENTRIES 10
+#define MLX5E_TC_FLOW_TABLE_NUM_GROUPS 10
+int mlx5e_create_offloads_flow_table(struct mlx5e_priv *priv)
+{
+	struct mlx5_flow_namespace *ns;
+
+	ns = mlx5_get_flow_namespace(priv->mdev,
+				     MLX5_FLOW_NAMESPACE_OFFLOADS);
+	if (!ns)
+		return -EINVAL;
+
+	priv->fts.offloads.t = mlx5_create_auto_grouped_flow_table(ns, 0,
+					    MLX5E_TC_FLOW_TABLE_NUM_ENTRIES,
+					    MLX5E_TC_FLOW_TABLE_NUM_GROUPS);
+	if (IS_ERR(priv->fts.offloads.t))
+		return PTR_ERR(priv->fts.offloads.t);
+
+	return 0;
+}
+
+void mlx5e_destroy_offloads_flow_table(struct mlx5e_priv *priv)
+{
+	mlx5_destroy_flow_table(priv->fts.offloads.t);
+	priv->fts.offloads.t = NULL;
+}
+
+static u8 generate_match_criteria_enable(u32 *match_c)
+{
+	u8 match_criteria_enable = 0;
+	void *outer_headers_c = MLX5_ADDR_OF(fte_match_param, match_c,
+					      outer_headers);
+	void *inner_headers_c = MLX5_ADDR_OF(fte_match_param, match_c,
+					      inner_headers);
+	void *misc_c = MLX5_ADDR_OF(fte_match_param, match_c,
+				     misc_parameters);
+	size_t header_size = MLX5_ST_SZ_BYTES(fte_match_set_lyr_2_4);
+	size_t misc_size = MLX5_ST_SZ_BYTES(fte_match_set_misc);
+
+	if (memchr_inv(outer_headers_c, 0, header_size))
+		match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
+	if (memchr_inv(misc_c, 0, misc_size))
+		match_criteria_enable |= MLX5_MATCH_MISC_PARAMETERS;
+	if (memchr_inv(inner_headers_c, 0, header_size))
+		match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
+
+	return match_criteria_enable;
+}
+
+static int mlx5e_offloads_flow_add(struct net_device *netdev,
+				   struct switchdev_obj_port_flow *f)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5e_offloads_flow_table *offloads = &priv->fts.offloads;
+	struct mlx5_flow_table *ft = offloads->t;
+	u8 match_criteria_enable;
+	u32 *match_c;
+	u32 *match_v;
+	int err = 0;
+	u32 flow_tag = MLX5_FS_DEFAULT_FLOW_TAG;
+	u32 action = 0;
+	struct mlx5e_switchdev_flow *flow;
+
+	match_c = kzalloc(MLX5_ST_SZ_BYTES(fte_match_param), GFP_KERNEL);
+	match_v = kzalloc(MLX5_ST_SZ_BYTES(fte_match_param), GFP_KERNEL);
+	if (!match_c || !match_v) {
+		err = -ENOMEM;
+		goto free;
+	}
+
+	flow = kzalloc(sizeof(*flow), GFP_KERNEL);
+	if (!flow) {
+		err = -ENOMEM;
+		goto free;
+	}
+	flow->cookie = f->cookie;
+
+	err = parse_flow_attr(match_c, match_v, &action, &flow_tag, f);
+	if (err < 0)
+		goto free;
+
+	/* Outer header support only */
+	match_criteria_enable = generate_match_criteria_enable(match_c);
+
+	flow->rule = mlx5_add_flow_rule(ft, match_criteria_enable,
+					match_c, match_v,
+					action, flow_tag, NULL);
+	if (IS_ERR(flow->rule)) {
+		kfree(flow);
+		err = PTR_ERR(flow->rule);
+		goto free;
+	}
+
+	err = rhashtable_insert_fast(&offloads->ht, &flow->node,
+				     offloads->ht_params);
+	if (err) {
+		mlx5_del_flow_rule(flow->rule);
+		kfree(flow);
+	}
+
+free:
+	kfree(match_c);
+	kfree(match_v);
+	return err;
+}
+
+static int mlx5e_offloads_flow_del(struct net_device *netdev,
+				   struct switchdev_obj_port_flow *f)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5e_switchdev_flow *flow;
+	struct mlx5e_offloads_flow_table *offloads = &priv->fts.offloads;
+
+	flow = rhashtable_lookup_fast(&offloads->ht, &f->cookie,
+				      offloads->ht_params);
+	if (!flow) {
+		pr_err("Can't find requested flow");
+		return -EINVAL;
+	}
+
+	mlx5_del_flow_rule(flow->rule);
+
+	rhashtable_remove_fast(&offloads->ht, &flow->node, offloads->ht_params);
+	kfree(flow);
+
+	return 0;
+}
+
+static int mlx5e_port_obj_add(struct net_device *dev,
+			      const struct switchdev_obj *obj,
+			      struct switchdev_trans *trans)
+{
+	int err = 0;
+
+	if (trans->ph_prepare) {
+		switch (obj->id) {
+		case SWITCHDEV_OBJ_ID_PORT_FLOW:
+			err = prep_flow_attr(SWITCHDEV_OBJ_PORT_FLOW(obj));
+			break;
+		default:
+			err = -EOPNOTSUPP;
+			break;
+		}
+
+		return err;
+	}
+
+	switch (obj->id) {
+	case SWITCHDEV_OBJ_ID_PORT_FLOW:
+		err = mlx5e_offloads_flow_add(dev,
+					      SWITCHDEV_OBJ_PORT_FLOW(obj));
+		break;
+	default:
+		err = -EOPNOTSUPP;
+		break;
+	}
+
+	return err;
+}
+
+static int mlx5e_port_obj_del(struct net_device *dev,
+			      const struct switchdev_obj *obj)
+{
+	int err = 0;
+
+	switch (obj->id) {
+	case SWITCHDEV_OBJ_ID_PORT_FLOW:
+		err = mlx5e_offloads_flow_del(dev,
+					      SWITCHDEV_OBJ_PORT_FLOW(obj));
+		break;
+	default:
+		err = -EOPNOTSUPP;
+		break;
+	}
+
+	return err;
+}
+
+const struct switchdev_ops mlx5e_switchdev_ops = {
+	.switchdev_port_obj_add = mlx5e_port_obj_add,
+	.switchdev_port_obj_del = mlx5e_port_obj_del,
+};
+
+static const struct rhashtable_params mlx5e_switchdev_flow_ht_params = {
+	.head_offset = offsetof(struct mlx5e_switchdev_flow, node),
+	.key_offset = offsetof(struct mlx5e_switchdev_flow, cookie),
+	.key_len = sizeof(unsigned long),
+	.hashfn = jhash,
+	.automatic_shrinking = true,
+};
+
+void mlx5e_switchdev_init(struct net_device *netdev)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct mlx5e_offloads_flow_table *offloads = &priv->fts.offloads;
+
+	netdev->switchdev_ops = &mlx5e_switchdev_ops;
+
+	offloads->ht_params = mlx5e_switchdev_flow_ht_params;
+	rhashtable_init(&offloads->ht, &offloads->ht_params);
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h b/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h
new file mode 100644
index 0000000..8f4e3a3
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h
@@ -0,0 +1,60 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef __MLX5_EN_SWITCHDEV__H__
+#define __MLX5_EN_SWITCHDEV__H__
+
+#ifdef CONFIG_MLX5_EN_SWITCHDEV
+
+extern const struct switchdev_ops mlx5e_switchdev_ops;
+
+void mlx5e_destroy_offloads_flow_table(struct mlx5e_priv *priv);
+int mlx5e_create_offloads_flow_table(struct mlx5e_priv *priv);
+void mlx5e_switchdev_init(struct net_device *dev);
+
+#else
+static inline void mlx5e_destroy_offloads_flow_table(struct mlx5e_priv *priv)
+{
+}
+
+static inline int mlx5e_create_offloads_flow_table(struct mlx5e_priv *priv)
+{
+	return 0;
+}
+
+static inline void mlx5e_switchdev_init(struct net_device *dev)
+{
+}
+#endif
+
+#endif /* __MLX5_EN_SWITCHDEV__H__ */
+
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 2/9] net/switchdev: Introduce hardware offload support
  2016-02-01  8:34 ` [RFC net-next 2/9] net/switchdev: Introduce hardware offload support Amir Vadai
@ 2016-02-01  9:06   ` Jiri Pirko
  2016-02-01  9:11     ` amirva
  2016-02-01  9:26   ` John Fastabend
  1 sibling, 1 reply; 23+ messages in thread
From: Jiri Pirko @ 2016-02-01  9:06 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, John Fastabend, Or Gerlitz,
	Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

Mon, Feb 01, 2016 at 09:34:38AM CET, amir@vadai.me wrote:
>Extend the switchdev API with new operations: switchdev_port_flow_add()
>and switchdev_port_flow_del().
>It allows the user to add/del a hardware offloaded flow classification
>and actions.
>For every new flow object a cookie is supplied. This cookie will be
>used later on to identify the flow when removed.
>
>In order to make the API as flexible as possible, flow_dissector is
>being used to describe the flow classifier.
>
>Every new flow object is consists of a flow_dissector+key+mask to
>describe the classifier and a switchdev_obj_port_flow_act to describe
>the actions and their attributes.
>
>object is passed to the lower layer driver to be pushed into the
>hardware.
>
>Signed-off-by: Amir Vadai <amir@vadai.me>
>---
> include/net/switchdev.h   | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> net/switchdev/switchdev.c | 33 +++++++++++++++++++++++++++++++++
> 2 files changed, 79 insertions(+)
>
>diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>index d451122..c5a5681 100644
>--- a/include/net/switchdev.h
>+++ b/include/net/switchdev.h
>@@ -15,6 +15,7 @@
> #include <linux/notifier.h>
> #include <linux/list.h>
> #include <net/ip_fib.h>
>+#include <net/flow_dissector.h>
> 
> #define SWITCHDEV_F_NO_RECURSE		BIT(0)
> #define SWITCHDEV_F_SKIP_EOPNOTSUPP	BIT(1)
>@@ -69,6 +70,7 @@ enum switchdev_obj_id {
> 	SWITCHDEV_OBJ_ID_IPV4_FIB,
> 	SWITCHDEV_OBJ_ID_PORT_FDB,
> 	SWITCHDEV_OBJ_ID_PORT_MDB,
>+	SWITCHDEV_OBJ_ID_PORT_FLOW,
> };
> 
> struct switchdev_obj {
>@@ -124,6 +126,30 @@ struct switchdev_obj_port_mdb {
> #define SWITCHDEV_OBJ_PORT_MDB(obj) \
> 	container_of(obj, struct switchdev_obj_port_mdb, obj)
> 
>+/* SWITCHDEV_OBJ_ID_PORT_FLOW */
>+enum switchdev_obj_port_flow_action {
>+	SWITCHDEV_OBJ_PORT_FLOW_ACT_DROP = 0,
>+	SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK = 1,
>+};
>+
>+struct switchdev_obj_port_flow_act {
>+	u32 actions; /* Bitmap of requested actions */
>+	u32 mark; /* Value for mark action - if requested */

This approach is certainly not correct. We need a list of actions here
instead of bitmap.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 2/9] net/switchdev: Introduce hardware offload support
  2016-02-01  9:06   ` Jiri Pirko
@ 2016-02-01  9:11     ` amirva
  0 siblings, 0 replies; 23+ messages in thread
From: amirva @ 2016-02-01  9:11 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Amir Vadai, David S. Miller, netdev, John Fastabend, Or Gerlitz,
	Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On Mon, Feb 01, 2016 at 10:06:27AM +0100, Jiri Pirko wrote:
> Mon, Feb 01, 2016 at 09:34:38AM CET, amir@vadai.me wrote:
> >Extend the switchdev API with new operations: switchdev_port_flow_add()
> >and switchdev_port_flow_del().
> >It allows the user to add/del a hardware offloaded flow classification
> >and actions.
> >For every new flow object a cookie is supplied. This cookie will be
> >used later on to identify the flow when removed.
> >
> >In order to make the API as flexible as possible, flow_dissector is
> >being used to describe the flow classifier.
> >
> >Every new flow object is consists of a flow_dissector+key+mask to
> >describe the classifier and a switchdev_obj_port_flow_act to describe
> >the actions and their attributes.
> >
> >object is passed to the lower layer driver to be pushed into the
> >hardware.
> >
> >Signed-off-by: Amir Vadai <amir@vadai.me>
> >---
> > include/net/switchdev.h   | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > net/switchdev/switchdev.c | 33 +++++++++++++++++++++++++++++++++
> > 2 files changed, 79 insertions(+)
> >
> >diff --git a/include/net/switchdev.h b/include/net/switchdev.h
> >index d451122..c5a5681 100644
> >--- a/include/net/switchdev.h
> >+++ b/include/net/switchdev.h
> >@@ -15,6 +15,7 @@
> > #include <linux/notifier.h>
> > #include <linux/list.h>
> > #include <net/ip_fib.h>
> >+#include <net/flow_dissector.h>
> > 
> > #define SWITCHDEV_F_NO_RECURSE		BIT(0)
> > #define SWITCHDEV_F_SKIP_EOPNOTSUPP	BIT(1)
> >@@ -69,6 +70,7 @@ enum switchdev_obj_id {
> > 	SWITCHDEV_OBJ_ID_IPV4_FIB,
> > 	SWITCHDEV_OBJ_ID_PORT_FDB,
> > 	SWITCHDEV_OBJ_ID_PORT_MDB,
> >+	SWITCHDEV_OBJ_ID_PORT_FLOW,
> > };
> > 
> > struct switchdev_obj {
> >@@ -124,6 +126,30 @@ struct switchdev_obj_port_mdb {
> > #define SWITCHDEV_OBJ_PORT_MDB(obj) \
> > 	container_of(obj, struct switchdev_obj_port_mdb, obj)
> > 
> >+/* SWITCHDEV_OBJ_ID_PORT_FLOW */
> >+enum switchdev_obj_port_flow_action {
> >+	SWITCHDEV_OBJ_PORT_FLOW_ACT_DROP = 0,
> >+	SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK = 1,
> >+};
> >+
> >+struct switchdev_obj_port_flow_act {
> >+	u32 actions; /* Bitmap of requested actions */
> >+	u32 mark; /* Value for mark action - if requested */
> 
> This approach is certainly not correct. We need a list of actions here
> instead of bitmap.
This is what I meant in the cover letter by saying:

"2. Serialization of actions will be changed into a list instead of one
    big structure to describe all actions."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 0/9] TC filter HW offloads
  2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
                   ` (8 preceding siblings ...)
  2016-02-01  8:34 ` [RFC net-next 9/9] net/mlx5e: Flow steering support through switchdev Amir Vadai
@ 2016-02-01  9:21 ` John Fastabend
  2016-02-01 14:37   ` Amir Vadai
  9 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2016-02-01  9:21 UTC (permalink / raw)
  To: Amir Vadai, David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On 16-02-01 12:34 AM, Amir Vadai wrote:
> Hi,
> 
> So... just before sending that, I noted Jonh's series that
> deals with tc and u32. One notable difference between the 
> two approaches is that here we "normalize" the upper layer
> way of describing matching and actions into a generic structure
> (flow dissector, etc), which should allow to use offload different
> potential consumer tools (TC flower, TC u32 subset), netfilter, etc).

Except its not really normalizing anything in this patchset
right? For a "real" normalizing I would expect the netdev
needs to advertise its parse graph and headers in a protocol
oblivious way, along with the table setup and this middle
layer needs to map the general software side onto the hardware
side. I tried this and I came to the conclusion I would just
push rules down at the hardware at least for now until I get
enough hardware implementations to see if there really is any
advantage in this sort of generic middle layer. My main concern
is its slow and table layout, hardware architecture both try
to fight you when doing this. It can be done I'm just not sure
its worth it yet.

Also just as an aside flower can be emulated with u32 which can
be emulated with bpf, I don't think the structures here are
generic.

> Another difference is with this series uses the switchdev
> framework which would allow using the proposed HW offloading
> mechanisms for physical and SRIOV embedded switches too that
> make use of switchdev.

But 'tc' infrastructure is useful even without SRIOV or any
switching at all. I don't think it needs to go into switchdev.
Even my vanilla 10G nic can drop/mark pkts coming onto the
physical functions.

> 
> This patchset introduces an infrastructure to offload matching of flows and
> some basic actions to hardware, currenrtly using iproute2 / tc tool.
> 
> In this patchset, the classification is described using the flower filter, and
> the supported actions are drop (using gact) and mark (using skbedit).
> 

ditto I just didn't show the mark patch set on my side. I also would
like to get pedit shortly.

> Flow classifcation is described using a flow dissector that is built by 
> the tc filter. The filter also calls the actions to be serialized into the new
> structure - switchdev_obj_port_flow_act.
> 
> The flow dissector and the serialized actions are passed using switchdev ops to
> the HW driver, which parse it to hardware commands. We propose to use the
> kernel flow-dissector to describe flows/ACLs in the switchdev framework which
> by itself could be also used for HW offloading of other kernel networking
> components.

I'm not sure I like this or at least I don't want to make this the
exclusive mechanism. I think bpf/u32 are more flexible. In general
I'm opposed to getting stuck talking about specific protocols I want
this to be flexible so I don't need a new thing everytime folks add
a new header/bit/field/etc. If you use flow-dissector to describe
flows your limiting the hardware. Also I'm sure I'll want to match on
fields that flow-dissector doesn't care about and really never should
care about think HTTP for example.

> 
> An implementation for the above is provided using mlx5 driver and Mellanox 
> ConnectX4 HW.
> 
> Some issues that will be addressed before making the final submission:
> 1. 'offload' should be a generic filter attribute and not flower filter
>    specific.

I'm not sure its worth normalizing now. See how I created a code and
set of structures for each filter. Maybe some helper libraries would
be in order.

> 2. Serialization of actions will be changed into a list instead of one big
>    structure to describe all actions.
> 
> Few more matters to discuss 
> 
> 1. Should HW offloading be done only under explicit admin directive?

I took the approach of having one big bit I set per netdev to turn it
on and off. Then I have a flag similar to your patch on cls_flower to
turn it on/off per rule if I care to. I didn't send the per rule patch
because I view it as an optimization.

But the case where it matters is mark on a NIC where you don't really
need/want to match the same packet twice and mark it again. For a switch
it may not matter because the host bound traffic is the exception not
the rule.

> 
> 2. switchdev is used today for physical switch HW and on an upcoming proposal
> for SRIOV e-switch vport representors too. Here, we're doing that with a NIC, 
> that can potentially serve as an uplink port for v-switch (e.g under Para-Virtual 
> scheme).

Sure but remember where switchdev may be relevant for SRIOV loading
'tc' like rules into a NIC doesn't mean you need/want/care/support
SRIOV. So I don't think we should use switchdev or at least I don't
think it should be required. A bunch of helper functions for switches
may be useful in switchdev.

.John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 2/9] net/switchdev: Introduce hardware offload support
  2016-02-01  8:34 ` [RFC net-next 2/9] net/switchdev: Introduce hardware offload support Amir Vadai
  2016-02-01  9:06   ` Jiri Pirko
@ 2016-02-01  9:26   ` John Fastabend
  1 sibling, 0 replies; 23+ messages in thread
From: John Fastabend @ 2016-02-01  9:26 UTC (permalink / raw)
  To: Amir Vadai, David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim, Scott Feldman

On 16-02-01 12:34 AM, Amir Vadai wrote:
> Extend the switchdev API with new operations: switchdev_port_flow_add()
> and switchdev_port_flow_del().
> It allows the user to add/del a hardware offloaded flow classification
> and actions.
> For every new flow object a cookie is supplied. This cookie will be
> used later on to identify the flow when removed.
> 
> In order to make the API as flexible as possible, flow_dissector is
> being used to describe the flow classifier.
> 
> Every new flow object is consists of a flow_dissector+key+mask to
> describe the classifier and a switchdev_obj_port_flow_act to describe
> the actions and their attributes.
> 
> object is passed to the lower layer driver to be pushed into the
> hardware.
> 
> Signed-off-by: Amir Vadai <amir@vadai.me>
> ---

+Scott

[...]

> +struct switchdev_obj_port_flow {
> +	struct switchdev_obj obj;
> +
> +	unsigned long cookie;
> +	struct flow_dissector *dissector; /* Dissector for mask and keys */
> +	void *mask; /* Flow keys mask */
> +	void *key;  /* Flow keys */

The heavy use of void* here and below seems questionable to me. If your
going to consume flow keys/mask define them.

> +	struct switchdev_obj_port_flow_act *actions;
> +};

[...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading
  2016-02-01  8:34 ` [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading Amir Vadai
@ 2016-02-01  9:31   ` John Fastabend
  2016-02-01  9:47     ` John Fastabend
  2016-02-01 10:43     ` Amir Vadai
  0 siblings, 2 replies; 23+ messages in thread
From: John Fastabend @ 2016-02-01  9:31 UTC (permalink / raw)
  To: Amir Vadai, David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On 16-02-01 12:34 AM, Amir Vadai wrote:
> During initialization, tcf_exts_offload_init() is called to initialize
> the list of actions description. later on, the classifier description
> is prepared and sent to the switchdev using switchdev_port_flow_add().
> 
> When offloaded, fl_classify() is a NOP - already done in hardware.
> 
> Signed-off-by: Amir Vadai <amir@vadai.me>
> ---

You need to account for where the classifier is being loaded
by passing the handle as I did in my patch set. Otherwise you may
be offloading on egress/ingress or even some qdisc multiple layers
down in the hierarchy.

.John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading
  2016-02-01  9:31   ` John Fastabend
@ 2016-02-01  9:47     ` John Fastabend
  2016-02-01 10:43     ` Amir Vadai
  1 sibling, 0 replies; 23+ messages in thread
From: John Fastabend @ 2016-02-01  9:47 UTC (permalink / raw)
  To: Amir Vadai, David S. Miller, netdev, John Fastabend
  Cc: Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On 16-02-01 01:31 AM, John Fastabend wrote:
> On 16-02-01 12:34 AM, Amir Vadai wrote:
>> During initialization, tcf_exts_offload_init() is called to initialize
>> the list of actions description. later on, the classifier description
>> is prepared and sent to the switchdev using switchdev_port_flow_add().
>>
>> When offloaded, fl_classify() is a NOP - already done in hardware.
>>
>> Signed-off-by: Amir Vadai <amir@vadai.me>
>> ---
> 
> You need to account for where the classifier is being loaded
> by passing the handle as I did in my patch set. Otherwise you may
> be offloading on egress/ingress or even some qdisc multiple layers
> down in the hierarchy.
> 
> .John
> 

Hi Amir,

I've read through the patches take a look at my set and see if you
can add this as another TC_SETUP_* command namely TC_SETUP_FLOWER. The
switchdev bits could be handled the same way as fdb_add and other ndo
ops are handled today in rocker. I don't think your set of patches
would have to change much to merge them with my set. I'll take a stab
at it tomorrow and send out a v2. I think this would work and then NIC
can implement just the tc_setup ndo and your switchdev patches remain
unchanged.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading
  2016-02-01  9:31   ` John Fastabend
  2016-02-01  9:47     ` John Fastabend
@ 2016-02-01 10:43     ` Amir Vadai
  2016-02-01 21:25       ` John Fastabend
  1 sibling, 1 reply; 23+ messages in thread
From: Amir Vadai @ 2016-02-01 10:43 UTC (permalink / raw)
  To: John Fastabend
  Cc: David S. Miller, netdev, John Fastabend, Or Gerlitz,
	Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On Mon, Feb 01, 2016 at 01:31:17AM -0800, John Fastabend wrote:
> On 16-02-01 12:34 AM, Amir Vadai wrote:
> > During initialization, tcf_exts_offload_init() is called to initialize
> > the list of actions description. later on, the classifier description
> > is prepared and sent to the switchdev using switchdev_port_flow_add().
> > 
> > When offloaded, fl_classify() is a NOP - already done in hardware.
> > 
> > Signed-off-by: Amir Vadai <amir@vadai.me>
> > ---
> 
> You need to account for where the classifier is being loaded
> by passing the handle as I did in my patch set. Otherwise you may
> be offloading on egress/ingress or even some qdisc multiple layers
> down in the hierarchy.
Right. Will fix it.

> 
> .John
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 0/9] TC filter HW offloads
  2016-02-01  9:21 ` [RFC net-next 0/9] TC filter HW offloads John Fastabend
@ 2016-02-01 14:37   ` Amir Vadai
  2016-02-01 19:59     ` Tom Herbert
  2016-02-01 20:14     ` John Fastabend
  0 siblings, 2 replies; 23+ messages in thread
From: Amir Vadai @ 2016-02-01 14:37 UTC (permalink / raw)
  To: John Fastabend
  Cc: David S. Miller, netdev, John Fastabend, Or Gerlitz,
	Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On Mon, Feb 01, 2016 at 01:21:36AM -0800, John Fastabend wrote:
> On 16-02-01 12:34 AM, Amir Vadai wrote:
> > Hi,
> > 
> > So... just before sending that, I noted Jonh's series that
> > deals with tc and u32. One notable difference between the 
> > two approaches is that here we "normalize" the upper layer
> > way of describing matching and actions into a generic structure
> > (flow dissector, etc), which should allow to use offload different
> > potential consumer tools (TC flower, TC u32 subset), netfilter, etc).
> 
> Except its not really normalizing anything in this patchset
> right? For a "real" normalizing I would expect the netdev
> needs to advertise its parse graph and headers in a protocol
> oblivious way, along with the table setup and this middle
> layer needs to map the general software side onto the hardware
> side. I tried this and I came to the conclusion I would just
> push rules down at the hardware at least for now until I get
> enough hardware implementations to see if there really is any
> advantage in this sort of generic middle layer. My main concern
> is its slow and table layout, hardware architecture both try
> to fight you when doing this. It can be done I'm just not sure
> its worth it yet.
What I was trying to do, is to find an extensible api to describe the
rules. And yes, like in your design, the device doesn't advertise its
capabilities, only if it is capable to do any offloading. The consumer
pushes the rules and the device return success/fail.

Using u32 filter is nice since it is a very universal classifier (and
you did implement parsing it in a very elegant way), but I'm not sure I
like having in device drivers a specific code for different filters. So,
if another consumer, for example the flower filter or netfilter, would
want to use this api, it will need to speak the u32 language, or have
its own implementation in the device driver?

> 
> Also just as an aside flower can be emulated with u32 which can
> be emulated with bpf, I don't think the structures here are
> generic.
This is why I used flow dissector - because it is a very abstract way to
pass the classifications.
If it is not flexible enough, maybe splitting the current flow
dissector code, into (1) a generic api to describe structures in an
abstract way (the offsets, bitmap, and structs), and (2) the code that
is used to dissect skb's. This way we could express stuff using (1) that
is not related to (2).

> 
> > Another difference is with this series uses the switchdev
> > framework which would allow using the proposed HW offloading
> > mechanisms for physical and SRIOV embedded switches too that
> > make use of switchdev.
> 
> But 'tc' infrastructure is useful even without SRIOV or any
> switching at all. I don't think it needs to go into switchdev.
> Even my vanilla 10G nic can drop/mark pkts coming onto the
> physical functions.
ok, we could work it out - as you suggested in a similar way fdb_add is
doing.

> 
> > 
> > This patchset introduces an infrastructure to offload matching of flows and
> > some basic actions to hardware, currenrtly using iproute2 / tc tool.
> > 
> > In this patchset, the classification is described using the flower filter, and
> > the supported actions are drop (using gact) and mark (using skbedit).
> > 
> 
> ditto I just didn't show the mark patch set on my side. I also would
> like to get pedit shortly.
> 
> > Flow classifcation is described using a flow dissector that is built by 
> > the tc filter. The filter also calls the actions to be serialized into the new
> > structure - switchdev_obj_port_flow_act.
> > 
> > The flow dissector and the serialized actions are passed using switchdev ops to
> > the HW driver, which parse it to hardware commands. We propose to use the
> > kernel flow-dissector to describe flows/ACLs in the switchdev framework which
> > by itself could be also used for HW offloading of other kernel networking
> > components.
> 
> I'm not sure I like this or at least I don't want to make this the
> exclusive mechanism. I think bpf/u32 are more flexible. In general
> I'm opposed to getting stuck talking about specific protocols I want
> this to be flexible so I don't need a new thing everytime folks add
> a new header/bit/field/etc. If you use flow-dissector to describe
> flows your limiting the hardware. Also I'm sure I'll want to match on
> fields that flow-dissector doesn't care about and really never should
> care about think HTTP for example.
I agree that we need a flexible way to express the classifiers. I'm not
sure that I see it as a problem to have the api extended over the years,
as long as it is designed to be extensible.
What you actually suggest is to use u32 as such an api, or make lower
layer driver support multiple api's.
I will try to see how does the code looks if using the u32 api from the
flower filter.

> 
> > 
> > An implementation for the above is provided using mlx5 driver and Mellanox 
> > ConnectX4 HW.
> > 
> > Some issues that will be addressed before making the final submission:
> > 1. 'offload' should be a generic filter attribute and not flower filter
> >    specific.
> 
> I'm not sure its worth normalizing now. See how I created a code and
> set of structures for each filter. Maybe some helper libraries would
> be in order.
> 
> > 2. Serialization of actions will be changed into a list instead of one big
> >    structure to describe all actions.
> > 
> > Few more matters to discuss 
> > 
> > 1. Should HW offloading be done only under explicit admin directive?
> 
> I took the approach of having one big bit I set per netdev to turn it
> on and off. Then I have a flag similar to your patch on cls_flower to
> turn it on/off per rule if I care to. I didn't send the per rule patch
> because I view it as an optimization.
> 
> But the case where it matters is mark on a NIC where you don't really
> need/want to match the same packet twice and mark it again. For a switch
> it may not matter because the host bound traffic is the exception not
> the rule.
Yeh, if the cpu gets the packet, there is no need to process the
filter again in software.

> 
> > 
> > 2. switchdev is used today for physical switch HW and on an upcoming proposal
> > for SRIOV e-switch vport representors too. Here, we're doing that with a NIC, 
> > that can potentially serve as an uplink port for v-switch (e.g under Para-Virtual 
> > scheme).
> 
> Sure but remember where switchdev may be relevant for SRIOV loading
> 'tc' like rules into a NIC doesn't mean you need/want/care/support
> SRIOV. So I don't think we should use switchdev or at least I don't
> think it should be required. A bunch of helper functions for switches
> may be useful in switchdev.
ack

> 
> .John
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 9/9] net/mlx5e: Flow steering support through switchdev
  2016-02-01  8:34 ` [RFC net-next 9/9] net/mlx5e: Flow steering support through switchdev Amir Vadai
@ 2016-02-01 18:52   ` Saeed Mahameed
  2016-02-01 21:45     ` Or Gerlitz
  0 siblings, 1 reply; 23+ messages in thread
From: Saeed Mahameed @ 2016-02-01 18:52 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, John Fastabend, Or Gerlitz,
	Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On Mon, Feb 1, 2016 at 10:34 AM, Amir Vadai <amir@vadai.me> wrote:
> Parse switchdev flow object into device specific commands and program
> the hardware to classify and mark/drop the flow accordingly.
>
> A new Kconfig is introduced: MLX5_EN_SWITCHDEV. This config enables to
> compile the driver when switchdev is not compiled.
>
> Signed-off-by: Amir Vadai <amir@vadai.me>

Amir,

It is nice seeing you are contributing to mlx5e driver from outside
mellanox borders :).

I have some small comments for now, I will later thoroughly review
your code as I am
still not fully familiar with net switchdev mechanism.

So I hope for next time  you CC  me for mlx5 ethernet patches.

> ---
>  drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |   7 +
>  drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +
>  drivers/net/ethernet/mellanox/mlx5/core/en.h       |  10 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    |  10 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   2 +
>  .../net/ethernet/mellanox/mlx5/core/en_switchdev.c | 475 +++++++++++++++++++++
>  .../net/ethernet/mellanox/mlx5/core/en_switchdev.h |  60 +++
>  8 files changed, 568 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> index c503ea0..61a9eed 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> @@ -19,3 +19,10 @@ config MLX5_CORE_EN
>           Ethernet support in Mellanox Technologies ConnectX-4 NIC.
>           Ethernet and Infiniband support in ConnectX-4 are currently mutually
>           exclusive.
> +
> +config MLX5_EN_SWITCHDEV
> +       bool "MLX5 EN switchdev support"
> +       depends on MLX5_CORE_EN && NET_SWITCHDEV
> +       default y
> +       ---help---
> +         Switchdev support in Mellanox Technologies ConnectX-4 NIC.
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> index 01c0256..b80143e 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
> @@ -3,6 +3,9 @@ obj-$(CONFIG_MLX5_CORE)         += mlx5_core.o
>  mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
>                 health.o mcg.o cq.o srq.o alloc.o qp.o port.o mr.o pd.o   \
>                 mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o
> +
> +mlx5_core-$(CONFIG_MLX5_EN_SWITCHDEV) += en_switchdev.o
> +
>  mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o \
>                 en_main.o en_fs.o en_ethtool.o en_tx.o en_rx.o \
>                 en_txrx.o en_clock.o
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 9ea49a8..e61a67c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -39,6 +39,8 @@
>  #include <linux/mlx5/qp.h>
>  #include <linux/mlx5/cq.h>
>  #include <linux/mlx5/vport.h>
> +#include <linux/rhashtable.h>
> +#include <net/switchdev.h>
>  #include "wq.h"
>  #include "transobj.h"
>  #include "mlx5_core.h"
> @@ -497,8 +499,16 @@ struct mlx5e_flow_table {
>         struct mlx5_flow_group          **g;
>  };
>
> +struct mlx5e_offloads_flow_table {
> +       struct mlx5_flow_table          *t;
> +
> +       struct rhashtable_params        ht_params;
> +       struct rhashtable               ht;
> +};
> +

"offloads" is a very general name, you can move this internal
structure to en_switchdev.h and rename it to mlx5e_eswitchdev to serve
as a handle for
accessing mlx5e_switchdev via mlx5e_switchdev API you are suggesting.
Please see my comment on "en_swtichdev.h".

>  struct mlx5e_flow_tables {
>         struct mlx5_flow_namespace      *ns;
> +       struct mlx5e_offloads_flow_table      offloads;
This table is created from a very different namespace which means it
shares nothing in common with its current neighbors,
please remove it from here and consider the above comment.

>         struct mlx5e_flow_table         vlan;
>         struct mlx5e_flow_table         main;
>  };
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> index 80d81ab..0fbe45c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> @@ -36,6 +36,7 @@
>  #include <linux/tcp.h>
>  #include <linux/mlx5/fs.h>
>  #include "en.h"
> +#include "en_switchdev.h"
>
>  #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
>
> @@ -1202,12 +1203,18 @@ int mlx5e_create_flow_tables(struct mlx5e_priv *priv)
>         if (err)
>                 goto err_destroy_vlan_flow_table;
>
> -       err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
> +       err = mlx5e_create_offloads_flow_table(priv);
>         if (err)
>                 goto err_destroy_main_flow_table;
>

mlx5e_create_offloads_flow_table is a very general name and one can't
know it is meant for switchdev flow tables,
Also this is not the place for such function since there is no
relation between mlx5e internal flow tables and switchdev flow tables.

I suggest the following for better self containment and better
decoupling between mlx5e and mlx5e_switchdev API you are creating.
mlx5e netdevice shouldn't be aware of internal data structures or
design of the en_switchdev, the netdev can only activate/deactivate
switchdev upon open/close.
so you can rename mlx5e_create_offloads_flow_table to
mlx5e_switchdev_activate and call it in open ndo just after or before
mlx5e_create_flow_table, it shouldn't matter.

Also in case switchdev activation fails i suggest not to fail the
driver load, printing a corresponding error message should be
sufficient.

> +       err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
> +       if (err)
> +               goto err_destroy_offloads_flow_table;
> +
>         return 0;
>
> +err_destroy_offloads_flow_table:
> +       mlx5e_destroy_offloads_flow_table(priv);
>  err_destroy_main_flow_table:
>         mlx5e_destroy_main_flow_table(priv);
>  err_destroy_vlan_flow_table:
> @@ -1219,6 +1226,7 @@ err_destroy_vlan_flow_table:
>  void mlx5e_destroy_flow_tables(struct mlx5e_priv *priv)
>  {
>         mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0);
> +       mlx5e_destroy_offloads_flow_table(priv);
Same here.

>         mlx5e_destroy_main_flow_table(priv);
>         mlx5e_destroy_vlan_flow_table(priv);
>  }
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 5c74a73..4bc9243 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -32,6 +32,7 @@
>
>  #include <linux/mlx5/fs.h>
>  #include "en.h"
> +#include "en_switchdev.h"
>  #include "eswitch.h"
>
>  struct mlx5e_rq_param {
> @@ -2178,6 +2179,7 @@ static void mlx5e_build_netdev(struct net_device *netdev)
>
>         netdev->priv_flags       |= IFF_UNICAST_FLT;
>
> +       mlx5e_switchdev_init(netdev);

If I am not mistaken this is for OVS offloads ?
in case it is, please consider using the vport_manager capability or
any other device capability meant for this,
to initialize and activate mlx5e_switchdev.

After all such offload might not be supported in some devices. e.g VF.

>         mlx5e_set_netdev_dev_addr(netdev);
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index dd959d9..678d4e0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -223,6 +223,8 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
>         if (cqe_has_vlan(cqe))
>                 __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
>                                        be16_to_cpu(cqe->vlan_info));
> +
> +       skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & 0x00ffffff;
>  }
>
>  int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c b/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c
> new file mode 100644
> index 0000000..b88ead4
> --- /dev/null
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.c
> @@ -0,0 +1,475 @@
> +/*
> + * Copyright (c) 2015, Mellanox Technologies. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <net/switchdev.h>
> +#include <linux/mlx5/fs.h>
> +#include <linux/mlx5/device.h>
> +#include <linux/rhashtable.h>
> +#include "en.h"
> +#include "en_switchdev.h"
> +#include "eswitch.h"
> +
> +struct mlx5e_switchdev_flow {
> +       struct rhash_head       node;
> +       unsigned long           cookie;
> +       void                    *rule;
> +};
> +
> +static int prep_flow_attr(struct switchdev_obj_port_flow *f)
> +{
> +       struct switchdev_obj_port_flow_act *act = f->actions;
> +
> +       if (~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
> +             BIT(FLOW_DISSECTOR_KEY_BASIC) |
> +             BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
> +             BIT(FLOW_DISSECTOR_KEY_VLANID) |
> +             BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
> +             BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
> +             BIT(FLOW_DISSECTOR_KEY_PORTS) |
> +             BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS)) & f->dissector->used_keys) {
> +               pr_warn("Unsupported key used: 0x%x\n",
> +                       f->dissector->used_keys);
> +               return -ENOTSUPP;
> +       }
> +
> +       if (~(BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_DROP) |
> +             BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK)) & act->actions) {
> +               pr_warn("Unsupported action used: 0x%x\n", act->actions);
> +               return -ENOTSUPP;
> +       }
> +
> +       if (BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK) & act->actions &&
> +           (act->mark & ~0xffff)) {
> +               pr_warn("Bad flow mark - only 16 bit is supported: 0x%x\n",
> +                       act->mark);
> +               return -EINVAL;
> +       }
> +
> +       return 0;
> +}
> +
> +static int parse_flow_attr(u32 *match_c, u32 *match_v,
> +                          u32 *action, u32 *flow_tag,
> +                          struct switchdev_obj_port_flow *f)
> +{
> +       void *outer_headers_c = MLX5_ADDR_OF(fte_match_param, match_c,
> +                                            outer_headers);
> +       void *outer_headers_v = MLX5_ADDR_OF(fte_match_param, match_v,
> +                                            outer_headers);
> +       struct switchdev_obj_port_flow_act *act = f->actions;
> +       u16 addr_type = 0;
> +       u8 ip_proto = 0;
> +
> +       if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_CONTROL)) {
> +               struct flow_dissector_key_control *key =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_BASIC,
> +                                                 f->key);
> +               addr_type = key->addr_type;
> +       }
> +
> +       if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
> +               struct flow_dissector_key_basic *key =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_BASIC,
> +                                                 f->key);
> +               struct flow_dissector_key_basic *mask =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_BASIC,
> +                                                 f->mask);
> +               ip_proto = key->ip_proto;
> +
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, ethertype,
> +                        ntohs(mask->n_proto));
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, ethertype,
> +                        ntohs(key->n_proto));
> +
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, ip_protocol,
> +                        mask->ip_proto);
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, ip_protocol,
> +                        key->ip_proto);
> +       }
> +
> +       if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
> +               struct flow_dissector_key_eth_addrs *key =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_ETH_ADDRS,
> +                                                 f->key);
> +               struct flow_dissector_key_eth_addrs *mask =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_ETH_ADDRS,
> +                                                 f->mask);
> +
> +               ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4,
> +                                            outer_headers_c, dmac_47_16),
> +                               mask->dst);
> +               ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4,
> +                                            outer_headers_v, dmac_47_16),
> +                               key->dst);
> +
> +               ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4,
> +                                            outer_headers_c, smac_47_16),
> +                               mask->src);
> +               ether_addr_copy(MLX5_ADDR_OF(fte_match_set_lyr_2_4,
> +                                            outer_headers_v, smac_47_16),
> +                               key->src);
> +       }
> +
> +       if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_VLANID)) {
> +               struct flow_dissector_key_tags *key =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_VLANID,
> +                                                 f->key);
> +               struct flow_dissector_key_tags *mask =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_VLANID,
> +                                                 f->mask);
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, vlan_tag, 1);
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, vlan_tag, 1);
> +
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, first_vid,
> +                        ntohs(mask->vlan_id));
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, first_vid,
> +                        ntohs(key->vlan_id));
> +
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, first_cfi,
> +                        ntohs(mask->flow_label));
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, first_cfi,
> +                        ntohs(key->flow_label));
> +
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c, first_prio,
> +                        ntohs(mask->flow_label) >> 1);
> +               MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v, first_prio,
> +                        ntohs(key->flow_label) >> 1);
> +       }
> +
> +       if (addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
> +               struct flow_dissector_key_ipv4_addrs *key =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_IPV4_ADDRS,
> +                                                 f->key);
> +               struct flow_dissector_key_ipv4_addrs *mask =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_IPV4_ADDRS,
> +                                                 f->mask);
> +
> +               memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c,
> +                                   src_ipv4_src_ipv6.ipv4_layout.ipv4),
> +                      &mask->src, sizeof(mask->src));
> +               memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v,
> +                                   src_ipv4_src_ipv6.ipv4_layout.ipv4),
> +                      &key->src, sizeof(key->src));
> +               memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c,
> +                                   dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
> +                      &mask->dst, sizeof(mask->dst));
> +               memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v,
> +                                   dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
> +                      &key->dst, sizeof(key->dst));
> +       }
> +
> +       if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
> +               struct flow_dissector_key_ipv6_addrs *key =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_IPV6_ADDRS,
> +                                                 f->key);
> +               struct flow_dissector_key_ipv6_addrs *mask =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_IPV6_ADDRS,
> +                                                 f->mask);
> +
> +               memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c,
> +                                   src_ipv4_src_ipv6.ipv6_layout.ipv6),
> +                      &mask->src, sizeof(mask->src));
> +               memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v,
> +                                   src_ipv4_src_ipv6.ipv6_layout.ipv6),
> +                      &key->src, sizeof(key->src));
> +
> +               memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_c,
> +                                   dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
> +                      &mask->dst, sizeof(mask->dst));
> +               memcpy(MLX5_ADDR_OF(fte_match_set_lyr_2_4, outer_headers_v,
> +                                   dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
> +                      &key->dst, sizeof(key->dst));
> +       }
> +
> +       if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_PORTS)) {
> +               struct flow_dissector_key_ports *key =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_PORTS,
> +                                                 f->key);
> +               struct flow_dissector_key_ports *mask =
> +                       skb_flow_dissector_target(f->dissector,
> +                                                 FLOW_DISSECTOR_KEY_PORTS,
> +                                                 f->mask);
> +               switch (ip_proto) {
> +               case IPPROTO_TCP:
> +                       MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c,
> +                                tcp_sport, ntohs(mask->src));
> +                       MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v,
> +                                tcp_sport, ntohs(key->src));
> +
> +                       MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c,
> +                                tcp_dport, ntohs(mask->dst));
> +                       MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v,
> +                                tcp_dport, ntohs(key->dst));
> +                       break;
> +
> +               case IPPROTO_UDP:
> +                       MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c,
> +                                udp_sport, ntohs(mask->src));
> +                       MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v,
> +                                udp_sport, ntohs(key->src));
> +
> +                       MLX5_SET(fte_match_set_lyr_2_4, outer_headers_c,
> +                                udp_dport, ntohs(mask->dst));
> +                       MLX5_SET(fte_match_set_lyr_2_4, outer_headers_v,
> +                                udp_dport, ntohs(key->dst));
> +                       break;
> +               default:
> +                       pr_err("Only UDP and TCP transport are supported\n");
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       /* Actions: */
> +       if (BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_MARK) & act->actions) {
> +               *flow_tag = act->mark;
> +               *action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
> +       }
> +
> +       if (BIT(SWITCHDEV_OBJ_PORT_FLOW_ACT_DROP) & act->actions)
> +               *action |= MLX5_FLOW_CONTEXT_ACTION_DROP;
> +
> +       return 0;
> +}
> +
> +#define MLX5E_TC_FLOW_TABLE_NUM_ENTRIES 10
> +#define MLX5E_TC_FLOW_TABLE_NUM_GROUPS 10
> +int mlx5e_create_offloads_flow_table(struct mlx5e_priv *priv)
> +{
> +       struct mlx5_flow_namespace *ns;
> +
> +       ns = mlx5_get_flow_namespace(priv->mdev,
> +                                    MLX5_FLOW_NAMESPACE_OFFLOADS);
> +       if (!ns)
> +               return -EINVAL;
> +
> +       priv->fts.offloads.t = mlx5_create_auto_grouped_flow_table(ns, 0,
> +                                           MLX5E_TC_FLOW_TABLE_NUM_ENTRIES,
> +                                           MLX5E_TC_FLOW_TABLE_NUM_GROUPS);
> +       if (IS_ERR(priv->fts.offloads.t))
> +               return PTR_ERR(priv->fts.offloads.t);
> +
> +       return 0;
> +}
> +
> +void mlx5e_destroy_offloads_flow_table(struct mlx5e_priv *priv)
> +{
> +       mlx5_destroy_flow_table(priv->fts.offloads.t);
> +       priv->fts.offloads.t = NULL;
> +}
> +
> +static u8 generate_match_criteria_enable(u32 *match_c)
> +{
> +       u8 match_criteria_enable = 0;
> +       void *outer_headers_c = MLX5_ADDR_OF(fte_match_param, match_c,
> +                                             outer_headers);
> +       void *inner_headers_c = MLX5_ADDR_OF(fte_match_param, match_c,
> +                                             inner_headers);
> +       void *misc_c = MLX5_ADDR_OF(fte_match_param, match_c,
> +                                    misc_parameters);
> +       size_t header_size = MLX5_ST_SZ_BYTES(fte_match_set_lyr_2_4);
> +       size_t misc_size = MLX5_ST_SZ_BYTES(fte_match_set_misc);
> +
> +       if (memchr_inv(outer_headers_c, 0, header_size))
> +               match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
> +       if (memchr_inv(misc_c, 0, misc_size))
> +               match_criteria_enable |= MLX5_MATCH_MISC_PARAMETERS;
> +       if (memchr_inv(inner_headers_c, 0, header_size))
> +               match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
> +
> +       return match_criteria_enable;
> +}
> +
> +static int mlx5e_offloads_flow_add(struct net_device *netdev,
> +                                  struct switchdev_obj_port_flow *f)
> +{
> +       struct mlx5e_priv *priv = netdev_priv(netdev);
> +       struct mlx5e_offloads_flow_table *offloads = &priv->fts.offloads;
> +       struct mlx5_flow_table *ft = offloads->t;
> +       u8 match_criteria_enable;
> +       u32 *match_c;
> +       u32 *match_v;
> +       int err = 0;
> +       u32 flow_tag = MLX5_FS_DEFAULT_FLOW_TAG;
> +       u32 action = 0;
> +       struct mlx5e_switchdev_flow *flow;
> +
> +       match_c = kzalloc(MLX5_ST_SZ_BYTES(fte_match_param), GFP_KERNEL);
> +       match_v = kzalloc(MLX5_ST_SZ_BYTES(fte_match_param), GFP_KERNEL);
> +       if (!match_c || !match_v) {
> +               err = -ENOMEM;
> +               goto free;
> +       }
> +
> +       flow = kzalloc(sizeof(*flow), GFP_KERNEL);
> +       if (!flow) {
> +               err = -ENOMEM;
> +               goto free;
> +       }
> +       flow->cookie = f->cookie;
> +
> +       err = parse_flow_attr(match_c, match_v, &action, &flow_tag, f);
> +       if (err < 0)
> +               goto free;
> +
> +       /* Outer header support only */
> +       match_criteria_enable = generate_match_criteria_enable(match_c);
> +
> +       flow->rule = mlx5_add_flow_rule(ft, match_criteria_enable,
> +                                       match_c, match_v,
> +                                       action, flow_tag, NULL);
> +       if (IS_ERR(flow->rule)) {
> +               kfree(flow);
> +               err = PTR_ERR(flow->rule);
> +               goto free;
> +       }
> +
> +       err = rhashtable_insert_fast(&offloads->ht, &flow->node,
> +                                    offloads->ht_params);
> +       if (err) {
> +               mlx5_del_flow_rule(flow->rule);
> +               kfree(flow);
> +       }
> +
> +free:
> +       kfree(match_c);
> +       kfree(match_v);
> +       return err;
> +}
> +
> +static int mlx5e_offloads_flow_del(struct net_device *netdev,
> +                                  struct switchdev_obj_port_flow *f)
> +{
> +       struct mlx5e_priv *priv = netdev_priv(netdev);
> +       struct mlx5e_switchdev_flow *flow;
> +       struct mlx5e_offloads_flow_table *offloads = &priv->fts.offloads;
> +
> +       flow = rhashtable_lookup_fast(&offloads->ht, &f->cookie,
> +                                     offloads->ht_params);
> +       if (!flow) {
> +               pr_err("Can't find requested flow");
> +               return -EINVAL;
> +       }
> +
> +       mlx5_del_flow_rule(flow->rule);
> +
> +       rhashtable_remove_fast(&offloads->ht, &flow->node, offloads->ht_params);
> +       kfree(flow);
> +
> +       return 0;
> +}
> +
> +static int mlx5e_port_obj_add(struct net_device *dev,
> +                             const struct switchdev_obj *obj,
> +                             struct switchdev_trans *trans)
> +{
> +       int err = 0;
> +
> +       if (trans->ph_prepare) {
> +               switch (obj->id) {
> +               case SWITCHDEV_OBJ_ID_PORT_FLOW:
> +                       err = prep_flow_attr(SWITCHDEV_OBJ_PORT_FLOW(obj));
> +                       break;
> +               default:
> +                       err = -EOPNOTSUPP;
> +                       break;
> +               }
> +
> +               return err;
> +       }
> +
> +       switch (obj->id) {
> +       case SWITCHDEV_OBJ_ID_PORT_FLOW:
> +               err = mlx5e_offloads_flow_add(dev,
> +                                             SWITCHDEV_OBJ_PORT_FLOW(obj));
> +               break;
> +       default:
> +               err = -EOPNOTSUPP;
> +               break;
> +       }
> +
> +       return err;
> +}
> +
> +static int mlx5e_port_obj_del(struct net_device *dev,
> +                             const struct switchdev_obj *obj)
> +{
> +       int err = 0;
> +
> +       switch (obj->id) {
> +       case SWITCHDEV_OBJ_ID_PORT_FLOW:
> +               err = mlx5e_offloads_flow_del(dev,
> +                                             SWITCHDEV_OBJ_PORT_FLOW(obj));
> +               break;
> +       default:
> +               err = -EOPNOTSUPP;
> +               break;
> +       }
> +
> +       return err;
> +}
> +
> +const struct switchdev_ops mlx5e_switchdev_ops = {
> +       .switchdev_port_obj_add = mlx5e_port_obj_add,
> +       .switchdev_port_obj_del = mlx5e_port_obj_del,
> +};
> +
> +static const struct rhashtable_params mlx5e_switchdev_flow_ht_params = {
> +       .head_offset = offsetof(struct mlx5e_switchdev_flow, node),
> +       .key_offset = offsetof(struct mlx5e_switchdev_flow, cookie),
> +       .key_len = sizeof(unsigned long),
> +       .hashfn = jhash,
> +       .automatic_shrinking = true,
> +};
> +
> +void mlx5e_switchdev_init(struct net_device *netdev)
> +{
> +       struct mlx5e_priv *priv = netdev_priv(netdev);
> +       struct mlx5e_offloads_flow_table *offloads = &priv->fts.offloads;
> +
> +       netdev->switchdev_ops = &mlx5e_switchdev_ops;
> +
> +       offloads->ht_params = mlx5e_switchdev_flow_ht_params;
> +       rhashtable_init(&offloads->ht, &offloads->ht_params);
> +}
> +
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h b/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h
> new file mode 100644
> index 0000000..8f4e3a3
> --- /dev/null
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_switchdev.h
> @@ -0,0 +1,60 @@
> +/*
> + * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifndef __MLX5_EN_SWITCHDEV__H__
> +#define __MLX5_EN_SWITCHDEV__H__
> +
> +#ifdef CONFIG_MLX5_EN_SWITCHDEV
> +
> +extern const struct switchdev_ops mlx5e_switchdev_ops;
> +
> +void mlx5e_destroy_offloads_flow_table(struct mlx5e_priv *priv);
> +int mlx5e_create_offloads_flow_table(struct mlx5e_priv *priv);
> +void mlx5e_switchdev_init(struct net_device *dev);

consider the following API:

/* mlx5e switchdev handle */
struct mlx5e_switchdev {
   ...
}

struct mlx5e_switchdev *mlx5e_switchdev_init(struct net_device *dev);
int mlx5e_switchdev_activate(struct mlx5e_switchdev *switchdev);
void mlx5e_switchdev_deactivate(struct mlx5e_switchdev *switchdev);
void mlx5e_switchdev_cleanupstruct mlx5e_switchdev *switchdev);

> +
> +#else
> +static inline void mlx5e_destroy_offloads_flow_table(struct mlx5e_priv *priv)
> +{
> +}
> +
> +static inline int mlx5e_create_offloads_flow_table(struct mlx5e_priv *priv)
> +{
> +       return 0;
> +}
> +
> +static inline void mlx5e_switchdev_init(struct net_device *dev)
> +{
> +}
> +#endif
> +
> +#endif /* __MLX5_EN_SWITCHDEV__H__ */
> +
> --
> 2.7.0
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 0/9] TC filter HW offloads
  2016-02-01 14:37   ` Amir Vadai
@ 2016-02-01 19:59     ` Tom Herbert
  2016-02-01 20:14     ` John Fastabend
  1 sibling, 0 replies; 23+ messages in thread
From: Tom Herbert @ 2016-02-01 19:59 UTC (permalink / raw)
  To: Amir Vadai
  Cc: John Fastabend, David S. Miller, Linux Kernel Network Developers,
	John Fastabend, Or Gerlitz, Hadar Har-Zion, Jiri Pirko,
	Jamal Hadi Salim

On Mon, Feb 1, 2016 at 6:37 AM, Amir Vadai <amir@vadai.me> wrote:
> On Mon, Feb 01, 2016 at 01:21:36AM -0800, John Fastabend wrote:
>> On 16-02-01 12:34 AM, Amir Vadai wrote:
>> > Hi,
>> >
>> > So... just before sending that, I noted Jonh's series that
>> > deals with tc and u32. One notable difference between the
>> > two approaches is that here we "normalize" the upper layer
>> > way of describing matching and actions into a generic structure
>> > (flow dissector, etc), which should allow to use offload different
>> > potential consumer tools (TC flower, TC u32 subset), netfilter, etc).
>>
>> Except its not really normalizing anything in this patchset
>> right? For a "real" normalizing I would expect the netdev
>> needs to advertise its parse graph and headers in a protocol
>> oblivious way, along with the table setup and this middle
>> layer needs to map the general software side onto the hardware
>> side. I tried this and I came to the conclusion I would just
>> push rules down at the hardware at least for now until I get
>> enough hardware implementations to see if there really is any
>> advantage in this sort of generic middle layer. My main concern
>> is its slow and table layout, hardware architecture both try
>> to fight you when doing this. It can be done I'm just not sure
>> its worth it yet.
> What I was trying to do, is to find an extensible api to describe the
> rules. And yes, like in your design, the device doesn't advertise its
> capabilities, only if it is capable to do any offloading. The consumer
> pushes the rules and the device return success/fail.
>
> Using u32 filter is nice since it is a very universal classifier (and
> you did implement parsing it in a very elegant way), but I'm not sure I
> like having in device drivers a specific code for different filters. So,
> if another consumer, for example the flower filter or netfilter, would
> want to use this api, it will need to speak the u32 language, or have
> its own implementation in the device driver?
>
>>
>> Also just as an aside flower can be emulated with u32 which can
>> be emulated with bpf, I don't think the structures here are
>> generic.
> This is why I used flow dissector - because it is a very abstract way to
> pass the classifications.
> If it is not flexible enough, maybe splitting the current flow
> dissector code, into (1) a generic api to describe structures in an
> abstract way (the offsets, bitmap, and structs), and (2) the code that
> is used to dissect skb's. This way we could express stuff using (1) that
> is not related to (2).
>
Flow dissector structure is not meant to be a generic interface and it
is limited in that regard. For instance, it can only give one set of
IP addresses in a flow when an encapsulation is being done so there
are multiple addresses. This is why we are interested in something
like BPF for programming HW which can describe arbitrary packet
formats as opposed to other structure based interfaces that restrict
expressibility.

Also, the code dealing with flow_dissector is pretty verbose and isn't
driver specific I would think. Might be better get some of that into a
common library.

Tom

>>
>> > Another difference is with this series uses the switchdev
>> > framework which would allow using the proposed HW offloading
>> > mechanisms for physical and SRIOV embedded switches too that
>> > make use of switchdev.
>>
>> But 'tc' infrastructure is useful even without SRIOV or any
>> switching at all. I don't think it needs to go into switchdev.
>> Even my vanilla 10G nic can drop/mark pkts coming onto the
>> physical functions.
> ok, we could work it out - as you suggested in a similar way fdb_add is
> doing.
>
>>
>> >
>> > This patchset introduces an infrastructure to offload matching of flows and
>> > some basic actions to hardware, currenrtly using iproute2 / tc tool.
>> >
>> > In this patchset, the classification is described using the flower filter, and
>> > the supported actions are drop (using gact) and mark (using skbedit).
>> >
>>
>> ditto I just didn't show the mark patch set on my side. I also would
>> like to get pedit shortly.
>>
>> > Flow classifcation is described using a flow dissector that is built by
>> > the tc filter. The filter also calls the actions to be serialized into the new
>> > structure - switchdev_obj_port_flow_act.
>> >
>> > The flow dissector and the serialized actions are passed using switchdev ops to
>> > the HW driver, which parse it to hardware commands. We propose to use the
>> > kernel flow-dissector to describe flows/ACLs in the switchdev framework which
>> > by itself could be also used for HW offloading of other kernel networking
>> > components.
>>
>> I'm not sure I like this or at least I don't want to make this the
>> exclusive mechanism. I think bpf/u32 are more flexible. In general
>> I'm opposed to getting stuck talking about specific protocols I want
>> this to be flexible so I don't need a new thing everytime folks add
>> a new header/bit/field/etc. If you use flow-dissector to describe
>> flows your limiting the hardware. Also I'm sure I'll want to match on
>> fields that flow-dissector doesn't care about and really never should
>> care about think HTTP for example.
> I agree that we need a flexible way to express the classifiers. I'm not
> sure that I see it as a problem to have the api extended over the years,
> as long as it is designed to be extensible.
> What you actually suggest is to use u32 as such an api, or make lower
> layer driver support multiple api's.
> I will try to see how does the code looks if using the u32 api from the
> flower filter.
>
>>
>> >
>> > An implementation for the above is provided using mlx5 driver and Mellanox
>> > ConnectX4 HW.
>> >
>> > Some issues that will be addressed before making the final submission:
>> > 1. 'offload' should be a generic filter attribute and not flower filter
>> >    specific.
>>
>> I'm not sure its worth normalizing now. See how I created a code and
>> set of structures for each filter. Maybe some helper libraries would
>> be in order.
>>
>> > 2. Serialization of actions will be changed into a list instead of one big
>> >    structure to describe all actions.
>> >
>> > Few more matters to discuss
>> >
>> > 1. Should HW offloading be done only under explicit admin directive?
>>
>> I took the approach of having one big bit I set per netdev to turn it
>> on and off. Then I have a flag similar to your patch on cls_flower to
>> turn it on/off per rule if I care to. I didn't send the per rule patch
>> because I view it as an optimization.
>>
>> But the case where it matters is mark on a NIC where you don't really
>> need/want to match the same packet twice and mark it again. For a switch
>> it may not matter because the host bound traffic is the exception not
>> the rule.
> Yeh, if the cpu gets the packet, there is no need to process the
> filter again in software.
>
>>
>> >
>> > 2. switchdev is used today for physical switch HW and on an upcoming proposal
>> > for SRIOV e-switch vport representors too. Here, we're doing that with a NIC,
>> > that can potentially serve as an uplink port for v-switch (e.g under Para-Virtual
>> > scheme).
>>
>> Sure but remember where switchdev may be relevant for SRIOV loading
>> 'tc' like rules into a NIC doesn't mean you need/want/care/support
>> SRIOV. So I don't think we should use switchdev or at least I don't
>> think it should be required. A bunch of helper functions for switches
>> may be useful in switchdev.
> ack
>
>>
>> .John
>>
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 0/9] TC filter HW offloads
  2016-02-01 14:37   ` Amir Vadai
  2016-02-01 19:59     ` Tom Herbert
@ 2016-02-01 20:14     ` John Fastabend
  1 sibling, 0 replies; 23+ messages in thread
From: John Fastabend @ 2016-02-01 20:14 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, John Fastabend, Or Gerlitz,
	Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On 16-02-01 06:37 AM, Amir Vadai wrote:
> On Mon, Feb 01, 2016 at 01:21:36AM -0800, John Fastabend wrote:
>> On 16-02-01 12:34 AM, Amir Vadai wrote:
>>> Hi,
>>>
>>> So... just before sending that, I noted Jonh's series that
>>> deals with tc and u32. One notable difference between the 
>>> two approaches is that here we "normalize" the upper layer
>>> way of describing matching and actions into a generic structure
>>> (flow dissector, etc), which should allow to use offload different
>>> potential consumer tools (TC flower, TC u32 subset), netfilter, etc).
>>
>> Except its not really normalizing anything in this patchset
>> right? For a "real" normalizing I would expect the netdev
>> needs to advertise its parse graph and headers in a protocol
>> oblivious way, along with the table setup and this middle
>> layer needs to map the general software side onto the hardware
>> side. I tried this and I came to the conclusion I would just
>> push rules down at the hardware at least for now until I get
>> enough hardware implementations to see if there really is any
>> advantage in this sort of generic middle layer. My main concern
>> is its slow and table layout, hardware architecture both try
>> to fight you when doing this. It can be done I'm just not sure
>> its worth it yet.
> What I was trying to do, is to find an extensible api to describe the
> rules. And yes, like in your design, the device doesn't advertise its
> capabilities, only if it is capable to do any offloading. The consumer
> pushes the rules and the device return success/fail.
> 
> Using u32 filter is nice since it is a very universal classifier (and
> you did implement parsing it in a very elegant way), but I'm not sure I
> like having in device drivers a specific code for different filters. So,
> if another consumer, for example the flower filter or netfilter, would
> want to use this api, it will need to speak the u32 language, or have
> its own implementation in the device driver?
> 

At the moment it seems easier to write an implementation for each
case. Sure I can write flower filters in u32 language but I can
just as easily build a flower jump table.

To do the general abstract solution right I think you need to read
the parse graph from the hardware and identify the nodes where the
keys match up and pass those down using a general flow API something
with a signature like

  flow_add(unsigned node_ids[], u8 keys[], u8 values[], u8 masks[];

I was sort of taking the lazy approach and planning to wait and see
what falls out and how other driver writers implement this and then
consolidate if needed later. At least I get the concrete cases covered
while the abstraction is cleaned up.

FWIW I think the implementation in ixgbe_model.h can be made to work
with flower fairly easily with some decorating of the flower enums onto
the u32 language.

>>
>> Also just as an aside flower can be emulated with u32 which can
>> be emulated with bpf, I don't think the structures here are
>> generic.
> This is why I used flow dissector - because it is a very abstract way to
> pass the classifications.
> If it is not flexible enough, maybe splitting the current flow
> dissector code, into (1) a generic api to describe structures in an
> abstract way (the offsets, bitmap, and structs), and (2) the code that
> is used to dissect skb's. This way we could express stuff using (1) that
> is not related to (2).

hmm for long term I think this is great but I wouldn't want to tie
this to closely to implementation in my series today. I think they
can evolve independently for the time being.

> 
>>
>>> Another difference is with this series uses the switchdev
>>> framework which would allow using the proposed HW offloading
>>> mechanisms for physical and SRIOV embedded switches too that
>>> make use of switchdev.
>>
>> But 'tc' infrastructure is useful even without SRIOV or any
>> switching at all. I don't think it needs to go into switchdev.
>> Even my vanilla 10G nic can drop/mark pkts coming onto the
>> physical functions.
> ok, we could work it out - as you suggested in a similar way fdb_add is
> doing.

I think I'll work this out now and send it out.

[...]

>>> The flow dissector and the serialized actions are passed using switchdev ops to
>>> the HW driver, which parse it to hardware commands. We propose to use the
>>> kernel flow-dissector to describe flows/ACLs in the switchdev framework which
>>> by itself could be also used for HW offloading of other kernel networking
>>> components.
>>
>> I'm not sure I like this or at least I don't want to make this the
>> exclusive mechanism. I think bpf/u32 are more flexible. In general
>> I'm opposed to getting stuck talking about specific protocols I want
>> this to be flexible so I don't need a new thing everytime folks add
>> a new header/bit/field/etc. If you use flow-dissector to describe
>> flows your limiting the hardware. Also I'm sure I'll want to match on
>> fields that flow-dissector doesn't care about and really never should
>> care about think HTTP for example.
> I agree that we need a flexible way to express the classifiers. I'm not
> sure that I see it as a problem to have the api extended over the years,
> as long as it is designed to be extensible.
> What you actually suggest is to use u32 as such an api, or make lower
> layer driver support multiple api's.
> I will try to see how does the code looks if using the u32 api from the
> flower filter.

same comment as above I'm leaning towards multiple apis at the moment
and consolidating as we go forward. The nice piece is none of this is
UAPI visible so we can rework it as needed.


Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading
  2016-02-01 10:43     ` Amir Vadai
@ 2016-02-01 21:25       ` John Fastabend
  0 siblings, 0 replies; 23+ messages in thread
From: John Fastabend @ 2016-02-01 21:25 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, John Fastabend, Or Gerlitz,
	Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On 16-02-01 02:43 AM, Amir Vadai wrote:
> On Mon, Feb 01, 2016 at 01:31:17AM -0800, John Fastabend wrote:
>> On 16-02-01 12:34 AM, Amir Vadai wrote:
>>> During initialization, tcf_exts_offload_init() is called to initialize
>>> the list of actions description. later on, the classifier description
>>> is prepared and sent to the switchdev using switchdev_port_flow_add().
>>>
>>> When offloaded, fl_classify() is a NOP - already done in hardware.
>>>
>>> Signed-off-by: Amir Vadai <amir@vadai.me>
>>> ---
>>
>> You need to account for where the classifier is being loaded
>> by passing the handle as I did in my patch set. Otherwise you may
>> be offloading on egress/ingress or even some qdisc multiple layers
>> down in the hierarchy.
> Right. Will fix it.

also it seems you missed fl_delete() this will be called from cmds
like 'tc filter delete ...'

> 
>>
>> .John
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC net-next 9/9] net/mlx5e: Flow steering support through switchdev
  2016-02-01 18:52   ` Saeed Mahameed
@ 2016-02-01 21:45     ` Or Gerlitz
  0 siblings, 0 replies; 23+ messages in thread
From: Or Gerlitz @ 2016-02-01 21:45 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Amir Vadai, David S. Miller, Linux Netdev List, John Fastabend,
	Or Gerlitz, Hadar Har-Zion, Jiri Pirko, Jamal Hadi Salim

On Mon, Feb 1, 2016 at 8:52 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:

[...]

> in case it is, please consider using the vport_manager capability or
> any other device capability meant for this,
> to initialize and activate mlx5e_switchdev.
>
> After all such offload might not be supported in some devices. e.g VF.

This offloading should be working on VFs too, we're offloading into HW
12-tuple classification to flow-tags and drops.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-02-01 21:45 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-01  8:34 [RFC net-next 0/9] TC filter HW offloads Amir Vadai
2016-02-01  8:34 ` [RFC net-next 1/9] net/flow_dissector: Make dissector_uses_key() and skb_flow_dissector_target() public Amir Vadai
2016-02-01  8:34 ` [RFC net-next 2/9] net/switchdev: Introduce hardware offload support Amir Vadai
2016-02-01  9:06   ` Jiri Pirko
2016-02-01  9:11     ` amirva
2016-02-01  9:26   ` John Fastabend
2016-02-01  8:34 ` [RFC net-next 3/9] net/act: Offload support by tc actions Amir Vadai
2016-02-01  8:34 ` [RFC net-next 4/9] net/act_skbedit: Introduce hardware offload support Amir Vadai
2016-02-01  8:34 ` [RFC net-next 5/9] net/act_gact: Introduce hardware offload support for drop Amir Vadai
2016-02-01  8:34 ` [RFC net-next 6/9] net/cls_flower: Introduce hardware offloading Amir Vadai
2016-02-01  9:31   ` John Fastabend
2016-02-01  9:47     ` John Fastabend
2016-02-01 10:43     ` Amir Vadai
2016-02-01 21:25       ` John Fastabend
2016-02-01  8:34 ` [RFC net-next 7/9] net/mlx5_core: Go to next flow table support Amir Vadai
2016-02-01  8:34 ` [RFC net-next 8/9] net/mlx5e: Introduce MLX5_FLOW_NAMESPACE_OFFLOADS Amir Vadai
2016-02-01  8:34 ` [RFC net-next 9/9] net/mlx5e: Flow steering support through switchdev Amir Vadai
2016-02-01 18:52   ` Saeed Mahameed
2016-02-01 21:45     ` Or Gerlitz
2016-02-01  9:21 ` [RFC net-next 0/9] TC filter HW offloads John Fastabend
2016-02-01 14:37   ` Amir Vadai
2016-02-01 19:59     ` Tom Herbert
2016-02-01 20:14     ` John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.