All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer
@ 2016-06-17 14:43 Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 01/18] net/mlx5: Refactor mlx5_add_flow_rule Saeed Mahameed
                   ` (18 more replies)
  0 siblings, 19 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

Hi Dave,

This patch set introduces mlx5 RoCE/RDMA packet sniffer, it allows
mlx5e netdevice to receive RoCE/RDMA or RAW ETH traffic which isn't
supposed to be passed to the kernel stack, for sniffing and diagnostics
purposes.  This traffic is still not supposed to go through the whole 
network stack processing and should only go to the non-protocol specific 
handlers (ptype_all). e.g: tcpdump, etc ..

In order to achieve this, when RoCE/RDMA sniffer is enabled, all RoCE/RDMA
steering rules which are forwarded to user space QPs will be duplicated
and marked with "OFFLOAD" flow tag, then forwarded to mlx5e netdevice receive path.
mlx5e receive path will detect sniffer packets by looking at the receive
completion flow tag, and in case it matches the "OFFLOAD" tag, skb->pkt_type is
set to (PACKET_OFFLOAD_KERNEL) so it will go to non-protocol specific handlers
(ptype_all) only.

To duplicate specific steering rules, a new notification mechanism is added.
It allows a consumer to request add/del rule notification on specific steering
namespaces.  Once a consumer registers, it will be notified on all existing rules.
Asynchronously, notifications on a dynamically added/deleted rules will be generated.

To achieve RoCE/RDMA sniffer, a new steering namespace is introduced (SNIFFER_NAMESPACE),
which will host all the duplicated steering rules to be forwarded to mlx5e netdevice
receive path.

RoCE sniffer module:
RoCE sniffer module will register to (RoCE/RDMA) user space traffic steering namespaces add/del
rules notification.
    - On rule add it will generated an identical rule and inject it into the SNIFFER_NAMESPACE 
	flow table with flow tag = "OFFLOAD" and destination = "mlx5e netdevice"
    - On rule delete it will remove the duplicated corresponding sniffer rule.

Thanks,
Saeed.

Huy Nguyen (3):
  net/mlx5e: Set sniffer skbs packet type to offload kernel
  net/mlx5e: Sniffer support for kernel offload (RoCE) traffic
  net/mlx5e: Add netdev hw feature flag offload-sniffer

Maor Gottlieb (15):
  net/mlx5: Refactor mlx5_add_flow_rule
  net/mlx5: Introduce mlx5_flow_steering structure
  net/mlx5: Properly remove all steering objects
  net/mlx5: Add hold/put rules refcount API
  net/mlx5: Add support to add/del flow rule notifiers
  net/mlx5: Introduce table of function pointer steering commands
  net/mlx5: Introduce nop steering commands
  if_ether.h: Add RoCE Ethertype
  IB/mlx5: Create RoCE root namespace
  net/mlx5: Introduce get flow rule match API
  net/mlx5: Add sniffer namespaces
  IB/mlx5: Add kernel offload flow-tag
  net: Add offload kernel net stack packet type
  net/mlx5: Introduce sniffer steering hardware capabilities
  net/mlx5e: Lock device state in set features

 drivers/infiniband/hw/mlx4/qp.c                    |   6 +-
 drivers/infiniband/hw/mlx5/main.c                  | 143 +++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h               |  15 +-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c           |   4 +-
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c           |   2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h          |   4 -
 drivers/infiniband/hw/usnic/usnic_common_pkt_hdr.h |   1 -
 drivers/infiniband/hw/usnic/usnic_fwd.h            |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c  |  26 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    |  28 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  44 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   4 +
 .../net/ethernet/mellanox/mlx5/core/en_sniffer.c   | 574 ++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  51 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   | 161 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h   |  71 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  | 746 ++++++++++++++-------
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  |  31 +
 include/linux/mlx5/device.h                        |  17 +
 include/linux/mlx5/driver.h                        |   6 +-
 include/linux/mlx5/fs.h                            |  62 +-
 include/linux/netdev_features.h                    |   2 +
 include/linux/skbuff.h                             |   6 +-
 include/uapi/linux/if_ether.h                      |   1 +
 include/uapi/linux/if_packet.h                     |   1 +
 net/core/dev.c                                     |   4 +
 net/core/ethtool.c                                 |   1 +
 30 files changed, 1626 insertions(+), 408 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_sniffer.c

-- 
2.8.0

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next 01/18] net/mlx5: Refactor mlx5_add_flow_rule
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 02/18] net/mlx5: Introduce mlx5_flow_steering structure Saeed Mahameed
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Reduce the set of arguments passed to mlx5_add_flow_rule to one
by introducing flow_attributes struct.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c                 | 10 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c | 26 ++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c   | 28 +++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   |  8 ++--
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 51 +++++++++--------------
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 46 +++++++++-----------
 include/linux/mlx5/fs.h                           | 29 ++++++++++---
 7 files changed, 98 insertions(+), 100 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index b48ad85..573952b 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1528,6 +1528,7 @@ static struct mlx5_ib_flow_handler *create_flow_rule(struct mlx5_ib_dev *dev,
 {
 	struct mlx5_flow_table	*ft = ft_prio->flow_table;
 	struct mlx5_ib_flow_handler *handler;
+	struct mlx5_flow_attr flow_rule_attr;
 	void *ib_flow = flow_attr + 1;
 	u8 match_criteria_enable = 0;
 	unsigned int spec_index;
@@ -1561,11 +1562,10 @@ static struct mlx5_ib_flow_handler *create_flow_rule(struct mlx5_ib_dev *dev,
 	match_criteria_enable = (!outer_header_zero(match_c)) << 0;
 	action = dst ? MLX5_FLOW_CONTEXT_ACTION_FWD_DEST :
 		MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO;
-	handler->rule = mlx5_add_flow_rule(ft, match_criteria_enable,
-					   match_c, match_v,
-					   action,
-					   MLX5_FS_DEFAULT_FLOW_TAG,
-					   dst);
+
+	MLX5_RULE_ATTR(flow_rule_attr, match_criteria_enable, match_c,
+		       match_v, action, MLX5_FS_DEFAULT_FLOW_TAG, dst);
+	handler->rule = mlx5_add_flow_rule(ft, &flow_rule_attr);
 
 	if (IS_ERR(handler->rule)) {
 		err = PTR_ERR(handler->rule);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
index 3515e78..f126043 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
@@ -174,15 +174,15 @@ static int arfs_add_default_rule(struct mlx5e_priv *priv,
 				 enum arfs_type type)
 {
 	struct arfs_table *arfs_t = &priv->fs.arfs.arfs_tables[type];
+	struct mlx5_flow_attr flow_attr;
 	struct mlx5_flow_destination dest;
-	u8 match_criteria_enable = 0;
 	u32 *tirn = priv->indir_tirn;
 	u32 *match_criteria;
 	u32 *match_value;
 	int err = 0;
 
 	match_value	= mlx5_vzalloc(MLX5_ST_SZ_BYTES(fte_match_param));
-	match_criteria	= mlx5_vzalloc(MLX5_ST_SZ_BYTES(fte_match_param));
+	match_criteria = mlx5_vzalloc(MLX5_ST_SZ_BYTES(fte_match_param));
 	if (!match_value || !match_criteria) {
 		netdev_err(priv->netdev, "%s: alloc failed\n", __func__);
 		err = -ENOMEM;
@@ -208,11 +208,10 @@ static int arfs_add_default_rule(struct mlx5e_priv *priv,
 		goto out;
 	}
 
-	arfs_t->default_rule = mlx5_add_flow_rule(arfs_t->ft.t, match_criteria_enable,
-						  match_criteria, match_value,
-						  MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
-						  MLX5_FS_DEFAULT_FLOW_TAG,
-						  &dest);
+	MLX5_RULE_ATTR(flow_attr, 0, match_criteria,
+		       match_value, MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       MLX5_FS_DEFAULT_FLOW_TAG, &dest);
+	arfs_t->default_rule = mlx5_add_flow_rule(arfs_t->ft.t, &flow_attr);
 	if (IS_ERR(arfs_t->default_rule)) {
 		err = PTR_ERR(arfs_t->default_rule);
 		arfs_t->default_rule = NULL;
@@ -474,21 +473,20 @@ static struct mlx5_flow_rule *arfs_add_rule(struct mlx5e_priv *priv,
 	struct arfs_tuple *tuple = &arfs_rule->tuple;
 	struct mlx5_flow_rule *rule = NULL;
 	struct mlx5_flow_destination dest;
+	struct mlx5_flow_attr flow_attr;
 	struct arfs_table *arfs_table;
-	u8 match_criteria_enable = 0;
 	struct mlx5_flow_table *ft;
 	u32 *match_criteria;
 	u32 *match_value;
 	int err = 0;
 
 	match_value	= mlx5_vzalloc(MLX5_ST_SZ_BYTES(fte_match_param));
-	match_criteria	= mlx5_vzalloc(MLX5_ST_SZ_BYTES(fte_match_param));
+	match_criteria = mlx5_vzalloc(MLX5_ST_SZ_BYTES(fte_match_param));
 	if (!match_value || !match_criteria) {
 		netdev_err(priv->netdev, "%s: alloc failed\n", __func__);
 		err = -ENOMEM;
 		goto out;
 	}
-	match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
 	MLX5_SET_TO_ONES(fte_match_param, match_criteria,
 			 outer_headers.ethertype);
 	MLX5_SET(fte_match_param, match_value, outer_headers.ethertype,
@@ -552,10 +550,10 @@ static struct mlx5_flow_rule *arfs_add_rule(struct mlx5e_priv *priv,
 	}
 	dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
 	dest.tir_num = priv->direct_tir[arfs_rule->rxq].tirn;
-	rule = mlx5_add_flow_rule(ft, match_criteria_enable, match_criteria,
-				  match_value, MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
-				  MLX5_FS_DEFAULT_FLOW_TAG,
-				  &dest);
+	MLX5_RULE_ATTR(flow_attr, MLX5_MATCH_OUTER_HEADERS, match_criteria,
+		       match_value, MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       MLX5_FS_DEFAULT_FLOW_TAG, &dest);
+	rule = mlx5_add_flow_rule(ft, &flow_attr);
 	if (IS_ERR(rule)) {
 		err = PTR_ERR(rule);
 		netdev_err(priv->netdev, "%s: add rule(filter id=%d, rq idx=%d) failed, err=%d\n",
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index b327400..95e359f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -160,6 +160,7 @@ static int __mlx5e_add_vlan_rule(struct mlx5e_priv *priv,
 {
 	struct mlx5_flow_table *ft = priv->fs.vlan.ft.t;
 	struct mlx5_flow_destination dest;
+	struct mlx5_flow_attr flow_attr;
 	u8 match_criteria_enable = 0;
 	struct mlx5_flow_rule **rule_p;
 	int err = 0;
@@ -186,10 +187,10 @@ static int __mlx5e_add_vlan_rule(struct mlx5e_priv *priv,
 		break;
 	}
 
-	*rule_p = mlx5_add_flow_rule(ft, match_criteria_enable, mc, mv,
-				     MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
-				     MLX5_FS_DEFAULT_FLOW_TAG,
-				     &dest);
+	MLX5_RULE_ATTR(flow_attr, match_criteria_enable, mc, mv,
+		       MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       MLX5_FS_DEFAULT_FLOW_TAG, &dest);
+	*rule_p = mlx5_add_flow_rule(ft, &flow_attr);
 
 	if (IS_ERR(*rule_p)) {
 		err = PTR_ERR(*rule_p);
@@ -597,6 +598,7 @@ static struct mlx5_flow_rule *mlx5e_generate_ttc_rule(struct mlx5e_priv *priv,
 						      u16 etype,
 						      u8 proto)
 {
+	struct mlx5_flow_attr flow_attr;
 	struct mlx5_flow_rule *rule;
 	u8 match_criteria_enable = 0;
 	u32 *match_criteria;
@@ -622,11 +624,10 @@ static struct mlx5_flow_rule *mlx5e_generate_ttc_rule(struct mlx5e_priv *priv,
 		MLX5_SET(fte_match_param, match_value, outer_headers.ethertype, etype);
 	}
 
-	rule = mlx5_add_flow_rule(ft, match_criteria_enable,
-				  match_criteria, match_value,
-				  MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
-				  MLX5_FS_DEFAULT_FLOW_TAG,
-				  dest);
+	MLX5_RULE_ATTR(flow_attr, match_criteria_enable, match_criteria,
+		       match_value, MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       MLX5_FS_DEFAULT_FLOW_TAG, dest);
+	rule = mlx5_add_flow_rule(ft, &flow_attr);
 	if (IS_ERR(rule)) {
 		err = PTR_ERR(rule);
 		netdev_err(priv->netdev, "%s: add rule failed\n", __func__);
@@ -792,6 +793,7 @@ static int mlx5e_add_l2_flow_rule(struct mlx5e_priv *priv,
 {
 	struct mlx5_flow_table *ft = priv->fs.l2.ft.t;
 	struct mlx5_flow_destination dest;
+	struct mlx5_flow_attr flow_attr;
 	u8 match_criteria_enable = 0;
 	u32 *match_criteria;
 	u32 *match_value;
@@ -832,10 +834,10 @@ static int mlx5e_add_l2_flow_rule(struct mlx5e_priv *priv,
 		break;
 	}
 
-	ai->rule = mlx5_add_flow_rule(ft, match_criteria_enable, match_criteria,
-				      match_value,
-				      MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
-				      MLX5_FS_DEFAULT_FLOW_TAG, &dest);
+	MLX5_RULE_ATTR(flow_attr, match_criteria_enable, match_criteria,
+		       match_value, MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       MLX5_FS_DEFAULT_FLOW_TAG, &dest);
+	ai->rule = mlx5_add_flow_rule(ft, &flow_attr);
 	if (IS_ERR(ai->rule)) {
 		netdev_err(priv->netdev, "%s: add l2 rule(mac:%pM) failed\n",
 			   __func__, mv_dmac);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 704c3d3..0b634c6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -55,6 +55,7 @@ static struct mlx5_flow_rule *mlx5e_tc_add_flow(struct mlx5e_priv *priv,
 {
 	struct mlx5_core_dev *dev = priv->mdev;
 	struct mlx5_flow_destination dest = { 0 };
+	struct mlx5_flow_attr flow_attr;
 	struct mlx5_fc *counter = NULL;
 	struct mlx5_flow_rule *rule;
 	bool table_created = false;
@@ -88,10 +89,9 @@ static struct mlx5_flow_rule *mlx5e_tc_add_flow(struct mlx5e_priv *priv,
 		table_created = true;
 	}
 
-	rule = mlx5_add_flow_rule(priv->fs.tc.t, MLX5_MATCH_OUTER_HEADERS,
-				  match_c, match_v,
-				  action, flow_tag,
-				  &dest);
+	MLX5_RULE_ATTR(flow_attr, MLX5_MATCH_OUTER_HEADERS, match_c,
+		       match_v, action, flow_tag, &dest);
+	rule = mlx5_add_flow_rule(priv->fs.tc.t, &flow_attr);
 
 	if (IS_ERR(rule))
 		goto err_add_rule;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index aebbd6c..b8b17b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -337,6 +337,7 @@ __esw_fdb_set_vport_rule(struct mlx5_eswitch *esw, u32 vport, bool rx_rule,
 			    MLX5_MATCH_OUTER_HEADERS);
 	struct mlx5_flow_rule *flow_rule = NULL;
 	struct mlx5_flow_destination dest;
+	struct mlx5_flow_attr flow_attr;
 	void *mv_misc = NULL;
 	void *mc_misc = NULL;
 	u8 *dmac_v = NULL;
@@ -376,13 +377,10 @@ __esw_fdb_set_vport_rule(struct mlx5_eswitch *esw, u32 vport, bool rx_rule,
 	esw_debug(esw->dev,
 		  "\tFDB add rule dmac_v(%pM) dmac_c(%pM) -> vport(%d)\n",
 		  dmac_v, dmac_c, vport);
-	flow_rule =
-		mlx5_add_flow_rule(esw->fdb_table.fdb,
-				   match_header,
-				   match_c,
-				   match_v,
-				   MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
-				   0, &dest);
+	MLX5_RULE_ATTR(flow_attr, match_header, match_c, match_v,
+		       MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       0, &dest);
+	flow_rule = mlx5_add_flow_rule(esw->fdb_table.fdb, &flow_attr);
 	if (IS_ERR(flow_rule)) {
 		pr_warn(
 			"FDB: Failed to add flow rule: dmac_v(%pM) dmac_c(%pM) -> vport(%d), err(%ld)\n",
@@ -1300,6 +1298,7 @@ static void esw_vport_disable_ingress_acl(struct mlx5_eswitch *esw,
 static int esw_vport_ingress_config(struct mlx5_eswitch *esw,
 				    struct mlx5_vport *vport)
 {
+	struct mlx5_flow_attr flow_attr;
 	u8 smac[ETH_ALEN];
 	u32 *match_v;
 	u32 *match_c;
@@ -1357,13 +1356,11 @@ static int esw_vport_ingress_config(struct mlx5_eswitch *esw,
 		ether_addr_copy(smac_v, smac);
 	}
 
+	MLX5_RULE_ATTR(flow_attr, MLX5_MATCH_OUTER_HEADERS, match_c, match_v,
+		       MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       0, NULL);
 	vport->ingress.allow_rule =
-		mlx5_add_flow_rule(vport->ingress.acl,
-				   MLX5_MATCH_OUTER_HEADERS,
-				   match_c,
-				   match_v,
-				   MLX5_FLOW_CONTEXT_ACTION_ALLOW,
-				   0, NULL);
+		mlx5_add_flow_rule(vport->ingress.acl, &flow_attr);
 	if (IS_ERR(vport->ingress.allow_rule)) {
 		err = PTR_ERR(vport->ingress.allow_rule);
 		pr_warn("vport[%d] configure ingress allow rule, err(%d)\n",
@@ -1374,13 +1371,10 @@ static int esw_vport_ingress_config(struct mlx5_eswitch *esw,
 
 	memset(match_c, 0, MLX5_ST_SZ_BYTES(fte_match_param));
 	memset(match_v, 0, MLX5_ST_SZ_BYTES(fte_match_param));
+	flow_attr.flow_match.match_criteria_enable = 0;
+	flow_attr.action = MLX5_FLOW_CONTEXT_ACTION_DROP;
 	vport->ingress.drop_rule =
-		mlx5_add_flow_rule(vport->ingress.acl,
-				   0,
-				   match_c,
-				   match_v,
-				   MLX5_FLOW_CONTEXT_ACTION_DROP,
-				   0, NULL);
+		mlx5_add_flow_rule(vport->ingress.acl, &flow_attr);
 	if (IS_ERR(vport->ingress.drop_rule)) {
 		err = PTR_ERR(vport->ingress.drop_rule);
 		pr_warn("vport[%d] configure ingress drop rule, err(%d)\n",
@@ -1401,6 +1395,7 @@ out:
 static int esw_vport_egress_config(struct mlx5_eswitch *esw,
 				   struct mlx5_vport *vport)
 {
+	struct mlx5_flow_attr flow_attr;
 	u32 *match_v;
 	u32 *match_c;
 	int err = 0;
@@ -1433,13 +1428,11 @@ static int esw_vport_egress_config(struct mlx5_eswitch *esw,
 	MLX5_SET_TO_ONES(fte_match_param, match_c, outer_headers.first_vid);
 	MLX5_SET(fte_match_param, match_v, outer_headers.first_vid, vport->vlan);
 
+	MLX5_RULE_ATTR(flow_attr, MLX5_MATCH_OUTER_HEADERS, match_c, match_v,
+		       MLX5_FLOW_CONTEXT_ACTION_ALLOW,
+		       0, NULL);
 	vport->egress.allowed_vlan =
-		mlx5_add_flow_rule(vport->egress.acl,
-				   MLX5_MATCH_OUTER_HEADERS,
-				   match_c,
-				   match_v,
-				   MLX5_FLOW_CONTEXT_ACTION_ALLOW,
-				   0, NULL);
+		mlx5_add_flow_rule(vport->egress.acl, &flow_attr);
 	if (IS_ERR(vport->egress.allowed_vlan)) {
 		err = PTR_ERR(vport->egress.allowed_vlan);
 		pr_warn("vport[%d] configure egress allowed vlan rule failed, err(%d)\n",
@@ -1451,13 +1444,9 @@ static int esw_vport_egress_config(struct mlx5_eswitch *esw,
 	/* Drop others rule (star rule) */
 	memset(match_c, 0, MLX5_ST_SZ_BYTES(fte_match_param));
 	memset(match_v, 0, MLX5_ST_SZ_BYTES(fte_match_param));
+	flow_attr.flow_match.match_criteria_enable = 0;
 	vport->egress.drop_rule =
-		mlx5_add_flow_rule(vport->egress.acl,
-				   0,
-				   match_c,
-				   match_v,
-				   MLX5_FLOW_CONTEXT_ACTION_DROP,
-				   0, NULL);
+		mlx5_add_flow_rule(vport->egress.acl, &flow_attr);
 	if (IS_ERR(vport->egress.drop_rule)) {
 		err = PTR_ERR(vport->egress.drop_rule);
 		pr_warn("vport[%d] configure egress drop rule failed, err(%d)\n",
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index e912a3d..9f613aa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1152,39 +1152,37 @@ static bool dest_is_valid(struct mlx5_flow_destination *dest,
 
 static struct mlx5_flow_rule *
 _mlx5_add_flow_rule(struct mlx5_flow_table *ft,
-		    u8 match_criteria_enable,
-		    u32 *match_criteria,
-		    u32 *match_value,
-		    u32 action,
-		    u32 flow_tag,
-		    struct mlx5_flow_destination *dest)
+		    struct mlx5_flow_attr *attr)
 {
+	struct mlx5_flow_match *match = &attr->flow_match;
 	struct mlx5_flow_group *g;
 	struct mlx5_flow_rule *rule;
 
-	if (!dest_is_valid(dest, action, ft))
+	if (!dest_is_valid(attr->dest, attr->action, ft))
 		return ERR_PTR(-EINVAL);
 
 	nested_lock_ref_node(&ft->node, FS_MUTEX_GRANDPARENT);
 	fs_for_each_fg(g, ft)
 		if (compare_match_criteria(g->mask.match_criteria_enable,
-					   match_criteria_enable,
+					   match->match_criteria_enable,
 					   g->mask.match_criteria,
-					   match_criteria)) {
-			rule = add_rule_fg(g, match_value,
-					   action, flow_tag, dest);
+					   match->match_criteria)) {
+			rule = add_rule_fg(g, match->match_value,
+					   attr->action, attr->flow_tag,
+					   attr->dest);
 			if (!IS_ERR(rule) || PTR_ERR(rule) != -ENOSPC)
 				goto unlock;
 		}
 
-	g = create_autogroup(ft, match_criteria_enable, match_criteria);
+	g = create_autogroup(ft, match->match_criteria_enable,
+			     match->match_criteria);
 	if (IS_ERR(g)) {
 		rule = (void *)g;
 		goto unlock;
 	}
 
-	rule = add_rule_fg(g, match_value,
-			   action, flow_tag, dest);
+	rule = add_rule_fg(g, match->match_value,
+			   attr->action, attr->flow_tag, attr->dest);
 	if (IS_ERR(rule)) {
 		/* Remove assumes refcount > 0 and autogroup creates a group
 		 * with a refcount = 0.
@@ -1207,41 +1205,35 @@ static bool fwd_next_prio_supported(struct mlx5_flow_table *ft)
 
 struct mlx5_flow_rule *
 mlx5_add_flow_rule(struct mlx5_flow_table *ft,
-		   u8 match_criteria_enable,
-		   u32 *match_criteria,
-		   u32 *match_value,
-		   u32 action,
-		   u32 flow_tag,
-		   struct mlx5_flow_destination *dest)
+		   struct mlx5_flow_attr *attr)
 {
 	struct mlx5_flow_root_namespace *root = find_root(&ft->node);
 	struct mlx5_flow_destination gen_dest;
 	struct mlx5_flow_table *next_ft = NULL;
 	struct mlx5_flow_rule *rule = NULL;
-	u32 sw_action = action;
+	u32 sw_action = attr->action;
 	struct fs_prio *prio;
 
 	fs_get_obj(prio, ft->node.parent);
-	if (action == MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO) {
+	if (attr->action == MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO) {
 		if (!fwd_next_prio_supported(ft))
 			return ERR_PTR(-EOPNOTSUPP);
-		if (dest)
+		if (attr->dest)
 			return ERR_PTR(-EINVAL);
 		mutex_lock(&root->chain_lock);
 		next_ft = find_next_chained_ft(prio);
 		if (next_ft) {
 			gen_dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
 			gen_dest.ft = next_ft;
-			dest = &gen_dest;
-			action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
+			attr->dest = &gen_dest;
+			attr->action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
 		} else {
 			mutex_unlock(&root->chain_lock);
 			return ERR_PTR(-EOPNOTSUPP);
 		}
 	}
 
-	rule =	_mlx5_add_flow_rule(ft, match_criteria_enable, match_criteria,
-				    match_value, action, flow_tag, dest);
+	rule =	_mlx5_add_flow_rule(ft, attr);
 
 	if (sw_action == MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO) {
 		if (!IS_ERR_OR_NULL(rule) &&
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 4b7a107..b300d43 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -67,6 +67,28 @@ struct mlx5_flow_group;
 struct mlx5_flow_rule;
 struct mlx5_flow_namespace;
 
+#define MLX5_RULE_ATTR(attr, mc_e, mc, mv, action_v, flow_tag_v, dest_v)  {\
+	attr.flow_match.match_criteria_enable = mc_e;		\
+	attr.flow_match.match_criteria = mc;			\
+	attr.flow_match.match_value = mv;			\
+	attr.action = action_v;					\
+	attr.flow_tag = flow_tag_v;				\
+	attr.dest = dest_v;					\
+}
+
+struct mlx5_flow_match {
+	   u8 match_criteria_enable;
+	   u32 *match_criteria;
+	   u32 *match_value;
+};
+
+struct mlx5_flow_attr {
+	struct mlx5_flow_match flow_match;
+	u32 action;
+	u32 flow_tag;
+	struct mlx5_flow_destination *dest;
+};
+
 struct mlx5_flow_destination {
 	enum mlx5_flow_destination_type	type;
 	union {
@@ -115,12 +137,7 @@ void mlx5_destroy_flow_group(struct mlx5_flow_group *fg);
  */
 struct mlx5_flow_rule *
 mlx5_add_flow_rule(struct mlx5_flow_table *ft,
-		   u8 match_criteria_enable,
-		   u32 *match_criteria,
-		   u32 *match_value,
-		   u32 action,
-		   u32 flow_tag,
-		   struct mlx5_flow_destination *dest);
+		   struct mlx5_flow_attr *attr);
 void mlx5_del_flow_rule(struct mlx5_flow_rule *fr);
 
 int mlx5_modify_rule_destination(struct mlx5_flow_rule *rule,
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 02/18] net/mlx5: Introduce mlx5_flow_steering structure
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 01/18] net/mlx5: Refactor mlx5_add_flow_rule Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 03/18] net/mlx5: Properly remove all steering objects Saeed Mahameed
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Instead of having all steering private name spaces and
steering module fields flat in mlx5_core_priv, we wrap
them in mlx5_flow_steering for better modularity and
API exposure.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 134 ++++++++++++----------
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |   8 ++
 include/linux/mlx5/driver.h                       |   6 +-
 3 files changed, 84 insertions(+), 64 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 9f613aa..dcd3082 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1351,12 +1351,13 @@ void mlx5_destroy_flow_group(struct mlx5_flow_group *fg)
 struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
 						    enum mlx5_flow_namespace_type type)
 {
-	struct mlx5_flow_root_namespace *root_ns = dev->priv.root_ns;
+	struct mlx5_flow_steering *steering = dev->priv.steering;
+	struct mlx5_flow_root_namespace *root_ns;
 	int prio;
 	struct fs_prio *fs_prio;
 	struct mlx5_flow_namespace *ns;
 
-	if (!root_ns)
+	if (!steering)
 		return NULL;
 
 	switch (type) {
@@ -1367,24 +1368,28 @@ struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
 		prio = type;
 		break;
 	case MLX5_FLOW_NAMESPACE_FDB:
-		if (dev->priv.fdb_root_ns)
-			return &dev->priv.fdb_root_ns->ns;
+		if (steering->fdb_root_ns)
+			return &steering->fdb_root_ns->ns;
 		else
 			return NULL;
 	case MLX5_FLOW_NAMESPACE_ESW_EGRESS:
-		if (dev->priv.esw_egress_root_ns)
-			return &dev->priv.esw_egress_root_ns->ns;
+		if (steering->esw_egress_root_ns)
+			return &steering->esw_egress_root_ns->ns;
 		else
 			return NULL;
 	case MLX5_FLOW_NAMESPACE_ESW_INGRESS:
-		if (dev->priv.esw_ingress_root_ns)
-			return &dev->priv.esw_ingress_root_ns->ns;
+		if (steering->esw_ingress_root_ns)
+			return &steering->esw_ingress_root_ns->ns;
 		else
 			return NULL;
 	default:
 		return NULL;
 	}
 
+	root_ns = steering->root_ns;
+	if (!root_ns)
+		return NULL;
+
 	fs_prio = find_prio(&root_ns->ns, prio);
 	if (!fs_prio)
 		return NULL;
@@ -1470,13 +1475,13 @@ static bool has_required_caps(struct mlx5_core_dev *dev, struct node_caps *caps)
 	return true;
 }
 
-static int init_root_tree_recursive(struct mlx5_core_dev *dev,
+static int init_root_tree_recursive(struct mlx5_flow_steering *steering,
 				    struct init_tree_node *init_node,
 				    struct fs_node *fs_parent_node,
 				    struct init_tree_node *init_parent_node,
 				    int prio)
 {
-	int max_ft_level = MLX5_CAP_FLOWTABLE(dev,
+	int max_ft_level = MLX5_CAP_FLOWTABLE(steering->dev,
 					      flow_table_properties_nic_receive.
 					      max_ft_level);
 	struct mlx5_flow_namespace *fs_ns;
@@ -1487,7 +1492,7 @@ static int init_root_tree_recursive(struct mlx5_core_dev *dev,
 
 	if (init_node->type == FS_TYPE_PRIO) {
 		if ((init_node->min_ft_level > max_ft_level) ||
-		    !has_required_caps(dev, &init_node->caps))
+		    !has_required_caps(steering->dev, &init_node->caps))
 			return 0;
 
 		fs_get_obj(fs_ns, fs_parent_node);
@@ -1508,7 +1513,7 @@ static int init_root_tree_recursive(struct mlx5_core_dev *dev,
 	}
 	prio = 0;
 	for (i = 0; i < init_node->ar_size; i++) {
-		err = init_root_tree_recursive(dev, &init_node->children[i],
+		err = init_root_tree_recursive(steering, &init_node->children[i],
 					       base, init_node, prio);
 		if (err)
 			return err;
@@ -1521,7 +1526,7 @@ static int init_root_tree_recursive(struct mlx5_core_dev *dev,
 	return 0;
 }
 
-static int init_root_tree(struct mlx5_core_dev *dev,
+static int init_root_tree(struct mlx5_flow_steering *steering,
 			  struct init_tree_node *init_node,
 			  struct fs_node *fs_parent_node)
 {
@@ -1531,7 +1536,7 @@ static int init_root_tree(struct mlx5_core_dev *dev,
 
 	fs_get_obj(fs_ns, fs_parent_node);
 	for (i = 0; i < init_node->ar_size; i++) {
-		err = init_root_tree_recursive(dev, &init_node->children[i],
+		err = init_root_tree_recursive(steering, &init_node->children[i],
 					       &fs_ns->node,
 					       init_node, i);
 		if (err)
@@ -1540,7 +1545,7 @@ static int init_root_tree(struct mlx5_core_dev *dev,
 	return 0;
 }
 
-static struct mlx5_flow_root_namespace *create_root_ns(struct mlx5_core_dev *dev,
+static struct mlx5_flow_root_namespace *create_root_ns(struct mlx5_flow_steering *steering,
 						       enum fs_flow_table_type
 						       table_type)
 {
@@ -1552,7 +1557,7 @@ static struct mlx5_flow_root_namespace *create_root_ns(struct mlx5_core_dev *dev
 	if (!root_ns)
 		return NULL;
 
-	root_ns->dev = dev;
+	root_ns->dev = steering->dev;
 	root_ns->table_type = table_type;
 
 	ns = &root_ns->ns;
@@ -1607,46 +1612,45 @@ static void set_prio_attrs(struct mlx5_flow_root_namespace *root_ns)
 #define ANCHOR_PRIO 0
 #define ANCHOR_SIZE 1
 #define ANCHOR_LEVEL 0
-static int create_anchor_flow_table(struct mlx5_core_dev
-							*dev)
+static int create_anchor_flow_table(struct mlx5_flow_steering *steering)
 {
 	struct mlx5_flow_namespace *ns = NULL;
 	struct mlx5_flow_table *ft;
 
-	ns = mlx5_get_flow_namespace(dev, MLX5_FLOW_NAMESPACE_ANCHOR);
+	ns = mlx5_get_flow_namespace(steering->dev, MLX5_FLOW_NAMESPACE_ANCHOR);
 	if (!ns)
 		return -EINVAL;
 	ft = mlx5_create_flow_table(ns, ANCHOR_PRIO, ANCHOR_SIZE, ANCHOR_LEVEL);
 	if (IS_ERR(ft)) {
-		mlx5_core_err(dev, "Failed to create last anchor flow table");
+		mlx5_core_err(steering->dev, "Failed to create last anchor flow table");
 		return PTR_ERR(ft);
 	}
 	return 0;
 }
 
-static int init_root_ns(struct mlx5_core_dev *dev)
+static int init_root_ns(struct mlx5_flow_steering *steering)
 {
 
-	dev->priv.root_ns = create_root_ns(dev, FS_FT_NIC_RX);
-	if (IS_ERR_OR_NULL(dev->priv.root_ns))
+	steering->root_ns = create_root_ns(steering, FS_FT_NIC_RX);
+	if (IS_ERR_OR_NULL(steering->root_ns))
 		goto cleanup;
 
-	if (init_root_tree(dev, &root_fs, &dev->priv.root_ns->ns.node))
+	if (init_root_tree(steering, &root_fs, &steering->root_ns->ns.node))
 		goto cleanup;
 
-	set_prio_attrs(dev->priv.root_ns);
+	set_prio_attrs(steering->root_ns);
 
-	if (create_anchor_flow_table(dev))
+	if (create_anchor_flow_table(steering))
 		goto cleanup;
 
 	return 0;
 
 cleanup:
-	mlx5_cleanup_fs(dev);
+	mlx5_cleanup_fs(steering->dev);
 	return -ENOMEM;
 }
 
-static void cleanup_single_prio_root_ns(struct mlx5_core_dev *dev,
+static void cleanup_single_prio_root_ns(struct mlx5_flow_steering *steering,
 					struct mlx5_flow_root_namespace *root_ns)
 {
 	struct fs_node *prio;
@@ -1659,11 +1663,11 @@ static void cleanup_single_prio_root_ns(struct mlx5_core_dev *dev,
 					struct fs_node,
 				 list);
 		if (tree_remove_node(prio))
-			mlx5_core_warn(dev,
+			mlx5_core_warn(steering->dev,
 				       "Flow steering priority wasn't destroyed, refcount > 1\n");
 	}
 	if (tree_remove_node(&root_ns->ns.node))
-		mlx5_core_warn(dev,
+		mlx5_core_warn(steering->dev,
 			       "Flow steering namespace wasn't destroyed, refcount > 1\n");
 	root_ns = NULL;
 }
@@ -1677,12 +1681,12 @@ static void destroy_flow_tables(struct fs_prio *prio)
 		mlx5_destroy_flow_table(iter);
 }
 
-static void cleanup_root_ns(struct mlx5_core_dev *dev)
+static void cleanup_root_ns(struct mlx5_flow_steering *steering)
 {
-	struct mlx5_flow_root_namespace *root_ns = dev->priv.root_ns;
+	struct mlx5_flow_root_namespace *root_ns = steering->root_ns;
 	struct fs_prio *iter_prio;
 
-	if (!MLX5_CAP_GEN(dev, nic_flow_table))
+	if (!MLX5_CAP_GEN(steering->dev, nic_flow_table))
 		return;
 
 	if (!root_ns)
@@ -1707,7 +1711,7 @@ static void cleanup_root_ns(struct mlx5_core_dev *dev)
 				fs_get_obj(obj_iter_prio2, iter_prio2);
 				destroy_flow_tables(obj_iter_prio2);
 				if (tree_remove_node(iter_prio2)) {
-					mlx5_core_warn(dev,
+					mlx5_core_warn(steering->dev,
 						       "Priority %d wasn't destroyed, refcount > 1\n",
 						       obj_iter_prio2->prio);
 					return;
@@ -1724,7 +1728,7 @@ static void cleanup_root_ns(struct mlx5_core_dev *dev)
 						 struct fs_node,
 						 list);
 			if (tree_remove_node(iter_ns)) {
-				mlx5_core_warn(dev,
+				mlx5_core_warn(steering->dev,
 					       "Namespace wasn't destroyed, refcount > 1\n");
 				return;
 			}
@@ -1741,7 +1745,7 @@ static void cleanup_root_ns(struct mlx5_core_dev *dev)
 
 		fs_get_obj(obj_prio_node, prio_node);
 		if (tree_remove_node(prio_node)) {
-			mlx5_core_warn(dev,
+			mlx5_core_warn(steering->dev,
 				       "Priority %d wasn't destroyed, refcount > 1\n",
 				       obj_prio_node->prio);
 			return;
@@ -1749,70 +1753,75 @@ static void cleanup_root_ns(struct mlx5_core_dev *dev)
 	}
 
 	if (tree_remove_node(&root_ns->ns.node)) {
-		mlx5_core_warn(dev,
+		mlx5_core_warn(steering->dev,
 			       "root namespace wasn't destroyed, refcount > 1\n");
 		return;
 	}
 
-	dev->priv.root_ns = NULL;
+	steering->root_ns = NULL;
 }
 
 void mlx5_cleanup_fs(struct mlx5_core_dev *dev)
 {
+	struct mlx5_flow_steering *steering = dev->priv.steering;
+
 	if (MLX5_CAP_GEN(dev, port_type) != MLX5_CAP_PORT_TYPE_ETH)
 		return;
 
-	cleanup_root_ns(dev);
-	cleanup_single_prio_root_ns(dev, dev->priv.fdb_root_ns);
-	cleanup_single_prio_root_ns(dev, dev->priv.esw_egress_root_ns);
-	cleanup_single_prio_root_ns(dev, dev->priv.esw_ingress_root_ns);
+	cleanup_root_ns(steering);
+	cleanup_single_prio_root_ns(steering, steering->esw_egress_root_ns);
+	cleanup_single_prio_root_ns(steering, steering->esw_ingress_root_ns);
+	cleanup_single_prio_root_ns(steering, steering->fdb_root_ns);
 	mlx5_cleanup_fc_stats(dev);
+	kfree(steering);
 }
 
-static int init_fdb_root_ns(struct mlx5_core_dev *dev)
+static int init_fdb_root_ns(struct mlx5_flow_steering *steering)
 {
 	struct fs_prio *prio;
 
-	dev->priv.fdb_root_ns = create_root_ns(dev, FS_FT_FDB);
-	if (!dev->priv.fdb_root_ns)
+	steering->fdb_root_ns = create_root_ns(steering, FS_FT_FDB);
+	if (!steering->fdb_root_ns)
 		return -ENOMEM;
 
 	/* Create single prio */
-	prio = fs_create_prio(&dev->priv.fdb_root_ns->ns, 0, 1);
+	prio = fs_create_prio(&steering->fdb_root_ns->ns, 0, 1);
 	if (IS_ERR(prio)) {
-		cleanup_single_prio_root_ns(dev, dev->priv.fdb_root_ns);
+		cleanup_single_prio_root_ns(steering, steering->fdb_root_ns);
 		return PTR_ERR(prio);
 	} else {
 		return 0;
 	}
 }
 
-static int init_egress_acl_root_ns(struct mlx5_core_dev *dev)
+static int init_ingress_acl_root_ns(struct mlx5_flow_steering *steering)
 {
 	struct fs_prio *prio;
 
-	dev->priv.esw_egress_root_ns = create_root_ns(dev, FS_FT_ESW_EGRESS_ACL);
-	if (!dev->priv.esw_egress_root_ns)
+	steering->esw_egress_root_ns = create_root_ns(steering, FS_FT_ESW_EGRESS_ACL);
+	if (!steering->esw_egress_root_ns)
 		return -ENOMEM;
 
 	/* create 1 prio*/
-	prio = fs_create_prio(&dev->priv.esw_egress_root_ns->ns, 0, MLX5_TOTAL_VPORTS(dev));
+	prio = fs_create_prio(&steering->esw_egress_root_ns->ns, 0,
+			      MLX5_TOTAL_VPORTS(steering->dev));
 	if (IS_ERR(prio))
 		return PTR_ERR(prio);
 	else
 		return 0;
 }
 
-static int init_ingress_acl_root_ns(struct mlx5_core_dev *dev)
+static int init_egress_acl_root_ns(struct mlx5_flow_steering *steering)
 {
 	struct fs_prio *prio;
 
-	dev->priv.esw_ingress_root_ns = create_root_ns(dev, FS_FT_ESW_INGRESS_ACL);
-	if (!dev->priv.esw_ingress_root_ns)
+	steering->esw_ingress_root_ns = create_root_ns(steering, FS_FT_ESW_INGRESS_ACL);
+	if (!steering->esw_ingress_root_ns)
 		return -ENOMEM;
 
 	/* create 1 prio*/
-	prio = fs_create_prio(&dev->priv.esw_ingress_root_ns->ns, 0, MLX5_TOTAL_VPORTS(dev));
+	prio = fs_create_prio(&steering->esw_ingress_root_ns->ns, 0,
+			      MLX5_TOTAL_VPORTS(steering->dev));
 	if (IS_ERR(prio))
 		return PTR_ERR(prio);
 	else
@@ -1821,6 +1830,7 @@ static int init_ingress_acl_root_ns(struct mlx5_core_dev *dev)
 
 int mlx5_init_fs(struct mlx5_core_dev *dev)
 {
+	struct mlx5_flow_steering *steering;
 	int err = 0;
 
 	if (MLX5_CAP_GEN(dev, port_type) != MLX5_CAP_PORT_TYPE_ETH)
@@ -1830,26 +1840,32 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
 	if (err)
 		return err;
 
+	steering = kzalloc(sizeof(*steering), GFP_KERNEL);
+	if (!steering)
+		return -ENOMEM;
+	steering->dev = dev;
+	dev->priv.steering = steering;
+
 	if (MLX5_CAP_GEN(dev, nic_flow_table) &&
 	    MLX5_CAP_FLOWTABLE_NIC_RX(dev, ft_support)) {
-		err = init_root_ns(dev);
+		err = init_root_ns(steering);
 		if (err)
 			goto err;
 	}
 
 	if (MLX5_CAP_GEN(dev, eswitch_flow_table)) {
 		if (MLX5_CAP_ESW_FLOWTABLE_FDB(dev, ft_support)) {
-			err = init_fdb_root_ns(dev);
+			err = init_fdb_root_ns(steering);
 			if (err)
 				goto err;
 		}
 		if (MLX5_CAP_ESW_EGRESS_ACL(dev, ft_support)) {
-			err = init_egress_acl_root_ns(dev);
+			err = init_egress_acl_root_ns(steering);
 			if (err)
 				goto err;
 		}
 		if (MLX5_CAP_ESW_INGRESS_ACL(dev, ft_support)) {
-			err = init_ingress_acl_root_ns(dev);
+			err = init_ingress_acl_root_ns(steering);
 			if (err)
 				goto err;
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index aa41a73..d7ba91a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -55,6 +55,14 @@ enum fs_fte_status {
 	FS_FTE_STATUS_EXISTING = 1UL << 0,
 };
 
+struct mlx5_flow_steering {
+	struct mlx5_core_dev *dev;
+	struct mlx5_flow_root_namespace *root_ns;
+	struct mlx5_flow_root_namespace *fdb_root_ns;
+	struct mlx5_flow_root_namespace *esw_egress_root_ns;
+	struct mlx5_flow_root_namespace *esw_ingress_root_ns;
+};
+
 struct fs_node {
 	struct list_head	list;
 	struct list_head	children;
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 80776d0..1bd7cde 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -535,14 +535,10 @@ struct mlx5_priv {
 	struct list_head        ctx_list;
 	spinlock_t              ctx_lock;
 
+	struct mlx5_flow_steering *steering;
 	struct mlx5_eswitch     *eswitch;
 	struct mlx5_core_sriov	sriov;
 	unsigned long		pci_dev_data;
-	struct mlx5_flow_root_namespace *root_ns;
-	struct mlx5_flow_root_namespace *fdb_root_ns;
-	struct mlx5_flow_root_namespace *esw_egress_root_ns;
-	struct mlx5_flow_root_namespace *esw_ingress_root_ns;
-
 	struct mlx5_fc_stats		fc_stats;
 };
 
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 03/18] net/mlx5: Properly remove all steering objects
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 01/18] net/mlx5: Refactor mlx5_add_flow_rule Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 02/18] net/mlx5: Introduce mlx5_flow_steering structure Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 04/18] net/mlx5: Add hold/put rules refcount API Saeed Mahameed
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Instead of explicitly cleaning up the well known parts of the steering
tree, we use the generic tree structure to traverse for cleanup.
No functional changes.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 120 +++-------------------
 1 file changed, 15 insertions(+), 105 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index dcd3082..ea90b66 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1650,115 +1650,24 @@ cleanup:
 	return -ENOMEM;
 }
 
-static void cleanup_single_prio_root_ns(struct mlx5_flow_steering *steering,
-					struct mlx5_flow_root_namespace *root_ns)
+static void clean_tree(struct fs_node *node)
 {
-	struct fs_node *prio;
-
-	if (!root_ns)
-		return;
+	if (node) {
+		struct fs_node *iter;
+		struct fs_node *temp;
 
-	if (!list_empty(&root_ns->ns.node.children)) {
-		prio = list_first_entry(&root_ns->ns.node.children,
-					struct fs_node,
-				 list);
-		if (tree_remove_node(prio))
-			mlx5_core_warn(steering->dev,
-				       "Flow steering priority wasn't destroyed, refcount > 1\n");
+		list_for_each_entry_safe(iter, temp, &node->children, list)
+			clean_tree(iter);
+		tree_remove_node(node);
 	}
-	if (tree_remove_node(&root_ns->ns.node))
-		mlx5_core_warn(steering->dev,
-			       "Flow steering namespace wasn't destroyed, refcount > 1\n");
-	root_ns = NULL;
-}
-
-static void destroy_flow_tables(struct fs_prio *prio)
-{
-	struct mlx5_flow_table *iter;
-	struct mlx5_flow_table *tmp;
-
-	fs_for_each_ft_safe(iter, tmp, prio)
-		mlx5_destroy_flow_table(iter);
 }
 
-static void cleanup_root_ns(struct mlx5_flow_steering *steering)
+static void cleanup_root_ns(struct mlx5_flow_root_namespace *root_ns)
 {
-	struct mlx5_flow_root_namespace *root_ns = steering->root_ns;
-	struct fs_prio *iter_prio;
-
-	if (!MLX5_CAP_GEN(steering->dev, nic_flow_table))
-		return;
-
 	if (!root_ns)
 		return;
 
-	/* stage 1 */
-	fs_for_each_prio(iter_prio, &root_ns->ns) {
-		struct fs_node *node;
-		struct mlx5_flow_namespace *iter_ns;
-
-		fs_for_each_ns_or_ft(node, iter_prio) {
-			if (node->type == FS_TYPE_FLOW_TABLE)
-				continue;
-			fs_get_obj(iter_ns, node);
-			while (!list_empty(&iter_ns->node.children)) {
-				struct fs_prio *obj_iter_prio2;
-				struct fs_node *iter_prio2 =
-					list_first_entry(&iter_ns->node.children,
-							 struct fs_node,
-							 list);
-
-				fs_get_obj(obj_iter_prio2, iter_prio2);
-				destroy_flow_tables(obj_iter_prio2);
-				if (tree_remove_node(iter_prio2)) {
-					mlx5_core_warn(steering->dev,
-						       "Priority %d wasn't destroyed, refcount > 1\n",
-						       obj_iter_prio2->prio);
-					return;
-				}
-			}
-		}
-	}
-
-	/* stage 2 */
-	fs_for_each_prio(iter_prio, &root_ns->ns) {
-		while (!list_empty(&iter_prio->node.children)) {
-			struct fs_node *iter_ns =
-				list_first_entry(&iter_prio->node.children,
-						 struct fs_node,
-						 list);
-			if (tree_remove_node(iter_ns)) {
-				mlx5_core_warn(steering->dev,
-					       "Namespace wasn't destroyed, refcount > 1\n");
-				return;
-			}
-		}
-	}
-
-	/* stage 3 */
-	while (!list_empty(&root_ns->ns.node.children)) {
-		struct fs_prio *obj_prio_node;
-		struct fs_node *prio_node =
-			list_first_entry(&root_ns->ns.node.children,
-					 struct fs_node,
-					 list);
-
-		fs_get_obj(obj_prio_node, prio_node);
-		if (tree_remove_node(prio_node)) {
-			mlx5_core_warn(steering->dev,
-				       "Priority %d wasn't destroyed, refcount > 1\n",
-				       obj_prio_node->prio);
-			return;
-		}
-	}
-
-	if (tree_remove_node(&root_ns->ns.node)) {
-		mlx5_core_warn(steering->dev,
-			       "root namespace wasn't destroyed, refcount > 1\n");
-		return;
-	}
-
-	steering->root_ns = NULL;
+	clean_tree(&root_ns->ns.node);
 }
 
 void mlx5_cleanup_fs(struct mlx5_core_dev *dev)
@@ -1768,10 +1677,10 @@ void mlx5_cleanup_fs(struct mlx5_core_dev *dev)
 	if (MLX5_CAP_GEN(dev, port_type) != MLX5_CAP_PORT_TYPE_ETH)
 		return;
 
-	cleanup_root_ns(steering);
-	cleanup_single_prio_root_ns(steering, steering->esw_egress_root_ns);
-	cleanup_single_prio_root_ns(steering, steering->esw_ingress_root_ns);
-	cleanup_single_prio_root_ns(steering, steering->fdb_root_ns);
+	cleanup_root_ns(steering->root_ns);
+	cleanup_root_ns(steering->esw_egress_root_ns);
+	cleanup_root_ns(steering->esw_ingress_root_ns);
+	cleanup_root_ns(steering->fdb_root_ns);
 	mlx5_cleanup_fc_stats(dev);
 	kfree(steering);
 }
@@ -1787,7 +1696,8 @@ static int init_fdb_root_ns(struct mlx5_flow_steering *steering)
 	/* Create single prio */
 	prio = fs_create_prio(&steering->fdb_root_ns->ns, 0, 1);
 	if (IS_ERR(prio)) {
-		cleanup_single_prio_root_ns(steering, steering->fdb_root_ns);
+		cleanup_root_ns(steering->fdb_root_ns);
+		steering->fdb_root_ns = NULL;
 		return PTR_ERR(prio);
 	} else {
 		return 0;
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 04/18] net/mlx5: Add hold/put rules refcount API
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (2 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 03/18] net/mlx5: Properly remove all steering objects Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 05/18] net/mlx5: Add support to add/del flow rule notifiers Saeed Mahameed
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Steering consumers (e.g. sniffer) will need to hold a refcount
on flow rules which weren't created by them until the work on
the rule will be finished.

For that we reveal here an API to hold/put a refcount on a rule and
add a completion mechanism so the rule will not be cleared until
the ref holder will release it.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 105 +++++++++++++++-------
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |   2 +
 include/linux/mlx5/fs.h                           |   3 +
 3 files changed, 76 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index ea90b66..06f94bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -141,6 +141,7 @@ static void tree_init_node(struct fs_node *node,
 	INIT_LIST_HEAD(&node->list);
 	INIT_LIST_HEAD(&node->children);
 	mutex_init(&node->lock);
+	init_completion(&node->complete);
 	node->remove_func = remove_func;
 }
 
@@ -190,6 +191,7 @@ static void unlock_ref_node(struct fs_node *node)
 static void tree_put_node(struct fs_node *node)
 {
 	struct fs_node *parent_node = node->parent;
+	bool node_deleted = false;
 
 	lock_ref_node(parent_node);
 	if (atomic_dec_and_test(&node->refcount)) {
@@ -197,21 +199,68 @@ static void tree_put_node(struct fs_node *node)
 			list_del_init(&node->list);
 		if (node->remove_func)
 			node->remove_func(node);
-		kfree(node);
-		node = NULL;
+		complete(&node->complete);
+		node_deleted = true;
 	}
 	unlock_ref_node(parent_node);
-	if (!node && parent_node)
+	if (node_deleted && parent_node)
 		tree_put_node(parent_node);
 }
 
-static int tree_remove_node(struct fs_node *node)
+static struct mlx5_flow_root_namespace *find_root(struct fs_node *node)
 {
-	if (atomic_read(&node->refcount) > 1) {
-		atomic_dec(&node->refcount);
-		return -EEXIST;
+	struct fs_node *root;
+	struct mlx5_flow_namespace *ns;
+
+	root = node->root;
+
+	if (WARN_ON(root->type != FS_TYPE_NAMESPACE)) {
+		pr_warn("mlx5: flow steering node is not in tree or garbaged\n");
+		return NULL;
 	}
+
+	ns = container_of(root, struct mlx5_flow_namespace, node);
+	return container_of(ns, struct mlx5_flow_root_namespace, ns);
+}
+
+static inline struct mlx5_core_dev *get_dev(struct fs_node *node)
+{
+	struct mlx5_flow_root_namespace *root = find_root(node);
+
+	if (root)
+		return root->dev;
+	return NULL;
+}
+
+#define MLX5_FS_TIMEOUT_MSEC 1000
+static int tree_remove_node(struct fs_node *node)
+{
+	unsigned long timeout = msecs_to_jiffies(MLX5_FS_TIMEOUT_MSEC);
+	struct mlx5_core_dev *dev = get_dev(node);
+
 	tree_put_node(node);
+	if (!wait_for_completion_timeout(&node->complete, timeout)) {
+		mlx5_core_warn(dev, "Timeout waiting for removing steering object\n");
+		return -ETIMEDOUT;
+	}
+	kfree(node);
+	node = NULL;
+
+	return 0;
+}
+
+static int tree_force_remove_node(struct fs_node *node)
+{
+	struct fs_node *parent_node = node->parent;
+
+	lock_ref_node(parent_node);
+	list_del_init(&node->list);
+	if (node->remove_func)
+		node->remove_func(node);
+	kfree(node);
+	node = NULL;
+	unlock_ref_node(parent_node);
+
 	return 0;
 }
 
@@ -295,31 +344,6 @@ static bool compare_match_criteria(u8 match_criteria_enable1,
 		!memcmp(mask1, mask2, MLX5_ST_SZ_BYTES(fte_match_param));
 }
 
-static struct mlx5_flow_root_namespace *find_root(struct fs_node *node)
-{
-	struct fs_node *root;
-	struct mlx5_flow_namespace *ns;
-
-	root = node->root;
-
-	if (WARN_ON(root->type != FS_TYPE_NAMESPACE)) {
-		pr_warn("mlx5: flow steering node is not in tree or garbaged\n");
-		return NULL;
-	}
-
-	ns = container_of(root, struct mlx5_flow_namespace, node);
-	return container_of(ns, struct mlx5_flow_root_namespace, ns);
-}
-
-static inline struct mlx5_core_dev *get_dev(struct fs_node *node)
-{
-	struct mlx5_flow_root_namespace *root = find_root(node);
-
-	if (root)
-		return root->dev;
-	return NULL;
-}
-
 static void del_flow_table(struct fs_node *node)
 {
 	struct mlx5_flow_table *ft;
@@ -870,6 +894,7 @@ static struct mlx5_flow_rule *alloc_rule(struct mlx5_flow_destination *dest)
 		return NULL;
 
 	INIT_LIST_HEAD(&rule->next_ft);
+	atomic_set(&rule->refcount, 1);
 	rule->node.type = FS_TYPE_FLOW_DEST;
 	if (dest)
 		memcpy(&rule->dest_attr, dest, sizeof(*dest));
@@ -1063,7 +1088,7 @@ static struct mlx5_flow_rule *add_rule_fg(struct mlx5_flow_group *fg,
 		    action == fte->action && flow_tag == fte->flow_tag) {
 			rule = find_flow_rule(fte, dest);
 			if (rule) {
-				atomic_inc(&rule->node.refcount);
+				atomic_inc(&rule->refcount);
 				unlock_ref_node(&fte->node);
 				unlock_ref_node(&fg->node);
 				return rule;
@@ -1251,6 +1276,8 @@ EXPORT_SYMBOL(mlx5_add_flow_rule);
 
 void mlx5_del_flow_rule(struct mlx5_flow_rule *rule)
 {
+	if (!atomic_dec_and_test(&rule->refcount))
+		return;
 	tree_remove_node(&rule->node);
 }
 EXPORT_SYMBOL(mlx5_del_flow_rule);
@@ -1658,7 +1685,7 @@ static void clean_tree(struct fs_node *node)
 
 		list_for_each_entry_safe(iter, temp, &node->children, list)
 			clean_tree(iter);
-		tree_remove_node(node);
+		tree_force_remove_node(node);
 	}
 }
 
@@ -1786,3 +1813,13 @@ err:
 	mlx5_cleanup_fs(dev);
 	return err;
 }
+
+void mlx5_get_flow_rule(struct mlx5_flow_rule *rule)
+{
+	tree_get_node(&rule->node);
+}
+
+void mlx5_put_flow_rule(struct mlx5_flow_rule *rule)
+{
+	tree_put_node(&rule->node);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index d7ba91a..29dd9e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -71,6 +71,7 @@ struct fs_node {
 	struct fs_node		*root;
 	/* lock the node for writing and traversing */
 	struct mutex		lock;
+	struct completion	complete;
 	atomic_t		refcount;
 	void			(*remove_func)(struct fs_node *);
 };
@@ -83,6 +84,7 @@ struct mlx5_flow_rule {
 	 */
 	struct list_head			next_ft;
 	u32					sw_action;
+	atomic_t				refcount;
 };
 
 /* Type of children is mlx5_flow_group */
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index b300d43..37e13a1 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -149,4 +149,7 @@ void mlx5_fc_destroy(struct mlx5_core_dev *dev, struct mlx5_fc *counter);
 void mlx5_fc_query_cached(struct mlx5_fc *counter,
 			  u64 *bytes, u64 *packets, u64 *lastuse);
 
+void mlx5_get_flow_rule(struct mlx5_flow_rule *rule);
+void mlx5_put_flow_rule(struct mlx5_flow_rule *rule);
+
 #endif
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 05/18] net/mlx5: Add support to add/del flow rule notifiers
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (3 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 04/18] net/mlx5: Add hold/put rules refcount API Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 06/18] net/mlx5: Introduce table of function pointer steering commands Saeed Mahameed
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Use kernel notifier block API in order to notifiy user when new rule
is added/deleted to/from namespace. Once a new listener is registered
we will fire add rule notification on all the existing rules.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 215 +++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |  15 ++
 include/linux/mlx5/fs.h                           |  20 ++
 3 files changed, 247 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 06f94bf..6ef7b99 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -164,10 +164,10 @@ static void tree_get_node(struct fs_node *node)
 }
 
 static void nested_lock_ref_node(struct fs_node *node,
-				 enum fs_i_mutex_lock_class class)
+				 int nesting)
 {
 	if (node) {
-		mutex_lock_nested(&node->lock, class);
+		mutex_lock_nested(&node->lock, nesting);
 		atomic_inc(&node->refcount);
 	}
 }
@@ -363,6 +363,8 @@ static void del_flow_table(struct fs_node *node)
 
 static void del_rule(struct fs_node *node)
 {
+	struct rule_client_data *priv_data;
+	struct rule_client_data *tmp;
 	struct mlx5_flow_rule *rule;
 	struct mlx5_flow_table *ft;
 	struct mlx5_flow_group *fg;
@@ -380,6 +382,12 @@ static void del_rule(struct fs_node *node)
 	}
 
 	fs_get_obj(rule, node);
+
+	list_for_each_entry_safe(priv_data, tmp, &rule->clients_data, list) {
+		list_del(&priv_data->list);
+		kfree(priv_data);
+	}
+
 	fs_get_obj(fte, rule->node.parent);
 	fs_get_obj(fg, fte->node.parent);
 	memcpy(match_value, fte->val, sizeof(fte->val));
@@ -896,6 +904,8 @@ static struct mlx5_flow_rule *alloc_rule(struct mlx5_flow_destination *dest)
 	INIT_LIST_HEAD(&rule->next_ft);
 	atomic_set(&rule->refcount, 1);
 	rule->node.type = FS_TYPE_FLOW_DEST;
+	INIT_LIST_HEAD(&rule->clients_data);
+	mutex_init(&rule->clients_lock);
 	if (dest)
 		memcpy(&rule->dest_attr, dest, sizeof(*dest));
 
@@ -1070,6 +1080,52 @@ static struct mlx5_flow_rule *find_flow_rule(struct fs_fte *fte,
 	return NULL;
 }
 
+static struct mlx5_flow_namespace *get_ns(struct fs_node *node)
+{
+	struct mlx5_flow_namespace *ns = NULL;
+
+	while (node  && (node->type != FS_TYPE_NAMESPACE))
+		node = node->parent;
+
+	if (node)
+		fs_get_obj(ns, node);
+
+	return ns;
+}
+
+static void get_event_data(struct mlx5_flow_rule *rule, struct mlx5_event_data
+			   *data)
+{
+	struct mlx5_flow_group *fg;
+	struct mlx5_flow_table *ft;
+	struct fs_fte *fte;
+
+	data->rule = rule;
+
+	fs_get_obj(fte, rule->node.parent);
+	WARN_ON(!fte);
+	fs_get_obj(fg, fte->node.parent);
+	WARN_ON(!fg);
+	fs_get_obj(ft, fg->node.parent);
+	WARN_ON(!ft);
+	data->ft = ft;
+}
+
+static void notify_add_rule(struct mlx5_flow_rule *rule)
+{
+	struct mlx5_event_data evt_data;
+	struct mlx5_flow_namespace *ns;
+	struct fs_fte *fte;
+
+	fs_get_obj(fte, rule->node.parent);
+	ns = get_ns(&fte->node);
+	if (!ns)
+		return;
+
+	get_event_data(rule, &evt_data);
+	raw_notifier_call_chain(&ns->listeners, MLX5_RULE_EVENT_ADD, &evt_data);
+}
+
 static struct mlx5_flow_rule *add_rule_fg(struct mlx5_flow_group *fg,
 					  u32 *match_value,
 					  u8 action,
@@ -1126,6 +1182,7 @@ static struct mlx5_flow_rule *add_rule_fg(struct mlx5_flow_group *fg,
 	list_add(&fte->node.list, prev);
 add_rule:
 	tree_add_node(&rule->node, &fte->node);
+	notify_add_rule(rule);
 unlock_fg:
 	unlock_ref_node(&fg->node);
 	return rule;
@@ -1238,6 +1295,7 @@ mlx5_add_flow_rule(struct mlx5_flow_table *ft,
 	struct mlx5_flow_rule *rule = NULL;
 	u32 sw_action = attr->action;
 	struct fs_prio *prio;
+	struct mlx5_flow_namespace *ns;
 
 	fs_get_obj(prio, ft->node.parent);
 	if (attr->action == MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO) {
@@ -1258,8 +1316,10 @@ mlx5_add_flow_rule(struct mlx5_flow_table *ft,
 		}
 	}
 
+	ns = get_ns(&ft->node);
+	if (ns)
+		down_read(&ns->ns_rw_sem);
 	rule =	_mlx5_add_flow_rule(ft, attr);
-
 	if (sw_action == MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO) {
 		if (!IS_ERR_OR_NULL(rule) &&
 		    (list_empty(&rule->next_ft))) {
@@ -1270,15 +1330,41 @@ mlx5_add_flow_rule(struct mlx5_flow_table *ft,
 		}
 		mutex_unlock(&root->chain_lock);
 	}
+	if (ns)
+		up_read(&ns->ns_rw_sem);
+
 	return rule;
 }
 EXPORT_SYMBOL(mlx5_add_flow_rule);
 
+static void notify_del_rule(struct mlx5_flow_rule *rule)
+{
+	struct mlx5_flow_namespace *ns;
+	struct mlx5_event_data evt_data;
+	struct fs_fte *fte;
+
+	fs_get_obj(fte, rule->node.parent);
+	ns = get_ns(&fte->node);
+	if (!ns)
+		return;
+
+	get_event_data(rule, &evt_data);
+	raw_notifier_call_chain(&ns->listeners, MLX5_RULE_EVENT_DEL, &evt_data);
+}
+
 void mlx5_del_flow_rule(struct mlx5_flow_rule *rule)
 {
+	struct mlx5_flow_namespace *ns;
+
 	if (!atomic_dec_and_test(&rule->refcount))
 		return;
+	ns = get_ns(&rule->node);
+	if (ns)
+		down_read(&ns->ns_rw_sem);
+	notify_del_rule(rule);
 	tree_remove_node(&rule->node);
+	if (ns)
+		up_read(&ns->ns_rw_sem);
 }
 EXPORT_SYMBOL(mlx5_del_flow_rule);
 
@@ -1453,6 +1539,8 @@ static struct mlx5_flow_namespace *fs_init_namespace(struct mlx5_flow_namespace
 {
 	ns->node.type = FS_TYPE_NAMESPACE;
 
+	init_rwsem(&ns->ns_rw_sem);
+
 	return ns;
 }
 
@@ -1823,3 +1911,124 @@ void mlx5_put_flow_rule(struct mlx5_flow_rule *rule)
 {
 	tree_put_node(&rule->node);
 }
+
+static void notify_existing_rules_recursive(struct fs_node *root,
+					    struct notifier_block *nb,
+					    int nesting)
+{
+	struct mlx5_event_data data;
+	struct fs_node *iter;
+
+	nested_lock_ref_node(root, nesting++);
+	if (root->type == FS_TYPE_FLOW_ENTRY) {
+		struct mlx5_flow_rule *rule;
+		int err = 0;
+
+		/* Iterate on destinations */
+		list_for_each_entry(iter, &root->children, list) {
+			fs_get_obj(rule, iter);
+			get_event_data(rule, &data);
+			err = nb->notifier_call(nb, MLX5_RULE_EVENT_ADD, &data);
+			if (err)
+				break;
+		}
+	} else {
+		list_for_each_entry(iter, &root->children, list)
+			notify_existing_rules_recursive(iter, nb, nesting);
+	}
+	unlock_ref_node(root);
+}
+
+static void mlx5_flow_notify_existing_rules(struct mlx5_flow_namespace *ns,
+					    struct notifier_block *nb)
+{
+	notify_existing_rules_recursive(&ns->node, nb, 0);
+}
+
+int mlx5_register_rule_notifier(struct mlx5_flow_namespace *ns,
+				struct notifier_block *nb)
+{
+	int err;
+
+	down_write(&ns->ns_rw_sem);
+	mlx5_flow_notify_existing_rules(ns, nb);
+	err = raw_notifier_chain_register(&ns->listeners, nb);
+	up_write(&ns->ns_rw_sem);
+
+	return err;
+}
+
+int mlx5_unregister_rule_notifier(struct mlx5_flow_namespace *ns,
+				  struct notifier_block *nb)
+{
+	int err;
+
+	down_write(&ns->ns_rw_sem);
+	err = raw_notifier_chain_unregister(&ns->listeners, nb);
+	up_write(&ns->ns_rw_sem);
+
+	return err;
+}
+
+void *mlx5_get_rule_private_data(struct mlx5_flow_rule *rule,
+				 struct notifier_block *nb)
+{
+	struct rule_client_data *priv_data;
+	void *data = NULL;
+
+	mutex_lock(&rule->clients_lock);
+	list_for_each_entry(priv_data, &rule->clients_data, list) {
+		if (priv_data->nb == nb) {
+			data = priv_data->client_data;
+			break;
+		}
+	}
+	mutex_unlock(&rule->clients_lock);
+
+	return data;
+}
+
+void mlx5_release_rule_private_data(struct mlx5_flow_rule *rule,
+				    struct notifier_block *nb)
+{
+	struct rule_client_data *priv_data;
+	struct rule_client_data *tmp;
+
+	mutex_lock(&rule->clients_lock);
+	list_for_each_entry_safe(priv_data, tmp, &rule->clients_data, list) {
+		if (priv_data->nb == nb) {
+			list_del(&priv_data->list);
+			break;
+		}
+	}
+	mutex_unlock(&rule->clients_lock);
+}
+
+int mlx5_set_rule_private_data(struct mlx5_flow_rule *rule,
+			       struct notifier_block *nb,
+			       void  *client_data)
+{
+	struct rule_client_data *priv_data;
+
+	mutex_lock(&rule->clients_lock);
+	list_for_each_entry(priv_data, &rule->clients_data, list) {
+		if (priv_data->nb == nb) {
+			priv_data->client_data = client_data;
+			goto unlock;
+		}
+	}
+	priv_data = kzalloc(sizeof(*priv_data), GFP_KERNEL);
+	if (!priv_data) {
+		mutex_unlock(&rule->clients_lock);
+		return -ENOMEM;
+	}
+
+	priv_data->client_data = client_data;
+	priv_data->nb = nb;
+	list_add(&priv_data->list, &rule->clients_data);
+
+unlock:
+	mutex_unlock(&rule->clients_lock);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index 29dd9e0..dc08742 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -85,6 +85,9 @@ struct mlx5_flow_rule {
 	struct list_head			next_ft;
 	u32					sw_action;
 	atomic_t				refcount;
+	struct list_head			clients_data;
+	/* Protect clients data list */
+	struct mutex				clients_lock;
 };
 
 /* Type of children is mlx5_flow_group */
@@ -153,6 +156,12 @@ struct fs_prio {
 struct mlx5_flow_namespace {
 	/* parent == NULL => root ns */
 	struct	fs_node			node;
+	/* Listeners list for rule add/del operations */
+	struct raw_notifier_head	listeners;
+	/* We take write lock when we iterate on the
+	 * namespace's rules.
+	 */
+	struct  rw_semaphore		ns_rw_sem;
 };
 
 struct mlx5_flow_group_mask {
@@ -182,6 +191,12 @@ struct mlx5_flow_root_namespace {
 int mlx5_init_fc_stats(struct mlx5_core_dev *dev);
 void mlx5_cleanup_fc_stats(struct mlx5_core_dev *dev);
 
+struct rule_client_data {
+	struct notifier_block *nb;
+	struct list_head list;
+	void   *client_data;
+};
+
 int mlx5_init_fs(struct mlx5_core_dev *dev);
 void mlx5_cleanup_fs(struct mlx5_core_dev *dev);
 
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 37e13a1..5ac0e8f 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -152,4 +152,24 @@ void mlx5_fc_query_cached(struct mlx5_fc *counter,
 void mlx5_get_flow_rule(struct mlx5_flow_rule *rule);
 void mlx5_put_flow_rule(struct mlx5_flow_rule *rule);
 
+enum {
+	MLX5_RULE_EVENT_ADD,
+	MLX5_RULE_EVENT_DEL,
+};
+
+int mlx5_set_rule_private_data(struct mlx5_flow_rule *rule,
+			       struct notifier_block *nb, void *client_data);
+void *mlx5_get_rule_private_data(struct mlx5_flow_rule *rule,
+				 struct notifier_block *nb);
+void mlx5_release_rule_private_data(struct mlx5_flow_rule *rule,
+				    struct notifier_block *nb);
+
+int mlx5_register_rule_notifier(struct mlx5_flow_namespace *ns,
+				struct notifier_block *nb);
+int mlx5_unregister_rule_notifier(struct mlx5_flow_namespace *ns,
+				  struct notifier_block *nb);
+struct mlx5_event_data {
+	struct mlx5_flow_table *ft;
+	struct mlx5_flow_rule *rule;
+};
 #endif
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 06/18] net/mlx5: Introduce table of function pointer steering commands
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (4 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 05/18] net/mlx5: Add support to add/del flow rule notifiers Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 07/18] net/mlx5: Introduce nop " Saeed Mahameed
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

The hardware automatically steers RoCE packets.
Thus, these rules aren't added through the flow steering
infrastructure. However, we want to make the flow steering software
tree describes the various flows of every packet.

In order to do that we add table of function pointer steering
commands. This will be later used by the virtual RoCE namespace.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c  | 79 ++++++++++++++---------
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h  | 70 ++++++++++----------
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 73 ++++++++++++---------
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |  1 +
 4 files changed, 125 insertions(+), 98 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index a5bb6b6..c3eecff 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -38,8 +38,8 @@
 #include "fs_cmd.h"
 #include "mlx5_core.h"
 
-int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev,
-			    struct mlx5_flow_table *ft)
+static int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev,
+				   struct mlx5_flow_table *ft)
 {
 	u32 in[MLX5_ST_SZ_DW(set_flow_table_root_in)];
 	u32 out[MLX5_ST_SZ_DW(set_flow_table_root_out)];
@@ -60,11 +60,11 @@ int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev,
 					  sizeof(out));
 }
 
-int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev,
-			       u16 vport,
-			       enum fs_flow_table_type type, unsigned int level,
-			       unsigned int log_size, struct mlx5_flow_table
-			       *next_ft, unsigned int *table_id)
+static int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev,
+				      u16 vport,
+				      enum fs_flow_table_type type, unsigned int level,
+				      unsigned int log_size, struct mlx5_flow_table
+				      *next_ft, unsigned int *table_id)
 {
 	u32 out[MLX5_ST_SZ_DW(create_flow_table_out)];
 	u32 in[MLX5_ST_SZ_DW(create_flow_table_in)];
@@ -97,8 +97,8 @@ int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev,
 	return err;
 }
 
-int mlx5_cmd_destroy_flow_table(struct mlx5_core_dev *dev,
-				struct mlx5_flow_table *ft)
+static int mlx5_cmd_destroy_flow_table(struct mlx5_core_dev *dev,
+				       struct mlx5_flow_table *ft)
 {
 	u32 in[MLX5_ST_SZ_DW(destroy_flow_table_in)];
 	u32 out[MLX5_ST_SZ_DW(destroy_flow_table_out)];
@@ -119,9 +119,9 @@ int mlx5_cmd_destroy_flow_table(struct mlx5_core_dev *dev,
 					  sizeof(out));
 }
 
-int mlx5_cmd_modify_flow_table(struct mlx5_core_dev *dev,
-			       struct mlx5_flow_table *ft,
-			       struct mlx5_flow_table *next_ft)
+static int mlx5_cmd_modify_flow_table(struct mlx5_core_dev *dev,
+				      struct mlx5_flow_table *ft,
+				      struct mlx5_flow_table *next_ft)
 {
 	u32 in[MLX5_ST_SZ_DW(modify_flow_table_in)];
 	u32 out[MLX5_ST_SZ_DW(modify_flow_table_out)];
@@ -150,10 +150,10 @@ int mlx5_cmd_modify_flow_table(struct mlx5_core_dev *dev,
 					  sizeof(out));
 }
 
-int mlx5_cmd_create_flow_group(struct mlx5_core_dev *dev,
-			       struct mlx5_flow_table *ft,
-			       u32 *in,
-			       unsigned int *group_id)
+static int mlx5_cmd_create_flow_group(struct mlx5_core_dev *dev,
+				      struct mlx5_flow_table *ft,
+				      u32 *in,
+				      unsigned int *group_id)
 {
 	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
 	u32 out[MLX5_ST_SZ_DW(create_flow_group_out)];
@@ -180,9 +180,9 @@ int mlx5_cmd_create_flow_group(struct mlx5_core_dev *dev,
 	return err;
 }
 
-int mlx5_cmd_destroy_flow_group(struct mlx5_core_dev *dev,
-				struct mlx5_flow_table *ft,
-				unsigned int group_id)
+static int mlx5_cmd_destroy_flow_group(struct mlx5_core_dev *dev,
+				       struct mlx5_flow_table *ft,
+				       unsigned int group_id)
 {
 	u32 out[MLX5_ST_SZ_DW(destroy_flow_group_out)];
 	u32 in[MLX5_ST_SZ_DW(destroy_flow_group_in)];
@@ -298,19 +298,19 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
 	return err;
 }
 
-int mlx5_cmd_create_fte(struct mlx5_core_dev *dev,
-			struct mlx5_flow_table *ft,
-			unsigned group_id,
-			struct fs_fte *fte)
+static int mlx5_cmd_create_fte(struct mlx5_core_dev *dev,
+			       struct mlx5_flow_table *ft,
+			       unsigned int group_id,
+			       struct fs_fte *fte)
 {
 	return	mlx5_cmd_set_fte(dev, 0, 0, ft, group_id, fte);
 }
 
-int mlx5_cmd_update_fte(struct mlx5_core_dev *dev,
-			struct mlx5_flow_table *ft,
-			unsigned group_id,
-			int modify_mask,
-			struct fs_fte *fte)
+static int mlx5_cmd_update_fte(struct mlx5_core_dev *dev,
+			       struct mlx5_flow_table *ft,
+			       unsigned int group_id,
+			       int modify_mask,
+			       struct fs_fte *fte)
 {
 	int opmod;
 	int atomic_mod_cap = MLX5_CAP_FLOWTABLE(dev,
@@ -323,9 +323,9 @@ int mlx5_cmd_update_fte(struct mlx5_core_dev *dev,
 	return	mlx5_cmd_set_fte(dev, opmod, modify_mask, ft, group_id, fte);
 }
 
-int mlx5_cmd_delete_fte(struct mlx5_core_dev *dev,
-			struct mlx5_flow_table *ft,
-			unsigned int index)
+static int mlx5_cmd_delete_fte(struct mlx5_core_dev *dev,
+			       struct mlx5_flow_table *ft,
+			       unsigned int index)
 {
 	u32 out[MLX5_ST_SZ_DW(delete_fte_out)];
 	u32 in[MLX5_ST_SZ_DW(delete_fte_in)];
@@ -413,3 +413,20 @@ int mlx5_cmd_fc_query(struct mlx5_core_dev *dev, u16 id,
 
 	return 0;
 }
+
+static const struct steering_cmds steering_cmds = {
+	.update_root_ft		= mlx5_cmd_update_root_ft,
+	.create_flow_table	= mlx5_cmd_create_flow_table,
+	.destroy_flow_table	= mlx5_cmd_destroy_flow_table,
+	.modify_flow_table	= mlx5_cmd_modify_flow_table,
+	.create_flow_group	= mlx5_cmd_create_flow_group,
+	.destroy_flow_group	= mlx5_cmd_destroy_flow_group,
+	.create_fte		= mlx5_cmd_create_fte,
+	.update_fte		= mlx5_cmd_update_fte,
+	.delete_fte		= mlx5_cmd_delete_fte,
+};
+
+const struct steering_cmds *mlx5_get_phys_fs_cmds(void)
+{
+	return &steering_cmds;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
index fc4f7b8..b3f16ca 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
@@ -33,44 +33,40 @@
 #ifndef _MLX5_FS_CMD_
 #define _MLX5_FS_CMD_
 
-int mlx5_cmd_create_flow_table(struct mlx5_core_dev *dev,
-			       u16 vport,
-			       enum fs_flow_table_type type, unsigned int level,
-			       unsigned int log_size, struct mlx5_flow_table
-			       *next_ft, unsigned int *table_id);
+struct steering_cmds {
+	int (*update_root_ft)(struct mlx5_core_dev *dev,
+			      struct mlx5_flow_table *ft);
+	int (*create_flow_table)(struct mlx5_core_dev *dev, u16 vport,
+				 enum fs_flow_table_type type, unsigned int level,
+				 unsigned int log_size, struct mlx5_flow_table
+				 *next_ft, unsigned int *table_id);
+	int (*destroy_flow_table)(struct mlx5_core_dev *dev,
+				  struct mlx5_flow_table *ft);
+	int (*modify_flow_table)(struct mlx5_core_dev *dev,
+				 struct mlx5_flow_table *ft,
+				 struct mlx5_flow_table *next_ft);
+	int (*create_flow_group)(struct mlx5_core_dev *dev,
+				 struct mlx5_flow_table *ft,
+				 u32 *in,
+				 unsigned int *group_id);
+	int (*destroy_flow_group)(struct mlx5_core_dev *dev,
+				  struct mlx5_flow_table *ft,
+				  unsigned int group_id);
+	int (*create_fte)(struct mlx5_core_dev *dev,
+			  struct mlx5_flow_table *ft,
+			  unsigned int group_id,
+			  struct fs_fte *fte);
+	int (*update_fte)(struct mlx5_core_dev *dev,
+			  struct mlx5_flow_table *ft,
+			  unsigned int group_id,
+			  int modify_mask,
+			  struct fs_fte *fte);
+	int (*delete_fte)(struct mlx5_core_dev *dev,
+			  struct mlx5_flow_table *ft,
+			  unsigned int index);
+};
 
-int mlx5_cmd_destroy_flow_table(struct mlx5_core_dev *dev,
-				struct mlx5_flow_table *ft);
-
-int mlx5_cmd_modify_flow_table(struct mlx5_core_dev *dev,
-			       struct mlx5_flow_table *ft,
-			       struct mlx5_flow_table *next_ft);
-
-int mlx5_cmd_create_flow_group(struct mlx5_core_dev *dev,
-			       struct mlx5_flow_table *ft,
-			       u32 *in, unsigned int *group_id);
-
-int mlx5_cmd_destroy_flow_group(struct mlx5_core_dev *dev,
-				struct mlx5_flow_table *ft,
-				unsigned int group_id);
-
-int mlx5_cmd_create_fte(struct mlx5_core_dev *dev,
-			struct mlx5_flow_table *ft,
-			unsigned group_id,
-			struct fs_fte *fte);
-
-int mlx5_cmd_update_fte(struct mlx5_core_dev *dev,
-			struct mlx5_flow_table *ft,
-			unsigned group_id,
-			int modify_mask,
-			struct fs_fte *fte);
-
-int mlx5_cmd_delete_fte(struct mlx5_core_dev *dev,
-			struct mlx5_flow_table *ft,
-			unsigned int index);
-
-int mlx5_cmd_update_root_ft(struct mlx5_core_dev *dev,
-			    struct mlx5_flow_table *ft);
+const struct steering_cmds *mlx5_get_phys_fs_cmds(void);
 
 int mlx5_cmd_fc_alloc(struct mlx5_core_dev *dev, u16 *id);
 int mlx5_cmd_fc_free(struct mlx5_core_dev *dev, u16 id);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 6ef7b99..e762a9c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -346,6 +346,7 @@ static bool compare_match_criteria(u8 match_criteria_enable1,
 
 static void del_flow_table(struct fs_node *node)
 {
+	struct mlx5_flow_root_namespace *root = find_root(node);
 	struct mlx5_flow_table *ft;
 	struct mlx5_core_dev *dev;
 	struct fs_prio *prio;
@@ -354,7 +355,7 @@ static void del_flow_table(struct fs_node *node)
 	fs_get_obj(ft, node);
 	dev = get_dev(&ft->node);
 
-	err = mlx5_cmd_destroy_flow_table(dev, ft);
+	err = root->cmds->destroy_flow_table(dev, ft);
 	if (err)
 		pr_warn("flow steering can't destroy ft\n");
 	fs_get_obj(prio, ft->node.parent);
@@ -363,6 +364,7 @@ static void del_flow_table(struct fs_node *node)
 
 static void del_rule(struct fs_node *node)
 {
+	struct mlx5_flow_root_namespace *root = find_root(node);
 	struct rule_client_data *priv_data;
 	struct rule_client_data *tmp;
 	struct mlx5_flow_rule *rule;
@@ -401,10 +403,8 @@ static void del_rule(struct fs_node *node)
 	if ((fte->action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) &&
 	    --fte->dests_size) {
 		modify_mask = BIT(MLX5_SET_FTE_MODIFY_ENABLE_MASK_DESTINATION_LIST),
-		err = mlx5_cmd_update_fte(dev, ft,
-					  fg->id,
-					  modify_mask,
-					  fte);
+		err = root->cmds->update_fte(dev, ft, fg->id,
+					     modify_mask, fte);
 		if (err)
 			pr_warn("%s can't del rule fg id=%d fte_index=%d\n",
 				__func__, fg->id, fte->index);
@@ -414,6 +414,7 @@ static void del_rule(struct fs_node *node)
 
 static void del_fte(struct fs_node *node)
 {
+	struct mlx5_flow_root_namespace *root = find_root(node);
 	struct mlx5_flow_table *ft;
 	struct mlx5_flow_group *fg;
 	struct mlx5_core_dev *dev;
@@ -425,8 +426,8 @@ static void del_fte(struct fs_node *node)
 	fs_get_obj(ft, fg->node.parent);
 
 	dev = get_dev(&ft->node);
-	err = mlx5_cmd_delete_fte(dev, ft,
-				  fte->index);
+	err = root->cmds->delete_fte(dev, ft,
+				     fte->index);
 	if (err)
 		pr_warn("flow steering can't delete fte in index %d of flow group id %d\n",
 			fte->index, fg->id);
@@ -437,6 +438,7 @@ static void del_fte(struct fs_node *node)
 
 static void del_flow_group(struct fs_node *node)
 {
+	struct mlx5_flow_root_namespace *root = find_root(node);
 	struct mlx5_flow_group *fg;
 	struct mlx5_flow_table *ft;
 	struct mlx5_core_dev *dev;
@@ -445,7 +447,7 @@ static void del_flow_group(struct fs_node *node)
 	fs_get_obj(ft, fg->node.parent);
 	dev = get_dev(&ft->node);
 
-	if (mlx5_cmd_destroy_flow_group(dev, ft, fg->id))
+	if (root->cmds->destroy_flow_group(dev, ft, fg->id))
 		pr_warn("flow steering can't destroy fg %d of ft %d\n",
 			fg->id, ft->id);
 }
@@ -584,15 +586,16 @@ static int connect_fts_in_prio(struct mlx5_core_dev *dev,
 			       struct fs_prio *prio,
 			       struct mlx5_flow_table *ft)
 {
+	struct mlx5_flow_root_namespace *root = find_root(&prio->node);
 	struct mlx5_flow_table *iter;
 	int i = 0;
 	int err;
 
 	fs_for_each_ft(iter, prio) {
 		i++;
-		err = mlx5_cmd_modify_flow_table(dev,
-						 iter,
-						 ft);
+		err = root->cmds->modify_flow_table(dev,
+						    iter,
+						    ft);
 		if (err) {
 			mlx5_core_warn(dev, "Failed to modify flow table %d\n",
 				       iter->id);
@@ -635,7 +638,7 @@ static int update_root_ft_create(struct mlx5_flow_table *ft, struct fs_prio
 	if (ft->level >= min_level)
 		return 0;
 
-	err = mlx5_cmd_update_root_ft(root->dev, ft);
+	err = root->cmds->update_root_ft(root->dev, ft);
 	if (err)
 		mlx5_core_warn(root->dev, "Update root flow table of id=%u failed\n",
 			       ft->id);
@@ -648,6 +651,8 @@ static int update_root_ft_create(struct mlx5_flow_table *ft, struct fs_prio
 int mlx5_modify_rule_destination(struct mlx5_flow_rule *rule,
 				 struct mlx5_flow_destination *dest)
 {
+	struct mlx5_flow_root_namespace *root =
+		find_root(&rule->node);
 	struct mlx5_flow_table *ft;
 	struct mlx5_flow_group *fg;
 	struct fs_fte *fte;
@@ -662,10 +667,10 @@ int mlx5_modify_rule_destination(struct mlx5_flow_rule *rule,
 	fs_get_obj(ft, fg->node.parent);
 
 	memcpy(&rule->dest_attr, dest, sizeof(*dest));
-	err = mlx5_cmd_update_fte(get_dev(&ft->node),
-				  ft, fg->id,
-				  modify_mask,
-				  fte);
+	err = root->cmds->update_fte(get_dev(&ft->node),
+				     ft, fg->id,
+				     modify_mask,
+				     fte);
 	unlock_ref_node(&fte->node);
 
 	return err;
@@ -783,8 +788,8 @@ static struct mlx5_flow_table *__mlx5_create_flow_table(struct mlx5_flow_namespa
 	tree_init_node(&ft->node, 1, del_flow_table);
 	log_table_sz = ilog2(ft->max_fte);
 	next_ft = find_next_chained_ft(fs_prio);
-	err = mlx5_cmd_create_flow_table(root->dev, ft->vport, ft->type, ft->level,
-					 log_table_sz, next_ft, &ft->id);
+	err = root->cmds->create_flow_table(root->dev, ft->vport, ft->type, ft->level,
+					    log_table_sz, next_ft, &ft->id);
 	if (err)
 		goto free_ft;
 
@@ -799,7 +804,7 @@ static struct mlx5_flow_table *__mlx5_create_flow_table(struct mlx5_flow_namespa
 	mutex_unlock(&root->chain_lock);
 	return ft;
 destroy_ft:
-	mlx5_cmd_destroy_flow_table(root->dev, ft);
+	root->cmds->destroy_flow_table(root->dev, ft);
 free_ft:
 	kfree(ft);
 unlock_root:
@@ -850,6 +855,7 @@ static struct mlx5_flow_group *create_flow_group_common(struct mlx5_flow_table *
 							*prev_fg,
 							bool is_auto_fg)
 {
+	struct mlx5_flow_root_namespace *root = find_root(&ft->node);
 	struct mlx5_flow_group *fg;
 	struct mlx5_core_dev *dev = get_dev(&ft->node);
 	int err;
@@ -861,7 +867,7 @@ static struct mlx5_flow_group *create_flow_group_common(struct mlx5_flow_table *
 	if (IS_ERR(fg))
 		return fg;
 
-	err = mlx5_cmd_create_flow_group(dev, ft, fg_in, &fg->id);
+	err = root->cmds->create_flow_group(dev, ft, fg_in, &fg->id);
 	if (err) {
 		kfree(fg);
 		return ERR_PTR(err);
@@ -917,6 +923,7 @@ static struct mlx5_flow_rule *add_rule_fte(struct fs_fte *fte,
 					   struct mlx5_flow_group *fg,
 					   struct mlx5_flow_destination *dest)
 {
+	struct mlx5_flow_root_namespace *root = find_root(&fg->node);
 	struct mlx5_flow_table *ft;
 	struct mlx5_flow_rule *rule;
 	int modify_mask = 0;
@@ -944,11 +951,11 @@ static struct mlx5_flow_rule *add_rule_fte(struct fs_fte *fte,
 	}
 
 	if (fte->dests_size == 1 || !dest)
-		err = mlx5_cmd_create_fte(get_dev(&ft->node),
-					  ft, fg->id, fte);
+		err = root->cmds->create_fte(get_dev(&ft->node),
+					     ft, fg->id, fte);
 	else
-		err = mlx5_cmd_update_fte(get_dev(&ft->node),
-					  ft, fg->id, modify_mask, fte);
+		err = root->cmds->update_fte(get_dev(&ft->node),
+					     ft, fg->id, modify_mask, fte);
 	if (err)
 		goto free_rule;
 
@@ -1390,7 +1397,7 @@ static int update_root_ft_destroy(struct mlx5_flow_table *ft)
 
 	new_root_ft = find_next_ft(ft);
 	if (new_root_ft) {
-		int err = mlx5_cmd_update_root_ft(root->dev, new_root_ft);
+		int err = root->cmds->update_root_ft(root->dev, new_root_ft);
 
 		if (err) {
 			mlx5_core_warn(root->dev, "Update root flow table of id=%u failed\n",
@@ -1662,7 +1669,8 @@ static int init_root_tree(struct mlx5_flow_steering *steering,
 
 static struct mlx5_flow_root_namespace *create_root_ns(struct mlx5_flow_steering *steering,
 						       enum fs_flow_table_type
-						       table_type)
+						       table_type,
+						       const struct steering_cmds *cmds)
 {
 	struct mlx5_flow_root_namespace *root_ns;
 	struct mlx5_flow_namespace *ns;
@@ -1674,6 +1682,7 @@ static struct mlx5_flow_root_namespace *create_root_ns(struct mlx5_flow_steering
 
 	root_ns->dev = steering->dev;
 	root_ns->table_type = table_type;
+	root_ns->cmds = cmds;
 
 	ns = &root_ns->ns;
 	fs_init_namespace(ns);
@@ -1746,7 +1755,8 @@ static int create_anchor_flow_table(struct mlx5_flow_steering *steering)
 static int init_root_ns(struct mlx5_flow_steering *steering)
 {
 
-	steering->root_ns = create_root_ns(steering, FS_FT_NIC_RX);
+	steering->root_ns = create_root_ns(steering, FS_FT_NIC_RX,
+					   mlx5_get_phys_fs_cmds());
 	if (IS_ERR_OR_NULL(steering->root_ns))
 		goto cleanup;
 
@@ -1804,7 +1814,8 @@ static int init_fdb_root_ns(struct mlx5_flow_steering *steering)
 {
 	struct fs_prio *prio;
 
-	steering->fdb_root_ns = create_root_ns(steering, FS_FT_FDB);
+	steering->fdb_root_ns = create_root_ns(steering, FS_FT_FDB,
+					       mlx5_get_phys_fs_cmds());
 	if (!steering->fdb_root_ns)
 		return -ENOMEM;
 
@@ -1823,7 +1834,8 @@ static int init_ingress_acl_root_ns(struct mlx5_flow_steering *steering)
 {
 	struct fs_prio *prio;
 
-	steering->esw_egress_root_ns = create_root_ns(steering, FS_FT_ESW_EGRESS_ACL);
+	steering->esw_egress_root_ns = create_root_ns(steering, FS_FT_ESW_EGRESS_ACL,
+						      mlx5_get_phys_fs_cmds());
 	if (!steering->esw_egress_root_ns)
 		return -ENOMEM;
 
@@ -1840,7 +1852,8 @@ static int init_egress_acl_root_ns(struct mlx5_flow_steering *steering)
 {
 	struct fs_prio *prio;
 
-	steering->esw_ingress_root_ns = create_root_ns(steering, FS_FT_ESW_INGRESS_ACL);
+	steering->esw_ingress_root_ns = create_root_ns(steering, FS_FT_ESW_INGRESS_ACL,
+						       mlx5_get_phys_fs_cmds());
 	if (!steering->esw_ingress_root_ns)
 		return -ENOMEM;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index dc08742..1d963fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -186,6 +186,7 @@ struct mlx5_flow_root_namespace {
 	struct mlx5_flow_table		*root_ft;
 	/* Should be held when chaining flow tables */
 	struct mutex			chain_lock;
+	const struct steering_cmds	*cmds;
 };
 
 int mlx5_init_fc_stats(struct mlx5_core_dev *dev);
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 07/18] net/mlx5: Introduce nop steering commands
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (5 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 06/18] net/mlx5: Introduce table of function pointer steering commands Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 08/18] if_ether.h: Add RoCE Ethertype Saeed Mahameed
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

All the nodes in the virtual namespaces exists only in software.
E.g. when we add flow rule with mlx5_add_flow_rule to virtual
namespace, the command of SET_FLOW_TABLE_ENTRY isn't called, the rule
will not be added to the hardware flow tables.

This virtual namespace merely exists in order to describe the RoCE
steering namespace.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 82 ++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h |  1 +
 2 files changed, 83 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index c3eecff..d6ea59d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -410,7 +410,72 @@ int mlx5_cmd_fc_query(struct mlx5_core_dev *dev, u16 id,
 	stats = MLX5_ADDR_OF(query_flow_counter_out, out, flow_statistics);
 	*packets = MLX5_GET64(traffic_counter, stats, packets);
 	*bytes = MLX5_GET64(traffic_counter, stats, octets);
+	return 0;
+}
+
+static int mlx5_cmd_virt_update_root_ft(struct mlx5_core_dev *dev,
+					struct mlx5_flow_table *ft)
+{
+	return 0;
+}
+
+static int mlx5_cmd_virt_create_flow_table(struct mlx5_core_dev *dev, u16 vport,
+					   enum fs_flow_table_type type, unsigned int level,
+					   unsigned int log_size, struct mlx5_flow_table
+					   *next_ft, unsigned int *table_id)
+{
+	return 0;
+}
+
+static int mlx5_cmd_virt_destroy_flow_table(struct mlx5_core_dev *dev,
+					    struct mlx5_flow_table *ft)
+{
+	return 0;
+}
+
+static int mlx5_cmd_virt_modify_flow_table(struct mlx5_core_dev *dev,
+					   struct mlx5_flow_table *ft,
+					   struct mlx5_flow_table *next_ft)
+{
+	return -EOPNOTSUPP;
+}
+
+static int mlx5_cmd_virt_create_flow_group(struct mlx5_core_dev *dev,
+					   struct mlx5_flow_table *ft,
+					   u32 *in,
+					   unsigned int *group_id)
+{
+	return 0;
+}
+
+static int mlx5_cmd_virt_destroy_flow_group(struct mlx5_core_dev *dev,
+					    struct mlx5_flow_table *ft,
+					    unsigned int group_id)
+{
+	return 0;
+}
 
+static int mlx5_cmd_virt_create_fte(struct mlx5_core_dev *dev,
+				    struct mlx5_flow_table *ft,
+				    unsigned int group_id,
+				    struct fs_fte *fte)
+{
+	return 0;
+}
+
+static int mlx5_cmd_virt_update_fte(struct mlx5_core_dev *dev,
+				    struct mlx5_flow_table *ft,
+				    unsigned int group_id,
+				    int modify_mask,
+				    struct fs_fte *fte)
+{
+	return -EOPNOTSUPP;
+}
+
+static int mlx5_cmd_virt_delete_fte(struct mlx5_core_dev *dev,
+				    struct mlx5_flow_table *ft,
+				    unsigned int index)
+{
 	return 0;
 }
 
@@ -426,7 +491,24 @@ static const struct steering_cmds steering_cmds = {
 	.delete_fte		= mlx5_cmd_delete_fte,
 };
 
+static const struct steering_cmds steering_virt_cmds = {
+	.update_root_ft		= mlx5_cmd_virt_update_root_ft,
+	.create_flow_table	= mlx5_cmd_virt_create_flow_table,
+	.destroy_flow_table	= mlx5_cmd_virt_destroy_flow_table,
+	.modify_flow_table	= mlx5_cmd_virt_modify_flow_table,
+	.create_flow_group	= mlx5_cmd_virt_create_flow_group,
+	.destroy_flow_group	= mlx5_cmd_virt_destroy_flow_group,
+	.create_fte		= mlx5_cmd_virt_create_fte,
+	.update_fte		= mlx5_cmd_virt_update_fte,
+	.delete_fte		= mlx5_cmd_virt_delete_fte,
+};
+
 const struct steering_cmds *mlx5_get_phys_fs_cmds(void)
 {
 	return &steering_cmds;
 }
+
+const struct steering_cmds *mlx5_get_virt_fs_cmds(void)
+{
+	return &steering_virt_cmds;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
index b3f16ca..6896c5f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.h
@@ -67,6 +67,7 @@ struct steering_cmds {
 };
 
 const struct steering_cmds *mlx5_get_phys_fs_cmds(void);
+const struct steering_cmds *mlx5_get_virt_fs_cmds(void);
 
 int mlx5_cmd_fc_alloc(struct mlx5_core_dev *dev, u16 *id);
 int mlx5_cmd_fc_free(struct mlx5_core_dev *dev, u16 id);
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 08/18] if_ether.h: Add RoCE Ethertype
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (6 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 07/18] net/mlx5: Introduce nop " Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 09/18] IB/mlx5: Create RoCE root namespace Saeed Mahameed
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Add the Ethertype for RoCE - RDMA over Converged Ethernet.
Refactor vendors' implementation to use this define.

RoCE was standartized by IBTA in InfiniBand Architecture Annex A16.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/infiniband/hw/mlx4/qp.c                    | 6 +-----
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c           | 4 ++--
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c           | 2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h          | 4 ----
 drivers/infiniband/hw/usnic/usnic_common_pkt_hdr.h | 1 -
 drivers/infiniband/hw/usnic/usnic_fwd.h            | 2 +-
 include/uapi/linux/if_ether.h                      | 1 +
 7 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 81b0e1f..fd03cef 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -76,10 +76,6 @@ enum {
 	MLX4_IB_LSO_HEADER_SPARE	= 128,
 };
 
-enum {
-	MLX4_IB_IBOE_ETHERTYPE		= 0x8915
-};
-
 struct mlx4_ib_sqp {
 	struct mlx4_ib_qp	qp;
 	int			pkey_index;
@@ -2563,7 +2559,7 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_ud_wr *wr,
 		u16 ether_type;
 		u16 pcp = (be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 29) << 13;
 
-		ether_type = (!is_udp) ? MLX4_IB_IBOE_ETHERTYPE :
+		ether_type = (!is_udp) ? ETH_P_ROCE :
 			(ip_version == 4 ? ETH_P_IP : ETH_P_IPV6);
 
 		mlx->sched_prio = cpu_to_be16(pcp);
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
index 797362a..3d5d841 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
@@ -59,7 +59,7 @@ static u16 ocrdma_hdr_type_to_proto_num(int devid, u8 hdr_type)
 {
 	switch (hdr_type) {
 	case OCRDMA_L3_TYPE_IB_GRH:
-		return (u16)0x8915;
+		return (u16)ETH_P_ROCE;
 	case OCRDMA_L3_TYPE_IPV4:
 		return (u16)0x0800;
 	case OCRDMA_L3_TYPE_IPV6:
@@ -94,7 +94,7 @@ static inline int set_av_attr(struct ocrdma_dev *dev, struct ocrdma_ah *ah,
 	proto_num = ocrdma_hdr_type_to_proto_num(dev->id, ah->hdr_type);
 	if (!proto_num)
 		return -EINVAL;
-	nxthdr = (proto_num == 0x8915) ? 0x1b : 0x11;
+	nxthdr = (proto_num == ETH_P_ROCE) ? 0x1b : 0x11;
 	/* VLAN */
 	if (!vlan_tag || (vlan_tag > 0xFFF))
 		vlan_tag = dev->pvid;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index 16740dc..d45d1b4 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -2985,7 +2985,7 @@ static int ocrdma_parse_dcbxcfg_rsp(struct ocrdma_dev *dev, int ptype,
 				OCRDMA_APP_PARAM_APP_PROTO_MASK;
 
 		if (
-			valid && proto == OCRDMA_APP_PROTO_ROCE &&
+			valid && proto == ETH_P_ROCE &&
 			proto_sel == OCRDMA_PROTO_SELECT_L2) {
 			for (slindx = 0; slindx <
 				OCRDMA_MAX_SERVICE_LEVEL_INDEX; slindx++) {
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
index 0efc966..1751a96 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
@@ -2175,10 +2175,6 @@ enum OCRDMA_DCBX_PARAM_TYPE {
 	OCRDMA_PARAMETER_TYPE_PEER	= 0x02
 };
 
-enum OCRDMA_DCBX_APP_PROTO {
-	OCRDMA_APP_PROTO_ROCE	= 0x8915
-};
-
 enum OCRDMA_DCBX_PROTO {
 	OCRDMA_PROTO_SELECT_L2	= 0x00,
 	OCRDMA_PROTO_SELECT_L4	= 0x01
diff --git a/drivers/infiniband/hw/usnic/usnic_common_pkt_hdr.h b/drivers/infiniband/hw/usnic/usnic_common_pkt_hdr.h
index 596e0ed..bf7d197 100644
--- a/drivers/infiniband/hw/usnic/usnic_common_pkt_hdr.h
+++ b/drivers/infiniband/hw/usnic/usnic_common_pkt_hdr.h
@@ -34,7 +34,6 @@
 #ifndef USNIC_CMN_PKT_HDR_H
 #define USNIC_CMN_PKT_HDR_H
 
-#define USNIC_ROCE_ETHERTYPE		(0x8915)
 #define USNIC_ROCE_GRH_VER              (8)
 #define USNIC_PROTO_VER                 (1)
 #define USNIC_ROCE_GRH_VER_SHIFT        (4)
diff --git a/drivers/infiniband/hw/usnic/usnic_fwd.h b/drivers/infiniband/hw/usnic/usnic_fwd.h
index 3a8add9..0f7baae 100644
--- a/drivers/infiniband/hw/usnic/usnic_fwd.h
+++ b/drivers/infiniband/hw/usnic/usnic_fwd.h
@@ -97,7 +97,7 @@ static inline void usnic_fwd_init_usnic_filter(struct filter *filter,
 						uint32_t usnic_id)
 {
 	filter->type = FILTER_USNIC_ID;
-	filter->u.usnic.ethtype = USNIC_ROCE_ETHERTYPE;
+	filter->u.usnic.ethtype = ETH_P_ROCE;
 	filter->u.usnic.flags = FILTER_FIELD_USNIC_ETHTYPE |
 				FILTER_FIELD_USNIC_ID |
 				FILTER_FIELD_USNIC_PROTO;
diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index cec849a..8e69375 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -91,6 +91,7 @@
 #define ETH_P_FCOE	0x8906		/* Fibre Channel over Ethernet  */
 #define ETH_P_TDLS	0x890D          /* TDLS */
 #define ETH_P_FIP	0x8914		/* FCoE Initialization Protocol */
+#define ETH_P_ROCE	0x8915		/* RDMA over Converged Ethernet */
 #define ETH_P_80221	0x8917		/* IEEE 802.21 Media Independent Handover Protocol */
 #define ETH_P_HSR	0x892F		/* IEC 62439-3 HSRv1	*/
 #define ETH_P_LOOPBACK	0x9000		/* Ethernet loopback packet, per IEEE 802.3 */
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 09/18] IB/mlx5: Create RoCE root namespace
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (7 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 08/18] if_ether.h: Add RoCE Ethertype Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 10/18] net/mlx5: Introduce get flow rule match API Saeed Mahameed
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Create virtual RoCE namespace. The flow table will
be populated with flow rules according to the RoCE state.
Sniffer will traverse on those rules in order to add them
to the sniffer flow table.

mlx5_ib should call to mlx5_init_roce_steering when RoCE
is enabled and to mlx5_cleanup_roce_steering when is
disabled.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c                 | 125 ++++++++++++++++++++++
 drivers/infiniband/hw/mlx5/mlx5_ib.h              |  15 ++-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c |  32 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |   1 +
 include/linux/mlx5/fs.h                           |   1 +
 5 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 573952b..60330c9 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2275,6 +2275,116 @@ static int mlx5_port_immutable(struct ib_device *ibdev, u8 port_num,
 	return 0;
 }
 
+static void del_roce_rules(struct mlx5_flow_roce_ns *ns)
+
+{
+	if (ns->rocev1_rule) {
+		mlx5_del_flow_rule(ns->rocev1_rule);
+		ns->rocev1_rule = NULL;
+	}
+
+	if (ns->rocev2_ipv4_rule) {
+		mlx5_del_flow_rule(ns->rocev2_ipv4_rule);
+		ns->rocev2_ipv4_rule = NULL;
+	}
+
+	if (ns->rocev2_ipv6_rule) {
+		mlx5_del_flow_rule(ns->rocev2_ipv6_rule);
+		ns->rocev2_ipv6_rule = NULL;
+	}
+}
+
+static int add_roce_rules(struct mlx5_flow_roce_ns *ns)
+{
+	struct mlx5_flow_attr flow_attr;
+	u8 match_criteria_enable;
+	int inlen = MLX5_ST_SZ_BYTES(fte_match_param);
+	u32 *mc;
+	u32 *mv;
+	int err = 0;
+
+	mv = mlx5_vzalloc(inlen);
+	mc = mlx5_vzalloc(inlen);
+	if (!mv || !mc) {
+		err = -ENOMEM;
+		goto add_roce_rules_out;
+	}
+
+	match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
+
+	MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ethertype);
+	MLX5_SET(fte_match_param, mv, outer_headers.ethertype, ETH_P_ROCE);
+
+	MLX5_RULE_ATTR(flow_attr, match_criteria_enable, mc, mv,
+		       MLX5_FLOW_CONTEXT_ACTION_ALLOW,
+		       MLX5_FS_DEFAULT_FLOW_TAG, NULL);
+	ns->rocev1_rule = mlx5_add_flow_rule(ns->ft, &flow_attr);
+	if (IS_ERR(ns->rocev1_rule)) {
+		err = PTR_ERR(ns->rocev1_rule);
+		ns->rocev1_rule = NULL;
+		goto add_roce_rules_out;
+	}
+
+	MLX5_SET(fte_match_param, mv, outer_headers.ethertype, ETH_P_IP);
+	MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.ip_protocol);
+	MLX5_SET(fte_match_param, mv, outer_headers.ip_protocol, IPPROTO_UDP);
+	MLX5_SET_TO_ONES(fte_match_param, mc, outer_headers.udp_dport);
+	MLX5_SET(fte_match_param, mv, outer_headers.udp_dport,
+		 ROCE_V2_UDP_DPORT);
+	ns->rocev2_ipv4_rule = mlx5_add_flow_rule(ns->ft, &flow_attr);
+	if (IS_ERR(ns->rocev2_ipv4_rule)) {
+		err = PTR_ERR(ns->rocev2_ipv4_rule);
+		ns->rocev2_ipv4_rule = NULL;
+		goto add_roce_rules_out;
+	}
+
+	MLX5_SET(fte_match_param, mv, outer_headers.ethertype, ETH_P_IPV6);
+	ns->rocev2_ipv6_rule = mlx5_add_flow_rule(ns->ft, &flow_attr);
+	if (IS_ERR(ns->rocev2_ipv6_rule)) {
+		err = PTR_ERR(ns->rocev2_ipv6_rule);
+		ns->rocev2_ipv6_rule = NULL;
+		goto add_roce_rules_out;
+	}
+
+add_roce_rules_out:
+	kvfree(mc);
+	kvfree(mv);
+	if (err)
+		del_roce_rules(ns);
+	return err;
+}
+
+#define ROCE_TABLE_SIZE 3
+static int mlx5_init_roce_steering(struct mlx5_ib_dev *dev)
+{
+	struct mlx5_flow_roce_ns *roce_ns = &dev->roce.roce_ns;
+	int err;
+
+	roce_ns->ns = mlx5_get_flow_namespace(dev->mdev,
+					      MLX5_FLOW_NAMESPACE_ROCE);
+	if (!roce_ns->ns)
+		return -EINVAL;
+
+	roce_ns->ft = mlx5_create_auto_grouped_flow_table(roce_ns->ns, 0,
+							  ROCE_TABLE_SIZE, 1, 0);
+	if (IS_ERR(roce_ns->ft)) {
+		err = PTR_ERR(roce_ns->ft);
+		pr_warn("Failed to create roce flow table\n");
+		roce_ns->ft = NULL;
+		return err;
+	}
+
+	err = add_roce_rules(roce_ns);
+	if (err)
+		goto destroy_flow_table;
+
+	return 0;
+
+destroy_flow_table:
+	mlx5_destroy_flow_table(roce_ns->ft);
+	return err;
+}
+
 static int mlx5_enable_roce(struct mlx5_ib_dev *dev)
 {
 	int err;
@@ -2288,6 +2398,9 @@ static int mlx5_enable_roce(struct mlx5_ib_dev *dev)
 	if (err)
 		goto err_unregister_netdevice_notifier;
 
+	/* RoCE can be supported without flow steering*/
+	mlx5_init_roce_steering(dev);
+
 	return 0;
 
 err_unregister_netdevice_notifier:
@@ -2295,8 +2408,20 @@ err_unregister_netdevice_notifier:
 	return err;
 }
 
+static void mlx5_cleanup_roce_steering(struct mlx5_ib_dev *dev)
+{
+	struct mlx5_flow_roce_ns *roce_ns = &dev->roce.roce_ns;
+
+	if (!roce_ns->ns || !roce_ns->ft)
+		return;
+
+	del_roce_rules(roce_ns);
+	mlx5_destroy_flow_table(roce_ns->ft);
+}
+
 static void mlx5_disable_roce(struct mlx5_ib_dev *dev)
 {
+	mlx5_cleanup_roce_steering(dev);
 	mlx5_nic_vport_disable_roce(dev->mdev);
 	unregister_netdevice_notifier(&dev->roce.nb);
 }
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index c4a9825..32f65fe 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -148,6 +148,14 @@ struct mlx5_ib_flow_handler {
 	struct mlx5_flow_rule	*rule;
 };
 
+struct mlx5_flow_roce_ns {
+	struct mlx5_flow_namespace	*ns;
+	struct mlx5_flow_table		*ft;
+	struct mlx5_flow_rule		*rocev1_rule;
+	struct mlx5_flow_rule		*rocev2_ipv4_rule;
+	struct mlx5_flow_rule		*rocev2_ipv6_rule;
+};
+
 struct mlx5_ib_flow_db {
 	struct mlx5_ib_flow_prio	prios[MLX5_IB_NUM_FLOW_FT];
 	/* Protect flow steering bypass flow tables
@@ -550,9 +558,10 @@ struct mlx5_roce {
 	/* Protect mlx5_ib_get_netdev from invoking dev_hold() with a NULL
 	 * netdev pointer
 	 */
-	rwlock_t		netdev_lock;
-	struct net_device	*netdev;
-	struct notifier_block	nb;
+	rwlock_t			netdev_lock;
+	struct net_device		*netdev;
+	struct notifier_block		nb;
+	struct mlx5_flow_roce_ns	roce_ns;
 };
 
 struct mlx5_ib_dev {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index e762a9c..d60d578 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1502,6 +1502,11 @@ struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
 			return &steering->esw_ingress_root_ns->ns;
 		else
 			return NULL;
+	case MLX5_FLOW_NAMESPACE_ROCE:
+		if (steering->roce_root_ns)
+			return &steering->roce_root_ns->ns;
+		else
+			return NULL;
 	default:
 		return NULL;
 	}
@@ -1806,10 +1811,31 @@ void mlx5_cleanup_fs(struct mlx5_core_dev *dev)
 	cleanup_root_ns(steering->esw_egress_root_ns);
 	cleanup_root_ns(steering->esw_ingress_root_ns);
 	cleanup_root_ns(steering->fdb_root_ns);
+	cleanup_root_ns(steering->roce_root_ns);
 	mlx5_cleanup_fc_stats(dev);
 	kfree(steering);
 }
 
+static int init_roce_root_ns(struct mlx5_flow_steering *steering)
+{
+	struct fs_prio *prio;
+
+	steering->roce_root_ns = create_root_ns(steering, FS_FT_NIC_RX,
+						mlx5_get_virt_fs_cmds());
+	if (!steering->roce_root_ns)
+		return -ENOMEM;
+
+	/* Create single prio */
+	prio = fs_create_prio(&steering->roce_root_ns->ns, 0, 1);
+	if (IS_ERR(prio)) {
+		cleanup_root_ns(steering->roce_root_ns);
+		steering->roce_root_ns = NULL;
+		return PTR_ERR(prio);
+	}
+
+	return 0;
+}
+
 static int init_fdb_root_ns(struct mlx5_flow_steering *steering)
 {
 	struct fs_prio *prio;
@@ -1909,6 +1935,12 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
 		}
 	}
 
+	if (MLX5_CAP_GEN(dev, roce)) {
+		err = init_roce_root_ns(steering);
+		if (err)
+			goto err;
+	}
+
 	return 0;
 err:
 	mlx5_cleanup_fs(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index 1d963fd..f758b1e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -61,6 +61,7 @@ struct mlx5_flow_steering {
 	struct mlx5_flow_root_namespace *fdb_root_ns;
 	struct mlx5_flow_root_namespace *esw_egress_root_ns;
 	struct mlx5_flow_root_namespace *esw_ingress_root_ns;
+	struct mlx5_flow_root_namespace *roce_root_ns;
 };
 
 struct fs_node {
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 5ac0e8f..ae82e00 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -60,6 +60,7 @@ enum mlx5_flow_namespace_type {
 	MLX5_FLOW_NAMESPACE_FDB,
 	MLX5_FLOW_NAMESPACE_ESW_EGRESS,
 	MLX5_FLOW_NAMESPACE_ESW_INGRESS,
+	MLX5_FLOW_NAMESPACE_ROCE,
 };
 
 struct mlx5_flow_table;
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 10/18] net/mlx5: Introduce get flow rule match API
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (8 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 09/18] IB/mlx5: Create RoCE root namespace Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 11/18] net/mlx5: Add sniffer namespaces Saeed Mahameed
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Introduce API to get mlx5_flow_match which contains:
1. match_criteria_enable
2. match_criteria
3. match_value

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 19 +++++++++++++++++++
 include/linux/mlx5/fs.h                           |  3 +++
 2 files changed, 22 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index d60d578..b7ddcd2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -2077,3 +2077,22 @@ unlock:
 
 	return 0;
 }
+
+void mlx5_get_rule_flow_match(struct mlx5_flow_match *flow_match,
+			      struct mlx5_flow_rule *rule)
+{
+	struct mlx5_flow_group *fg;
+	struct fs_node *pnode;
+	struct fs_fte *fte;
+
+	pnode = rule->node.parent;
+	WARN_ON(!pnode);
+	fs_get_obj(fte, pnode);
+	pnode = pnode->parent;
+	WARN_ON(!pnode);
+	fs_get_obj(fg, pnode);
+
+	flow_match->match_value = fte->val;
+	flow_match->match_criteria = fg->mask.match_criteria;
+	flow_match->match_criteria_enable = fg->mask.match_criteria_enable;
+}
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index ae82e00..db1f06e 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -173,4 +173,7 @@ struct mlx5_event_data {
 	struct mlx5_flow_table *ft;
 	struct mlx5_flow_rule *rule;
 };
+
+void mlx5_get_rule_flow_match(struct mlx5_flow_match *flow_match,
+			      struct mlx5_flow_rule *rule);
 #endif
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 11/18] net/mlx5: Add sniffer namespaces
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (9 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 10/18] net/mlx5: Introduce get flow rule match API Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag Saeed Mahameed
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Sniffer RX namespace: used for catching RoCE and
bypass traffic.
Sniffer TX namespace: used for catching all the TX traffic.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 56 +++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h |  4 ++
 include/linux/mlx5/fs.h                           |  2 +
 3 files changed, 62 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index b7ddcd2..e52da43 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1502,6 +1502,16 @@ struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
 			return &steering->esw_ingress_root_ns->ns;
 		else
 			return NULL;
+	case MLX5_FLOW_NAMESPACE_SNIFFER_RX:
+		if (steering->sniffer_rx_root_ns)
+			return &steering->sniffer_rx_root_ns->ns;
+		else
+			return NULL;
+	case MLX5_FLOW_NAMESPACE_SNIFFER_TX:
+		if (steering->sniffer_tx_root_ns)
+			return &steering->sniffer_tx_root_ns->ns;
+		else
+			return NULL;
 	case MLX5_FLOW_NAMESPACE_ROCE:
 		if (steering->roce_root_ns)
 			return &steering->roce_root_ns->ns;
@@ -1812,6 +1822,8 @@ void mlx5_cleanup_fs(struct mlx5_core_dev *dev)
 	cleanup_root_ns(steering->esw_ingress_root_ns);
 	cleanup_root_ns(steering->fdb_root_ns);
 	cleanup_root_ns(steering->roce_root_ns);
+	cleanup_root_ns(steering->sniffer_rx_root_ns);
+	cleanup_root_ns(steering->sniffer_tx_root_ns);
 	mlx5_cleanup_fc_stats(dev);
 	kfree(steering);
 }
@@ -1836,6 +1848,44 @@ static int init_roce_root_ns(struct mlx5_flow_steering *steering)
 	return 0;
 }
 
+static int init_sniffer_rx_root_ns(struct mlx5_flow_steering *steering)
+{
+	struct fs_prio *prio;
+
+	steering->sniffer_rx_root_ns = create_root_ns(steering, FS_FT_SNIFFER_RX,
+						      mlx5_get_phys_fs_cmds());
+	if (!steering->sniffer_rx_root_ns)
+		return -ENOMEM;
+
+	/* Create single prio */
+	prio = fs_create_prio(&steering->sniffer_rx_root_ns->ns, 0, 1);
+	if (IS_ERR(prio)) {
+		cleanup_root_ns(steering->sniffer_rx_root_ns);
+		steering->sniffer_rx_root_ns = NULL;
+		return PTR_ERR(prio);
+	}
+	return 0;
+}
+
+static int init_sniffer_tx_root_ns(struct mlx5_flow_steering *steering)
+{
+	struct fs_prio *prio;
+
+	steering->sniffer_tx_root_ns = create_root_ns(steering, FS_FT_SNIFFER_TX,
+						      mlx5_get_phys_fs_cmds());
+	if (!steering->sniffer_tx_root_ns)
+		return -ENOMEM;
+
+	/* Create single prio */
+	prio = fs_create_prio(&steering->sniffer_tx_root_ns->ns, 0, 1);
+	if (IS_ERR(prio)) {
+		cleanup_root_ns(steering->sniffer_tx_root_ns);
+		steering->sniffer_tx_root_ns = NULL;
+		return PTR_ERR(prio);
+	}
+	return 0;
+}
+
 static int init_fdb_root_ns(struct mlx5_flow_steering *steering)
 {
 	struct fs_prio *prio;
@@ -1915,6 +1965,12 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
 		err = init_root_ns(steering);
 		if (err)
 			goto err;
+		err = init_sniffer_tx_root_ns(steering);
+		if (err)
+			goto err;
+		err = init_sniffer_rx_root_ns(steering);
+		if (err)
+			goto err;
 	}
 
 	if (MLX5_CAP_GEN(dev, eswitch_flow_table)) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index f758b1e..a869302 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -49,6 +49,8 @@ enum fs_flow_table_type {
 	FS_FT_ESW_EGRESS_ACL  = 0x2,
 	FS_FT_ESW_INGRESS_ACL = 0x3,
 	FS_FT_FDB             = 0X4,
+	FS_FT_SNIFFER_RX	= 0X5,
+	FS_FT_SNIFFER_TX	= 0X6,
 };
 
 enum fs_fte_status {
@@ -62,6 +64,8 @@ struct mlx5_flow_steering {
 	struct mlx5_flow_root_namespace *esw_egress_root_ns;
 	struct mlx5_flow_root_namespace *esw_ingress_root_ns;
 	struct mlx5_flow_root_namespace *roce_root_ns;
+	struct mlx5_flow_root_namespace	*sniffer_tx_root_ns;
+	struct mlx5_flow_root_namespace	*sniffer_rx_root_ns;
 };
 
 struct fs_node {
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index db1f06e..f3715eb 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -60,6 +60,8 @@ enum mlx5_flow_namespace_type {
 	MLX5_FLOW_NAMESPACE_FDB,
 	MLX5_FLOW_NAMESPACE_ESW_EGRESS,
 	MLX5_FLOW_NAMESPACE_ESW_INGRESS,
+	MLX5_FLOW_NAMESPACE_SNIFFER_RX,
+	MLX5_FLOW_NAMESPACE_SNIFFER_TX,
 	MLX5_FLOW_NAMESPACE_ROCE,
 };
 
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (10 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 11/18] net/mlx5: Add sniffer namespaces Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 16:00   ` Alexei Starovoitov
  2016-06-17 14:43 ` [PATCH net-next 13/18] net: Add offload kernel net stack packet type Saeed Mahameed
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Add kernel offload flow tag for packets that will bypass the kernel
stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).

User leftover FTEs are shared with sniffer, therefore leftover rules
should be added with the bypass flow-tag.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c | 10 ++++++++--
 include/linux/mlx5/fs.h           |  1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 60330c9..5af7c5f 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1534,6 +1534,7 @@ static struct mlx5_ib_flow_handler *create_flow_rule(struct mlx5_ib_dev *dev,
 	unsigned int spec_index;
 	u32 *match_c;
 	u32 *match_v;
+	u32 flow_tag;
 	u32 action;
 	int err = 0;
 
@@ -1562,9 +1563,12 @@ static struct mlx5_ib_flow_handler *create_flow_rule(struct mlx5_ib_dev *dev,
 	match_criteria_enable = (!outer_header_zero(match_c)) << 0;
 	action = dst ? MLX5_FLOW_CONTEXT_ACTION_FWD_DEST :
 		MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO;
+	flow_tag = (flow_attr->type == IB_FLOW_ATTR_ALL_DEFAULT ||
+		    flow_attr->type == IB_FLOW_ATTR_MC_DEFAULT) ?
+		MLX5_FS_OFFLOAD_FLOW_TAG : MLX5_FS_DEFAULT_FLOW_TAG;
 
 	MLX5_RULE_ATTR(flow_rule_attr, match_criteria_enable, match_c,
-		       match_v, action, MLX5_FS_DEFAULT_FLOW_TAG, dst);
+		       match_v, action, flow_tag, dst);
 	handler->rule = mlx5_add_flow_rule(ft, &flow_rule_attr);
 
 	if (IS_ERR(handler->rule)) {
@@ -1619,12 +1623,13 @@ static struct mlx5_ib_flow_handler *create_leftovers_rule(struct mlx5_ib_dev *de
 	struct mlx5_ib_flow_handler *handler_ucast = NULL;
 	struct mlx5_ib_flow_handler *handler = NULL;
 
-	static struct {
+	struct {
 		struct ib_flow_attr	flow_attr;
 		struct ib_flow_spec_eth eth_flow;
 	} leftovers_specs[] = {
 		[LEFTOVERS_MC] = {
 			.flow_attr = {
+				.type = flow_attr->type,
 				.num_of_specs = 1,
 				.size = sizeof(leftovers_specs[0])
 			},
@@ -1637,6 +1642,7 @@ static struct mlx5_ib_flow_handler *create_leftovers_rule(struct mlx5_ib_dev *de
 		},
 		[LEFTOVERS_UC] = {
 			.flow_attr = {
+				.type = flow_attr->type,
 				.num_of_specs = 1,
 				.size = sizeof(leftovers_specs[0])
 			},
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index f3715eb..123b901 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -37,6 +37,7 @@
 #include <linux/mlx5/mlx5_ifc.h>
 
 #define MLX5_FS_DEFAULT_FLOW_TAG 0x0
+#define MLX5_FS_OFFLOAD_FLOW_TAG 0x800000
 
 enum {
 	MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO	= 1 << 16,
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 13/18] net: Add offload kernel net stack packet type
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (11 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 15:12   ` Eric Dumazet
  2016-06-17 15:15   ` Daniel Borkmann
  2016-06-17 14:43 ` [PATCH net-next 14/18] net/mlx5e: Set sniffer skbs packet type to offload kernel Saeed Mahameed
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Patrick McHardy, Eric Dumazet, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Add new packet type to skip kernel specific protocol handlers.

This is needed so device drivers can pass packets up to user space
(af_packet/tcpdump, etc..) without the need for them to go through
the whole kernel data path.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
CC: David S. Miller <davem@davemloft.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/skbuff.h         | 6 +++---
 include/uapi/linux/if_packet.h | 1 +
 net/core/dev.c                 | 4 ++++
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dc0fca7..359724e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -694,14 +694,14 @@ struct sk_buff {
 
 /* if you move pkt_type around you also must adapt those constants */
 #ifdef __BIG_ENDIAN_BITFIELD
-#define PKT_TYPE_MAX	(7 << 5)
+#define PKT_TYPE_MAX	(8 << 5)
 #else
-#define PKT_TYPE_MAX	7
+#define PKT_TYPE_MAX	8
 #endif
 #define PKT_TYPE_OFFSET()	offsetof(struct sk_buff, __pkt_type_offset)
 
 	__u8			__pkt_type_offset[0];
-	__u8			pkt_type:3;
+	__u8			pkt_type:4;
 	__u8			pfmemalloc:1;
 	__u8			ignore_df:1;
 	__u8			nfctinfo:3;
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 9e7edfd..93a9f13 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -29,6 +29,7 @@ struct sockaddr_ll {
 #define PACKET_LOOPBACK		5		/* MC/BRD frame looped back */
 #define PACKET_USER		6		/* To user space	*/
 #define PACKET_KERNEL		7		/* To kernel space	*/
+#define PACKET_OFFLOAD_KERNEL	8		/* Offload NET stack	*/
 /* Unused, PACKET_FASTROUTE and PACKET_LOOPBACK are invisible to user space */
 #define PACKET_FASTROUTE	6		/* Fastrouted frame	*/
 
diff --git a/net/core/dev.c b/net/core/dev.c
index d40593b..f300f1a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4113,6 +4113,9 @@ another_round:
 		pt_prev = ptype;
 	}
 
+	if (unlikely(skb->pkt_type == PACKET_OFFLOAD_KERNEL))
+		goto done;
+
 skip_taps:
 #ifdef CONFIG_NET_INGRESS
 	if (static_key_false(&ingress_needed)) {
@@ -4190,6 +4193,7 @@ ncls:
 				       &skb->dev->ptype_specific);
 	}
 
+done:
 	if (pt_prev) {
 		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
 			goto drop;
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 14/18] net/mlx5e: Set sniffer skbs packet type to offload kernel
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (12 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 13/18] net: Add offload kernel net stack packet type Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 15/18] net/mlx5: Introduce sniffer steering hardware capabilities Saeed Mahameed
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Huy Nguyen <huyn@mellanox.com>

Set skb->pkt_type to PACKET_OFFLOAD_KERNEL for sniffer packets in
mlx5e_build_rx_skb for them to skip the kernel net stack processing.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 4 ++++
 include/linux/mlx5/device.h                     | 5 +++++
 3 files changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index e8a6c33..05ee644 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -44,6 +44,7 @@
 #include <linux/mlx5/vport.h>
 #include <linux/mlx5/transobj.h>
 #include <linux/rhashtable.h>
+#include <linux/mlx5/fs.h>
 #include "wq.h"
 #include "mlx5_core.h"
 #include "en_stats.h"
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index bd94770..b1aa9f2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -33,6 +33,7 @@
 #include <linux/ip.h>
 #include <linux/ipv6.h>
 #include <linux/tcp.h>
+#include <linux/if_packet.h>
 #include <net/busy_poll.h>
 #include "en.h"
 #include "en_tc.h"
@@ -741,6 +742,9 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 
 	mlx5e_handle_csum(netdev, cqe, rq, skb, !!lro_num_seg);
 	skb->protocol = eth_type_trans(skb, netdev);
+	if (unlikely((mlx5_get_cqe_ft(cqe) ==
+		      cpu_to_be32(MLX5_FS_OFFLOAD_FLOW_TAG))))
+		skb->pkt_type = PACKET_OFFLOAD_KERNEL;
 }
 
 static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 73a4847..a90f85b 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -754,6 +754,11 @@ static inline u64 get_cqe_ts(struct mlx5_cqe64 *cqe)
 	return (u64)lo | ((u64)hi << 32);
 }
 
+static inline __be32 mlx5_get_cqe_ft(struct mlx5_cqe64 *cqe)
+{
+	return cqe->sop_drop_qpn & cpu_to_be32(0xFFFFFF);
+}
+
 struct mpwrq_cqe_bc {
 	__be16	filler_consumed_strides;
 	__be16	byte_cnt;
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 15/18] net/mlx5: Introduce sniffer steering hardware capabilities
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (13 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 14/18] net/mlx5e: Set sniffer skbs packet type to offload kernel Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 16/18] net/mlx5e: Sniffer support for kernel offload (RoCE) traffic Saeed Mahameed
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Define needed hardware capabilities needed for sniffer
RX and TX flow tables.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/mlx5/device.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index a90f85b..c286cf1 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -1378,6 +1378,18 @@ enum mlx5_cap_type {
 #define MLX5_CAP_FLOWTABLE_NIC_RX_MAX(mdev, cap) \
 	MLX5_CAP_FLOWTABLE_MAX(mdev, flow_table_properties_nic_receive.cap)
 
+#define MLX5_CAP_FLOWTABLE_SNIFFER_RX(mdev, cap) \
+	MLX5_CAP_FLOWTABLE(mdev, flow_table_properties_nic_receive_sniffer.cap)
+
+#define MLX5_CAP_FLOWTABLE_SNIFFER_RX_MAX(mdev, cap) \
+	MLX5_CAP_FLOWTABLE_MAX(mdev, flow_table_properties_nic_receive_sniffer.cap)
+
+#define MLX5_CAP_FLOWTABLE_SNIFFER_TX(mdev, cap) \
+	MLX5_CAP_FLOWTABLE(mdev, flow_table_properties_nic_transmit_sniffer.cap)
+
+#define MLX5_CAP_FLOWTABLE_SNIFFER_TX_MAX(mdev, cap) \
+	MLX5_CAP_FLOWTABLE_MAX(mdev, flow_table_properties_nic_transmit_sniffer.cap)
+
 #define MLX5_CAP_ESW_FLOWTABLE(mdev, cap) \
 	MLX5_GET(flow_table_eswitch_cap, \
 		 mdev->hca_caps_cur[MLX5_CAP_ESWITCH_FLOW_TABLE], cap)
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 16/18] net/mlx5e: Sniffer support for kernel offload (RoCE) traffic
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (14 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 15/18] net/mlx5: Introduce sniffer steering hardware capabilities Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 17/18] net/mlx5e: Lock device state in set features Saeed Mahameed
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Huy Nguyen <huyn@mellanox.com>

Create sniffer RX and TX flow tables, flow group, flow rules,
and tirs.

TIRs:
  Create three TIRs, one for rx traffic, one for tx traffic and
  one for sniffer rules in the leftovers flow table

Flow rules:
  Register call back notifier from steering interface to
  dynamic add/remove RoCE/kernel offload rules.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   4 +-
 .../net/ethernet/mellanox/mlx5/core/en_sniffer.c   | 574 +++++++++++++++++++++
 include/linux/mlx5/fs.h                            |   3 +
 5 files changed, 590 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_sniffer.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9ea7b58..111d0f5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -6,6 +6,7 @@ mlx5_core-y :=	main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += wq.o eswitch.o \
 		en_main.o en_fs.o en_ethtool.o en_tx.o en_rx.o \
-		en_txrx.o en_clock.o vxlan.o en_tc.o en_arfs.o
+		en_txrx.o en_clock.o vxlan.o en_tc.o en_arfs.o \
+		en_sniffer.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN_DCB) +=  en_dcbnl.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 05ee644..9a73ac2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -492,6 +492,8 @@ enum {
 	MLX5E_ARFS_FT_LEVEL
 };
 
+struct mlx5e_sniffer;
+
 struct mlx5e_flow_steering {
 	struct mlx5_flow_namespace      *ns;
 	struct mlx5e_tc_table           tc;
@@ -499,6 +501,7 @@ struct mlx5e_flow_steering {
 	struct mlx5e_l2_table           l2;
 	struct mlx5e_ttc_table          ttc;
 	struct mlx5e_arfs_tables        arfs;
+	struct mlx5e_sniffer            *sniffer;
 };
 
 struct mlx5e_direct_tir {
@@ -580,6 +583,9 @@ enum mlx5e_link_mode {
 
 #define MLX5E_PROT_MASK(link_mode) (1 << link_mode)
 
+int mlx5e_sniffer_start(struct mlx5e_priv *priv);
+int mlx5e_sniffer_stop(struct mlx5e_priv *priv);
+
 void mlx5e_send_nop(struct mlx5e_sq *sq, bool notify_hw);
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
 		       void *accel_priv, select_queue_fallback_t fallback);
@@ -646,6 +652,9 @@ int mlx5e_close_locked(struct net_device *netdev);
 void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
 				   u32 *indirection_rqt, int len,
 				   int num_channels);
+void mlx5e_build_direct_tir_ctx(struct mlx5e_priv *priv, u32 *tirc,
+				u32 rqtn);
+
 int mlx5e_get_max_linkspeed(struct mlx5_core_dev *mdev, u32 *speed);
 
 static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f5c8d5d..982f852 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2017,8 +2017,8 @@ static void mlx5e_build_indir_tir_ctx(struct mlx5e_priv *priv, u32 *tirc,
 	}
 }
 
-static void mlx5e_build_direct_tir_ctx(struct mlx5e_priv *priv, u32 *tirc,
-				       u32 rqtn)
+void mlx5e_build_direct_tir_ctx(struct mlx5e_priv *priv, u32 *tirc,
+				u32 rqtn)
 {
 	MLX5_SET(tirc, tirc, transport_domain, priv->tdn);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_sniffer.c b/drivers/net/ethernet/mellanox/mlx5/core/en_sniffer.c
new file mode 100644
index 0000000..ff462c8
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_sniffer.c
@@ -0,0 +1,574 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/mlx5/fs.h>
+#include "en.h"
+
+enum sniffer_types {
+	SNIFFER_TX,
+	SNIFFER_RX,
+	SNIFFER_LEFTOVERS,
+	SNIFFER_NUM_TYPES,
+};
+
+struct mlx5_sniffer_rule_info {
+	struct mlx5_flow_rule   *rule;
+	struct mlx5_flow_table  *ft;
+	enum sniffer_types      type;
+};
+
+struct sniffer_work {
+	struct work_struct             work;
+	struct mlx5_sniffer_rule_info  rule_info;
+	struct mlx5e_sniffer           *sniffer;
+	struct notifier_block          *nb;
+};
+
+struct sniffer_evt_ctx {
+	struct mlx5e_sniffer    *sniffer;
+	struct notifier_block   nb;
+};
+
+struct sniffer_rule {
+	struct mlx5_flow_rule   *rule;
+	struct list_head        list;
+};
+
+struct mlx5e_sniffer {
+	struct mlx5e_priv	*priv;
+	struct workqueue_struct *sniffer_wq;
+	struct mlx5_flow_table  *rx_ft;
+	struct mlx5_flow_table  *tx_ft;
+	struct sniffer_evt_ctx  bypass_ctx;
+	struct sniffer_evt_ctx  roce_ctx;
+	struct sniffer_evt_ctx  leftovers_ctx;
+	struct list_head        rules;
+	struct list_head        leftover_rules;
+	u32                     tirn[SNIFFER_NUM_TYPES];
+};
+
+static bool sniffer_rule_in_leftovers(struct mlx5e_sniffer *sniffer,
+				      struct mlx5_flow_rule *rule)
+{
+	struct sniffer_rule *sniffer_flow;
+
+	list_for_each_entry(sniffer_flow, &sniffer->leftover_rules, list) {
+		if (sniffer_flow->rule == rule)
+			return true;
+	}
+	return false;
+}
+
+static int mlx5e_sniffer_create_tx_rule(struct mlx5e_sniffer *sniffer)
+{
+	struct mlx5e_priv *priv = sniffer->priv;
+	int match_len = MLX5_ST_SZ_BYTES(fte_match_param);
+	struct sniffer_rule *sniffer_flow;
+	struct mlx5_flow_destination dest;
+	struct mlx5_flow_attr flow_attr;
+	u32 *match_criteria_value;
+	int err = 0;
+
+	/* Create no filter rule */
+	match_criteria_value = mlx5_vzalloc(match_len);
+	if (!match_criteria_value)
+		return -ENOMEM;
+
+	sniffer_flow = kzalloc(sizeof(*sniffer_flow), GFP_KERNEL);
+	if (!sniffer_flow) {
+		err = -ENOMEM;
+		netdev_err(priv->netdev, "failed to alloc sniifer_flow");
+		goto out;
+	}
+	dest.tir_num = sniffer->tirn[SNIFFER_TX];
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
+	MLX5_RULE_ATTR(flow_attr, 0, match_criteria_value, match_criteria_value,
+		       MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       MLX5_FS_OFFLOAD_FLOW_TAG, &dest);
+	sniffer_flow->rule =
+		mlx5_add_flow_rule(sniffer->tx_ft, &flow_attr);
+	if (IS_ERR(sniffer_flow->rule)) {
+		err = PTR_ERR(sniffer_flow->rule);
+		kfree(sniffer_flow);
+		goto out;
+	}
+	list_add(&sniffer_flow->list, &sniffer->rules);
+out:
+	kvfree(match_criteria_value);
+	return err;
+}
+
+static void sniffer_del_rule_handler(struct work_struct *_work)
+{
+	struct mlx5_sniffer_rule_info *rule_info;
+	struct sniffer_rule *sniffer_rule;
+	struct sniffer_work *work;
+
+	work = container_of(_work, struct sniffer_work, work);
+	rule_info = &work->rule_info;
+	sniffer_rule = (struct sniffer_rule *)
+		mlx5_get_rule_private_data(rule_info->rule, work->nb);
+
+	if (!sniffer_rule)
+		goto out;
+
+	mlx5_del_flow_rule(sniffer_rule->rule);
+	list_del(&sniffer_rule->list);
+	kfree(sniffer_rule);
+
+out:
+	mlx5_release_rule_private_data(rule_info->rule, work->nb);
+	mlx5_put_flow_rule(work->rule_info.rule);
+	kfree(work);
+}
+
+static int sniffer_add_flow_rule(struct mlx5e_sniffer *sniffer,
+				 struct sniffer_rule *sniffer_flow,
+				 struct mlx5_sniffer_rule_info *rule_info)
+{
+	struct mlx5e_priv *priv = sniffer->priv;
+	struct mlx5_flow_destination  dest;
+	struct mlx5_flow_match flow_match;
+	struct mlx5_flow_attr flow_attr;
+	struct mlx5_flow_table *ft;
+	int err = 0;
+
+	mlx5_get_rule_flow_match(&flow_match, rule_info->rule);
+	dest.tir_num = priv->direct_tir[rule_info->type].tirn;
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_TIR;
+	MLX5_RULE_ATTR(flow_attr, flow_match.match_criteria_enable,
+		       flow_match.match_criteria,
+		       flow_match.match_value,
+		       MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
+		       MLX5_FS_OFFLOAD_FLOW_TAG, &dest);
+
+	ft = (rule_info->type == SNIFFER_LEFTOVERS) ? rule_info->ft :
+		sniffer->rx_ft;
+	sniffer_flow->rule =
+		mlx5_add_flow_rule(ft, &flow_attr);
+	if (IS_ERR(sniffer_flow->rule)) {
+		err = PTR_ERR(sniffer_flow->rule);
+		sniffer_flow->rule = NULL;
+	}
+
+	return err;
+}
+
+static void sniffer_add_rule_handler(struct work_struct *work)
+{
+	struct mlx5_sniffer_rule_info *rule_info;
+	struct sniffer_rule *sniffer_flow;
+	struct sniffer_work *sniffer_work;
+	struct mlx5e_sniffer *sniffer;
+	struct notifier_block *nb;
+	struct mlx5e_priv *priv;
+	int err;
+
+	sniffer_work = container_of(work, struct sniffer_work, work);
+	rule_info = &sniffer_work->rule_info;
+	sniffer = sniffer_work->sniffer;
+	nb = sniffer_work->nb;
+	priv = sniffer->priv;
+
+	if (sniffer_rule_in_leftovers(sniffer,
+				      rule_info->rule))
+		goto out;
+
+	sniffer_flow = kzalloc(sizeof(*sniffer_flow), GFP_KERNEL);
+	if (!sniffer_flow)
+		goto out;
+
+	err = sniffer_add_flow_rule(sniffer, sniffer_flow, rule_info);
+	if (err) {
+		netdev_err(priv->netdev, "%s: Failed to add sniffer rule, err=%d\n",
+			   __func__, err);
+		kfree(sniffer_flow);
+		goto out;
+	}
+
+	err = mlx5_set_rule_private_data(rule_info->rule, nb, sniffer_flow);
+	if (err) {
+		netdev_err(priv->netdev, "%s: mlx5_set_rule_private_data failed\n",
+			   __func__);
+		mlx5_del_flow_rule(sniffer_flow->rule);
+	}
+	if (rule_info->type == SNIFFER_LEFTOVERS)
+		list_add(&sniffer_flow->list, &sniffer->leftover_rules);
+	else
+		list_add(&sniffer_flow->list, &sniffer->rules);
+
+out:
+	mlx5_put_flow_rule(rule_info->rule);
+	kfree(sniffer_work);
+}
+
+static int sniffer_flow_rule_event_fn(struct notifier_block *nb,
+				      unsigned long event, void *data)
+{
+	struct mlx5_event_data *event_data;
+	struct sniffer_evt_ctx *event_ctx;
+	struct mlx5e_sniffer *sniffer;
+	struct sniffer_work *work;
+	enum sniffer_types type;
+
+	event_ctx = container_of(nb, struct sniffer_evt_ctx, nb);
+	sniffer = event_ctx->sniffer;
+
+	event_data = (struct mlx5_event_data *)data;
+	type = (event_ctx == &sniffer->leftovers_ctx) ? SNIFFER_LEFTOVERS :
+		SNIFFER_RX;
+
+	if ((type == SNIFFER_LEFTOVERS) && (event == MLX5_RULE_EVENT_DEL) &&
+	    sniffer_rule_in_leftovers(sniffer, event_data->rule)) {
+		return 0;
+	}
+
+	work = kzalloc(sizeof(*work), GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	work->rule_info.rule = event_data->rule;
+	work->rule_info.ft = event_data->ft;
+	work->rule_info.type = type;
+	work->sniffer = sniffer;
+	work->nb = nb;
+
+	mlx5_get_flow_rule(event_data->rule);
+
+	if (event == MLX5_RULE_EVENT_ADD)
+		INIT_WORK(&work->work, sniffer_add_rule_handler);
+	else
+		INIT_WORK(&work->work, sniffer_del_rule_handler);
+
+	queue_work(sniffer->sniffer_wq, &work->work);
+
+	return 0;
+}
+
+static struct sniffer_evt_ctx *sniffer_get_event_ctx(struct mlx5e_sniffer *sniffer,
+						     enum mlx5_flow_namespace_type type)
+{
+	switch (type) {
+	case MLX5_FLOW_NAMESPACE_BYPASS:
+		return &sniffer->bypass_ctx;
+	case MLX5_FLOW_NAMESPACE_ROCE:
+		return &sniffer->roce_ctx;
+	case MLX5_FLOW_NAMESPACE_LEFTOVERS:
+		return &sniffer->leftovers_ctx;
+	default:
+		return NULL;
+	}
+}
+
+static void sniffer_destroy_tirs(struct mlx5e_sniffer *sniffer)
+{
+	struct mlx5e_priv *priv = sniffer->priv;
+	int i;
+
+	for (i = 0; i < SNIFFER_NUM_TYPES; i++)
+		mlx5_core_destroy_tir(priv->mdev, sniffer->tirn[i]);
+}
+
+static void sniffer_cleanup_resources(struct mlx5e_sniffer *sniffer)
+{
+	struct sniffer_rule *sniffer_flow;
+	struct sniffer_rule *tmp;
+
+	if (sniffer->sniffer_wq)
+		destroy_workqueue(sniffer->sniffer_wq);
+
+	list_for_each_entry_safe(sniffer_flow, tmp, &sniffer->rules, list) {
+		mlx5_del_flow_rule(sniffer_flow->rule);
+		list_del(&sniffer_flow->list);
+		kfree(sniffer_flow);
+	}
+
+	list_for_each_entry_safe(sniffer_flow, tmp, &sniffer->leftover_rules, list) {
+		mlx5_del_flow_rule(sniffer_flow->rule);
+		list_del(&sniffer_flow->list);
+		kfree(sniffer_flow);
+	}
+
+	if (sniffer->rx_ft)
+		mlx5_destroy_flow_table(sniffer->rx_ft);
+
+	if (sniffer->tx_ft)
+		mlx5_destroy_flow_table(sniffer->tx_ft);
+
+	sniffer_destroy_tirs(sniffer);
+}
+
+static void sniffer_unregister_ns_rules_handlers(struct mlx5e_sniffer *sniffer,
+						 enum mlx5_flow_namespace_type ns_type)
+{
+	struct mlx5e_priv *priv = sniffer->priv;
+	struct sniffer_evt_ctx *evt_ctx;
+	struct mlx5_flow_namespace *ns;
+
+	ns = mlx5_get_flow_namespace(priv->mdev, ns_type);
+	if (!ns)
+		return;
+
+	evt_ctx = sniffer_get_event_ctx(sniffer, ns_type);
+	mlx5_unregister_rule_notifier(ns, &evt_ctx->nb);
+}
+
+static void sniffer_unregister_rules_handlers(struct mlx5e_sniffer *sniffer)
+{
+	sniffer_unregister_ns_rules_handlers(sniffer,
+					     MLX5_FLOW_NAMESPACE_BYPASS);
+	sniffer_unregister_ns_rules_handlers(sniffer,
+					     MLX5_FLOW_NAMESPACE_ROCE);
+	sniffer_unregister_ns_rules_handlers(sniffer,
+					     MLX5_FLOW_NAMESPACE_LEFTOVERS);
+}
+
+int mlx5e_sniffer_stop(struct mlx5e_priv *priv)
+{
+	struct mlx5e_sniffer *sniffer = priv->fs.sniffer;
+
+	if (!sniffer)
+		return 0;
+
+	sniffer_unregister_rules_handlers(sniffer);
+	sniffer_cleanup_resources(sniffer);
+	kfree(sniffer);
+
+	return 0;
+}
+
+static int sniffer_register_ns_rules_handlers(struct mlx5e_sniffer *sniffer,
+					      enum mlx5_flow_namespace_type ns_type)
+{
+	struct mlx5e_priv *priv = sniffer->priv;
+	struct sniffer_evt_ctx *evt_ctx;
+	struct mlx5_flow_namespace *ns;
+	int err;
+
+	ns = mlx5_get_flow_namespace(priv->mdev, ns_type);
+	if (!ns)
+		return -ENOENT;
+
+	evt_ctx = sniffer_get_event_ctx(sniffer, ns_type);
+	if (!evt_ctx)
+		return -ENOENT;
+
+	evt_ctx->nb.notifier_call = sniffer_flow_rule_event_fn;
+	evt_ctx->sniffer  = sniffer;
+	err = mlx5_register_rule_notifier(ns, &evt_ctx->nb);
+	if (err) {
+		netdev_err(priv->netdev,
+			   "%s: mlx5_register_rule_notifier failed\n", __func__);
+		return err;
+	}
+
+	return 0;
+}
+
+static int sniffer_register_rules_handlers(struct mlx5e_sniffer *sniffer)
+{
+	struct mlx5e_priv *priv = sniffer->priv;
+	int err;
+
+	err = sniffer_register_ns_rules_handlers(sniffer,
+						 MLX5_FLOW_NAMESPACE_BYPASS);
+	if (err)
+		netdev_err(priv->netdev,
+			   "%s: Failed to register for bypass namesapce\n",
+			   __func__);
+
+	err = sniffer_register_ns_rules_handlers(sniffer,
+						 MLX5_FLOW_NAMESPACE_ROCE);
+	if (err)
+		netdev_err(priv->netdev,
+			   "%s: Failed to register for roce namesapce\n",
+			   __func__);
+
+	err = sniffer_register_ns_rules_handlers(sniffer,
+						 MLX5_FLOW_NAMESPACE_LEFTOVERS);
+	if (err)
+		netdev_err(priv->netdev,
+			   "%s: Failed to register for leftovers namesapce\n",
+			   __func__);
+
+	return err;
+}
+
+static int sniffer_create_tirs(struct mlx5e_sniffer *sniffer)
+{
+	struct mlx5e_priv *priv = sniffer->priv;
+	void *tirc;
+	int inlen;
+	u32 *tirn;
+	u32 rqtn;
+	int err;
+	u32 *in;
+	int tt;
+
+	inlen = MLX5_ST_SZ_BYTES(create_tir_in);
+	in = mlx5_vzalloc(inlen);
+	if (!in)
+		return -ENOMEM;
+
+	for (tt = 0; tt < SNIFFER_NUM_TYPES; tt++) {
+		tirn = &sniffer->tirn[tt];
+		tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
+		rqtn = priv->direct_tir[tt % priv->params.num_channels].rqtn;
+		mlx5e_build_direct_tir_ctx(priv, tirc, rqtn);
+		err = mlx5_core_create_tir(priv->mdev, in, inlen, tirn);
+		if (err)
+			goto err_destroy_ch_tirs;
+		memset(in, 0, inlen);
+	}
+
+	kvfree(in);
+
+	return 0;
+
+err_destroy_ch_tirs:
+	for (tt--; tt >= 0; tt--)
+		mlx5_core_destroy_tir(priv->mdev, sniffer->tirn[tt]);
+	kvfree(in);
+
+	return err;
+}
+
+#define SNIFFER_RX_MAX_FTES min_t(u32, (MLX5_BY_PASS_NUM_REGULAR_PRIOS *\
+					FS_MAX_ENTRIES), BIT(20))
+#define SNIFFER_RX_MAX_NUM_GROUPS (MLX5_BY_PASS_NUM_REGULAR_PRIOS *\
+				   FS_MAX_TYPES)
+
+#define SNIFFER_TX_MAX_FTES 1
+#define SNIFFER_TX_MAX_NUM_GROUPS 1
+
+static int sniffer_init_resources(struct mlx5e_sniffer *sniffer)
+{
+	struct mlx5e_priv *priv = sniffer->priv;
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5_flow_namespace *p_sniffer_rx_ns;
+	struct mlx5_flow_namespace *p_sniffer_tx_ns;
+	int table_size;
+	int err;
+
+	INIT_LIST_HEAD(&sniffer->rules);
+	INIT_LIST_HEAD(&sniffer->leftover_rules);
+
+	p_sniffer_rx_ns =
+		mlx5_get_flow_namespace(mdev, MLX5_FLOW_NAMESPACE_SNIFFER_RX);
+	if (!p_sniffer_rx_ns)
+		return -ENOENT;
+
+	p_sniffer_tx_ns =
+		mlx5_get_flow_namespace(mdev, MLX5_FLOW_NAMESPACE_SNIFFER_TX);
+	if (!p_sniffer_tx_ns)
+		return -ENOENT;
+
+	err = sniffer_create_tirs(sniffer);
+	if (err) {
+		netdev_err(priv->netdev, "%s: Create tirs failed, err=%d\n",
+			   __func__, err);
+		return err;
+	}
+
+	sniffer->sniffer_wq = create_singlethread_workqueue("mlx5e_sniffer");
+	if (!sniffer->sniffer_wq)
+		goto error;
+
+	/* Create "medium" size flow table */
+	table_size = min_t(u32,
+			   BIT(MLX5_CAP_FLOWTABLE_SNIFFER_RX(mdev,
+							     log_max_ft_size)),
+			   SNIFFER_RX_MAX_FTES);
+	sniffer->rx_ft =
+		mlx5_create_auto_grouped_flow_table(p_sniffer_rx_ns, 0,
+						    table_size,
+						    SNIFFER_RX_MAX_NUM_GROUPS,
+						    0);
+	if (IS_ERR(sniffer->rx_ft)) {
+		err = PTR_ERR(sniffer->rx_ft);
+		sniffer->rx_ft = NULL;
+		goto error;
+	}
+
+	sniffer->tx_ft =
+		mlx5_create_auto_grouped_flow_table(p_sniffer_tx_ns, 0,
+						    SNIFFER_TX_MAX_FTES,
+						    SNIFFER_TX_MAX_NUM_GROUPS,
+						    0);
+	if (IS_ERR(sniffer->tx_ft)) {
+		err = PTR_ERR(sniffer->tx_ft);
+		sniffer->tx_ft = NULL;
+		goto error;
+	}
+
+	err = mlx5e_sniffer_create_tx_rule(sniffer);
+	if (err)
+		goto error;
+
+	return 0;
+error:
+	sniffer_cleanup_resources(sniffer);
+	return err;
+}
+
+int mlx5e_sniffer_start(struct mlx5e_priv *priv)
+{
+	struct mlx5e_sniffer *sniffer;
+	int err;
+
+	sniffer = kzalloc(sizeof(*sniffer), GFP_KERNEL);
+	if (!sniffer)
+		return -ENOMEM;
+
+	sniffer->priv = priv;
+	err = sniffer_init_resources(sniffer);
+	if (err) {
+		netdev_err(priv->netdev, "%s: Failed to init sniffer resources\n",
+			   __func__);
+		goto err_out;
+	}
+
+	err = sniffer_register_rules_handlers(sniffer);
+	if (err) {
+		netdev_err(priv->netdev, "%s: Failed to register rules handlers\n",
+			   __func__);
+		goto err_cleanup_resources;
+	}
+	priv->fs.sniffer = sniffer;
+	return 0;
+
+err_cleanup_resources:
+	sniffer_cleanup_resources(sniffer);
+err_out:
+	kfree(sniffer);
+	return err;
+}
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 123b901..5463be6 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -43,6 +43,9 @@ enum {
 	MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO	= 1 << 16,
 };
 
+#define FS_MAX_TYPES             10
+#define FS_MAX_ENTRIES           32000U
+
 #define LEFTOVERS_RULE_NUM	 2
 static inline void build_leftovers_ft_param(int *priority,
 					    int *n_ent,
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 17/18] net/mlx5e: Lock device state in set features
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (15 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 16/18] net/mlx5e: Sniffer support for kernel offload (RoCE) traffic Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-17 14:43 ` [PATCH net-next 18/18] net/mlx5e: Add netdev hw feature flag offload-sniffer Saeed Mahameed
  2016-06-21 13:10 ` [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Maor Gottlieb <maorg@mellanox.com>

Lock device state in mlx5e_set_features, rather than make each set
feature handler lock it itself.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 982f852..94d6f60 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2247,7 +2247,6 @@ static int set_feature_lro(struct net_device *netdev, bool enable)
 	bool was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
 	int err;
 
-	mutex_lock(&priv->state_lock);
 
 	if (was_opened && (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST))
 		mlx5e_close_locked(priv->netdev);
@@ -2262,8 +2261,6 @@ static int set_feature_lro(struct net_device *netdev, bool enable)
 	if (was_opened && (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST))
 		mlx5e_open_locked(priv->netdev);
 
-	mutex_unlock(&priv->state_lock);
-
 	return err;
 }
 
@@ -2305,15 +2302,11 @@ static int set_feature_rx_vlan(struct net_device *netdev, bool enable)
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	int err;
 
-	mutex_lock(&priv->state_lock);
-
 	priv->params.vlan_strip_disable = !enable;
 	err = mlx5e_modify_rqs_vsd(priv, !enable);
 	if (err)
 		priv->params.vlan_strip_disable = enable;
 
-	mutex_unlock(&priv->state_lock);
-
 	return err;
 }
 
@@ -2358,8 +2351,11 @@ static int mlx5e_handle_feature(struct net_device *netdev,
 static int mlx5e_set_features(struct net_device *netdev,
 			      netdev_features_t features)
 {
+	struct mlx5e_priv *priv = netdev_priv(netdev);
 	int err;
 
+	mutex_lock(&priv->state_lock);
+
 	err  = mlx5e_handle_feature(netdev, features, NETIF_F_LRO,
 				    set_feature_lro);
 	err |= mlx5e_handle_feature(netdev, features,
@@ -2376,6 +2372,8 @@ static int mlx5e_set_features(struct net_device *netdev,
 				    set_feature_arfs);
 #endif
 
+	mutex_unlock(&priv->state_lock);
+
 	return err ? -EINVAL : 0;
 }
 
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next 18/18] net/mlx5e: Add netdev hw feature flag offload-sniffer
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (16 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 17/18] net/mlx5e: Lock device state in set features Saeed Mahameed
@ 2016-06-17 14:43 ` Saeed Mahameed
  2016-06-21 13:10 ` [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
  18 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 14:43 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Saeed Mahameed

From: Huy Nguyen <huyn@mellanox.com>

Add netdev hw feature flag offload-sniffer to turn on/off kernel net
stack offload sniffer.  Kernel offload sniffer is a device driver
feature which when turned on, the device driver will start
receiving packets that weren't supposed to go through the kernel stack.

Optionally, those packets can be marked with PACKET_OFFLOAD_KERNEL in
skb->pkt_type so they will skip the unneeded kernel net stack
processing.
Please see ("net: Add offload kernel net stack packet type")

Example: ethtool eth1 –K offload-sniffer on

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 28 +++++++++++++++++++++++
 include/linux/netdev_features.h                   |  2 ++
 net/core/ethtool.c                                |  1 +
 3 files changed, 31 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 94d6f60..61201f4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1702,6 +1702,9 @@ int mlx5e_open_locked(struct net_device *netdev)
 	priv->netdev->rx_cpu_rmap = priv->mdev->rmap;
 #endif
 
+	if (netdev->features & NETIF_F_OFFLOAD_SNIFFER)
+		mlx5e_sniffer_start(priv);
+
 	queue_delayed_work(priv->wq, &priv->update_stats_work, 0);
 
 	return 0;
@@ -1737,6 +1740,9 @@ int mlx5e_close_locked(struct net_device *netdev)
 
 	clear_bit(MLX5E_STATE_OPENED, &priv->state);
 
+	if (netdev->features & NETIF_F_OFFLOAD_SNIFFER)
+		mlx5e_sniffer_stop(priv);
+
 	mlx5e_timestamp_cleanup(priv);
 	netif_carrier_off(priv->netdev);
 	mlx5e_redirect_rqts(priv);
@@ -2325,6 +2331,22 @@ static int set_feature_arfs(struct net_device *netdev, bool enable)
 }
 #endif
 
+static int set_feature_offload_sniffer(struct net_device *netdev, bool enable)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	int err;
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
+		return 0;
+
+	if (enable)
+		err = mlx5e_sniffer_start(priv);
+	else
+		err = mlx5e_sniffer_stop(priv);
+
+	return err;
+}
+
 static int mlx5e_handle_feature(struct net_device *netdev,
 				netdev_features_t wanted_features,
 				netdev_features_t feature,
@@ -2371,6 +2393,8 @@ static int mlx5e_set_features(struct net_device *netdev,
 	err |= mlx5e_handle_feature(netdev, features, NETIF_F_NTUPLE,
 				    set_feature_arfs);
 #endif
+	err |= mlx5e_handle_feature(netdev, features, NETIF_F_OFFLOAD_SNIFFER,
+				    set_feature_offload_sniffer);
 
 	mutex_unlock(&priv->state_lock);
 
@@ -2911,6 +2935,10 @@ static void mlx5e_build_netdev(struct net_device *netdev)
 		netdev->hw_features |= NETIF_F_RXALL;
 
 	netdev->features          = netdev->hw_features;
+
+	/* Put it here because default is off */
+	netdev->hw_features      |= NETIF_F_OFFLOAD_SNIFFER;
+
 	if (!priv->params.lro_en)
 		netdev->features  &= ~NETIF_F_LRO;
 
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 9c6c8ef..c00590c2 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -74,6 +74,7 @@ enum {
 	NETIF_F_BUSY_POLL_BIT,		/* Busy poll */
 
 	NETIF_F_HW_TC_BIT,		/* Offload TC infrastructure */
+	NETIF_F_OFFLOAD_SNIFFER_BIT,	/* Kernel Offload sniffer */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -136,6 +137,7 @@ enum {
 #define NETIF_F_HW_L2FW_DOFFLOAD	__NETIF_F(HW_L2FW_DOFFLOAD)
 #define NETIF_F_BUSY_POLL	__NETIF_F(BUSY_POLL)
 #define NETIF_F_HW_TC		__NETIF_F(HW_TC)
+#define NETIF_F_OFFLOAD_SNIFFER __NETIF_F(OFFLOAD_SNIFFER)
 
 #define for_each_netdev_feature(mask_addr, bit)	\
 	for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 9774898..daaf9a5 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -104,6 +104,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_HW_L2FW_DOFFLOAD_BIT] = "l2-fwd-offload",
 	[NETIF_F_BUSY_POLL_BIT] =        "busy-poll",
 	[NETIF_F_HW_TC_BIT] =		 "hw-tc-offload",
+	[NETIF_F_OFFLOAD_SNIFFER_BIT] =  "offload-sniffer"
 };
 
 static const char
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 13/18] net: Add offload kernel net stack packet type
  2016-06-17 14:43 ` [PATCH net-next 13/18] net: Add offload kernel net stack packet type Saeed Mahameed
@ 2016-06-17 15:12   ` Eric Dumazet
  2016-06-17 15:15   ` Daniel Borkmann
  1 sibling, 0 replies; 33+ messages in thread
From: Eric Dumazet @ 2016-06-17 15:12 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb,
	Huy Nguyen, Tal Alon, Patrick McHardy

On Fri, Jun 17, 2016 at 7:43 AM, Saeed Mahameed <saeedm@mellanox.com> wrote:
> From: Maor Gottlieb <maorg@mellanox.com>
>
> Add new packet type to skip kernel specific protocol handlers.
>
> This is needed so device drivers can pass packets up to user space
> (af_packet/tcpdump, etc..) without the need for them to go through
> the whole kernel data path.
>
> Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Patrick McHardy <kaber@trash.net>
> CC: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

This breaks BPF quite badly.

git grep SKF_AD_PKTTYPE

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 13/18] net: Add offload kernel net stack packet type
  2016-06-17 14:43 ` [PATCH net-next 13/18] net: Add offload kernel net stack packet type Saeed Mahameed
  2016-06-17 15:12   ` Eric Dumazet
@ 2016-06-17 15:15   ` Daniel Borkmann
  2016-06-17 22:54     ` Saeed Mahameed
  1 sibling, 1 reply; 33+ messages in thread
From: Daniel Borkmann @ 2016-06-17 15:15 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller
  Cc: netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Patrick McHardy, Eric Dumazet, ast

On 06/17/2016 04:43 PM, Saeed Mahameed wrote:
> From: Maor Gottlieb <maorg@mellanox.com>
>
> Add new packet type to skip kernel specific protocol handlers.
>
> This is needed so device drivers can pass packets up to user space
> (af_packet/tcpdump, etc..) without the need for them to go through
> the whole kernel data path.
>
> Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Patrick McHardy <kaber@trash.net>
> CC: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>   include/linux/skbuff.h         | 6 +++---
>   include/uapi/linux/if_packet.h | 1 +
>   net/core/dev.c                 | 4 ++++
>   3 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index dc0fca7..359724e 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -694,14 +694,14 @@ struct sk_buff {
>
>   /* if you move pkt_type around you also must adapt those constants */
>   #ifdef __BIG_ENDIAN_BITFIELD
> -#define PKT_TYPE_MAX	(7 << 5)
> +#define PKT_TYPE_MAX	(8 << 5)
>   #else
> -#define PKT_TYPE_MAX	7
> +#define PKT_TYPE_MAX	8
>   #endif

Aehm ... did you actually test this with BPF ?!

PKT_TYPE_MAX is a mask (naming could be better no doubt), see also function
convert_skb_access():

[...]
	case SKF_AD_PKTTYPE:
		*insn++ = BPF_LDX_MEM(BPF_B, dst_reg, src_reg, PKT_TYPE_OFFSET());
		*insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, PKT_TYPE_MAX);
#ifdef __BIG_ENDIAN_BITFIELD
		*insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, 5);
#endif
		break;
[...]

Also, dunno if it's worth burning a skb bit for one driver.

>   #define PKT_TYPE_OFFSET()	offsetof(struct sk_buff, __pkt_type_offset)
>
>   	__u8			__pkt_type_offset[0];
> -	__u8			pkt_type:3;
> +	__u8			pkt_type:4;
>   	__u8			pfmemalloc:1;
>   	__u8			ignore_df:1;
>   	__u8			nfctinfo:3;
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index 9e7edfd..93a9f13 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -29,6 +29,7 @@ struct sockaddr_ll {
>   #define PACKET_LOOPBACK		5		/* MC/BRD frame looped back */
>   #define PACKET_USER		6		/* To user space	*/
>   #define PACKET_KERNEL		7		/* To kernel space	*/
> +#define PACKET_OFFLOAD_KERNEL	8		/* Offload NET stack	*/
>   /* Unused, PACKET_FASTROUTE and PACKET_LOOPBACK are invisible to user space */
>   #define PACKET_FASTROUTE	6		/* Fastrouted frame	*/
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d40593b..f300f1a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4113,6 +4113,9 @@ another_round:
>   		pt_prev = ptype;
>   	}
>
> +	if (unlikely(skb->pkt_type == PACKET_OFFLOAD_KERNEL))
> +		goto done;
> +
>   skip_taps:
>   #ifdef CONFIG_NET_INGRESS
>   	if (static_key_false(&ingress_needed)) {
> @@ -4190,6 +4193,7 @@ ncls:
>   				       &skb->dev->ptype_specific);
>   	}
>
> +done:
>   	if (pt_prev) {
>   		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
>   			goto drop;
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-17 14:43 ` [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag Saeed Mahameed
@ 2016-06-17 16:00   ` Alexei Starovoitov
  2016-06-17 16:50     ` John Fastabend
  2016-06-17 22:31     ` Saeed Mahameed
  0 siblings, 2 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2016-06-17 16:00 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb,
	Huy Nguyen, Tal Alon, Eric Dumazet, Daniel Borkmann

On Fri, Jun 17, 2016 at 05:43:53PM +0300, Saeed Mahameed wrote:
> From: Maor Gottlieb <maorg@mellanox.com>
> 
> Add kernel offload flow tag for packets that will bypass the kernel
> stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).

so the whole series is an elaborate way to enable dpdk? how nice.
NACK.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-17 16:00   ` Alexei Starovoitov
@ 2016-06-17 16:50     ` John Fastabend
  2016-06-17 22:31     ` Saeed Mahameed
  1 sibling, 0 replies; 33+ messages in thread
From: John Fastabend @ 2016-06-17 16:50 UTC (permalink / raw)
  To: Alexei Starovoitov, Saeed Mahameed
  Cc: David S. Miller, netdev, Doug Ledford, Or Gerlitz, Maor Gottlieb,
	Huy Nguyen, Tal Alon, Eric Dumazet, Daniel Borkmann

On 16-06-17 09:00 AM, Alexei Starovoitov wrote:
> On Fri, Jun 17, 2016 at 05:43:53PM +0300, Saeed Mahameed wrote:
>> From: Maor Gottlieb <maorg@mellanox.com>
>>
>> Add kernel offload flow tag for packets that will bypass the kernel
>> stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).
> 
> so the whole series is an elaborate way to enable dpdk? how nice.
> NACK.
> 

Well there is certainly room for augmenting the af_packet interfac with
hardware support.

Some things on my list (its a bit behind a few other things though) is
to align queues to af_packet sockets, put those queues in busy poll and
if possible look at zero copy RX. The problem with zero copy rx is it
bypasses the stack but we should be able to detect hooks being added on
ingress and disable it dynamically. Maybe I could look at this in a few
months but think about it for me I'm a bit busy lately. Also it requires
the driver to translate descriptor formats but I'm not convinced it is
that costly to do.

For DPDK why not just use SR-IOV like everyone else and bind a VF to
your favorite user space implementation (DPDK, NETMAP, PFRING, foobar0)
even Windows if you like. DPDK even supports this as far as I know.
Then you don't need to bother kernel folks at all. And you don't have
any overhead except from whatever your usermode code does.

Here's a really old patch I wrote that I would like to revisit at some
point,

---

>> This adds ndo ops for upper layer objects to request direct DMA from
>> the network interface into memory "slots". The slots must be DMA'able
>> memory given by a page/offset/size vector in a packet_ring_buffer
>> structure.
>>
>> The PF_PACKET socket interface can use these ndo_ops to do zerocopy
>> RX from the network device into memory mapped userspace memory. For
>> this to work ixgbe encodes the correct descriptor blocks and headers
>> so that existing PF_PACKET applications work without any modification.
>> This only supports the V2 header formats. And works by mapping a ring
>> of the network device to these slots.
>>
>> V3 header formats added bulk polling via socket calls and timers
>> used in the polling interface to return every n milliseconds. Currently,
>> I don't see any way to support this in hardware because we can't
>> know if the hardware is in the middle of a DMA operation or not
>> on a slot. So when a timer fires I don't know how to advance the
>> descriptor ring leaving empty descriptors similar to how the software
>> ring works. The easiest (best?) route is to simply not support this.
>>
>> The ndo operations and new socket option PACKET_RX_DIRECT work by
>> selecting a queue_index to run the direct dma operations over. Once
>> setsockopt returns sucessfully the indicated queue is mapped
>> directly to the requesting application and can not be used for
>> other purposes. Also any kernel layers such as BPF will be bypassed
>> and need to be implemented in the hardware via some other mechanism
>> such as flow classifier or future offload interfaces.
>>
>> Users steer traffic to the selected queue using flow director or
>> VMDQ. This needs to implemented through the ioctl flow classifier
>> interface (ethtool) or macvlan+hardware offload via netlink the
>> command line tool ip also supports macvlan+hardware_offload.
>>
>> The new socket option added to PF_PACKET is called PACKET_RX_DIRECT.
>> It takes a single unsigned int value specifying the queue index,
>>
>>      setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT,
>>             &queue_index, sizeof(queue_index));
>>
>> To test this I modified the tool psock_tpacket in the selftests
>> kernel directory here:
>>
>>      ./tools/testing/selftests/net/psock_tpacket.c
>>
>> Running this tool opens a socket and listend for packets over
>> the PACKET_RX_DIRECT enabled socket.
>>
>> One more caveat is the ring size of ixgbe and the size used by
>> the software socket need to be the same. There is currently an
>> ethtool to configure this and a warnding is pushed to dmesg when
>> it is not the case. To set use an ioctl directly or call
>>
>>      ethtool -G ethx rx <size>
>>
>> where <size> is the number of configured slots. The demo program
>> psock_tpacket needs size=2400.
>>
>> Known Limitations (TBD):
>>
>>      (1) Users are required to match the number of rx ring
>>              slots with ethtool to the number requested by the
>>              setsockopt PF_PACKET layout. In the future we could
>>              possibly do this automatically. More importantly this
>>          setting is currently global for all rings and needs
>>          to be per queue.
>>
>>      (2) Users need to configure Flow director or VMDQ (macvlan)
>>              to steer traffic to the correct queues. I don't believe
>>          this needs to be changed it seems to be a good mechanism
>>          for driving ddma.
>>
>>      (3) Not supporting timestamps or priv space yet
>>
>>      (4) Not supporting BPF yet...
>>
>>      (5) Only RX supported so far. TX already supports direct DMA
>>          interface but uses skbs which is really not needed. In
>>          the TX_RING case.
>>
>> To be done:
>>
>>      (1) More testing and performance analysis
>>      (2) Busy polling sockets
>>      (3) Write generic code so driver implementation is small
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>   drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    5
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  383 +++++++++++++++++++++++++
>>   include/linux/netdevice.h                     |    8 +
>>   include/net/af_packet.h                       |   62 ++++
>>   include/uapi/linux/if_packet.h                |    1
>>   net/packet/af_packet.c                        |   41 +++
>>   net/packet/internal.h                         |   59 ----
>>   tools/testing/selftests/net/psock_tpacket.c   |   49 +++
>>   8 files changed, 537 insertions(+), 71 deletions(-)
>>   create mode 100644 include/net/af_packet.h
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> index ac9f214..5000731 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> @@ -186,6 +186,8 @@ struct ixgbe_rx_buffer {
>>       dma_addr_t dma;
>>       struct page *page;
>>       unsigned int page_offset;
>> +    unsigned int hdr;
>> +    bool last_frame;
>>   };
>>
>>   struct ixgbe_queue_stats {
>> @@ -279,6 +281,9 @@ struct ixgbe_ring {
>>           };
>>       };
>>
>> +    bool ddma;            /* direct DMA for mapping user pages */
>> +    size_t ddma_size;
>> +    struct sock *sk;
>>       u8 dcb_tc;
>>       struct ixgbe_queue_stats stats;
>>       struct u64_stats_sync syncp;
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index e22278a..b4997b4 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -1478,6 +1478,79 @@ static bool ixgbe_alloc_mapped_page(struct ixgbe_ring *rx_ring,
>>   }
>>
>>   /**
>> + * ixgbe_ddma_claim_buffers - Manage user visible buffer pages
>> + * @rx_ring: ring to place buffers on
>> + * @cleaned_count: number of buffers to replace
>> + *
>> + * First check that userspace is done with the page then reclaim it with a
>> + * dma_map_page() and give it to the hardware. In comparison to the normal
>> + * alloc_rx_buffers path there is never any need to allocate pages because
>> + * this is done at setup time by the mmapped rx_ring calculations.
>> + *
>> + * Because we are effectively giving the hardware ring to user space a
>> + * misbehaving or simply slow program may stall the receiving of packets
>> + * by causing descriptor exhaustion. But the ring is "owned" by the application
>> + * so the side effects should be limited. Administrators do need to be
>> + * concerned about giving applications flow control enabled queues though.
>> + **/
>> +static u16 ixgbe_ddma_claim_buffers(struct ixgbe_ring *rx_ring,
>> +                    u16 cleaned_count)
>> +{
>> +    union ixgbe_adv_rx_desc *rx_desc;
>> +    struct ixgbe_rx_buffer *bi;
>> +    u16 i = rx_ring->next_to_use;
>> +
>> +    /* nothing to do */
>> +    if (!cleaned_count)
>> +        return cleaned_count;
>> +
>> +    rx_desc = IXGBE_RX_DESC(rx_ring, i);
>> +    bi = &rx_ring->rx_buffer_info[i];
>> +    i -= rx_ring->count;
>> +
>> +    do {
>> +        struct tpacket2_hdr *hdr = page_address(bi->page) + bi->hdr;
>> +
>> +        /* If user space has not released packet yet stop */
>> +        if (unlikely(TP_STATUS_USER & hdr->tp_status))
>> +            break;
>> +
>> +        /* Reclaim shared memory for next DMA */
>> +        dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
>> +                         bi->page_offset,
>> +                         rx_ring->ddma_size,
>> +                         DMA_FROM_DEVICE);
>> +
>> +        /*
>> +         * Refresh the desc even if buffer_addrs didn't change
>> +         * because each write-back erases this info.
>> +         */
>> +        rx_desc->read.pkt_addr = cpu_to_le64(bi->dma + bi->page_offset);
>> +
>> +        rx_desc++;
>> +        bi++;
>> +        i++;
>> +        if (unlikely(!i)) {
>> +            rx_desc = IXGBE_RX_DESC(rx_ring, 0);
>> +            bi = rx_ring->rx_buffer_info;
>> +            i -= rx_ring->count;
>> +        }
>> +
>> +        /* clear the hdr_addr for the next_to_use descriptor */
>> +        rx_desc->read.hdr_addr = 0;
>> +
>> +        cleaned_count--;
>> +    } while (cleaned_count);
>> +
>> +    i += rx_ring->count;
>> +
>> +    if (rx_ring->next_to_use != i)
>> +        ixgbe_release_rx_desc(rx_ring, i);
>> +
>> +    return cleaned_count;
>> +}
>> +
>> +/**
>>    * ixgbe_alloc_rx_buffers - Replace used receive buffers
>>    * @rx_ring: ring to place buffers on
>>    * @cleaned_count: number of buffers to replace
>> @@ -1492,6 +1565,12 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
>>       if (!cleaned_count)
>>           return;
>>
>> +    /* Debug helper remember to remove before official patch */
>> +    if (rx_ring->ddma) {
>> +        WARN_ON(rx_ring->ddma);
>> +        return;
>> +    }
>> +
>>       rx_desc = IXGBE_RX_DESC(rx_ring, i);
>>       bi = &rx_ring->rx_buffer_info[i];
>>       i -= rx_ring->count;
>> @@ -1908,6 +1987,12 @@ static void ixgbe_reuse_rx_page(struct ixgbe_ring *rx_ring,
>>       struct ixgbe_rx_buffer *new_buff;
>>       u16 nta = rx_ring->next_to_alloc;
>>
>> +    /* Debug statement to catch logic errors remove later */
>> +    if (rx_ring->ddma) {
>> +        WARN_ON(1);
>> +        return;
>> +    }
>> +
>>       new_buff = &rx_ring->rx_buffer_info[nta];
>>
>>       /* update, and store next to alloc */
>> @@ -2005,6 +2090,128 @@ static bool ixgbe_add_rx_frag(struct ixgbe_ring *rx_ring,
>>       return true;
>>   }
>>
>> +/* ixgbe_do_ddma - direct dma routine to populate PACKET_RX_RING mmap region
>> + *
>> + * The packet socket interface builds a shared memory region using mmap after
>> + * it is specified by the PACKET_RX_RING socket option. This will create a
>> + * circular ring of memory slots. Typical software usage case copies the skb
>> + * into these pages via tpacket_rcv() routine.
>> + *
>> + * Here we do direct DMA from the hardware (82599 in this case) into the
>> + * mmap regions and populate the uhdr (think user space descriptor). This
>> + * requires the hardware to support Scatter Gather and HighDMA which should
>> + * be standard on most 10/40 Gbps devices.
>> + *
>> + * The buffer mapping should have already been done so that rx_buffer pages
>> + * are handed to the driver from the mmap setup done at the socket layer.
>> + *
>> + * This routine may be moved into a generic method later.
>> + *
>> + * See ./include/uapi/linux/if_packet.h for details on packet layout here
>> + * we can only use tpacket2_hdr type. v3 of the header type introduced bulk
>> + * polling modes which do not work correctly with hardware DMA engine. The
>> + * primary issue is we can not stop a DMA transaction from occuring after it
>> + * has been configured. What results is the software timer advances the
>> + * ring ahead of the hardware and the ring state is lost. Maybe there is
>> + * a clever way to resolve this by I haven't thought it up yet.
>> + *
>> + * TBD: integrate timesync with tp_sec, tp_nsec
>> + */
>> +static int ixgbe_do_ddma(struct ixgbe_ring *rx_ring,
>> +             union ixgbe_adv_rx_desc *rx_desc)
>> +{
>> +    struct ixgbe_adapter *adapter = netdev_priv(rx_ring->netdev);
>> +    struct ixgbe_rx_buffer *rx_buffer;
>> +    struct tpacket2_hdr *h2;        /* userspace descriptor */
>> +    struct sockaddr_ll *sll;
>> +    struct ethhdr *eth;
>> +#if 0 /* PTP implementation */
>> +    struct ixgbe_hw *hw = &adapter->hw;
>> +    unsigned long flags;
>> +    u64 regval = 0;
>> +    u32 tsyncrxctl;
>> +#endif
>> +    u64 ns = 0;
>> +    struct page *page;
>> +    int len = 0;
>> +    s32 rem;
>> +
>> +    rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
>> +
>> +    if (!rx_buffer->dma)
>> +        return -EBUSY;
>> +
>> +    page = rx_buffer->page;
>> +    prefetchw(page);
>> +
>> +    /* This indicates some obscure err case that needs to be handled
>> +     * gracefully
>> +     */
>> +    WARN_ON(ixgbe_test_staterr(rx_desc,
>> +                   IXGBE_RXDADV_ERR_FRAME_ERR_MASK) &&
>> +                   !(rx_ring->netdev->features & NETIF_F_RXALL));
>> +
>> +    dma_sync_single_range_for_cpu(rx_ring->dev,
>> +                      rx_buffer->dma,
>> +                      rx_buffer->page_offset,
>> +                      rx_ring->ddma_size,
>> +                      DMA_FROM_DEVICE);
>> +
>> +#if 0
>> +    /* use PTP for timestamps */
>> +    tsyncrxctl = IXGBE_READ_REG(hw, IXGBE_TSYNCRXCTL);
>> +    if (!(tsyncrxctl & IXGBE_TSYNCRXCTL_VALID)) {
>> +        e_warn(drv, "Direct DMA timestamp error aborting.");
>> +        return 0;
>> +    }
>> +
>> +    regval |= (u64)IXGBE_READ_REG(hw, IXGBE_RXSTMPL);
>> +    regval |= (u64)IXGBE_READ_REG(hw, IXGBE_RXSTMPH) << 32;
>> +
>> +    spin_lock_irqsave(&adapter->tmreg_lock, flags);
>> +    ns = 0; //timecounter_cyc2ns(&adapter->tc, regval); /* TBD */
>> +    spin_unlock_irqrestore(&adapter->tmreg_lock, flags);
>> +#endif
>> +
>> +    /* Update the last_rx_timestamp timer in order to enable watchdog check
>> +     * for error case of latched timestamp on a dropped packet.
>> +     */
>> +    adapter->last_rx_timestamp = jiffies;
>> +
>> +    h2 = page_address(rx_buffer->page) + rx_buffer->hdr;
>> +     eth = page_address(rx_buffer->page) + rx_buffer->page_offset,
>> +
>> +    /* This indicates a bug in ixgbe leaving for testing purposes */
>> +    WARN_ON(TP_STATUS_USER & h2->tp_status);
>> +
>> +    h2->tp_len = len = le16_to_cpu(rx_desc->wb.upper.length);
>> +    h2->tp_snaplen = len;
>> +    h2->tp_mac = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN), L1_CACHE_BYTES);
>> +    h2->tp_net = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN), L1_CACHE_BYTES) + ETH_HLEN;
>> +    h2->tp_sec = div_s64_rem(ns, NSEC_PER_SEC, &rem);
>> +    h2->tp_nsec = rem;
>> +
>> +    sll = (void *)h2 + TPACKET_ALIGN(sizeof(struct tpacket2_hdr));
>> +    sll->sll_halen = ETH_HLEN;
>> +    memcpy(sll->sll_addr, eth->h_source, ETH_ALEN);
>> +    sll->sll_family = AF_PACKET;
>> +    sll->sll_hatype = rx_ring->netdev->type;
>> +    sll->sll_protocol = eth->h_proto;
>> +    sll->sll_pkttype = PACKET_HOST;
>> +    sll->sll_ifindex = rx_ring->netdev->ifindex;
>> +
>> +    smp_mb();
>> +    h2->tp_status = TP_STATUS_USER;
>> +    rx_ring->sk->sk_data_ready(rx_ring->sk);
>> +
>> +    /* TBD handle non-EOP frames? - I think this is an invalid case
>> +     * assuming ring slots are setup correctly.
>> +     */
>> +    if (ixgbe_is_non_eop(rx_ring, rx_desc, NULL))
>> +        e_warn(drv, "Direct DMA received non-eop descriptor!");
>> +
>> +    return len;
>> +}
>>   static struct sk_buff *ixgbe_fetch_rx_buffer(struct ixgbe_ring *rx_ring,
>>                            union ixgbe_adv_rx_desc *rx_desc)
>>   {
>> @@ -2012,6 +2219,12 @@ static struct sk_buff *ixgbe_fetch_rx_buffer(struct ixgbe_ring *rx_ring,
>>       struct sk_buff *skb;
>>       struct page *page;
>>
>> +    /* Debug stmt to catch logic errors */
>> +    if (rx_ring->ddma) {
>> +        WARN_ON(1);
>> +        return NULL;
>> +    }
>> +
>>       rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
>>       page = rx_buffer->page;
>>       prefetchw(page);
>> @@ -2119,8 +2332,12 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>>
>>           /* return some buffers to hardware, one at a time is too slow */
>>           if (cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
>> -            ixgbe_alloc_rx_buffers(rx_ring, cleaned_count);
>> -            cleaned_count = 0;
>> +            if (rx_ring->ddma) {
>> +                cleaned_count = ixgbe_ddma_claim_buffers(rx_ring, cleaned_count);
>> +            } else {
>> +                ixgbe_alloc_rx_buffers(rx_ring, cleaned_count);
>> +                cleaned_count = 0;
>> +            }
>>           }
>>
>>           rx_desc = IXGBE_RX_DESC(rx_ring, rx_ring->next_to_clean);
>> @@ -2135,6 +2352,21 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>>            */
>>           rmb();
>>
>> +        /* If we use direct DMA to shmem then we do not need SKBs
>> +          * because user space descriptors are populated directly.
>> +          */
>> +        if (rx_ring->ddma) {
>> +            int len = ixgbe_do_ddma(rx_ring, rx_desc);
>> +
>> +            if (len) {
>> +                total_rx_packets++;
>> +                total_rx_bytes += len;
>> +                cleaned_count++;
>> +                continue;
>> +            }
>> +            break;
>> +        }
>> +
>>           /* retrieve a buffer from the ring */
>>           skb = ixgbe_fetch_rx_buffer(rx_ring, rx_desc);
>>
>> @@ -2197,8 +2429,12 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>>       q_vector->rx.total_packets += total_rx_packets;
>>       q_vector->rx.total_bytes += total_rx_bytes;
>>
>> -    if (cleaned_count)
>> -        ixgbe_alloc_rx_buffers(rx_ring, cleaned_count);
>> +    if (cleaned_count) {
>> +        if (rx_ring->ddma) /* TBD resolve stalls */
>> +            ixgbe_ddma_claim_buffers(rx_ring, cleaned_count);
>> +        else
>> +            ixgbe_alloc_rx_buffers(rx_ring, cleaned_count);
>> +    }
>>
>>       return total_rx_packets;
>>   }
>> @@ -3522,7 +3758,10 @@ void ixgbe_configure_rx_ring(struct ixgbe_adapter *adapter,
>>       IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(reg_idx), rxdctl);
>>
>>       ixgbe_rx_desc_queue_enable(adapter, ring);
>> -    ixgbe_alloc_rx_buffers(ring, ixgbe_desc_unused(ring));
>> +    if (!ring->ddma)
>> +        ixgbe_alloc_rx_buffers(ring, ixgbe_desc_unused(ring));
>> +    else
>> +        ixgbe_ddma_claim_buffers(ring, ixgbe_desc_unused(ring));
>>   }
>>
>>   static void ixgbe_setup_psrtype(struct ixgbe_adapter *adapter)
>> @@ -4435,6 +4674,9 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
>>       if (!rx_ring->rx_buffer_info)
>>           return;
>>
>> +    if (rx_ring->ddma)
>> +        return;
>> +
>>       /* Free all the Rx ring sk_buffs */
>>       for (i = 0; i < rx_ring->count; i++) {
>>           struct ixgbe_rx_buffer *rx_buffer;
>> @@ -5398,6 +5640,12 @@ int ixgbe_setup_rx_resources(struct ixgbe_ring *rx_ring)
>>       if (rx_ring->q_vector)
>>           numa_node = rx_ring->q_vector->numa_node;
>>
>> +    /* Debug stmt to catch logic errors */
>> +    if (rx_ring->ddma) {
>> +        WARN_ON(1);
>> +        return 0;
>> +    }
>> +
>>       rx_ring->rx_buffer_info = vzalloc_node(size, numa_node);
>>       if (!rx_ring->rx_buffer_info)
>>           rx_ring->rx_buffer_info = vzalloc(size);
>> @@ -5514,6 +5762,12 @@ static void ixgbe_free_all_tx_resources(struct ixgbe_adapter *adapter)
>>    **/
>>   void ixgbe_free_rx_resources(struct ixgbe_ring *rx_ring)
>>   {
>> +    /* Debug stmt to catch logic errors */
>> +    if (rx_ring->ddma) {
>> +        WARN_ON(1);
>> +        return;
>> +    }
>> +
>>       ixgbe_clean_rx_ring(rx_ring);
>>
>>       vfree(rx_ring->rx_buffer_info);
>> @@ -7916,6 +8170,123 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
>>       kfree(fwd_adapter);
>>   }
>>
>> +static int ixgbe_ddma_map(struct net_device *dev, unsigned int ring,
>> +              struct sock *sk, struct packet_ring_buffer *prb)
>> +{
>> +    struct ixgbe_adapter *adapter = netdev_priv(dev);
>> +    struct ixgbe_ring *rx_ring = adapter->rx_ring[ring];
>> +    unsigned int frames_per_blk = prb->frames_per_block;
>> +    unsigned int blk_nr = prb->pg_vec_len;
>> +    struct ixgbe_rx_buffer *bi;
>> +    int i, j, err = 0;
>> +
>> +    /* Verify we are given a valid ring */
>> +    if (ring >= adapter->num_rx_queues)
>> +        return -EINVAL;
>> +
>> +    /* Verify we have enough descriptors to support user space ring */
>> +    if (!frames_per_blk || ((blk_nr * frames_per_blk) != rx_ring->count)) {
>> +        e_warn(drv, "ddma map requires %i ring slots\n",
>> +               blk_nr * frames_per_blk);
>> +        return -EBUSY;
>> +    }
>> +
>> +    /* Bring the queue down */
>> +    ixgbe_disable_rx_queue(adapter, rx_ring);
>> +    usleep_range(10000, 20000);
>> +    ixgbe_irq_disable_queues(adapter, ((u64)1 << ring));
>> +    ixgbe_clean_rx_ring(rx_ring);
>> +
>> +    rx_ring->ddma_size = prb->frame_size;
>> +
>> +    /* In Direct DMA mode the descriptor block and tpacket headers are
>> +      * held in fixed locations  so we can pre-populate most of the fields.
>> +      * Similarly the pages, offsets, and sizes for DMA are pre-calculated
>> +      * to align with the user space ring and do not need to change.
>> +      *
>> +      * The layout is fixed to match the PF_PACKET layer in /net/packet/
>> +      * which also invokes this routine via ndo_ddma_map().
>> +      */
>> +    for (i = 0; i < blk_nr; i++) {
>> +        char *kaddr = prb->pg_vec[i].buffer;
>> +        //struct tpacket_block_desc *desc;
>> +        unsigned int blk_size;
>> +        struct page *page;
>> +        size_t offset = 0;
>> +        dma_addr_t dma;
>> +
>> +        /* For DMA usage we can't use vmalloc */
>> +        if (is_vmalloc_addr(kaddr)) {
>> +            e_warn(drv, "vmalloc pages not supported in ddma\n");
>> +            err = -EINVAL;
>> +            goto unwind_map;
>> +        }
>> +
>> +        /* map page for use */
>> +        page = virt_to_page(kaddr);
>> +        blk_size = PAGE_SIZE << prb->pg_vec_order;
>> +        dma = dma_map_page(rx_ring->dev,
>> +                   page, 0, blk_size, DMA_FROM_DEVICE);
>> +        if (dma_mapping_error(rx_ring->dev, dma)) {
>> +            e_warn(drv, "ddma dma mapping error DMA_FROM_DEVICE\n");
>> +            rx_ring->rx_stats.alloc_rx_page_failed++;
>> +            err = -EBUSY;
>> +            goto unwind_map;
>> +        }
>> +
>> +        /* We may be able to push multiple frames per block in this case
>> +          * set offset correctly to set pkt_addr correctly in descriptor.
>> +          */
>> +        for (j = 0; j < frames_per_blk; j++) {
>> +            int hdrlen = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN), L1_CACHE_BYTES);
>> +            int b = ((i * frames_per_blk) + j);
>> +
>> +            bi = &rx_ring->rx_buffer_info[b];
>> +            bi->hdr = offset;
>> +            bi->page = page;
>> +            bi->dma = dma;
>> +
>> +            /* ignore priv for now */
>> +            bi->page_offset = offset + hdrlen;
>> +
>> +             offset += rx_ring->ddma_size;
>> +        }
>> +    }
>> +
>> +    rx_ring->ddma = true;
>> +    rx_ring->sk = sk;
>> +unwind_map:
>> +    ixgbe_configure_rx_ring(adapter, rx_ring);
>> +    return err;
>> +}
>> +
>> +static void ixgbe_ddma_unmap(struct net_device *dev, unsigned int index)
>> +{
>> +    struct ixgbe_adapter *adapter = netdev_priv(dev);
>> +    struct ixgbe_ring *rx_ring = adapter->rx_ring[index];
>> +    int i;
>> +
>> +    rx_ring->ddma = false;
>> +
>> +    /* Free all the Rx ring sk_buffs */
>> +    for (i = 0; i < rx_ring->count; i++) {
>> +        struct ixgbe_rx_buffer *rx_buffer;
>> +
>> +        rx_buffer = &rx_ring->rx_buffer_info[i];
>> +        if (rx_buffer->dma)
>> +            dma_unmap_page(rx_ring->dev, rx_buffer->dma,
>> +                       rx_buffer->page_offset,
>> +                       DMA_FROM_DEVICE);
>> +        rx_buffer->dma = 0;
>> +        rx_buffer->page_offset = 0;
>> +        rx_buffer->page = NULL;
>> +    }
>> +
>> +    rtnl_lock();
>> +    ixgbe_setup_tc(dev, netdev_get_num_tc(dev));
>> +    rtnl_unlock();
>> +}
>> +
>>   static const struct net_device_ops ixgbe_netdev_ops = {
>>       .ndo_open        = ixgbe_open,
>>       .ndo_stop        = ixgbe_close,
>> @@ -7960,6 +8331,8 @@ static const struct net_device_ops ixgbe_netdev_ops = {
>>       .ndo_bridge_getlink    = ixgbe_ndo_bridge_getlink,
>>       .ndo_dfwd_add_station    = ixgbe_fwd_add,
>>       .ndo_dfwd_del_station    = ixgbe_fwd_del,
>> +    .ndo_ddma_map        = ixgbe_ddma_map,
>> +    .ndo_ddma_unmap        = ixgbe_ddma_unmap,
>>   };
>>
>>   /**
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 774e539..fde9815 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -51,6 +51,8 @@
>>   #include <linux/neighbour.h>
>>   #include <uapi/linux/netdevice.h>
>>
>> +#include <net/af_packet.h>
>> +
>>   struct netpoll_info;
>>   struct device;
>>   struct phy_device;
>> @@ -1144,6 +1146,12 @@ struct net_device_ops {
>>                               struct net_device *dev,
>>                               void *priv);
>>       int            (*ndo_get_lock_subclass)(struct net_device *dev);
>> +    int            (*ndo_ddma_map) (struct net_device *dev,
>> +                         unsigned int rindex,
>> +                         struct sock *sk,
>> +                         struct packet_ring_buffer *rb);
>> +    void            (*ndo_ddma_unmap) (struct net_device *dev,
>> +                           unsigned int rindex);
>>   };
>>
>>   /**
>> diff --git a/include/net/af_packet.h b/include/net/af_packet.h
>> new file mode 100644
>> index 0000000..f1622da
>> --- /dev/null
>> +++ b/include/net/af_packet.h
>> @@ -0,0 +1,62 @@
>> +#include <linux/timer.h>
>> +
>> +struct pgv {
>> +    char *buffer;
>> +};
>> +
>> +/* kbdq - kernel block descriptor queue */
>> +struct tpacket_kbdq_core {
>> +    struct pgv    *pkbdq;
>> +    unsigned int    feature_req_word;
>> +    unsigned int    hdrlen;
>> +    unsigned char    reset_pending_on_curr_blk;
>> +    unsigned char   delete_blk_timer;
>> +    unsigned short    kactive_blk_num;
>> +    unsigned short    blk_sizeof_priv;
>> +
>> +    /* last_kactive_blk_num:
>> +     * trick to see if user-space has caught up
>> +     * in order to avoid refreshing timer when every single pkt arrives.
>> +     */
>> +    unsigned short    last_kactive_blk_num;
>> +
>> +    char        *pkblk_start;
>> +    char        *pkblk_end;
>> +    int        kblk_size;
>> +    unsigned int    knum_blocks;
>> +    uint64_t    knxt_seq_num;
>> +    char        *prev;
>> +    char        *nxt_offset;
>> +    struct sk_buff    *skb;
>> +
>> +    atomic_t    blk_fill_in_prog;
>> +
>> +    /* Default is set to 8ms */
>> +#define DEFAULT_PRB_RETIRE_TOV    (8)
>> +
>> +    unsigned short  retire_blk_tov;
>> +    unsigned short  version;
>> +    unsigned long    tov_in_jiffies;
>> +
>> +    /* timer to retire an outstanding block */
>> +    struct timer_list retire_blk_timer;
>> +};
>> +
>> +struct packet_ring_buffer {
>> +    struct pgv        *pg_vec;
>> +
>> +    unsigned int        head;
>> +    unsigned int        frames_per_block;
>> +    unsigned int        frame_size;
>> +    unsigned int        frame_max;
>> +
>> +    unsigned int        pg_vec_order;
>> +    unsigned int        pg_vec_pages;
>> +    unsigned int        pg_vec_len;
>> +
>> +    unsigned int __percpu    *pending_refcnt;
>> +
>> +    bool            ddma;
>> +
>> +    struct tpacket_kbdq_core    prb_bdqc;
>> +};
>> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
>> index bac27fa..7db6aba 100644
>> --- a/include/uapi/linux/if_packet.h
>> +++ b/include/uapi/linux/if_packet.h
>> @@ -54,6 +54,7 @@ struct sockaddr_ll {
>>   #define PACKET_FANOUT            18
>>   #define PACKET_TX_HAS_OFF        19
>>   #define PACKET_QDISC_BYPASS        20
>> +#define PACKET_RX_DIRECT        21
>>
>>   #define PACKET_FANOUT_HASH        0
>>   #define PACKET_FANOUT_LB        1
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index b85c67c..6fe0a3b 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -1970,6 +1970,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>>       }
>>       spin_unlock(&sk->sk_receive_queue.lock);
>>
>> +    /* skb_copy_bits(skb, offset, to, len) */
>>       skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
>>
>>       if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
>> @@ -3256,6 +3257,37 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
>>           return packet_set_ring(sk, &req_u, 0,
>>               optname == PACKET_TX_RING);
>>       }
>> +    case PACKET_RX_DIRECT:
>> +    {
>> +        struct packet_ring_buffer *rb = &po->rx_ring;
>> +        struct net_device *dev;
>> +        unsigned int index;
>> +        int err;
>> +
>> +        if (optlen != sizeof(index))
>> +            return -EINVAL;
>> +        if (copy_from_user(&index, optval, sizeof(index)))
>> +            return -EFAULT;
>> +
>> +        /* This call only works after a bind call which calls a dev_hold
>> +          * operation so we do not need to increment dev ref counter
>> +          */
>> +        dev = __dev_get_by_index(sock_net(sk), po->ifindex);
>> +        if (!dev)
>> +            return -EINVAL;
>> +
>> +        if (!dev->netdev_ops->ndo_ddma_map)
>> +            return -EOPNOTSUPP;
>> +
>> +        if (!atomic_read(&po->mapped))
>> +            return -EINVAL;
>> +
>> +        err =  dev->netdev_ops->ndo_ddma_map(dev, index, sk, rb);
>> +        if (!err)
>> +            rb->ddma = true;
>> +
>> +        return err;
>> +    }
>>       case PACKET_COPY_THRESH:
>>       {
>>           int val;
>> @@ -3861,6 +3893,15 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
>>           if (atomic_read(&po->mapped))
>>               pr_err("packet_mmap: vma is busy: %d\n",
>>                      atomic_read(&po->mapped));
>> +
>> +        if (rb->ddma) {
>> +            struct net_device *dev =
>> +                __dev_get_by_index(sock_net(sk), po->ifindex);
>> +
>> +            if (dev && dev->netdev_ops->ndo_ddma_map)
>> +                dev->netdev_ops->ndo_ddma_unmap(dev, 0);
>> +            rb->ddma = false;
>> +        }
>>       }
>>       mutex_unlock(&po->pg_vec_lock);
>>
>> diff --git a/net/packet/internal.h b/net/packet/internal.h
>> index eb9580a..6257dab 100644
>> --- a/net/packet/internal.h
>> +++ b/net/packet/internal.h
>> @@ -10,65 +10,6 @@ struct packet_mclist {
>>       unsigned char        addr[MAX_ADDR_LEN];
>>   };
>>
>> -/* kbdq - kernel block descriptor queue */
>> -struct tpacket_kbdq_core {
>> -    struct pgv    *pkbdq;
>> -    unsigned int    feature_req_word;
>> -    unsigned int    hdrlen;
>> -    unsigned char    reset_pending_on_curr_blk;
>> -    unsigned char   delete_blk_timer;
>> -    unsigned short    kactive_blk_num;
>> -    unsigned short    blk_sizeof_priv;
>> -
>> -    /* last_kactive_blk_num:
>> -     * trick to see if user-space has caught up
>> -     * in order to avoid refreshing timer when every single pkt arrives.
>> -     */
>> -    unsigned short    last_kactive_blk_num;
>> -
>> -    char        *pkblk_start;
>> -    char        *pkblk_end;
>> -    int        kblk_size;
>> -    unsigned int    knum_blocks;
>> -    uint64_t    knxt_seq_num;
>> -    char        *prev;
>> -    char        *nxt_offset;
>> -    struct sk_buff    *skb;
>> -
>> -    atomic_t    blk_fill_in_prog;
>> -
>> -    /* Default is set to 8ms */
>> -#define DEFAULT_PRB_RETIRE_TOV    (8)
>> -
>> -    unsigned short  retire_blk_tov;
>> -    unsigned short  version;
>> -    unsigned long    tov_in_jiffies;
>> -
>> -    /* timer to retire an outstanding block */
>> -    struct timer_list retire_blk_timer;
>> -};
>> -
>> -struct pgv {
>> -    char *buffer;
>> -};
>> -
>> -struct packet_ring_buffer {
>> -    struct pgv        *pg_vec;
>> -
>> -    unsigned int        head;
>> -    unsigned int        frames_per_block;
>> -    unsigned int        frame_size;
>> -    unsigned int        frame_max;
>> -
>> -    unsigned int        pg_vec_order;
>> -    unsigned int        pg_vec_pages;
>> -    unsigned int        pg_vec_len;
>> -
>> -    unsigned int __percpu    *pending_refcnt;
>> -
>> -    struct tpacket_kbdq_core    prb_bdqc;
>> -};
>> -
>>   extern struct mutex fanout_mutex;
>>   #define PACKET_FANOUT_MAX    256
>>
>> diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c
>> index 24adf70..279d56a 100644
>> --- a/tools/testing/selftests/net/psock_tpacket.c
>> +++ b/tools/testing/selftests/net/psock_tpacket.c
>> @@ -133,6 +133,19 @@ static void status_bar_update(void)
>>       }
>>   }
>>
>> +static void print_payload(void *pay, size_t len)
>> +{
>> +    char *payload = pay;
>> +    int i;
>> +
>> +    printf("payload (bytes %lu): ", len);
>> +    for (i = 0; i < len; i++) {
>> +        printf("0x%02x ", payload[i]);
>> +        if ((i % 32) == 0)
>> +            printf("\n");
>> +    }
>> +}
>> +
>>   static void test_payload(void *pay, size_t len)
>>   {
>>       struct ethhdr *eth = pay;
>> @@ -238,15 +251,15 @@ static void walk_v1_v2_rx(int sock, struct ring *ring)
>>
>>       bug_on(ring->type != PACKET_RX_RING);
>>
>> -    pair_udp_open(udp_sock, PORT_BASE);
>> -    pair_udp_setfilter(sock);
>> +    //pair_udp_open(udp_sock, PORT_BASE);
>> +    //pair_udp_setfilter(sock);
>>
>>       memset(&pfd, 0, sizeof(pfd));
>>       pfd.fd = sock;
>>       pfd.events = POLLIN | POLLERR;
>>       pfd.revents = 0;
>>
>> -    pair_udp_send(udp_sock, NUM_PACKETS);
>> +    //pair_udp_send(udp_sock, NUM_PACKETS);
>>
>>       while (total_packets < NUM_PACKETS * 2) {
>>           while (__v1_v2_rx_kernel_ready(ring->rd[frame_num].iov_base,
>> @@ -263,6 +276,8 @@ static void walk_v1_v2_rx(int sock, struct ring *ring)
>>               case TPACKET_V2:
>>                   test_payload((uint8_t *) ppd.raw + ppd.v2->tp_h.tp_mac,
>>                            ppd.v2->tp_h.tp_snaplen);
>> +                print_payload((uint8_t *) ppd.raw + ppd.v2->tp_h.tp_mac,
>> +                         ppd.v2->tp_h.tp_snaplen);
>>                   total_bytes += ppd.v2->tp_h.tp_snaplen;
>>                   break;
>>               }
>> @@ -273,12 +288,14 @@ static void walk_v1_v2_rx(int sock, struct ring *ring)
>>               __v1_v2_rx_user_ready(ppd.raw, ring->version);
>>
>>               frame_num = (frame_num + 1) % ring->rd_num;
>> +            printf("%i ", frame_num);
>>           }
>>
>> -        poll(&pfd, 1, 1);
>> +        printf("\npolling %i: ", frame_num);
>> +        poll(&pfd, 1, 1000);
>>       }
>>
>> -    pair_udp_close(udp_sock);
>> +    //pair_udp_close(udp_sock);
>>
>>       if (total_packets != 2 * NUM_PACKETS) {
>>           fprintf(stderr, "walk_v%d_rx: received %u out of %u pkts\n",
>> @@ -372,7 +389,7 @@ static void walk_v1_v2_tx(int sock, struct ring *ring)
>>
>>       pair_udp_setfilter(rcv_sock);
>>
>> -    ll.sll_ifindex = if_nametoindex("lo");
>> +    ll.sll_ifindex = if_nametoindex("p3p2");
>>       ret = bind(rcv_sock, (struct sockaddr *) &ll, sizeof(ll));
>>       if (ret == -1) {
>>           perror("bind");
>> @@ -687,7 +704,7 @@ static void bind_ring(int sock, struct ring *ring)
>>
>>       ring->ll.sll_family = PF_PACKET;
>>       ring->ll.sll_protocol = htons(ETH_P_ALL);
>> -    ring->ll.sll_ifindex = if_nametoindex("lo");
>> +    ring->ll.sll_ifindex = if_nametoindex("p3p2");
>>       ring->ll.sll_hatype = 0;
>>       ring->ll.sll_pkttype = 0;
>>       ring->ll.sll_halen = 0;
>> @@ -755,6 +772,19 @@ static const char *type_str[] = {
>>       [PACKET_TX_RING] = "PACKET_TX_RING",
>>   };
>>
>> +void direct_dma_ring(int sock)
>> +{
>> +    int ret;
>> +    int index = 0;
>> +
>> +    ret = setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT, &index, sizeof(index));
>> +    if (ret < 0)
>> +        printf("Failed direct dma socket with %i\n", ret);
>> +    else
>> +        printf("Configured a direct dma socket!\n");
>> +
>> +}
>> +
>>   static int test_tpacket(int version, int type)
>>   {
>>       int sock;
>> @@ -777,6 +807,7 @@ static int test_tpacket(int version, int type)
>>       setup_ring(sock, &ring, version, type);
>>       mmap_ring(sock, &ring);
>>       bind_ring(sock, &ring);
>> +    direct_dma_ring(sock);
>>       walk_ring(sock, &ring);
>>       unmap_ring(sock, &ring);
>>       close(sock);
>> @@ -789,13 +820,17 @@ int main(void)
>>   {
>>       int ret = 0;
>>
>> +#if 0
>>       ret |= test_tpacket(TPACKET_V1, PACKET_RX_RING);
>>       ret |= test_tpacket(TPACKET_V1, PACKET_TX_RING);
>> +#endif
>>
>>       ret |= test_tpacket(TPACKET_V2, PACKET_RX_RING);
>> +#if 0
>>       ret |= test_tpacket(TPACKET_V2, PACKET_TX_RING);
>>
>>       ret |= test_tpacket(TPACKET_V3, PACKET_RX_RING);
>> +#endif
>>
>>       if (ret)
>>           return 1;
>>
>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-17 16:00   ` Alexei Starovoitov
  2016-06-17 16:50     ` John Fastabend
@ 2016-06-17 22:31     ` Saeed Mahameed
  2016-06-17 23:34       ` Eric Dumazet
  2016-06-21  2:18       ` Alexei Starovoitov
  1 sibling, 2 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 22:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Doug Ledford,
	Or Gerlitz, Maor Gottlieb, Huy Nguyen, Tal Alon, Eric Dumazet,
	Daniel Borkmann

On Fri, Jun 17, 2016 at 7:00 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Jun 17, 2016 at 05:43:53PM +0300, Saeed Mahameed wrote:
>> From: Maor Gottlieb <maorg@mellanox.com>
>>
>> Add kernel offload flow tag for packets that will bypass the kernel
>> stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).
>
> so the whole series is an elaborate way to enable dpdk? how nice.

NO, God forbids! the whole series has nothing to do with dpdk!
Please read the cover letter.

Quoting my own words from the cover letter:
"This patch set introduces mlx5 RoCE/RDMA packet sniffer, it allows
mlx5e netdevice to receive RoCE/RDMA or RAW ETH traffic which isn't
supposed to be passed to the kernel stack, for sniffing and diagnostics
purposes."

We simply want to selectively be able to see RoCE/RDMA ETH standard
traffic in tcpdump, for diagnostic purposes.
so in order to not overwhelm the kernel TCP/IP stack with this
traffic, this patch in particular
configures ConnectX4 hardware to tag those packets, so in downstream
patches mlx5 netdevice will mark the SKBs of those packets
to skip the TCP/IP stack and go only to tcpdump.

DPDK is not enabled/disabled or even slightly affected in this series.
It was just given as an example in this patch commit message for
traffic that can be sniffed in standard tools such as tcpdump.

Today there are some bad usages and abuse to skb->protocol where some
device drivers set skb->protocol = 0xffffff to skip the kernel TCP/IP
processing for the same diagnostic purposes.
In this series we are just trying to do the right thing.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 13/18] net: Add offload kernel net stack packet type
  2016-06-17 15:15   ` Daniel Borkmann
@ 2016-06-17 22:54     ` Saeed Mahameed
  0 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-17 22:54 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Doug Ledford,
	Or Gerlitz, Maor Gottlieb, Huy Nguyen, Tal Alon, Patrick McHardy,
	Eric Dumazet, ast

On Fri, Jun 17, 2016 at 6:15 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 06/17/2016 04:43 PM, Saeed Mahameed wrote:
>>
>> From: Maor Gottlieb <maorg@mellanox.com>
>>
>> Add new packet type to skip kernel specific protocol handlers.
>>
>> This is needed so device drivers can pass packets up to user space
>> (af_packet/tcpdump, etc..) without the need for them to go through
>> the whole kernel data path.
>>
>> Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
>> CC: David S. Miller <davem@davemloft.net>
>> CC: Patrick McHardy <kaber@trash.net>
>> CC: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>   include/linux/skbuff.h         | 6 +++---
>>   include/uapi/linux/if_packet.h | 1 +
>>   net/core/dev.c                 | 4 ++++
>>   3 files changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index dc0fca7..359724e 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -694,14 +694,14 @@ struct sk_buff {
>>
>>   /* if you move pkt_type around you also must adapt those constants */
>>   #ifdef __BIG_ENDIAN_BITFIELD
>> -#define PKT_TYPE_MAX   (7 << 5)
>> +#define PKT_TYPE_MAX   (8 << 5)
>>   #else
>> -#define PKT_TYPE_MAX   7
>> +#define PKT_TYPE_MAX   8
>>   #endif
>
>
> Aehm ... did you actually test this with BPF ?!
>
> PKT_TYPE_MAX is a mask (naming could be better no doubt), see also function
> convert_skb_access():
>
> [...]
>         case SKF_AD_PKTTYPE:
>                 *insn++ = BPF_LDX_MEM(BPF_B, dst_reg, src_reg,
> PKT_TYPE_OFFSET());
>                 *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, PKT_TYPE_MAX);
> #ifdef __BIG_ENDIAN_BITFIELD
>                 *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, 5);
> #endif
>                 break;
> [...]
>
> Also, dunno if it's worth burning a skb bit for one driver.
>

oops! it didn't occur to me that it was used as a mask!
does this mean we are out of PKT_TYPE flags ?
Maybe we can use skb->mark instead ? any ideas ?

Also i am not sure it is a one driver thing, i know about other
examples where some device drivers set skb->protocol to 0xffff to
achieve the same ! which i didn't like.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-17 22:31     ` Saeed Mahameed
@ 2016-06-17 23:34       ` Eric Dumazet
  2016-06-19 14:27         ` Saeed Mahameed
  2016-06-21  2:18       ` Alexei Starovoitov
  1 sibling, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2016-06-17 23:34 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexei Starovoitov, Saeed Mahameed, David S. Miller,
	Linux Netdev List, Doug Ledford, Or Gerlitz, Maor Gottlieb,
	Huy Nguyen, Tal Alon, Daniel Borkmann

On Fri, Jun 17, 2016 at 3:31 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Fri, Jun 17, 2016 at 7:00 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Fri, Jun 17, 2016 at 05:43:53PM +0300, Saeed Mahameed wrote:
>>> From: Maor Gottlieb <maorg@mellanox.com>
>>>
>>> Add kernel offload flow tag for packets that will bypass the kernel
>>> stack, e.g (RoCE/RDMA/RAW ETH (DPDK), etc ..).
>>
>> so the whole series is an elaborate way to enable dpdk? how nice.
>
> NO, God forbids! the whole series has nothing to do with dpdk!
> Please read the cover letter.
>
> Quoting my own words from the cover letter:
> "This patch set introduces mlx5 RoCE/RDMA packet sniffer, it allows
> mlx5e netdevice to receive RoCE/RDMA or RAW ETH traffic which isn't
> supposed to be passed to the kernel stack, for sniffing and diagnostics
> purposes."
>
> We simply want to selectively be able to see RoCE/RDMA ETH standard
> traffic in tcpdump, for diagnostic purposes.
> so in order to not overwhelm the kernel TCP/IP stack with this
> traffic, this patch in particular
> configures ConnectX4 hardware to tag those packets, so in downstream
> patches mlx5 netdevice will mark the SKBs of those packets
> to skip the TCP/IP stack and go only to tcpdump.
>
> DPDK is not enabled/disabled or even slightly affected in this series.
> It was just given as an example in this patch commit message for
> traffic that can be sniffed in standard tools such as tcpdump.
>
> Today there are some bad usages and abuse to skb->protocol where some
> device drivers set skb->protocol = 0xffffff to skip the kernel TCP/IP
> processing for the same diagnostic purposes.
> In this series we are just trying to do the right thing.

Well, your patch adds an extra test in kernel fast path, just to ease
the life of people using kernel bypass,
but willing to use tcpdump because they can not figure to do this in
user space properly.

Please find a way  _not_ adding a single instruction in kernel fast path.

Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-17 23:34       ` Eric Dumazet
@ 2016-06-19 14:27         ` Saeed Mahameed
  0 siblings, 0 replies; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-19 14:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, Saeed Mahameed, David S. Miller,
	Linux Netdev List, Doug Ledford, Or Gerlitz, Maor Gottlieb,
	Huy Nguyen, Tal Alon, Daniel Borkmann

On Sat, Jun 18, 2016 at 2:34 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Fri, Jun 17, 2016 at 3:31 PM, Saeed Mahameed
> <saeedm@dev.mellanox.co.il> wrote:
>>
>> Today there are some bad usages and abuse to skb->protocol where some
>> device drivers set skb->protocol = 0xffffff to skip the kernel TCP/IP
>> processing for the same diagnostic purposes.
>> In this series we are just trying to do the right thing.
>
> Well, your patch adds an extra test in kernel fast path, just to ease
> the life of people using kernel bypass,
> but willing to use tcpdump because they can not figure to do this in
> user space properly.
>
> Please find a way  _not_ adding a single instruction in kernel fast path.
>

Well, we can set skb->protocol = 0xffff.
what do you think ?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-17 22:31     ` Saeed Mahameed
  2016-06-17 23:34       ` Eric Dumazet
@ 2016-06-21  2:18       ` Alexei Starovoitov
  2016-06-21 13:04         ` Saeed Mahameed
  1 sibling, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2016-06-21  2:18 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Doug Ledford,
	Or Gerlitz, Maor Gottlieb, Huy Nguyen, Tal Alon, Eric Dumazet,
	Daniel Borkmann

On Sat, Jun 18, 2016 at 01:31:26AM +0300, Saeed Mahameed wrote:
> 
> We simply want to selectively be able to see RoCE/RDMA ETH standard
> traffic in tcpdump, for diagnostic purposes.
> so in order to not overwhelm the kernel TCP/IP stack with this
> traffic, this patch in particular
> configures ConnectX4 hardware to tag those packets, so in downstream
> patches mlx5 netdevice will mark the SKBs of those packets
> to skip the TCP/IP stack and go only to tcpdump.

such 'feature' doesn't make sense to me.
'not overwhelming' kernel, but to 'overwhelm' userspace tcpdump?
Kernel can drop packets way faster than userspace, so such bypass
scheme has no prartical usage other than building a first step towards
complete dpdk-like bypass.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-21  2:18       ` Alexei Starovoitov
@ 2016-06-21 13:04         ` Saeed Mahameed
  2016-06-21 15:18           ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-21 13:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Doug Ledford,
	Or Gerlitz, Maor Gottlieb, Huy Nguyen, Tal Alon, Eric Dumazet,
	Daniel Borkmann

On Tue, Jun 21, 2016 at 5:18 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sat, Jun 18, 2016 at 01:31:26AM +0300, Saeed Mahameed wrote:
>>
>> We simply want to selectively be able to see RoCE/RDMA ETH standard
>> traffic in tcpdump, for diagnostic purposes.
>> so in order to not overwhelm the kernel TCP/IP stack with this
>> traffic, this patch in particular
>> configures ConnectX4 hardware to tag those packets, so in downstream
>> patches mlx5 netdevice will mark the SKBs of those packets
>> to skip the TCP/IP stack and go only to tcpdump.
>
> such 'feature' doesn't make sense to me.
> 'not overwhelming' kernel, but to 'overwhelm' userspace tcpdump?
> Kernel can drop packets way faster than userspace, so such bypass
> scheme has no prartical usage other than building a first step towards
> complete dpdk-like bypass.
>

Alexei , I don't understand your concern.
We already have a full/complete working dpdk bypass solution in
userspace nothing extra is required from the kernel.

We just want to see this traffic and any other rdma traffic in tcpdump
or other standard sniffing tools.

Anyway we brainstormed this internally today and we don't like the
"skb->protocol = 0xffff" solution,
we will suggest a plugin for libpcap in user space to extend libpcap
ability to sniff RDMA/raw eth traffic.
This way userspace RDMA traffic will be sniffed also via a userspace
RDMA channel.

I will ask Dave to drop this series.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer
  2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
                   ` (17 preceding siblings ...)
  2016-06-17 14:43 ` [PATCH net-next 18/18] net/mlx5e: Add netdev hw feature flag offload-sniffer Saeed Mahameed
@ 2016-06-21 13:10 ` Saeed Mahameed
  2016-06-22 18:52   ` David Miller
  18 siblings, 1 reply; 33+ messages in thread
From: Saeed Mahameed @ 2016-06-21 13:10 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Linux Netdev List, Doug Ledford, Or Gerlitz,
	Maor Gottlieb, Huy Nguyen, Tal Alon

On Fri, Jun 17, 2016 at 5:43 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:
> Hi Dave,
>
> This patch set introduces mlx5 RoCE/RDMA packet sniffer, it allows
> mlx5e netdevice to receive RoCE/RDMA or RAW ETH traffic which isn't
> supposed to be passed to the kernel stack, for sniffing and diagnostics
> purposes.  This traffic is still not supposed to go through the whole
> network stack processing and should only go to the non-protocol specific
> handlers (ptype_all). e.g: tcpdump, etc ..

Dave,

I would like to drop this series, we did some evaluation and there is
no straightforward way to tell the kernel to not process traffic that
should only go to tcpdump or (ptype_all).

We are evaluating a semi-pure userspace solution to sniff user space
traffic directly form userspace into libpcap. (verbs/rdma
plugin/extension for libpcap)

Sorry for any inconvenience,

Thanks,
Saeed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-21 13:04         ` Saeed Mahameed
@ 2016-06-21 15:18           ` Eric Dumazet
  2016-06-21 15:41             ` Or Gerlitz
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2016-06-21 15:18 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexei Starovoitov, Saeed Mahameed, David S. Miller,
	Linux Netdev List, Doug Ledford, Or Gerlitz, Maor Gottlieb,
	Huy Nguyen, Tal Alon, Daniel Borkmann

On Tue, Jun 21, 2016 at 6:04 AM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:

>
> Alexei , I don't understand your concern.
> We already have a full/complete working dpdk bypass solution in
> userspace nothing extra is required from the kernel.
>
> We just want to see this traffic and any other rdma traffic in tcpdump
> or other standard sniffing tools.
>
> Anyway we brainstormed this internally today and we don't like the
> "skb->protocol = 0xffff" solution,

One solution would be to setup a special netdev used only for sniffers
(No IP address on it)

-> Only changes would happen in the driver, to set skb->dev to this
'debug' device.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag
  2016-06-21 15:18           ` Eric Dumazet
@ 2016-06-21 15:41             ` Or Gerlitz
  0 siblings, 0 replies; 33+ messages in thread
From: Or Gerlitz @ 2016-06-21 15:41 UTC (permalink / raw)
  To: Eric Dumazet, Saeed Mahameed
  Cc: Alexei Starovoitov, Saeed Mahameed, David S. Miller,
	Linux Netdev List, Doug Ledford, Maor Gottlieb, Huy Nguyen,
	Tal Alon, Daniel Borkmann

On 6/21/2016 6:18 PM, Eric Dumazet wrote:
> One solution would be to setup a special netdev used only for sniffers
> (No IP address on it)
>
> -> Only changes would happen in the driver, to set skb->dev to this
> 'debug' device.

Eric,

Yep, that was an option too, but when we realized that libpcap has the 
means to add plug-in for non-netdevices (e.g usb, can devices and now we 
are thinking to add one for uverbs),  it means we can avoid adding tons 
of pretty complex code into the kernel driver and happily have simpler 
code is user-space, so... why not? will try that 1st.

Or.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer
  2016-06-21 13:10 ` [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
@ 2016-06-22 18:52   ` David Miller
  0 siblings, 0 replies; 33+ messages in thread
From: David Miller @ 2016-06-22 18:52 UTC (permalink / raw)
  To: saeedm; +Cc: saeedm, netdev, dledford, ogerlitz, maorg, huyn, talal

From: Saeed Mahameed <saeedm@dev.mellanox.co.il>
Date: Tue, 21 Jun 2016 16:10:26 +0300

> I would like to drop this series,

Don't worry, I dropped it several days before this.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-06-22 18:52 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-17 14:43 [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 01/18] net/mlx5: Refactor mlx5_add_flow_rule Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 02/18] net/mlx5: Introduce mlx5_flow_steering structure Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 03/18] net/mlx5: Properly remove all steering objects Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 04/18] net/mlx5: Add hold/put rules refcount API Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 05/18] net/mlx5: Add support to add/del flow rule notifiers Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 06/18] net/mlx5: Introduce table of function pointer steering commands Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 07/18] net/mlx5: Introduce nop " Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 08/18] if_ether.h: Add RoCE Ethertype Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 09/18] IB/mlx5: Create RoCE root namespace Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 10/18] net/mlx5: Introduce get flow rule match API Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 11/18] net/mlx5: Add sniffer namespaces Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 12/18] IB/mlx5: Add kernel offload flow-tag Saeed Mahameed
2016-06-17 16:00   ` Alexei Starovoitov
2016-06-17 16:50     ` John Fastabend
2016-06-17 22:31     ` Saeed Mahameed
2016-06-17 23:34       ` Eric Dumazet
2016-06-19 14:27         ` Saeed Mahameed
2016-06-21  2:18       ` Alexei Starovoitov
2016-06-21 13:04         ` Saeed Mahameed
2016-06-21 15:18           ` Eric Dumazet
2016-06-21 15:41             ` Or Gerlitz
2016-06-17 14:43 ` [PATCH net-next 13/18] net: Add offload kernel net stack packet type Saeed Mahameed
2016-06-17 15:12   ` Eric Dumazet
2016-06-17 15:15   ` Daniel Borkmann
2016-06-17 22:54     ` Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 14/18] net/mlx5e: Set sniffer skbs packet type to offload kernel Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 15/18] net/mlx5: Introduce sniffer steering hardware capabilities Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 16/18] net/mlx5e: Sniffer support for kernel offload (RoCE) traffic Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 17/18] net/mlx5e: Lock device state in set features Saeed Mahameed
2016-06-17 14:43 ` [PATCH net-next 18/18] net/mlx5e: Add netdev hw feature flag offload-sniffer Saeed Mahameed
2016-06-21 13:10 ` [PATCH net-next 00/18] mlx5 RoCE/RDMA packet sniffer Saeed Mahameed
2016-06-22 18:52   ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.