netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014
@ 2014-10-30 16:06 Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 1/8] net/mlx4_core: Prevent VF from changing port configuration Or Gerlitz
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli,
	Or Gerlitz

Hi Dave,

The 1st patch from Saeed fixes a bug in the last net-next batch where
a VF could get access to set port configuration, the next patch from Amir
fixes a race in the port VPI logic. Next are two performance patches from Ido.

The last four patches from Shani, Matan and myself add support for CHECKSUM_COMPLETE 
reporting on non TCP/UDP packets such as GRE and ICMP. I'd like to deeply thank 
Jerry Chu for his innovation and support in that effort.

Or.

Amir Vadai (1):
  net/mlx4_core: Protect port type setting by mutex

Ido Shamay (2):
  net/mlx4_en: Remove RX buffers alignment to IP_ALIGN
  net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages

Matan Barak (1):
  net/mlx4_core: Add retrieval of CONFIG_DEV parameters

Or Gerlitz (1):
  net/mlx4_en: Remove redundant code from RX/GRO path

Saeed Mahameed (1):
  net/mlx4_core: Prevent VF from changing port configuration

Shani Michaeli (1):
  net: Add calaulation of non folded IPV6 pseudo header checksum
  net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE

 drivers/net/ethernet/mellanox/mlx4/cmd.c           |    6 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |    5 +
 drivers/net/ethernet/mellanox/mlx4/en_port.c       |    2 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |  186 ++++++++++++--------
 drivers/net/ethernet/mellanox/mlx4/fw.c            |  118 ++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/main.c          |   18 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h          |   10 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |    6 +-
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |   17 ++
 include/linux/mlx4/cmd.h                           |   29 +++
 include/linux/mlx4/device.h                        |    4 +-
 include/net/ip6_checksum.h                         |   21 +++
 13 files changed, 339 insertions(+), 85 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH net-next 1/8] net/mlx4_core: Prevent VF from changing port configuration
  2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
@ 2014-10-30 16:06 ` Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 2/8] net/mlx4_core: Protect port type setting by mutex Or Gerlitz
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli

From: Saeed Mahameed <saeedm@mellanox.com>

Added wrapper to the ACCESS_REG command for handling guest HW
registers access, preventing write operations, but do allow reads.

This will prevent SRIOV guests to change port PTYS configuration,
such as speed/advertised link modes.

Fixes: adbc7ac5c15e ('net/mlx4_core: Introduce ACCESS_REG CMD [...]')
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c  |    2 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c   |   30 ++++++++++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h |    5 ++++
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 916459e..1312ccf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -1345,7 +1345,7 @@ static struct mlx4_cmd_info cmd_info[] = {
 		.out_is_imm = false,
 		.encode_slave_id = false,
 		.verify = NULL,
-		.wrapper = NULL,
+		.wrapper = mlx4_ACCESS_REG_wrapper,
 	},
 	/* Native multicast commands are not available for guests */
 	{
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 72289ef..e7639e3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -2220,7 +2220,7 @@ static int mlx4_ACCESS_REG(struct mlx4_dev *dev, u16 reg_id,
 	memcpy(inbuf->reg_data, reg_data, reg_len);
 	err = mlx4_cmd_box(dev, inbox->dma, outbox->dma, 0, 0,
 			   MLX4_CMD_ACCESS_REG, MLX4_CMD_TIME_CLASS_C,
-			   MLX4_CMD_NATIVE);
+			   MLX4_CMD_WRAPPED);
 	if (err)
 		goto out;
 
@@ -2263,3 +2263,31 @@ int mlx4_ACCESS_PTYS_REG(struct mlx4_dev *dev,
 			       method, sizeof(*ptys_reg), ptys_reg);
 }
 EXPORT_SYMBOL_GPL(mlx4_ACCESS_PTYS_REG);
+
+int mlx4_ACCESS_REG_wrapper(struct mlx4_dev *dev, int slave,
+			    struct mlx4_vhcr *vhcr,
+			    struct mlx4_cmd_mailbox *inbox,
+			    struct mlx4_cmd_mailbox *outbox,
+			    struct mlx4_cmd_info *cmd)
+{
+	struct mlx4_access_reg *inbuf = inbox->buf;
+	u8 method = inbuf->method & MLX4_ACCESS_REG_METHOD_MASK;
+	u16 reg_id = be16_to_cpu(inbuf->reg_id);
+
+	if (slave != mlx4_master_func_num(dev) &&
+	    method == MLX4_ACCESS_REG_WRITE)
+		return -EPERM;
+
+	if (reg_id == MLX4_REG_ID_PTYS) {
+		struct mlx4_ptys_reg *ptys_reg =
+			(struct mlx4_ptys_reg *)inbuf->reg_data;
+
+		ptys_reg->local_port =
+			mlx4_slave_convert_port(dev, slave,
+						ptys_reg->local_port);
+	}
+
+	return mlx4_cmd_box(dev, inbox->dma, outbox->dma, vhcr->in_modifier,
+			    0, MLX4_CMD_ACCESS_REG, MLX4_CMD_TIME_CLASS_C,
+			    MLX4_CMD_NATIVE);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index de10dbb..254ec7b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -1273,6 +1273,11 @@ int mlx4_QP_FLOW_STEERING_DETACH_wrapper(struct mlx4_dev *dev, int slave,
 					 struct mlx4_cmd_mailbox *inbox,
 					 struct mlx4_cmd_mailbox *outbox,
 					 struct mlx4_cmd_info *cmd);
+int mlx4_ACCESS_REG_wrapper(struct mlx4_dev *dev, int slave,
+			    struct mlx4_vhcr *vhcr,
+			    struct mlx4_cmd_mailbox *inbox,
+			    struct mlx4_cmd_mailbox *outbox,
+			    struct mlx4_cmd_info *cmd);
 
 int mlx4_get_mgm_entry_size(struct mlx4_dev *dev);
 int mlx4_get_qp_per_mgm(struct mlx4_dev *dev);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 2/8] net/mlx4_core: Protect port type setting by mutex
  2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 1/8] net/mlx4_core: Prevent VF from changing port configuration Or Gerlitz
@ 2014-10-30 16:06 ` Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 3/8] net/mlx4_en: Remove RX buffers alignment to IP_ALIGN Or Gerlitz
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli

From: Amir Vadai <amirv@mellanox.com>

We need to protect set_port_type() for concurrency, as the sysfs code could
call it from mutliple contexts in parallel.

The port_mutex is not enough because we need to protect from concurrent
modification of 'info' and stopping of the port sensing work.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/main.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 90de6e1..9f82196 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -901,9 +901,12 @@ static ssize_t set_port_type(struct device *dev,
 	struct mlx4_priv *priv = mlx4_priv(mdev);
 	enum mlx4_port_type types[MLX4_MAX_PORTS];
 	enum mlx4_port_type new_types[MLX4_MAX_PORTS];
+	static DEFINE_MUTEX(set_port_type_mutex);
 	int i;
 	int err = 0;
 
+	mutex_lock(&set_port_type_mutex);
+
 	if (!strcmp(buf, "ib\n"))
 		info->tmp_type = MLX4_PORT_TYPE_IB;
 	else if (!strcmp(buf, "eth\n"))
@@ -912,7 +915,8 @@ static ssize_t set_port_type(struct device *dev,
 		info->tmp_type = MLX4_PORT_TYPE_AUTO;
 	else {
 		mlx4_err(mdev, "%s is not supported port type\n", buf);
-		return -EINVAL;
+		err = -EINVAL;
+		goto err_out;
 	}
 
 	mlx4_stop_sense(mdev);
@@ -958,6 +962,9 @@ static ssize_t set_port_type(struct device *dev,
 out:
 	mlx4_start_sense(mdev);
 	mutex_unlock(&priv->port_mutex);
+err_out:
+	mutex_unlock(&set_port_type_mutex);
+
 	return err ? err : count;
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 3/8] net/mlx4_en: Remove RX buffers alignment to IP_ALIGN
  2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 1/8] net/mlx4_core: Prevent VF from changing port configuration Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 2/8] net/mlx4_core: Protect port type setting by mutex Or Gerlitz
@ 2014-10-30 16:06 ` Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 4/8] net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages Or Gerlitz
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli,
	Ido Shamay

From: Ido Shamay <idos@mellanox.com>

When IP_ALIGN has a non zero value, hardware will write to a non aligned
address. The only reader from this address is when copying the header
from the first frag into the linear buffer (further access to the IP
address will be from the linear buffer, in which the headers are
aligned). Since the penalty of non align access by the hardware is
greater than the software memcpy, changing the frag_align to always be 0.

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |   16 ++++------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    1 -
 2 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index c8e75da..4cb716f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -74,7 +74,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 	page_alloc->page_size = PAGE_SIZE << order;
 	page_alloc->page = page;
 	page_alloc->dma = dma;
-	page_alloc->page_offset = frag_info->frag_align;
+	page_alloc->page_offset = 0;
 	/* Not doing get_page() for each frag is a big win
 	 * on asymetric workloads. Note we can not use atomic_set().
 	 */
@@ -945,15 +945,8 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 			(eff_mtu > buf_size + frag_sizes[i]) ?
 				frag_sizes[i] : eff_mtu - buf_size;
 		priv->frag_info[i].frag_prefix_size = buf_size;
-		if (!i)	{
-			priv->frag_info[i].frag_align = NET_IP_ALIGN;
-			priv->frag_info[i].frag_stride =
-				ALIGN(frag_sizes[i] + NET_IP_ALIGN, SMP_CACHE_BYTES);
-		} else {
-			priv->frag_info[i].frag_align = 0;
-			priv->frag_info[i].frag_stride =
-				ALIGN(frag_sizes[i], SMP_CACHE_BYTES);
-		}
+		priv->frag_info[i].frag_stride = ALIGN(frag_sizes[i],
+						       SMP_CACHE_BYTES);
 		buf_size += priv->frag_info[i].frag_size;
 		i++;
 	}
@@ -966,11 +959,10 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 	       eff_mtu, priv->num_frags);
 	for (i = 0; i < priv->num_frags; i++) {
 		en_err(priv,
-		       "  frag:%d - size:%d prefix:%d align:%d stride:%d\n",
+		       "  frag:%d - size:%d prefix:%d stride:%d\n",
 		       i,
 		       priv->frag_info[i].frag_size,
 		       priv->frag_info[i].frag_prefix_size,
-		       priv->frag_info[i].frag_align,
 		       priv->frag_info[i].frag_stride);
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 6beb4d3..ef83d12 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -481,7 +481,6 @@ struct mlx4_en_frag_info {
 	u16 frag_size;
 	u16 frag_prefix_size;
 	u16 frag_stride;
-	u16 frag_align;
 };
 
 #ifdef CONFIG_MLX4_EN_DCB
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 4/8] net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages
  2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
                   ` (2 preceding siblings ...)
  2014-10-30 16:06 ` [PATCH net-next 3/8] net/mlx4_en: Remove RX buffers alignment to IP_ALIGN Or Gerlitz
@ 2014-10-30 16:06 ` Or Gerlitz
  2014-10-30 17:15   ` Eric Dumazet
  2014-10-30 16:06 ` [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path Or Gerlitz
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli,
	Ido Shamay

From: Ido Shamay <idos@mellanox.com>

Needed in order to get cache cold pages (L3 flushed) for HW scatter.

Otherwise memory may flush those entries when the packet comes from
PCI, causing back pressure resulting in BW decrease.

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 4cb716f..9d616a8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -54,7 +54,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 	dma_addr_t dma;
 
 	for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
-		gfp_t gfp = _gfp;
+		gfp_t gfp = _gfp | __GFP_COLD;
 
 		if (order)
 			gfp |= __GFP_COMP | __GFP_NOWARN;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path
  2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
                   ` (3 preceding siblings ...)
  2014-10-30 16:06 ` [PATCH net-next 4/8] net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages Or Gerlitz
@ 2014-10-30 16:06 ` Or Gerlitz
  2014-10-30 19:00   ` Eric Dumazet
  2014-10-30 16:06 ` [PATCH net-next 6/8] net/mlx4_core: Add retrieval of CONFIG_DEV parameters Or Gerlitz
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli,
	Or Gerlitz

Remove the code which goes through napi_gro_frags() on the RX path,
use only napi_gro_receive().

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   54 ----------------------------
 1 files changed, 0 insertions(+), 54 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9d616a8..2a29a1a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -746,60 +746,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
 			    (cqe->checksum == cpu_to_be16(0xffff))) {
 				ring->csum_ok++;
-				/* This packet is eligible for GRO if it is:
-				 * - DIX Ethernet (type interpretation)
-				 * - TCP/IP (v4)
-				 * - without IP options
-				 * - not an IP fragment
-				 * - no LLS polling in progress
-				 */
-				if (!mlx4_en_cq_busy_polling(cq) &&
-				    (dev->features & NETIF_F_GRO)) {
-					struct sk_buff *gro_skb = napi_get_frags(&cq->napi);
-					if (!gro_skb)
-						goto next;
-
-					nr = mlx4_en_complete_rx_desc(priv,
-						rx_desc, frags, gro_skb,
-						length);
-					if (!nr)
-						goto next;
-
-					skb_shinfo(gro_skb)->nr_frags = nr;
-					gro_skb->len = length;
-					gro_skb->data_len = length;
-					gro_skb->ip_summed = CHECKSUM_UNNECESSARY;
-
-					if (l2_tunnel)
-						gro_skb->csum_level = 1;
-					if ((cqe->vlan_my_qpn &
-					    cpu_to_be32(MLX4_CQE_VLAN_PRESENT_MASK)) &&
-					    (dev->features & NETIF_F_HW_VLAN_CTAG_RX)) {
-						u16 vid = be16_to_cpu(cqe->sl_vid);
-
-						__vlan_hwaccel_put_tag(gro_skb, htons(ETH_P_8021Q), vid);
-					}
-
-					if (dev->features & NETIF_F_RXHASH)
-						skb_set_hash(gro_skb,
-							     be32_to_cpu(cqe->immed_rss_invalid),
-							     PKT_HASH_TYPE_L3);
-
-					skb_record_rx_queue(gro_skb, cq->ring);
-					skb_mark_napi_id(gro_skb, &cq->napi);
-
-					if (ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL) {
-						timestamp = mlx4_en_get_cqe_ts(cqe);
-						mlx4_en_fill_hwtstamps(mdev,
-								       skb_hwtstamps(gro_skb),
-								       timestamp);
-					}
-
-					napi_gro_frags(&cq->napi);
-					goto next;
-				}
-
-				/* GRO not possible, complete processing here */
 				ip_summed = CHECKSUM_UNNECESSARY;
 			} else {
 				ip_summed = CHECKSUM_NONE;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 6/8] net/mlx4_core: Add retrieval of CONFIG_DEV parameters
  2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
                   ` (4 preceding siblings ...)
  2014-10-30 16:06 ` [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path Or Gerlitz
@ 2014-10-30 16:06 ` Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum Or Gerlitz
  2014-10-30 16:06 ` [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE Or Gerlitz
  7 siblings, 0 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli,
	Or Gerlitz

From: Matan Barak <matanb@mellanox.com>

Add code to issue CONFIG_DEV "get" firmware command.

This command is used in order to obtain certain parameters used for
supporting various RX checksumming options and vxlan UDP port.

The GET operation is allowed for VFs too.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c           |    4 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c            |   88 +++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h          |    5 +
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |   17 ++++
 include/linux/mlx4/cmd.h                           |   29 +++++++
 include/linux/mlx4/device.h                        |    3 +-
 6 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 1312ccf..3c05e58 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -990,11 +990,11 @@ static struct mlx4_cmd_info cmd_info[] = {
 	{
 		.opcode = MLX4_CMD_CONFIG_DEV,
 		.has_inbox = false,
-		.has_outbox = false,
+		.has_outbox = true,
 		.out_is_imm = false,
 		.encode_slave_id = false,
 		.verify = NULL,
-		.wrapper = mlx4_CMD_EPERM_wrapper
+		.wrapper = mlx4_CONFIG_DEV_wrapper
 	},
 	{
 		.opcode = MLX4_CMD_ALLOC_RES,
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index e7639e3..d6dba77 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -141,7 +141,8 @@ static void dump_dev_cap_flags2(struct mlx4_dev *dev, u64 flags)
 		[12] = "Large cache line (>64B) CQE stride support",
 		[13] = "Large cache line (>64B) EQE stride support",
 		[14] = "Ethernet protocol control support",
-		[15] = "Ethernet Backplane autoneg support"
+		[15] = "Ethernet Backplane autoneg support",
+		[16] = "CONFIG DEV support"
 	};
 	int i;
 
@@ -574,6 +575,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 #define QUERY_DEV_CAP_MTT_ENTRY_SZ_OFFSET	0x90
 #define QUERY_DEV_CAP_D_MPT_ENTRY_SZ_OFFSET	0x92
 #define QUERY_DEV_CAP_BMME_FLAGS_OFFSET		0x94
+#define QUERY_DEV_CAP_CONFIG_DEV_OFFSET		0x94
 #define QUERY_DEV_CAP_RSVD_LKEY_OFFSET		0x98
 #define QUERY_DEV_CAP_MAX_ICM_SZ_OFFSET		0xa0
 #define QUERY_DEV_CAP_ETH_BACKPL_OFFSET		0x9c
@@ -749,6 +751,9 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 		dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_EQE_STRIDE;
 	MLX4_GET(dev_cap->bmme_flags, outbox,
 		 QUERY_DEV_CAP_BMME_FLAGS_OFFSET);
+	MLX4_GET(field, outbox, QUERY_DEV_CAP_CONFIG_DEV_OFFSET);
+	if (field & 0x20)
+		dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_CONFIG_DEV;
 	MLX4_GET(dev_cap->reserved_lkey, outbox,
 		 QUERY_DEV_CAP_RSVD_LKEY_OFFSET);
 	MLX4_GET(field32, outbox, QUERY_DEV_CAP_ETH_BACKPL_OFFSET);
@@ -1849,14 +1854,18 @@ int mlx4_CLOSE_HCA(struct mlx4_dev *dev, int panic)
 
 struct mlx4_config_dev {
 	__be32	update_flags;
-	__be32	rsdv1[3];
+	__be32	rsvd1[3];
 	__be16	vxlan_udp_dport;
 	__be16	rsvd2;
+	__be32	rsvd3[27];
+	__be16	rsvd4;
+	u8	rsvd5;
+	u8	rx_checksum_val;
 };
 
 #define MLX4_VXLAN_UDP_DPORT (1 << 0)
 
-static int mlx4_CONFIG_DEV(struct mlx4_dev *dev, struct mlx4_config_dev *config_dev)
+static int mlx4_CONFIG_DEV_set(struct mlx4_dev *dev, struct mlx4_config_dev *config_dev)
 {
 	int err;
 	struct mlx4_cmd_mailbox *mailbox;
@@ -1874,6 +1883,77 @@ static int mlx4_CONFIG_DEV(struct mlx4_dev *dev, struct mlx4_config_dev *config_
 	return err;
 }
 
+static int mlx4_CONFIG_DEV_get(struct mlx4_dev *dev, struct mlx4_config_dev *config_dev)
+{
+	int err;
+	struct mlx4_cmd_mailbox *mailbox;
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox))
+		return PTR_ERR(mailbox);
+
+	err = mlx4_cmd_box(dev, 0, mailbox->dma, 0, 1, MLX4_CMD_CONFIG_DEV,
+			   MLX4_CMD_TIME_CLASS_A, MLX4_CMD_NATIVE);
+
+	if (!err)
+		memcpy(config_dev, mailbox->buf, sizeof(*config_dev));
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+
+/* Conversion between the HW values and the actual functionality.
+ * The value represented by the array index,
+ * and the functionality determined by the flags.
+ */
+static const u8 config_dev_csum_flags[] = {
+	[0] =	0,
+	[1] =	MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP,
+	[2] =	MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP	|
+		MLX4_RX_CSUM_MODE_L4,
+	[3] =	MLX4_RX_CSUM_MODE_L4			|
+		MLX4_RX_CSUM_MODE_IP_OK_IP_NON_TCP_UDP	|
+		MLX4_RX_CSUM_MODE_MULTI_VLAN
+};
+
+int mlx4_config_dev_retrieval(struct mlx4_dev *dev,
+			      struct mlx4_config_dev_params *params)
+{
+	struct mlx4_config_dev config_dev;
+	int err;
+	u8 csum_mask;
+
+#define CONFIG_DEV_RX_CSUM_MODE_MASK			0x7
+#define CONFIG_DEV_RX_CSUM_MODE_PORT1_BIT_OFFSET	0
+#define CONFIG_DEV_RX_CSUM_MODE_PORT2_BIT_OFFSET	4
+
+	if (!(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_CONFIG_DEV))
+		return -ENOTSUPP;
+
+	err = mlx4_CONFIG_DEV_get(dev, &config_dev);
+	if (err)
+		return err;
+
+	csum_mask = (config_dev.rx_checksum_val >> CONFIG_DEV_RX_CSUM_MODE_PORT1_BIT_OFFSET) &
+			CONFIG_DEV_RX_CSUM_MODE_MASK;
+
+	if (csum_mask >= sizeof(config_dev_csum_flags)/sizeof(config_dev_csum_flags[0]))
+		return -EINVAL;
+	params->rx_csum_flags_port_1 = config_dev_csum_flags[csum_mask];
+
+	csum_mask = (config_dev.rx_checksum_val >> CONFIG_DEV_RX_CSUM_MODE_PORT2_BIT_OFFSET) &
+			CONFIG_DEV_RX_CSUM_MODE_MASK;
+
+	if (csum_mask >= sizeof(config_dev_csum_flags)/sizeof(config_dev_csum_flags[0]))
+		return -EINVAL;
+	params->rx_csum_flags_port_2 = config_dev_csum_flags[csum_mask];
+
+	params->vxlan_udp_dport = be16_to_cpu(config_dev.vxlan_udp_dport);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mlx4_config_dev_retrieval);
+
 int mlx4_config_vxlan_port(struct mlx4_dev *dev, __be16 udp_port)
 {
 	struct mlx4_config_dev config_dev;
@@ -1882,7 +1962,7 @@ int mlx4_config_vxlan_port(struct mlx4_dev *dev, __be16 udp_port)
 	config_dev.update_flags    = cpu_to_be32(MLX4_VXLAN_UDP_DPORT);
 	config_dev.vxlan_udp_dport = udp_port;
 
-	return mlx4_CONFIG_DEV(dev, &config_dev);
+	return mlx4_CONFIG_DEV_set(dev, &config_dev);
 }
 EXPORT_SYMBOL_GPL(mlx4_config_vxlan_port);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index 254ec7b..f8fc7bd 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -947,6 +947,11 @@ int mlx4_SW2HW_EQ_wrapper(struct mlx4_dev *dev, int slave,
 			  struct mlx4_cmd_mailbox *inbox,
 			  struct mlx4_cmd_mailbox *outbox,
 			  struct mlx4_cmd_info *cmd);
+int mlx4_CONFIG_DEV_wrapper(struct mlx4_dev *dev, int slave,
+			    struct mlx4_vhcr *vhcr,
+			    struct mlx4_cmd_mailbox *inbox,
+			    struct mlx4_cmd_mailbox *outbox,
+			    struct mlx4_cmd_info *cmd);
 int mlx4_DMA_wrapper(struct mlx4_dev *dev, int slave,
 		     struct mlx4_vhcr *vhcr,
 		     struct mlx4_cmd_mailbox *inbox,
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 5d2498d..d718ca0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -2872,6 +2872,23 @@ out_add:
 	return err;
 }
 
+int mlx4_CONFIG_DEV_wrapper(struct mlx4_dev *dev, int slave,
+			    struct mlx4_vhcr *vhcr,
+			    struct mlx4_cmd_mailbox *inbox,
+			    struct mlx4_cmd_mailbox *outbox,
+			    struct mlx4_cmd_info *cmd)
+{
+	int err;
+	u8 get = vhcr->op_modifier;
+
+	if (get != 1)
+		return -EPERM;
+
+	err = mlx4_DMA_wrapper(dev, slave, vhcr, inbox, outbox, cmd);
+
+	return err;
+}
+
 static int get_containing_mtt(struct mlx4_dev *dev, int slave, int start,
 			      int len, struct res_mtt **res)
 {
diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h
index ff5f5de..64d2594 100644
--- a/include/linux/mlx4/cmd.h
+++ b/include/linux/mlx4/cmd.h
@@ -199,6 +199,33 @@ enum {
 	MLX4_CMD_NATIVE
 };
 
+/*
+ * MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP -
+ * Receive checksum value is reported in CQE also for non TCP/UDP packets.
+ *
+ * MLX4_RX_CSUM_MODE_L4 -
+ * L4_CSUM bit in CQE, which indicates whether or not L4 checksum
+ * was validated correctly, is supported.
+ *
+ * MLX4_RX_CSUM_MODE_IP_OK_IP_NON_TCP_UDP -
+ * IP_OK CQE's field is supported also for non TCP/UDP IP packets.
+ *
+ * MLX4_RX_CSUM_MODE_MULTI_VLAN -
+ * Receive Checksum offload is supported for packets with more than 2 vlan headers.
+ */
+enum mlx4_rx_csum_mode {
+	MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP		= 1UL << 0,
+	MLX4_RX_CSUM_MODE_L4				= 1UL << 1,
+	MLX4_RX_CSUM_MODE_IP_OK_IP_NON_TCP_UDP		= 1UL << 2,
+	MLX4_RX_CSUM_MODE_MULTI_VLAN			= 1UL << 3
+};
+
+struct mlx4_config_dev_params {
+	u16	vxlan_udp_dport;
+	u8	rx_csum_flags_port_1;
+	u8	rx_csum_flags_port_2;
+};
+
 struct mlx4_dev;
 
 struct mlx4_cmd_mailbox {
@@ -250,6 +277,8 @@ int mlx4_set_vf_vlan(struct mlx4_dev *dev, int port, int vf, u16 vlan, u8 qos);
 int mlx4_set_vf_spoofchk(struct mlx4_dev *dev, int port, int vf, bool setting);
 int mlx4_get_vf_config(struct mlx4_dev *dev, int port, int vf, struct ifla_vf_info *ivf);
 int mlx4_set_vf_link_state(struct mlx4_dev *dev, int port, int vf, int link_state);
+int mlx4_config_dev_retrieval(struct mlx4_dev *dev,
+			      struct mlx4_config_dev_params *params);
 /*
  * mlx4_get_slave_default_vlan -
  * return true if VST ( default vlan)
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index e4c136e..5cc5eac 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -188,7 +188,8 @@ enum {
 	MLX4_DEV_CAP_FLAG2_CQE_STRIDE		= 1LL <<  12,
 	MLX4_DEV_CAP_FLAG2_EQE_STRIDE		= 1LL <<  13,
 	MLX4_DEV_CAP_FLAG2_ETH_PROT_CTRL        = 1LL <<  14,
-	MLX4_DEV_CAP_FLAG2_ETH_BACKPL_AN_REP	= 1LL <<  15
+	MLX4_DEV_CAP_FLAG2_ETH_BACKPL_AN_REP	= 1LL <<  15,
+	MLX4_DEV_CAP_FLAG2_CONFIG_DEV		= 1LL <<  16
 };
 
 enum {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum
  2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
                   ` (5 preceding siblings ...)
  2014-10-30 16:06 ` [PATCH net-next 6/8] net/mlx4_core: Add retrieval of CONFIG_DEV parameters Or Gerlitz
@ 2014-10-30 16:06 ` Or Gerlitz
  2014-10-30 16:25   ` David Laight
  2014-10-30 16:06 ` [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE Or Gerlitz
  7 siblings, 1 reply; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli

From: Shani Michaeli <shanim@mellanox.com>

Compute IPV6 pseudo header checksum without folding it to a 16 bit
return value.

Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
---
 include/net/ip6_checksum.h |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/net/ip6_checksum.h b/include/net/ip6_checksum.h
index 1a49b73..c45d690 100644
--- a/include/net/ip6_checksum.h
+++ b/include/net/ip6_checksum.h
@@ -41,6 +41,27 @@ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
 			__wsum csum);
 #endif
 
+static inline __wsum csum_ipv6_magic_nofold(const struct in6_addr *saddr,
+					    const struct in6_addr *daddr,
+					    __u32 len, unsigned short proto,
+					    __wsum sum)
+{
+	__wsum res = sum;
+
+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[0]);
+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[1]);
+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[2]);
+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[3]);
+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[0]);
+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[1]);
+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[2]);
+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[3]);
+	res = csum_add(res, (__force __wsum)htonl(len));
+	res = csum_add(res, (__force __wsum)htonl(proto));
+
+	return res;
+}
+
 static inline __wsum ip6_compute_pseudo(struct sk_buff *skb, int proto)
 {
 	return ~csum_unfold(csum_ipv6_magic(&ipv6_hdr(skb)->saddr,
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
                   ` (6 preceding siblings ...)
  2014-10-30 16:06 ` [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum Or Gerlitz
@ 2014-10-30 16:06 ` Or Gerlitz
  2014-10-30 18:22   ` Eric Dumazet
                     ` (3 more replies)
  7 siblings, 4 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli,
	Jerry Chu, Or Gerlitz

From: Shani Michaeli <shanim@mellanox.com>

When processing received traffic, pass CHECKSUM_COMPLETE status to the
stack, with calculated checksum for non TCP/UDP packets (such
as GRE or ICMP).

Although the stack expects checksum which doesn't include the pseudo
header, the HW adds it. To address that, we are subtracting the pseudo
header checksum from the checksum value provided by the HW.

In the IPv6 case, we also compute/add the IP header checksum which
is not added by the HW for such packets.

Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |    5 +
 drivers/net/ethernet/mellanox/mlx4/en_port.c    |    2 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      |  116 +++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/main.c       |    9 ++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    5 +-
 include/linux/mlx4/device.h                     |    1 +
 7 files changed, 132 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 8ea4d5b..6c64323 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -115,7 +115,7 @@ static const char main_strings[][ETH_GSTRING_LEN] = {
 	"tso_packets",
 	"xmit_more",
 	"queue_stopped", "wake_queue", "tx_timeout", "rx_alloc_failed",
-	"rx_csum_good", "rx_csum_none", "tx_chksum_offload",
+	"rx_csum_good", "rx_csum_none", "rx_csum_complete", "tx_chksum_offload",
 
 	/* packet statistics */
 	"broadcast", "rx_prio_0", "rx_prio_1", "rx_prio_2", "rx_prio_3",
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 0efbae9..d1eb25d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1893,6 +1893,7 @@ static void mlx4_en_clear_stats(struct net_device *dev)
 		priv->rx_ring[i]->packets = 0;
 		priv->rx_ring[i]->csum_ok = 0;
 		priv->rx_ring[i]->csum_none = 0;
+		priv->rx_ring[i]->csum_complete = 0;
 	}
 }
 
@@ -2503,6 +2504,10 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	/* Query for default mac and max mtu */
 	priv->max_mtu = mdev->dev->caps.eth_mtu_cap[priv->port];
 
+	if (mdev->dev->caps.rx_checksum_flags_port[priv->port] &
+	    MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP)
+		priv->flags |= MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP;
+
 	/* Set default MAC */
 	dev->addr_len = ETH_ALEN;
 	mlx4_en_u64_to_mac(dev->dev_addr, mdev->dev->caps.def_mac[priv->port]);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index 134b12e..6cb8007 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -155,11 +155,13 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 	stats->rx_bytes = 0;
 	priv->port_stats.rx_chksum_good = 0;
 	priv->port_stats.rx_chksum_none = 0;
+	priv->port_stats.rx_chksum_complete = 0;
 	for (i = 0; i < priv->rx_ring_num; i++) {
 		stats->rx_packets += priv->rx_ring[i]->packets;
 		stats->rx_bytes += priv->rx_ring[i]->bytes;
 		priv->port_stats.rx_chksum_good += priv->rx_ring[i]->csum_ok;
 		priv->port_stats.rx_chksum_none += priv->rx_ring[i]->csum_none;
+		priv->port_stats.rx_chksum_complete += priv->rx_ring[i]->csum_complete;
 	}
 	stats->tx_packets = 0;
 	stats->tx_bytes = 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 2a29a1a..f8a0449 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -42,6 +42,10 @@
 #include <linux/vmalloc.h>
 #include <linux/irq.h>
 
+#if IS_ENABLED(CONFIG_IPV6)
+#include <net/ip6_checksum.h>
+#endif
+
 #include "mlx4_en.h"
 
 static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
@@ -642,6 +646,86 @@ static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
 	}
 }
 
+/* When hardware doesn't strip the vlan, we need to calculate the checksum
+ * over it and add it to the hardware's checksum calculation
+ */
+static inline __wsum get_fixed_vlan_csum(__wsum hw_checksum,
+					 struct vlan_hdr *vlanh)
+{
+	return csum_add(hw_checksum, *(__wsum *)vlanh);
+}
+
+/* Although the stack expects checksum which doesn't include the pseudo
+ * header, the HW adds it. To address that, we are subtracting the pseudo
+ * header checksum from the checksum value provided by the HW.
+ */
+static void get_fixed_ipv4_csum(__wsum hw_checksum, struct sk_buff *skb,
+				struct iphdr *iph)
+{
+	__u16 length_for_csum = 0;
+	__wsum csum_pseudo_header = 0;
+
+	length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
+	csum_pseudo_header = csum_tcpudp_nofold(iph->saddr, iph->daddr,
+						length_for_csum, iph->protocol, 0);
+	skb->csum = csum_sub(hw_checksum, csum_pseudo_header);
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+/* In IPv6 packets, besides subtracting the pseudo header checksum,
+ * we also compute/add the IP header checksum which
+ * is not added by the HW.
+ */
+static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
+			       struct ipv6hdr *ipv6h)
+{
+	__wsum csum_pseudo_header = 0;
+
+	if (ipv6h->nexthdr == IPPROTO_FRAGMENT || ipv6h->nexthdr == IPPROTO_HOPOPTS)
+		return -1;
+	hw_checksum = csum_add(hw_checksum, (__force __wsum)(ipv6h->nexthdr << 8));
+
+	csum_pseudo_header = csum_ipv6_magic_nofold(&ipv6h->saddr,
+						    &ipv6h->daddr,
+						    ntohs(ipv6h->payload_len),
+						    ipv6h->nexthdr,
+						    0);
+	skb->csum = csum_sub(hw_checksum, csum_pseudo_header);
+	skb->csum = csum_add(skb->csum, csum_partial(ipv6h, sizeof(struct ipv6hdr), 0));
+	return 0;
+}
+#endif
+
+static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, int hwtstamp_rx_filter)
+{
+	__wsum hw_checksum = 0;
+
+	void *hdr = (u8 *)skb->data + sizeof(struct ethhdr);
+
+	hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
+
+	if (((struct ethhdr *)skb->data)->h_proto == htons(ETH_P_8021Q) &&
+	    hwtstamp_rx_filter != HWTSTAMP_FILTER_NONE) {
+		/* next protocol non IPv4 or IPv6 */
+		if (((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
+		    != htons(ETH_P_IP) ||
+		    ((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
+		    != htons(ETH_P_IPV6))
+			return -1;
+		hw_checksum = get_fixed_vlan_csum(hw_checksum, hdr);
+		hdr += sizeof(struct vlan_hdr);
+	}
+
+	if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4))
+		get_fixed_ipv4_csum(hw_checksum, skb, hdr);
+#if IS_ENABLED(CONFIG_IPV6)
+	else if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV6))
+		if (get_fixed_ipv6_csum(hw_checksum, skb, hdr))
+			return -1;
+#endif
+	return 0;
+}
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -743,13 +827,26 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
-			if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
-			    (cqe->checksum == cpu_to_be16(0xffff))) {
-				ring->csum_ok++;
-				ip_summed = CHECKSUM_UNNECESSARY;
+			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
+						      MLX4_CQE_STATUS_UDP)) {
+				if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
+				    cqe->checksum == cpu_to_be16(0xffff)) {
+					ip_summed = CHECKSUM_UNNECESSARY;
+					ring->csum_ok++;
+				} else {
+					ip_summed = CHECKSUM_NONE;
+					ring->csum_none++;
+				}
 			} else {
-				ip_summed = CHECKSUM_NONE;
-				ring->csum_none++;
+				if (priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
+				    (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4 |
+							       MLX4_CQE_STATUS_IPV6))) {
+					ip_summed = CHECKSUM_COMPLETE;
+					ring->csum_complete++;
+				} else {
+					ip_summed = CHECKSUM_NONE;
+					ring->csum_none++;
+				}
 			}
 		} else {
 			ip_summed = CHECKSUM_NONE;
@@ -767,6 +864,13 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			goto next;
 		}
 
+		if (ip_summed == CHECKSUM_COMPLETE) {
+			if (check_csum(cqe, skb, ring->hwtstamp_rx_filter)) {
+				ip_summed = CHECKSUM_NONE;
+				ring->csum_none++;
+			}
+		}
+
 		skb->ip_summed = ip_summed;
 		skb->protocol = eth_type_trans(skb, dev);
 		skb_record_rx_queue(skb, cq->ring);
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 9f82196..2f6ba42 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -1629,6 +1629,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
 	struct mlx4_init_hca_param init_hca;
 	u64 icm_size;
 	int err;
+	struct mlx4_config_dev_params params;
 
 	if (!mlx4_is_slave(dev)) {
 		err = mlx4_QUERY_FW(dev);
@@ -1762,6 +1763,14 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
 		goto unmap_bf;
 	}
 
+	/* Query CONFIG_DEV parameters */
+	err = mlx4_config_dev_retrieval(dev, &params);
+	if (err && err != -ENOTSUPP) {
+		mlx4_err(dev, "Failed to query CONFIG_DEV parameters\n");
+	} else if (!err) {
+		dev->caps.rx_checksum_flags_port[1] = params.rx_csum_flags_port_1;
+		dev->caps.rx_checksum_flags_port[2] = params.rx_csum_flags_port_2;
+	}
 	priv->eq_table.inta_pin = adapter.inta_pin;
 	memcpy(dev->board_id, adapter.board_id, sizeof dev->board_id);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index ef83d12..de45674 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -326,6 +326,7 @@ struct mlx4_en_rx_ring {
 #endif
 	unsigned long csum_ok;
 	unsigned long csum_none;
+	unsigned long csum_complete;
 	int hwtstamp_rx_filter;
 	cpumask_var_t affinity_mask;
 };
@@ -449,6 +450,7 @@ struct mlx4_en_port_stats {
 	unsigned long rx_alloc_failed;
 	unsigned long rx_chksum_good;
 	unsigned long rx_chksum_none;
+	unsigned long rx_chksum_complete;
 	unsigned long tx_chksum_offload;
 #define NUM_PORT_STATS		9
 };
@@ -507,7 +509,8 @@ enum {
 	MLX4_EN_FLAG_ENABLE_HW_LOOPBACK	= (1 << 2),
 	/* whether we need to drop packets that hardware loopback-ed */
 	MLX4_EN_FLAG_RX_FILTER_NEEDED	= (1 << 3),
-	MLX4_EN_FLAG_FORCE_PROMISC	= (1 << 4)
+	MLX4_EN_FLAG_FORCE_PROMISC	= (1 << 4),
+	MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP	= (1 << 5),
 };
 
 #define MLX4_EN_MAC_HASH_SIZE (1 << BITS_PER_BYTE)
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 5cc5eac..3d9bff0 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -497,6 +497,7 @@ struct mlx4_caps {
 	u16			hca_core_clock;
 	u64			phys_port_id[MLX4_MAX_PORTS + 1];
 	int			tunnel_offload_mode;
+	u8			rx_checksum_flags_port[MLX4_MAX_PORTS + 1];
 };
 
 struct mlx4_buf_list {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* RE: [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum
  2014-10-30 16:06 ` [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum Or Gerlitz
@ 2014-10-30 16:25   ` David Laight
  2014-10-30 16:32     ` Or Gerlitz
  0 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2014-10-30 16:25 UTC (permalink / raw)
  To: 'Or Gerlitz', David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli

From: Or Gerlitz
> From: Shani Michaeli <shanim@mellanox.com>
> 
> Compute IPV6 pseudo header checksum without folding it to a 16 bit
> return value.
> 
> Signed-off-by: Shani Michaeli <shanim@mellanox.com>
> Signed-off-by: Matan Barak <matanb@mellanox.com>
> ---
>  include/net/ip6_checksum.h |   21 +++++++++++++++++++++
>  1 files changed, 21 insertions(+), 0 deletions(-)
> 
> diff --git a/include/net/ip6_checksum.h b/include/net/ip6_checksum.h
> index 1a49b73..c45d690 100644
> --- a/include/net/ip6_checksum.h
> +++ b/include/net/ip6_checksum.h
> @@ -41,6 +41,27 @@ __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
>  			__wsum csum);
>  #endif
> 
> +static inline __wsum csum_ipv6_magic_nofold(const struct in6_addr *saddr,
> +					    const struct in6_addr *daddr,
> +					    __u32 len, unsigned short proto,
> +					    __wsum sum)
> +{
> +	__wsum res = sum;
> +
> +	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[0]);
> +	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[1]);
> +	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[2]);
> +	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[3]);
> +	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[0]);
> +	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[1]);
> +	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[2]);
> +	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[3]);

That probably generates a very long dependency chain.

> +	res = csum_add(res, (__force __wsum)htonl(len));
> +	res = csum_add(res, (__force __wsum)htonl(proto));

htonl() doesn't look right for a 16bit value.
It might not matter (because the final checksum is 16bits).

	David

> +
> +	return res;
> +}
> +
>  static inline __wsum ip6_compute_pseudo(struct sk_buff *skb, int proto)
>  {
>  	return ~csum_unfold(csum_ipv6_magic(&ipv6_hdr(skb)->saddr,
> --
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum
  2014-10-30 16:25   ` David Laight
@ 2014-10-30 16:32     ` Or Gerlitz
  2014-10-30 16:39       ` David Laight
  0 siblings, 1 reply; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:32 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli

On 10/30/2014 6:25 PM, David Laight wrote:
>> >+static inline __wsum csum_ipv6_magic_nofold(const struct in6_addr *saddr,
>> >+					    const struct in6_addr *daddr,
>> >+					    __u32 len, unsigned short proto,
>> >+					    __wsum sum)
>> >+{
>> >+	__wsum res = sum;
>> >+
>> >+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[0]);
>> >+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[1]);
>> >+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[2]);
>> >+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[3]);
>> >+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[0]);
>> >+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[1]);
>> >+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[2]);
>> >+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[3]);
> That probably generates a very long dependency chain.
>

Could you clarify this comment a bit?

Or.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum
  2014-10-30 16:32     ` Or Gerlitz
@ 2014-10-30 16:39       ` David Laight
  2014-10-30 16:54         ` Or Gerlitz
  0 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2014-10-30 16:39 UTC (permalink / raw)
  To: 'Or Gerlitz'
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli

From: Or Gerlitz [mailto:ogerlitz@mellanox.com]
> On 10/30/2014 6:25 PM, David Laight wrote:
> >> >+static inline __wsum csum_ipv6_magic_nofold(const struct in6_addr *saddr,
> >> >+					    const struct in6_addr *daddr,
> >> >+					    __u32 len, unsigned short proto,
> >> >+					    __wsum sum)
> >> >+{
> >> >+	__wsum res = sum;
> >> >+
> >> >+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[0]);
> >> >+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[1]);
> >> >+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[2]);
> >> >+	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[3]);
> >> >+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[0]);
> >> >+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[1]);
> >> >+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[2]);
> >> >+	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[3]);
>
> > That probably generates a very long dependency chain.
> >
> 
> Could you clarify this comment a bit?

csum_add() probably generates a 32bit 'add with carry' instruction.
So the above generates 8 instructions that have to be executed in series
(dependencies on the register and the carry flag).

On a 64bit cpu there are other options, eg adding 32bit values into
several 64bit registers, then adding those together and finally
collapsing the value to 32 then 16 bits.

Maybe __wsum does end up being 64bit (not looked), but gcc won't
generate a 'tree' of additions, it will still generate a dependency chain.

Hopefully the software 'checksum a buffer' function is written to
avoid these problems.

	David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum
  2014-10-30 16:39       ` David Laight
@ 2014-10-30 16:54         ` Or Gerlitz
  2014-10-30 16:55           ` Or Gerlitz
  2014-10-30 17:09           ` David Laight
  0 siblings, 2 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:54 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli

On 10/30/2014 6:39 PM, David Laight wrote:
> From: Or Gerlitz [mailto:ogerlitz@mellanox.com]
>> On 10/30/2014 6:25 PM, David Laight wrote:
>>>>> +static inline __wsum csum_ipv6_magic_nofold(const struct in6_addr *saddr,
>>>>> +					    const struct in6_addr *daddr,
>>>>> +					    __u32 len, unsigned short proto,
>>>>> +					    __wsum sum)
>>>>> +{
>>>>> +	__wsum res = sum;
>>>>> +
>>>>> +	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[0]);
>>>>> +	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[1]);
>>>>> +	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[2]);
>>>>> +	res = csum_add(res, (__force __wsum)saddr->in6_u.u6_addr32[3]);
>>>>> +	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[0]);
>>>>> +	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[1]);
>>>>> +	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[2]);
>>>>> +	res = csum_add(res, (__force __wsum)daddr->in6_u.u6_addr32[3]);
>>> That probably generates a very long dependency chain.
>>>
>> Could you clarify this comment a bit?
> csum_add() probably generates a 32bit 'add with carry' instruction.
> So the above generates 8 instructions that have to be executed in series
> (dependencies on the register and the carry flag).
>
> On a 64bit cpu there are other options, eg adding 32bit values into
> several 64bit registers, then adding those together and finally
> collapsing the value to 32 then 16 bits.
>
> Maybe __wsum does end up being 64bit (not looked), but gcc won't
> generate a 'tree' of additions, it will still generate a dependency chain.
>
> Hopefully the software 'checksum a buffer' function is written to
> avoid these problems.

OK, talking to Matan, hecame up with this super-quick (compile tested 
only) alternative, is
this what you were advocating for?


Or.


diff --git a/include/net/ip6_checksum.h b/include/net/ip6_checksum.h
index c45d690..cadffdc 100644
--- a/include/net/ip6_checksum.h
+++ b/include/net/ip6_checksum.h
@@ -46,6 +46,19 @@ static inline __wsum csum_ipv6_magic_nofold(const 
struct in6_addr *saddr,
                                             __u32 len, unsigned short 
proto,
                                             __wsum sum)
  {
+       sum = csum_partial(saddr, sizeof(saddr->in6_u.u6_addr32), sum);
+       sum = csum_partial(saddr, sizeof(daddr->in6_u.u6_addr32), sum);
+       sum = csum_add(sum, (__force __wsum)htonl(len));
+       sum = csum_add(sum, (__force __wsum)htons(proto));
+
+       return sum;
+}

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum
  2014-10-30 16:54         ` Or Gerlitz
@ 2014-10-30 16:55           ` Or Gerlitz
  2014-10-30 17:09           ` David Laight
  1 sibling, 0 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 16:55 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli


On 10/30/2014 6:54 PM, Or Gerlitz wrote:
> sum = csum_partial(saddr, sizeof(daddr->in6_u.u6_addr32), sum); 
s/saddr/daddr/ sorry for the spam

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum
  2014-10-30 16:54         ` Or Gerlitz
  2014-10-30 16:55           ` Or Gerlitz
@ 2014-10-30 17:09           ` David Laight
  1 sibling, 0 replies; 32+ messages in thread
From: David Laight @ 2014-10-30 17:09 UTC (permalink / raw)
  To: 'Or Gerlitz'
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli

> OK, talking to Matan, hecame up with this super-quick (compile tested
> only) alternative, is
> this what you were advocating for?
>   {
> +       sum = csum_partial(saddr, sizeof(saddr->in6_u.u6_addr32), sum);
> +       sum = csum_partial(saddr, sizeof(daddr->in6_u.u6_addr32), sum);

I'm pretty sure your 'saddr' and 'daddr' are adjacent.
Whether you can prove that is another matter.

> +       sum = csum_add(sum, (__force __wsum)htonl(len));
> +       sum = csum_add(sum, (__force __wsum)htons(proto));
> +
> +       return sum;
> +}

On 64 bit systems you probably want to end up with something akin to:
	__u64 a, b, c, d;

	a = saddr[0] + saddr[1];
	b = saddr[2] + saddr[3];
	c = daddr[0] + daddr[1];
	d = daddr[2] + daddr[3];
	a += b;
	c += d;
	a += c;
and then collapse down the 64bit value.
However if you write the above in proper C, gcc will probably
convert it to long dependence chain.

An architecture specific csum_partial() probably manages to DTRT.

	David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 4/8] net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages
  2014-10-30 16:06 ` [PATCH net-next 4/8] net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages Or Gerlitz
@ 2014-10-30 17:15   ` Eric Dumazet
  2014-10-30 22:49     ` Or Gerlitz
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2014-10-30 17:15 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli, Ido Shamay

On Thu, 2014-10-30 at 18:06 +0200, Or Gerlitz wrote:
> From: Ido Shamay <idos@mellanox.com>
> 
> Needed in order to get cache cold pages (L3 flushed) for HW scatter.
> 
> Otherwise memory may flush those entries when the packet comes from
> PCI, causing back pressure resulting in BW decrease.
> 
> Signed-off-by: Ido Shamay <idos@mellanox.com>
> Signed-off-by: Amir Vadai <amirv@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 4cb716f..9d616a8 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -54,7 +54,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
>  	dma_addr_t dma;
>  
>  	for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
> -		gfp_t gfp = _gfp;
> +		gfp_t gfp = _gfp | __GFP_COLD;

This should be set by callers, to avoid extra code

GFP_ATOMIC | __GFP_COLD

or

GFP_KERNEL | __GFP_COLD

>  
>  		if (order)
>  			gfp |= __GFP_COMP | __GFP_NOWARN;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-30 16:06 ` [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE Or Gerlitz
@ 2014-10-30 18:22   ` Eric Dumazet
  2014-10-30 21:20   ` Jerry Chu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2014-10-30 18:22 UTC (permalink / raw)
  To: Or Gerlitz, Tom Herbert
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli, Jerry Chu

On Thu, 2014-10-30 at 18:06 +0200, Or Gerlitz wrote:
> From: Shani Michaeli <shanim@mellanox.com>
> 
> When processing received traffic, pass CHECKSUM_COMPLETE status to the
> stack, with calculated checksum for non TCP/UDP packets (such
> as GRE or ICMP).
> 
> Although the stack expects checksum which doesn't include the pseudo
> header, the HW adds it. To address that, we are subtracting the pseudo
> header checksum from the checksum value provided by the HW.
> 
> In the IPv6 case, we also compute/add the IP header checksum which
> is not added by the HW for such packets.
> 
> Cc: Jerry Chu <hkchu@google.com>
> Signed-off-by: Shani Michaeli <shanim@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---


Awesome !

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path
  2014-10-30 16:06 ` [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path Or Gerlitz
@ 2014-10-30 19:00   ` Eric Dumazet
  2014-10-30 23:25     ` Or Gerlitz
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2014-10-30 19:00 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli

On Thu, 2014-10-30 at 18:06 +0200, Or Gerlitz wrote:
> Remove the code which goes through napi_gro_frags() on the RX path,
> use only napi_gro_receive().
> 
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---

Hmpff... napi_gro_frags() should be faster.

Have you benchmarked this ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-30 16:06 ` [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE Or Gerlitz
  2014-10-30 18:22   ` Eric Dumazet
@ 2014-10-30 21:20   ` Jerry Chu
  2014-10-30 23:28     ` Or Gerlitz
  2014-10-30 23:59   ` Tom Herbert
  2014-10-31  0:38   ` Yann Ylavic
  3 siblings, 1 reply; 32+ messages in thread
From: Jerry Chu @ 2014-10-30 21:20 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli

On Thu, Oct 30, 2014 at 9:06 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>
> From: Shani Michaeli <shanim@mellanox.com>
>
> When processing received traffic, pass CHECKSUM_COMPLETE status to the
> stack, with calculated checksum for non TCP/UDP packets (such
> as GRE or ICMP).
>
> Although the stack expects checksum which doesn't include the pseudo
> header, the HW adds it. To address that, we are subtracting the pseudo
> header checksum from the checksum value provided by the HW.
>
> In the IPv6 case, we also compute/add the IP header checksum which
> is not added by the HW for such packets.
>
> Cc: Jerry Chu <hkchu@google.com>
> Signed-off-by: Shani Michaeli <shanim@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |    2 +-
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |    5 +
>  drivers/net/ethernet/mellanox/mlx4/en_port.c    |    2 +
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c      |  116 +++++++++++++++++++++-
>  drivers/net/ethernet/mellanox/mlx4/main.c       |    9 ++
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    5 +-
>  include/linux/mlx4/device.h                     |    1 +
>  7 files changed, 132 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
> index 8ea4d5b..6c64323 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
> @@ -115,7 +115,7 @@ static const char main_strings[][ETH_GSTRING_LEN] = {
>         "tso_packets",
>         "xmit_more",
>         "queue_stopped", "wake_queue", "tx_timeout", "rx_alloc_failed",
> -       "rx_csum_good", "rx_csum_none", "tx_chksum_offload",
> +       "rx_csum_good", "rx_csum_none", "rx_csum_complete", "tx_chksum_offload",
>
>         /* packet statistics */
>         "broadcast", "rx_prio_0", "rx_prio_1", "rx_prio_2", "rx_prio_3",
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 0efbae9..d1eb25d 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -1893,6 +1893,7 @@ static void mlx4_en_clear_stats(struct net_device *dev)
>                 priv->rx_ring[i]->packets = 0;
>                 priv->rx_ring[i]->csum_ok = 0;
>                 priv->rx_ring[i]->csum_none = 0;
> +               priv->rx_ring[i]->csum_complete = 0;
>         }
>  }
>
> @@ -2503,6 +2504,10 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
>         /* Query for default mac and max mtu */
>         priv->max_mtu = mdev->dev->caps.eth_mtu_cap[priv->port];
>
> +       if (mdev->dev->caps.rx_checksum_flags_port[priv->port] &
> +           MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP)
> +               priv->flags |= MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP;
> +
>         /* Set default MAC */
>         dev->addr_len = ETH_ALEN;
>         mlx4_en_u64_to_mac(dev->dev_addr, mdev->dev->caps.def_mac[priv->port]);
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
> index 134b12e..6cb8007 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
> @@ -155,11 +155,13 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
>         stats->rx_bytes = 0;
>         priv->port_stats.rx_chksum_good = 0;
>         priv->port_stats.rx_chksum_none = 0;
> +       priv->port_stats.rx_chksum_complete = 0;
>         for (i = 0; i < priv->rx_ring_num; i++) {
>                 stats->rx_packets += priv->rx_ring[i]->packets;
>                 stats->rx_bytes += priv->rx_ring[i]->bytes;
>                 priv->port_stats.rx_chksum_good += priv->rx_ring[i]->csum_ok;
>                 priv->port_stats.rx_chksum_none += priv->rx_ring[i]->csum_none;
> +               priv->port_stats.rx_chksum_complete += priv->rx_ring[i]->csum_complete;
>         }
>         stats->tx_packets = 0;
>         stats->tx_bytes = 0;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 2a29a1a..f8a0449 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -42,6 +42,10 @@
>  #include <linux/vmalloc.h>
>  #include <linux/irq.h>
>
> +#if IS_ENABLED(CONFIG_IPV6)
> +#include <net/ip6_checksum.h>
> +#endif
> +
>  #include "mlx4_en.h"
>
>  static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
> @@ -642,6 +646,86 @@ static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
>         }
>  }
>
> +/* When hardware doesn't strip the vlan, we need to calculate the checksum
> + * over it and add it to the hardware's checksum calculation
> + */
> +static inline __wsum get_fixed_vlan_csum(__wsum hw_checksum,
> +                                        struct vlan_hdr *vlanh)
> +{
> +       return csum_add(hw_checksum, *(__wsum *)vlanh);
> +}
> +
> +/* Although the stack expects checksum which doesn't include the pseudo
> + * header, the HW adds it. To address that, we are subtracting the pseudo
> + * header checksum from the checksum value provided by the HW.
> + */
> +static void get_fixed_ipv4_csum(__wsum hw_checksum, struct sk_buff *skb,
> +                               struct iphdr *iph)
> +{
> +       __u16 length_for_csum = 0;
> +       __wsum csum_pseudo_header = 0;
> +
> +       length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
> +       csum_pseudo_header = csum_tcpudp_nofold(iph->saddr, iph->daddr,
> +                                               length_for_csum, iph->protocol, 0);
> +       skb->csum = csum_sub(hw_checksum, csum_pseudo_header);
> +}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +/* In IPv6 packets, besides subtracting the pseudo header checksum,
> + * we also compute/add the IP header checksum which
> + * is not added by the HW.
> + */
> +static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
> +                              struct ipv6hdr *ipv6h)
> +{
> +       __wsum csum_pseudo_header = 0;
> +
> +       if (ipv6h->nexthdr == IPPROTO_FRAGMENT || ipv6h->nexthdr == IPPROTO_HOPOPTS)
> +               return -1;
> +       hw_checksum = csum_add(hw_checksum, (__force __wsum)(ipv6h->nexthdr << 8));
> +
> +       csum_pseudo_header = csum_ipv6_magic_nofold(&ipv6h->saddr,
> +                                                   &ipv6h->daddr,
> +                                                   ntohs(ipv6h->payload_len),
> +                                                   ipv6h->nexthdr,
> +                                                   0);
> +       skb->csum = csum_sub(hw_checksum, csum_pseudo_header);
> +       skb->csum = csum_add(skb->csum, csum_partial(ipv6h, sizeof(struct ipv6hdr), 0));
> +       return 0;
> +}
> +#endif
> +
> +static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, int hwtstamp_rx_filter)
> +{
> +       __wsum hw_checksum = 0;
> +
> +       void *hdr = (u8 *)skb->data + sizeof(struct ethhdr);
> +
> +       hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> +
> +       if (((struct ethhdr *)skb->data)->h_proto == htons(ETH_P_8021Q) &&
> +           hwtstamp_rx_filter != HWTSTAMP_FILTER_NONE) {
> +               /* next protocol non IPv4 or IPv6 */
> +               if (((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
> +                   != htons(ETH_P_IP) ||
> +                   ((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
> +                   != htons(ETH_P_IPV6))
> +                       return -1;
> +               hw_checksum = get_fixed_vlan_csum(hw_checksum, hdr);
> +               hdr += sizeof(struct vlan_hdr);
> +       }
> +
> +       if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4))
> +               get_fixed_ipv4_csum(hw_checksum, skb, hdr);
> +#if IS_ENABLED(CONFIG_IPV6)
> +       else if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV6))
> +               if (get_fixed_ipv6_csum(hw_checksum, skb, hdr))
> +                       return -1;
> +#endif
> +       return 0;
> +}
> +
>  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
>  {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
> @@ -743,13 +827,26 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                         (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>
>                 if (likely(dev->features & NETIF_F_RXCSUM)) {
> -                       if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> -                           (cqe->checksum == cpu_to_be16(0xffff))) {
> -                               ring->csum_ok++;
> -                               ip_summed = CHECKSUM_UNNECESSARY;
> +                       if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
> +                                                     MLX4_CQE_STATUS_UDP)) {
> +                               if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> +                                   cqe->checksum == cpu_to_be16(0xffff)) {
> +                                       ip_summed = CHECKSUM_UNNECESSARY;
> +                                       ring->csum_ok++;
> +                               } else {
> +                                       ip_summed = CHECKSUM_NONE;
> +                                       ring->csum_none++;
> +                               }
>                         } else {
> -                               ip_summed = CHECKSUM_NONE;
> -                               ring->csum_none++;
> +                               if (priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
> +                                   (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4 |
> +                                                              MLX4_CQE_STATUS_IPV6))) {
> +                                       ip_summed = CHECKSUM_COMPLETE;
> +                                       ring->csum_complete++;
> +                               } else {
> +                                       ip_summed = CHECKSUM_NONE;
> +                                       ring->csum_none++;
> +                               }
>                         }
>                 } else {
>                         ip_summed = CHECKSUM_NONE;
> @@ -767,6 +864,13 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                         goto next;
>                 }
>
> +               if (ip_summed == CHECKSUM_COMPLETE) {
> +                       if (check_csum(cqe, skb, ring->hwtstamp_rx_filter)) {
> +                               ip_summed = CHECKSUM_NONE;
> +                               ring->csum_none++;
> +                       }
> +               }
> +
>                 skb->ip_summed = ip_summed;
>                 skb->protocol = eth_type_trans(skb, dev);
>                 skb_record_rx_queue(skb, cq->ring);
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index 9f82196..2f6ba42 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -1629,6 +1629,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
>         struct mlx4_init_hca_param init_hca;
>         u64 icm_size;
>         int err;
> +       struct mlx4_config_dev_params params;
>
>         if (!mlx4_is_slave(dev)) {
>                 err = mlx4_QUERY_FW(dev);
> @@ -1762,6 +1763,14 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
>                 goto unmap_bf;
>         }
>
> +       /* Query CONFIG_DEV parameters */
> +       err = mlx4_config_dev_retrieval(dev, &params);
> +       if (err && err != -ENOTSUPP) {
> +               mlx4_err(dev, "Failed to query CONFIG_DEV parameters\n");
> +       } else if (!err) {
> +               dev->caps.rx_checksum_flags_port[1] = params.rx_csum_flags_port_1;
> +               dev->caps.rx_checksum_flags_port[2] = params.rx_csum_flags_port_2;
> +       }
>         priv->eq_table.inta_pin = adapter.inta_pin;
>         memcpy(dev->board_id, adapter.board_id, sizeof dev->board_id);
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index ef83d12..de45674 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -326,6 +326,7 @@ struct mlx4_en_rx_ring {
>  #endif
>         unsigned long csum_ok;
>         unsigned long csum_none;
> +       unsigned long csum_complete;
>         int hwtstamp_rx_filter;
>         cpumask_var_t affinity_mask;
>  };
> @@ -449,6 +450,7 @@ struct mlx4_en_port_stats {
>         unsigned long rx_alloc_failed;
>         unsigned long rx_chksum_good;
>         unsigned long rx_chksum_none;
> +       unsigned long rx_chksum_complete;
>         unsigned long tx_chksum_offload;
>  #define NUM_PORT_STATS         9
>  };
> @@ -507,7 +509,8 @@ enum {
>         MLX4_EN_FLAG_ENABLE_HW_LOOPBACK = (1 << 2),
>         /* whether we need to drop packets that hardware loopback-ed */
>         MLX4_EN_FLAG_RX_FILTER_NEEDED   = (1 << 3),
> -       MLX4_EN_FLAG_FORCE_PROMISC      = (1 << 4)
> +       MLX4_EN_FLAG_FORCE_PROMISC      = (1 << 4),
> +       MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP        = (1 << 5),
>  };
>
>  #define MLX4_EN_MAC_HASH_SIZE (1 << BITS_PER_BYTE)
> diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
> index 5cc5eac..3d9bff0 100644
> --- a/include/linux/mlx4/device.h
> +++ b/include/linux/mlx4/device.h
> @@ -497,6 +497,7 @@ struct mlx4_caps {
>         u16                     hca_core_clock;
>         u64                     phys_port_id[MLX4_MAX_PORTS + 1];
>         int                     tunnel_offload_mode;
> +       u8                      rx_checksum_flags_port[MLX4_MAX_PORTS + 1];
>  };
>
>  struct mlx4_buf_list {
> --
> 1.7.1
>

Acked-by: H.K. Jerry Chu <hkchu@google.com>

BTW, will the patch work for all versions of the chip?

Jerry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 4/8] net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages
  2014-10-30 17:15   ` Eric Dumazet
@ 2014-10-30 22:49     ` Or Gerlitz
  0 siblings, 0 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 22:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Matan Barak,
	Amir Vadai, Saeed Mahameed, Shani Michaeli, Ido Shamay

On Thu, Oct 30, 2014 at 7:15 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> @@ -54,7 +54,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
>>       dma_addr_t dma;
>>
>>       for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
>> -             gfp_t gfp = _gfp;
>> +             gfp_t gfp = _gfp | __GFP_COLD;

> This should be set by callers, to avoid extra code
> GFP_ATOMIC | __GFP_COLD
> or
> GFP_KERNEL | __GFP_COLD

I guess we can do that, sure.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path
  2014-10-30 19:00   ` Eric Dumazet
@ 2014-10-30 23:25     ` Or Gerlitz
  2014-10-31  3:19       ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 23:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Matan Barak,
	Amir Vadai, Saeed Mahameed, Shani Michaeli, Ido Shamay

On Thu, Oct 30, 2014 at 9:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2014-10-30 at 18:06 +0200, Or Gerlitz wrote:
>> Remove the code which goes through napi_gro_frags() on the RX path,
>> use only napi_gro_receive().

> Hmpff... napi_gro_frags() should be faster.
> Have you benchmarked this ?


yep we did, napi_gro_frags() was somehow better for single stream. Do
you think we need to do it the other way around, e.g converge to use
napi_gro_frags()?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-30 21:20   ` Jerry Chu
@ 2014-10-30 23:28     ` Or Gerlitz
  2014-10-30 23:53       ` Jerry Chu
  0 siblings, 1 reply; 32+ messages in thread
From: Or Gerlitz @ 2014-10-30 23:28 UTC (permalink / raw)
  To: Jerry Chu
  Cc: Or Gerlitz, David S. Miller, netdev, Matan Barak, Amir Vadai,
	Saeed Mahameed, Shani Michaeli

On Thu, Oct 30, 2014 at 11:20 PM, Jerry Chu <hkchu@google.com> wrote:
> Acked-by: H.K. Jerry Chu <hkchu@google.com>
>
> BTW, will the patch work for all versions of the chip?

If you'll look carefully, you'll see we go that path only when
priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP is true. This currently
holds only for ConnectX3 and not ConnectX3-pro. Down the road, the
feature will also be available for the -pro too.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-30 23:28     ` Or Gerlitz
@ 2014-10-30 23:53       ` Jerry Chu
  0 siblings, 0 replies; 32+ messages in thread
From: Jerry Chu @ 2014-10-30 23:53 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, David S. Miller, netdev, Matan Barak, Amir Vadai,
	Saeed Mahameed, Shani Michaeli

On Thu, Oct 30, 2014 at 4:28 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>
> On Thu, Oct 30, 2014 at 11:20 PM, Jerry Chu <hkchu@google.com> wrote:
> > Acked-by: H.K. Jerry Chu <hkchu@google.com>
> >
> > BTW, will the patch work for all versions of the chip?
>
> If you'll look carefully, you'll see we go that path only when
> priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP is true. This currently
> holds only for ConnectX3 and not ConnectX3-pro. Down the road, the
> feature will also be available for the -pro too.

It didn't dawn on me that flag will be tied to CX3 but this makes sense.

Thanks,

Jerry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-30 16:06 ` [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE Or Gerlitz
  2014-10-30 18:22   ` Eric Dumazet
  2014-10-30 21:20   ` Jerry Chu
@ 2014-10-30 23:59   ` Tom Herbert
  2014-10-31 13:57     ` Or Gerlitz
  2014-10-31  0:38   ` Yann Ylavic
  3 siblings, 1 reply; 32+ messages in thread
From: Tom Herbert @ 2014-10-30 23:59 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, Linux Netdev List, Matan Barak, Amir Vadai,
	Saeed Mahameed, Shani Michaeli, Jerry Chu

On Thu, Oct 30, 2014 at 9:06 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> From: Shani Michaeli <shanim@mellanox.com>
>
> When processing received traffic, pass CHECKSUM_COMPLETE status to the
> stack, with calculated checksum for non TCP/UDP packets (such
> as GRE or ICMP).
>
Hi Or,

This is very exciting work! One question though, what would mlx4
return in the case of a zero UDP checksum? (I assume this patch won't
affect this case but would like to make sure).

Thanks,
Tom


> Although the stack expects checksum which doesn't include the pseudo
> header, the HW adds it. To address that, we are subtracting the pseudo
> header checksum from the checksum value provided by the HW.
>
> In the IPv6 case, we also compute/add the IP header checksum which
> is not added by the HW for such packets.
>
> Cc: Jerry Chu <hkchu@google.com>
> Signed-off-by: Shani Michaeli <shanim@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |    2 +-
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |    5 +
>  drivers/net/ethernet/mellanox/mlx4/en_port.c    |    2 +
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c      |  116 +++++++++++++++++++++-
>  drivers/net/ethernet/mellanox/mlx4/main.c       |    9 ++
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |    5 +-
>  include/linux/mlx4/device.h                     |    1 +
>  7 files changed, 132 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
> index 8ea4d5b..6c64323 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
> @@ -115,7 +115,7 @@ static const char main_strings[][ETH_GSTRING_LEN] = {
>         "tso_packets",
>         "xmit_more",
>         "queue_stopped", "wake_queue", "tx_timeout", "rx_alloc_failed",
> -       "rx_csum_good", "rx_csum_none", "tx_chksum_offload",
> +       "rx_csum_good", "rx_csum_none", "rx_csum_complete", "tx_chksum_offload",
>
>         /* packet statistics */
>         "broadcast", "rx_prio_0", "rx_prio_1", "rx_prio_2", "rx_prio_3",
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 0efbae9..d1eb25d 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -1893,6 +1893,7 @@ static void mlx4_en_clear_stats(struct net_device *dev)
>                 priv->rx_ring[i]->packets = 0;
>                 priv->rx_ring[i]->csum_ok = 0;
>                 priv->rx_ring[i]->csum_none = 0;
> +               priv->rx_ring[i]->csum_complete = 0;
>         }
>  }
>
> @@ -2503,6 +2504,10 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
>         /* Query for default mac and max mtu */
>         priv->max_mtu = mdev->dev->caps.eth_mtu_cap[priv->port];
>
> +       if (mdev->dev->caps.rx_checksum_flags_port[priv->port] &
> +           MLX4_RX_CSUM_MODE_VAL_NON_TCP_UDP)
> +               priv->flags |= MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP;
> +
>         /* Set default MAC */
>         dev->addr_len = ETH_ALEN;
>         mlx4_en_u64_to_mac(dev->dev_addr, mdev->dev->caps.def_mac[priv->port]);
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
> index 134b12e..6cb8007 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
> @@ -155,11 +155,13 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
>         stats->rx_bytes = 0;
>         priv->port_stats.rx_chksum_good = 0;
>         priv->port_stats.rx_chksum_none = 0;
> +       priv->port_stats.rx_chksum_complete = 0;
>         for (i = 0; i < priv->rx_ring_num; i++) {
>                 stats->rx_packets += priv->rx_ring[i]->packets;
>                 stats->rx_bytes += priv->rx_ring[i]->bytes;
>                 priv->port_stats.rx_chksum_good += priv->rx_ring[i]->csum_ok;
>                 priv->port_stats.rx_chksum_none += priv->rx_ring[i]->csum_none;
> +               priv->port_stats.rx_chksum_complete += priv->rx_ring[i]->csum_complete;
>         }
>         stats->tx_packets = 0;
>         stats->tx_bytes = 0;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 2a29a1a..f8a0449 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -42,6 +42,10 @@
>  #include <linux/vmalloc.h>
>  #include <linux/irq.h>
>
> +#if IS_ENABLED(CONFIG_IPV6)
> +#include <net/ip6_checksum.h>
> +#endif
> +
>  #include "mlx4_en.h"
>
>  static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
> @@ -642,6 +646,86 @@ static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
>         }
>  }
>
> +/* When hardware doesn't strip the vlan, we need to calculate the checksum
> + * over it and add it to the hardware's checksum calculation
> + */
> +static inline __wsum get_fixed_vlan_csum(__wsum hw_checksum,
> +                                        struct vlan_hdr *vlanh)
> +{
> +       return csum_add(hw_checksum, *(__wsum *)vlanh);
> +}
> +
> +/* Although the stack expects checksum which doesn't include the pseudo
> + * header, the HW adds it. To address that, we are subtracting the pseudo
> + * header checksum from the checksum value provided by the HW.
> + */
> +static void get_fixed_ipv4_csum(__wsum hw_checksum, struct sk_buff *skb,
> +                               struct iphdr *iph)
> +{
> +       __u16 length_for_csum = 0;
> +       __wsum csum_pseudo_header = 0;
> +
> +       length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
> +       csum_pseudo_header = csum_tcpudp_nofold(iph->saddr, iph->daddr,
> +                                               length_for_csum, iph->protocol, 0);
> +       skb->csum = csum_sub(hw_checksum, csum_pseudo_header);
> +}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +/* In IPv6 packets, besides subtracting the pseudo header checksum,
> + * we also compute/add the IP header checksum which
> + * is not added by the HW.
> + */
> +static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
> +                              struct ipv6hdr *ipv6h)
> +{
> +       __wsum csum_pseudo_header = 0;
> +
> +       if (ipv6h->nexthdr == IPPROTO_FRAGMENT || ipv6h->nexthdr == IPPROTO_HOPOPTS)
> +               return -1;
> +       hw_checksum = csum_add(hw_checksum, (__force __wsum)(ipv6h->nexthdr << 8));
> +
> +       csum_pseudo_header = csum_ipv6_magic_nofold(&ipv6h->saddr,
> +                                                   &ipv6h->daddr,
> +                                                   ntohs(ipv6h->payload_len),
> +                                                   ipv6h->nexthdr,
> +                                                   0);
> +       skb->csum = csum_sub(hw_checksum, csum_pseudo_header);
> +       skb->csum = csum_add(skb->csum, csum_partial(ipv6h, sizeof(struct ipv6hdr), 0));
> +       return 0;
> +}
> +#endif
> +
> +static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, int hwtstamp_rx_filter)
> +{
> +       __wsum hw_checksum = 0;
> +
> +       void *hdr = (u8 *)skb->data + sizeof(struct ethhdr);
> +
> +       hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> +
> +       if (((struct ethhdr *)skb->data)->h_proto == htons(ETH_P_8021Q) &&
> +           hwtstamp_rx_filter != HWTSTAMP_FILTER_NONE) {
> +               /* next protocol non IPv4 or IPv6 */
> +               if (((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
> +                   != htons(ETH_P_IP) ||
> +                   ((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
> +                   != htons(ETH_P_IPV6))
> +                       return -1;
> +               hw_checksum = get_fixed_vlan_csum(hw_checksum, hdr);
> +               hdr += sizeof(struct vlan_hdr);
> +       }
> +
> +       if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4))
> +               get_fixed_ipv4_csum(hw_checksum, skb, hdr);
> +#if IS_ENABLED(CONFIG_IPV6)
> +       else if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV6))
> +               if (get_fixed_ipv6_csum(hw_checksum, skb, hdr))
> +                       return -1;
> +#endif
> +       return 0;
> +}
> +
>  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
>  {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
> @@ -743,13 +827,26 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                         (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>
>                 if (likely(dev->features & NETIF_F_RXCSUM)) {
> -                       if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> -                           (cqe->checksum == cpu_to_be16(0xffff))) {
> -                               ring->csum_ok++;
> -                               ip_summed = CHECKSUM_UNNECESSARY;
> +                       if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
> +                                                     MLX4_CQE_STATUS_UDP)) {
> +                               if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> +                                   cqe->checksum == cpu_to_be16(0xffff)) {
> +                                       ip_summed = CHECKSUM_UNNECESSARY;
> +                                       ring->csum_ok++;
> +                               } else {
> +                                       ip_summed = CHECKSUM_NONE;
> +                                       ring->csum_none++;
> +                               }
>                         } else {
> -                               ip_summed = CHECKSUM_NONE;
> -                               ring->csum_none++;
> +                               if (priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
> +                                   (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4 |
> +                                                              MLX4_CQE_STATUS_IPV6))) {
> +                                       ip_summed = CHECKSUM_COMPLETE;
> +                                       ring->csum_complete++;
> +                               } else {
> +                                       ip_summed = CHECKSUM_NONE;
> +                                       ring->csum_none++;
> +                               }
>                         }
>                 } else {
>                         ip_summed = CHECKSUM_NONE;
> @@ -767,6 +864,13 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                         goto next;
>                 }
>
> +               if (ip_summed == CHECKSUM_COMPLETE) {
> +                       if (check_csum(cqe, skb, ring->hwtstamp_rx_filter)) {
> +                               ip_summed = CHECKSUM_NONE;
> +                               ring->csum_none++;
> +                       }
> +               }
> +
>                 skb->ip_summed = ip_summed;
>                 skb->protocol = eth_type_trans(skb, dev);
>                 skb_record_rx_queue(skb, cq->ring);
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index 9f82196..2f6ba42 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -1629,6 +1629,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
>         struct mlx4_init_hca_param init_hca;
>         u64 icm_size;
>         int err;
> +       struct mlx4_config_dev_params params;
>
>         if (!mlx4_is_slave(dev)) {
>                 err = mlx4_QUERY_FW(dev);
> @@ -1762,6 +1763,14 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
>                 goto unmap_bf;
>         }
>
> +       /* Query CONFIG_DEV parameters */
> +       err = mlx4_config_dev_retrieval(dev, &params);
> +       if (err && err != -ENOTSUPP) {
> +               mlx4_err(dev, "Failed to query CONFIG_DEV parameters\n");
> +       } else if (!err) {
> +               dev->caps.rx_checksum_flags_port[1] = params.rx_csum_flags_port_1;
> +               dev->caps.rx_checksum_flags_port[2] = params.rx_csum_flags_port_2;
> +       }
>         priv->eq_table.inta_pin = adapter.inta_pin;
>         memcpy(dev->board_id, adapter.board_id, sizeof dev->board_id);
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index ef83d12..de45674 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -326,6 +326,7 @@ struct mlx4_en_rx_ring {
>  #endif
>         unsigned long csum_ok;
>         unsigned long csum_none;
> +       unsigned long csum_complete;
>         int hwtstamp_rx_filter;
>         cpumask_var_t affinity_mask;
>  };
> @@ -449,6 +450,7 @@ struct mlx4_en_port_stats {
>         unsigned long rx_alloc_failed;
>         unsigned long rx_chksum_good;
>         unsigned long rx_chksum_none;
> +       unsigned long rx_chksum_complete;
>         unsigned long tx_chksum_offload;
>  #define NUM_PORT_STATS         9
>  };
> @@ -507,7 +509,8 @@ enum {
>         MLX4_EN_FLAG_ENABLE_HW_LOOPBACK = (1 << 2),
>         /* whether we need to drop packets that hardware loopback-ed */
>         MLX4_EN_FLAG_RX_FILTER_NEEDED   = (1 << 3),
> -       MLX4_EN_FLAG_FORCE_PROMISC      = (1 << 4)
> +       MLX4_EN_FLAG_FORCE_PROMISC      = (1 << 4),
> +       MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP        = (1 << 5),
>  };
>
>  #define MLX4_EN_MAC_HASH_SIZE (1 << BITS_PER_BYTE)
> diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
> index 5cc5eac..3d9bff0 100644
> --- a/include/linux/mlx4/device.h
> +++ b/include/linux/mlx4/device.h
> @@ -497,6 +497,7 @@ struct mlx4_caps {
>         u16                     hca_core_clock;
>         u64                     phys_port_id[MLX4_MAX_PORTS + 1];
>         int                     tunnel_offload_mode;
> +       u8                      rx_checksum_flags_port[MLX4_MAX_PORTS + 1];
>  };
>
>  struct mlx4_buf_list {
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-30 16:06 ` [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE Or Gerlitz
                     ` (2 preceding siblings ...)
  2014-10-30 23:59   ` Tom Herbert
@ 2014-10-31  0:38   ` Yann Ylavic
  2014-10-31 13:52     ` Or Gerlitz
  3 siblings, 1 reply; 32+ messages in thread
From: Yann Ylavic @ 2014-10-31  0:38 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, netdev, Matan Barak, Amir Vadai, Saeed Mahameed,
	Shani Michaeli, Jerry Chu

Hi,

On Thu, Oct 30, 2014 at 5:06 PM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
[...]
> +static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, int hwtstamp_rx_filter)
> +{
> +       __wsum hw_checksum = 0;
> +
> +       void *hdr = (u8 *)skb->data + sizeof(struct ethhdr);
> +
> +       hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> +
> +       if (((struct ethhdr *)skb->data)->h_proto == htons(ETH_P_8021Q) &&
> +           hwtstamp_rx_filter != HWTSTAMP_FILTER_NONE) {
> +               /* next protocol non IPv4 or IPv6 */
> +               if (((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
> +                   != htons(ETH_P_IP) ||

Shouldn't this be a AND (&&)?

> +                   ((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
> +                   != htons(ETH_P_IPV6))
> +                       return -1;

Regards,
Yann.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path
  2014-10-30 23:25     ` Or Gerlitz
@ 2014-10-31  3:19       ` Eric Dumazet
  2014-10-31 14:00         ` Or Gerlitz
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2014-10-31  3:19 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Matan Barak,
	Amir Vadai, Saeed Mahameed, Shani Michaeli, Ido Shamay

On Fri, 2014-10-31 at 01:25 +0200, Or Gerlitz wrote:
> On Thu, Oct 30, 2014 at 9:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Thu, 2014-10-30 at 18:06 +0200, Or Gerlitz wrote:
> >> Remove the code which goes through napi_gro_frags() on the RX path,
> >> use only napi_gro_receive().
> 
> > Hmpff... napi_gro_frags() should be faster.
> > Have you benchmarked this ?
> 
> 
> yep we did, napi_gro_frags() was somehow better for single stream. Do
> you think we need to do it the other way around, e.g converge to use
> napi_gro_frags()?

napi_gro_frags() is faster because the napi->skb is reused fast (not
going through kfree_skb()/alloc_skb() for every fragment)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-31  0:38   ` Yann Ylavic
@ 2014-10-31 13:52     ` Or Gerlitz
  0 siblings, 0 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-31 13:52 UTC (permalink / raw)
  To: Yann Ylavic
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Matan Barak,
	Amir Vadai, Saeed Mahameed, Shani Michaeli, Jerry Chu

On Fri, Oct 31, 2014 at 2:38 AM, Yann Ylavic <ylavic.dev@gmail.com> wrote:
> Hi,
>
> On Thu, Oct 30, 2014 at 5:06 PM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> [...]
>> +static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, int hwtstamp_rx_filter)
>> +{
>> +       __wsum hw_checksum = 0;
>> +
>> +       void *hdr = (u8 *)skb->data + sizeof(struct ethhdr);
>> +
>> +       hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
>> +
>> +       if (((struct ethhdr *)skb->data)->h_proto == htons(ETH_P_8021Q) &&
>> +           hwtstamp_rx_filter != HWTSTAMP_FILTER_NONE) {
>> +               /* next protocol non IPv4 or IPv6 */
>> +               if (((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
>> +                   != htons(ETH_P_IP) ||
>
> Shouldn't this be a AND (&&)?

Oh, yes of course, good catch (this protects against the case of QinQ
which isn't supported, so somehow passed the testing... Shani, please
fix it up.

Or.

>
>> +                   ((struct vlan_hdr *)hdr)->h_vlan_encapsulated_proto
>> +                   != htons(ETH_P_IPV6))
>> +                       return -1;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE
  2014-10-30 23:59   ` Tom Herbert
@ 2014-10-31 13:57     ` Or Gerlitz
  0 siblings, 0 replies; 32+ messages in thread
From: Or Gerlitz @ 2014-10-31 13:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Matan Barak,
	Amir Vadai, Saeed Mahameed, Shani Michaeli, Jerry Chu

On Fri, Oct 31, 2014 at 1:59 AM, Tom Herbert <therbert@google.com> wrote:
> On Thu, Oct 30, 2014 at 9:06 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>> From: Shani Michaeli <shanim@mellanox.com>
>>
>> When processing received traffic, pass CHECKSUM_COMPLETE status to the
>> stack, with calculated checksum for non TCP/UDP packets (such
>> as GRE or ICMP).

> This is very exciting work!

thanks, good to know...

> One question though, what would mlx4
> return in the case of a zero UDP checksum? (I assume this patch won't
> affect this case but would like to make sure).

This patch doesn't change any functionality w.r.t UDP or TCP packets,
only for IP protocols which are different from those two.

When VXLAN packets arrive with zero UDP checksum and the HW verified
the internal checksum we return CHECKSUM_UNNECESSARY, for other cases
(e.g UDP but not VXLAN) I'd like to have 2nd look.

Or.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path
  2014-10-31  3:19       ` Eric Dumazet
@ 2014-10-31 14:00         ` Or Gerlitz
  2014-10-31 15:46           ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Or Gerlitz @ 2014-10-31 14:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Matan Barak,
	Amir Vadai, Saeed Mahameed, Shani Michaeli, Ido Shamay

On Fri, Oct 31, 2014 at 5:19 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-10-31 at 01:25 +0200, Or Gerlitz wrote:
>> On Thu, Oct 30, 2014 at 9:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Thu, 2014-10-30 at 18:06 +0200, Or Gerlitz wrote:
>> >> Remove the code which goes through napi_gro_frags() on the RX path,
>> >> use only napi_gro_receive().
>>
>> > Hmpff... napi_gro_frags() should be faster.
>> > Have you benchmarked this ?
>>
>>
>> yep we did, napi_gro_frags() was somehow better for single stream. Do
>> you think we need to do it the other way around, e.g converge to use
>> napi_gro_frags()?

> napi_gro_frags() is faster because the napi->skb is reused fast (not
> going through kfree_skb()/alloc_skb() for every fragment)

I see. Is this a strong vote to convert the code to use napi_gro_frags
on it's usual track?

Or.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path
  2014-10-31 14:00         ` Or Gerlitz
@ 2014-10-31 15:46           ` Eric Dumazet
  2014-11-02 14:09             ` Or Gerlitz
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2014-10-31 15:46 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Matan Barak,
	Amir Vadai, Saeed Mahameed, Shani Michaeli, Ido Shamay

On Fri, 2014-10-31 at 16:00 +0200, Or Gerlitz wrote:
> On Fri, Oct 31, 2014 at 5:19 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Fri, 2014-10-31 at 01:25 +0200, Or Gerlitz wrote:
> >> On Thu, Oct 30, 2014 at 9:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> > On Thu, 2014-10-30 at 18:06 +0200, Or Gerlitz wrote:
> >> >> Remove the code which goes through napi_gro_frags() on the RX path,
> >> >> use only napi_gro_receive().
> >>
> >> > Hmpff... napi_gro_frags() should be faster.
> >> > Have you benchmarked this ?
> >>
> >>
> >> yep we did, napi_gro_frags() was somehow better for single stream. Do
> >> you think we need to do it the other way around, e.g converge to use
> >> napi_gro_frags()?
> 
> > napi_gro_frags() is faster because the napi->skb is reused fast (not
> > going through kfree_skb()/alloc_skb() for every fragment)
> 
> I see. Is this a strong vote to convert the code to use napi_gro_frags
> on it's usual track?

I don't know yet. In some cases, actually slowing down the rx path can
help by building bigger GRO packets. But instead of inserting delays,
we can simply force napi to be run another time, with a nanosec based
timer.

I've tested this kind of heuristic :

       /* If some packets are waiting in GRO engine and timeout is not expired,
        * reschedule a NAPI poll. We allow servicing other softirqs
        * before repoll, we do not rearm CQ.
        */
       if (rx_nsecs && napi->gro_list && !need_resched()) {
               u64 now = local_clock();
               unsigned long flags;

               /* If we got packets in this round, restart timeout */
               if (done)
                       cq->tstart = now;
               else if (now - cq->tstart >= (u64)rx_nsecs)
                       goto complete;

               /* Since we might need one skb very soon, build it now */
               napi_get_frags(napi);

               local_irq_save(flags);
               list_del(&napi->poll_list);
               __napi_schedule_irqoff(napi);
               local_irq_restore(flags);

        } else {
complete:
                napi_complete(napi);
                mlx4_en_arm_cq(priv, cq);
        }
	return done;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path
  2014-10-31 15:46           ` Eric Dumazet
@ 2014-11-02 14:09             ` Or Gerlitz
  2014-11-02 14:28               ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Or Gerlitz @ 2014-11-02 14:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Linux Netdev List, Matan Barak, Amir Vadai,
	Saeed Mahameed, Shani Michaeli, Ido Shamay

On 10/31/2014 5:46 PM, Eric Dumazet wrote:
> On Fri, 2014-10-31 at 16:00 +0200, Or Gerlitz wrote:
>> On Fri, Oct 31, 2014 at 5:19 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Fri, 2014-10-31 at 01:25 +0200, Or Gerlitz wrote:
>>>> On Thu, Oct 30, 2014 at 9:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>> On Thu, 2014-10-30 at 18:06 +0200, Or Gerlitz wrote:
>>>>>> Remove the code which goes through napi_gro_frags() on the RX path,
>>>>>> use only napi_gro_receive().
>>>>> Hmpff... napi_gro_frags() should be faster. Have you benchmarked this ?
>>>>
>>>> yep we did, napi_gro_frags() was somehow better for single stream. Do
>>>> you think we need to do it the other way around, e.g converge to use napi_gro_frags()?
>>> napi_gro_frags() is faster because the napi->skb is reused fast (not
>>> going through kfree_skb()/alloc_skb() for every fragment)
>> I see. Is this a strong vote to convert the code to use napi_gro_frags
>> on it's usual track?
> I don't know yet. In some cases, actually slowing down the rx path can
> help by building bigger GRO packets. But instead of inserting delays,
> we can simply force napi to be run another time, with a nanosec based
> timer.
>
> I've tested this kind of heuristic :
>
>         /* If some packets are waiting in GRO engine and timeout is not expired,
>          * reschedule a NAPI poll. We allow servicing other softirqs
>          * before repoll, we do not rearm CQ.
>          */
>         if (rx_nsecs && napi->gro_list && !need_resched()) {
>                 u64 now = local_clock();
>                 unsigned long flags;
>
>                 /* If we got packets in this round, restart timeout */
>                 if (done)
>                         cq->tstart = now;
>                 else if (now - cq->tstart >= (u64)rx_nsecs)
>                         goto complete;
>
>                 /* Since we might need one skb very soon, build it now */
>                 napi_get_frags(napi);
>
>                 local_irq_save(flags);
>                 list_del(&napi->poll_list);
>                 __napi_schedule_irqoff(napi);
>                 local_irq_restore(flags);
>
>          } else {
> complete:
>                  napi_complete(napi);
>                  mlx4_en_arm_cq(priv, cq);
>          }
> 	return done;

Hi Eric,

For the time being, I'll drop from this series thischange and the 
following ones which depend on it. So can pick in the earlier patches of 
the series, and investigate in parallel thevarious optionsw.r.t GRO here.

Or.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path
  2014-11-02 14:09             ` Or Gerlitz
@ 2014-11-02 14:28               ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2014-11-02 14:28 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, Linux Netdev List, Matan Barak, Amir Vadai,
	Saeed Mahameed, Shani Michaeli, Ido Shamay

On Sun, 2014-11-02 at 16:09 +0200, Or Gerlitz wrote:

> Hi Eric,
> 
> For the time being, I'll drop from this series thischange and the 
> following ones which depend on it. So can pick in the earlier patches of 
> the series, and investigate in parallel thevarious optionsw.r.t GRO here.

Thanks Or !

I posted first patch to allow implementing this better GRO strategy
(https://patchwork.ozlabs.org/patch/405892/ )

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-11-02 14:28 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-30 16:06 [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014 Or Gerlitz
2014-10-30 16:06 ` [PATCH net-next 1/8] net/mlx4_core: Prevent VF from changing port configuration Or Gerlitz
2014-10-30 16:06 ` [PATCH net-next 2/8] net/mlx4_core: Protect port type setting by mutex Or Gerlitz
2014-10-30 16:06 ` [PATCH net-next 3/8] net/mlx4_en: Remove RX buffers alignment to IP_ALIGN Or Gerlitz
2014-10-30 16:06 ` [PATCH net-next 4/8] net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages Or Gerlitz
2014-10-30 17:15   ` Eric Dumazet
2014-10-30 22:49     ` Or Gerlitz
2014-10-30 16:06 ` [PATCH net-next 5/8] net/mlx4_en: Remove redundant code from RX/GRO path Or Gerlitz
2014-10-30 19:00   ` Eric Dumazet
2014-10-30 23:25     ` Or Gerlitz
2014-10-31  3:19       ` Eric Dumazet
2014-10-31 14:00         ` Or Gerlitz
2014-10-31 15:46           ` Eric Dumazet
2014-11-02 14:09             ` Or Gerlitz
2014-11-02 14:28               ` Eric Dumazet
2014-10-30 16:06 ` [PATCH net-next 6/8] net/mlx4_core: Add retrieval of CONFIG_DEV parameters Or Gerlitz
2014-10-30 16:06 ` [PATCH net-next 7/8] net: Add calaulation of non folded IPV6 pseudo header checksum Or Gerlitz
2014-10-30 16:25   ` David Laight
2014-10-30 16:32     ` Or Gerlitz
2014-10-30 16:39       ` David Laight
2014-10-30 16:54         ` Or Gerlitz
2014-10-30 16:55           ` Or Gerlitz
2014-10-30 17:09           ` David Laight
2014-10-30 16:06 ` [PATCH net-next 8/8] net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE Or Gerlitz
2014-10-30 18:22   ` Eric Dumazet
2014-10-30 21:20   ` Jerry Chu
2014-10-30 23:28     ` Or Gerlitz
2014-10-30 23:53       ` Jerry Chu
2014-10-30 23:59   ` Tom Herbert
2014-10-31 13:57     ` Or Gerlitz
2014-10-31  0:38   ` Yann Ylavic
2014-10-31 13:52     ` Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).