All of lore.kernel.org
 help / color / mirror / Atom feed
* [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18
@ 2018-07-19  1:00 Saeed Mahameed
  2018-07-19  1:00 ` [net-next 01/16] net/mlx5: FW tracer, implement tracer logic Saeed Mahameed
                   ` (16 more replies)
  0 siblings, 17 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Hi dave,

This series includes updates for mlx5e net device driver, with a couple
of major features and some misc updates.

Please notice the mlx5-next merge patch at the beginning:
"Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux"

For more information please see tag log below.

Please pull and let me know if there's any problem.

Thanks,
Saeed.

--- 

The following changes since commit 681d5d071c8bd5533a14244c0d55d1c0e30aa989:

  Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux (2018-07-18 15:53:31 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5e-updates-2018-07-18

for you to fetch changes up to a0ba57c09676689eb35f13d48990c9674c9baad4:

  net/mlx5e: Use PARTIAL_GSO for UDP segmentation (2018-07-18 17:26:28 -0700)

----------------------------------------------------------------
mlx5e-updates-2018-07-18

This series includes update for mlx5e net device driver.

1) From Feras Daoud, Added the support for firmware log tracing,
first by introducing the firmware API needed for the task and then
For each PF do the following:
    1- Allocate memory for the tracer strings database and read it from the FW to the SW.
    2- Allocate and dma map tracer buffers.

    Traces that will be written into the buffer will be parsed as a group
    of one or more traces, referred to as trace message. The trace message
    represents a C-like printf string.
Once a new trace is available  FW will generate an event indicates new trace/s are
available and the driver will parse them and dump them using tracepoints
event tracing

Enable mlx5 fw tracing by:
echo 1 > /sys/kernel/debug/tracing/events/mlx5/mlx5_fw/enable

Read traces by:
cat /sys/kernel/debug/tracing/trace

2) From Eran Ben Elisha, Support PCIe buffer congestion handling
via Devlink, using the new devlink device parameters API, added the new
parameters:
 - Congestion action
            HW mechanism in the PCIe buffer which monitors the amount of
            consumed PCIe buffer per host.  This mechanism supports the
            following actions in case of threshold overflow:
            - Disabled - NOP (Default)
            - Drop
            - Mark - Mark CE bit in the CQE of received packet
    - Congestion mode
            - Aggressive - Aggressive static trigger threshold (Default)
            - Dynamic - Dynamically change the trigger threshold

3) From Natali, Set ECN for received packets using CQE indication.
Using Eran's congestion settings a user can enable ECN marking, on such case
driver must update ECN CE IP fields when requested by firmware (congestion is sensed).

4) From Roi Dayan, Remove redundant WARN when we cannot find neigh entry

5) From Jianbo Liu, TC double vlan support
- Support offloading tc double vlan headers match
- Support offloading double vlan push/pop tc actions

6) From Boris, re-visit UDP GSO, remove the splitting of UDP_GSO_L4 packets
in the driver, and exposes UDP_GSO_L4 as a PARTIAL_GSO feature.

----------------------------------------------------------------
Boris Pismenny (1):
      net/mlx5e: Use PARTIAL_GSO for UDP segmentation

Eran Ben Elisha (3):
      net/mlx5: Move all devlink related functions calls to devlink.c
      net/mlx5: Add MPEGC register configuration functionality
      net/mlx5: Support PCIe buffer congestion handling via Devlink

Feras Daoud (5):
      net/mlx5: FW tracer, implement tracer logic
      net/mlx5: FW tracer, create trace buffer and copy strings database
      net/mlx5: FW tracer, events handling
      net/mlx5: FW tracer, parse traces and kernel tracing support
      net/mlx5: FW tracer, Enable tracing

Jianbo Liu (3):
      net/mlx5e: Support offloading tc double vlan headers match
      net/mlx5e: Refactor tc vlan push/pop actions offloading
      net/mlx5e: Support offloading double vlan push/pop tc actions

Natali Shechtman (1):
      net/mlx5e: Set ECN for received packets using CQE indication

Roi Dayan (1):
      net/mlx5e: Remove redundant WARN when we cannot find neigh entry

Saeed Mahameed (2):
      net/mlx5: FW tracer, register log buffer memory key
      net/mlx5: FW tracer, Add debug prints

 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/devlink.c  | 267 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/devlink.h  |  41 +
 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c   | 947 +++++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.h   | 175 ++++
 .../mellanox/mlx5/core/diag/fw_tracer_tracepoint.h |  78 ++
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  27 +-
 .../ethernet/mellanox/mlx5/core/en_accel/rxtx.c    | 109 ---
 .../ethernet/mellanox/mlx5/core/en_accel/rxtx.h    |  14 -
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  35 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    | 134 ++-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |  11 +
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  21 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |  23 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  23 +-
 include/linux/mlx5/device.h                        |   7 +
 include/linux/mlx5/driver.h                        |   3 +
 20 files changed, 1745 insertions(+), 190 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer_tracepoint.h
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [net-next 01/16] net/mlx5: FW tracer, implement tracer logic
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
@ 2018-07-19  1:00 ` Saeed Mahameed
  2018-07-19  1:00 ` [net-next 02/16] net/mlx5: FW tracer, create trace buffer and copy strings database Saeed Mahameed
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Feras Daoud, Saeed Mahameed

From: Feras Daoud <ferasda@mellanox.com>

Implement FW tracer logic and registers access, initialization and
cleanup flows.

Initializing the tracer will be part of load one flow, as multiple
PFs will try to acquire ownership but only one will succeed and will
be the tracer owner.

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/diag/fw_tracer.c       | 196 ++++++++++++++++++
 .../mellanox/mlx5/core/diag/fw_tracer.h       |  66 ++++++
 include/linux/mlx5/driver.h                   |   3 +
 3 files changed, 265 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
new file mode 100644
index 000000000000..3ecbf06b4d71
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -0,0 +1,196 @@
+/*
+ * Copyright (c) 2018, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "fw_tracer.h"
+
+static int mlx5_query_mtrc_caps(struct mlx5_fw_tracer *tracer)
+{
+	u32 *string_db_base_address_out = tracer->str_db.base_address_out;
+	u32 *string_db_size_out = tracer->str_db.size_out;
+	struct mlx5_core_dev *dev = tracer->dev;
+	u32 out[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+	u32 in[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+	void *mtrc_cap_sp;
+	int err, i;
+
+	err = mlx5_core_access_reg(dev, in, sizeof(in), out, sizeof(out),
+				   MLX5_REG_MTRC_CAP, 0, 0);
+	if (err) {
+		mlx5_core_warn(dev, "FWTracer: Error reading tracer caps %d\n",
+			       err);
+		return err;
+	}
+
+	if (!MLX5_GET(mtrc_cap, out, trace_to_memory)) {
+		mlx5_core_dbg(dev, "FWTracer: Device does not support logging traces to memory\n");
+		return -ENOTSUPP;
+	}
+
+	tracer->trc_ver = MLX5_GET(mtrc_cap, out, trc_ver);
+	tracer->str_db.first_string_trace =
+			MLX5_GET(mtrc_cap, out, first_string_trace);
+	tracer->str_db.num_string_trace =
+			MLX5_GET(mtrc_cap, out, num_string_trace);
+	tracer->str_db.num_string_db = MLX5_GET(mtrc_cap, out, num_string_db);
+	tracer->owner = !!MLX5_GET(mtrc_cap, out, trace_owner);
+
+	for (i = 0; i < tracer->str_db.num_string_db; i++) {
+		mtrc_cap_sp = MLX5_ADDR_OF(mtrc_cap, out, string_db_param[i]);
+		string_db_base_address_out[i] = MLX5_GET(mtrc_string_db_param,
+							 mtrc_cap_sp,
+							 string_db_base_address);
+		string_db_size_out[i] = MLX5_GET(mtrc_string_db_param,
+						 mtrc_cap_sp, string_db_size);
+	}
+
+	return err;
+}
+
+static int mlx5_set_mtrc_caps_trace_owner(struct mlx5_fw_tracer *tracer,
+					  u32 *out, u32 out_size,
+					  u8 trace_owner)
+{
+	struct mlx5_core_dev *dev = tracer->dev;
+	u32 in[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+
+	MLX5_SET(mtrc_cap, in, trace_owner, trace_owner);
+
+	return mlx5_core_access_reg(dev, in, sizeof(in), out, out_size,
+				    MLX5_REG_MTRC_CAP, 0, 1);
+}
+
+static int mlx5_fw_tracer_ownership_acquire(struct mlx5_fw_tracer *tracer)
+{
+	struct mlx5_core_dev *dev = tracer->dev;
+	u32 out[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+	int err;
+
+	err = mlx5_set_mtrc_caps_trace_owner(tracer, out, sizeof(out),
+					     MLX5_FW_TRACER_ACQUIRE_OWNERSHIP);
+	if (err) {
+		mlx5_core_warn(dev, "FWTracer: Acquire tracer ownership failed %d\n",
+			       err);
+		return err;
+	}
+
+	tracer->owner = !!MLX5_GET(mtrc_cap, out, trace_owner);
+
+	if (!tracer->owner)
+		return -EBUSY;
+
+	return 0;
+}
+
+static void mlx5_fw_tracer_ownership_release(struct mlx5_fw_tracer *tracer)
+{
+	u32 out[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+
+	mlx5_set_mtrc_caps_trace_owner(tracer, out, sizeof(out),
+				       MLX5_FW_TRACER_RELEASE_OWNERSHIP);
+	tracer->owner = false;
+}
+
+static void mlx5_fw_tracer_ownership_change(struct work_struct *work)
+{
+	struct mlx5_fw_tracer *tracer = container_of(work, struct mlx5_fw_tracer,
+						     ownership_change_work);
+	struct mlx5_core_dev *dev = tracer->dev;
+	int err;
+
+	if (tracer->owner) {
+		mlx5_fw_tracer_ownership_release(tracer);
+		return;
+	}
+
+	err = mlx5_fw_tracer_ownership_acquire(tracer);
+	if (err) {
+		mlx5_core_dbg(dev, "FWTracer: Ownership was not granted %d\n", err);
+		return;
+	}
+}
+
+struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
+{
+	struct mlx5_fw_tracer *tracer = NULL;
+	int err;
+
+	if (!MLX5_CAP_MCAM_REG(dev, tracer_registers)) {
+		mlx5_core_dbg(dev, "FWTracer: Tracer capability not present\n");
+		return NULL;
+	}
+
+	tracer = kzalloc(sizeof(*tracer), GFP_KERNEL);
+	if (!tracer)
+		return ERR_PTR(-ENOMEM);
+
+	tracer->work_queue = create_singlethread_workqueue("mlx5_fw_tracer");
+	if (!tracer->work_queue) {
+		err = -ENOMEM;
+		goto free_tracer;
+	}
+
+	tracer->dev = dev;
+
+	INIT_WORK(&tracer->ownership_change_work, mlx5_fw_tracer_ownership_change);
+
+	err = mlx5_query_mtrc_caps(tracer);
+	if (err) {
+		mlx5_core_dbg(dev, "FWTracer: Failed to query capabilities %d\n", err);
+		goto destroy_workqueue;
+	}
+
+	mlx5_fw_tracer_ownership_change(&tracer->ownership_change_work);
+
+	return tracer;
+
+destroy_workqueue:
+	tracer->dev = NULL;
+	destroy_workqueue(tracer->work_queue);
+free_tracer:
+	kfree(tracer);
+	return ERR_PTR(err);
+}
+
+void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
+{
+	if (!tracer)
+		return;
+
+	cancel_work_sync(&tracer->ownership_change_work);
+
+	if (tracer->owner)
+		mlx5_fw_tracer_ownership_release(tracer);
+
+	flush_workqueue(tracer->work_queue);
+	destroy_workqueue(tracer->work_queue);
+	kfree(tracer);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
new file mode 100644
index 000000000000..721c41a5e827
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
@@ -0,0 +1,66 @@
+/*
+ * Copyright (c) 2018, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef __LIB_TRACER_H__
+#define __LIB_TRACER_H__
+
+#include <linux/mlx5/driver.h>
+#include "mlx5_core.h"
+
+#define STRINGS_DB_SECTIONS_NUM 8
+
+struct mlx5_fw_tracer {
+	struct mlx5_core_dev *dev;
+	bool owner;
+	u8   trc_ver;
+	struct workqueue_struct *work_queue;
+	struct work_struct ownership_change_work;
+
+	/* Strings DB */
+	struct {
+		u8 first_string_trace;
+		u8 num_string_trace;
+		u32 num_string_db;
+		u32 base_address_out[STRINGS_DB_SECTIONS_NUM];
+		u32 size_out[STRINGS_DB_SECTIONS_NUM];
+	} str_db;
+};
+
+enum mlx5_fw_tracer_ownership_state {
+	MLX5_FW_TRACER_RELEASE_OWNERSHIP,
+	MLX5_FW_TRACER_ACQUIRE_OWNERSHIP,
+};
+
+struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev);
+void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer);
+
+#endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 957199c20a0f..86cb0ebf92fa 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -816,6 +816,8 @@ struct mlx5_clock {
 	struct mlx5_pps            pps_info;
 };
 
+struct mlx5_fw_tracer;
+
 struct mlx5_core_dev {
 	struct pci_dev	       *pdev;
 	/* sync pci state */
@@ -860,6 +862,7 @@ struct mlx5_core_dev {
 	struct mlx5_clock        clock;
 	struct mlx5_ib_clock_info  *clock_info;
 	struct page             *clock_info_page;
+	struct mlx5_fw_tracer   *tracer;
 };
 
 struct mlx5_db {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 02/16] net/mlx5: FW tracer, create trace buffer and copy strings database
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
  2018-07-19  1:00 ` [net-next 01/16] net/mlx5: FW tracer, implement tracer logic Saeed Mahameed
@ 2018-07-19  1:00 ` Saeed Mahameed
  2018-07-19  1:00 ` [net-next 03/16] net/mlx5: FW tracer, register log buffer memory key Saeed Mahameed
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Feras Daoud, Saeed Mahameed

From: Feras Daoud <ferasda@mellanox.com>

For each PF do the following:
1- Allocate memory for the tracer strings database and read the
strings from the FW to the SW. These strings will be used later for
parsing traces.
2- Allocate and dma map tracer buffers.

Traces that will be written into the buffer will be parsed as a group
of one or more traces, referred to as trace message. The trace message
represents a C-like printf string.
First trace of a message holds the pointer to the correct string in
strings database. The following traces holds the variables of the
message.

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/diag/fw_tracer.c       | 209 +++++++++++++++++-
 .../mellanox/mlx5/core/diag/fw_tracer.h       |  18 ++
 2 files changed, 224 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index 3ecbf06b4d71..35107b8f76df 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -119,6 +119,163 @@ static void mlx5_fw_tracer_ownership_release(struct mlx5_fw_tracer *tracer)
 	tracer->owner = false;
 }
 
+static int mlx5_fw_tracer_create_log_buf(struct mlx5_fw_tracer *tracer)
+{
+	struct mlx5_core_dev *dev = tracer->dev;
+	struct device *ddev = &dev->pdev->dev;
+	dma_addr_t dma;
+	void *buff;
+	gfp_t gfp;
+	int err;
+
+	tracer->buff.size = TRACE_BUFFER_SIZE_BYTE;
+
+	gfp = GFP_KERNEL | __GFP_ZERO;
+	buff = (void *)__get_free_pages(gfp,
+					get_order(tracer->buff.size));
+	if (!buff) {
+		err = -ENOMEM;
+		mlx5_core_warn(dev, "FWTracer: Failed to allocate pages, %d\n", err);
+		return err;
+	}
+	tracer->buff.log_buf = buff;
+
+	dma = dma_map_single(ddev, buff, tracer->buff.size, DMA_FROM_DEVICE);
+	if (dma_mapping_error(ddev, dma)) {
+		mlx5_core_warn(dev, "FWTracer: Unable to map DMA: %d\n",
+			       dma_mapping_error(ddev, dma));
+		err = -ENOMEM;
+		goto free_pages;
+	}
+	tracer->buff.dma = dma;
+
+	return 0;
+
+free_pages:
+	free_pages((unsigned long)tracer->buff.log_buf, get_order(tracer->buff.size));
+
+	return err;
+}
+
+static void mlx5_fw_tracer_destroy_log_buf(struct mlx5_fw_tracer *tracer)
+{
+	struct mlx5_core_dev *dev = tracer->dev;
+	struct device *ddev = &dev->pdev->dev;
+
+	if (!tracer->buff.log_buf)
+		return;
+
+	dma_unmap_single(ddev, tracer->buff.dma, tracer->buff.size, DMA_FROM_DEVICE);
+	free_pages((unsigned long)tracer->buff.log_buf, get_order(tracer->buff.size));
+}
+
+static void mlx5_fw_tracer_free_strings_db(struct mlx5_fw_tracer *tracer)
+{
+	u32 num_string_db = tracer->str_db.num_string_db;
+	int i;
+
+	for (i = 0; i < num_string_db; i++) {
+		kfree(tracer->str_db.buffer[i]);
+		tracer->str_db.buffer[i] = NULL;
+	}
+}
+
+static int mlx5_fw_tracer_allocate_strings_db(struct mlx5_fw_tracer *tracer)
+{
+	u32 *string_db_size_out = tracer->str_db.size_out;
+	u32 num_string_db = tracer->str_db.num_string_db;
+	int i;
+
+	for (i = 0; i < num_string_db; i++) {
+		tracer->str_db.buffer[i] = kzalloc(string_db_size_out[i], GFP_KERNEL);
+		if (!tracer->str_db.buffer[i])
+			goto free_strings_db;
+	}
+
+	return 0;
+
+free_strings_db:
+	mlx5_fw_tracer_free_strings_db(tracer);
+	return -ENOMEM;
+}
+
+static void mlx5_tracer_read_strings_db(struct work_struct *work)
+{
+	struct mlx5_fw_tracer *tracer = container_of(work, struct mlx5_fw_tracer,
+						     read_fw_strings_work);
+	u32 num_of_reads, num_string_db = tracer->str_db.num_string_db;
+	struct mlx5_core_dev *dev = tracer->dev;
+	u32 in[MLX5_ST_SZ_DW(mtrc_cap)] = {0};
+	u32 leftovers, offset;
+	int err = 0, i, j;
+	u32 *out, outlen;
+	void *out_value;
+
+	outlen = MLX5_ST_SZ_BYTES(mtrc_stdb) + STRINGS_DB_READ_SIZE_BYTES;
+	out = kzalloc(outlen, GFP_KERNEL);
+	if (!out) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < num_string_db; i++) {
+		offset = 0;
+		MLX5_SET(mtrc_stdb, in, string_db_index, i);
+		num_of_reads = tracer->str_db.size_out[i] /
+				STRINGS_DB_READ_SIZE_BYTES;
+		leftovers = (tracer->str_db.size_out[i] %
+				STRINGS_DB_READ_SIZE_BYTES) /
+					STRINGS_DB_LEFTOVER_SIZE_BYTES;
+
+		MLX5_SET(mtrc_stdb, in, read_size, STRINGS_DB_READ_SIZE_BYTES);
+		for (j = 0; j < num_of_reads; j++) {
+			MLX5_SET(mtrc_stdb, in, start_offset, offset);
+
+			err = mlx5_core_access_reg(dev, in, sizeof(in), out,
+						   outlen, MLX5_REG_MTRC_STDB,
+						   0, 1);
+			if (err) {
+				mlx5_core_dbg(dev, "FWTracer: Failed to read strings DB %d\n",
+					      err);
+				goto out_free;
+			}
+
+			out_value = MLX5_ADDR_OF(mtrc_stdb, out, string_db_data);
+			memcpy(tracer->str_db.buffer[i] + offset, out_value,
+			       STRINGS_DB_READ_SIZE_BYTES);
+			offset += STRINGS_DB_READ_SIZE_BYTES;
+		}
+
+		/* Strings database is aligned to 64, need to read leftovers*/
+		MLX5_SET(mtrc_stdb, in, read_size,
+			 STRINGS_DB_LEFTOVER_SIZE_BYTES);
+		for (j = 0; j < leftovers; j++) {
+			MLX5_SET(mtrc_stdb, in, start_offset, offset);
+
+			err = mlx5_core_access_reg(dev, in, sizeof(in), out,
+						   outlen, MLX5_REG_MTRC_STDB,
+						   0, 1);
+			if (err) {
+				mlx5_core_dbg(dev, "FWTracer: Failed to read strings DB %d\n",
+					      err);
+				goto out_free;
+			}
+
+			out_value = MLX5_ADDR_OF(mtrc_stdb, out, string_db_data);
+			memcpy(tracer->str_db.buffer[i] + offset, out_value,
+			       STRINGS_DB_LEFTOVER_SIZE_BYTES);
+			offset += STRINGS_DB_LEFTOVER_SIZE_BYTES;
+		}
+	}
+
+	tracer->str_db.loaded = true;
+
+out_free:
+	kfree(out);
+out:
+	return;
+}
+
 static void mlx5_fw_tracer_ownership_change(struct work_struct *work)
 {
 	struct mlx5_fw_tracer *tracer = container_of(work, struct mlx5_fw_tracer,
@@ -161,6 +318,7 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
 	tracer->dev = dev;
 
 	INIT_WORK(&tracer->ownership_change_work, mlx5_fw_tracer_ownership_change);
+	INIT_WORK(&tracer->read_fw_strings_work, mlx5_tracer_read_strings_db);
 
 	err = mlx5_query_mtrc_caps(tracer);
 	if (err) {
@@ -168,10 +326,22 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
 		goto destroy_workqueue;
 	}
 
-	mlx5_fw_tracer_ownership_change(&tracer->ownership_change_work);
+	err = mlx5_fw_tracer_create_log_buf(tracer);
+	if (err) {
+		mlx5_core_warn(dev, "FWTracer: Create log buffer failed %d\n", err);
+		goto destroy_workqueue;
+	}
+
+	err = mlx5_fw_tracer_allocate_strings_db(tracer);
+	if (err) {
+		mlx5_core_warn(dev, "FWTracer: Allocate strings database failed %d\n", err);
+		goto free_log_buf;
+	}
 
 	return tracer;
 
+free_log_buf:
+	mlx5_fw_tracer_destroy_log_buf(tracer);
 destroy_workqueue:
 	tracer->dev = NULL;
 	destroy_workqueue(tracer->work_queue);
@@ -180,17 +350,50 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
 	return ERR_PTR(err);
 }
 
-void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
+int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer)
 {
-	if (!tracer)
+	struct mlx5_core_dev *dev;
+	int err;
+
+	if (IS_ERR_OR_NULL(tracer))
+		return 0;
+
+	dev = tracer->dev;
+
+	if (!tracer->str_db.loaded)
+		queue_work(tracer->work_queue, &tracer->read_fw_strings_work);
+
+	err = mlx5_fw_tracer_ownership_acquire(tracer);
+	if (err) {
+		mlx5_core_dbg(dev, "FWTracer: Ownership was not granted %d\n", err);
+		return 0; /* return 0 since ownership can be acquired on a later FW event */
+	}
+
+	return 0;
+}
+
+void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
+{
+	if (IS_ERR_OR_NULL(tracer))
 		return;
 
 	cancel_work_sync(&tracer->ownership_change_work);
 
 	if (tracer->owner)
 		mlx5_fw_tracer_ownership_release(tracer);
+}
 
+void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
+{
+	if (IS_ERR_OR_NULL(tracer))
+		return;
+
+	cancel_work_sync(&tracer->read_fw_strings_work);
+	mlx5_fw_tracer_free_strings_db(tracer);
+	mlx5_fw_tracer_destroy_log_buf(tracer);
 	flush_workqueue(tracer->work_queue);
 	destroy_workqueue(tracer->work_queue);
 	kfree(tracer);
 }
+
+EXPORT_TRACEPOINT_SYMBOL(mlx5_fw);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
index 721c41a5e827..66cb7e7ada28 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
@@ -37,6 +37,11 @@
 #include "mlx5_core.h"
 
 #define STRINGS_DB_SECTIONS_NUM 8
+#define STRINGS_DB_READ_SIZE_BYTES 256
+#define STRINGS_DB_LEFTOVER_SIZE_BYTES 64
+#define TRACER_BUFFER_PAGE_NUM 64
+#define TRACER_BUFFER_CHUNK 4096
+#define TRACE_BUFFER_SIZE_BYTE (TRACER_BUFFER_PAGE_NUM * TRACER_BUFFER_CHUNK)
 
 struct mlx5_fw_tracer {
 	struct mlx5_core_dev *dev;
@@ -44,6 +49,7 @@ struct mlx5_fw_tracer {
 	u8   trc_ver;
 	struct workqueue_struct *work_queue;
 	struct work_struct ownership_change_work;
+	struct work_struct read_fw_strings_work;
 
 	/* Strings DB */
 	struct {
@@ -52,7 +58,19 @@ struct mlx5_fw_tracer {
 		u32 num_string_db;
 		u32 base_address_out[STRINGS_DB_SECTIONS_NUM];
 		u32 size_out[STRINGS_DB_SECTIONS_NUM];
+		void *buffer[STRINGS_DB_SECTIONS_NUM];
+		bool loaded;
 	} str_db;
+
+	/* Log Buffer */
+	struct {
+		u32 pdn;
+		void *log_buf;
+		dma_addr_t dma;
+		u32 size;
+		struct mlx5_core_mkey mkey;
+
+	} buff;
 };
 
 enum mlx5_fw_tracer_ownership_state {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 03/16] net/mlx5: FW tracer, register log buffer memory key
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
  2018-07-19  1:00 ` [net-next 01/16] net/mlx5: FW tracer, implement tracer logic Saeed Mahameed
  2018-07-19  1:00 ` [net-next 02/16] net/mlx5: FW tracer, create trace buffer and copy strings database Saeed Mahameed
@ 2018-07-19  1:00 ` Saeed Mahameed
  2018-07-19  1:00 ` [net-next 04/16] net/mlx5: FW tracer, events handling Saeed Mahameed
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Create a memory key and protection domain for the tracer log buffer.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/diag/fw_tracer.c       | 64 ++++++++++++++++++-
 1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index 35107b8f76df..d6cc27b0ff34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -169,6 +169,48 @@ static void mlx5_fw_tracer_destroy_log_buf(struct mlx5_fw_tracer *tracer)
 	free_pages((unsigned long)tracer->buff.log_buf, get_order(tracer->buff.size));
 }
 
+static int mlx5_fw_tracer_create_mkey(struct mlx5_fw_tracer *tracer)
+{
+	struct mlx5_core_dev *dev = tracer->dev;
+	int err, inlen, i;
+	__be64 *mtt;
+	void *mkc;
+	u32 *in;
+
+	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+			sizeof(*mtt) * round_up(TRACER_BUFFER_PAGE_NUM, 2);
+
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
+		 DIV_ROUND_UP(TRACER_BUFFER_PAGE_NUM, 2));
+	mtt = (u64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
+	for (i = 0 ; i < TRACER_BUFFER_PAGE_NUM ; i++)
+		mtt[i] = cpu_to_be64(tracer->buff.dma + i * PAGE_SIZE);
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, pd, tracer->buff.pdn);
+	MLX5_SET(mkc, mkc, bsf_octword_size, 0);
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, translations_octword_size,
+		 DIV_ROUND_UP(TRACER_BUFFER_PAGE_NUM, 2));
+	MLX5_SET64(mkc, mkc, start_addr, tracer->buff.dma);
+	MLX5_SET64(mkc, mkc, len, tracer->buff.size);
+	err = mlx5_core_create_mkey(dev, &tracer->buff.mkey, in, inlen);
+	if (err)
+		mlx5_core_warn(dev, "FWTracer: Failed to create mkey, %d\n", err);
+
+	kvfree(in);
+
+	return err;
+}
+
 static void mlx5_fw_tracer_free_strings_db(struct mlx5_fw_tracer *tracer)
 {
 	u32 num_string_db = tracer->str_db.num_string_db;
@@ -363,13 +405,26 @@ int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer)
 	if (!tracer->str_db.loaded)
 		queue_work(tracer->work_queue, &tracer->read_fw_strings_work);
 
-	err = mlx5_fw_tracer_ownership_acquire(tracer);
+	err = mlx5_core_alloc_pd(dev, &tracer->buff.pdn);
 	if (err) {
-		mlx5_core_dbg(dev, "FWTracer: Ownership was not granted %d\n", err);
-		return 0; /* return 0 since ownership can be acquired on a later FW event */
+		mlx5_core_warn(dev, "FWTracer: Failed to allocate PD %d\n", err);
+		return err;
 	}
 
+	err = mlx5_fw_tracer_create_mkey(tracer);
+	if (err) {
+		mlx5_core_warn(dev, "FWTracer: Failed to create mkey %d\n", err);
+		goto err_dealloc_pd;
+	}
+
+	err = mlx5_fw_tracer_ownership_acquire(tracer);
+	if (err) /* Don't fail since ownership can be acquired on a later FW event */
+		mlx5_core_dbg(dev, "FWTracer: Ownership was not granted %d\n", err);
+
 	return 0;
+err_dealloc_pd:
+	mlx5_core_dealloc_pd(dev, tracer->buff.pdn);
+	return err;
 }
 
 void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
@@ -381,6 +436,9 @@ void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
 
 	if (tracer->owner)
 		mlx5_fw_tracer_ownership_release(tracer);
+
+	mlx5_core_destroy_mkey(tracer->dev, &tracer->buff.mkey);
+	mlx5_core_dealloc_pd(tracer->dev, tracer->buff.pdn);
 }
 
 void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 04/16] net/mlx5: FW tracer, events handling
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (2 preceding siblings ...)
  2018-07-19  1:00 ` [net-next 03/16] net/mlx5: FW tracer, register log buffer memory key Saeed Mahameed
@ 2018-07-19  1:00 ` Saeed Mahameed
  2018-07-19  1:00 ` [net-next 05/16] net/mlx5: FW tracer, parse traces and kernel tracing support Saeed Mahameed
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Feras Daoud, Saeed Mahameed

From: Feras Daoud <ferasda@mellanox.com>

The tracer has one event, event 0x26, with two subtypes:
- Subtype 0: Ownership change
- Subtype 1: Traces available

An ownership change occurs in the following cases:
1- Owner releases his ownership, in this case, an event will be
sent to inform others to reattempt acquire ownership.
2- Ownership was taken by a higher priority tool, in this case
the owner should understand that it lost ownership, and go through
tear down flow.

The second subtype indicates that there are traces in the trace buffer,
in this case, the driver polls the tracer buffer for new traces, parse
them and prepares the messages for printing.

The HW starts tracing from the first address in the tracer buffer.
Driver receives an event notifying that new trace block exists.
HW posts a timestamp event at the last 8B of every 256B block.
Comparing the timestamp to the last handled timestamp would indicate
that this is a new trace block. Once the new timestamp is detected,
the entire block is considered valid.

Block validation and parsing, should be done after copying the current
block to a different location, in order to avoid block overwritten
during processing.

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/diag/fw_tracer.c       | 268 +++++++++++++++++-
 .../mellanox/mlx5/core/diag/fw_tracer.h       |  71 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  11 +
 include/linux/mlx5/device.h                   |   7 +
 4 files changed, 347 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index d6cc27b0ff34..bd887d1d3396 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -318,25 +318,244 @@ static void mlx5_tracer_read_strings_db(struct work_struct *work)
 	return;
 }
 
-static void mlx5_fw_tracer_ownership_change(struct work_struct *work)
+static void mlx5_fw_tracer_arm(struct mlx5_core_dev *dev)
 {
-	struct mlx5_fw_tracer *tracer = container_of(work, struct mlx5_fw_tracer,
-						     ownership_change_work);
-	struct mlx5_core_dev *dev = tracer->dev;
+	u32 out[MLX5_ST_SZ_DW(mtrc_ctrl)] = {0};
+	u32 in[MLX5_ST_SZ_DW(mtrc_ctrl)] = {0};
 	int err;
 
-	if (tracer->owner) {
-		mlx5_fw_tracer_ownership_release(tracer);
+	MLX5_SET(mtrc_ctrl, in, arm_event, 1);
+
+	err = mlx5_core_access_reg(dev, in, sizeof(in), out, sizeof(out),
+				   MLX5_REG_MTRC_CTRL, 0, 1);
+	if (err)
+		mlx5_core_warn(dev, "FWTracer: Failed to arm tracer event %d\n", err);
+}
+
+static void poll_trace(struct mlx5_fw_tracer *tracer,
+		       struct tracer_event *tracer_event, u64 *trace)
+{
+	u32 timestamp_low, timestamp_mid, timestamp_high, urts;
+
+	tracer_event->event_id = MLX5_GET(tracer_event, trace, event_id);
+	tracer_event->lost_event = MLX5_GET(tracer_event, trace, lost);
+
+	switch (tracer_event->event_id) {
+	case TRACER_EVENT_TYPE_TIMESTAMP:
+		tracer_event->type = TRACER_EVENT_TYPE_TIMESTAMP;
+		urts = MLX5_GET(tracer_timestamp_event, trace, urts);
+		if (tracer->trc_ver == 0)
+			tracer_event->timestamp_event.unreliable = !!(urts >> 2);
+		else
+			tracer_event->timestamp_event.unreliable = !!(urts & 1);
+
+		timestamp_low = MLX5_GET(tracer_timestamp_event,
+					 trace, timestamp7_0);
+		timestamp_mid = MLX5_GET(tracer_timestamp_event,
+					 trace, timestamp39_8);
+		timestamp_high = MLX5_GET(tracer_timestamp_event,
+					  trace, timestamp52_40);
+
+		tracer_event->timestamp_event.timestamp =
+				((u64)timestamp_high << 40) |
+				((u64)timestamp_mid << 8) |
+				(u64)timestamp_low;
+		break;
+	default:
+		if (tracer_event->event_id >= tracer->str_db.first_string_trace ||
+		    tracer_event->event_id <= tracer->str_db.first_string_trace +
+					      tracer->str_db.num_string_trace) {
+			tracer_event->type = TRACER_EVENT_TYPE_STRING;
+			tracer_event->string_event.timestamp =
+				MLX5_GET(tracer_string_event, trace, timestamp);
+			tracer_event->string_event.string_param =
+				MLX5_GET(tracer_string_event, trace, string_param);
+			tracer_event->string_event.tmsn =
+				MLX5_GET(tracer_string_event, trace, tmsn);
+			tracer_event->string_event.tdsn =
+				MLX5_GET(tracer_string_event, trace, tdsn);
+		} else {
+			tracer_event->type = TRACER_EVENT_TYPE_UNRECOGNIZED;
+		}
+		break;
+	}
+}
+
+static u64 get_block_timestamp(struct mlx5_fw_tracer *tracer, u64 *ts_event)
+{
+	struct tracer_event tracer_event;
+	u8 event_id;
+
+	event_id = MLX5_GET(tracer_event, ts_event, event_id);
+
+	if (event_id == TRACER_EVENT_TYPE_TIMESTAMP)
+		poll_trace(tracer, &tracer_event, ts_event);
+	else
+		tracer_event.timestamp_event.timestamp = 0;
+
+	return tracer_event.timestamp_event.timestamp;
+}
+
+static void mlx5_fw_tracer_handle_traces(struct work_struct *work)
+{
+	struct mlx5_fw_tracer *tracer =
+			container_of(work, struct mlx5_fw_tracer, handle_traces_work);
+	u64 block_timestamp, last_block_timestamp, tmp_trace_block[TRACES_PER_BLOCK];
+	u32 block_count, start_offset, prev_start_offset, prev_consumer_index;
+	u32 trace_event_size = MLX5_ST_SZ_BYTES(tracer_event);
+	struct tracer_event tracer_event;
+	struct mlx5_core_dev *dev;
+	int i;
+
+	if (!tracer->owner)
 		return;
+
+	dev = tracer->dev;
+	block_count = tracer->buff.size / TRACER_BLOCK_SIZE_BYTE;
+	start_offset = tracer->buff.consumer_index * TRACER_BLOCK_SIZE_BYTE;
+
+	/* Copy the block to local buffer to avoid HW override while being processed*/
+	memcpy(tmp_trace_block, tracer->buff.log_buf + start_offset,
+	       TRACER_BLOCK_SIZE_BYTE);
+
+	block_timestamp =
+		get_block_timestamp(tracer, &tmp_trace_block[TRACES_PER_BLOCK - 1]);
+
+	while (block_timestamp > tracer->last_timestamp) {
+		/* Check block override if its not the first block */
+		if (!tracer->last_timestamp) {
+			u64 *ts_event;
+			/* To avoid block override be the HW in case of buffer
+			 * wraparound, the time stamp of the previous block
+			 * should be compared to the last timestamp handled
+			 * by the driver.
+			 */
+			prev_consumer_index =
+				(tracer->buff.consumer_index - 1) & (block_count - 1);
+			prev_start_offset = prev_consumer_index * TRACER_BLOCK_SIZE_BYTE;
+
+			ts_event = tracer->buff.log_buf + prev_start_offset +
+				   (TRACES_PER_BLOCK - 1) * trace_event_size;
+			last_block_timestamp = get_block_timestamp(tracer, ts_event);
+			/* If previous timestamp different from last stored
+			 * timestamp then there is a good chance that the
+			 * current buffer is overwritten and therefore should
+			 * not be parsed.
+			 */
+			if (tracer->last_timestamp != last_block_timestamp) {
+				mlx5_core_warn(dev, "FWTracer: Events were lost\n");
+				tracer->last_timestamp = block_timestamp;
+				tracer->buff.consumer_index =
+					(tracer->buff.consumer_index + 1) & (block_count - 1);
+				break;
+			}
+		}
+
+		/* Parse events */
+		for (i = 0; i < TRACES_PER_BLOCK ; i++)
+			poll_trace(tracer, &tracer_event, &tmp_trace_block[i]);
+
+		tracer->buff.consumer_index =
+			(tracer->buff.consumer_index + 1) & (block_count - 1);
+
+		tracer->last_timestamp = block_timestamp;
+		start_offset = tracer->buff.consumer_index * TRACER_BLOCK_SIZE_BYTE;
+		memcpy(tmp_trace_block, tracer->buff.log_buf + start_offset,
+		       TRACER_BLOCK_SIZE_BYTE);
+		block_timestamp = get_block_timestamp(tracer,
+						      &tmp_trace_block[TRACES_PER_BLOCK - 1]);
 	}
 
+	mlx5_fw_tracer_arm(dev);
+}
+
+static int mlx5_fw_tracer_set_mtrc_conf(struct mlx5_fw_tracer *tracer)
+{
+	struct mlx5_core_dev *dev = tracer->dev;
+	u32 out[MLX5_ST_SZ_DW(mtrc_conf)] = {0};
+	u32 in[MLX5_ST_SZ_DW(mtrc_conf)] = {0};
+	int err;
+
+	MLX5_SET(mtrc_conf, in, trace_mode, TRACE_TO_MEMORY);
+	MLX5_SET(mtrc_conf, in, log_trace_buffer_size,
+		 ilog2(TRACER_BUFFER_PAGE_NUM));
+	MLX5_SET(mtrc_conf, in, trace_mkey, tracer->buff.mkey.key);
+
+	err = mlx5_core_access_reg(dev, in, sizeof(in), out, sizeof(out),
+				   MLX5_REG_MTRC_CONF, 0, 1);
+	if (err)
+		mlx5_core_warn(dev, "FWTracer: Failed to set tracer configurations %d\n", err);
+
+	return err;
+}
+
+static int mlx5_fw_tracer_set_mtrc_ctrl(struct mlx5_fw_tracer *tracer, u8 status, u8 arm)
+{
+	struct mlx5_core_dev *dev = tracer->dev;
+	u32 out[MLX5_ST_SZ_DW(mtrc_ctrl)] = {0};
+	u32 in[MLX5_ST_SZ_DW(mtrc_ctrl)] = {0};
+	int err;
+
+	MLX5_SET(mtrc_ctrl, in, modify_field_select, TRACE_STATUS);
+	MLX5_SET(mtrc_ctrl, in, trace_status, status);
+	MLX5_SET(mtrc_ctrl, in, arm_event, arm);
+
+	err = mlx5_core_access_reg(dev, in, sizeof(in), out, sizeof(out),
+				   MLX5_REG_MTRC_CTRL, 0, 1);
+
+	if (!err && status)
+		tracer->last_timestamp = 0;
+
+	return err;
+}
+
+static int mlx5_fw_tracer_start(struct mlx5_fw_tracer *tracer)
+{
+	struct mlx5_core_dev *dev = tracer->dev;
+	int err;
+
 	err = mlx5_fw_tracer_ownership_acquire(tracer);
 	if (err) {
 		mlx5_core_dbg(dev, "FWTracer: Ownership was not granted %d\n", err);
+		/* Don't fail since ownership can be acquired on a later FW event */
+		return 0;
+	}
+
+	err = mlx5_fw_tracer_set_mtrc_conf(tracer);
+	if (err) {
+		mlx5_core_warn(dev, "FWTracer: Failed to set tracer configuration %d\n", err);
+		goto release_ownership;
+	}
+
+	/* enable tracer & trace events */
+	err = mlx5_fw_tracer_set_mtrc_ctrl(tracer, 1, 1);
+	if (err) {
+		mlx5_core_warn(dev, "FWTracer: Failed to enable tracer %d\n", err);
+		goto release_ownership;
+	}
+
+	return 0;
+
+release_ownership:
+	mlx5_fw_tracer_ownership_release(tracer);
+	return err;
+}
+
+static void mlx5_fw_tracer_ownership_change(struct work_struct *work)
+{
+	struct mlx5_fw_tracer *tracer =
+		container_of(work, struct mlx5_fw_tracer, ownership_change_work);
+
+	if (tracer->owner) {
+		tracer->owner = false;
+		tracer->buff.consumer_index = 0;
 		return;
 	}
+
+	mlx5_fw_tracer_start(tracer);
 }
 
+/* Create software resources (Buffers, etc ..) */
 struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
 {
 	struct mlx5_fw_tracer *tracer = NULL;
@@ -361,6 +580,8 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
 
 	INIT_WORK(&tracer->ownership_change_work, mlx5_fw_tracer_ownership_change);
 	INIT_WORK(&tracer->read_fw_strings_work, mlx5_tracer_read_strings_db);
+	INIT_WORK(&tracer->handle_traces_work, mlx5_fw_tracer_handle_traces);
+
 
 	err = mlx5_query_mtrc_caps(tracer);
 	if (err) {
@@ -392,6 +613,9 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
 	return ERR_PTR(err);
 }
 
+/* Create HW resources + start tracer
+ * must be called before Async EQ is created
+ */
 int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer)
 {
 	struct mlx5_core_dev *dev;
@@ -417,22 +641,25 @@ int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer)
 		goto err_dealloc_pd;
 	}
 
-	err = mlx5_fw_tracer_ownership_acquire(tracer);
-	if (err) /* Don't fail since ownership can be acquired on a later FW event */
-		mlx5_core_dbg(dev, "FWTracer: Ownership was not granted %d\n", err);
+	mlx5_fw_tracer_start(tracer);
 
 	return 0;
+
 err_dealloc_pd:
 	mlx5_core_dealloc_pd(dev, tracer->buff.pdn);
 	return err;
 }
 
+/* Stop tracer + Cleanup HW resources
+ * must be called after Async EQ is destroyed
+ */
 void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
 {
 	if (IS_ERR_OR_NULL(tracer))
 		return;
 
 	cancel_work_sync(&tracer->ownership_change_work);
+	cancel_work_sync(&tracer->handle_traces_work);
 
 	if (tracer->owner)
 		mlx5_fw_tracer_ownership_release(tracer);
@@ -441,6 +668,7 @@ void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
 	mlx5_core_dealloc_pd(tracer->dev, tracer->buff.pdn);
 }
 
+/* Free software resources (Buffers, etc ..) */
 void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
 {
 	if (IS_ERR_OR_NULL(tracer))
@@ -454,4 +682,26 @@ void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
 	kfree(tracer);
 }
 
+void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe)
+{
+	struct mlx5_fw_tracer *tracer = dev->tracer;
+
+	if (!tracer)
+		return;
+
+	switch (eqe->sub_type) {
+	case MLX5_TRACER_SUBTYPE_OWNERSHIP_CHANGE:
+		if (test_bit(MLX5_INTERFACE_STATE_UP, &dev->intf_state))
+			queue_work(tracer->work_queue, &tracer->ownership_change_work);
+		break;
+	case MLX5_TRACER_SUBTYPE_TRACES_AVAILABLE:
+		if (likely(tracer->str_db.loaded))
+			queue_work(tracer->work_queue, &tracer->handle_traces_work);
+		break;
+	default:
+		mlx5_core_dbg(dev, "FWTracer: Event with unrecognized subtype: sub_type %d\n",
+			      eqe->sub_type);
+	}
+}
+
 EXPORT_TRACEPOINT_SYMBOL(mlx5_fw);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
index 66cb7e7ada28..3915e91486b2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
@@ -43,6 +43,9 @@
 #define TRACER_BUFFER_CHUNK 4096
 #define TRACE_BUFFER_SIZE_BYTE (TRACER_BUFFER_PAGE_NUM * TRACER_BUFFER_CHUNK)
 
+#define TRACER_BLOCK_SIZE_BYTE 256
+#define TRACES_PER_BLOCK 32
+
 struct mlx5_fw_tracer {
 	struct mlx5_core_dev *dev;
 	bool owner;
@@ -69,8 +72,11 @@ struct mlx5_fw_tracer {
 		dma_addr_t dma;
 		u32 size;
 		struct mlx5_core_mkey mkey;
-
+		u32 consumer_index;
 	} buff;
+
+	u64 last_timestamp;
+	struct work_struct handle_traces_work;
 };
 
 enum mlx5_fw_tracer_ownership_state {
@@ -78,7 +84,70 @@ enum mlx5_fw_tracer_ownership_state {
 	MLX5_FW_TRACER_ACQUIRE_OWNERSHIP,
 };
 
+enum tracer_ctrl_fields_select {
+	TRACE_STATUS = 1 << 0,
+};
+
+enum tracer_event_type {
+	TRACER_EVENT_TYPE_STRING,
+	TRACER_EVENT_TYPE_TIMESTAMP = 0xFF,
+	TRACER_EVENT_TYPE_UNRECOGNIZED,
+};
+
+enum tracing_mode {
+	TRACE_TO_MEMORY = 1 << 0,
+};
+
+struct tracer_timestamp_event {
+	u64        timestamp;
+	u8         unreliable;
+};
+
+struct tracer_string_event {
+	u32        timestamp;
+	u32        tmsn;
+	u32        tdsn;
+	u32        string_param;
+};
+
+struct tracer_event {
+	bool      lost_event;
+	u32       type;
+	u8        event_id;
+	union {
+		struct tracer_string_event string_event;
+		struct tracer_timestamp_event timestamp_event;
+	};
+};
+
+struct mlx5_ifc_tracer_event_bits {
+	u8         lost[0x1];
+	u8         timestamp[0x7];
+	u8         event_id[0x8];
+	u8         event_data[0x30];
+};
+
+struct mlx5_ifc_tracer_string_event_bits {
+	u8         lost[0x1];
+	u8         timestamp[0x7];
+	u8         event_id[0x8];
+	u8         tmsn[0xd];
+	u8         tdsn[0x3];
+	u8         string_param[0x20];
+};
+
+struct mlx5_ifc_tracer_timestamp_event_bits {
+	u8         timestamp7_0[0x8];
+	u8         event_id[0x8];
+	u8         urts[0x3];
+	u8         timestamp52_40[0xd];
+	u8         timestamp39_8[0x20];
+};
+
 struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev);
+int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer);
+void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer);
 void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer);
+void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe) { return; }
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 406c23862f5f..7669b4380779 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -40,6 +40,7 @@
 #include "mlx5_core.h"
 #include "fpga/core.h"
 #include "eswitch.h"
+#include "diag/fw_tracer.h"
 
 enum {
 	MLX5_EQE_SIZE		= sizeof(struct mlx5_eqe),
@@ -168,6 +169,8 @@ static const char *eqe_type_str(u8 type)
 		return "MLX5_EVENT_TYPE_FPGA_QP_ERROR";
 	case MLX5_EVENT_TYPE_GENERAL_EVENT:
 		return "MLX5_EVENT_TYPE_GENERAL_EVENT";
+	case MLX5_EVENT_TYPE_DEVICE_TRACER:
+		return "MLX5_EVENT_TYPE_DEVICE_TRACER";
 	default:
 		return "Unrecognized event";
 	}
@@ -576,6 +579,11 @@ static irqreturn_t mlx5_eq_int(int irq, void *eq_ptr)
 		case MLX5_EVENT_TYPE_GENERAL_EVENT:
 			general_event_handler(dev, eqe);
 			break;
+
+		case MLX5_EVENT_TYPE_DEVICE_TRACER:
+			mlx5_fw_tracer_event(dev, eqe);
+			break;
+
 		default:
 			mlx5_core_warn(dev, "Unhandled event 0x%x on EQ 0x%x\n",
 				       eqe->type, eq->eqn);
@@ -853,6 +861,9 @@ int mlx5_start_eqs(struct mlx5_core_dev *dev)
 	if (MLX5_CAP_GEN(dev, temp_warn_event))
 		async_event_mask |= (1ull << MLX5_EVENT_TYPE_TEMP_WARN_EVENT);
 
+	if (MLX5_CAP_MCAM_REG(dev, tracer_registers))
+		async_event_mask |= (1ull << MLX5_EVENT_TYPE_DEVICE_TRACER);
+
 	err = mlx5_create_map_eq(dev, &table->cmd_eq, MLX5_EQ_VEC_CMD,
 				 MLX5_NUM_CMD_EQE, 1ull << MLX5_EVENT_TYPE_CMD,
 				 "mlx5_cmd_eq", MLX5_EQ_TYPE_ASYNC);
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 0566c6a94805..d489494b0a84 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -332,6 +332,13 @@ enum mlx5_event {
 
 	MLX5_EVENT_TYPE_FPGA_ERROR         = 0x20,
 	MLX5_EVENT_TYPE_FPGA_QP_ERROR      = 0x21,
+
+	MLX5_EVENT_TYPE_DEVICE_TRACER      = 0x26,
+};
+
+enum {
+	MLX5_TRACER_SUBTYPE_OWNERSHIP_CHANGE = 0x0,
+	MLX5_TRACER_SUBTYPE_TRACES_AVAILABLE = 0x1,
 };
 
 enum {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 05/16] net/mlx5: FW tracer, parse traces and kernel tracing support
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (3 preceding siblings ...)
  2018-07-19  1:00 ` [net-next 04/16] net/mlx5: FW tracer, events handling Saeed Mahameed
@ 2018-07-19  1:00 ` Saeed Mahameed
  2018-07-19  1:00 ` [net-next 06/16] net/mlx5: FW tracer, Enable tracing Saeed Mahameed
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Feras Daoud, Erez Shitrit, Saeed Mahameed

From: Feras Daoud <ferasda@mellanox.com>

For each message the driver should do the following:
1- Find the message string in the strings database
2- Count the param number of each message
3- Wait for the param events and accumulate them
4- Calculate the event timestamp using the local event timestamp
and the first timestamp event following it.
5- Print message to trace log

Enable the tracing by:
echo 1 > /sys/kernel/debug/tracing/events/mlx5/mlx5_fw/enable

Read traces by:
cat /sys/kernel/debug/tracing/trace

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../mellanox/mlx5/core/diag/fw_tracer.c       | 235 +++++++++++++++++-
 .../mellanox/mlx5/core/diag/fw_tracer.h       |  22 ++
 .../mlx5/core/diag/fw_tracer_tracepoint.h     |  78 ++++++
 3 files changed, 333 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer_tracepoint.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index bd887d1d3396..309842de272c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -29,8 +29,9 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  */
-
+#define CREATE_TRACE_POINTS
 #include "fw_tracer.h"
+#include "fw_tracer_tracepoint.h"
 
 static int mlx5_query_mtrc_caps(struct mlx5_fw_tracer *tracer)
 {
@@ -332,6 +333,109 @@ static void mlx5_fw_tracer_arm(struct mlx5_core_dev *dev)
 		mlx5_core_warn(dev, "FWTracer: Failed to arm tracer event %d\n", err);
 }
 
+static const char *VAL_PARM		= "%llx";
+static const char *REPLACE_64_VAL_PARM	= "%x%x";
+static const char *PARAM_CHAR		= "%";
+
+static int mlx5_tracer_message_hash(u32 message_id)
+{
+	return jhash_1word(message_id, 0) & (MESSAGE_HASH_SIZE - 1);
+}
+
+static struct tracer_string_format *mlx5_tracer_message_insert(struct mlx5_fw_tracer *tracer,
+							       struct tracer_event *tracer_event)
+{
+	struct hlist_head *head =
+		&tracer->hash[mlx5_tracer_message_hash(tracer_event->string_event.tmsn)];
+	struct tracer_string_format *cur_string;
+
+	cur_string = kzalloc(sizeof(*cur_string), GFP_KERNEL);
+	if (!cur_string)
+		return NULL;
+
+	hlist_add_head(&cur_string->hlist, head);
+
+	return cur_string;
+}
+
+static struct tracer_string_format *mlx5_tracer_get_string(struct mlx5_fw_tracer *tracer,
+							   struct tracer_event *tracer_event)
+{
+	struct tracer_string_format *cur_string;
+	u32 str_ptr, offset;
+	int i;
+
+	str_ptr = tracer_event->string_event.string_param;
+
+	for (i = 0; i < tracer->str_db.num_string_db; i++) {
+		if (str_ptr > tracer->str_db.base_address_out[i] &&
+		    str_ptr < tracer->str_db.base_address_out[i] +
+		    tracer->str_db.size_out[i]) {
+			offset = str_ptr - tracer->str_db.base_address_out[i];
+			/* add it to the hash */
+			cur_string = mlx5_tracer_message_insert(tracer, tracer_event);
+			if (!cur_string)
+				return NULL;
+			cur_string->string = (char *)(tracer->str_db.buffer[i] +
+							offset);
+			return cur_string;
+		}
+	}
+
+	return NULL;
+}
+
+static void mlx5_tracer_clean_message(struct tracer_string_format *str_frmt)
+{
+	hlist_del(&str_frmt->hlist);
+	kfree(str_frmt);
+}
+
+static int mlx5_tracer_get_num_of_params(char *str)
+{
+	char *substr, *pstr = str;
+	int num_of_params = 0;
+
+	/* replace %llx with %x%x */
+	substr = strstr(pstr, VAL_PARM);
+	while (substr) {
+		memcpy(substr, REPLACE_64_VAL_PARM, 4);
+		pstr = substr;
+		substr = strstr(pstr, VAL_PARM);
+	}
+
+	/* count all the % characters */
+	substr = strstr(str, PARAM_CHAR);
+	while (substr) {
+		num_of_params += 1;
+		str = substr + 1;
+		substr = strstr(str, PARAM_CHAR);
+	}
+
+	return num_of_params;
+}
+
+static struct tracer_string_format *mlx5_tracer_message_find(struct hlist_head *head,
+							     u8 event_id, u32 tmsn)
+{
+	struct tracer_string_format *message;
+
+	hlist_for_each_entry(message, head, hlist)
+		if (message->event_id == event_id && message->tmsn == tmsn)
+			return message;
+
+	return NULL;
+}
+
+static struct tracer_string_format *mlx5_tracer_message_get(struct mlx5_fw_tracer *tracer,
+							    struct tracer_event *tracer_event)
+{
+	struct hlist_head *head =
+		&tracer->hash[mlx5_tracer_message_hash(tracer_event->string_event.tmsn)];
+
+	return mlx5_tracer_message_find(head, tracer_event->event_id, tracer_event->string_event.tmsn);
+}
+
 static void poll_trace(struct mlx5_fw_tracer *tracer,
 		       struct tracer_event *tracer_event, u64 *trace)
 {
@@ -396,6 +500,128 @@ static u64 get_block_timestamp(struct mlx5_fw_tracer *tracer, u64 *ts_event)
 	return tracer_event.timestamp_event.timestamp;
 }
 
+static void mlx5_fw_tracer_clean_print_hash(struct mlx5_fw_tracer *tracer)
+{
+	struct tracer_string_format *str_frmt;
+	struct hlist_node *n;
+	int i;
+
+	for (i = 0; i < MESSAGE_HASH_SIZE; i++) {
+		hlist_for_each_entry_safe(str_frmt, n, &tracer->hash[i], hlist)
+			mlx5_tracer_clean_message(str_frmt);
+	}
+}
+
+static void mlx5_fw_tracer_clean_ready_list(struct mlx5_fw_tracer *tracer)
+{
+	struct tracer_string_format *str_frmt, *tmp_str;
+
+	list_for_each_entry_safe(str_frmt, tmp_str, &tracer->ready_strings_list,
+				 list)
+		list_del(&str_frmt->list);
+}
+
+static void mlx5_tracer_print_trace(struct tracer_string_format *str_frmt,
+				    struct mlx5_core_dev *dev,
+				    u64 trace_timestamp)
+{
+	char	tmp[512];
+
+	snprintf(tmp, sizeof(tmp), str_frmt->string,
+		 str_frmt->params[0],
+		 str_frmt->params[1],
+		 str_frmt->params[2],
+		 str_frmt->params[3],
+		 str_frmt->params[4],
+		 str_frmt->params[5],
+		 str_frmt->params[6]);
+
+	trace_mlx5_fw(dev->tracer, trace_timestamp, str_frmt->lost,
+		      str_frmt->event_id, tmp);
+
+	/* remove it from hash */
+	mlx5_tracer_clean_message(str_frmt);
+}
+
+static int mlx5_tracer_handle_string_trace(struct mlx5_fw_tracer *tracer,
+					   struct tracer_event *tracer_event)
+{
+	struct tracer_string_format *cur_string;
+
+	if (tracer_event->string_event.tdsn == 0) {
+		cur_string = mlx5_tracer_get_string(tracer, tracer_event);
+		if (!cur_string)
+			return -1;
+
+		cur_string->num_of_params = mlx5_tracer_get_num_of_params(cur_string->string);
+		cur_string->last_param_num = 0;
+		cur_string->event_id = tracer_event->event_id;
+		cur_string->tmsn = tracer_event->string_event.tmsn;
+		cur_string->timestamp = tracer_event->string_event.timestamp;
+		cur_string->lost = tracer_event->lost_event;
+		if (cur_string->num_of_params == 0) /* trace with no params */
+			list_add_tail(&cur_string->list, &tracer->ready_strings_list);
+	} else {
+		cur_string = mlx5_tracer_message_get(tracer, tracer_event);
+		if (!cur_string) {
+			pr_debug("%s Got string event for unknown string tdsm: %d\n",
+				 __func__, tracer_event->string_event.tmsn);
+			return -1;
+		}
+		cur_string->last_param_num += 1;
+		if (cur_string->last_param_num > TRACER_MAX_PARAMS) {
+			pr_debug("%s Number of params exceeds the max (%d)\n",
+				 __func__, TRACER_MAX_PARAMS);
+			list_add_tail(&cur_string->list, &tracer->ready_strings_list);
+			return 0;
+		}
+		/* keep the new parameter */
+		cur_string->params[cur_string->last_param_num - 1] =
+			tracer_event->string_event.string_param;
+		if (cur_string->last_param_num == cur_string->num_of_params)
+			list_add_tail(&cur_string->list, &tracer->ready_strings_list);
+	}
+
+	return 0;
+}
+
+static void mlx5_tracer_handle_timestamp_trace(struct mlx5_fw_tracer *tracer,
+					       struct tracer_event *tracer_event)
+{
+	struct tracer_timestamp_event timestamp_event =
+						tracer_event->timestamp_event;
+	struct tracer_string_format *str_frmt, *tmp_str;
+	struct mlx5_core_dev *dev = tracer->dev;
+	u64 trace_timestamp;
+
+	list_for_each_entry_safe(str_frmt, tmp_str, &tracer->ready_strings_list, list) {
+		list_del(&str_frmt->list);
+		if (str_frmt->timestamp < (timestamp_event.timestamp & MASK_6_0))
+			trace_timestamp = (timestamp_event.timestamp & MASK_52_7) |
+					  (str_frmt->timestamp & MASK_6_0);
+		else
+			trace_timestamp = ((timestamp_event.timestamp & MASK_52_7) - 1) |
+					  (str_frmt->timestamp & MASK_6_0);
+
+		mlx5_tracer_print_trace(str_frmt, dev, trace_timestamp);
+	}
+}
+
+static int mlx5_tracer_handle_trace(struct mlx5_fw_tracer *tracer,
+				    struct tracer_event *tracer_event)
+{
+	if (tracer_event->type == TRACER_EVENT_TYPE_STRING) {
+		mlx5_tracer_handle_string_trace(tracer, tracer_event);
+	} else if (tracer_event->type == TRACER_EVENT_TYPE_TIMESTAMP) {
+		if (!tracer_event->timestamp_event.unreliable)
+			mlx5_tracer_handle_timestamp_trace(tracer, tracer_event);
+	} else {
+		pr_debug("%s Got unrecognised type %d for parsing, exiting..\n",
+			 __func__, tracer_event->type);
+	}
+	return 0;
+}
+
 static void mlx5_fw_tracer_handle_traces(struct work_struct *work)
 {
 	struct mlx5_fw_tracer *tracer =
@@ -452,8 +678,10 @@ static void mlx5_fw_tracer_handle_traces(struct work_struct *work)
 		}
 
 		/* Parse events */
-		for (i = 0; i < TRACES_PER_BLOCK ; i++)
+		for (i = 0; i < TRACES_PER_BLOCK ; i++) {
 			poll_trace(tracer, &tracer_event, &tmp_trace_block[i]);
+			mlx5_tracer_handle_trace(tracer, &tracer_event);
+		}
 
 		tracer->buff.consumer_index =
 			(tracer->buff.consumer_index + 1) & (block_count - 1);
@@ -578,6 +806,7 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
 
 	tracer->dev = dev;
 
+	INIT_LIST_HEAD(&tracer->ready_strings_list);
 	INIT_WORK(&tracer->ownership_change_work, mlx5_fw_tracer_ownership_change);
 	INIT_WORK(&tracer->read_fw_strings_work, mlx5_tracer_read_strings_db);
 	INIT_WORK(&tracer->handle_traces_work, mlx5_fw_tracer_handle_traces);
@@ -675,6 +904,8 @@ void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
 		return;
 
 	cancel_work_sync(&tracer->read_fw_strings_work);
+	mlx5_fw_tracer_clean_ready_list(tracer);
+	mlx5_fw_tracer_clean_print_hash(tracer);
 	mlx5_fw_tracer_free_strings_db(tracer);
 	mlx5_fw_tracer_destroy_log_buf(tracer);
 	flush_workqueue(tracer->work_queue);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
index 3915e91486b2..8d310e7d6743 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
@@ -46,6 +46,13 @@
 #define TRACER_BLOCK_SIZE_BYTE 256
 #define TRACES_PER_BLOCK 32
 
+#define TRACER_MAX_PARAMS 7
+#define MESSAGE_HASH_BITS 6
+#define MESSAGE_HASH_SIZE BIT(MESSAGE_HASH_BITS)
+
+#define MASK_52_7 (0x1FFFFFFFFFFF80)
+#define MASK_6_0  (0x7F)
+
 struct mlx5_fw_tracer {
 	struct mlx5_core_dev *dev;
 	bool owner;
@@ -77,6 +84,21 @@ struct mlx5_fw_tracer {
 
 	u64 last_timestamp;
 	struct work_struct handle_traces_work;
+	struct hlist_head hash[MESSAGE_HASH_SIZE];
+	struct list_head ready_strings_list;
+};
+
+struct tracer_string_format {
+	char *string;
+	int params[TRACER_MAX_PARAMS];
+	int num_of_params;
+	int last_param_num;
+	u8 event_id;
+	u32 tmsn;
+	struct hlist_node hlist;
+	struct list_head list;
+	u32 timestamp;
+	bool lost;
 };
 
 enum mlx5_fw_tracer_ownership_state {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer_tracepoint.h b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer_tracepoint.h
new file mode 100644
index 000000000000..83f90e9aff45
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer_tracepoint.h
@@ -0,0 +1,78 @@
+/*
+ * Copyright (c) 2018, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#if !defined(__LIB_TRACER_TRACEPOINT_H__) || defined(TRACE_HEADER_MULTI_READ)
+#define __LIB_TRACER_TRACEPOINT_H__
+
+#include <linux/tracepoint.h>
+#include "fw_tracer.h"
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mlx5
+
+/* Tracepoint for FWTracer messages: */
+TRACE_EVENT(mlx5_fw,
+	TP_PROTO(const struct mlx5_fw_tracer *tracer, u64 trace_timestamp,
+		 bool lost, u8 event_id, const char *msg),
+
+	TP_ARGS(tracer, trace_timestamp, lost, event_id, msg),
+
+	TP_STRUCT__entry(
+		__string(dev_name, dev_name(&tracer->dev->pdev->dev))
+		__field(u64, trace_timestamp)
+		__field(bool, lost)
+		__field(u8, event_id)
+		__string(msg, msg)
+	),
+
+	TP_fast_assign(
+		__assign_str(dev_name, dev_name(&tracer->dev->pdev->dev));
+		__entry->trace_timestamp = trace_timestamp;
+		__entry->lost = lost;
+		__entry->event_id = event_id;
+		__assign_str(msg, msg);
+	),
+
+	TP_printk("%s [0x%llx] %d [0x%x] %s",
+		  __get_str(dev_name),
+		  __entry->trace_timestamp,
+		  __entry->lost, __entry->event_id,
+		  __get_str(msg))
+);
+
+#endif
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH ./diag
+#define TRACE_INCLUDE_FILE fw_tracer_tracepoint
+#include <trace/define_trace.h>
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 06/16] net/mlx5: FW tracer, Enable tracing
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (4 preceding siblings ...)
  2018-07-19  1:00 ` [net-next 05/16] net/mlx5: FW tracer, parse traces and kernel tracing support Saeed Mahameed
@ 2018-07-19  1:00 ` Saeed Mahameed
  2018-07-19  1:00 ` [net-next 07/16] net/mlx5: FW tracer, Add debug prints Saeed Mahameed
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Feras Daoud, Saeed Mahameed

From: Feras Daoud <ferasda@mellanox.com>

Add the tracer file to the makefile and add the init
function to the load one flow.

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
 .../mellanox/mlx5/core/diag/fw_tracer.h        |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 18 ++++++++++++++++--
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index d923f2f58608..55d5a5c2e9d8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -6,7 +6,7 @@ mlx5_core-y :=	main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
 		health.o mcg.o cq.o srq.o alloc.o qp.o port.o mr.o pd.o \
 		mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
 		fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o lib/clock.o \
-		diag/fs_tracepoint.o
+		diag/fs_tracepoint.o diag/fw_tracer.o
 
 mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o accel/tls.o
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
index 8d310e7d6743..0347f2dd5cee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
@@ -170,6 +170,6 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev);
 int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer);
 void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer);
 void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer);
-void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe) { return; }
+void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe);
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index f9b950e1bd85..6ddbb70e95de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -62,6 +62,7 @@
 #include "accel/ipsec.h"
 #include "accel/tls.h"
 #include "lib/clock.h"
+#include "diag/fw_tracer.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -990,6 +991,8 @@ static int mlx5_init_once(struct mlx5_core_dev *dev, struct mlx5_priv *priv)
 		goto err_sriov_cleanup;
 	}
 
+	dev->tracer = mlx5_fw_tracer_create(dev);
+
 	return 0;
 
 err_sriov_cleanup:
@@ -1015,6 +1018,7 @@ static int mlx5_init_once(struct mlx5_core_dev *dev, struct mlx5_priv *priv)
 
 static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 {
+	mlx5_fw_tracer_destroy(dev->tracer);
 	mlx5_fpga_cleanup(dev);
 	mlx5_sriov_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
@@ -1167,10 +1171,16 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_put_uars;
 	}
 
+	err = mlx5_fw_tracer_init(dev->tracer);
+	if (err) {
+		dev_err(&pdev->dev, "Failed to init FW tracer\n");
+		goto err_fw_tracer;
+	}
+
 	err = alloc_comp_eqs(dev);
 	if (err) {
 		dev_err(&pdev->dev, "Failed to alloc completion EQs\n");
-		goto err_stop_eqs;
+		goto err_comp_eqs;
 	}
 
 	err = mlx5_irq_set_affinity_hints(dev);
@@ -1252,7 +1262,10 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 err_affinity_hints:
 	free_comp_eqs(dev);
 
-err_stop_eqs:
+err_comp_eqs:
+	mlx5_fw_tracer_cleanup(dev->tracer);
+
+err_fw_tracer:
 	mlx5_stop_eqs(dev);
 
 err_put_uars:
@@ -1320,6 +1333,7 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_fpga_device_stop(dev);
 	mlx5_irq_clear_affinity_hints(dev);
 	free_comp_eqs(dev);
+	mlx5_fw_tracer_cleanup(dev->tracer);
 	mlx5_stop_eqs(dev);
 	mlx5_put_uars_page(dev, priv->uar);
 	mlx5_free_irq_vectors(dev);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 07/16] net/mlx5: FW tracer, Add debug prints
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (5 preceding siblings ...)
  2018-07-19  1:00 ` [net-next 06/16] net/mlx5: FW tracer, Enable tracing Saeed Mahameed
@ 2018-07-19  1:00 ` Saeed Mahameed
  2018-07-19  1:00 ` [net-next 08/16] net/mlx5: Move all devlink related functions calls to devlink.c Saeed Mahameed
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c    | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index 309842de272c..d4ec93bde4de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -629,14 +629,14 @@ static void mlx5_fw_tracer_handle_traces(struct work_struct *work)
 	u64 block_timestamp, last_block_timestamp, tmp_trace_block[TRACES_PER_BLOCK];
 	u32 block_count, start_offset, prev_start_offset, prev_consumer_index;
 	u32 trace_event_size = MLX5_ST_SZ_BYTES(tracer_event);
+	struct mlx5_core_dev *dev = tracer->dev;
 	struct tracer_event tracer_event;
-	struct mlx5_core_dev *dev;
 	int i;
 
+	mlx5_core_dbg(dev, "FWTracer: Handle Trace event, owner=(%d)\n", tracer->owner);
 	if (!tracer->owner)
 		return;
 
-	dev = tracer->dev;
 	block_count = tracer->buff.size / TRACER_BLOCK_SIZE_BYTE;
 	start_offset = tracer->buff.consumer_index * TRACER_BLOCK_SIZE_BYTE;
 
@@ -762,6 +762,7 @@ static int mlx5_fw_tracer_start(struct mlx5_fw_tracer *tracer)
 		goto release_ownership;
 	}
 
+	mlx5_core_dbg(dev, "FWTracer: Ownership granted and active\n");
 	return 0;
 
 release_ownership:
@@ -774,6 +775,7 @@ static void mlx5_fw_tracer_ownership_change(struct work_struct *work)
 	struct mlx5_fw_tracer *tracer =
 		container_of(work, struct mlx5_fw_tracer, ownership_change_work);
 
+	mlx5_core_dbg(tracer->dev, "FWTracer: ownership changed, current=(%d)\n", tracer->owner);
 	if (tracer->owner) {
 		tracer->owner = false;
 		tracer->buff.consumer_index = 0;
@@ -830,6 +832,8 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct mlx5_core_dev *dev)
 		goto free_log_buf;
 	}
 
+	mlx5_core_dbg(dev, "FWTracer: Tracer created\n");
+
 	return tracer;
 
 free_log_buf:
@@ -887,6 +891,9 @@ void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
 	if (IS_ERR_OR_NULL(tracer))
 		return;
 
+	mlx5_core_dbg(tracer->dev, "FWTracer: Cleanup, is owner ? (%d)\n",
+		      tracer->owner);
+
 	cancel_work_sync(&tracer->ownership_change_work);
 	cancel_work_sync(&tracer->handle_traces_work);
 
@@ -903,6 +910,8 @@ void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
 	if (IS_ERR_OR_NULL(tracer))
 		return;
 
+	mlx5_core_dbg(tracer->dev, "FWTracer: Destroy\n");
+
 	cancel_work_sync(&tracer->read_fw_strings_work);
 	mlx5_fw_tracer_clean_ready_list(tracer);
 	mlx5_fw_tracer_clean_print_hash(tracer);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 08/16] net/mlx5: Move all devlink related functions calls to devlink.c
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (6 preceding siblings ...)
  2018-07-19  1:00 ` [net-next 07/16] net/mlx5: FW tracer, Add debug prints Saeed Mahameed
@ 2018-07-19  1:00 ` Saeed Mahameed
  2018-07-19  1:01 ` [net-next 09/16] net/mlx5: Add MPEGC register configuration functionality Saeed Mahameed
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Saeed Mahameed

From: Eran Ben Elisha <eranbe@mellanox.com>

Centrelize all devlink related callbacks in one file.
In the downstream patch, some more functionality will be added, this
patch is preparing the driver infrastructure for it.

Currently, move devlink un/register functions calls into this file.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 43 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h | 40 +++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  5 ++-
 4 files changed, 87 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 55d5a5c2e9d8..83abd9130ffb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -6,7 +6,7 @@ mlx5_core-y :=	main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
 		health.o mcg.o cq.o srq.o alloc.o qp.o port.o mr.o pd.o \
 		mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
 		fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o lib/clock.o \
-		diag/fs_tracepoint.o diag/fw_tracer.o
+		diag/fs_tracepoint.o diag/fw_tracer.o devlink.o
 
 mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o accel/tls.o
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
new file mode 100644
index 000000000000..8851b3ec0ae2
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2018, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <devlink.h>
+
+int mlx5_devlink_register(struct devlink *devlink, struct device *dev)
+{
+	return devlink_register(devlink, dev);
+}
+
+void mlx5_devlink_unregister(struct devlink *devlink)
+{
+	devlink_unregister(devlink);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
new file mode 100644
index 000000000000..eeb4fabba6ec
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
@@ -0,0 +1,40 @@
+/*
+ * Copyright (c) 2018, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __MLX5_DEVLINK_H__
+#define __MLX5_DEVLINK_H__
+
+#include <net/devlink.h>
+
+int mlx5_devlink_register(struct devlink *devlink, struct device *dev);
+void mlx5_devlink_unregister(struct devlink *devlink);
+
+#endif /* __MLX5_DEVLINK_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 6ddbb70e95de..7f581f3189ea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -56,6 +56,7 @@
 #include "fs_core.h"
 #include "lib/mpfs.h"
 #include "eswitch.h"
+#include "devlink.h"
 #include "lib/mlx5.h"
 #include "fpga/core.h"
 #include "fpga/ipsec.h"
@@ -1440,7 +1441,7 @@ static int init_one(struct pci_dev *pdev,
 
 	request_module_nowait(MLX5_IB_MOD);
 
-	err = devlink_register(devlink, &pdev->dev);
+	err = mlx5_devlink_register(devlink, &pdev->dev);
 	if (err)
 		goto clean_load;
 
@@ -1470,7 +1471,7 @@ static void remove_one(struct pci_dev *pdev)
 	struct devlink *devlink = priv_to_devlink(dev);
 	struct mlx5_priv *priv = &dev->priv;
 
-	devlink_unregister(devlink);
+	mlx5_devlink_unregister(devlink);
 	mlx5_unregister_device(dev);
 
 	if (mlx5_unload_one(dev, priv, true)) {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 09/16] net/mlx5: Add MPEGC register configuration functionality
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (7 preceding siblings ...)
  2018-07-19  1:00 ` [net-next 08/16] net/mlx5: Move all devlink related functions calls to devlink.c Saeed Mahameed
@ 2018-07-19  1:01 ` Saeed Mahameed
  2018-07-19  1:01 ` [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink Saeed Mahameed
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Saeed Mahameed

From: Eran Ben Elisha <eranbe@mellanox.com>

MPEGC register is used to configure and access the PCIe general
configuration.

Expose set/get for TX lossy overflow and TX overflow sense which use the
MPEGC register. These will be used in a downstream patch via devlnik
params.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 121 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   1 +
 2 files changed, 122 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 8851b3ec0ae2..9800c98b01d3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -32,6 +32,127 @@
 
 #include <devlink.h>
 
+enum {
+	MLX5_DEVLINK_MPEGC_FIELD_SELECT_TX_OVERFLOW_DROP_EN = BIT(0),
+	MLX5_DEVLINK_MPEGC_FIELD_SELECT_TX_OVERFLOW_SENSE = BIT(3),
+	MLX5_DEVLINK_MPEGC_FIELD_SELECT_MARK_TX_ACTION_CQE = BIT(4),
+	MLX5_DEVLINK_MPEGC_FIELD_SELECT_MARK_TX_ACTION_CNP = BIT(5),
+};
+
+enum {
+	MLX5_DEVLINK_CONGESTION_ACTION_DISABLED,
+	MLX5_DEVLIN_CONGESTION_ACTION_DROP,
+	MLX5_DEVLINK_CONGESTION_ACTION_MARK,
+	__MLX5_DEVLINK_CONGESTION_ACTION_MAX,
+	MLX5_DEVLINK_CONGESTION_ACTION_MAX = __MLX5_DEVLINK_CONGESTION_ACTION_MAX - 1,
+};
+
+enum {
+	MLX5_DEVLINK_CONGESTION_MODE_AGGRESSIVE,
+	MLX5_DEVLINK_CONGESTION_MODE_DYNAMIC_ADJUSTMENT,
+	__MLX5_DEVLINK_CONGESTION_MODE_MAX,
+	MLX5_DEVLINK_CONGESTION_MODE_MAX = __MLX5_DEVLINK_CONGESTION_MODE_MAX - 1,
+};
+
+static int mlx5_devlink_set_mpegc(struct mlx5_core_dev *mdev, u32 *in, int size_in)
+{
+	u32 out[MLX5_ST_SZ_DW(mpegc_reg)] = {0};
+
+	if (!MLX5_CAP_MCAM_REG(mdev, mpegc))
+		return -EOPNOTSUPP;
+
+	return mlx5_core_access_reg(mdev, in, size_in, out,
+				    sizeof(out), MLX5_REG_MPEGC, 0, 1);
+}
+
+static int mlx5_devlink_set_tx_lossy_overflow(struct mlx5_core_dev *mdev, u8 tx_lossy_overflow)
+{
+	u32 in[MLX5_ST_SZ_DW(mpegc_reg)] = {0};
+	u8 field_select = 0;
+
+	if (tx_lossy_overflow == MLX5_DEVLINK_CONGESTION_ACTION_MARK) {
+		if (MLX5_CAP_MCAM_FEATURE(mdev, mark_tx_action_cqe))
+			field_select |=
+				MLX5_DEVLINK_MPEGC_FIELD_SELECT_MARK_TX_ACTION_CQE;
+
+		if (MLX5_CAP_MCAM_FEATURE(mdev, mark_tx_action_cnp))
+			field_select |=
+				MLX5_DEVLINK_MPEGC_FIELD_SELECT_MARK_TX_ACTION_CNP;
+
+		if (!field_select)
+			return -EOPNOTSUPP;
+	}
+
+	MLX5_SET(mpegc_reg, in, field_select,
+		 field_select |
+		 MLX5_DEVLINK_MPEGC_FIELD_SELECT_TX_OVERFLOW_DROP_EN);
+	MLX5_SET(mpegc_reg, in, tx_lossy_overflow_oper, tx_lossy_overflow);
+	MLX5_SET(mpegc_reg, in, mark_cqe, 0x1);
+	MLX5_SET(mpegc_reg, in, mark_cnp, 0x1);
+
+	return mlx5_devlink_set_mpegc(mdev, in, sizeof(in));
+}
+
+static int mlx5_devlink_set_tx_overflow_sense(struct mlx5_core_dev *mdev,
+					      u8 tx_overflow_sense)
+{
+	u32 in[MLX5_ST_SZ_DW(mpegc_reg)] = {0};
+
+	if (!MLX5_CAP_MCAM_FEATURE(mdev, dynamic_tx_overflow))
+		return -EOPNOTSUPP;
+
+	MLX5_SET(mpegc_reg, in, field_select,
+		 MLX5_DEVLINK_MPEGC_FIELD_SELECT_TX_OVERFLOW_SENSE);
+	MLX5_SET(mpegc_reg, in, tx_overflow_sense, tx_overflow_sense);
+
+	return mlx5_devlink_set_mpegc(mdev, in, sizeof(in));
+}
+
+static int mlx5_devlink_query_mpegc(struct mlx5_core_dev *mdev, u32 *out,
+				    int size_out)
+{
+	u32 in[MLX5_ST_SZ_DW(mpegc_reg)] = {0};
+
+	if (!MLX5_CAP_MCAM_REG(mdev, mpegc))
+		return -EOPNOTSUPP;
+
+	return mlx5_core_access_reg(mdev, in, sizeof(in), out,
+				    size_out, MLX5_REG_MPEGC, 0, 0);
+}
+
+static int mlx5_devlink_query_tx_lossy_overflow(struct mlx5_core_dev *mdev,
+						u8 *tx_lossy_overflow)
+{
+	u32 out[MLX5_ST_SZ_DW(mpegc_reg)] = {0};
+	int err;
+
+	err = mlx5_devlink_query_mpegc(mdev, out, sizeof(out));
+	if (err)
+		return err;
+
+	*tx_lossy_overflow = MLX5_GET(mpegc_reg, out, tx_lossy_overflow_oper);
+
+	return 0;
+}
+
+static int mlx5_devlink_query_tx_overflow_sense(struct mlx5_core_dev *mdev,
+						u8 *tx_overflow_sense)
+{
+	u32 out[MLX5_ST_SZ_DW(mpegc_reg)] = {0};
+	int err;
+
+	if (!MLX5_CAP_MCAM_FEATURE(mdev, dynamic_tx_overflow))
+		return -EOPNOTSUPP;
+
+	err = mlx5_devlink_query_mpegc(mdev, out, sizeof(out));
+	if (err)
+		return err;
+
+	*tx_overflow_sense = MLX5_GET(mpegc_reg, out, tx_overflow_sense);
+
+	return 0;
+}
+
 int mlx5_devlink_register(struct devlink *devlink, struct device *dev)
 {
 	return devlink_register(devlink, dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
index eeb4fabba6ec..3a21a225cd27 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
@@ -33,6 +33,7 @@
 #define __MLX5_DEVLINK_H__
 
 #include <net/devlink.h>
+#include <linux/mlx5/driver.h>
 
 int mlx5_devlink_register(struct devlink *devlink, struct device *dev);
 void mlx5_devlink_unregister(struct devlink *devlink);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (8 preceding siblings ...)
  2018-07-19  1:01 ` [net-next 09/16] net/mlx5: Add MPEGC register configuration functionality Saeed Mahameed
@ 2018-07-19  1:01 ` Saeed Mahameed
  2018-07-19  1:49   ` Jakub Kicinski
  2018-07-19  8:24   ` Jiri Pirko
  2018-07-19  1:01 ` [net-next 11/16] net/mlx5e: Set ECN for received packets using CQE indication Saeed Mahameed
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Saeed Mahameed

From: Eran Ben Elisha <eranbe@mellanox.com>

Add support for two driver parameters via devlink params interface:
- Congestion action
	HW mechanism in the PCIe buffer which monitors the amount of
	consumed PCIe buffer per host.  This mechanism supports the
	following actions in case of threshold overflow:
	- Disabled - NOP (Default)
	- Drop
	- Mark - Mark CE bit in the CQE of received packet
- Congestion mode
	- Aggressive - Aggressive static trigger threshold (Default)
	- Dynamic - Dynamically change the trigger threshold

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 105 +++++++++++++++++-
 1 file changed, 104 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 9800c98b01d3..1f04decef043 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -153,12 +153,115 @@ static int mlx5_devlink_query_tx_overflow_sense(struct mlx5_core_dev *mdev,
 	return 0;
 }
 
+static int mlx5_devlink_set_congestion_action(struct devlink *devlink, u32 id,
+					      struct devlink_param_gset_ctx *ctx)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	u8 max = MLX5_DEVLINK_CONGESTION_ACTION_MAX;
+	u8 sense;
+	int err;
+
+	if (!MLX5_CAP_MCAM_FEATURE(dev, mark_tx_action_cqe) &&
+	    !MLX5_CAP_MCAM_FEATURE(dev, mark_tx_action_cnp))
+		max = MLX5_DEVLINK_CONGESTION_ACTION_MARK - 1;
+
+	if (ctx->val.vu8 > max)
+		return -ERANGE;
+
+	err = mlx5_devlink_query_tx_overflow_sense(dev, &sense);
+	if (err)
+		return err;
+
+	if (ctx->val.vu8 == MLX5_DEVLINK_CONGESTION_ACTION_DISABLED &&
+	    sense != MLX5_DEVLINK_CONGESTION_MODE_AGGRESSIVE)
+		return -EINVAL;
+
+	return mlx5_devlink_set_tx_lossy_overflow(dev, ctx->val.vu8);
+}
+
+static int mlx5_devlink_get_congestion_action(struct devlink *devlink, u32 id,
+					      struct devlink_param_gset_ctx *ctx)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+
+	return mlx5_devlink_query_tx_lossy_overflow(dev, &ctx->val.vu8);
+}
+
+static int mlx5_devlink_set_congestion_mode(struct devlink *devlink, u32 id,
+					    struct devlink_param_gset_ctx *ctx)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	u8 tx_lossy_overflow;
+	int err;
+
+	if (ctx->val.vu8 > MLX5_DEVLINK_CONGESTION_MODE_MAX)
+		return -ERANGE;
+
+	err = mlx5_devlink_query_tx_lossy_overflow(dev, &tx_lossy_overflow);
+	if (err)
+		return err;
+
+	if (ctx->val.vu8 != MLX5_DEVLINK_CONGESTION_MODE_AGGRESSIVE &&
+	    tx_lossy_overflow == MLX5_DEVLINK_CONGESTION_ACTION_DISABLED)
+		return -EINVAL;
+
+	return mlx5_devlink_set_tx_overflow_sense(dev, ctx->val.vu8);
+}
+
+static int mlx5_devlink_get_congestion_mode(struct devlink *devlink, u32 id,
+					    struct devlink_param_gset_ctx *ctx)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+
+	return mlx5_devlink_query_tx_overflow_sense(dev, &ctx->val.vu8);
+}
+
+enum mlx5_devlink_param_id {
+	MLX5_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
+	MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
+	MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
+};
+
+static const struct devlink_param mlx5_devlink_params[] = {
+	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
+			     "congestion_action",
+			     DEVLINK_PARAM_TYPE_U8,
+			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
+			     mlx5_devlink_get_congestion_action,
+			     mlx5_devlink_set_congestion_action, NULL),
+	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
+			     "congestion_mode",
+			     DEVLINK_PARAM_TYPE_U8,
+			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
+			     mlx5_devlink_get_congestion_mode,
+			     mlx5_devlink_set_congestion_mode, NULL),
+};
+
 int mlx5_devlink_register(struct devlink *devlink, struct device *dev)
 {
-	return devlink_register(devlink, dev);
+	int err;
+
+	err = devlink_register(devlink, dev);
+	if (err)
+		return err;
+
+	err = devlink_params_register(devlink, mlx5_devlink_params,
+				      ARRAY_SIZE(mlx5_devlink_params));
+	if (err) {
+		dev_err(dev, "devlink_params_register failed, err = %d\n", err);
+		goto unregister;
+	}
+
+	return 0;
+
+unregister:
+	devlink_unregister(devlink);
+	return err;
 }
 
 void mlx5_devlink_unregister(struct devlink *devlink)
 {
+	devlink_params_unregister(devlink, mlx5_devlink_params,
+				  ARRAY_SIZE(mlx5_devlink_params));
 	devlink_unregister(devlink);
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 11/16] net/mlx5e: Set ECN for received packets using CQE indication
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (9 preceding siblings ...)
  2018-07-19  1:01 ` [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink Saeed Mahameed
@ 2018-07-19  1:01 ` Saeed Mahameed
  2018-07-19  1:01 ` [net-next 12/16] net/mlx5e: Remove redundant WARN when we cannot find neigh entry Saeed Mahameed
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Natali Shechtman, Saeed Mahameed

From: Natali Shechtman <natali@mellanox.com>

In multi-host (MH) NIC scheme, a single HW port serves multiple hosts
or sockets on the same host.
The HW uses a mechanism in the PCIe buffer which monitors
the amount of consumed PCIe buffers per host.
On a certain configuration, under congestion,
the HW emulates a switch doing ECN marking on packets using ECN
indication on the completion descriptor (CQE).

The driver needs to set the ECN bits on the packet SKB,
such that the network stack can react on that, this commit does that.

Signed-off-by: Natali Shechtman <natali@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 35 ++++++++++++++++---
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  3 ++
 .../ethernet/mellanox/mlx5/core/en_stats.h    |  2 ++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 1d5295ee863c..e684869484e2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -38,6 +38,7 @@
 #include <net/busy_poll.h>
 #include <net/ip6_checksum.h>
 #include <net/page_pool.h>
+#include <net/inet_ecn.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -688,12 +689,29 @@ static inline void mlx5e_skb_set_hash(struct mlx5_cqe64 *cqe,
 	skb_set_hash(skb, be32_to_cpu(cqe->rss_hash_result), ht);
 }
 
-static inline bool is_last_ethertype_ip(struct sk_buff *skb, int *network_depth)
+static inline bool is_last_ethertype_ip(struct sk_buff *skb, int *network_depth,
+					__be16 *proto)
 {
-	__be16 ethertype = ((struct ethhdr *)skb->data)->h_proto;
+	*proto = ((struct ethhdr *)skb->data)->h_proto;
+	*proto = __vlan_get_protocol(skb, *proto, network_depth);
+	return (*proto == htons(ETH_P_IP) || *proto == htons(ETH_P_IPV6));
+}
+
+static inline void mlx5e_enable_ecn(struct mlx5e_rq *rq, struct sk_buff *skb)
+{
+	int network_depth = 0;
+	__be16 proto;
+	void *ip;
+	int rc;
 
-	ethertype = __vlan_get_protocol(skb, ethertype, network_depth);
-	return (ethertype == htons(ETH_P_IP) || ethertype == htons(ETH_P_IPV6));
+	if (unlikely(!is_last_ethertype_ip(skb, &network_depth, &proto)))
+		return;
+
+	ip = skb->data + network_depth;
+	rc = ((proto == htons(ETH_P_IP)) ? IP_ECN_set_ce((struct iphdr *)ip) :
+					 IP6_ECN_set_ce(skb, (struct ipv6hdr *)ip));
+
+	rq->stats->ecn_mark += !!rc;
 }
 
 static __be32 mlx5e_get_fcs(struct sk_buff *skb)
@@ -743,6 +761,7 @@ static inline void mlx5e_handle_csum(struct net_device *netdev,
 {
 	struct mlx5e_rq_stats *stats = rq->stats;
 	int network_depth = 0;
+	__be16 proto;
 
 	if (unlikely(!(netdev->features & NETIF_F_RXCSUM)))
 		goto csum_none;
@@ -753,7 +772,7 @@ static inline void mlx5e_handle_csum(struct net_device *netdev,
 		return;
 	}
 
-	if (likely(is_last_ethertype_ip(skb, &network_depth))) {
+	if (likely(is_last_ethertype_ip(skb, &network_depth, &proto))) {
 		skb->ip_summed = CHECKSUM_COMPLETE;
 		skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
 		if (network_depth > ETH_HLEN)
@@ -788,6 +807,8 @@ static inline void mlx5e_handle_csum(struct net_device *netdev,
 	stats->csum_none++;
 }
 
+#define MLX5E_CE_BIT_MASK 0x80
+
 static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 				      u32 cqe_bcnt,
 				      struct mlx5e_rq *rq,
@@ -832,6 +853,10 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 	skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK;
 
 	mlx5e_handle_csum(netdev, cqe, rq, skb, !!lro_num_seg);
+	/* checking CE bit in cqe - MSB in ml_path field */
+	if (unlikely(cqe->ml_path & MLX5E_CE_BIT_MASK))
+		mlx5e_enable_ecn(rq, skb);
+
 	skb->protocol = eth_type_trans(skb, netdev);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index c0507fada0be..4ed7d1fbbd1b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -53,6 +53,7 @@ static const struct counter_desc sw_stats_desc[] = {
 
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_bytes) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_ecn_mark) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_removed_vlan_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_unnecessary) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_none) },
@@ -136,6 +137,7 @@ void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
 		s->rx_bytes	+= rq_stats->bytes;
 		s->rx_lro_packets += rq_stats->lro_packets;
 		s->rx_lro_bytes	+= rq_stats->lro_bytes;
+		s->rx_ecn_mark	+= rq_stats->ecn_mark;
 		s->rx_removed_vlan_packets += rq_stats->removed_vlan_packets;
 		s->rx_csum_none	+= rq_stats->csum_none;
 		s->rx_csum_complete += rq_stats->csum_complete;
@@ -1131,6 +1133,7 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_tx_full) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, ecn_mark) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, removed_vlan_packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_filler_cqes) },
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index fc3f66003edd..43e84cdb37b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -64,6 +64,7 @@ struct mlx5e_sw_stats {
 	u64 tx_nop;
 	u64 rx_lro_packets;
 	u64 rx_lro_bytes;
+	u64 rx_ecn_mark;
 	u64 rx_removed_vlan_packets;
 	u64 rx_csum_unnecessary;
 	u64 rx_csum_none;
@@ -176,6 +177,7 @@ struct mlx5e_rq_stats {
 	u64 csum_none;
 	u64 lro_packets;
 	u64 lro_bytes;
+	u64 ecn_mark;
 	u64 removed_vlan_packets;
 	u64 xdp_drop;
 	u64 xdp_tx;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 12/16] net/mlx5e: Remove redundant WARN when we cannot find neigh entry
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (10 preceding siblings ...)
  2018-07-19  1:01 ` [net-next 11/16] net/mlx5e: Set ECN for received packets using CQE indication Saeed Mahameed
@ 2018-07-19  1:01 ` Saeed Mahameed
  2018-07-19  1:01 ` [net-next 13/16] net/mlx5e: Support offloading tc double vlan headers match Saeed Mahameed
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roi Dayan, Saeed Mahameed

From: Roi Dayan <roid@mellanox.com>

It is possible for neigh entry not to exist if it was cleaned already.
When we bring down an interface the neigh gets deleted but it could be
that our listener for neigh event to clear the encap valid bit didn't
start yet and the neigh update last used work is started first.
In this scenario the encap entry has valid bit set but the neigh entry
doesn't exist.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 0edf4751a8ba..335a08bc381d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1032,10 +1032,8 @@ void mlx5e_tc_update_neigh_used_value(struct mlx5e_neigh_hash_entry *nhe)
 		 * dst ip pair
 		 */
 		n = neigh_lookup(tbl, &m_neigh->dst_ip, m_neigh->dev);
-		if (!n) {
-			WARN(1, "The neighbour already freed\n");
+		if (!n)
 			return;
-		}
 
 		neigh_event_send(n, NULL);
 		neigh_release(n);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 13/16] net/mlx5e: Support offloading tc double vlan headers match
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (11 preceding siblings ...)
  2018-07-19  1:01 ` [net-next 12/16] net/mlx5e: Remove redundant WARN when we cannot find neigh entry Saeed Mahameed
@ 2018-07-19  1:01 ` Saeed Mahameed
  2018-07-19  1:01 ` [net-next 14/16] net/mlx5e: Refactor tc vlan push/pop actions offloading Saeed Mahameed
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jianbo Liu, Saeed Mahameed

From: Jianbo Liu <jianbol@mellanox.com>

We can match on both outer and inner vlan tags, add support for
offloading that.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 55 ++++++++++++++++++-
 1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 335a08bc381d..dcb8c4993811 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1235,6 +1235,10 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 				       outer_headers);
 	void *headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value,
 				       outer_headers);
+	void *misc_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
+				    misc_parameters);
+	void *misc_v = MLX5_ADDR_OF(fte_match_param, spec->match_value,
+				    misc_parameters);
 	u16 addr_type = 0;
 	u8 ip_proto = 0;
 
@@ -1245,6 +1249,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
 	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
 	      BIT(FLOW_DISSECTOR_KEY_VLAN) |
+	      BIT(FLOW_DISSECTOR_KEY_CVLAN) |
 	      BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
 	      BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
 	      BIT(FLOW_DISSECTOR_KEY_PORTS) |
@@ -1325,9 +1330,18 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 			skb_flow_dissector_target(f->dissector,
 						  FLOW_DISSECTOR_KEY_VLAN,
 						  f->mask);
-		if (mask->vlan_id || mask->vlan_priority) {
-			MLX5_SET(fte_match_set_lyr_2_4, headers_c, cvlan_tag, 1);
-			MLX5_SET(fte_match_set_lyr_2_4, headers_v, cvlan_tag, 1);
+		if (mask->vlan_id || mask->vlan_priority || mask->vlan_tpid) {
+			if (key->vlan_tpid == htons(ETH_P_8021AD)) {
+				MLX5_SET(fte_match_set_lyr_2_4, headers_c,
+					 svlan_tag, 1);
+				MLX5_SET(fte_match_set_lyr_2_4, headers_v,
+					 svlan_tag, 1);
+			} else {
+				MLX5_SET(fte_match_set_lyr_2_4, headers_c,
+					 cvlan_tag, 1);
+				MLX5_SET(fte_match_set_lyr_2_4, headers_v,
+					 cvlan_tag, 1);
+			}
 
 			MLX5_SET(fte_match_set_lyr_2_4, headers_c, first_vid, mask->vlan_id);
 			MLX5_SET(fte_match_set_lyr_2_4, headers_v, first_vid, key->vlan_id);
@@ -1339,6 +1353,41 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 		}
 	}
 
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_CVLAN)) {
+		struct flow_dissector_key_vlan *key =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_CVLAN,
+						  f->key);
+		struct flow_dissector_key_vlan *mask =
+			skb_flow_dissector_target(f->dissector,
+						  FLOW_DISSECTOR_KEY_CVLAN,
+						  f->mask);
+		if (mask->vlan_id || mask->vlan_priority || mask->vlan_tpid) {
+			if (key->vlan_tpid == htons(ETH_P_8021AD)) {
+				MLX5_SET(fte_match_set_misc, misc_c,
+					 outer_second_svlan_tag, 1);
+				MLX5_SET(fte_match_set_misc, misc_v,
+					 outer_second_svlan_tag, 1);
+			} else {
+				MLX5_SET(fte_match_set_misc, misc_c,
+					 outer_second_cvlan_tag, 1);
+				MLX5_SET(fte_match_set_misc, misc_v,
+					 outer_second_cvlan_tag, 1);
+			}
+
+			MLX5_SET(fte_match_set_misc, misc_c, outer_second_vid,
+				 mask->vlan_id);
+			MLX5_SET(fte_match_set_misc, misc_v, outer_second_vid,
+				 key->vlan_id);
+			MLX5_SET(fte_match_set_misc, misc_c, outer_second_prio,
+				 mask->vlan_priority);
+			MLX5_SET(fte_match_set_misc, misc_v, outer_second_prio,
+				 key->vlan_priority);
+
+			*match_level = MLX5_MATCH_L2;
+		}
+	}
+
 	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
 		struct flow_dissector_key_basic *key =
 			skb_flow_dissector_target(f->dissector,
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 14/16] net/mlx5e: Refactor tc vlan push/pop actions offloading
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (12 preceding siblings ...)
  2018-07-19  1:01 ` [net-next 13/16] net/mlx5e: Support offloading tc double vlan headers match Saeed Mahameed
@ 2018-07-19  1:01 ` Saeed Mahameed
  2018-07-19  1:01 ` [net-next 15/16] net/mlx5e: Support offloading double vlan push/pop tc actions Saeed Mahameed
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jianbo Liu, Saeed Mahameed

From: Jianbo Liu <jianbol@mellanox.com>

Extract actions offloading code to a new function, and also extend data
structures for double vlan actions.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 51 ++++++++++++-------
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  6 +--
 .../mellanox/mlx5/core/eswitch_offloads.c     | 12 ++---
 3 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index dcb8c4993811..35b3e135ae1d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2578,6 +2578,32 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
 	return err;
 }
 
+static int parse_tc_vlan_action(struct mlx5e_priv *priv,
+				const struct tc_action *a,
+				struct mlx5_esw_flow_attr *attr,
+				u32 *action)
+{
+	if (tcf_vlan_action(a) == TCA_VLAN_ACT_POP) {
+		*action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
+	} else if (tcf_vlan_action(a) == TCA_VLAN_ACT_PUSH) {
+		*action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
+		attr->vlan_vid[0] = tcf_vlan_push_vid(a);
+		if (mlx5_eswitch_vlan_actions_supported(priv->mdev)) {
+			attr->vlan_prio[0] = tcf_vlan_push_prio(a);
+			attr->vlan_proto[0] = tcf_vlan_push_proto(a);
+			if (!attr->vlan_proto[0])
+				attr->vlan_proto[0] = htons(ETH_P_8021Q);
+		} else if (tcf_vlan_push_proto(a) != htons(ETH_P_8021Q) ||
+			   tcf_vlan_push_prio(a)) {
+			return -EOPNOTSUPP;
+		}
+	} else { /* action is TCA_VLAN_ACT_MODIFY */
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
 static int parse_tc_fdb_actions(struct mlx5e_priv *priv, struct tcf_exts *exts,
 				struct mlx5e_tc_flow_parse_attr *parse_attr,
 				struct mlx5e_tc_flow *flow)
@@ -2589,6 +2615,7 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, struct tcf_exts *exts,
 	LIST_HEAD(actions);
 	bool encap = false;
 	u32 action = 0;
+	int err;
 
 	if (!tcf_exts_has_actions(exts))
 		return -EINVAL;
@@ -2605,8 +2632,6 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, struct tcf_exts *exts,
 		}
 
 		if (is_tcf_pedit(a)) {
-			int err;
-
 			err = parse_tc_pedit_action(priv, a, MLX5_FLOW_NAMESPACE_FDB,
 						    parse_attr);
 			if (err)
@@ -2673,23 +2698,11 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, struct tcf_exts *exts,
 		}
 
 		if (is_tcf_vlan(a)) {
-			if (tcf_vlan_action(a) == TCA_VLAN_ACT_POP) {
-				action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
-			} else if (tcf_vlan_action(a) == TCA_VLAN_ACT_PUSH) {
-				action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
-				attr->vlan_vid = tcf_vlan_push_vid(a);
-				if (mlx5_eswitch_vlan_actions_supported(priv->mdev)) {
-					attr->vlan_prio = tcf_vlan_push_prio(a);
-					attr->vlan_proto = tcf_vlan_push_proto(a);
-					if (!attr->vlan_proto)
-						attr->vlan_proto = htons(ETH_P_8021Q);
-				} else if (tcf_vlan_push_proto(a) != htons(ETH_P_8021Q) ||
-					   tcf_vlan_push_prio(a)) {
-					return -EOPNOTSUPP;
-				}
-			} else { /* action is TCA_VLAN_ACT_MODIFY */
-				return -EOPNOTSUPP;
-			}
+			err = parse_tc_vlan_action(priv, a, attr, &action);
+
+			if (err)
+				return err;
+
 			attr->mirror_count = attr->out_count;
 			continue;
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index b174da2884c5..befa0011efee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -256,9 +256,9 @@ struct mlx5_esw_flow_attr {
 	int out_count;
 
 	int	action;
-	__be16	vlan_proto;
-	u16	vlan_vid;
-	u8	vlan_prio;
+	__be16	vlan_proto[1];
+	u16	vlan_vid[1];
+	u8	vlan_prio[1];
 	bool	vlan_handled;
 	u32	encap_id;
 	u32	mod_hdr_id;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index f32e69170b30..552954d7184e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -70,9 +70,9 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
 		flow_act.action &= ~(MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH |
 				     MLX5_FLOW_CONTEXT_ACTION_VLAN_POP);
 	else if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH) {
-		flow_act.vlan[0].ethtype = ntohs(attr->vlan_proto);
-		flow_act.vlan[0].vid = attr->vlan_vid;
-		flow_act.vlan[0].prio = attr->vlan_prio;
+		flow_act.vlan[0].ethtype = ntohs(attr->vlan_proto[0]);
+		flow_act.vlan[0].vid = attr->vlan_vid[0];
+		flow_act.vlan[0].prio = attr->vlan_prio[0];
 	}
 
 	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) {
@@ -266,7 +266,7 @@ static int esw_add_vlan_action_check(struct mlx5_esw_flow_attr *attr,
 	/* protects against (1) setting rules with different vlans to push and
 	 * (2) setting rules w.o vlans (attr->vlan = 0) && w. vlans to push (!= 0)
 	 */
-	if (push && in_rep->vlan_refcount && (in_rep->vlan != attr->vlan_vid))
+	if (push && in_rep->vlan_refcount && (in_rep->vlan != attr->vlan_vid[0]))
 		goto out_notsupp;
 
 	return 0;
@@ -324,11 +324,11 @@ int mlx5_eswitch_add_vlan_action(struct mlx5_eswitch *esw,
 		if (vport->vlan_refcount)
 			goto skip_set_push;
 
-		err = __mlx5_eswitch_set_vport_vlan(esw, vport->vport, attr->vlan_vid, 0,
+		err = __mlx5_eswitch_set_vport_vlan(esw, vport->vport, attr->vlan_vid[0], 0,
 						    SET_VLAN_INSERT | SET_VLAN_STRIP);
 		if (err)
 			goto out;
-		vport->vlan = attr->vlan_vid;
+		vport->vlan = attr->vlan_vid[0];
 skip_set_push:
 		vport->vlan_refcount++;
 	}
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 15/16] net/mlx5e: Support offloading double vlan push/pop tc actions
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (13 preceding siblings ...)
  2018-07-19  1:01 ` [net-next 14/16] net/mlx5e: Refactor tc vlan push/pop actions offloading Saeed Mahameed
@ 2018-07-19  1:01 ` Saeed Mahameed
  2018-07-19  1:01 ` [net-next 16/16] net/mlx5e: Use PARTIAL_GSO for UDP segmentation Saeed Mahameed
  2018-07-23 21:35 ` [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jianbo Liu, Saeed Mahameed

From: Jianbo Liu <jianbol@mellanox.com>

As we can configure two push/pop actions in one flow table entry,
add support to offload those double vlan actions in a rule to HW.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 46 ++++++++++++++-----
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 21 ++++++---
 .../mellanox/mlx5/core/eswitch_offloads.c     | 11 +++--
 3 files changed, 58 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 35b3e135ae1d..e9888d6c1f7c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2583,24 +2583,48 @@ static int parse_tc_vlan_action(struct mlx5e_priv *priv,
 				struct mlx5_esw_flow_attr *attr,
 				u32 *action)
 {
+	u8 vlan_idx = attr->total_vlan;
+
+	if (vlan_idx >= MLX5_FS_VLAN_DEPTH)
+		return -EOPNOTSUPP;
+
 	if (tcf_vlan_action(a) == TCA_VLAN_ACT_POP) {
-		*action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
+		if (vlan_idx) {
+			if (!mlx5_eswitch_vlan_actions_supported(priv->mdev,
+								 MLX5_FS_VLAN_DEPTH))
+				return -EOPNOTSUPP;
+
+			*action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP_2;
+		} else {
+			*action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_POP;
+		}
 	} else if (tcf_vlan_action(a) == TCA_VLAN_ACT_PUSH) {
-		*action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
-		attr->vlan_vid[0] = tcf_vlan_push_vid(a);
-		if (mlx5_eswitch_vlan_actions_supported(priv->mdev)) {
-			attr->vlan_prio[0] = tcf_vlan_push_prio(a);
-			attr->vlan_proto[0] = tcf_vlan_push_proto(a);
-			if (!attr->vlan_proto[0])
-				attr->vlan_proto[0] = htons(ETH_P_8021Q);
-		} else if (tcf_vlan_push_proto(a) != htons(ETH_P_8021Q) ||
-			   tcf_vlan_push_prio(a)) {
-			return -EOPNOTSUPP;
+		attr->vlan_vid[vlan_idx] = tcf_vlan_push_vid(a);
+		attr->vlan_prio[vlan_idx] = tcf_vlan_push_prio(a);
+		attr->vlan_proto[vlan_idx] = tcf_vlan_push_proto(a);
+		if (!attr->vlan_proto[vlan_idx])
+			attr->vlan_proto[vlan_idx] = htons(ETH_P_8021Q);
+
+		if (vlan_idx) {
+			if (!mlx5_eswitch_vlan_actions_supported(priv->mdev,
+								 MLX5_FS_VLAN_DEPTH))
+				return -EOPNOTSUPP;
+
+			*action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH_2;
+		} else {
+			if (!mlx5_eswitch_vlan_actions_supported(priv->mdev, 1) &&
+			    (tcf_vlan_push_proto(a) != htons(ETH_P_8021Q) ||
+			     tcf_vlan_push_prio(a)))
+				return -EOPNOTSUPP;
+
+			*action |= MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH;
 		}
 	} else { /* action is TCA_VLAN_ACT_MODIFY */
 		return -EOPNOTSUPP;
 	}
 
+	attr->total_vlan = vlan_idx + 1;
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index befa0011efee..c17bfcab517c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -38,6 +38,7 @@
 #include <net/devlink.h>
 #include <linux/mlx5/device.h>
 #include <linux/mlx5/eswitch.h>
+#include <linux/mlx5/fs.h>
 #include "lib/mpfs.h"
 
 #ifdef CONFIG_MLX5_ESWITCH
@@ -256,9 +257,10 @@ struct mlx5_esw_flow_attr {
 	int out_count;
 
 	int	action;
-	__be16	vlan_proto[1];
-	u16	vlan_vid[1];
-	u8	vlan_prio[1];
+	__be16	vlan_proto[MLX5_FS_VLAN_DEPTH];
+	u16	vlan_vid[MLX5_FS_VLAN_DEPTH];
+	u8	vlan_prio[MLX5_FS_VLAN_DEPTH];
+	u8	total_vlan;
 	bool	vlan_handled;
 	u32	encap_id;
 	u32	mod_hdr_id;
@@ -282,10 +284,17 @@ int mlx5_eswitch_del_vlan_action(struct mlx5_eswitch *esw,
 int __mlx5_eswitch_set_vport_vlan(struct mlx5_eswitch *esw,
 				  int vport, u16 vlan, u8 qos, u8 set_flags);
 
-static inline bool mlx5_eswitch_vlan_actions_supported(struct mlx5_core_dev *dev)
+static inline bool mlx5_eswitch_vlan_actions_supported(struct mlx5_core_dev *dev,
+						       u8 vlan_depth)
 {
-	return MLX5_CAP_ESW_FLOWTABLE_FDB(dev, pop_vlan) &&
-	       MLX5_CAP_ESW_FLOWTABLE_FDB(dev, push_vlan);
+	bool ret = MLX5_CAP_ESW_FLOWTABLE_FDB(dev, pop_vlan) &&
+		   MLX5_CAP_ESW_FLOWTABLE_FDB(dev, push_vlan);
+
+	if (vlan_depth == 1)
+		return ret;
+
+	return  ret && MLX5_CAP_ESW_FLOWTABLE_FDB(dev, pop_vlan_2) &&
+		MLX5_CAP_ESW_FLOWTABLE_FDB(dev, push_vlan_2);
 }
 
 #define MLX5_DEBUG_ESWITCH_MASK BIT(3)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 552954d7184e..f72b5c9dcfe9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -66,13 +66,18 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
 
 	flow_act.action = attr->action;
 	/* if per flow vlan pop/push is emulated, don't set that into the firmware */
-	if (!mlx5_eswitch_vlan_actions_supported(esw->dev))
+	if (!mlx5_eswitch_vlan_actions_supported(esw->dev, 1))
 		flow_act.action &= ~(MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH |
 				     MLX5_FLOW_CONTEXT_ACTION_VLAN_POP);
 	else if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH) {
 		flow_act.vlan[0].ethtype = ntohs(attr->vlan_proto[0]);
 		flow_act.vlan[0].vid = attr->vlan_vid[0];
 		flow_act.vlan[0].prio = attr->vlan_prio[0];
+		if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH_2) {
+			flow_act.vlan[1].ethtype = ntohs(attr->vlan_proto[1]);
+			flow_act.vlan[1].vid = attr->vlan_vid[1];
+			flow_act.vlan[1].prio = attr->vlan_prio[1];
+		}
 	}
 
 	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_FWD_DEST) {
@@ -284,7 +289,7 @@ int mlx5_eswitch_add_vlan_action(struct mlx5_eswitch *esw,
 	int err = 0;
 
 	/* nop if we're on the vlan push/pop non emulation mode */
-	if (mlx5_eswitch_vlan_actions_supported(esw->dev))
+	if (mlx5_eswitch_vlan_actions_supported(esw->dev, 1))
 		return 0;
 
 	push = !!(attr->action & MLX5_FLOW_CONTEXT_ACTION_VLAN_PUSH);
@@ -347,7 +352,7 @@ int mlx5_eswitch_del_vlan_action(struct mlx5_eswitch *esw,
 	int err = 0;
 
 	/* nop if we're on the vlan push/pop non emulation mode */
-	if (mlx5_eswitch_vlan_actions_supported(esw->dev))
+	if (mlx5_eswitch_vlan_actions_supported(esw->dev, 1))
 		return 0;
 
 	if (!attr->vlan_handled)
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [net-next 16/16] net/mlx5e: Use PARTIAL_GSO for UDP segmentation
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (14 preceding siblings ...)
  2018-07-19  1:01 ` [net-next 15/16] net/mlx5e: Support offloading double vlan push/pop tc actions Saeed Mahameed
@ 2018-07-19  1:01 ` Saeed Mahameed
  2018-07-23 21:35 ` [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-19  1:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Boris Pismenny, Saeed Mahameed

From: Boris Pismenny <borisp@mellanox.com>

This patch removes the splitting of UDP_GSO_L4 packets in the driver,
and exposes UDP_GSO_L4 as a PARTIAL_GSO feature. Thus, the network stack
is not responsible for splitting the packet into two.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +-
 .../mellanox/mlx5/core/en_accel/en_accel.h    |  27 +++--
 .../mellanox/mlx5/core/en_accel/rxtx.c        | 109 ------------------
 .../mellanox/mlx5/core/en_accel/rxtx.h        |  14 ---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   9 +-
 5 files changed, 23 insertions(+), 140 deletions(-)
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 83abd9130ffb..b18c1604789d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -14,8 +14,8 @@ mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o fpga/conn.o fpga/sdk.o \
 		fpga/ipsec.o fpga/tls.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
-		en_tx.o en_rx.o en_dim.o en_txrx.o en_accel/rxtx.o en_stats.o  \
-		vxlan.o en_arfs.o en_fs_ethtool.o en_selftest.o en/port.o
+		en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o          \
+		en_arfs.o en_fs_ethtool.o en_selftest.o en/port.o
 
 mlx5_core-$(CONFIG_MLX5_MPFS) += lib/mpfs.o
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
index 39a5d13ba459..1dd225380a66 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -38,14 +38,22 @@
 #include <linux/netdevice.h>
 #include "en_accel/ipsec_rxtx.h"
 #include "en_accel/tls_rxtx.h"
-#include "en_accel/rxtx.h"
 #include "en.h"
 
-static inline struct sk_buff *mlx5e_accel_handle_tx(struct sk_buff *skb,
-						    struct mlx5e_txqsq *sq,
-						    struct net_device *dev,
-						    struct mlx5e_tx_wqe **wqe,
-						    u16 *pi)
+static inline void
+mlx5e_udp_gso_handle_tx_skb(struct sk_buff *skb)
+{
+	int payload_len = skb_shinfo(skb)->gso_size + sizeof(struct udphdr);
+
+	udp_hdr(skb)->len = htons(payload_len);
+}
+
+static inline struct sk_buff *
+mlx5e_accel_handle_tx(struct sk_buff *skb,
+		      struct mlx5e_txqsq *sq,
+		      struct net_device *dev,
+		      struct mlx5e_tx_wqe **wqe,
+		      u16 *pi)
 {
 #ifdef CONFIG_MLX5_EN_TLS
 	if (test_bit(MLX5E_SQ_STATE_TLS, &sq->state)) {
@@ -63,11 +71,8 @@ static inline struct sk_buff *mlx5e_accel_handle_tx(struct sk_buff *skb,
 	}
 #endif
 
-	if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
-		skb = mlx5e_udp_gso_handle_tx_skb(dev, sq, skb, wqe, pi);
-		if (unlikely(!skb))
-			return NULL;
-	}
+	if (skb_is_gso(skb) && skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
+		mlx5e_udp_gso_handle_tx_skb(skb);
 
 	return skb;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
deleted file mode 100644
index 7b7ec3998e84..000000000000
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
+++ /dev/null
@@ -1,109 +0,0 @@
-#include "en_accel/rxtx.h"
-
-static void mlx5e_udp_gso_prepare_last_skb(struct sk_buff *skb,
-					   struct sk_buff *nskb,
-					   int remaining)
-{
-	int bytes_needed = remaining, remaining_headlen, remaining_page_offset;
-	int headlen = skb_transport_offset(skb) + sizeof(struct udphdr);
-	int payload_len = remaining + sizeof(struct udphdr);
-	int k = 0, i, j;
-
-	skb_copy_bits(skb, 0, nskb->data, headlen);
-	nskb->dev = skb->dev;
-	skb_reset_mac_header(nskb);
-	skb_set_network_header(nskb, skb_network_offset(skb));
-	skb_set_transport_header(nskb, skb_transport_offset(skb));
-	skb_set_tail_pointer(nskb, headlen);
-
-	/* How many frags do we need? */
-	for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
-		bytes_needed -= skb_frag_size(&skb_shinfo(skb)->frags[i]);
-		k++;
-		if (bytes_needed <= 0)
-			break;
-	}
-
-	/* Fill the first frag and split it if necessary */
-	j = skb_shinfo(skb)->nr_frags - k;
-	remaining_page_offset = -bytes_needed;
-	skb_fill_page_desc(nskb, 0,
-			   skb_shinfo(skb)->frags[j].page.p,
-			   skb_shinfo(skb)->frags[j].page_offset + remaining_page_offset,
-			   skb_shinfo(skb)->frags[j].size - remaining_page_offset);
-
-	skb_frag_ref(skb, j);
-
-	/* Fill the rest of the frags */
-	for (i = 1; i < k; i++) {
-		j = skb_shinfo(skb)->nr_frags - k + i;
-
-		skb_fill_page_desc(nskb, i,
-				   skb_shinfo(skb)->frags[j].page.p,
-				   skb_shinfo(skb)->frags[j].page_offset,
-				   skb_shinfo(skb)->frags[j].size);
-		skb_frag_ref(skb, j);
-	}
-	skb_shinfo(nskb)->nr_frags = k;
-
-	remaining_headlen = remaining - skb->data_len;
-
-	/* headlen contains remaining data? */
-	if (remaining_headlen > 0)
-		skb_copy_bits(skb, skb->len - remaining, nskb->data + headlen,
-			      remaining_headlen);
-	nskb->len = remaining + headlen;
-	nskb->data_len =  payload_len - sizeof(struct udphdr) +
-		max_t(int, 0, remaining_headlen);
-	nskb->protocol = skb->protocol;
-	if (nskb->protocol == htons(ETH_P_IP)) {
-		ip_hdr(nskb)->id = htons(ntohs(ip_hdr(nskb)->id) +
-					 skb_shinfo(skb)->gso_segs);
-		ip_hdr(nskb)->tot_len =
-			htons(payload_len + sizeof(struct iphdr));
-	} else {
-		ipv6_hdr(nskb)->payload_len = htons(payload_len);
-	}
-	udp_hdr(nskb)->len = htons(payload_len);
-	skb_shinfo(nskb)->gso_size = 0;
-	nskb->ip_summed = skb->ip_summed;
-	nskb->csum_start = skb->csum_start;
-	nskb->csum_offset = skb->csum_offset;
-	nskb->queue_mapping = skb->queue_mapping;
-}
-
-/* might send skbs and update wqe and pi */
-struct sk_buff *mlx5e_udp_gso_handle_tx_skb(struct net_device *netdev,
-					    struct mlx5e_txqsq *sq,
-					    struct sk_buff *skb,
-					    struct mlx5e_tx_wqe **wqe,
-					    u16 *pi)
-{
-	int payload_len = skb_shinfo(skb)->gso_size + sizeof(struct udphdr);
-	int headlen = skb_transport_offset(skb) + sizeof(struct udphdr);
-	int remaining = (skb->len - headlen) % skb_shinfo(skb)->gso_size;
-	struct sk_buff *nskb;
-
-	if (skb->protocol == htons(ETH_P_IP))
-		ip_hdr(skb)->tot_len = htons(payload_len + sizeof(struct iphdr));
-	else
-		ipv6_hdr(skb)->payload_len = htons(payload_len);
-	udp_hdr(skb)->len = htons(payload_len);
-	if (!remaining)
-		return skb;
-
-	sq->stats->udp_seg_rem++;
-	nskb = alloc_skb(max_t(int, headlen, headlen + remaining - skb->data_len), GFP_ATOMIC);
-	if (unlikely(!nskb)) {
-		sq->stats->dropped++;
-		return NULL;
-	}
-
-	mlx5e_udp_gso_prepare_last_skb(skb, nskb, remaining);
-
-	skb_shinfo(skb)->gso_segs--;
-	pskb_trim(skb, skb->len - remaining);
-	mlx5e_sq_xmit(sq, skb, *wqe, *pi);
-	mlx5e_sq_fetch_wqe(sq, wqe, pi);
-	return nskb;
-}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h
deleted file mode 100644
index ed42699a78b3..000000000000
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h
+++ /dev/null
@@ -1,14 +0,0 @@
-
-#ifndef __MLX5E_EN_ACCEL_RX_TX_H__
-#define __MLX5E_EN_ACCEL_RX_TX_H__
-
-#include <linux/skbuff.h>
-#include "en.h"
-
-struct sk_buff *mlx5e_udp_gso_handle_tx_skb(struct net_device *netdev,
-					    struct mlx5e_txqsq *sq,
-					    struct sk_buff *skb,
-					    struct mlx5e_tx_wqe **wqe,
-					    u16 *pi);
-
-#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 712b9766485f..dccde18f6170 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4538,7 +4538,6 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	netdev->hw_features      |= NETIF_F_HW_VLAN_STAG_TX;
 
 	if (mlx5e_vxlan_allowed(mdev) || MLX5_CAP_ETH(mdev, tunnel_stateless_gre)) {
-		netdev->hw_features     |= NETIF_F_GSO_PARTIAL;
 		netdev->hw_enc_features |= NETIF_F_IP_CSUM;
 		netdev->hw_enc_features |= NETIF_F_IPV6_CSUM;
 		netdev->hw_enc_features |= NETIF_F_TSO;
@@ -4563,6 +4562,11 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 						NETIF_F_GSO_GRE_CSUM;
 	}
 
+	netdev->hw_features	                 |= NETIF_F_GSO_PARTIAL;
+	netdev->gso_partial_features             |= NETIF_F_GSO_UDP_L4;
+	netdev->hw_features                      |= NETIF_F_GSO_UDP_L4;
+	netdev->features                         |= NETIF_F_GSO_UDP_L4;
+
 	mlx5_query_port_fcs(mdev, &fcs_supported, &fcs_enabled);
 
 	if (fcs_supported)
@@ -4595,9 +4599,6 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	netdev->features         |= NETIF_F_HIGHDMA;
 	netdev->features         |= NETIF_F_HW_VLAN_STAG_FILTER;
 
-	netdev->features         |= NETIF_F_GSO_UDP_L4;
-	netdev->hw_features      |= NETIF_F_GSO_UDP_L4;
-
 	netdev->priv_flags       |= IFF_UNICAST_FLT;
 
 	mlx5e_set_netdev_dev_addr(netdev);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-19  1:01 ` [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink Saeed Mahameed
@ 2018-07-19  1:49   ` Jakub Kicinski
  2018-07-24 10:31     ` Eran Ben Elisha
  2018-07-19  8:24   ` Jiri Pirko
  1 sibling, 1 reply; 38+ messages in thread
From: Jakub Kicinski @ 2018-07-19  1:49 UTC (permalink / raw)
  To: Saeed Mahameed, Jiri Pirko; +Cc: David S. Miller, netdev, Eran Ben Elisha

On Wed, 18 Jul 2018 18:01:01 -0700, Saeed Mahameed wrote:
> +static const struct devlink_param mlx5_devlink_params[] = {
> +	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
> +			     "congestion_action",
> +			     DEVLINK_PARAM_TYPE_U8,
> +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
> +			     mlx5_devlink_get_congestion_action,
> +			     mlx5_devlink_set_congestion_action, NULL),
> +	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
> +			     "congestion_mode",
> +			     DEVLINK_PARAM_TYPE_U8,
> +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
> +			     mlx5_devlink_get_congestion_mode,
> +			     mlx5_devlink_set_congestion_mode, NULL),
> +};

The devlink params haven't been upstream even for a full cycle and
already you guys are starting to use them to configure standard
features like queuing.  

I know your HW is not capable of doing full RED offload, it's a
snowflake.  You tell us you're doing custom DCB configuration hacks on
one side (previous argument we had) and custom devlink parameter
configuration hacks on PCIe.

Perhaps the idea that we're trying to use the existing Linux APIs for
HW configuration only applies to forwarding behaviour.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-19  1:01 ` [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink Saeed Mahameed
  2018-07-19  1:49   ` Jakub Kicinski
@ 2018-07-19  8:24   ` Jiri Pirko
  2018-07-19  8:49     ` Eran Ben Elisha
  1 sibling, 1 reply; 38+ messages in thread
From: Jiri Pirko @ 2018-07-19  8:24 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David S. Miller, netdev, Eran Ben Elisha

Thu, Jul 19, 2018 at 03:01:01AM CEST, saeedm@mellanox.com wrote:
>From: Eran Ben Elisha <eranbe@mellanox.com>
>
>Add support for two driver parameters via devlink params interface:
>- Congestion action
>	HW mechanism in the PCIe buffer which monitors the amount of
>	consumed PCIe buffer per host.  This mechanism supports the
>	following actions in case of threshold overflow:
>	- Disabled - NOP (Default)
>	- Drop
>	- Mark - Mark CE bit in the CQE of received packet
>- Congestion mode
>	- Aggressive - Aggressive static trigger threshold (Default)
>	- Dynamic - Dynamically change the trigger threshold
>
>Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
>Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
>Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>---
> .../net/ethernet/mellanox/mlx5/core/devlink.c | 105 +++++++++++++++++-
> 1 file changed, 104 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
>index 9800c98b01d3..1f04decef043 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
>@@ -153,12 +153,115 @@ static int mlx5_devlink_query_tx_overflow_sense(struct mlx5_core_dev *mdev,
> 	return 0;
> }
> 
>+static int mlx5_devlink_set_congestion_action(struct devlink *devlink, u32 id,
>+					      struct devlink_param_gset_ctx *ctx)
>+{
>+	struct mlx5_core_dev *dev = devlink_priv(devlink);
>+	u8 max = MLX5_DEVLINK_CONGESTION_ACTION_MAX;
>+	u8 sense;
>+	int err;
>+
>+	if (!MLX5_CAP_MCAM_FEATURE(dev, mark_tx_action_cqe) &&
>+	    !MLX5_CAP_MCAM_FEATURE(dev, mark_tx_action_cnp))
>+		max = MLX5_DEVLINK_CONGESTION_ACTION_MARK - 1;
>+
>+	if (ctx->val.vu8 > max)

This should not be num. It should be a string. Same for "mode".


>+		return -ERANGE;
>+
>+	err = mlx5_devlink_query_tx_overflow_sense(dev, &sense);
>+	if (err)
>+		return err;
>+
>+	if (ctx->val.vu8 == MLX5_DEVLINK_CONGESTION_ACTION_DISABLED &&
>+	    sense != MLX5_DEVLINK_CONGESTION_MODE_AGGRESSIVE)
>+		return -EINVAL;
>+
>+	return mlx5_devlink_set_tx_lossy_overflow(dev, ctx->val.vu8);
>+}
>+
>+static int mlx5_devlink_get_congestion_action(struct devlink *devlink, u32 id,
>+					      struct devlink_param_gset_ctx *ctx)
>+{
>+	struct mlx5_core_dev *dev = devlink_priv(devlink);
>+
>+	return mlx5_devlink_query_tx_lossy_overflow(dev, &ctx->val.vu8);
>+}
>+
>+static int mlx5_devlink_set_congestion_mode(struct devlink *devlink, u32 id,
>+					    struct devlink_param_gset_ctx *ctx)
>+{
>+	struct mlx5_core_dev *dev = devlink_priv(devlink);
>+	u8 tx_lossy_overflow;
>+	int err;
>+
>+	if (ctx->val.vu8 > MLX5_DEVLINK_CONGESTION_MODE_MAX)
>+		return -ERANGE;
>+
>+	err = mlx5_devlink_query_tx_lossy_overflow(dev, &tx_lossy_overflow);
>+	if (err)
>+		return err;
>+
>+	if (ctx->val.vu8 != MLX5_DEVLINK_CONGESTION_MODE_AGGRESSIVE &&
>+	    tx_lossy_overflow == MLX5_DEVLINK_CONGESTION_ACTION_DISABLED)
>+		return -EINVAL;
>+
>+	return mlx5_devlink_set_tx_overflow_sense(dev, ctx->val.vu8);
>+}
>+
>+static int mlx5_devlink_get_congestion_mode(struct devlink *devlink, u32 id,
>+					    struct devlink_param_gset_ctx *ctx)
>+{
>+	struct mlx5_core_dev *dev = devlink_priv(devlink);
>+
>+	return mlx5_devlink_query_tx_overflow_sense(dev, &ctx->val.vu8);
>+}
>+
>+enum mlx5_devlink_param_id {
>+	MLX5_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
>+	MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
>+	MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
>+};
>+
>+static const struct devlink_param mlx5_devlink_params[] = {
>+	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
>+			     "congestion_action",
>+			     DEVLINK_PARAM_TYPE_U8,
>+			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>+			     mlx5_devlink_get_congestion_action,
>+			     mlx5_devlink_set_congestion_action, NULL),
>+	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
>+			     "congestion_mode",
>+			     DEVLINK_PARAM_TYPE_U8,
>+			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>+			     mlx5_devlink_get_congestion_mode,
>+			     mlx5_devlink_set_congestion_mode, NULL),
>+};
>+
> int mlx5_devlink_register(struct devlink *devlink, struct device *dev)
> {
>-	return devlink_register(devlink, dev);
>+	int err;
>+
>+	err = devlink_register(devlink, dev);
>+	if (err)
>+		return err;
>+
>+	err = devlink_params_register(devlink, mlx5_devlink_params,
>+				      ARRAY_SIZE(mlx5_devlink_params));
>+	if (err) {
>+		dev_err(dev, "devlink_params_register failed, err = %d\n", err);
>+		goto unregister;
>+	}
>+
>+	return 0;
>+
>+unregister:
>+	devlink_unregister(devlink);
>+	return err;
> }
> 
> void mlx5_devlink_unregister(struct devlink *devlink)
> {
>+	devlink_params_unregister(devlink, mlx5_devlink_params,
>+				  ARRAY_SIZE(mlx5_devlink_params));
> 	devlink_unregister(devlink);
> }
>-- 
>2.17.0
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-19  8:24   ` Jiri Pirko
@ 2018-07-19  8:49     ` Eran Ben Elisha
  0 siblings, 0 replies; 38+ messages in thread
From: Eran Ben Elisha @ 2018-07-19  8:49 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Eran Ben Elisha

>
> This should not be num. It should be a string. Same for "mode".

will fix for v2, thanks.
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18
  2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
                   ` (15 preceding siblings ...)
  2018-07-19  1:01 ` [net-next 16/16] net/mlx5e: Use PARTIAL_GSO for UDP segmentation Saeed Mahameed
@ 2018-07-23 21:35 ` Saeed Mahameed
  16 siblings, 0 replies; 38+ messages in thread
From: Saeed Mahameed @ 2018-07-23 21:35 UTC (permalink / raw)
  To: davem; +Cc: netdev

On Wed, 2018-07-18 at 18:00 -0700, Saeed Mahameed wrote:
> Hi dave,
> 
> This series includes updates for mlx5e net device driver, with a
> couple
> of major features and some misc updates.
> 
> Please notice the mlx5-next merge patch at the beginning:
> "Merge branch 'mlx5-next' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux"
> 
> For more information please see tag log below.
> 
> Please pull and let me know if there's any problem.
> 

I will re-post v2 without the "Support PCIe buffer congestion handling
via Devlink" patches until Eran sorts out the review comments.

Thanks,
Saeed.


> Thanks,
> Saeed.
> 
> --- 
> 
> The following changes since commit
> 681d5d071c8bd5533a14244c0d55d1c0e30aa989:
> 
>   Merge branch 'mlx5-next' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux (2018-
> 07-18 15:53:31 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
> tags/mlx5e-updates-2018-07-18
> 
> for you to fetch changes up to
> a0ba57c09676689eb35f13d48990c9674c9baad4:
> 
>   net/mlx5e: Use PARTIAL_GSO for UDP segmentation (2018-07-18
> 17:26:28 -0700)
> 
> ----------------------------------------------------------------
> mlx5e-updates-2018-07-18
> 
> This series includes update for mlx5e net device driver.
> 
> 1) From Feras Daoud, Added the support for firmware log tracing,
> first by introducing the firmware API needed for the task and then
> For each PF do the following:
>     1- Allocate memory for the tracer strings database and read it
> from the FW to the SW.
>     2- Allocate and dma map tracer buffers.
> 
>     Traces that will be written into the buffer will be parsed as a
> group
>     of one or more traces, referred to as trace message. The trace
> message
>     represents a C-like printf string.
> Once a new trace is available  FW will generate an event indicates
> new trace/s are
> available and the driver will parse them and dump them using
> tracepoints
> event tracing
> 
> Enable mlx5 fw tracing by:
> echo 1 > /sys/kernel/debug/tracing/events/mlx5/mlx5_fw/enable
> 
> Read traces by:
> cat /sys/kernel/debug/tracing/trace
> 
> 2) From Eran Ben Elisha, Support PCIe buffer congestion handling
> via Devlink, using the new devlink device parameters API, added the
> new
> parameters:
>  - Congestion action
>             HW mechanism in the PCIe buffer which monitors the amount
> of
>             consumed PCIe buffer per host.  This mechanism supports
> the
>             following actions in case of threshold overflow:
>             - Disabled - NOP (Default)
>             - Drop
>             - Mark - Mark CE bit in the CQE of received packet
>     - Congestion mode
>             - Aggressive - Aggressive static trigger threshold
> (Default)
>             - Dynamic - Dynamically change the trigger threshold
> 
> 3) From Natali, Set ECN for received packets using CQE indication.
> Using Eran's congestion settings a user can enable ECN marking, on
> such case
> driver must update ECN CE IP fields when requested by firmware
> (congestion is sensed).
> 
> 4) From Roi Dayan, Remove redundant WARN when we cannot find neigh
> entry
> 
> 5) From Jianbo Liu, TC double vlan support
> - Support offloading tc double vlan headers match
> - Support offloading double vlan push/pop tc actions
> 
> 6) From Boris, re-visit UDP GSO, remove the splitting of UDP_GSO_L4
> packets
> in the driver, and exposes UDP_GSO_L4 as a PARTIAL_GSO feature.
> 
> ----------------------------------------------------------------
> Boris Pismenny (1):
>       net/mlx5e: Use PARTIAL_GSO for UDP segmentation
> 
> Eran Ben Elisha (3):
>       net/mlx5: Move all devlink related functions calls to devlink.c
>       net/mlx5: Add MPEGC register configuration functionality
>       net/mlx5: Support PCIe buffer congestion handling via Devlink
> 
> Feras Daoud (5):
>       net/mlx5: FW tracer, implement tracer logic
>       net/mlx5: FW tracer, create trace buffer and copy strings
> database
>       net/mlx5: FW tracer, events handling
>       net/mlx5: FW tracer, parse traces and kernel tracing support
>       net/mlx5: FW tracer, Enable tracing
> 
> Jianbo Liu (3):
>       net/mlx5e: Support offloading tc double vlan headers match
>       net/mlx5e: Refactor tc vlan push/pop actions offloading
>       net/mlx5e: Support offloading double vlan push/pop tc actions
> 
> Natali Shechtman (1):
>       net/mlx5e: Set ECN for received packets using CQE indication
> 
> Roi Dayan (1):
>       net/mlx5e: Remove redundant WARN when we cannot find neigh
> entry
> 
> Saeed Mahameed (2):
>       net/mlx5: FW tracer, register log buffer memory key
>       net/mlx5: FW tracer, Add debug prints
> 
>  drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
>  drivers/net/ethernet/mellanox/mlx5/core/devlink.c  | 267 ++++++
>  drivers/net/ethernet/mellanox/mlx5/core/devlink.h  |  41 +
>  .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c   | 947
> +++++++++++++++++++++
>  .../ethernet/mellanox/mlx5/core/diag/fw_tracer.h   | 175 ++++
>  .../mellanox/mlx5/core/diag/fw_tracer_tracepoint.h |  78 ++
>  .../mellanox/mlx5/core/en_accel/en_accel.h         |  27 +-
>  .../ethernet/mellanox/mlx5/core/en_accel/rxtx.c    | 109 ---
>  .../ethernet/mellanox/mlx5/core/en_accel/rxtx.h    |  14 -
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  35 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |   3 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   2 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    | 134 ++-
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c       |  11 +
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  21 +-
>  .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |  23 +-
>  drivers/net/ethernet/mellanox/mlx5/core/main.c     |  23 +-
>  include/linux/mlx5/device.h                        |   7 +
>  include/linux/mlx5/driver.h                        |   3 +
>  20 files changed, 1745 insertions(+), 190 deletions(-)
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.h
>  create mode 100644
> drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
>  create mode 100644
> drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
>  create mode 100644
> drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer_tracepoint.h
>  delete mode 100644
> drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.c
>  delete mode 100644
> drivers/net/ethernet/mellanox/mlx5/core/en_accel/rxtx.h

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-19  1:49   ` Jakub Kicinski
@ 2018-07-24 10:31     ` Eran Ben Elisha
  2018-07-24 19:51       ` Jakub Kicinski
  0 siblings, 1 reply; 38+ messages in thread
From: Eran Ben Elisha @ 2018-07-24 10:31 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed, Jiri Pirko; +Cc: David S. Miller, netdev



On 7/19/2018 4:49 AM, Jakub Kicinski wrote:
> On Wed, 18 Jul 2018 18:01:01 -0700, Saeed Mahameed wrote:
>> +static const struct devlink_param mlx5_devlink_params[] = {
>> +	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
>> +			     "congestion_action",
>> +			     DEVLINK_PARAM_TYPE_U8,
>> +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>> +			     mlx5_devlink_get_congestion_action,
>> +			     mlx5_devlink_set_congestion_action, NULL),
>> +	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
>> +			     "congestion_mode",
>> +			     DEVLINK_PARAM_TYPE_U8,
>> +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>> +			     mlx5_devlink_get_congestion_mode,
>> +			     mlx5_devlink_set_congestion_mode, NULL),
>> +};
> 
> The devlink params haven't been upstream even for a full cycle and
> already you guys are starting to use them to configure standard
> features like queuing.

We developed the devlink params in order to support non-standard 
configuration only. And for non-standard, there are generic and vendor 
specific options.
The queuing model is a standard. However here we are configuring the 
outbound PCIe buffers on the receive path from NIC port toward the 
host(s) in Single / MultiHost environment.
(You can see the driver processing based on this param as part of the RX 
patch for the marked option here https://patchwork.ozlabs.org/patch/945998/)

> 
> I know your HW is not capable of doing full RED offload, it's a
> snowflake. 

The algorithm which is applied here for the drop option is not the core 
of this feature.

> You tell us you're doing custom DCB configuration hacks on
> one side (previous argument we had) and custom devlink parameter
> configuration hacks on PCIe.
> 
> Perhaps the idea that we're trying to use the existing Linux APIs for
> HW configuration only applies to forwarding behaviour.

Hopefully I explained above well why it is not related.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-24 10:31     ` Eran Ben Elisha
@ 2018-07-24 19:51       ` Jakub Kicinski
  2018-07-25 12:31         ` Eran Ben Elisha
  0 siblings, 1 reply; 38+ messages in thread
From: Jakub Kicinski @ 2018-07-24 19:51 UTC (permalink / raw)
  To: Eran Ben Elisha; +Cc: Saeed Mahameed, Jiri Pirko, David S. Miller, netdev

On Tue, 24 Jul 2018 13:31:28 +0300, Eran Ben Elisha wrote:
> On 7/19/2018 4:49 AM, Jakub Kicinski wrote:
> > On Wed, 18 Jul 2018 18:01:01 -0700, Saeed Mahameed wrote:  
> >> +static const struct devlink_param mlx5_devlink_params[] = {
> >> +	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_ACTION,
> >> +			     "congestion_action",
> >> +			     DEVLINK_PARAM_TYPE_U8,
> >> +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
> >> +			     mlx5_devlink_get_congestion_action,
> >> +			     mlx5_devlink_set_congestion_action, NULL),
> >> +	DEVLINK_PARAM_DRIVER(MLX5_DEVLINK_PARAM_ID_CONGESTION_MODE,
> >> +			     "congestion_mode",
> >> +			     DEVLINK_PARAM_TYPE_U8,
> >> +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
> >> +			     mlx5_devlink_get_congestion_mode,
> >> +			     mlx5_devlink_set_congestion_mode, NULL),
> >> +};  
> > 
> > The devlink params haven't been upstream even for a full cycle and
> > already you guys are starting to use them to configure standard
> > features like queuing.  
> 
> We developed the devlink params in order to support non-standard 
> configuration only. And for non-standard, there are generic and vendor 
> specific options.

I thought it was developed for performing non-standard and possibly
vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
examples of well justified generic options for which we have no
other API.  The vendor mlx4 options look fairly vendor specific if you
ask me, too.

Configuring queuing has an API.  The question is it acceptable to enter
into the risky territory of controlling offloads via devlink parameters
or would we rather make vendors take the time and effort to model
things to (a subset) of existing APIs.  The HW never fits the APIs
perfectly.

> The queuing model is a standard. However here we are configuring the 
> outbound PCIe buffers on the receive path from NIC port toward the 
> host(s) in Single / MultiHost environment.

That's why we have PF representors.

> (You can see the driver processing based on this param as part of the RX 
> patch for the marked option here https://patchwork.ozlabs.org/patch/945998/)
>
> > I know your HW is not capable of doing full RED offload, it's a
> > snowflake.   
> 
> The algorithm which is applied here for the drop option is not the core 
> of this feature.
> 
> > You tell us you're doing custom DCB configuration hacks on
> > one side (previous argument we had) and custom devlink parameter
> > configuration hacks on PCIe.
> > 
> > Perhaps the idea that we're trying to use the existing Linux APIs for
> > HW configuration only applies to forwarding behaviour.  
> 
> Hopefully I explained above well why it is not related.

Sure ;)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-24 19:51       ` Jakub Kicinski
@ 2018-07-25 12:31         ` Eran Ben Elisha
  2018-07-25 15:23           ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Eran Ben Elisha @ 2018-07-25 12:31 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Saeed Mahameed, Jiri Pirko, David S. Miller, netdev



On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>>>
>>> The devlink params haven't been upstream even for a full cycle and
>>> already you guys are starting to use them to configure standard
>>> features like queuing.
>>
>> We developed the devlink params in order to support non-standard
>> configuration only. And for non-standard, there are generic and vendor
>> specific options.
> 
> I thought it was developed for performing non-standard and possibly
> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
> examples of well justified generic options for which we have no
> other API.  The vendor mlx4 options look fairly vendor specific if you
> ask me, too.
> 
> Configuring queuing has an API.  The question is it acceptable to enter
> into the risky territory of controlling offloads via devlink parameters
> or would we rather make vendors take the time and effort to model
> things to (a subset) of existing APIs.  The HW never fits the APIs
> perfectly.

I understand what you meant here, I would like to highlight that this 
mechanism was not meant to handle SRIOV, Representors, etc.
The vendor specific configuration suggested here is to handle a 
congestion state in Multi Host environment (which includes PF and 
multiple VFs per host), where one host is not aware to the other hosts, 
and each is running on its own pci/driver. It is a device working mode 
configuration.

This  couldn't fit into any existing API, thus creating this vendor 
specific unique API is needed.

> 
>> The queuing model is a standard. However here we are configuring the
>> outbound PCIe buffers on the receive path from NIC port toward the
>> host(s) in Single / MultiHost environment.
> 
> That's why we have PF representors.
> 
>> (You can see the driver processing based on this param as part of the RX
>> patch for the marked option here https://patchwork.ozlabs.org/patch/945998/)
>>
>>> I know your HW is not capable of doing full RED offload, it's a
>>> snowflake.
>>
>> The algorithm which is applied here for the drop option is not the core
>> of this feature.
>>
>>> You tell us you're doing custom DCB configuration hacks on
>>> one side (previous argument we had) and custom devlink parameter
>>> configuration hacks on PCIe.
>>>
>>> Perhaps the idea that we're trying to use the existing Linux APIs for
>>> HW configuration only applies to forwarding behaviour.
>>
>> Hopefully I explained above well why it is not related.
> 
> Sure ;)
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-25 12:31         ` Eran Ben Elisha
@ 2018-07-25 15:23           ` Alexander Duyck
  2018-07-26  0:43             ` Jakub Kicinski
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Duyck @ 2018-07-25 15:23 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: Jakub Kicinski, Saeed Mahameed, Jiri Pirko, David S. Miller, netdev

On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha <eranbe@mellanox.com> wrote:
>
>
> On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>>>>
>>>>
>>>> The devlink params haven't been upstream even for a full cycle and
>>>> already you guys are starting to use them to configure standard
>>>> features like queuing.
>>>
>>>
>>> We developed the devlink params in order to support non-standard
>>> configuration only. And for non-standard, there are generic and vendor
>>> specific options.
>>
>>
>> I thought it was developed for performing non-standard and possibly
>> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
>> examples of well justified generic options for which we have no
>> other API.  The vendor mlx4 options look fairly vendor specific if you
>> ask me, too.
>>
>> Configuring queuing has an API.  The question is it acceptable to enter
>> into the risky territory of controlling offloads via devlink parameters
>> or would we rather make vendors take the time and effort to model
>> things to (a subset) of existing APIs.  The HW never fits the APIs
>> perfectly.
>
>
> I understand what you meant here, I would like to highlight that this
> mechanism was not meant to handle SRIOV, Representors, etc.
> The vendor specific configuration suggested here is to handle a congestion
> state in Multi Host environment (which includes PF and multiple VFs per
> host), where one host is not aware to the other hosts, and each is running
> on its own pci/driver. It is a device working mode configuration.
>
> This  couldn't fit into any existing API, thus creating this vendor specific
> unique API is needed.

If we are just going to start creating devlink interfaces in for every
one-off option a device wants to add why did we even bother with
trying to prevent drivers from using sysfs? This just feels like we
are back to the same arguments we had back in the day with it.

I feel like the bigger question here is if devlink is how we are going
to deal with all PCIe related features going forward, or should we
start looking at creating a new interface/tool for PCI/PCIe related
features? My concern is that we have already had features such as DMA
Coalescing that didn't really fit into anything and now we are
starting to see other things related to DMA and PCIe bus credits. I'm
wondering if we shouldn't start looking at a tool/interface to
configure all the PCIe related features such as interrupts, error
reporting, DMA configuration, power management, etc. Maybe we could
even look at sharing it across subsystems and include things like
storage, graphics, and other subsystems in the conversation.

- Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-25 15:23           ` Alexander Duyck
@ 2018-07-26  0:43             ` Jakub Kicinski
  2018-07-26  7:14               ` Jiri Pirko
  0 siblings, 1 reply; 38+ messages in thread
From: Jakub Kicinski @ 2018-07-26  0:43 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eran Ben Elisha, Saeed Mahameed, Jiri Pirko, David S. Miller, netdev

On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:  
> >>>> The devlink params haven't been upstream even for a full cycle and
> >>>> already you guys are starting to use them to configure standard
> >>>> features like queuing.  
> >>>
> >>> We developed the devlink params in order to support non-standard
> >>> configuration only. And for non-standard, there are generic and vendor
> >>> specific options.  
> >>
> >> I thought it was developed for performing non-standard and possibly
> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
> >> examples of well justified generic options for which we have no
> >> other API.  The vendor mlx4 options look fairly vendor specific if you
> >> ask me, too.
> >>
> >> Configuring queuing has an API.  The question is it acceptable to enter
> >> into the risky territory of controlling offloads via devlink parameters
> >> or would we rather make vendors take the time and effort to model
> >> things to (a subset) of existing APIs.  The HW never fits the APIs
> >> perfectly.  
> >
> > I understand what you meant here, I would like to highlight that this
> > mechanism was not meant to handle SRIOV, Representors, etc.
> > The vendor specific configuration suggested here is to handle a congestion
> > state in Multi Host environment (which includes PF and multiple VFs per
> > host), where one host is not aware to the other hosts, and each is running
> > on its own pci/driver. It is a device working mode configuration.
> >
> > This  couldn't fit into any existing API, thus creating this vendor specific
> > unique API is needed.  
> 
> If we are just going to start creating devlink interfaces in for every
> one-off option a device wants to add why did we even bother with
> trying to prevent drivers from using sysfs? This just feels like we
> are back to the same arguments we had back in the day with it.
> 
> I feel like the bigger question here is if devlink is how we are going
> to deal with all PCIe related features going forward, or should we
> start looking at creating a new interface/tool for PCI/PCIe related
> features? My concern is that we have already had features such as DMA
> Coalescing that didn't really fit into anything and now we are
> starting to see other things related to DMA and PCIe bus credits. I'm
> wondering if we shouldn't start looking at a tool/interface to
> configure all the PCIe related features such as interrupts, error
> reporting, DMA configuration, power management, etc. Maybe we could
> even look at sharing it across subsystems and include things like
> storage, graphics, and other subsystems in the conversation.

Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need
to build up an API.  Sharing it across subsystems would be very cool!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-26  0:43             ` Jakub Kicinski
@ 2018-07-26  7:14               ` Jiri Pirko
  2018-07-26 14:00                 ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Jiri Pirko @ 2018-07-26  7:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Duyck, Eran Ben Elisha, Saeed Mahameed,
	David S. Miller, netdev

Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com wrote:
>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:  
>> >>>> The devlink params haven't been upstream even for a full cycle and
>> >>>> already you guys are starting to use them to configure standard
>> >>>> features like queuing.  
>> >>>
>> >>> We developed the devlink params in order to support non-standard
>> >>> configuration only. And for non-standard, there are generic and vendor
>> >>> specific options.  
>> >>
>> >> I thought it was developed for performing non-standard and possibly
>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
>> >> examples of well justified generic options for which we have no
>> >> other API.  The vendor mlx4 options look fairly vendor specific if you
>> >> ask me, too.
>> >>
>> >> Configuring queuing has an API.  The question is it acceptable to enter
>> >> into the risky territory of controlling offloads via devlink parameters
>> >> or would we rather make vendors take the time and effort to model
>> >> things to (a subset) of existing APIs.  The HW never fits the APIs
>> >> perfectly.  
>> >
>> > I understand what you meant here, I would like to highlight that this
>> > mechanism was not meant to handle SRIOV, Representors, etc.
>> > The vendor specific configuration suggested here is to handle a congestion
>> > state in Multi Host environment (which includes PF and multiple VFs per
>> > host), where one host is not aware to the other hosts, and each is running
>> > on its own pci/driver. It is a device working mode configuration.
>> >
>> > This  couldn't fit into any existing API, thus creating this vendor specific
>> > unique API is needed.  
>> 
>> If we are just going to start creating devlink interfaces in for every
>> one-off option a device wants to add why did we even bother with
>> trying to prevent drivers from using sysfs? This just feels like we
>> are back to the same arguments we had back in the day with it.
>> 
>> I feel like the bigger question here is if devlink is how we are going
>> to deal with all PCIe related features going forward, or should we
>> start looking at creating a new interface/tool for PCI/PCIe related
>> features? My concern is that we have already had features such as DMA
>> Coalescing that didn't really fit into anything and now we are
>> starting to see other things related to DMA and PCIe bus credits. I'm
>> wondering if we shouldn't start looking at a tool/interface to
>> configure all the PCIe related features such as interrupts, error
>> reporting, DMA configuration, power management, etc. Maybe we could
>> even look at sharing it across subsystems and include things like
>> storage, graphics, and other subsystems in the conversation.
>
>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need
>to build up an API.  Sharing it across subsystems would be very cool!

I wonder howcome there isn't such API in place already. Or is it?
If it is not, do you have any idea how should it look like? Should it be
an extension of the existing PCI uapi or something completely new?
It would be probably good to loop some PCI people in...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-26  7:14               ` Jiri Pirko
@ 2018-07-26 14:00                 ` Alexander Duyck
  2018-07-28 16:06                   ` Bjorn Helgaas
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Duyck @ 2018-07-26 14:00 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, Eran Ben Elisha, Saeed Mahameed, David S. Miller,
	netdev, linux-pci

On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com wrote:
>>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>>> >>>> The devlink params haven't been upstream even for a full cycle and
>>> >>>> already you guys are starting to use them to configure standard
>>> >>>> features like queuing.
>>> >>>
>>> >>> We developed the devlink params in order to support non-standard
>>> >>> configuration only. And for non-standard, there are generic and vendor
>>> >>> specific options.
>>> >>
>>> >> I thought it was developed for performing non-standard and possibly
>>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
>>> >> examples of well justified generic options for which we have no
>>> >> other API.  The vendor mlx4 options look fairly vendor specific if you
>>> >> ask me, too.
>>> >>
>>> >> Configuring queuing has an API.  The question is it acceptable to enter
>>> >> into the risky territory of controlling offloads via devlink parameters
>>> >> or would we rather make vendors take the time and effort to model
>>> >> things to (a subset) of existing APIs.  The HW never fits the APIs
>>> >> perfectly.
>>> >
>>> > I understand what you meant here, I would like to highlight that this
>>> > mechanism was not meant to handle SRIOV, Representors, etc.
>>> > The vendor specific configuration suggested here is to handle a congestion
>>> > state in Multi Host environment (which includes PF and multiple VFs per
>>> > host), where one host is not aware to the other hosts, and each is running
>>> > on its own pci/driver. It is a device working mode configuration.
>>> >
>>> > This  couldn't fit into any existing API, thus creating this vendor specific
>>> > unique API is needed.
>>>
>>> If we are just going to start creating devlink interfaces in for every
>>> one-off option a device wants to add why did we even bother with
>>> trying to prevent drivers from using sysfs? This just feels like we
>>> are back to the same arguments we had back in the day with it.
>>>
>>> I feel like the bigger question here is if devlink is how we are going
>>> to deal with all PCIe related features going forward, or should we
>>> start looking at creating a new interface/tool for PCI/PCIe related
>>> features? My concern is that we have already had features such as DMA
>>> Coalescing that didn't really fit into anything and now we are
>>> starting to see other things related to DMA and PCIe bus credits. I'm
>>> wondering if we shouldn't start looking at a tool/interface to
>>> configure all the PCIe related features such as interrupts, error
>>> reporting, DMA configuration, power management, etc. Maybe we could
>>> even look at sharing it across subsystems and include things like
>>> storage, graphics, and other subsystems in the conversation.
>>
>>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need
>>to build up an API.  Sharing it across subsystems would be very cool!
>
> I wonder howcome there isn't such API in place already. Or is it?
> If it is not, do you have any idea how should it look like? Should it be
> an extension of the existing PCI uapi or something completely new?
> It would be probably good to loop some PCI people in...

The closest thing I can think of in terms of answering your questions
as to why we haven't seen anything like that would be setpci.
Basically with that tool you can go through the PCI configuration
space and update any piece you want. The problem is it can have
effects on the driver and I don't recall there ever being any sort of
notification mechanism added to make a driver aware of configuration
updates.

As far as the interface I don't know if we would want to use something
like netlink or look at something completely new.

I've gone ahead and added the linux-pci mailing list to the thread.

- Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-26 14:00                 ` Alexander Duyck
@ 2018-07-28 16:06                   ` Bjorn Helgaas
  2018-07-29  9:23                     ` Moshe Shemesh
  0 siblings, 1 reply; 38+ messages in thread
From: Bjorn Helgaas @ 2018-07-28 16:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiri Pirko, Jakub Kicinski, Eran Ben Elisha, Saeed Mahameed,
	David S. Miller, netdev, linux-pci

On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com wrote:
> >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >>> >>>> The devlink params haven't been upstream even for a full cycle and
> >>> >>>> already you guys are starting to use them to configure standard
> >>> >>>> features like queuing.
> >>> >>>
> >>> >>> We developed the devlink params in order to support non-standard
> >>> >>> configuration only. And for non-standard, there are generic and vendor
> >>> >>> specific options.
> >>> >>
> >>> >> I thought it was developed for performing non-standard and possibly
> >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_* for
> >>> >> examples of well justified generic options for which we have no
> >>> >> other API.  The vendor mlx4 options look fairly vendor specific if you
> >>> >> ask me, too.
> >>> >>
> >>> >> Configuring queuing has an API.  The question is it acceptable to enter
> >>> >> into the risky territory of controlling offloads via devlink parameters
> >>> >> or would we rather make vendors take the time and effort to model
> >>> >> things to (a subset) of existing APIs.  The HW never fits the APIs
> >>> >> perfectly.
> >>> >
> >>> > I understand what you meant here, I would like to highlight that this
> >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >>> > The vendor specific configuration suggested here is to handle a congestion
> >>> > state in Multi Host environment (which includes PF and multiple VFs per
> >>> > host), where one host is not aware to the other hosts, and each is running
> >>> > on its own pci/driver. It is a device working mode configuration.
> >>> >
> >>> > This  couldn't fit into any existing API, thus creating this vendor specific
> >>> > unique API is needed.
> >>>
> >>> If we are just going to start creating devlink interfaces in for every
> >>> one-off option a device wants to add why did we even bother with
> >>> trying to prevent drivers from using sysfs? This just feels like we
> >>> are back to the same arguments we had back in the day with it.
> >>>
> >>> I feel like the bigger question here is if devlink is how we are going
> >>> to deal with all PCIe related features going forward, or should we
> >>> start looking at creating a new interface/tool for PCI/PCIe related
> >>> features? My concern is that we have already had features such as DMA
> >>> Coalescing that didn't really fit into anything and now we are
> >>> starting to see other things related to DMA and PCIe bus credits. I'm
> >>> wondering if we shouldn't start looking at a tool/interface to
> >>> configure all the PCIe related features such as interrupts, error
> >>> reporting, DMA configuration, power management, etc. Maybe we could
> >>> even look at sharing it across subsystems and include things like
> >>> storage, graphics, and other subsystems in the conversation.
> >>
> >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need
> >>to build up an API.  Sharing it across subsystems would be very cool!

I read the thread (starting at [1], for anybody else coming in late)
and I see this has something to do with "configuring outbound PCIe
buffers", but I haven't seen the connection to PCIe protocol or
features, i.e., I can't connect this to anything in the PCIe spec.

Can somebody help me understand how the PCI core is relevant?  If
there's some connection with a feature defined by PCIe, or if it
affects the PCIe transaction protocol somehow, I'm definitely
interested in this.  But if this only affects the data transferred
over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
sure why the PCI core should care.

> > I wonder howcome there isn't such API in place already. Or is it?
> > If it is not, do you have any idea how should it look like? Should it be
> > an extension of the existing PCI uapi or something completely new?
> > It would be probably good to loop some PCI people in...
> 
> The closest thing I can think of in terms of answering your questions
> as to why we haven't seen anything like that would be setpci.
> Basically with that tool you can go through the PCI configuration
> space and update any piece you want. The problem is it can have
> effects on the driver and I don't recall there ever being any sort of
> notification mechanism added to make a driver aware of configuration
> updates.

setpci is a development and debugging tool, not something we should
use as the standard way of configuring things.  Use of setpci should
probably taint the kernel because the PCI core configures features
like MPS, ASPM, AER, etc., based on the assumption that nobody else is
changing things in PCI config space.

> As far as the interface I don't know if we would want to use something
> like netlink or look at something completely new.
> 
> I've gone ahead and added the linux-pci mailing list to the thread.
> 
> - Alex

[1] https://lkml.kernel.org/r/20180719010107.22363-11-saeedm@mellanox.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-28 16:06                   ` Bjorn Helgaas
@ 2018-07-29  9:23                     ` Moshe Shemesh
  2018-07-29 22:00                       ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Moshe Shemesh @ 2018-07-29  9:23 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Jiri Pirko, Jakub Kicinski, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

[-- Attachment #1: Type: text/plain, Size: 5844 bytes --]

On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com
> wrote:
> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> > >>> >>>> The devlink params haven't been upstream even for a full cycle
> and
> > >>> >>>> already you guys are starting to use them to configure standard
> > >>> >>>> features like queuing.
> > >>> >>>
> > >>> >>> We developed the devlink params in order to support non-standard
> > >>> >>> configuration only. And for non-standard, there are generic and
> vendor
> > >>> >>> specific options.
> > >>> >>
> > >>> >> I thought it was developed for performing non-standard and
> possibly
> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> for
> > >>> >> examples of well justified generic options for which we have no
> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> if you
> > >>> >> ask me, too.
> > >>> >>
> > >>> >> Configuring queuing has an API.  The question is it acceptable to
> enter
> > >>> >> into the risky territory of controlling offloads via devlink
> parameters
> > >>> >> or would we rather make vendors take the time and effort to model
> > >>> >> things to (a subset) of existing APIs.  The HW never fits the APIs
> > >>> >> perfectly.
> > >>> >
> > >>> > I understand what you meant here, I would like to highlight that
> this
> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> > >>> > The vendor specific configuration suggested here is to handle a
> congestion
> > >>> > state in Multi Host environment (which includes PF and multiple
> VFs per
> > >>> > host), where one host is not aware to the other hosts, and each is
> running
> > >>> > on its own pci/driver. It is a device working mode configuration.
> > >>> >
> > >>> > This  couldn't fit into any existing API, thus creating this
> vendor specific
> > >>> > unique API is needed.
> > >>>
> > >>> If we are just going to start creating devlink interfaces in for
> every
> > >>> one-off option a device wants to add why did we even bother with
> > >>> trying to prevent drivers from using sysfs? This just feels like we
> > >>> are back to the same arguments we had back in the day with it.
> > >>>
> > >>> I feel like the bigger question here is if devlink is how we are
> going
> > >>> to deal with all PCIe related features going forward, or should we
> > >>> start looking at creating a new interface/tool for PCI/PCIe related
> > >>> features? My concern is that we have already had features such as DMA
> > >>> Coalescing that didn't really fit into anything and now we are
> > >>> starting to see other things related to DMA and PCIe bus credits. I'm
> > >>> wondering if we shouldn't start looking at a tool/interface to
> > >>> configure all the PCIe related features such as interrupts, error
> > >>> reporting, DMA configuration, power management, etc. Maybe we could
> > >>> even look at sharing it across subsystems and include things like
> > >>> storage, graphics, and other subsystems in the conversation.
> > >>
> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need
> > >>to build up an API.  Sharing it across subsystems would be very cool!
>
> I read the thread (starting at [1], for anybody else coming in late)
> and I see this has something to do with "configuring outbound PCIe
> buffers", but I haven't seen the connection to PCIe protocol or
> features, i.e., I can't connect this to anything in the PCIe spec.
>
> Can somebody help me understand how the PCI core is relevant?  If
> there's some connection with a feature defined by PCIe, or if it
> affects the PCIe transaction protocol somehow, I'm definitely
> interested in this.  But if this only affects the data transferred
> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
> sure why the PCI core should care.
>
>

As you wrote, this is not a PCIe feature  or affects the PCIe transaction
protocol.

Actually, due to hardware limitation in current device, we have enabled a
workaround in hardware.

This mode is proprietary and not relevant to other PCIe devices, thus is
set using driver-specific parameter in devlink

> > I wonder howcome there isn't such API in place already. Or is it?
> > > If it is not, do you have any idea how should it look like? Should it
> be
> > > an extension of the existing PCI uapi or something completely new?
> > > It would be probably good to loop some PCI people in...
> >
> > The closest thing I can think of in terms of answering your questions
> > as to why we haven't seen anything like that would be setpci.
> > Basically with that tool you can go through the PCI configuration
> > space and update any piece you want. The problem is it can have
> > effects on the driver and I don't recall there ever being any sort of
> > notification mechanism added to make a driver aware of configuration
> > updates.
>
> setpci is a development and debugging tool, not something we should
> use as the standard way of configuring things.  Use of setpci should
> probably taint the kernel because the PCI core configures features
> like MPS, ASPM, AER, etc., based on the assumption that nobody else is
> changing things in PCI config space.
>
> > As far as the interface I don't know if we would want to use something
> > like netlink or look at something completely new.
> >
> > I've gone ahead and added the linux-pci mailing list to the thread.
> >
> > - Alex
>
> [1] https://lkml.kernel.org/r/20180719010107.22363-11-saeedm@mellanox.com
>

[-- Attachment #2: Type: text/html, Size: 8068 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-29  9:23                     ` Moshe Shemesh
@ 2018-07-29 22:00                       ` Alexander Duyck
  2018-07-30 14:07                         ` Bjorn Helgaas
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Duyck @ 2018-07-29 22:00 UTC (permalink / raw)
  To: Moshe Shemesh
  Cc: Bjorn Helgaas, Jiri Pirko, Jakub Kicinski, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <moshes20.il@gmail.com> wrote:
>
>
> On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
>>
>> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
>> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com
>> > > wrote:
>> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>> > >>> >>>> The devlink params haven't been upstream even for a full cycle
>> > >>> >>>> and
>> > >>> >>>> already you guys are starting to use them to configure standard
>> > >>> >>>> features like queuing.
>> > >>> >>>
>> > >>> >>> We developed the devlink params in order to support non-standard
>> > >>> >>> configuration only. And for non-standard, there are generic and
>> > >>> >>> vendor
>> > >>> >>> specific options.
>> > >>> >>
>> > >>> >> I thought it was developed for performing non-standard and
>> > >>> >> possibly
>> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
>> > >>> >> for
>> > >>> >> examples of well justified generic options for which we have no
>> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
>> > >>> >> if you
>> > >>> >> ask me, too.
>> > >>> >>
>> > >>> >> Configuring queuing has an API.  The question is it acceptable to
>> > >>> >> enter
>> > >>> >> into the risky territory of controlling offloads via devlink
>> > >>> >> parameters
>> > >>> >> or would we rather make vendors take the time and effort to model
>> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
>> > >>> >> APIs
>> > >>> >> perfectly.
>> > >>> >
>> > >>> > I understand what you meant here, I would like to highlight that
>> > >>> > this
>> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
>> > >>> > The vendor specific configuration suggested here is to handle a
>> > >>> > congestion
>> > >>> > state in Multi Host environment (which includes PF and multiple
>> > >>> > VFs per
>> > >>> > host), where one host is not aware to the other hosts, and each is
>> > >>> > running
>> > >>> > on its own pci/driver. It is a device working mode configuration.
>> > >>> >
>> > >>> > This  couldn't fit into any existing API, thus creating this
>> > >>> > vendor specific
>> > >>> > unique API is needed.
>> > >>>
>> > >>> If we are just going to start creating devlink interfaces in for
>> > >>> every
>> > >>> one-off option a device wants to add why did we even bother with
>> > >>> trying to prevent drivers from using sysfs? This just feels like we
>> > >>> are back to the same arguments we had back in the day with it.
>> > >>>
>> > >>> I feel like the bigger question here is if devlink is how we are
>> > >>> going
>> > >>> to deal with all PCIe related features going forward, or should we
>> > >>> start looking at creating a new interface/tool for PCI/PCIe related
>> > >>> features? My concern is that we have already had features such as
>> > >>> DMA
>> > >>> Coalescing that didn't really fit into anything and now we are
>> > >>> starting to see other things related to DMA and PCIe bus credits.
>> > >>> I'm
>> > >>> wondering if we shouldn't start looking at a tool/interface to
>> > >>> configure all the PCIe related features such as interrupts, error
>> > >>> reporting, DMA configuration, power management, etc. Maybe we could
>> > >>> even look at sharing it across subsystems and include things like
>> > >>> storage, graphics, and other subsystems in the conversation.
>> > >>
>> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
>> > >> need
>> > >>to build up an API.  Sharing it across subsystems would be very cool!
>>
>> I read the thread (starting at [1], for anybody else coming in late)
>> and I see this has something to do with "configuring outbound PCIe
>> buffers", but I haven't seen the connection to PCIe protocol or
>> features, i.e., I can't connect this to anything in the PCIe spec.
>>
>> Can somebody help me understand how the PCI core is relevant?  If
>> there's some connection with a feature defined by PCIe, or if it
>> affects the PCIe transaction protocol somehow, I'm definitely
>> interested in this.  But if this only affects the data transferred
>> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
>> sure why the PCI core should care.
>>
>
>
> As you wrote, this is not a PCIe feature  or affects the PCIe transaction
> protocol.
>
> Actually, due to hardware limitation in current device, we have enabled a
> workaround in hardware.
>
> This mode is proprietary and not relevant to other PCIe devices, thus is set
> using driver-specific parameter in devlink

Essentially what this feature is doing is communicating the need for
PCIe back-pressure to the network fabric. So as the buffers on the
device start to fill because the device isn't able to get back PCIe
credits fast enough it will then start to send congestion
notifications to the network stack itself if I understand this
correctly.

For now there are no major conflicts, but when we start getting into
stuff like PCIe DMA coalescing, and on a more general basis just PCIe
active state power management that is going to start making things
more complicated going forward.

I assume the devices we are talking about supporting this new feature
on either don't deal with ASPM or assume a quick turnaround to get out
of the lower power states? Otherwise that would definitely cause some
back-pressure buildups that would hurt performance.

- Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-29 22:00                       ` Alexander Duyck
@ 2018-07-30 14:07                         ` Bjorn Helgaas
  2018-07-30 15:02                           ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Bjorn Helgaas @ 2018-07-30 14:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Moshe Shemesh, Jiri Pirko, Jakub Kicinski, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <moshes20.il@gmail.com> wrote:
> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com
> >> > > wrote:
> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
> >> > >>> >>>> and
> >> > >>> >>>> already you guys are starting to use them to configure standard
> >> > >>> >>>> features like queuing.
> >> > >>> >>>
> >> > >>> >>> We developed the devlink params in order to support non-standard
> >> > >>> >>> configuration only. And for non-standard, there are generic and
> >> > >>> >>> vendor
> >> > >>> >>> specific options.
> >> > >>> >>
> >> > >>> >> I thought it was developed for performing non-standard and
> >> > >>> >> possibly
> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> >> > >>> >> for
> >> > >>> >> examples of well justified generic options for which we have no
> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> >> > >>> >> if you
> >> > >>> >> ask me, too.
> >> > >>> >>
> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
> >> > >>> >> enter
> >> > >>> >> into the risky territory of controlling offloads via devlink
> >> > >>> >> parameters
> >> > >>> >> or would we rather make vendors take the time and effort to model
> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
> >> > >>> >> APIs
> >> > >>> >> perfectly.
> >> > >>> >
> >> > >>> > I understand what you meant here, I would like to highlight that
> >> > >>> > this
> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >> > >>> > The vendor specific configuration suggested here is to handle a
> >> > >>> > congestion
> >> > >>> > state in Multi Host environment (which includes PF and multiple
> >> > >>> > VFs per
> >> > >>> > host), where one host is not aware to the other hosts, and each is
> >> > >>> > running
> >> > >>> > on its own pci/driver. It is a device working mode configuration.
> >> > >>> >
> >> > >>> > This  couldn't fit into any existing API, thus creating this
> >> > >>> > vendor specific
> >> > >>> > unique API is needed.
> >> > >>>
> >> > >>> If we are just going to start creating devlink interfaces in for
> >> > >>> every
> >> > >>> one-off option a device wants to add why did we even bother with
> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
> >> > >>> are back to the same arguments we had back in the day with it.
> >> > >>>
> >> > >>> I feel like the bigger question here is if devlink is how we are
> >> > >>> going
> >> > >>> to deal with all PCIe related features going forward, or should we
> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
> >> > >>> features? My concern is that we have already had features such as
> >> > >>> DMA
> >> > >>> Coalescing that didn't really fit into anything and now we are
> >> > >>> starting to see other things related to DMA and PCIe bus credits.
> >> > >>> I'm
> >> > >>> wondering if we shouldn't start looking at a tool/interface to
> >> > >>> configure all the PCIe related features such as interrupts, error
> >> > >>> reporting, DMA configuration, power management, etc. Maybe we could
> >> > >>> even look at sharing it across subsystems and include things like
> >> > >>> storage, graphics, and other subsystems in the conversation.
> >> > >>
> >> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
> >> > >> need
> >> > >>to build up an API.  Sharing it across subsystems would be very cool!
> >>
> >> I read the thread (starting at [1], for anybody else coming in late)
> >> and I see this has something to do with "configuring outbound PCIe
> >> buffers", but I haven't seen the connection to PCIe protocol or
> >> features, i.e., I can't connect this to anything in the PCIe spec.
> >>
> >> Can somebody help me understand how the PCI core is relevant?  If
> >> there's some connection with a feature defined by PCIe, or if it
> >> affects the PCIe transaction protocol somehow, I'm definitely
> >> interested in this.  But if this only affects the data transferred
> >> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
> >> sure why the PCI core should care.
> >>
> >
> >
> > As you wrote, this is not a PCIe feature  or affects the PCIe transaction
> > protocol.
> >
> > Actually, due to hardware limitation in current device, we have enabled a
> > workaround in hardware.
> >
> > This mode is proprietary and not relevant to other PCIe devices, thus is set
> > using driver-specific parameter in devlink
> 
> Essentially what this feature is doing is communicating the need for
> PCIe back-pressure to the network fabric. So as the buffers on the
> device start to fill because the device isn't able to get back PCIe
> credits fast enough it will then start to send congestion
> notifications to the network stack itself if I understand this
> correctly.

This sounds like a hook that allows the device to tell its driver
about PCIe flow control credits, and the driver can pass that on to
the network stack.  IIUC, that would be a device-specific feature
outside the scope of the PCI core.

> For now there are no major conflicts, but when we start getting into
> stuff like PCIe DMA coalescing, and on a more general basis just PCIe
> active state power management that is going to start making things
> more complicated going forward.

We do support ASPM already in the PCI core, and we do have the
pci_disable_link_state() interface, which is currently the only way
drivers can influence it.  There are several drivers that do their own
ASPM configuration, but this is not safe because it's not coordinated
with what the PCI core does.  If/when drivers need more control, we
should enhance the PCI core interfaces.

I don't know what PCIe DMA coalescing means, so I can't comment on
that.

> I assume the devices we are talking about supporting this new feature
> on either don't deal with ASPM or assume a quick turnaround to get out
> of the lower power states? Otherwise that would definitely cause some
> back-pressure buildups that would hurt performance.

Devices can communicate the ASPM exit latency they can tolerate via
the Device Capabilities register (PCIe r4.0, sec 76.5.3.3).  Linux
should be configuring ASPM to respect those device requirements.

Bjorn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-30 14:07                         ` Bjorn Helgaas
@ 2018-07-30 15:02                           ` Alexander Duyck
  2018-07-30 22:00                             ` Jakub Kicinski
  2018-07-31  2:33                             ` Bjorn Helgaas
  0 siblings, 2 replies; 38+ messages in thread
From: Alexander Duyck @ 2018-07-30 15:02 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Moshe Shemesh, Jiri Pirko, Jakub Kicinski, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
>> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <moshes20.il@gmail.com> wrote:
>> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
>> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
>> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com
>> >> > > wrote:
>> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
>> >> > >>> >>>> and
>> >> > >>> >>>> already you guys are starting to use them to configure standard
>> >> > >>> >>>> features like queuing.
>> >> > >>> >>>
>> >> > >>> >>> We developed the devlink params in order to support non-standard
>> >> > >>> >>> configuration only. And for non-standard, there are generic and
>> >> > >>> >>> vendor
>> >> > >>> >>> specific options.
>> >> > >>> >>
>> >> > >>> >> I thought it was developed for performing non-standard and
>> >> > >>> >> possibly
>> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
>> >> > >>> >> for
>> >> > >>> >> examples of well justified generic options for which we have no
>> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
>> >> > >>> >> if you
>> >> > >>> >> ask me, too.
>> >> > >>> >>
>> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
>> >> > >>> >> enter
>> >> > >>> >> into the risky territory of controlling offloads via devlink
>> >> > >>> >> parameters
>> >> > >>> >> or would we rather make vendors take the time and effort to model
>> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
>> >> > >>> >> APIs
>> >> > >>> >> perfectly.
>> >> > >>> >
>> >> > >>> > I understand what you meant here, I would like to highlight that
>> >> > >>> > this
>> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
>> >> > >>> > The vendor specific configuration suggested here is to handle a
>> >> > >>> > congestion
>> >> > >>> > state in Multi Host environment (which includes PF and multiple
>> >> > >>> > VFs per
>> >> > >>> > host), where one host is not aware to the other hosts, and each is
>> >> > >>> > running
>> >> > >>> > on its own pci/driver. It is a device working mode configuration.
>> >> > >>> >
>> >> > >>> > This  couldn't fit into any existing API, thus creating this
>> >> > >>> > vendor specific
>> >> > >>> > unique API is needed.
>> >> > >>>
>> >> > >>> If we are just going to start creating devlink interfaces in for
>> >> > >>> every
>> >> > >>> one-off option a device wants to add why did we even bother with
>> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
>> >> > >>> are back to the same arguments we had back in the day with it.
>> >> > >>>
>> >> > >>> I feel like the bigger question here is if devlink is how we are
>> >> > >>> going
>> >> > >>> to deal with all PCIe related features going forward, or should we
>> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
>> >> > >>> features? My concern is that we have already had features such as
>> >> > >>> DMA
>> >> > >>> Coalescing that didn't really fit into anything and now we are
>> >> > >>> starting to see other things related to DMA and PCIe bus credits.
>> >> > >>> I'm
>> >> > >>> wondering if we shouldn't start looking at a tool/interface to
>> >> > >>> configure all the PCIe related features such as interrupts, error
>> >> > >>> reporting, DMA configuration, power management, etc. Maybe we could
>> >> > >>> even look at sharing it across subsystems and include things like
>> >> > >>> storage, graphics, and other subsystems in the conversation.
>> >> > >>
>> >> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
>> >> > >> need
>> >> > >>to build up an API.  Sharing it across subsystems would be very cool!
>> >>
>> >> I read the thread (starting at [1], for anybody else coming in late)
>> >> and I see this has something to do with "configuring outbound PCIe
>> >> buffers", but I haven't seen the connection to PCIe protocol or
>> >> features, i.e., I can't connect this to anything in the PCIe spec.
>> >>
>> >> Can somebody help me understand how the PCI core is relevant?  If
>> >> there's some connection with a feature defined by PCIe, or if it
>> >> affects the PCIe transaction protocol somehow, I'm definitely
>> >> interested in this.  But if this only affects the data transferred
>> >> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
>> >> sure why the PCI core should care.
>> >>
>> >
>> >
>> > As you wrote, this is not a PCIe feature  or affects the PCIe transaction
>> > protocol.
>> >
>> > Actually, due to hardware limitation in current device, we have enabled a
>> > workaround in hardware.
>> >
>> > This mode is proprietary and not relevant to other PCIe devices, thus is set
>> > using driver-specific parameter in devlink
>>
>> Essentially what this feature is doing is communicating the need for
>> PCIe back-pressure to the network fabric. So as the buffers on the
>> device start to fill because the device isn't able to get back PCIe
>> credits fast enough it will then start to send congestion
>> notifications to the network stack itself if I understand this
>> correctly.
>
> This sounds like a hook that allows the device to tell its driver
> about PCIe flow control credits, and the driver can pass that on to
> the network stack.  IIUC, that would be a device-specific feature
> outside the scope of the PCI core.
>
>> For now there are no major conflicts, but when we start getting into
>> stuff like PCIe DMA coalescing, and on a more general basis just PCIe
>> active state power management that is going to start making things
>> more complicated going forward.
>
> We do support ASPM already in the PCI core, and we do have the
> pci_disable_link_state() interface, which is currently the only way
> drivers can influence it.  There are several drivers that do their own
> ASPM configuration, but this is not safe because it's not coordinated
> with what the PCI core does.  If/when drivers need more control, we
> should enhance the PCI core interfaces.

This is kind of what I was getting at. It would be useful to have an
interface of some sort so that drivers get notified when a user is
making changes to configuration space and I don't know if anything
like that exists now.

> I don't know what PCIe DMA coalescing means, so I can't comment on
> that.

There are devices, specifically network devices, that will hold off on
switching between either L0s or L1 and L0 by deferring DMA operations.
Basically the idea is supposed to be to hold off bringing the link up
for as long as possible in order to maximize power savings for the
ASPM state. This is something that has come up in the past, and I
don't know if there has been any interface determined for how to
handle this sort of configuration. Most of it occurs through MMIO.

>> I assume the devices we are talking about supporting this new feature
>> on either don't deal with ASPM or assume a quick turnaround to get out
>> of the lower power states? Otherwise that would definitely cause some
>> back-pressure buildups that would hurt performance.
>
> Devices can communicate the ASPM exit latency they can tolerate via
> the Device Capabilities register (PCIe r4.0, sec 76.5.3.3).  Linux
> should be configuring ASPM to respect those device requirements.
>
> Bjorn

Right. But my  point was something like ASPM will add extra complexity
to a feature such as what has been described here. My concern is that
I don't want us implementing stuff on a per-driver basis that is not
all that unique to the device. I don't really see the feature that was
described above as being something that will stay specific to this one
device for very long, especially if it provides added value. Basically
all it is doing is allowing exposing PCIe congestion management to
upper levels in the network stack. I don't even necessarily see it as
being networking specific as I would imagine there might be other
types of devices that could make use of knowing how many transactions
and such they could process at the same time.

- Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-30 15:02                           ` Alexander Duyck
@ 2018-07-30 22:00                             ` Jakub Kicinski
  2018-07-31  2:33                             ` Bjorn Helgaas
  1 sibling, 0 replies; 38+ messages in thread
From: Jakub Kicinski @ 2018-07-30 22:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Bjorn Helgaas, Moshe Shemesh, Jiri Pirko, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

On Mon, 30 Jul 2018 08:02:48 -0700, Alexander Duyck wrote:
> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:  
> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <moshes20.il@gmail.com> wrote:  
> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:  
> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:  
> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:  
> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com wrote:  
> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:  
> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:  
> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:  
> >> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
> >> >> > >>> >>>> and
> >> >> > >>> >>>> already you guys are starting to use them to configure standard
> >> >> > >>> >>>> features like queuing.  
> >> >> > >>> >>>
> >> >> > >>> >>> We developed the devlink params in order to support non-standard
> >> >> > >>> >>> configuration only. And for non-standard, there are generic and
> >> >> > >>> >>> vendor
> >> >> > >>> >>> specific options.  
> >> >> > >>> >>
> >> >> > >>> >> I thought it was developed for performing non-standard and
> >> >> > >>> >> possibly
> >> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> >> >> > >>> >> for
> >> >> > >>> >> examples of well justified generic options for which we have no
> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> >> >> > >>> >> if you
> >> >> > >>> >> ask me, too.
> >> >> > >>> >>
> >> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
> >> >> > >>> >> enter
> >> >> > >>> >> into the risky territory of controlling offloads via devlink
> >> >> > >>> >> parameters
> >> >> > >>> >> or would we rather make vendors take the time and effort to model
> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
> >> >> > >>> >> APIs
> >> >> > >>> >> perfectly.  
> >> >> > >>> >
> >> >> > >>> > I understand what you meant here, I would like to highlight that
> >> >> > >>> > this
> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >> >> > >>> > The vendor specific configuration suggested here is to handle a
> >> >> > >>> > congestion
> >> >> > >>> > state in Multi Host environment (which includes PF and multiple
> >> >> > >>> > VFs per
> >> >> > >>> > host), where one host is not aware to the other hosts, and each is
> >> >> > >>> > running
> >> >> > >>> > on its own pci/driver. It is a device working mode configuration.
> >> >> > >>> >
> >> >> > >>> > This  couldn't fit into any existing API, thus creating this
> >> >> > >>> > vendor specific
> >> >> > >>> > unique API is needed.  
> >> >> > >>>
> >> >> > >>> If we are just going to start creating devlink interfaces in for
> >> >> > >>> every
> >> >> > >>> one-off option a device wants to add why did we even bother with
> >> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
> >> >> > >>> are back to the same arguments we had back in the day with it.
> >> >> > >>>
> >> >> > >>> I feel like the bigger question here is if devlink is how we are
> >> >> > >>> going
> >> >> > >>> to deal with all PCIe related features going forward, or should we
> >> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
> >> >> > >>> features? My concern is that we have already had features such as
> >> >> > >>> DMA
> >> >> > >>> Coalescing that didn't really fit into anything and now we are
> >> >> > >>> starting to see other things related to DMA and PCIe bus credits.
> >> >> > >>> I'm
> >> >> > >>> wondering if we shouldn't start looking at a tool/interface to
> >> >> > >>> configure all the PCIe related features such as interrupts, error
> >> >> > >>> reporting, DMA configuration, power management, etc. Maybe we could
> >> >> > >>> even look at sharing it across subsystems and include things like
> >> >> > >>> storage, graphics, and other subsystems in the conversation.  
> >> >> > >>
> >> >> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
> >> >> > >> need
> >> >> > >>to build up an API.  Sharing it across subsystems would be very cool!  
> >> >>
> >> >> I read the thread (starting at [1], for anybody else coming in late)
> >> >> and I see this has something to do with "configuring outbound PCIe
> >> >> buffers", but I haven't seen the connection to PCIe protocol or
> >> >> features, i.e., I can't connect this to anything in the PCIe spec.
> >> >>
> >> >> Can somebody help me understand how the PCI core is relevant?  If
> >> >> there's some connection with a feature defined by PCIe, or if it
> >> >> affects the PCIe transaction protocol somehow, I'm definitely
> >> >> interested in this.  But if this only affects the data transferred
> >> >> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
> >> >> sure why the PCI core should care.
> >> >>  
> >> >
> >> >
> >> > As you wrote, this is not a PCIe feature  or affects the PCIe transaction
> >> > protocol.
> >> >
> >> > Actually, due to hardware limitation in current device, we have enabled a
> >> > workaround in hardware.
> >> >
> >> > This mode is proprietary and not relevant to other PCIe devices, thus is set
> >> > using driver-specific parameter in devlink  
> >>
> >> Essentially what this feature is doing is communicating the need for
> >> PCIe back-pressure to the network fabric. So as the buffers on the
> >> device start to fill because the device isn't able to get back PCIe
> >> credits fast enough it will then start to send congestion
> >> notifications to the network stack itself if I understand this
> >> correctly.  
> >
> > This sounds like a hook that allows the device to tell its driver
> > about PCIe flow control credits, and the driver can pass that on to
> > the network stack.  IIUC, that would be a device-specific feature
> > outside the scope of the PCI core.

Hm, I might be wrong but AFAIU the patch which sparked the discussion
does not go all the way down to the PCIe FC.  PCIe layer works at max
possible rate (single VC etc.), but there is a mismatch between network
and PCIe speed.  E.g. with a 2x40GE NIC on a 8x8 PCIe v3 (63 Gbps) there
can be more traffic flowing in than PCIe bus will be able to transfer
to the host. From a networking ASIC perspective it's a fairly typical
problem of dealing with mismatched port speeds, incast, etc.  GPUs or
storage devices will not have this problem, it will only happen with
non-flow controlled network technologies, i.e. netdevs.  It so happens
the device is on a PCIe bus but the same problem can happen on SPI or
any other bus.

Having said that a PCIe configuration API seems to continue to come up.
Examples Alex gives below seem very valid (AFAIU them).

> >> For now there are no major conflicts, but when we start getting into
> >> stuff like PCIe DMA coalescing, and on a more general basis just PCIe
> >> active state power management that is going to start making things
> >> more complicated going forward.  
> >
> > We do support ASPM already in the PCI core, and we do have the
> > pci_disable_link_state() interface, which is currently the only way
> > drivers can influence it.  There are several drivers that do their own
> > ASPM configuration, but this is not safe because it's not coordinated
> > with what the PCI core does.  If/when drivers need more control, we
> > should enhance the PCI core interfaces.  
> 
> This is kind of what I was getting at. It would be useful to have an
> interface of some sort so that drivers get notified when a user is
> making changes to configuration space and I don't know if anything
> like that exists now.
> 
> > I don't know what PCIe DMA coalescing means, so I can't comment on
> > that.  
> 
> There are devices, specifically network devices, that will hold off on
> switching between either L0s or L1 and L0 by deferring DMA operations.
> Basically the idea is supposed to be to hold off bringing the link up
> for as long as possible in order to maximize power savings for the
> ASPM state. This is something that has come up in the past, and I
> don't know if there has been any interface determined for how to
> handle this sort of configuration. Most of it occurs through MMIO.
> 
> >> I assume the devices we are talking about supporting this new feature
> >> on either don't deal with ASPM or assume a quick turnaround to get out
> >> of the lower power states? Otherwise that would definitely cause some
> >> back-pressure buildups that would hurt performance.  
> >
> > Devices can communicate the ASPM exit latency they can tolerate via
> > the Device Capabilities register (PCIe r4.0, sec 76.5.3.3).  Linux
> > should be configuring ASPM to respect those device requirements.
> >
> > Bjorn  
> 
> Right. But my  point was something like ASPM will add extra complexity
> to a feature such as what has been described here. My concern is that
> I don't want us implementing stuff on a per-driver basis that is not
> all that unique to the device. I don't really see the feature that was
> described above as being something that will stay specific to this one
> device for very long, especially if it provides added value. Basically
> all it is doing is allowing exposing PCIe congestion management to
> upper levels in the network stack. I don't even necessarily see it as
> being networking specific as I would imagine there might be other
> types of devices that could make use of knowing how many transactions
> and such they could process at the same time.
> 
> - Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-30 15:02                           ` Alexander Duyck
  2018-07-30 22:00                             ` Jakub Kicinski
@ 2018-07-31  2:33                             ` Bjorn Helgaas
  2018-07-31  3:19                               ` Alexander Duyck
  1 sibling, 1 reply; 38+ messages in thread
From: Bjorn Helgaas @ 2018-07-31  2:33 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Moshe Shemesh, Jiri Pirko, Jakub Kicinski, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote:
> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <moshes20.il@gmail.com> wrote:
> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com
> >> >> > > wrote:
> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
> >> >> > >>> >>>> and
> >> >> > >>> >>>> already you guys are starting to use them to configure standard
> >> >> > >>> >>>> features like queuing.
> >> >> > >>> >>>
> >> >> > >>> >>> We developed the devlink params in order to support non-standard
> >> >> > >>> >>> configuration only. And for non-standard, there are generic and
> >> >> > >>> >>> vendor
> >> >> > >>> >>> specific options.
> >> >> > >>> >>
> >> >> > >>> >> I thought it was developed for performing non-standard and
> >> >> > >>> >> possibly
> >> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> >> >> > >>> >> for
> >> >> > >>> >> examples of well justified generic options for which we have no
> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> >> >> > >>> >> if you
> >> >> > >>> >> ask me, too.
> >> >> > >>> >>
> >> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
> >> >> > >>> >> enter
> >> >> > >>> >> into the risky territory of controlling offloads via devlink
> >> >> > >>> >> parameters
> >> >> > >>> >> or would we rather make vendors take the time and effort to model
> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
> >> >> > >>> >> APIs
> >> >> > >>> >> perfectly.
> >> >> > >>> >
> >> >> > >>> > I understand what you meant here, I would like to highlight that
> >> >> > >>> > this
> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >> >> > >>> > The vendor specific configuration suggested here is to handle a
> >> >> > >>> > congestion
> >> >> > >>> > state in Multi Host environment (which includes PF and multiple
> >> >> > >>> > VFs per
> >> >> > >>> > host), where one host is not aware to the other hosts, and each is
> >> >> > >>> > running
> >> >> > >>> > on its own pci/driver. It is a device working mode configuration.
> >> >> > >>> >
> >> >> > >>> > This  couldn't fit into any existing API, thus creating this
> >> >> > >>> > vendor specific
> >> >> > >>> > unique API is needed.
> >> >> > >>>
> >> >> > >>> If we are just going to start creating devlink interfaces in for
> >> >> > >>> every
> >> >> > >>> one-off option a device wants to add why did we even bother with
> >> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
> >> >> > >>> are back to the same arguments we had back in the day with it.
> >> >> > >>>
> >> >> > >>> I feel like the bigger question here is if devlink is how we are
> >> >> > >>> going
> >> >> > >>> to deal with all PCIe related features going forward, or should we
> >> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
> >> >> > >>> features? My concern is that we have already had features such as
> >> >> > >>> DMA
> >> >> > >>> Coalescing that didn't really fit into anything and now we are
> >> >> > >>> starting to see other things related to DMA and PCIe bus credits.
> >> >> > >>> I'm
> >> >> > >>> wondering if we shouldn't start looking at a tool/interface to
> >> >> > >>> configure all the PCIe related features such as interrupts, error
> >> >> > >>> reporting, DMA configuration, power management, etc. Maybe we could
> >> >> > >>> even look at sharing it across subsystems and include things like
> >> >> > >>> storage, graphics, and other subsystems in the conversation.
> >> >> > >>
> >> >> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
> >> >> > >> need
> >> >> > >>to build up an API.  Sharing it across subsystems would be very cool!
> >> >>
> >> >> I read the thread (starting at [1], for anybody else coming in late)
> >> >> and I see this has something to do with "configuring outbound PCIe
> >> >> buffers", but I haven't seen the connection to PCIe protocol or
> >> >> features, i.e., I can't connect this to anything in the PCIe spec.
> >> >>
> >> >> Can somebody help me understand how the PCI core is relevant?  If
> >> >> there's some connection with a feature defined by PCIe, or if it
> >> >> affects the PCIe transaction protocol somehow, I'm definitely
> >> >> interested in this.  But if this only affects the data transferred
> >> >> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
> >> >> sure why the PCI core should care.
> >> >
> >> > As you wrote, this is not a PCIe feature  or affects the PCIe transaction
> >> > protocol.
> >> >
> >> > Actually, due to hardware limitation in current device, we have enabled a
> >> > workaround in hardware.
> >> >
> >> > This mode is proprietary and not relevant to other PCIe devices, thus is set
> >> > using driver-specific parameter in devlink
> >>
> >> Essentially what this feature is doing is communicating the need for
> >> PCIe back-pressure to the network fabric. So as the buffers on the
> >> device start to fill because the device isn't able to get back PCIe
> >> credits fast enough it will then start to send congestion
> >> notifications to the network stack itself if I understand this
> >> correctly.
> >
> > This sounds like a hook that allows the device to tell its driver
> > about PCIe flow control credits, and the driver can pass that on to
> > the network stack.  IIUC, that would be a device-specific feature
> > outside the scope of the PCI core.
> >
> >> For now there are no major conflicts, but when we start getting into
> >> stuff like PCIe DMA coalescing, and on a more general basis just PCIe
> >> active state power management that is going to start making things
> >> more complicated going forward.
> >
> > We do support ASPM already in the PCI core, and we do have the
> > pci_disable_link_state() interface, which is currently the only way
> > drivers can influence it.  There are several drivers that do their own
> > ASPM configuration, but this is not safe because it's not coordinated
> > with what the PCI core does.  If/when drivers need more control, we
> > should enhance the PCI core interfaces.
> 
> This is kind of what I was getting at. It would be useful to have an
> interface of some sort so that drivers get notified when a user is
> making changes to configuration space and I don't know if anything
> like that exists now.

You mean something like this?

  - driver registers a callback with PCI core
  - user runs setpci, which writes PCI config space via sysfs
  - kernel hook in pci_write_config() notices write and calls driver
    callback
  - driver callback receives config address and data written
  - driver parses PCI capability lists to identify register

Nothing like that exists today, and this is not what I had in mind by
"enhance the PCI core interfaces".  I'm not sure what the utility of
this is (but I'm not a networking guy by any means).

I think it's a bad idea for drivers to directly write config space.
It would be much better to provide a PCI core interface so we can
implement things once and coordinate things that need to be
coordinated.

> > I don't know what PCIe DMA coalescing means, so I can't comment on
> > that.
> 
> There are devices, specifically network devices, that will hold off on
> switching between either L0s or L1 and L0 by deferring DMA operations.
> Basically the idea is supposed to be to hold off bringing the link up
> for as long as possible in order to maximize power savings for the
> ASPM state. This is something that has come up in the past, and I
> don't know if there has been any interface determined for how to
> handle this sort of configuration. Most of it occurs through MMIO.

The device can certainly delay L0s or L1 exit if it wants to.  If
there are knobs to control this, e.g., how long it can defer a DMA, it
makes sense that they would be device-specific and in MMIO space.  The
PCI core can't be involved in that because in general it knows nothing
about the contents of MMIO BARs.  Presumably those knobs would work
within the framework of ASPM as defined by the PCIe spec, e.g., if we
disable ASPM, the knobs would do nothing because there is no L0s or L1
at all.

That's not to say that device designers couldn't get together and
define a common model for such knobs that puts them at well-known
offsets in well-known BARs.  All I'm saying is that this sounds like
it's currently outside the realm of the PCI specs and the PCI core.

> >> I assume the devices we are talking about supporting this new feature
> >> on either don't deal with ASPM or assume a quick turnaround to get out
> >> of the lower power states? Otherwise that would definitely cause some
> >> back-pressure buildups that would hurt performance.
> >
> > Devices can communicate the ASPM exit latency they can tolerate via
> > the Device Capabilities register (PCIe r4.0, sec 76.5.3.3).  Linux
> > should be configuring ASPM to respect those device requirements.
> 
> Right. But my point was something like ASPM will add extra complexity
> to a feature such as what has been described here. My concern is that
> I don't want us implementing stuff on a per-driver basis that is not
> all that unique to the device. I don't really see the feature that was
> described above as being something that will stay specific to this one
> device for very long, especially if it provides added value. Basically
> all it is doing is allowing exposing PCIe congestion management to
> upper levels in the network stack. I don't even necessarily see it as
> being networking specific as I would imagine there might be other
> types of devices that could make use of knowing how many transactions
> and such they could process at the same time.

It sounds to me like you need a library that can be used by all the
drivers that need this functionality.  Unless it's something
documented in the PCIe specs so we can rely on a standard way of
discovering and configuring things, I don't see how the PCI core can
really be involved.

Bjorn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-31  2:33                             ` Bjorn Helgaas
@ 2018-07-31  3:19                               ` Alexander Duyck
  2018-07-31 11:06                                 ` Bjorn Helgaas
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Duyck @ 2018-07-31  3:19 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Moshe Shemesh, Jiri Pirko, Jakub Kicinski, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

On Mon, Jul 30, 2018 at 7:33 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote:
>> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
>> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
>> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <moshes20.il@gmail.com> wrote:
>> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
>> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
>> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com
>> >> >> > > wrote:
>> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>> >> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
>> >> >> > >>> >>>> and
>> >> >> > >>> >>>> already you guys are starting to use them to configure standard
>> >> >> > >>> >>>> features like queuing.
>> >> >> > >>> >>>
>> >> >> > >>> >>> We developed the devlink params in order to support non-standard
>> >> >> > >>> >>> configuration only. And for non-standard, there are generic and
>> >> >> > >>> >>> vendor
>> >> >> > >>> >>> specific options.
>> >> >> > >>> >>
>> >> >> > >>> >> I thought it was developed for performing non-standard and
>> >> >> > >>> >> possibly
>> >> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
>> >> >> > >>> >> for
>> >> >> > >>> >> examples of well justified generic options for which we have no
>> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
>> >> >> > >>> >> if you
>> >> >> > >>> >> ask me, too.
>> >> >> > >>> >>
>> >> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
>> >> >> > >>> >> enter
>> >> >> > >>> >> into the risky territory of controlling offloads via devlink
>> >> >> > >>> >> parameters
>> >> >> > >>> >> or would we rather make vendors take the time and effort to model
>> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
>> >> >> > >>> >> APIs
>> >> >> > >>> >> perfectly.
>> >> >> > >>> >
>> >> >> > >>> > I understand what you meant here, I would like to highlight that
>> >> >> > >>> > this
>> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
>> >> >> > >>> > The vendor specific configuration suggested here is to handle a
>> >> >> > >>> > congestion
>> >> >> > >>> > state in Multi Host environment (which includes PF and multiple
>> >> >> > >>> > VFs per
>> >> >> > >>> > host), where one host is not aware to the other hosts, and each is
>> >> >> > >>> > running
>> >> >> > >>> > on its own pci/driver. It is a device working mode configuration.
>> >> >> > >>> >
>> >> >> > >>> > This  couldn't fit into any existing API, thus creating this
>> >> >> > >>> > vendor specific
>> >> >> > >>> > unique API is needed.
>> >> >> > >>>
>> >> >> > >>> If we are just going to start creating devlink interfaces in for
>> >> >> > >>> every
>> >> >> > >>> one-off option a device wants to add why did we even bother with
>> >> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
>> >> >> > >>> are back to the same arguments we had back in the day with it.
>> >> >> > >>>
>> >> >> > >>> I feel like the bigger question here is if devlink is how we are
>> >> >> > >>> going
>> >> >> > >>> to deal with all PCIe related features going forward, or should we
>> >> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
>> >> >> > >>> features? My concern is that we have already had features such as
>> >> >> > >>> DMA
>> >> >> > >>> Coalescing that didn't really fit into anything and now we are
>> >> >> > >>> starting to see other things related to DMA and PCIe bus credits.
>> >> >> > >>> I'm
>> >> >> > >>> wondering if we shouldn't start looking at a tool/interface to
>> >> >> > >>> configure all the PCIe related features such as interrupts, error
>> >> >> > >>> reporting, DMA configuration, power management, etc. Maybe we could
>> >> >> > >>> even look at sharing it across subsystems and include things like
>> >> >> > >>> storage, graphics, and other subsystems in the conversation.
>> >> >> > >>
>> >> >> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
>> >> >> > >> need
>> >> >> > >>to build up an API.  Sharing it across subsystems would be very cool!
>> >> >>
>> >> >> I read the thread (starting at [1], for anybody else coming in late)
>> >> >> and I see this has something to do with "configuring outbound PCIe
>> >> >> buffers", but I haven't seen the connection to PCIe protocol or
>> >> >> features, i.e., I can't connect this to anything in the PCIe spec.
>> >> >>
>> >> >> Can somebody help me understand how the PCI core is relevant?  If
>> >> >> there's some connection with a feature defined by PCIe, or if it
>> >> >> affects the PCIe transaction protocol somehow, I'm definitely
>> >> >> interested in this.  But if this only affects the data transferred
>> >> >> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
>> >> >> sure why the PCI core should care.
>> >> >
>> >> > As you wrote, this is not a PCIe feature  or affects the PCIe transaction
>> >> > protocol.
>> >> >
>> >> > Actually, due to hardware limitation in current device, we have enabled a
>> >> > workaround in hardware.
>> >> >
>> >> > This mode is proprietary and not relevant to other PCIe devices, thus is set
>> >> > using driver-specific parameter in devlink
>> >>
>> >> Essentially what this feature is doing is communicating the need for
>> >> PCIe back-pressure to the network fabric. So as the buffers on the
>> >> device start to fill because the device isn't able to get back PCIe
>> >> credits fast enough it will then start to send congestion
>> >> notifications to the network stack itself if I understand this
>> >> correctly.
>> >
>> > This sounds like a hook that allows the device to tell its driver
>> > about PCIe flow control credits, and the driver can pass that on to
>> > the network stack.  IIUC, that would be a device-specific feature
>> > outside the scope of the PCI core.
>> >
>> >> For now there are no major conflicts, but when we start getting into
>> >> stuff like PCIe DMA coalescing, and on a more general basis just PCIe
>> >> active state power management that is going to start making things
>> >> more complicated going forward.
>> >
>> > We do support ASPM already in the PCI core, and we do have the
>> > pci_disable_link_state() interface, which is currently the only way
>> > drivers can influence it.  There are several drivers that do their own
>> > ASPM configuration, but this is not safe because it's not coordinated
>> > with what the PCI core does.  If/when drivers need more control, we
>> > should enhance the PCI core interfaces.
>>
>> This is kind of what I was getting at. It would be useful to have an
>> interface of some sort so that drivers get notified when a user is
>> making changes to configuration space and I don't know if anything
>> like that exists now.
>
> You mean something like this?
>
>   - driver registers a callback with PCI core
>   - user runs setpci, which writes PCI config space via sysfs
>   - kernel hook in pci_write_config() notices write and calls driver
>     callback
>   - driver callback receives config address and data written
>   - driver parses PCI capability lists to identify register
>
> Nothing like that exists today, and this is not what I had in mind by
> "enhance the PCI core interfaces".  I'm not sure what the utility of
> this is (but I'm not a networking guy by any means).

Well in general I have been wondering if setpci is really the cleanest
way to do any of this. I have found it can be a fast way to really
mess things up. For example using setpci to trigger an FLR is a fast
way to cripple an interface. I recall using that approach when testing
the fm10k driver to deal with surprise resets.

The problem is setpci is REALLY powerful. It lets you change a ton
that can impact the driver, and many drivers just read the config
space at the probe function and assume it is static (which in most
cases it is). That is why I was thinking as a small step it might be
useful to have the notifications delivered to the driver if it is
registered on the interface so it could prepare or clean-up after a
change to the PCI configuration space.

> I think it's a bad idea for drivers to directly write config space.
> It would be much better to provide a PCI core interface so we can
> implement things once and coordinate things that need to be
> coordinated.

I agree with that.

>> > I don't know what PCIe DMA coalescing means, so I can't comment on
>> > that.
>>
>> There are devices, specifically network devices, that will hold off on
>> switching between either L0s or L1 and L0 by deferring DMA operations.
>> Basically the idea is supposed to be to hold off bringing the link up
>> for as long as possible in order to maximize power savings for the
>> ASPM state. This is something that has come up in the past, and I
>> don't know if there has been any interface determined for how to
>> handle this sort of configuration. Most of it occurs through MMIO.
>
> The device can certainly delay L0s or L1 exit if it wants to.  If
> there are knobs to control this, e.g., how long it can defer a DMA, it
> makes sense that they would be device-specific and in MMIO space.  The
> PCI core can't be involved in that because in general it knows nothing
> about the contents of MMIO BARs.  Presumably those knobs would work
> within the framework of ASPM as defined by the PCIe spec, e.g., if we
> disable ASPM, the knobs would do nothing because there is no L0s or L1
> at all.
>
> That's not to say that device designers couldn't get together and
> define a common model for such knobs that puts them at well-known
> offsets in well-known BARs.  All I'm saying is that this sounds like
> it's currently outside the realm of the PCI specs and the PCI core.

Some of this was already handled in LTR and OBFF. I think the DMA
coalescing was meant to work either with those, or optionally on its
own. As far as being ASPM dependent or not I don't think it really
cared all that much other than knowing if ASPM was enabled and how
long it takes for a given device to come out of either of those
states. The general idea with DMA coalescing was to try and save power
on the CPU and then possibly the PCIe link if the value was set high
enough and ASPM was enabled.

Anyway we have kind of gotten off into a tangent as I was just citing
something that might end up having an interaction with a feature such
as notifying the stack that the Rx buffer on a given device has become
congested.

>> >> I assume the devices we are talking about supporting this new feature
>> >> on either don't deal with ASPM or assume a quick turnaround to get out
>> >> of the lower power states? Otherwise that would definitely cause some
>> >> back-pressure buildups that would hurt performance.
>> >
>> > Devices can communicate the ASPM exit latency they can tolerate via
>> > the Device Capabilities register (PCIe r4.0, sec 76.5.3.3).  Linux
>> > should be configuring ASPM to respect those device requirements.
>>
>> Right. But my point was something like ASPM will add extra complexity
>> to a feature such as what has been described here. My concern is that
>> I don't want us implementing stuff on a per-driver basis that is not
>> all that unique to the device. I don't really see the feature that was
>> described above as being something that will stay specific to this one
>> device for very long, especially if it provides added value. Basically
>> all it is doing is allowing exposing PCIe congestion management to
>> upper levels in the network stack. I don't even necessarily see it as
>> being networking specific as I would imagine there might be other
>> types of devices that could make use of knowing how many transactions
>> and such they could process at the same time.
>
> It sounds to me like you need a library that can be used by all the
> drivers that need this functionality.  Unless it's something
> documented in the PCIe specs so we can rely on a standard way of
> discovering and configuring things, I don't see how the PCI core can
> really be involved.
>
> Bjorn

I am kind of thinking that a common library would be a preferred way
to go, or at least a common interface to share between the drivers. It
wasn't my intention to imply that the PCI core needed to get involved.
I was including the linux-pci list as more of a way of checking to get
the broader audience's input since we were just discussing this on
netdev.

My main concern is that those of us on netdev have historically we
have been dealing with features such as SR-IOV as a networking thing,
when really it has been a combination of a PCI feature and some
network switching. I might have ratholed this a bit, as I kind of see
this topic as something similar.

- Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-31  3:19                               ` Alexander Duyck
@ 2018-07-31 11:06                                 ` Bjorn Helgaas
  2018-08-01 18:28                                   ` Moshe Shemesh
  0 siblings, 1 reply; 38+ messages in thread
From: Bjorn Helgaas @ 2018-07-31 11:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Moshe Shemesh, Jiri Pirko, Jakub Kicinski, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

On Mon, Jul 30, 2018 at 08:19:50PM -0700, Alexander Duyck wrote:
> On Mon, Jul 30, 2018 at 7:33 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote:
> >> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> >> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
> >> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <moshes20.il@gmail.com> wrote:
> >> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> >> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> >> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@netronome.com
> >> >> >> > > wrote:
> >> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >> >> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
> >> >> >> > >>> >>>> and
> >> >> >> > >>> >>>> already you guys are starting to use them to configure standard
> >> >> >> > >>> >>>> features like queuing.
> >> >> >> > >>> >>>
> >> >> >> > >>> >>> We developed the devlink params in order to support non-standard
> >> >> >> > >>> >>> configuration only. And for non-standard, there are generic and
> >> >> >> > >>> >>> vendor
> >> >> >> > >>> >>> specific options.
> >> >> >> > >>> >>
> >> >> >> > >>> >> I thought it was developed for performing non-standard and
> >> >> >> > >>> >> possibly
> >> >> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> >> >> >> > >>> >> for
> >> >> >> > >>> >> examples of well justified generic options for which we have no
> >> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> >> >> >> > >>> >> if you
> >> >> >> > >>> >> ask me, too.
> >> >> >> > >>> >>
> >> >> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
> >> >> >> > >>> >> enter
> >> >> >> > >>> >> into the risky territory of controlling offloads via devlink
> >> >> >> > >>> >> parameters
> >> >> >> > >>> >> or would we rather make vendors take the time and effort to model
> >> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
> >> >> >> > >>> >> APIs
> >> >> >> > >>> >> perfectly.
> >> >> >> > >>> >
> >> >> >> > >>> > I understand what you meant here, I would like to highlight that
> >> >> >> > >>> > this
> >> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >> >> >> > >>> > The vendor specific configuration suggested here is to handle a
> >> >> >> > >>> > congestion
> >> >> >> > >>> > state in Multi Host environment (which includes PF and multiple
> >> >> >> > >>> > VFs per
> >> >> >> > >>> > host), where one host is not aware to the other hosts, and each is
> >> >> >> > >>> > running
> >> >> >> > >>> > on its own pci/driver. It is a device working mode configuration.
> >> >> >> > >>> >
> >> >> >> > >>> > This  couldn't fit into any existing API, thus creating this
> >> >> >> > >>> > vendor specific
> >> >> >> > >>> > unique API is needed.
> >> >> >> > >>>
> >> >> >> > >>> If we are just going to start creating devlink interfaces in for
> >> >> >> > >>> every
> >> >> >> > >>> one-off option a device wants to add why did we even bother with
> >> >> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
> >> >> >> > >>> are back to the same arguments we had back in the day with it.
> >> >> >> > >>>
> >> >> >> > >>> I feel like the bigger question here is if devlink is how we are
> >> >> >> > >>> going
> >> >> >> > >>> to deal with all PCIe related features going forward, or should we
> >> >> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
> >> >> >> > >>> features? My concern is that we have already had features such as
> >> >> >> > >>> DMA
> >> >> >> > >>> Coalescing that didn't really fit into anything and now we are
> >> >> >> > >>> starting to see other things related to DMA and PCIe bus credits.
> >> >> >> > >>> I'm
> >> >> >> > >>> wondering if we shouldn't start looking at a tool/interface to
> >> >> >> > >>> configure all the PCIe related features such as interrupts, error
> >> >> >> > >>> reporting, DMA configuration, power management, etc. Maybe we could
> >> >> >> > >>> even look at sharing it across subsystems and include things like
> >> >> >> > >>> storage, graphics, and other subsystems in the conversation.
> >> >> >> > >>
> >> >> >> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
> >> >> >> > >> need
> >> >> >> > >>to build up an API.  Sharing it across subsystems would be very cool!
> >> >> >>
> >> >> >> I read the thread (starting at [1], for anybody else coming in late)
> >> >> >> and I see this has something to do with "configuring outbound PCIe
> >> >> >> buffers", but I haven't seen the connection to PCIe protocol or
> >> >> >> features, i.e., I can't connect this to anything in the PCIe spec.
> >> >> >>
> >> >> >> Can somebody help me understand how the PCI core is relevant?  If
> >> >> >> there's some connection with a feature defined by PCIe, or if it
> >> >> >> affects the PCIe transaction protocol somehow, I'm definitely
> >> >> >> interested in this.  But if this only affects the data transferred
> >> >> >> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
> >> >> >> sure why the PCI core should care.
> >> >> >
> >> >> > As you wrote, this is not a PCIe feature  or affects the PCIe transaction
> >> >> > protocol.
> >> >> >
> >> >> > Actually, due to hardware limitation in current device, we have enabled a
> >> >> > workaround in hardware.
> >> >> >
> >> >> > This mode is proprietary and not relevant to other PCIe devices, thus is set
> >> >> > using driver-specific parameter in devlink
> >> >>
> >> >> Essentially what this feature is doing is communicating the need for
> >> >> PCIe back-pressure to the network fabric. So as the buffers on the
> >> >> device start to fill because the device isn't able to get back PCIe
> >> >> credits fast enough it will then start to send congestion
> >> >> notifications to the network stack itself if I understand this
> >> >> correctly.
> >> >
> >> > This sounds like a hook that allows the device to tell its driver
> >> > about PCIe flow control credits, and the driver can pass that on to
> >> > the network stack.  IIUC, that would be a device-specific feature
> >> > outside the scope of the PCI core.
> >> >
> >> >> For now there are no major conflicts, but when we start getting into
> >> >> stuff like PCIe DMA coalescing, and on a more general basis just PCIe
> >> >> active state power management that is going to start making things
> >> >> more complicated going forward.
> >> >
> >> > We do support ASPM already in the PCI core, and we do have the
> >> > pci_disable_link_state() interface, which is currently the only way
> >> > drivers can influence it.  There are several drivers that do their own
> >> > ASPM configuration, but this is not safe because it's not coordinated
> >> > with what the PCI core does.  If/when drivers need more control, we
> >> > should enhance the PCI core interfaces.
> >>
> >> This is kind of what I was getting at. It would be useful to have an
> >> interface of some sort so that drivers get notified when a user is
> >> making changes to configuration space and I don't know if anything
> >> like that exists now.
> >
> > You mean something like this?
> >
> >   - driver registers a callback with PCI core
> >   - user runs setpci, which writes PCI config space via sysfs
> >   - kernel hook in pci_write_config() notices write and calls driver
> >     callback
> >   - driver callback receives config address and data written
> >   - driver parses PCI capability lists to identify register
> >
> > Nothing like that exists today, and this is not what I had in mind by
> > "enhance the PCI core interfaces".  I'm not sure what the utility of
> > this is (but I'm not a networking guy by any means).
> 
> Well in general I have been wondering if setpci is really the cleanest
> way to do any of this. I have found it can be a fast way to really
> mess things up. For example using setpci to trigger an FLR is a fast
> way to cripple an interface. I recall using that approach when testing
> the fm10k driver to deal with surprise resets.
> 
> The problem is setpci is REALLY powerful. It lets you change a ton
> that can impact the driver, and many drivers just read the config
> space at the probe function and assume it is static (which in most
> cases it is). That is why I was thinking as a small step it might be
> useful to have the notifications delivered to the driver if it is
> registered on the interface so it could prepare or clean-up after a
> change to the PCI configuration space.

Yep, setpci is very powerful.  I doubt it's worth having drivers try
to react, just because setpci can do completely arbitrary things, many
of which are not recoverable even in principle.  But I do think we
should make it taint the kernel so we have a clue when things go
wrong.

> > I think it's a bad idea for drivers to directly write config space.
> > It would be much better to provide a PCI core interface so we can
> > implement things once and coordinate things that need to be
> > coordinated.
> 
> I agree with that.
> 
> >> > I don't know what PCIe DMA coalescing means, so I can't comment on
> >> > that.
> >>
> >> There are devices, specifically network devices, that will hold off on
> >> switching between either L0s or L1 and L0 by deferring DMA operations.
> >> Basically the idea is supposed to be to hold off bringing the link up
> >> for as long as possible in order to maximize power savings for the
> >> ASPM state. This is something that has come up in the past, and I
> >> don't know if there has been any interface determined for how to
> >> handle this sort of configuration. Most of it occurs through MMIO.
> >
> > The device can certainly delay L0s or L1 exit if it wants to.  If
> > there are knobs to control this, e.g., how long it can defer a DMA, it
> > makes sense that they would be device-specific and in MMIO space.  The
> > PCI core can't be involved in that because in general it knows nothing
> > about the contents of MMIO BARs.  Presumably those knobs would work
> > within the framework of ASPM as defined by the PCIe spec, e.g., if we
> > disable ASPM, the knobs would do nothing because there is no L0s or L1
> > at all.
> >
> > That's not to say that device designers couldn't get together and
> > define a common model for such knobs that puts them at well-known
> > offsets in well-known BARs.  All I'm saying is that this sounds like
> > it's currently outside the realm of the PCI specs and the PCI core.
> 
> Some of this was already handled in LTR and OBFF. I think the DMA
> coalescing was meant to work either with those, or optionally on its
> own. As far as being ASPM dependent or not I don't think it really
> cared all that much other than knowing if ASPM was enabled and how
> long it takes for a given device to come out of either of those
> states. The general idea with DMA coalescing was to try and save power
> on the CPU and then possibly the PCIe link if the value was set high
> enough and ASPM was enabled.
> 
> Anyway we have kind of gotten off into a tangent as I was just citing
> something that might end up having an interaction with a feature such
> as notifying the stack that the Rx buffer on a given device has become
> congested.
> 
> >> >> I assume the devices we are talking about supporting this new feature
> >> >> on either don't deal with ASPM or assume a quick turnaround to get out
> >> >> of the lower power states? Otherwise that would definitely cause some
> >> >> back-pressure buildups that would hurt performance.
> >> >
> >> > Devices can communicate the ASPM exit latency they can tolerate via
> >> > the Device Capabilities register (PCIe r4.0, sec 76.5.3.3).  Linux
> >> > should be configuring ASPM to respect those device requirements.
> >>
> >> Right. But my point was something like ASPM will add extra complexity
> >> to a feature such as what has been described here. My concern is that
> >> I don't want us implementing stuff on a per-driver basis that is not
> >> all that unique to the device. I don't really see the feature that was
> >> described above as being something that will stay specific to this one
> >> device for very long, especially if it provides added value. Basically
> >> all it is doing is allowing exposing PCIe congestion management to
> >> upper levels in the network stack. I don't even necessarily see it as
> >> being networking specific as I would imagine there might be other
> >> types of devices that could make use of knowing how many transactions
> >> and such they could process at the same time.
> >
> > It sounds to me like you need a library that can be used by all the
> > drivers that need this functionality.  Unless it's something
> > documented in the PCIe specs so we can rely on a standard way of
> > discovering and configuring things, I don't see how the PCI core can
> > really be involved.
> >
> > Bjorn
> 
> I am kind of thinking that a common library would be a preferred way
> to go, or at least a common interface to share between the drivers. It
> wasn't my intention to imply that the PCI core needed to get involved.
> I was including the linux-pci list as more of a way of checking to get
> the broader audience's input since we were just discussing this on
> netdev.
> 
> My main concern is that those of us on netdev have historically we
> have been dealing with features such as SR-IOV as a networking thing,
> when really it has been a combination of a PCI feature and some
> network switching. I might have ratholed this a bit, as I kind of see
> this topic as something similar.

OK, thanks.  Sounds like we have similar perspectives and I appreciate
being looped in!

Bjorn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink
  2018-07-31 11:06                                 ` Bjorn Helgaas
@ 2018-08-01 18:28                                   ` Moshe Shemesh
  0 siblings, 0 replies; 38+ messages in thread
From: Moshe Shemesh @ 2018-08-01 18:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Jiri Pirko, Jakub Kicinski, Eran Ben Elisha,
	Saeed Mahameed, David S. Miller, netdev, linux-pci

[-- Attachment #1: Type: text/plain, Size: 15208 bytes --]

On Tue, Jul 31, 2018 at 2:06 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Jul 30, 2018 at 08:19:50PM -0700, Alexander Duyck wrote:
> > On Mon, Jul 30, 2018 at 7:33 PM, Bjorn Helgaas <helgaas@kernel.org>
> wrote:
> > > On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote:
> > >> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas <helgaas@kernel.org>
> wrote:
> > >> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
> > >> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <
> moshes20.il@gmail.com> wrote:
> > >> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <
> helgaas@kernel.org> wrote:
> > >> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> > >> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <
> jiri@resnulli.us> wrote:
> > >> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST,
> jakub.kicinski@netronome.com
> > >> >> >> > > wrote:
> > >> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> > >> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> > >> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> > >> >> >> > >>> >>>> The devlink params haven't been upstream even for a
> full cycle
> > >> >> >> > >>> >>>> and
> > >> >> >> > >>> >>>> already you guys are starting to use them to
> configure standard
> > >> >> >> > >>> >>>> features like queuing.
> > >> >> >> > >>> >>>
> > >> >> >> > >>> >>> We developed the devlink params in order to support
> non-standard
> > >> >> >> > >>> >>> configuration only. And for non-standard, there are
> generic and
> > >> >> >> > >>> >>> vendor
> > >> >> >> > >>> >>> specific options.
> > >> >> >> > >>> >>
> > >> >> >> > >>> >> I thought it was developed for performing non-standard
> and
> > >> >> >> > >>> >> possibly
> > >> >> >> > >>> >> vendor specific configuration.  Look at
> DEVLINK_PARAM_GENERIC_*
> > >> >> >> > >>> >> for
> > >> >> >> > >>> >> examples of well justified generic options for which
> we have no
> > >> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor
> specific
> > >> >> >> > >>> >> if you
> > >> >> >> > >>> >> ask me, too.
> > >> >> >> > >>> >>
> > >> >> >> > >>> >> Configuring queuing has an API.  The question is it
> acceptable to
> > >> >> >> > >>> >> enter
> > >> >> >> > >>> >> into the risky territory of controlling offloads via
> devlink
> > >> >> >> > >>> >> parameters
> > >> >> >> > >>> >> or would we rather make vendors take the time and
> effort to model
> > >> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never
> fits the
> > >> >> >> > >>> >> APIs
> > >> >> >> > >>> >> perfectly.
> > >> >> >> > >>> >
> > >> >> >> > >>> > I understand what you meant here, I would like to
> highlight that
> > >> >> >> > >>> > this
> > >> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors,
> etc.
> > >> >> >> > >>> > The vendor specific configuration suggested here is to
> handle a
> > >> >> >> > >>> > congestion
> > >> >> >> > >>> > state in Multi Host environment (which includes PF and
> multiple
> > >> >> >> > >>> > VFs per
> > >> >> >> > >>> > host), where one host is not aware to the other hosts,
> and each is
> > >> >> >> > >>> > running
> > >> >> >> > >>> > on its own pci/driver. It is a device working mode
> configuration.
> > >> >> >> > >>> >
> > >> >> >> > >>> > This  couldn't fit into any existing API, thus creating
> this
> > >> >> >> > >>> > vendor specific
> > >> >> >> > >>> > unique API is needed.
> > >> >> >> > >>>
> > >> >> >> > >>> If we are just going to start creating devlink interfaces
> in for
> > >> >> >> > >>> every
> > >> >> >> > >>> one-off option a device wants to add why did we even
> bother with
> > >> >> >> > >>> trying to prevent drivers from using sysfs? This just
> feels like we
> > >> >> >> > >>> are back to the same arguments we had back in the day
> with it.
> > >> >> >> > >>>
> > >> >> >> > >>> I feel like the bigger question here is if devlink is how
> we are
> > >> >> >> > >>> going
> > >> >> >> > >>> to deal with all PCIe related features going forward, or
> should we
> > >> >> >> > >>> start looking at creating a new interface/tool for
> PCI/PCIe related
> > >> >> >> > >>> features? My concern is that we have already had features
> such as
> > >> >> >> > >>> DMA
> > >> >> >> > >>> Coalescing that didn't really fit into anything and now
> we are
> > >> >> >> > >>> starting to see other things related to DMA and PCIe bus
> credits.
> > >> >> >> > >>> I'm
> > >> >> >> > >>> wondering if we shouldn't start looking at a
> tool/interface to
> > >> >> >> > >>> configure all the PCIe related features such as
> interrupts, error
> > >> >> >> > >>> reporting, DMA configuration, power management, etc.
> Maybe we could
> > >> >> >> > >>> even look at sharing it across subsystems and include
> things like
> > >> >> >> > >>> storage, graphics, and other subsystems in the
> conversation.
> > >> >> >> > >>
> > >> >> >> > >>Agreed, for actual PCIe configuration (i.e. not ECN
> marking) we do
> > >> >> >> > >> need
> > >> >> >> > >>to build up an API.  Sharing it across subsystems would be
> very cool!
> > >> >> >>
> > >> >> >> I read the thread (starting at [1], for anybody else coming in
> late)
> > >> >> >> and I see this has something to do with "configuring outbound
> PCIe
> > >> >> >> buffers", but I haven't seen the connection to PCIe protocol or
> > >> >> >> features, i.e., I can't connect this to anything in the PCIe
> spec.
> > >> >> >>
> > >> >> >> Can somebody help me understand how the PCI core is relevant?
> If
> > >> >> >> there's some connection with a feature defined by PCIe, or if it
> > >> >> >> affects the PCIe transaction protocol somehow, I'm definitely
> > >> >> >> interested in this.  But if this only affects the data
> transferred
> > >> >> >> over PCIe, i.e., the data payloads of PCIe TLP packets, then
> I'm not
> > >> >> >> sure why the PCI core should care.
> > >> >> >
> > >> >> > As you wrote, this is not a PCIe feature  or affects the PCIe
> transaction
> > >> >> > protocol.
> > >> >> >
> > >> >> > Actually, due to hardware limitation in current device, we have
> enabled a
> > >> >> > workaround in hardware.
> > >> >> >
> > >> >> > This mode is proprietary and not relevant to other PCIe devices,
> thus is set
> > >> >> > using driver-specific parameter in devlink
> > >> >>
> > >> >> Essentially what this feature is doing is communicating the need
> for
> > >> >> PCIe back-pressure to the network fabric. So as the buffers on the
> > >> >> device start to fill because the device isn't able to get back PCIe
> > >> >> credits fast enough it will then start to send congestion
> > >> >> notifications to the network stack itself if I understand this
> > >> >> correctly.
> > >> >
> > >> > This sounds like a hook that allows the device to tell its driver
> > >> > about PCIe flow control credits, and the driver can pass that on to
> > >> > the network stack.  IIUC, that would be a device-specific feature
> > >> > outside the scope of the PCI core.
> > >> >
> > >> >> For now there are no major conflicts, but when we start getting
> into
> > >> >> stuff like PCIe DMA coalescing, and on a more general basis just
> PCIe
> > >> >> active state power management that is going to start making things
> > >> >> more complicated going forward.
> > >> >
> > >> > We do support ASPM already in the PCI core, and we do have the
> > >> > pci_disable_link_state() interface, which is currently the only way
> > >> > drivers can influence it.  There are several drivers that do their
> own
> > >> > ASPM configuration, but this is not safe because it's not
> coordinated
> > >> > with what the PCI core does.  If/when drivers need more control, we
> > >> > should enhance the PCI core interfaces.
> > >>
> > >> This is kind of what I was getting at. It would be useful to have an
> > >> interface of some sort so that drivers get notified when a user is
> > >> making changes to configuration space and I don't know if anything
> > >> like that exists now.
> > >
> > > You mean something like this?
> > >
> > >   - driver registers a callback with PCI core
> > >   - user runs setpci, which writes PCI config space via sysfs
> > >   - kernel hook in pci_write_config() notices write and calls driver
> > >     callback
> > >   - driver callback receives config address and data written
> > >   - driver parses PCI capability lists to identify register
> > >
> > > Nothing like that exists today, and this is not what I had in mind by
> > > "enhance the PCI core interfaces".  I'm not sure what the utility of
> > > this is (but I'm not a networking guy by any means).
> >
> > Well in general I have been wondering if setpci is really the cleanest
> > way to do any of this. I have found it can be a fast way to really
> > mess things up. For example using setpci to trigger an FLR is a fast
> > way to cripple an interface. I recall using that approach when testing
> > the fm10k driver to deal with surprise resets.
> >
> > The problem is setpci is REALLY powerful. It lets you change a ton
> > that can impact the driver, and many drivers just read the config
> > space at the probe function and assume it is static (which in most
> > cases it is). That is why I was thinking as a small step it might be
> > useful to have the notifications delivered to the driver if it is
> > registered on the interface so it could prepare or clean-up after a
> > change to the PCI configuration space.
>
> Yep, setpci is very powerful.  I doubt it's worth having drivers try
> to react, just because setpci can do completely arbitrary things, many
> of which are not recoverable even in principle.  But I do think we
> should make it taint the kernel so we have a clue when things go
> wrong.
>
> > > I think it's a bad idea for drivers to directly write config space.
> > > It would be much better to provide a PCI core interface so we can
> > > implement things once and coordinate things that need to be
> > > coordinated.
> >
> > I agree with that.
> >
> > >> > I don't know what PCIe DMA coalescing means, so I can't comment on
> > >> > that.
> > >>
> > >> There are devices, specifically network devices, that will hold off on
> > >> switching between either L0s or L1 and L0 by deferring DMA operations.
> > >> Basically the idea is supposed to be to hold off bringing the link up
> > >> for as long as possible in order to maximize power savings for the
> > >> ASPM state. This is something that has come up in the past, and I
> > >> don't know if there has been any interface determined for how to
> > >> handle this sort of configuration. Most of it occurs through MMIO.
> > >
> > > The device can certainly delay L0s or L1 exit if it wants to.  If
> > > there are knobs to control this, e.g., how long it can defer a DMA, it
> > > makes sense that they would be device-specific and in MMIO space.  The
> > > PCI core can't be involved in that because in general it knows nothing
> > > about the contents of MMIO BARs.  Presumably those knobs would work
> > > within the framework of ASPM as defined by the PCIe spec, e.g., if we
> > > disable ASPM, the knobs would do nothing because there is no L0s or L1
> > > at all.
> > >
> > > That's not to say that device designers couldn't get together and
> > > define a common model for such knobs that puts them at well-known
> > > offsets in well-known BARs.  All I'm saying is that this sounds like
> > > it's currently outside the realm of the PCI specs and the PCI core.
> >
> > Some of this was already handled in LTR and OBFF. I think the DMA
> > coalescing was meant to work either with those, or optionally on its
> > own. As far as being ASPM dependent or not I don't think it really
> > cared all that much other than knowing if ASPM was enabled and how
> > long it takes for a given device to come out of either of those
> > states. The general idea with DMA coalescing was to try and save power
> > on the CPU and then possibly the PCIe link if the value was set high
> > enough and ASPM was enabled.
> >
> > Anyway we have kind of gotten off into a tangent as I was just citing
> > something that might end up having an interaction with a feature such
> > as notifying the stack that the Rx buffer on a given device has become
> > congested.
> >
> > >> >> I assume the devices we are talking about supporting this new
> feature
> > >> >> on either don't deal with ASPM or assume a quick turnaround to get
> out
> > >> >> of the lower power states? Otherwise that would definitely cause
> some
> > >> >> back-pressure buildups that would hurt performance.
> > >> >
> > >> > Devices can communicate the ASPM exit latency they can tolerate via
> > >> > the Device Capabilities register (PCIe r4.0, sec 76.5.3.3).  Linux
> > >> > should be configuring ASPM to respect those device requirements.
> > >>
> > >> Right. But my point was something like ASPM will add extra complexity
> > >> to a feature such as what has been described here. My concern is that
> > >> I don't want us implementing stuff on a per-driver basis that is not
> > >> all that unique to the device. I don't really see the feature that was
> > >> described above as being something that will stay specific to this one
> > >> device for very long, especially if it provides added value. Basically
> > >> all it is doing is allowing exposing PCIe congestion management to
> > >> upper levels in the network stack. I don't even necessarily see it as
> > >> being networking specific as I would imagine there might be other
> > >> types of devices that could make use of knowing how many transactions
> > >> and such they could process at the same time.
> > >
> > > It sounds to me like you need a library that can be used by all the
> > > drivers that need this functionality.  Unless it's something
> > > documented in the PCIe specs so we can rely on a standard way of
> > > discovering and configuring things, I don't see how the PCI core can
> > > really be involved.
> > >
> > > Bjorn
> >
> > I am kind of thinking that a common library would be a preferred way
> > to go, or at least a common interface to share between the drivers. It
> > wasn't my intention to imply that the PCI core needed to get involved.
> > I was including the linux-pci list as more of a way of checking to get
> > the broader audience's input since we were just discussing this on
> > netdev.
> >
> > My main concern is that those of us on netdev have historically we
> > have been dealing with features such as SR-IOV as a networking thing,
> > when really it has been a combination of a PCI feature and some
> > network switching. I might have ratholed this a bit, as I kind of see
> > this topic as something similar.
>
> OK, thanks.  Sounds like we have similar perspectives and I appreciate
> being looped in!
>
> Bjorn
>

Thanks for your comments, as already understood this is not a pci core
feature, this is a NIC specific hardware limitation workaround.
It is not common to drivers, not even within Mellanox drivers. The previous
NIC generation driver mlx4 does not need it. The next NIC won't need it
too, it should be handled by hardware automatically.
We are sending version 2 with clarification on this in cover letter.

[-- Attachment #2: Type: text/html, Size: 21975 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-08-01 20:15 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-19  1:00 [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed
2018-07-19  1:00 ` [net-next 01/16] net/mlx5: FW tracer, implement tracer logic Saeed Mahameed
2018-07-19  1:00 ` [net-next 02/16] net/mlx5: FW tracer, create trace buffer and copy strings database Saeed Mahameed
2018-07-19  1:00 ` [net-next 03/16] net/mlx5: FW tracer, register log buffer memory key Saeed Mahameed
2018-07-19  1:00 ` [net-next 04/16] net/mlx5: FW tracer, events handling Saeed Mahameed
2018-07-19  1:00 ` [net-next 05/16] net/mlx5: FW tracer, parse traces and kernel tracing support Saeed Mahameed
2018-07-19  1:00 ` [net-next 06/16] net/mlx5: FW tracer, Enable tracing Saeed Mahameed
2018-07-19  1:00 ` [net-next 07/16] net/mlx5: FW tracer, Add debug prints Saeed Mahameed
2018-07-19  1:00 ` [net-next 08/16] net/mlx5: Move all devlink related functions calls to devlink.c Saeed Mahameed
2018-07-19  1:01 ` [net-next 09/16] net/mlx5: Add MPEGC register configuration functionality Saeed Mahameed
2018-07-19  1:01 ` [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink Saeed Mahameed
2018-07-19  1:49   ` Jakub Kicinski
2018-07-24 10:31     ` Eran Ben Elisha
2018-07-24 19:51       ` Jakub Kicinski
2018-07-25 12:31         ` Eran Ben Elisha
2018-07-25 15:23           ` Alexander Duyck
2018-07-26  0:43             ` Jakub Kicinski
2018-07-26  7:14               ` Jiri Pirko
2018-07-26 14:00                 ` Alexander Duyck
2018-07-28 16:06                   ` Bjorn Helgaas
2018-07-29  9:23                     ` Moshe Shemesh
2018-07-29 22:00                       ` Alexander Duyck
2018-07-30 14:07                         ` Bjorn Helgaas
2018-07-30 15:02                           ` Alexander Duyck
2018-07-30 22:00                             ` Jakub Kicinski
2018-07-31  2:33                             ` Bjorn Helgaas
2018-07-31  3:19                               ` Alexander Duyck
2018-07-31 11:06                                 ` Bjorn Helgaas
2018-08-01 18:28                                   ` Moshe Shemesh
2018-07-19  8:24   ` Jiri Pirko
2018-07-19  8:49     ` Eran Ben Elisha
2018-07-19  1:01 ` [net-next 11/16] net/mlx5e: Set ECN for received packets using CQE indication Saeed Mahameed
2018-07-19  1:01 ` [net-next 12/16] net/mlx5e: Remove redundant WARN when we cannot find neigh entry Saeed Mahameed
2018-07-19  1:01 ` [net-next 13/16] net/mlx5e: Support offloading tc double vlan headers match Saeed Mahameed
2018-07-19  1:01 ` [net-next 14/16] net/mlx5e: Refactor tc vlan push/pop actions offloading Saeed Mahameed
2018-07-19  1:01 ` [net-next 15/16] net/mlx5e: Support offloading double vlan push/pop tc actions Saeed Mahameed
2018-07-19  1:01 ` [net-next 16/16] net/mlx5e: Use PARTIAL_GSO for UDP segmentation Saeed Mahameed
2018-07-23 21:35 ` [pull request][net-next 00/16] Mellanox, mlx5e updates 2018-07-18 Saeed Mahameed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.