All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce
@ 2021-02-01 17:51 Yishai Hadas
  2021-02-01 17:51 ` [PATCH net-next 1/2] devlink: Expose port function commands to control roce Yishai Hadas
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Yishai Hadas @ 2021-02-01 17:51 UTC (permalink / raw)
  To: netdev, davem, kuba; +Cc: parav, saeedm, Yishai Hadas

Currently mlx5 PCI VF and SF are enabled by default for RoCE
functionality.

Currently a user does not have the ability to disable RoCE for a PCI
VF/SF device before such device is enumerated by the driver.

User is also incapable to do such setting from smartnic scenario for a
VF from the smartnic.

Current 'enable_roce' device knob is limited to do setting only at
driverinit time. By this time device is already created and firmware has
already allocated necessary system memory for supporting RoCE.

When a RoCE is disabled for the PCI VF/SF device, it saves 1 Mbyte of
system memory per function. Such saving is helpful when running on low
memory embedded platform with many VFs or SFs.

Therefore, it is desired to empower user to disable RoCE functionality
before a PCI SF/VF device is enumerated.

This is achieved by extending existing 'port function' object to control
capabilities of a function. This enables users to control capability of
the device before enumeration.

Examples when user prefers to disable RoCE for a VF when using switchdev
mode:

$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller 0
pfnum 0 vfnum 0 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00 roce on

$ devlink port function set pci/0000:06:00.0/1 roce off
  
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller 0
pfnum 0 vfnum 0 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00 roce off

FAQs:
-----
1. What does roce on/off do?
Ans: It disables RoCE capability of the function before its enumerated,
so when driver reads the capability from the device firmware, it is
disabled.
At this point RDMA stack will not be able to create UD, QP1, RC, XRC
type of QPs. When RoCE is disabled, the GID table of all ports of the
device is disabled in the device and software stack.

2. How is the roce 'port function' option different from existing
devlink param?
Ans: RoCE attribute at the port function level disables the RoCE
capability at the specific function level; while enable_roce only does
at the software level.

3. Why is this option for disabling only RoCE and not the whole RDMA
device?
Ans: Because user still wants to use the RDMA device for non RoCE
commands in more memory efficient way.

Patch summary:
Patch-1 adds devlink attribute to control roce
Patch-2 implements mlx5 callbacks for roce control

Yishai Hadas (2):
  devlink: Expose port function commands to control roce
  net/mlx5: E-Switch, Implement devlink port function cmds to control
    roce

 .../device_drivers/ethernet/mellanox/mlx5.rst |  32 ++++
 .../networking/devlink/devlink-port.rst       |   5 +-
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   3 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 138 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  10 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../net/ethernet/mellanox/mlx5/core/vport.c   |  35 +++++
 include/net/devlink.h                         |  22 +++
 include/uapi/linux/devlink.h                  |   1 +
 net/core/devlink.c                            |  63 ++++++++
 10 files changed, 307 insertions(+), 4 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next 1/2] devlink: Expose port function commands to control roce
  2021-02-01 17:51 [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce Yishai Hadas
@ 2021-02-01 17:51 ` Yishai Hadas
  2021-02-01 17:51 ` [PATCH net-next 2/2] net/mlx5: E-Switch, Implement devlink port function cmds " Yishai Hadas
  2021-02-03  2:14 ` [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce Jakub Kicinski
  2 siblings, 0 replies; 9+ messages in thread
From: Yishai Hadas @ 2021-02-01 17:51 UTC (permalink / raw)
  To: netdev, davem, kuba; +Cc: parav, saeedm, Yishai Hadas, Jiri Pirko

Expose port function commands to turn on / off roce, this is used to
control the port roce device capabilities.

When roce is disabled for a function of the port, function cannot create
any roce specific resources (e.g GID table).
It also saves system memory utilization. For example disabling roce on a
VF/SF saves 1 Mbytes of system memory per function.

Example of a PCI VF port which supports function configuration:
Set roce of the VF's port function.

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
    function:
        hw_addr 00:00:00:00:00:00 roce on

$ devlink port function set pci/0000:06:00.0/2 roce off

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
    function:
        hw_addr 00:11:22:33:44:55 roce off

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../networking/devlink/devlink-port.rst       |  5 +-
 include/net/devlink.h                         | 22 +++++++
 include/uapi/linux/devlink.h                  |  1 +
 net/core/devlink.c                            | 63 +++++++++++++++++++
 4 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index e99b41599465..541e19f9d256 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -110,7 +110,7 @@ devlink ports for both the controllers.
 Function configuration
 ======================
 
-A user can configure the function attribute before enumerating the PCI
+A user can configure one or more function attributes before enumerating the PCI
 function. Usually it means, user should configure function attribute
 before a bus specific device for the function is created. However, when
 SRIOV is enabled, virtual function devices are created on the PCI bus.
@@ -122,6 +122,9 @@ A user may set the hardware address of the function using
 'devlink port function set hw_addr' command. For Ethernet port function
 this means a MAC address.
 
+A user may set also the roce capability of the function using
+'devlink port function set roce' command.
+
 Subfunction
 ============
 
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 47b4b063401b..055280212b58 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1451,6 +1451,28 @@ struct devlink_ops {
 				 struct devlink_port *port,
 				 enum devlink_port_fn_state state,
 				 struct netlink_ext_ack *extack);
+	/**
+	 * @port_fn_roce_get: Port function's roce get function.
+	 *
+	 * Should be used by device drivers to report the roce state of
+	 * a function managed by the devlink port. Driver should return
+	 * -EOPNOTSUPP if it doesn't support port function handling for
+	 * a particular port.
+	 */
+	int (*port_fn_roce_get)(struct devlink *devlink,
+				struct devlink_port *port, bool *on,
+				struct netlink_ext_ack *extack);
+	/**
+	 * @port_fn_roce_set: Port function's roce set function.
+	 *
+	 * Should be used by device drivers to enable/disable the roce state of
+	 * a function managed by the devlink port. Driver should return
+	 * -EOPNOTSUPP if it doesn't support port function handling for
+	 * a particular port.
+	 */
+	int (*port_fn_roce_set)(struct devlink *devlink,
+				struct devlink_port *port, bool on,
+				struct netlink_ext_ack *extack);
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index f6008b2fa60f..77990b563d80 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -585,6 +585,7 @@ enum devlink_port_function_attr {
 	DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR,	/* binary */
 	DEVLINK_PORT_FN_ATTR_STATE,	/* u8 */
 	DEVLINK_PORT_FN_ATTR_OPSTATE,	/* u8 */
+	DEVLINK_PORT_FN_ATTR_ROCE,	/* u8 */
 
 	__DEVLINK_PORT_FUNCTION_ATTR_MAX,
 	DEVLINK_PORT_FUNCTION_ATTR_MAX = __DEVLINK_PORT_FUNCTION_ATTR_MAX - 1
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 737b61c2976e..d04318e79dc2 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -90,6 +90,7 @@ static const struct nla_policy devlink_function_nl_policy[DEVLINK_PORT_FUNCTION_
 	[DEVLINK_PORT_FN_ATTR_STATE] =
 		NLA_POLICY_RANGE(NLA_U8, DEVLINK_PORT_FN_STATE_INACTIVE,
 				 DEVLINK_PORT_FN_STATE_ACTIVE),
+	[DEVLINK_PORT_FN_ATTR_ROCE] = NLA_POLICY_RANGE(NLA_U8, 0, 1),
 };
 
 static LIST_HEAD(devlink_list);
@@ -724,6 +725,34 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 	return 0;
 }
 
+static int devlink_port_function_roce_fill(struct devlink *devlink,
+					   const struct devlink_ops *ops,
+					   struct devlink_port *port,
+					   struct sk_buff *msg,
+					   struct netlink_ext_ack *extack,
+					   bool *msg_updated)
+{
+	bool on;
+	int err;
+
+	if (!ops->port_fn_roce_get)
+		return 0;
+
+	err = ops->port_fn_roce_get(devlink, port, &on, extack);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+
+	err = nla_put_u8(msg, DEVLINK_PORT_FN_ATTR_ROCE, on);
+	if (err)
+		return err;
+
+	*msg_updated = true;
+	return 0;
+}
+
 static int
 devlink_port_fn_hw_addr_fill(struct devlink *devlink, const struct devlink_ops *ops,
 			     struct devlink_port *port, struct sk_buff *msg,
@@ -820,6 +849,12 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por
 					   extack, &msg_updated);
 	if (err)
 		goto out;
+
+	err = devlink_port_function_roce_fill(devlink, ops, port, msg, extack,
+					      &msg_updated);
+	if (err)
+		goto out;
+
 	err = devlink_port_fn_state_fill(devlink, ops, port, msg, extack,
 					 &msg_updated);
 out:
@@ -1054,6 +1089,26 @@ static int devlink_port_type_set(struct devlink *devlink,
 	return -EOPNOTSUPP;
 }
 
+static int
+devlink_port_fn_roce_set(struct devlink *devlink, struct devlink_port *port,
+			 const struct nlattr *attr,
+			 struct netlink_ext_ack *extack)
+{
+	const struct devlink_ops *ops;
+	bool on;
+
+	on = nla_get_u8(attr);
+
+	ops = devlink->ops;
+	if (!ops->port_fn_roce_set) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port doesn't support roce function attribute");
+		return -EOPNOTSUPP;
+	}
+
+	return ops->port_fn_roce_set(devlink, port, on, extack);
+}
+
 static int
 devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *port,
 				  const struct nlattr *attr, struct netlink_ext_ack *extack)
@@ -1126,6 +1181,14 @@ devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 		if (err)
 			return err;
 	}
+
+	attr = tb[DEVLINK_PORT_FN_ATTR_ROCE];
+	if (attr) {
+		err = devlink_port_fn_roce_set(devlink, port, attr, extack);
+		if (err)
+			return err;
+	}
+
 	/* Keep this as the last function attribute set, so that when
 	 * multiple port function attributes are set along with state,
 	 * Those can be applied first before activating the state.
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 2/2] net/mlx5: E-Switch, Implement devlink port function cmds to control roce
  2021-02-01 17:51 [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce Yishai Hadas
  2021-02-01 17:51 ` [PATCH net-next 1/2] devlink: Expose port function commands to control roce Yishai Hadas
@ 2021-02-01 17:51 ` Yishai Hadas
  2021-02-03  2:14 ` [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce Jakub Kicinski
  2 siblings, 0 replies; 9+ messages in thread
From: Yishai Hadas @ 2021-02-01 17:51 UTC (permalink / raw)
  To: netdev, davem, kuba; +Cc: parav, saeedm, Yishai Hadas

Implement devlink port function commands to enable / disable roce.
This is used to control the roce device capabilities.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../device_drivers/ethernet/mellanox/mlx5.rst |  32 ++++
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   3 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 138 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  10 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../net/ethernet/mellanox/mlx5/core/vport.c   |  35 +++++
 6 files changed, 217 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
index a1b32fcd0d76..3245b6a6dd39 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -322,6 +322,38 @@ device created for the PCI VF/SF.
       function:
         hw_addr 00:00:00:00:88:88
 
+RoCE capability setup
+---------------------
+Not all mlx5 PCI VFs/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
+PCI VF/SF.
+
+mlx5 driver provides mechanism to setup RoCE capability.
+
+When user disables RoCE capability for a VF/SF, user application cannot send or
+receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
+will be empty.
+
+When RoCE capability is disabled in the device using port function attribute,
+VF/SF driver cannot override it.
+
+- Get RoCE capability of the VF device::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 roce on
+
+- Set RoCE capability of the VF device::
+
+    $ devlink port function set pci/0000:06:00.0/2 roce off
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 roce off
+
 SF state setup
 --------------
 To use the SF, the user must active the SF using the SF function state
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index aa76a6e0dae8..b8b6e8a1a436 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -276,6 +276,9 @@ static const struct devlink_ops mlx5_devlink_ops = {
 	.eswitch_encap_mode_get = mlx5_devlink_eswitch_encap_mode_get,
 	.port_function_hw_addr_get = mlx5_devlink_port_function_hw_addr_get,
 	.port_function_hw_addr_set = mlx5_devlink_port_function_hw_addr_set,
+	.port_fn_roce_get = mlx5_devlink_port_function_roce_get,
+	.port_fn_roce_set = mlx5_devlink_port_function_roce_set,
+
 #endif
 #ifdef CONFIG_MLX5_SF_MANAGER
 	.port_new = mlx5_devlink_sf_port_new,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 820305b1664e..349a6fd21ff2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1220,6 +1220,32 @@ static void esw_vport_cleanup_acl(struct mlx5_eswitch *esw,
 		esw_vport_destroy_offloads_acl_tables(esw, vport);
 }
 
+static int mlx5_esw_vport_roce_cap_get(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
+{
+	int query_out_sz = MLX5_ST_SZ_BYTES(query_hca_cap_out);
+	void *query_ctx;
+	void *hca_caps;
+	int err;
+
+	if (!MLX5_CAP_GEN(esw->dev, vhca_resource_manager))
+		return 0;
+
+	query_ctx = kzalloc(query_out_sz, GFP_KERNEL);
+	if (!query_ctx)
+		return -ENOMEM;
+
+	err = mlx5_vport_get_other_func_cap(esw->dev, vport->vport, query_ctx);
+	if (err)
+		goto out_free;
+
+	hca_caps = MLX5_ADDR_OF(query_hca_cap_out, query_ctx, capability);
+	vport->info.roce_enabled = MLX5_GET(cmd_hca_cap, hca_caps, roce);
+
+out_free:
+	kfree(query_ctx);
+	return err;
+}
+
 static int esw_vport_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
 {
 	u16 vport_num = vport->vport;
@@ -1236,6 +1262,10 @@ static int esw_vport_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
 	if (mlx5_esw_is_manager_vport(esw, vport_num))
 		return 0;
 
+	err = mlx5_esw_vport_roce_cap_get(esw, vport);
+	if (err)
+		goto err_roce;
+
 	mlx5_modify_vport_admin_state(esw->dev,
 				      MLX5_VPORT_STATE_OP_MOD_ESW_VPORT,
 				      vport_num, 1,
@@ -1255,6 +1285,11 @@ static int esw_vport_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
 			       vport->info.qos, flags);
 
 	return 0;
+
+err_roce:
+	esw_vport_disable_qos(esw, vport);
+	esw_vport_cleanup_acl(esw, vport);
+	return err;
 }
 
 /* Don't cleanup vport->info, it's needed to restore vport configuration */
@@ -1995,6 +2030,109 @@ int mlx5_devlink_port_function_hw_addr_set(struct devlink *devlink,
 	return err;
 }
 
+int mlx5_devlink_port_function_roce_get(struct devlink *devlink, struct devlink_port *port,
+					bool *is_enabled, struct netlink_ext_ack *extack)
+{
+	struct mlx5_eswitch *esw;
+	struct mlx5_vport *vport;
+	int err = -EOPNOTSUPP;
+	u16 vport_num;
+
+	esw = mlx5_devlink_eswitch_get(devlink);
+	if (IS_ERR(esw))
+		return PTR_ERR(esw);
+
+	if (!MLX5_CAP_GEN(esw->dev, vhca_resource_manager))
+		return -EOPNOTSUPP;
+
+	vport_num = mlx5_esw_devlink_port_index_to_vport_num(port->index);
+	if (!is_port_function_supported(esw, vport_num))
+		return -EOPNOTSUPP;
+
+	vport = mlx5_eswitch_get_vport(esw, vport_num);
+	if (IS_ERR(vport)) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid port");
+		return PTR_ERR(vport);
+	}
+
+	mutex_lock(&esw->state_lock);
+	if (vport->enabled) {
+		*is_enabled = vport->info.roce_enabled;
+		err = 0;
+	}
+	mutex_unlock(&esw->state_lock);
+	return err;
+}
+
+int mlx5_devlink_port_function_roce_set(struct devlink *devlink, struct devlink_port *port,
+					bool enable, struct netlink_ext_ack *extack)
+{
+	int query_out_sz = MLX5_ST_SZ_BYTES(query_hca_cap_out);
+	struct mlx5_eswitch *esw;
+	struct mlx5_vport *vport;
+	int err = -EOPNOTSUPP;
+	void *query_ctx;
+	void *hca_caps;
+	u16 vport_num;
+
+	esw = mlx5_devlink_eswitch_get(devlink);
+	if (IS_ERR(esw))
+		return PTR_ERR(esw);
+
+	if (!MLX5_CAP_GEN(esw->dev, vhca_resource_manager))
+		return -EOPNOTSUPP;
+
+	vport_num = mlx5_esw_devlink_port_index_to_vport_num(port->index);
+	if (!is_port_function_supported(esw, vport_num))
+		return -EOPNOTSUPP;
+
+	vport = mlx5_eswitch_get_vport(esw, vport_num);
+	if (IS_ERR(vport)) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid port");
+		return PTR_ERR(vport);
+	}
+
+	mutex_lock(&esw->state_lock);
+	if (!vport->enabled) {
+		NL_SET_ERR_MSG_MOD(extack, "Eswitch vport is disabled");
+		goto out;
+	}
+
+	if (vport->info.roce_enabled == enable) {
+		err = 0;
+		goto out;
+	}
+
+	query_ctx = kzalloc(query_out_sz, GFP_KERNEL);
+	if (!query_ctx) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = mlx5_vport_get_other_func_cap(esw->dev, vport_num, query_ctx);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack, "Failed getting HCA caps");
+		goto out_free;
+	}
+
+	hca_caps = MLX5_ADDR_OF(query_hca_cap_out, query_ctx, capability);
+	MLX5_SET(cmd_hca_cap, hca_caps, roce, enable);
+
+	err = mlx5_vport_set_other_func_cap(esw->dev, hca_caps, vport_num);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack, "Failed setting HCA roce cap");
+		goto out_free;
+	}
+
+	vport->info.roce_enabled = enable;
+
+out_free:
+	kfree(query_ctx);
+out:
+	mutex_unlock(&esw->state_lock);
+	return err;
+}
+
 int mlx5_eswitch_set_vport_state(struct mlx5_eswitch *esw,
 				 u16 vport, int link_state)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 479d2ac2cd85..0e7ad84e7b45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -122,8 +122,9 @@ struct mlx5_vport_info {
 	int                     link_state;
 	u32                     min_rate;
 	u32                     max_rate;
-	bool                    spoofchk;
-	bool                    trusted;
+	u8                      spoofchk: 1;
+	u8                      trusted: 1;
+	u8                      roce_enabled: 1;
 };
 
 /* Vport context events */
@@ -436,7 +437,10 @@ int mlx5_devlink_port_function_hw_addr_set(struct devlink *devlink,
 					   struct devlink_port *port,
 					   const u8 *hw_addr, int hw_addr_len,
 					   struct netlink_ext_ack *extack);
-
+int mlx5_devlink_port_function_roce_get(struct devlink *devlink, struct devlink_port *port,
+					bool *is_enabled, struct netlink_ext_ack *extack);
+int mlx5_devlink_port_function_roce_set(struct devlink *devlink, struct devlink_port *port,
+					bool enable, struct netlink_ext_ack *extack);
 void *mlx5_eswitch_get_uplink_priv(struct mlx5_eswitch *esw, u8 rep_type);
 
 int mlx5_eswitch_add_vlan_action(struct mlx5_eswitch *esw,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 3754ef98554f..de45a4b54934 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -259,6 +259,8 @@ enum {
 
 u8 mlx5_get_nic_state(struct mlx5_core_dev *dev);
 void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state);
+int mlx5_vport_get_other_func_cap(struct mlx5_core_dev *dev, u16 function_id, void *out);
+int mlx5_vport_set_other_func_cap(struct mlx5_core_dev *dev, const void *hca_cap, u16 function_id);
 
 static inline bool mlx5_core_is_sf(const struct mlx5_core_dev *dev)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index ba78e0660523..a822a25b18af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -1164,3 +1164,38 @@ u16 mlx5_eswitch_get_total_vports(const struct mlx5_core_dev *dev)
 	return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev) + mlx5_sf_max_functions(dev);
 }
 EXPORT_SYMBOL_GPL(mlx5_eswitch_get_total_vports);
+
+int mlx5_vport_get_other_func_cap(struct mlx5_core_dev *dev, u16 function_id, void *out)
+{
+	u16 opmod = (MLX5_CAP_GENERAL << 1) | (HCA_CAP_OPMOD_GET_MAX & 0x01);
+	u8 in[MLX5_ST_SZ_BYTES(query_hca_cap_in)] = {};
+
+	MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+	MLX5_SET(query_hca_cap_in, in, op_mod, opmod);
+	MLX5_SET(query_hca_cap_in, in, function_id, function_id);
+	MLX5_SET(query_hca_cap_in, in, other_function, true);
+	return mlx5_cmd_exec_inout(dev, query_hca_cap, in, out);
+}
+
+int mlx5_vport_set_other_func_cap(struct mlx5_core_dev *dev, const void *hca_cap, u16 function_id)
+{
+	int set_sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
+	void *set_hca_cap;
+	void *set_ctx;
+	int ret;
+
+	set_ctx = kzalloc(set_sz, GFP_KERNEL);
+	if (!set_ctx)
+		return -ENOMEM;
+
+	MLX5_SET(set_hca_cap_in, set_ctx, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+	MLX5_SET(set_hca_cap_in, set_ctx, op_mod, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1);
+	set_hca_cap = MLX5_ADDR_OF(set_hca_cap_in, set_ctx, capability);
+	memcpy(set_hca_cap, hca_cap, MLX5_ST_SZ_BYTES(cmd_hca_cap));
+	MLX5_SET(set_hca_cap_in, set_ctx, function_id, function_id);
+	MLX5_SET(set_hca_cap_in, set_ctx, other_function, true);
+	ret = mlx5_cmd_exec_in(dev, set_hca_cap, set_ctx);
+
+	kfree(set_ctx);
+	return ret;
+}
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce
  2021-02-01 17:51 [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce Yishai Hadas
  2021-02-01 17:51 ` [PATCH net-next 1/2] devlink: Expose port function commands to control roce Yishai Hadas
  2021-02-01 17:51 ` [PATCH net-next 2/2] net/mlx5: E-Switch, Implement devlink port function cmds " Yishai Hadas
@ 2021-02-03  2:14 ` Jakub Kicinski
  2021-02-03  4:13   ` Saeed Mahameed
  2 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2021-02-03  2:14 UTC (permalink / raw)
  To: Yishai Hadas; +Cc: netdev, davem, parav, saeedm

On Mon, 1 Feb 2021 19:51:50 +0200 Yishai Hadas wrote:
> Currently mlx5 PCI VF and SF are enabled by default for RoCE
> functionality.
> 
> Currently a user does not have the ability to disable RoCE for a PCI
> VF/SF device before such device is enumerated by the driver.
> 
> User is also incapable to do such setting from smartnic scenario for a
> VF from the smartnic.
> 
> Current 'enable_roce' device knob is limited to do setting only at
> driverinit time. By this time device is already created and firmware has
> already allocated necessary system memory for supporting RoCE.
> 
> When a RoCE is disabled for the PCI VF/SF device, it saves 1 Mbyte of
> system memory per function. Such saving is helpful when running on low
> memory embedded platform with many VFs or SFs.
> 
> Therefore, it is desired to empower user to disable RoCE functionality
> before a PCI SF/VF device is enumerated.

You say that the user on the VF/SF side wants to save memory, yet
the control knob is on the eswitch instance side, correct?

> This is achieved by extending existing 'port function' object to control
> capabilities of a function. This enables users to control capability of
> the device before enumeration.
> 
> Examples when user prefers to disable RoCE for a VF when using switchdev
> mode:
> 
> $ devlink port show pci/0000:06:00.0/1
> pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller 0
> pfnum 0 vfnum 0 external false splittable false
>   function:
>     hw_addr 00:00:00:00:00:00 roce on
> 
> $ devlink port function set pci/0000:06:00.0/1 roce off
>   
> $ devlink port show pci/0000:06:00.0/1
> pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller 0
> pfnum 0 vfnum 0 external false splittable false
>   function:
>     hw_addr 00:00:00:00:00:00 roce off
> 
> FAQs:
> -----
> 1. What does roce on/off do?
> Ans: It disables RoCE capability of the function before its enumerated,
> so when driver reads the capability from the device firmware, it is
> disabled.
> At this point RDMA stack will not be able to create UD, QP1, RC, XRC
> type of QPs. When RoCE is disabled, the GID table of all ports of the
> device is disabled in the device and software stack.
> 
> 2. How is the roce 'port function' option different from existing
> devlink param?
> Ans: RoCE attribute at the port function level disables the RoCE
> capability at the specific function level; while enable_roce only does
> at the software level.
> 
> 3. Why is this option for disabling only RoCE and not the whole RDMA
> device?
> Ans: Because user still wants to use the RDMA device for non RoCE
> commands in more memory efficient way.

What are those "non-RoCE commands" that user may want to use "in a more
efficient way"?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce
  2021-02-03  2:14 ` [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce Jakub Kicinski
@ 2021-02-03  4:13   ` Saeed Mahameed
  2021-02-03 18:51     ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Saeed Mahameed @ 2021-02-03  4:13 UTC (permalink / raw)
  To: Jakub Kicinski, Yishai Hadas; +Cc: netdev, davem, parav

On Tue, 2021-02-02 at 18:14 -0800, Jakub Kicinski wrote:
> On Mon, 1 Feb 2021 19:51:50 +0200 Yishai Hadas wrote:
> > Currently mlx5 PCI VF and SF are enabled by default for RoCE
> > functionality.
> > 
> > Currently a user does not have the ability to disable RoCE for a
> > PCI
> > VF/SF device before such device is enumerated by the driver.
> > 
> > User is also incapable to do such setting from smartnic scenario
> > for a
> > VF from the smartnic.
> > 
> > Current 'enable_roce' device knob is limited to do setting only at
> > driverinit time. By this time device is already created and
> > firmware has
> > already allocated necessary system memory for supporting RoCE.
> > 
> > When a RoCE is disabled for the PCI VF/SF device, it saves 1 Mbyte
> > of
> > system memory per function. Such saving is helpful when running on
> > low
> > memory embedded platform with many VFs or SFs.
> > 
> > Therefore, it is desired to empower user to disable RoCE
> > functionality
> > before a PCI SF/VF device is enumerated.
> 
> You say that the user on the VF/SF side wants to save memory, yet
> the control knob is on the eswitch instance side, correct?
> 

yes, user in this case is the admin, who controls the provisioned
network function SF/VFs.. by turning off this knob it allows to create
more of that resource in case the user/admin is limited by memory.

> > This is achieved by extending existing 'port function' object to
> > control
> > capabilities of a function. This enables users to control
> > capability of
> > the device before enumeration.
> > 
> > Examples when user prefers to disable RoCE for a VF when using
> > switchdev
> > mode:
> > 
> > $ devlink port show pci/0000:06:00.0/1
> > pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller
> > 0
> > pfnum 0 vfnum 0 external false splittable false
> >   function:
> >     hw_addr 00:00:00:00:00:00 roce on
> > 
> > $ devlink port function set pci/0000:06:00.0/1 roce off
> >   
> > $ devlink port show pci/0000:06:00.0/1
> > pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller
> > 0
> > pfnum 0 vfnum 0 external false splittable false
> >   function:
> >     hw_addr 00:00:00:00:00:00 roce off
> > 
> > FAQs:
> > -----
> > 1. What does roce on/off do?
> > Ans: It disables RoCE capability of the function before its
> > enumerated,
> > so when driver reads the capability from the device firmware, it is
> > disabled.
> > At this point RDMA stack will not be able to create UD, QP1, RC,
> > XRC
> > type of QPs. When RoCE is disabled, the GID table of all ports of
> > the
> > device is disabled in the device and software stack.
> > 
> > 2. How is the roce 'port function' option different from existing
> > devlink param?
> > Ans: RoCE attribute at the port function level disables the RoCE
> > capability at the specific function level; while enable_roce only
> > does
> > at the software level.
> > 
> > 3. Why is this option for disabling only RoCE and not the whole
> > RDMA
> > device?
> > Ans: Because user still wants to use the RDMA device for non RoCE
> > commands in more memory efficient way.
> 
> What are those "non-RoCE commands" that user may want to use "in a
> more
> efficient way"?

RAW eth QP, i think you already know this one, it is a very thin layer
that doesn't require the whole rdma stack.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce
  2021-02-03  4:13   ` Saeed Mahameed
@ 2021-02-03 18:51     ` Jakub Kicinski
  2021-02-03 19:22       ` Saeed Mahameed
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2021-02-03 18:51 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Yishai Hadas, netdev, davem, parav

On Tue, 02 Feb 2021 20:13:48 -0800 Saeed Mahameed wrote:
> On Tue, 2021-02-02 at 18:14 -0800, Jakub Kicinski wrote:
> > On Mon, 1 Feb 2021 19:51:50 +0200 Yishai Hadas wrote:  
> > > Currently mlx5 PCI VF and SF are enabled by default for RoCE
> > > functionality.
> > > 
> > > Currently a user does not have the ability to disable RoCE for a
> > > PCI
> > > VF/SF device before such device is enumerated by the driver.
> > > 
> > > User is also incapable to do such setting from smartnic scenario
> > > for a
> > > VF from the smartnic.
> > > 
> > > Current 'enable_roce' device knob is limited to do setting only at
> > > driverinit time. By this time device is already created and
> > > firmware has
> > > already allocated necessary system memory for supporting RoCE.
> > > 
> > > When a RoCE is disabled for the PCI VF/SF device, it saves 1 Mbyte
> > > of
> > > system memory per function. Such saving is helpful when running on
> > > low
> > > memory embedded platform with many VFs or SFs.
> > > 
> > > Therefore, it is desired to empower user to disable RoCE
> > > functionality
> > > before a PCI SF/VF device is enumerated.  
> > 
> > You say that the user on the VF/SF side wants to save memory, yet
> > the control knob is on the eswitch instance side, correct?
> >   
> 
> yes, user in this case is the admin, who controls the provisioned
> network function SF/VFs.. by turning off this knob it allows to create
> more of that resource in case the user/admin is limited by memory.

Ah, so in case of the SmartNIC this extra memory is allocated on the
control system, not where the function resides?

My next question is regarding the behavior on the target system - what
does "that user" see? Can we expect they will understand that the
limitation was imposed by the admin and not due to some initialization
failure or SW incompatibility?

> > > This is achieved by extending existing 'port function' object to
> > > control
> > > capabilities of a function. This enables users to control
> > > capability of
> > > the device before enumeration.
> > > 
> > > Examples when user prefers to disable RoCE for a VF when using
> > > switchdev
> > > mode:
> > > 
> > > $ devlink port show pci/0000:06:00.0/1
> > > pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller
> > > 0
> > > pfnum 0 vfnum 0 external false splittable false
> > >   function:
> > >     hw_addr 00:00:00:00:00:00 roce on
> > > 
> > > $ devlink port function set pci/0000:06:00.0/1 roce off
> > >   
> > > $ devlink port show pci/0000:06:00.0/1
> > > pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller
> > > 0
> > > pfnum 0 vfnum 0 external false splittable false
> > >   function:
> > >     hw_addr 00:00:00:00:00:00 roce off
> > > 
> > > FAQs:
> > > -----
> > > 1. What does roce on/off do?
> > > Ans: It disables RoCE capability of the function before its
> > > enumerated,
> > > so when driver reads the capability from the device firmware, it is
> > > disabled.
> > > At this point RDMA stack will not be able to create UD, QP1, RC,
> > > XRC
> > > type of QPs. When RoCE is disabled, the GID table of all ports of
> > > the
> > > device is disabled in the device and software stack.
> > > 
> > > 2. How is the roce 'port function' option different from existing
> > > devlink param?
> > > Ans: RoCE attribute at the port function level disables the RoCE
> > > capability at the specific function level; while enable_roce only
> > > does
> > > at the software level.
> > > 
> > > 3. Why is this option for disabling only RoCE and not the whole
> > > RDMA
> > > device?
> > > Ans: Because user still wants to use the RDMA device for non RoCE
> > > commands in more memory efficient way.  
> > 
> > What are those "non-RoCE commands" that user may want to use "in a
> > more
> > efficient way"?  
> 
> RAW eth QP, i think you already know this one, it is a very thin layer
> that doesn't require the whole rdma stack.

Sorry for asking a leading question. You know how we'll feel about
that one, do we need to talk this out or can we save ourselves the
battle? :S

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce
  2021-02-03 18:51     ` Jakub Kicinski
@ 2021-02-03 19:22       ` Saeed Mahameed
  2021-02-03 21:26         ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Saeed Mahameed @ 2021-02-03 19:22 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Yishai Hadas, netdev, davem, parav

On Wed, 2021-02-03 at 10:51 -0800, Jakub Kicinski wrote:
> On Tue, 02 Feb 2021 20:13:48 -0800 Saeed Mahameed wrote:
> > On Tue, 2021-02-02 at 18:14 -0800, Jakub Kicinski wrote:
> > > On Mon, 1 Feb 2021 19:51:50 +0200 Yishai Hadas wrote:  
> > > > Currently mlx5 PCI VF and SF are enabled by default for RoCE
> > > > functionality.
> > > > 
> > > > Currently a user does not have the ability to disable RoCE for
> > > > a
> > > > PCI
> > > > VF/SF device before such device is enumerated by the driver.
> > > > 
> > > > User is also incapable to do such setting from smartnic
> > > > scenario
> > > > for a
> > > > VF from the smartnic.
> > > > 
> > > > Current 'enable_roce' device knob is limited to do setting only
> > > > at
> > > > driverinit time. By this time device is already created and
> > > > firmware has
> > > > already allocated necessary system memory for supporting RoCE.
> > > > 
> > > > When a RoCE is disabled for the PCI VF/SF device, it saves 1
> > > > Mbyte
> > > > of
> > > > system memory per function. Such saving is helpful when running
> > > > on
> > > > low
> > > > memory embedded platform with many VFs or SFs.
> > > > 
> > > > Therefore, it is desired to empower user to disable RoCE
> > > > functionality
> > > > before a PCI SF/VF device is enumerated.  
> > > 
> > > You say that the user on the VF/SF side wants to save memory, yet
> > > the control knob is on the eswitch instance side, correct?
> > >   
> > 
> > yes, user in this case is the admin, who controls the provisioned
> > network function SF/VFs.. by turning off this knob it allows to
> > create
> > more of that resource in case the user/admin is limited by memory.
> 
> Ah, so in case of the SmartNIC this extra memory is allocated on the
> control system, not where the function resides?
> 

most of the memeory are actually allocated from where the function
resides, some are on the management system but it is not as critical.
SFs for now can only be probed on the management system, so the main
issue will be on the SmartNIC side for now.

> My next question is regarding the behavior on the target system -
> what
> does "that user" see? Can we expect they will understand that the
> limitation was imposed by the admin and not due to some
> initialization
> failure or SW incompatibility?
> 

the whole thing works with only real HW capabilities, there is no
synthetic SW capabilities. 

when mlx5 instance driver loads, it doesn't assume anything about
underlying HW, and it queries for the advertised FW capability
according to the HW spec before it enables a feature.

so this patch adds the ability for admin to enforce a specific HW cap
"off" for a VF/SF hca slice.


> > > > This is achieved by extending existing 'port function' object
> > > > to
> > > > control
> > > > capabilities of a function. This enables users to control
> > > > capability of
> > > > the device before enumeration.
> > > > 
> > > > Examples when user prefers to disable RoCE for a VF when using
> > > > switchdev
> > > > mode:
> > > > 
> > > > $ devlink port show pci/0000:06:00.0/1
> > > > pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf
> > > > controller
> > > > 0
> > > > pfnum 0 vfnum 0 external false splittable false
> > > >   function:
> > > >     hw_addr 00:00:00:00:00:00 roce on
> > > > 
> > > > $ devlink port function set pci/0000:06:00.0/1 roce off
> > > >   
> > > > $ devlink port show pci/0000:06:00.0/1
> > > > pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf
> > > > controller
> > > > 0
> > > > pfnum 0 vfnum 0 external false splittable false
> > > >   function:
> > > >     hw_addr 00:00:00:00:00:00 roce off
> > > > 
> > > > FAQs:
> > > > -----
> > > > 1. What does roce on/off do?
> > > > Ans: It disables RoCE capability of the function before its
> > > > enumerated,
> > > > so when driver reads the capability from the device firmware,
> > > > it is
> > > > disabled.
> > > > At this point RDMA stack will not be able to create UD, QP1,
> > > > RC,
> > > > XRC
> > > > type of QPs. When RoCE is disabled, the GID table of all ports
> > > > of
> > > > the
> > > > device is disabled in the device and software stack.
> > > > 
> > > > 2. How is the roce 'port function' option different from
> > > > existing
> > > > devlink param?
> > > > Ans: RoCE attribute at the port function level disables the
> > > > RoCE
> > > > capability at the specific function level; while enable_roce
> > > > only
> > > > does
> > > > at the software level.
> > > > 
> > > > 3. Why is this option for disabling only RoCE and not the whole
> > > > RDMA
> > > > device?
> > > > Ans: Because user still wants to use the RDMA device for non
> > > > RoCE
> > > > commands in more memory efficient way.  
> > > 
> > > What are those "non-RoCE commands" that user may want to use "in
> > > a
> > > more
> > > efficient way"?  
> > 
> > RAW eth QP, i think you already know this one, it is a very thin
> > layer
> > that doesn't require the whole rdma stack.
> 
> Sorry for asking a leading question. You know how we'll feel about
> that one, do we need to talk this out or can we save ourselves the
> battle? :S

I know, I know :/

So, there is no rdma bit/cap in HW.. to disable non-RoCE commands we
will have to disable etherent capability. 

The user interface here has no synthetic semantics, all knobs will
eventually be mapped to real HW/FW capabilities to get disabled.

the whole feature is about allowing admin to ship network functions
with different capabilities that are actually enforced by FW/HW.. 
so the user of the VF will see, RDMA/ETH only cards or both.







^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce
  2021-02-03 19:22       ` Saeed Mahameed
@ 2021-02-03 21:26         ` Jakub Kicinski
  2021-02-09  9:39           ` Saeed Mahameed
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2021-02-03 21:26 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Yishai Hadas, netdev, davem, parav

On Wed, 03 Feb 2021 11:22:44 -0800 Saeed Mahameed wrote:
> On Wed, 2021-02-03 at 10:51 -0800, Jakub Kicinski wrote:
> > On Tue, 02 Feb 2021 20:13:48 -0800 Saeed Mahameed wrote:  
> > > yes, user in this case is the admin, who controls the provisioned
> > > network function SF/VFs.. by turning off this knob it allows to
> > > create
> > > more of that resource in case the user/admin is limited by memory.  
> > 
> > Ah, so in case of the SmartNIC this extra memory is allocated on the
> > control system, not where the function resides?
> 
> most of the memeory are actually allocated from where the function
> resides, some are on the management system but it is not as critical.
> SFs for now can only be probed on the management system, so the main
> issue will be on the SmartNIC side for now.

Why not leave the decision whether to allocate that memory or not to
the SF itself? If user never binds the RDMA driver to the SF they
clearly don't care for RDMA. No extra knobs needed.

> > My next question is regarding the behavior on the target system -
> > what
> > does "that user" see? Can we expect they will understand that the
> > limitation was imposed by the admin and not due to some
> > initialization
> > failure or SW incompatibility?
>
> the whole thing works with only real HW capabilities, there is no
> synthetic SW capabilities. 
> 
> when mlx5 instance driver loads, it doesn't assume anything about
> underlying HW, and it queries for the advertised FW capability
> according to the HW spec before it enables a feature.
> 
> so this patch adds the ability for admin to enforce a specific HW cap
> "off" for a VF/SF hca slice.
> 
> > > RAW eth QP, i think you already know this one, it is a very thin
> > > layer
> > > that doesn't require the whole rdma stack.  
> > 
> > Sorry for asking a leading question. You know how we'll feel about
> > that one, do we need to talk this out or can we save ourselves the
> > battle? :S  
> 
> I know, I know :/
> 
> So, there is no rdma bit/cap in HW.. to disable non-RoCE commands we
> will have to disable etherent capability. 

It's your driver, you can make it do what you need to. Why does 
the RDMA driver bind successfully to a non-RoCE Ethernet device 
in the first place?

> The user interface here has no synthetic semantics, all knobs will
> eventually be mapped to real HW/FW capabilities to get disabled.
> 
> the whole feature is about allowing admin to ship network functions
> with different capabilities that are actually enforced by FW/HW.. 
> so the user of the VF will see, RDMA/ETH only cards or both.

RDMA-only, ETH-only, RDMA+ETH makes sense to me. Having an ETH-only
device also exposed though rdma subsystem does not.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce
  2021-02-03 21:26         ` Jakub Kicinski
@ 2021-02-09  9:39           ` Saeed Mahameed
  0 siblings, 0 replies; 9+ messages in thread
From: Saeed Mahameed @ 2021-02-09  9:39 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Yishai Hadas, netdev, davem, parav

On Wed, 2021-02-03 at 13:26 -0800, Jakub Kicinski wrote:
> On Wed, 03 Feb 2021 11:22:44 -0800 Saeed Mahameed wrote:
> > On Wed, 2021-02-03 at 10:51 -0800, Jakub Kicinski wrote:
> > > On Tue, 02 Feb 2021 20:13:48 -0800 Saeed Mahameed wrote:  
> > > > yes, user in this case is the admin, who controls the
> > > > provisioned
> > > > network function SF/VFs.. by turning off this knob it allows to
> > > > create
> > > > more of that resource in case the user/admin is limited by
> > > > memory.  
> > > 
> > > Ah, so in case of the SmartNIC this extra memory is allocated on
> > > the
> > > control system, not where the function resides?
> > 
> > most of the memeory are actually allocated from where the function
> > resides, some are on the management system but it is not as
> > critical.
> > SFs for now can only be probed on the management system, so the
> > main
> > issue will be on the SmartNIC side for now.
> 
> Why not leave the decision whether to allocate that memory or not to
> the SF itself? If user never binds the RDMA driver to the SF they
> clearly don't care for RDMA. No extra knobs needed.
> 

Sorry about the late response.

But FW is already setup for RDMA weather SW wants it or not for this
specific function.
a system admin may want to deploy a leaner function to the client. we
can disable RoCE in FW before we even instantiate this function.

> > > My next question is regarding the behavior on the target system -
> > > what
> > > does "that user" see? Can we expect they will understand that the
> > > limitation was imposed by the admin and not due to some
> > > initialization
> > > failure or SW incompatibility?
> > 
> > the whole thing works with only real HW capabilities, there is no
> > synthetic SW capabilities. 
> > 
> > when mlx5 instance driver loads, it doesn't assume anything about
> > underlying HW, and it queries for the advertised FW capability
> > according to the HW spec before it enables a feature.
> > 
> > so this patch adds the ability for admin to enforce a specific HW
> > cap
> > "off" for a VF/SF hca slice.
> > 
> > > > RAW eth QP, i think you already know this one, it is a very
> > > > thin
> > > > layer
> > > > that doesn't require the whole rdma stack.  
> > > 
> > > Sorry for asking a leading question. You know how we'll feel
> > > about
> > > that one, do we need to talk this out or can we save ourselves
> > > the
> > > battle? :S  
> > 
> > I know, I know :/
> > 
> > So, there is no rdma bit/cap in HW.. to disable non-RoCE commands
> > we
> > will have to disable etherent capability. 
> 
> It's your driver, you can make it do what you need to. Why does 
> the RDMA driver bind successfully to a non-RoCE Ethernet device 
> in the first place?
> 

because RDMA people are greedy :), and they can create QPs that are not
RoCE..

this patch is only basic enable/disable RoCE feature on a specific
device slice, how rdma driver initializes itself is not related to this
patch at all, and it is just how rdma works, with or without this
patch.

> > The user interface here has no synthetic semantics, all knobs will
> > eventually be mapped to real HW/FW capabilities to get disabled.
> > 
> > the whole feature is about allowing admin to ship network functions
> > with different capabilities that are actually enforced by FW/HW.. 
> > so the user of the VF will see, RDMA/ETH only cards or both.
> 
> RDMA-only, ETH-only, RDMA+ETH makes sense to me. Having an ETH-only
> device also exposed though rdma subsystem does not.

But this has nothing to do with this patch. this is the rdma driver
implementation.




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-02-09  9:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-01 17:51 [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce Yishai Hadas
2021-02-01 17:51 ` [PATCH net-next 1/2] devlink: Expose port function commands to control roce Yishai Hadas
2021-02-01 17:51 ` [PATCH net-next 2/2] net/mlx5: E-Switch, Implement devlink port function cmds " Yishai Hadas
2021-02-03  2:14 ` [PATCH net-next 0/2] devlink: Add port function attribute to enable/disable roce Jakub Kicinski
2021-02-03  4:13   ` Saeed Mahameed
2021-02-03 18:51     ` Jakub Kicinski
2021-02-03 19:22       ` Saeed Mahameed
2021-02-03 21:26         ` Jakub Kicinski
2021-02-09  9:39           ` Saeed Mahameed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.