All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next v5 00/15] Add mlx5 subfunction support
@ 2020-12-15  9:03 Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
                   ` (14 more replies)
  0 siblings, 15 replies; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Saeed Mahameed

From: Saeed Mahameed <saeedm@nvidia.com>

Hi Dave, Jakub, Jason,

This series form Parav was the theme of this mlx5 release cycle,
we've been waiting anxiously for the auxbus infrastructure to make it into
the kernel, and now as the auxbus is in and all the stars are aligned, I
can finally submit this V2 of the devlink and mlx5 subfunction support.

Subfunctions came to solve the scaling issue of virtualization
and switchdev environments, where SRIOV failed to deliver and users ran
out of VFs very quickly as SRIOV demands huge amount of physical resources
in both of the servers and the NIC.

Subfunction provide the same functionality as SRIOV but in a very
lightweight manner, please see the thorough and detailed
documentation from Parav below, in the commit messages and the
Networking documentation patches at the end of this series.

Sending V4 as a continuation to V1 that was sent Last month [0],
[0] https://lore.kernel.org/linux-rdma/20201112192424.2742-1-parav@nvidia.com/

---
Changelog:
v4->v5:
 - Fix some typos in the documentation
 
v3->v4:
 - Fix 32bit compilation issue

v2->v3:
 - added header file sf/priv.h to cmd.c to avoid missing prototype warning
 - made mlx5_sf_table_disable as static function as its used only in one file

v1->v2:
 - added documentation for subfunction and its mlx5 implementation
 - add MLX5_SF config option documentation
 - rebased
 - dropped devlink global lock improvement patch as mlx5 doesn't support
   reload while SFs are allocated
 - dropped devlink reload lock patch as mlx5 doesn't support reload
   when SFs are allocated
 - using updated vhca event from device to add remove auxiliary device
 - split sf devlink port allocation and sf hardware context allocation

Parav Pandit Says:
=================

This patchset introduces support for mlx5 subfunction (SF).

A subfunction is a lightweight function that has a parent PCI function on
which it is deployed. mlx5 subfunction has its own function capabilities
and its own resources. This means a subfunction has its own dedicated
queues(txq, rxq, cq, eq). These queues are neither shared nor stolen from
the parent PCI function.

When subfunction is RDMA capable, it has its own QP1, GID table and rdma
resources neither shared nor stolen from the parent PCI function.

A subfunction has dedicated window in PCI BAR space that is not shared
with the other subfunctions or parent PCI function. This ensures that all
class devices of the subfunction accesses only assigned PCI BAR space.

A Subfunction supports eswitch representation through which it supports tc
offloads. User must configure eswitch to send/receive packets from/to
subfunction port.

Subfunctions share PCI level resources such as PCI MSI-X IRQs with
their other subfunctions and/or with its parent PCI function.

Patch summary:
--------------
Patch 1 to 4 prepares devlink
patch 5 to 7 mlx5 adds SF device support
Patch 8 to 11 mlx5 adds SF devlink port support
Patch 12 and 14 adds documentation

Patch-1 prepares code to handle multiple port function attributes
Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
Patch-3 adds port add and delete driver callbacks
Patch-4 adds port function state get and set callbacks
Patch-5 mlx5 vhca event notifier support to distribute subfunction
        state change notification
Patch-6 adds SF auxiliary device
Patch-7 adds SF auxiliary driver
Patch-8 prepares eswitch to handler SF vport
Patch-9 adds eswitch helpers to add/remove SF vport
Patch-10 implements devlink port add/del callbacks
Patch-11 implements devlink port function get/set callbacks
Patch-12 to 14 adds documentation
Patch-12 added mlx5 port function documentation
Patch-13 adds subfunction documentation
Patch-14 adds mlx5 subfunction documentation

Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support. Functionality of this patchset is best explained using real
examples further below.

overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.

A subfunction can be configured using devlink port function attribute
before its activated.

When a subfunction is activated, it results in an auxiliary device on
the host PCI device where it is deployed. A driver binds to the
auxiliary device that further creates supported class devices.

example subfunction usage sequence:
-----------------------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88

Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active

Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff

$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

After use inactivate the function:
$ devlink port function set ens2f0npf0sf88 state inactive

Now delete the subfunction port:
$ devlink port del ens2f0npf0sf88

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2

=================

Parav Pandit (14):
  net/mlx5: Fix compilation warning for 32-bit platform
  devlink: Prepare code to fill multiple port function attributes
  devlink: Introduce PCI SF port flavour and port attribute
  devlink: Support add and delete devlink port
  devlink: Support get and set state of port function
  net/mlx5: Introduce vhca state event notifier
  net/mlx5: SF, Add auxiliary device support
  net/mlx5: SF, Add auxiliary device driver
  net/mlx5: E-switch, Add eswitch helpers for SF vport
  net/mlx5: SF, Add port add delete functionality
  net/mlx5: SF, Port function state change support
  devlink: Add devlink port documentation
  devlink: Extend devlink port documentation for subfunctions
  net/mlx5: Add devlink subfunction port documentation

Vu Pham (1):
  net/mlx5: E-switch, Prepare eswitch to handle SF vport

 Documentation/driver-api/auxiliary_bus.rst    |   2 +
 .../device_drivers/ethernet/mellanox/mlx5.rst | 209 +++++++
 .../networking/devlink/devlink-port.rst       | 199 +++++++
 Documentation/networking/devlink/index.rst    |   1 +
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  19 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   8 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  19 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   5 +-
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c  |   2 +-
 .../mellanox/mlx5/core/esw/devlink_port.c     |  41 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  48 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  78 +++
 .../mellanox/mlx5/core/eswitch_offloads.c     |  47 +-
 .../net/ethernet/mellanox/mlx5/core/events.c  |   7 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  60 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  12 +
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  20 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  49 ++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  | 271 +++++++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  55 ++
 .../mellanox/mlx5/core/sf/dev/driver.c        | 101 ++++
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  | 552 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/sf/hw_table.c | 233 ++++++++
 .../mlx5/core/sf/mlx5_ifc_vhca_event.h        |  82 +++
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |  21 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  92 +++
 .../mellanox/mlx5/core/sf/vhca_event.c        | 189 ++++++
 .../mellanox/mlx5/core/sf/vhca_event.h        |  57 ++
 .../net/ethernet/mellanox/mlx5/core/vport.c   |   3 +-
 include/linux/mlx5/driver.h                   |  16 +-
 include/linux/mlx5/mlx5_ifc.h                 |   6 +-
 include/net/devlink.h                         |  79 +++
 include/uapi/linux/devlink.h                  |  26 +
 net/core/devlink.c                            | 266 ++++++++-
 35 files changed, 2834 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/networking/devlink/devlink-port.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h

-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 01/15] net/mlx5: Fix compilation warning for 32-bit platform
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Stephen Rothwell, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

MLX5_GENERAL_OBJECT_TYPES types bitfield is 64-bit field.

Defining an enum for such bit fields on 32-bit platform results in below
warning.

./include/vdso/bits.h:7:26: warning: left shift count >= width of type [-Wshift-count-overflow]
                         ^
./include/linux/mlx5/mlx5_ifc.h:10716:46: note: in expansion of macro ‘BIT’
 MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
                                             ^~~
Use 32-bit friendly left shift.

Fixes: 2a2970891647 ("net/mlx5: Add sample offload hardware bits and structures")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeed@kernel.org>
---
 include/linux/mlx5/mlx5_ifc.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 0d6e287d614f..b9f15935dfe5 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -10711,9 +10711,9 @@ struct mlx5_ifc_affiliated_event_header_bits {
 };
 
 enum {
-	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = BIT(0xc),
-	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = BIT(0x13),
-	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
+	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = 1ULL << 0xc,
+	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = 1ULL << 0x13,
+	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = 1ULL << 0x20,
 };
 
 enum {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 02/15] devlink: Prepare code to fill multiple port function attributes
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Prepare code to fill zero or more port function optional attributes.
Subsequent patch makes use of this to fill more port function
attributes.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 net/core/devlink.c | 63 +++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 31 deletions(-)

diff --git a/net/core/devlink.c b/net/core/devlink.c
index ee828e4b1007..13e0de80c4f9 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -712,6 +712,31 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 	return 0;
 }
 
+static int
+devlink_port_function_hw_addr_fill(struct devlink *devlink, const struct devlink_ops *ops,
+				   struct devlink_port *port, struct sk_buff *msg,
+				   struct netlink_ext_ack *extack, bool *msg_updated)
+{
+	u8 hw_addr[MAX_ADDR_LEN];
+	int hw_addr_len;
+	int err;
+
+	if (!ops->port_function_hw_addr_get)
+		return 0;
+
+	err = ops->port_function_hw_addr_get(devlink, port, hw_addr, &hw_addr_len, extack);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+	err = nla_put(msg, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, hw_addr_len, hw_addr);
+	if (err)
+		return err;
+	*msg_updated = true;
+	return 0;
+}
+
 static int
 devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *port,
 				   struct netlink_ext_ack *extack)
@@ -719,36 +744,16 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por
 	struct devlink *devlink = port->devlink;
 	const struct devlink_ops *ops;
 	struct nlattr *function_attr;
-	bool empty_nest = true;
-	int err = 0;
+	bool msg_updated = false;
+	int err;
 
 	function_attr = nla_nest_start_noflag(msg, DEVLINK_ATTR_PORT_FUNCTION);
 	if (!function_attr)
 		return -EMSGSIZE;
 
 	ops = devlink->ops;
-	if (ops->port_function_hw_addr_get) {
-		int hw_addr_len;
-		u8 hw_addr[MAX_ADDR_LEN];
-
-		err = ops->port_function_hw_addr_get(devlink, port, hw_addr, &hw_addr_len, extack);
-		if (err == -EOPNOTSUPP) {
-			/* Port function attributes are optional for a port. If port doesn't
-			 * support function attribute, returning -EOPNOTSUPP is not an error.
-			 */
-			err = 0;
-			goto out;
-		} else if (err) {
-			goto out;
-		}
-		err = nla_put(msg, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, hw_addr_len, hw_addr);
-		if (err)
-			goto out;
-		empty_nest = false;
-	}
-
-out:
-	if (err || empty_nest)
+	err = devlink_port_function_hw_addr_fill(devlink, ops, port, msg, extack, &msg_updated);
+	if (err || !msg_updated)
 		nla_nest_cancel(msg, function_attr);
 	else
 		nla_nest_end(msg, function_attr);
@@ -986,7 +991,6 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 	const struct devlink_ops *ops;
 	const u8 *hw_addr;
 	int hw_addr_len;
-	int err;
 
 	hw_addr = nla_data(attr);
 	hw_addr_len = nla_len(attr);
@@ -1011,12 +1015,7 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 		return -EOPNOTSUPP;
 	}
 
-	err = ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
-	if (err)
-		return err;
-
-	devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
-	return 0;
+	return ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
 }
 
 static int
@@ -1037,6 +1036,8 @@ devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 	if (attr)
 		err = devlink_port_function_hw_addr_set(devlink, port, attr, extack);
 
+	if (!err)
+		devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
 	return err;
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-15 23:27   ` Jakub Kicinski
  2020-12-15  9:03 ` [net-next v5 04/15] devlink: Support add and delete devlink port Saeed Mahameed
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

A PCI sub-function (SF) represents a portion of the device similar
to PCI VF.

In an eswitch, PCI SF may have port which is normally represented
using a representor netdevice.
To have better visibility of eswitch port, its association with SF,
and its representor netdevice, introduce a PCI SF port flavour.

When devlink port flavour is PCI SF, fill up PCI SF attributes of the
port.

Extend port name creation using PCI PF and SF number scheme on best
effort basis, so that vendor drivers can skip defining their own
scheme.
This is done as cApfNSfM, where A, N and M are controller, PCI PF and
PCI SF number respectively.
This is similar to existing naming for PCI PF and PCI VF ports.

An example view of a PCI SF port:

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state active opstate attached

$ devlink port show pci/0000:06:00.0/32768 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 include/net/devlink.h        | 17 +++++++++++++
 include/uapi/linux/devlink.h |  5 ++++
 net/core/devlink.c           | 46 ++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index f466819cc477..5bd43f0a79a8 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -93,6 +93,20 @@ struct devlink_port_pci_vf_attrs {
 	u8 external:1;
 };
 
+/**
+ * struct devlink_port_pci_sf_attrs - devlink port's PCI SF attributes
+ * @controller: Associated controller number
+ * @pf: Associated PCI PF number for this port.
+ * @sf: Associated PCI SF for of the PCI PF for this port.
+ * @external: when set, indicates if a port is for an external controller
+ */
+struct devlink_port_pci_sf_attrs {
+	u32 controller;
+	u16 pf;
+	u32 sf;
+	u8 external:1;
+};
+
 /**
  * struct devlink_port_attrs - devlink port object
  * @flavour: flavour of the port
@@ -114,6 +128,7 @@ struct devlink_port_attrs {
 		struct devlink_port_phys_attrs phys;
 		struct devlink_port_pci_pf_attrs pci_pf;
 		struct devlink_port_pci_vf_attrs pci_vf;
+		struct devlink_port_pci_sf_attrs pci_sf;
 	};
 };
 
@@ -1404,6 +1419,8 @@ void devlink_port_attrs_pci_pf_set(struct devlink_port *devlink_port, u32 contro
 				   u16 pf, bool external);
 void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 controller,
 				   u16 pf, u16 vf, bool external);
+void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 controller,
+				   u16 pf, u32 sf, bool external);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 5203f54a2be1..6fe00f10eb3f 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -200,6 +200,10 @@ enum devlink_port_flavour {
 	DEVLINK_PORT_FLAVOUR_UNUSED, /* Port which exists in the switch, but
 				      * is not used in any way.
 				      */
+	DEVLINK_PORT_FLAVOUR_PCI_SF, /* Represents eswitch port
+				      * for the PCI SF. It is an internal
+				      * port that faces the PCI SF.
+				      */
 };
 
 enum devlink_param_cmode {
@@ -529,6 +533,7 @@ enum devlink_attr {
 	DEVLINK_ATTR_RELOAD_ACTION_INFO,        /* nested */
 	DEVLINK_ATTR_RELOAD_ACTION_STATS,       /* nested */
 
+	DEVLINK_ATTR_PORT_PCI_SF_NUMBER,	/* u32 */
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 13e0de80c4f9..08eac247f200 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -690,6 +690,15 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 		if (nla_put_u8(msg, DEVLINK_ATTR_PORT_EXTERNAL, attrs->pci_vf.external))
 			return -EMSGSIZE;
 		break;
+	case DEVLINK_PORT_FLAVOUR_PCI_SF:
+		if (nla_put_u32(msg, DEVLINK_ATTR_PORT_CONTROLLER_NUMBER,
+				attrs->pci_sf.controller) ||
+		    nla_put_u16(msg, DEVLINK_ATTR_PORT_PCI_PF_NUMBER, attrs->pci_sf.pf) ||
+		    nla_put_u32(msg, DEVLINK_ATTR_PORT_PCI_SF_NUMBER, attrs->pci_sf.sf))
+			return -EMSGSIZE;
+		if (nla_put_u8(msg, DEVLINK_ATTR_PORT_EXTERNAL, attrs->pci_sf.external))
+			return -EMSGSIZE;
+		break;
 	case DEVLINK_PORT_FLAVOUR_PHYSICAL:
 	case DEVLINK_PORT_FLAVOUR_CPU:
 	case DEVLINK_PORT_FLAVOUR_DSA:
@@ -8373,6 +8382,33 @@ void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 contro
 }
 EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_vf_set);
 
+/**
+ *	devlink_port_attrs_pci_sf_set - Set PCI SF port attributes
+ *
+ *	@devlink_port: devlink port
+ *	@controller: associated controller number for the devlink port instance
+ *	@pf: associated PF for the devlink port instance
+ *	@sf: associated SF of a PF for the devlink port instance
+ *	@external: indicates if the port is for an external controller
+ */
+void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 controller,
+				   u16 pf, u32 sf, bool external)
+{
+	struct devlink_port_attrs *attrs = &devlink_port->attrs;
+	int ret;
+
+	if (WARN_ON(devlink_port->registered))
+		return;
+	ret = __devlink_port_attrs_set(devlink_port, DEVLINK_PORT_FLAVOUR_PCI_SF);
+	if (ret)
+		return;
+	attrs->pci_sf.controller = controller;
+	attrs->pci_sf.pf = pf;
+	attrs->pci_sf.sf = sf;
+	attrs->pci_sf.external = external;
+}
+EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_sf_set);
+
 static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port,
 					     char *name, size_t len)
 {
@@ -8421,6 +8457,16 @@ static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port,
 		n = snprintf(name, len, "pf%uvf%u",
 			     attrs->pci_vf.pf, attrs->pci_vf.vf);
 		break;
+	case DEVLINK_PORT_FLAVOUR_PCI_SF:
+		if (attrs->pci_sf.external) {
+			n = snprintf(name, len, "c%u", attrs->pci_sf.controller);
+			if (n >= len)
+				return -EINVAL;
+			len -= n;
+			name += n;
+		}
+		n = snprintf(name, len, "pf%usf%u", attrs->pci_sf.pf, attrs->pci_sf.sf);
+		break;
 	}
 
 	if (n >= len)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 04/15] devlink: Support add and delete devlink port
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (2 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-16  0:29   ` Jakub Kicinski
  2020-12-15  9:03 ` [net-next v5 05/15] devlink: Support get and set state of port function Saeed Mahameed
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Extended devlink interface for the user to add and delete port.
Extend devlink to connect user requests to driver to add/delete
such port in the device.

When driver routines are invoked, devlink instance lock is not held.
This enables driver to perform several devlink objects registration,
unregistration such as (port, health reporter, resource etc)
by using existing devlink APIs.
This also helps to uniformly use the code for port unregistration
during driver unload and during port deletion initiated by user.

Examples of add, show and delete commands:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ udevadm test-builtin net_id /sys/class/net/eth0
Load module index
Parsed configuration file /usr/lib/systemd/network/99-default.link
Created link configuration context.
Using default interface naming scheme 'v245'.
ID_NET_NAMING_SCHEME=v245
ID_NET_NAME_PATH=enp6s0f0npf0sf88
ID_NET_NAME_SLOT=ens2f0npf0sf88
Unload module index
Unloaded link configuration context.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 include/net/devlink.h | 39 ++++++++++++++++++++++++
 net/core/devlink.c    | 71 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 5bd43f0a79a8..f8cff3e402da 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -153,6 +153,17 @@ struct devlink_port {
 	struct mutex reporters_lock; /* Protects reporter_list */
 };
 
+struct devlink_port_new_attrs {
+	enum devlink_port_flavour flavour;
+	unsigned int port_index;
+	u32 controller;
+	u32 sfnum;
+	u16 pfnum;
+	u8 port_index_valid:1,
+	   controller_valid:1,
+	   sfnum_valid:1;
+};
+
 struct devlink_sb_pool_info {
 	enum devlink_sb_pool_type pool_type;
 	u32 size;
@@ -1363,6 +1374,34 @@ struct devlink_ops {
 	int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port,
 					 const u8 *hw_addr, int hw_addr_len,
 					 struct netlink_ext_ack *extack);
+	/**
+	 * @port_new: Port add function.
+	 *
+	 * Should be used by device driver to let caller add new port of a
+	 * specified flavour with optional attributes.
+	 * Driver should return -EOPNOTSUPP if it doesn't support port addition
+	 * of a specified flavour or specified attributes. Driver should set
+	 * extack error message in case of fail to add the port. Devlink core
+	 * does not hold a devlink instance lock when this callback is invoked.
+	 * Driver must ensures synchronization when adding or deleting a port.
+	 * Driver must register a port with devlink core.
+	 */
+	int (*port_new)(struct devlink *devlink,
+			const struct devlink_port_new_attrs *attrs,
+			struct netlink_ext_ack *extack);
+	/**
+	 * @port_del: Port delete function.
+	 *
+	 * Should be used by device driver to let caller delete port which was
+	 * previously created using port_new() callback.
+	 * Driver should return -EOPNOTSUPP if it doesn't support port deletion.
+	 * Driver should set extack error message in case of fail to delete the
+	 * port. Devlink core does not hold a devlink instance lock when this
+	 * callback is invoked. Driver must ensures synchronization when adding
+	 * or deleting a port. Driver must register a port with devlink core.
+	 */
+	int (*port_del)(struct devlink *devlink, unsigned int port_index,
+			struct netlink_ext_ack *extack);
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 08eac247f200..11043707f63f 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -1146,6 +1146,61 @@ static int devlink_nl_cmd_port_unsplit_doit(struct sk_buff *skb,
 	return devlink_port_unsplit(devlink, port_index, info->extack);
 }
 
+static int devlink_nl_cmd_port_new_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink_port_new_attrs new_attrs = {};
+	struct devlink *devlink = info->user_ptr[0];
+
+	if (!info->attrs[DEVLINK_ATTR_PORT_FLAVOUR] ||
+	    !info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]) {
+		NL_SET_ERR_MSG_MOD(extack, "Port flavour or PCI PF are not specified");
+		return -EINVAL;
+	}
+	new_attrs.flavour = nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_FLAVOUR]);
+	new_attrs.pfnum =
+		nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]);
+
+	if (info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
+		new_attrs.port_index =
+			nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
+		new_attrs.port_index_valid = true;
+	}
+	if (info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]) {
+		new_attrs.controller =
+			nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]);
+		new_attrs.controller_valid = true;
+	}
+	if (info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]) {
+		new_attrs.sfnum = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]);
+		new_attrs.sfnum_valid = true;
+	}
+
+	if (!devlink->ops->port_new)
+		return -EOPNOTSUPP;
+
+	return devlink->ops->port_new(devlink, &new_attrs, extack);
+}
+
+static int devlink_nl_cmd_port_del_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink *devlink = info->user_ptr[0];
+	unsigned int port_index;
+
+	if (!info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
+		NL_SET_ERR_MSG_MOD(extack, "Port index is not specified");
+		return -EINVAL;
+	}
+	port_index = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
+
+	if (!devlink->ops->port_del)
+		return -EOPNOTSUPP;
+	return devlink->ops->port_del(devlink, port_index, extack);
+}
+
 static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink,
 			      struct devlink_sb *devlink_sb,
 			      enum devlink_command cmd, u32 portid,
@@ -7604,6 +7659,10 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_RELOAD_ACTION] = NLA_POLICY_RANGE(NLA_U8, DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
 							DEVLINK_RELOAD_ACTION_MAX),
 	[DEVLINK_ATTR_RELOAD_LIMITS] = NLA_POLICY_BITFIELD32(DEVLINK_RELOAD_LIMITS_VALID_MASK),
+	[DEVLINK_ATTR_PORT_FLAVOUR] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_PORT_PCI_PF_NUMBER] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_PORT_PCI_SF_NUMBER] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER] = { .type = NLA_U32 },
 };
 
 static const struct genl_small_ops devlink_nl_ops[] = {
@@ -7643,6 +7702,18 @@ static const struct genl_small_ops devlink_nl_ops[] = {
 		.flags = GENL_ADMIN_PERM,
 		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
 	},
+	{
+		.cmd = DEVLINK_CMD_PORT_NEW,
+		.doit = devlink_nl_cmd_port_new_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+	},
+	{
+		.cmd = DEVLINK_CMD_PORT_DEL,
+		.doit = devlink_nl_cmd_port_del_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+	},
 	{
 		.cmd = DEVLINK_CMD_SB_GET,
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 05/15] devlink: Support get and set state of port function
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (3 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 04/15] devlink: Support add and delete devlink port Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-16  0:37   ` Jakub Kicinski
  2020-12-15  9:03 ` [net-next v5 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

devlink port function can be in active or inactive state.
Allow users to get and set port function's state.

When the port function it activated, its operational state may change
after a while when the device is created and driver binds to it.
Similarly on deactivation flow.

To clearly describe the state of the port function and its device's
operational state in the host system, define state and opstate
attributes.

Example of a PCI SF port which supports a port function:
Create a device with ID=10 and one physical port.

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active

$ devlink port show pci/0000:06:00.0/32768 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 include/net/devlink.h        | 23 +++++++++
 include/uapi/linux/devlink.h | 21 +++++++++
 net/core/devlink.c           | 90 +++++++++++++++++++++++++++++++++++-
 3 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index f8cff3e402da..18a7e66b7982 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1374,6 +1374,29 @@ struct devlink_ops {
 	int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port,
 					 const u8 *hw_addr, int hw_addr_len,
 					 struct netlink_ext_ack *extack);
+	/**
+	 * @port_function_state_get: Port function's state get function.
+	 *
+	 * Should be used by device drivers to report the state of a function
+	 * managed by the devlink port. Driver should return -EOPNOTSUPP if it
+	 * doesn't support port function handling for a particular port.
+	 */
+	int (*port_function_state_get)(struct devlink *devlink,
+				       struct devlink_port *port,
+				       enum devlink_port_function_state *state,
+				       enum devlink_port_function_opstate *opstate,
+				       struct netlink_ext_ack *extack);
+	/**
+	 * @port_function_state_set: Port function's state set function.
+	 *
+	 * Should be used by device drivers to set the state of a function
+	 * managed by the devlink port. Driver should return -EOPNOTSUPP if it
+	 * doesn't support port function handling for a particular port.
+	 */
+	int (*port_function_state_set)(struct devlink *devlink,
+				       struct devlink_port *port,
+				       enum devlink_port_function_state state,
+				       struct netlink_ext_ack *extack);
 	/**
 	 * @port_new: Port add function.
 	 *
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 6fe00f10eb3f..beeb30bb6b20 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -583,9 +583,30 @@ enum devlink_resource_unit {
 enum devlink_port_function_attr {
 	DEVLINK_PORT_FUNCTION_ATTR_UNSPEC,
 	DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR,	/* binary */
+	DEVLINK_PORT_FUNCTION_ATTR_STATE,	/* u8 */
+	DEVLINK_PORT_FUNCTION_ATTR_OPSTATE,	/* u8 */
 
 	__DEVLINK_PORT_FUNCTION_ATTR_MAX,
 	DEVLINK_PORT_FUNCTION_ATTR_MAX = __DEVLINK_PORT_FUNCTION_ATTR_MAX - 1
 };
 
+enum devlink_port_function_state {
+	DEVLINK_PORT_FUNCTION_STATE_INACTIVE,
+	DEVLINK_PORT_FUNCTION_STATE_ACTIVE,
+};
+
+/**
+ * enum devlink_port_function_opstate - indicates operational state of port function
+ * @DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED: Driver is attached to the function of port, for
+ *					    gracefufl tear down of the function, after
+ *					    inactivation of the port function, user should wait
+ *					    for operational state to turn DETACHED.
+ * @DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED: Driver is detached from the function of port; it is
+ *					    safe to delete the port.
+ */
+enum devlink_port_function_opstate {
+	DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED,
+	DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED,
+};
+
 #endif /* _UAPI_LINUX_DEVLINK_H_ */
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 11043707f63f..b8acb8842aa1 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -87,6 +87,9 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_trap_report);
 
 static const struct nla_policy devlink_function_nl_policy[DEVLINK_PORT_FUNCTION_ATTR_MAX + 1] = {
 	[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR] = { .type = NLA_BINARY },
+	[DEVLINK_PORT_FUNCTION_ATTR_STATE] =
+		NLA_POLICY_RANGE(NLA_U8, DEVLINK_PORT_FUNCTION_STATE_INACTIVE,
+				 DEVLINK_PORT_FUNCTION_STATE_ACTIVE),
 };
 
 static LIST_HEAD(devlink_list);
@@ -746,6 +749,57 @@ devlink_port_function_hw_addr_fill(struct devlink *devlink, const struct devlink
 	return 0;
 }
 
+static bool
+devlink_port_function_state_valid(enum devlink_port_function_state state)
+{
+	return state == DEVLINK_PORT_FUNCTION_STATE_INACTIVE ||
+	       state == DEVLINK_PORT_FUNCTION_STATE_ACTIVE;
+}
+
+static bool
+devlink_port_function_opstate_valid(enum devlink_port_function_opstate state)
+{
+	return state == DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED ||
+	       state == DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED;
+}
+
+static int
+devlink_port_function_state_fill(struct devlink *devlink,
+				 const struct devlink_ops *ops,
+				 struct devlink_port *port, struct sk_buff *msg,
+				 struct netlink_ext_ack *extack,
+				 bool *msg_updated)
+{
+	enum devlink_port_function_opstate opstate;
+	enum devlink_port_function_state state;
+	int err;
+
+	if (!ops->port_function_state_get)
+		return 0;
+
+	err = ops->port_function_state_get(devlink, port, &state, &opstate, extack);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+	if (!devlink_port_function_state_valid(state)) {
+		WARN_ON_ONCE(1);
+		NL_SET_ERR_MSG_MOD(extack, "Invalid state value read from driver");
+		return -EINVAL;
+	}
+	if (!devlink_port_function_opstate_valid(opstate)) {
+		WARN_ON_ONCE(1);
+		NL_SET_ERR_MSG_MOD(extack, "Invalid operational state value read from driver");
+		return -EINVAL;
+	}
+	if (nla_put_u8(msg, DEVLINK_PORT_FUNCTION_ATTR_STATE, state) ||
+	    nla_put_u8(msg, DEVLINK_PORT_FUNCTION_ATTR_OPSTATE, opstate))
+		return -EMSGSIZE;
+	*msg_updated = true;
+	return 0;
+}
+
 static int
 devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *port,
 				   struct netlink_ext_ack *extack)
@@ -762,6 +816,13 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por
 
 	ops = devlink->ops;
 	err = devlink_port_function_hw_addr_fill(devlink, ops, port, msg, extack, &msg_updated);
+	if (err)
+		goto out;
+	err = devlink_port_function_state_fill(devlink, ops, port, msg, extack,
+					       &msg_updated);
+	if (err)
+		goto out;
+out:
 	if (err || !msg_updated)
 		nla_nest_cancel(msg, function_attr);
 	else
@@ -1027,6 +1088,22 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 	return ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
 }
 
+static int
+devlink_port_function_state_set(struct devlink *devlink, struct devlink_port *port,
+				const struct nlattr *attr, struct netlink_ext_ack *extack)
+{
+	enum devlink_port_function_state state;
+	const struct devlink_ops *ops;
+
+	state = nla_get_u8(attr);
+	ops = devlink->ops;
+	if (!ops->port_function_state_set) {
+		NL_SET_ERR_MSG_MOD(extack, "Port function does not support state setting");
+		return -EOPNOTSUPP;
+	}
+	return ops->port_function_state_set(devlink, port, state, extack);
+}
+
 static int
 devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 			  const struct nlattr *attr, struct netlink_ext_ack *extack)
@@ -1042,8 +1119,19 @@ devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 	}
 
 	attr = tb[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR];
-	if (attr)
+	if (attr) {
 		err = devlink_port_function_hw_addr_set(devlink, port, attr, extack);
+		if (err)
+			return err;
+	}
+	/* Keep this as the last function attribute set, so that when
+	 * multiple port function attributes are set along with state,
+	 * Those can be applied first before activating the state.
+	 */
+	attr = tb[DEVLINK_PORT_FUNCTION_ATTR_STATE];
+	if (attr)
+		err = devlink_port_function_state_set(devlink, port, attr,
+						      extack);
 
 	if (!err)
 		devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 06/15] net/mlx5: Introduce vhca state event notifier
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (4 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 05/15] devlink: Support get and set state of port function Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

vhca state events indicates change in the state of the vhca that may
occur due to a SF allocation, deallocation or enabling/disabling the
SF HCA.

Introduce vhca state event handler which will be used by SF devlink
port manager and SF hardware id allocator in subsequent patches
to act on the event.

This enables single entity to subscribe, query and rearm the event
for a function.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   9 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   3 +
 .../net/ethernet/mellanox/mlx5/core/events.c  |   7 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  16 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../mlx5/core/sf/mlx5_ifc_vhca_event.h        |  82 ++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  45 +++++
 .../mellanox/mlx5/core/sf/vhca_event.c        | 189 ++++++++++++++++++
 .../mellanox/mlx5/core/sf/vhca_event.h        |  57 ++++++
 include/linux/mlx5/driver.h                   |   4 +
 12 files changed, 422 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 6e4d7bb7fea2..d6c48582e7a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -203,3 +203,12 @@ config MLX5_SW_STEERING
 	default y
 	help
 	Build support for software-managed steering in the NIC.
+
+config MLX5_SF
+	bool "Mellanox Technologies subfunction device support using auxiliary device"
+	depends on MLX5_CORE && MLX5_CORE_EN
+	default n
+	help
+	Build support for subfuction device in the NIC. A Mellanox subfunction
+	device can support RDMA, netdevice and vdpa device.
+	It is similar to a SRIOV VF but it doesn't require SRIOV support.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 77961643d5a9..292c02c4828c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -85,3 +85,7 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 					steering/dr_ste.o steering/dr_send.o \
 					steering/dr_cmd.o steering/dr_fw.o \
 					steering/dr_action.o steering/fs_dr.o
+#
+# SF device
+#
+mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 50c7b9ee80c3..47dcc3ac2cf0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -464,6 +464,8 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_ALLOC_MEMIC:
 	case MLX5_CMD_OP_MODIFY_XRQ:
 	case MLX5_CMD_OP_RELEASE_XRQ_ERROR:
+	case MLX5_CMD_OP_QUERY_VHCA_STATE:
+	case MLX5_CMD_OP_MODIFY_VHCA_STATE:
 		*status = MLX5_DRIVER_STATUS_ABORTED;
 		*synd = MLX5_DRIVER_SYND;
 		return -EIO;
@@ -657,6 +659,8 @@ const char *mlx5_command_str(int command)
 	MLX5_COMMAND_STR_CASE(DESTROY_UMEM);
 	MLX5_COMMAND_STR_CASE(RELEASE_XRQ_ERROR);
 	MLX5_COMMAND_STR_CASE(MODIFY_XRQ);
+	MLX5_COMMAND_STR_CASE(QUERY_VHCA_STATE);
+	MLX5_COMMAND_STR_CASE(MODIFY_VHCA_STATE);
 	default: return "unknown command opcode";
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index fc0afa03d407..421febebc658 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -595,6 +595,9 @@ static void gather_async_events_mask(struct mlx5_core_dev *dev, u64 mask[4])
 		async_event_mask |=
 			(1ull << MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED);
 
+	if (MLX5_CAP_GEN_MAX(dev, vhca_state))
+		async_event_mask |= (1ull << MLX5_EVENT_TYPE_VHCA_STATE_CHANGE);
+
 	mask[0] = async_event_mask;
 
 	if (MLX5_CAP_GEN(dev, event_cap))
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/events.c b/drivers/net/ethernet/mellanox/mlx5/core/events.c
index 3ce17c3d7a00..5523d218e5fb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/events.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/events.c
@@ -110,6 +110,8 @@ static const char *eqe_type_str(u8 type)
 		return "MLX5_EVENT_TYPE_CMD";
 	case MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED:
 		return "MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED";
+	case MLX5_EVENT_TYPE_VHCA_STATE_CHANGE:
+		return "MLX5_EVENT_TYPE_VHCA_STATE_CHANGE";
 	case MLX5_EVENT_TYPE_PAGE_REQUEST:
 		return "MLX5_EVENT_TYPE_PAGE_REQUEST";
 	case MLX5_EVENT_TYPE_PAGE_FAULT:
@@ -403,3 +405,8 @@ int mlx5_notifier_call_chain(struct mlx5_events *events, unsigned int event, voi
 {
 	return atomic_notifier_call_chain(&events->nh, event, data);
 }
+
+void mlx5_events_work_enqueue(struct mlx5_core_dev *dev, struct work_struct *work)
+{
+	queue_work(dev->priv.events->wq, work);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index c08315b51fd3..6e67ad11c713 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -73,6 +73,7 @@
 #include "ecpf.h"
 #include "lib/hv_vhca.h"
 #include "diag/rsc_dump.h"
+#include "sf/vhca_event.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -567,6 +568,8 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
 	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
 
+	mlx5_vhca_state_cap_handle(dev, set_hca_cap);
+
 	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }
 
@@ -884,6 +887,12 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 		goto err_eswitch_cleanup;
 	}
 
+	err = mlx5_vhca_event_init(dev);
+	if (err) {
+		mlx5_core_err(dev, "Failed to init vhca event notifier %d\n", err);
+		goto err_fpga_cleanup;
+	}
+
 	dev->dm = mlx5_dm_create(dev);
 	if (IS_ERR(dev->dm))
 		mlx5_core_warn(dev, "Failed to init device memory%d\n", err);
@@ -894,6 +903,8 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 
 	return 0;
 
+err_fpga_cleanup:
+	mlx5_fpga_cleanup(dev);
 err_eswitch_cleanup:
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 err_sriov_cleanup:
@@ -925,6 +936,7 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 	mlx5_hv_vhca_destroy(dev->hv_vhca);
 	mlx5_fw_tracer_destroy(dev->tracer);
 	mlx5_dm_cleanup(dev);
+	mlx5_vhca_event_cleanup(dev);
 	mlx5_fpga_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 	mlx5_sriov_cleanup(dev);
@@ -1129,6 +1141,8 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 		goto err_sriov;
 	}
 
+	mlx5_vhca_event_start(dev);
+
 	err = mlx5_ec_init(dev);
 	if (err) {
 		mlx5_core_err(dev, "Failed to init embedded CPU\n");
@@ -1146,6 +1160,7 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 err_sriov:
 	mlx5_ec_cleanup(dev);
 err_ec:
+	mlx5_vhca_event_stop(dev);
 	mlx5_cleanup_fs(dev);
 err_fs:
 	mlx5_accel_tls_cleanup(dev);
@@ -1173,6 +1188,7 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
 {
 	mlx5_sriov_detach(dev);
 	mlx5_ec_cleanup(dev);
+	mlx5_vhca_event_stop(dev);
 	mlx5_cleanup_fs(dev);
 	mlx5_accel_ipsec_cleanup(dev);
 	mlx5_accel_tls_cleanup(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 0a0302ce7144..a33b7496d748 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -259,4 +259,6 @@ void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state);
 
 void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup);
 int mlx5_load_one(struct mlx5_core_dev *dev, bool boot);
+
+void mlx5_events_work_enqueue(struct mlx5_core_dev *dev, struct work_struct *work);
 #endif /* __MLX5_CORE_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
new file mode 100644
index 000000000000..1daf5a122ba3
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_IFC_VHCA_EVENT_H__
+#define __MLX5_IFC_VHCA_EVENT_H__
+
+enum mlx5_ifc_vhca_state {
+	MLX5_VHCA_STATE_INVALID = 0x0,
+	MLX5_VHCA_STATE_ALLOCATED = 0x1,
+	MLX5_VHCA_STATE_ACTIVE = 0x2,
+	MLX5_VHCA_STATE_IN_USE = 0x3,
+	MLX5_VHCA_STATE_TEARDOWN_REQUEST = 0x4,
+};
+
+struct mlx5_ifc_vhca_state_context_bits {
+	u8         arm_change_event[0x1];
+	u8         reserved_at_1[0xb];
+	u8         vhca_state[0x4];
+	u8         reserved_at_10[0x10];
+
+	u8         sw_function_id[0x20];
+
+	u8         reserved_at_40[0x80];
+};
+
+struct mlx5_ifc_query_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+
+	struct mlx5_ifc_vhca_state_context_bits vhca_state_context;
+};
+
+struct mlx5_ifc_query_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         embedded_cpu_function[0x1];
+	u8         reserved_at_41[0xf];
+	u8         function_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_vhca_state_field_select_bits {
+	u8         reserved_at_0[0x1e];
+	u8         sw_function_id[0x1];
+	u8         arm_change_event[0x1];
+};
+
+struct mlx5_ifc_modify_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_modify_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         embedded_cpu_function[0x1];
+	u8         reserved_at_41[0xf];
+	u8         function_id[0x10];
+
+	struct mlx5_ifc_vhca_state_field_select_bits vhca_state_field_select;
+
+	struct mlx5_ifc_vhca_state_context_bits vhca_state_context;
+};
+
+#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
new file mode 100644
index 000000000000..623191679b49
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_H__
+#define __MLX5_SF_H__
+
+#include <linux/mlx5/driver.h>
+
+static inline u16 mlx5_sf_start_function_id(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf_base_id);
+}
+
+#ifdef CONFIG_MLX5_SF
+
+static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf);
+}
+
+static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
+{
+	if (!mlx5_sf_supported(dev))
+		return 0;
+	if (MLX5_CAP_GEN(dev, max_num_sf))
+		return MLX5_CAP_GEN(dev, max_num_sf);
+	else
+		return 1 << MLX5_CAP_GEN(dev, log_max_sf);
+}
+
+#else
+
+static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
+{
+	return false;
+}
+
+static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+#endif
+
+#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
new file mode 100644
index 000000000000..af2f2dd9db25
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include "mlx5_ifc_vhca_event.h"
+#include "mlx5_core.h"
+#include "vhca_event.h"
+#include "ecpf.h"
+
+struct mlx5_vhca_state_notifier {
+	struct mlx5_core_dev *dev;
+	struct mlx5_nb nb;
+	struct blocking_notifier_head n_head;
+};
+
+struct mlx5_vhca_event_work {
+	struct work_struct work;
+	struct mlx5_vhca_state_notifier *notifier;
+	struct mlx5_vhca_state_event event;
+};
+
+int mlx5_cmd_query_vhca_state(struct mlx5_core_dev *dev, u16 function_id,
+			      bool ecpu, u32 *out, u32 outlen)
+{
+	u32 in[MLX5_ST_SZ_DW(query_vhca_state_in)] = {};
+
+	MLX5_SET(query_vhca_state_in, in, opcode, MLX5_CMD_OP_QUERY_VHCA_STATE);
+	MLX5_SET(query_vhca_state_in, in, function_id, function_id);
+	MLX5_SET(query_vhca_state_in, in, embedded_cpu_function, ecpu);
+
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
+}
+
+static int mlx5_cmd_modify_vhca_state(struct mlx5_core_dev *dev, u16 function_id,
+				      bool ecpu, u32 *in, u32 inlen)
+{
+	u32 out[MLX5_ST_SZ_DW(modify_vhca_state_out)] = {};
+
+	MLX5_SET(modify_vhca_state_in, in, opcode, MLX5_CMD_OP_MODIFY_VHCA_STATE);
+	MLX5_SET(modify_vhca_state_in, in, function_id, function_id);
+	MLX5_SET(modify_vhca_state_in, in, embedded_cpu_function, ecpu);
+
+	return mlx5_cmd_exec(dev, in, inlen, out, sizeof(out));
+}
+
+int mlx5_modify_vhca_sw_id(struct mlx5_core_dev *dev, u16 function_id, bool ecpu, u32 sw_fn_id)
+{
+	u32 out[MLX5_ST_SZ_DW(modify_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(modify_vhca_state_in)] = {};
+
+	MLX5_SET(modify_vhca_state_in, in, opcode, MLX5_CMD_OP_MODIFY_VHCA_STATE);
+	MLX5_SET(modify_vhca_state_in, in, function_id, function_id);
+	MLX5_SET(modify_vhca_state_in, in, embedded_cpu_function, ecpu);
+	MLX5_SET(modify_vhca_state_in, in, vhca_state_field_select.sw_function_id, 1);
+	MLX5_SET(modify_vhca_state_in, in, vhca_state_context.sw_function_id, sw_fn_id);
+
+	return mlx5_cmd_exec_inout(dev, modify_vhca_state, in, out);
+}
+
+int mlx5_vhca_event_arm(struct mlx5_core_dev *dev, u16 function_id, bool ecpu)
+{
+	u32 in[MLX5_ST_SZ_DW(modify_vhca_state_in)] = {};
+
+	MLX5_SET(modify_vhca_state_in, in, vhca_state_context.arm_change_event, 1);
+	MLX5_SET(modify_vhca_state_in, in, vhca_state_field_select.arm_change_event, 1);
+
+	return mlx5_cmd_modify_vhca_state(dev, function_id, ecpu, in, sizeof(in));
+}
+
+static void
+mlx5_vhca_event_notify(struct mlx5_core_dev *dev, struct mlx5_vhca_state_event *event)
+{
+	u32 out[MLX5_ST_SZ_DW(query_vhca_state_out)] = {};
+	int err;
+
+	err = mlx5_cmd_query_vhca_state(dev, event->function_id, event->ecpu, out, sizeof(out));
+	if (err)
+		return;
+
+	event->sw_function_id = MLX5_GET(query_vhca_state_out, out,
+					 vhca_state_context.sw_function_id);
+	event->new_vhca_state = MLX5_GET(query_vhca_state_out, out,
+					 vhca_state_context.vhca_state);
+
+	mlx5_vhca_event_arm(dev, event->function_id, event->ecpu);
+
+	blocking_notifier_call_chain(&dev->priv.vhca_state_notifier->n_head, 0, event);
+}
+
+static void mlx5_vhca_state_work_handler(struct work_struct *_work)
+{
+	struct mlx5_vhca_event_work *work = container_of(_work, struct mlx5_vhca_event_work, work);
+	struct mlx5_vhca_state_notifier *notifier = work->notifier;
+	struct mlx5_core_dev *dev = notifier->dev;
+
+	mlx5_vhca_event_notify(dev, &work->event);
+}
+
+static int
+mlx5_vhca_state_change_notifier(struct notifier_block *nb, unsigned long type, void *data)
+{
+	struct mlx5_vhca_state_notifier *notifier =
+				mlx5_nb_cof(nb, struct mlx5_vhca_state_notifier, nb);
+	struct mlx5_vhca_event_work *work;
+	struct mlx5_eqe *eqe = data;
+
+	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work)
+		return NOTIFY_DONE;
+	INIT_WORK(&work->work, &mlx5_vhca_state_work_handler);
+	work->notifier = notifier;
+	work->event.function_id = be16_to_cpu(eqe->data.vhca_state.function_id);
+	work->event.ecpu = be16_to_cpu(eqe->data.vhca_state.ec_function);
+	mlx5_events_work_enqueue(notifier->dev, &work->work);
+	return NOTIFY_OK;
+}
+
+void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap)
+{
+	if (!mlx5_vhca_event_supported(dev))
+		return;
+
+	MLX5_SET(cmd_hca_cap, set_hca_cap, vhca_state, 1);
+	MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_allocated, 1);
+	MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_active, 1);
+	MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_in_use, 1);
+	MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_teardown_request, 1);
+}
+
+int mlx5_vhca_event_init(struct mlx5_core_dev *dev)
+{
+	struct mlx5_vhca_state_notifier *notifier;
+
+	if (!mlx5_vhca_event_supported(dev))
+		return 0;
+
+	notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
+	if (!notifier)
+		return -ENOMEM;
+
+	dev->priv.vhca_state_notifier = notifier;
+	notifier->dev = dev;
+	BLOCKING_INIT_NOTIFIER_HEAD(&notifier->n_head);
+	MLX5_NB_INIT(&notifier->nb, mlx5_vhca_state_change_notifier, VHCA_STATE_CHANGE);
+	return 0;
+}
+
+void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev)
+{
+	if (!mlx5_vhca_event_supported(dev))
+		return;
+
+	kfree(dev->priv.vhca_state_notifier);
+	dev->priv.vhca_state_notifier = NULL;
+}
+
+void mlx5_vhca_event_start(struct mlx5_core_dev *dev)
+{
+	struct mlx5_vhca_state_notifier *notifier;
+
+	if (!dev->priv.vhca_state_notifier)
+		return;
+
+	notifier = dev->priv.vhca_state_notifier;
+	mlx5_eq_notifier_register(dev, &notifier->nb);
+}
+
+void mlx5_vhca_event_stop(struct mlx5_core_dev *dev)
+{
+	struct mlx5_vhca_state_notifier *notifier;
+
+	if (!dev->priv.vhca_state_notifier)
+		return;
+
+	notifier = dev->priv.vhca_state_notifier;
+	mlx5_eq_notifier_unregister(dev, &notifier->nb);
+}
+
+int mlx5_vhca_event_notifier_register(struct mlx5_core_dev *dev, struct notifier_block *nb)
+{
+	if (!dev->priv.vhca_state_notifier)
+		return -EOPNOTSUPP;
+	return blocking_notifier_chain_register(&dev->priv.vhca_state_notifier->n_head, nb);
+}
+
+void mlx5_vhca_event_notifier_unregister(struct mlx5_core_dev *dev, struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&dev->priv.vhca_state_notifier->n_head, nb);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h
new file mode 100644
index 000000000000..1fe1ec6f4d4b
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_VHCA_EVENT_H__
+#define __MLX5_VHCA_EVENT_H__
+
+#ifdef CONFIG_MLX5_SF
+
+struct mlx5_vhca_state_event {
+	u16 function_id;
+	u16 sw_function_id;
+	u8 new_vhca_state;
+	bool ecpu;
+};
+
+static inline bool mlx5_vhca_event_supported(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN_MAX(dev, vhca_state);
+}
+
+void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap);
+int mlx5_vhca_event_init(struct mlx5_core_dev *dev);
+void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev);
+void mlx5_vhca_event_start(struct mlx5_core_dev *dev);
+void mlx5_vhca_event_stop(struct mlx5_core_dev *dev);
+int mlx5_vhca_event_notifier_register(struct mlx5_core_dev *dev, struct notifier_block *nb);
+void mlx5_vhca_event_notifier_unregister(struct mlx5_core_dev *dev, struct notifier_block *nb);
+int mlx5_modify_vhca_sw_id(struct mlx5_core_dev *dev, u16 function_id, bool ecpu, u32 sw_fn_id);
+int mlx5_vhca_event_arm(struct mlx5_core_dev *dev, u16 function_id, bool ecpu);
+int mlx5_cmd_query_vhca_state(struct mlx5_core_dev *dev, u16 function_id,
+			      bool ecpu, u32 *out, u32 outlen);
+#else
+
+static inline void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap)
+{
+}
+
+static inline int mlx5_vhca_event_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
+static inline void mlx5_vhca_event_start(struct mlx5_core_dev *dev)
+{
+}
+
+static inline void mlx5_vhca_event_stop(struct mlx5_core_dev *dev)
+{
+}
+
+#endif
+
+#endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index f93bfe7473aa..ffba0786051e 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -507,6 +507,7 @@ struct mlx5_devcom;
 struct mlx5_fw_reset;
 struct mlx5_eq_table;
 struct mlx5_irq_table;
+struct mlx5_vhca_state_notifier;
 
 struct mlx5_rate_limit {
 	u32			rate;
@@ -603,6 +604,9 @@ struct mlx5_priv {
 
 	struct mlx5_bfreg_data		bfregs;
 	struct mlx5_uars_page	       *uar;
+#ifdef CONFIG_MLX5_SF
+	struct mlx5_vhca_state_notifier *vhca_state_notifier;
+#endif
 };
 
 enum mlx5_device_state {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (5 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-16  0:43   ` Jakub Kicinski
  2020-12-15  9:03 ` [net-next v5 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Introduce API to add and delete an auxiliary device for an SF.
Each SF has its own dedicated window in the PCI BAR 2.

SF device is similar to PCI PF and VF that supports multiple class of
devices such as net, rdma and vdpa.

SF device will be added or removed in subsequent patch during SF
devlink port function state change command.

A subfunction device exposes user supplied subfunction number which will
be further used by systemd/udev to have deterministic name for its
netdevice and rdma device.

An mlx5 subfunction auxiliary device example:

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 state active

On activation,

$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.4 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.4

$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.4/sfnum
88

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../device_drivers/ethernet/mellanox/mlx5.rst |   5 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |   4 +
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  | 261 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  35 +++
 include/linux/mlx5/driver.h                   |   2 +
 6 files changed, 308 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h

diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
index e9b65035cd47..a5eb22793bb9 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -97,6 +97,11 @@ Enabling the driver and kconfig options
 
 |   Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
 
+**CONFIG_MLX5_SF=(y/n)**
+
+|   Build support for subfunction.
+|   Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
+|   will enable support for creating subfunction devices.
 
 **External options** ( Choose if the corresponding mlx5 feature is required )
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 292c02c4828c..2aefbca404c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -88,4 +88,4 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 #
 # SF device
 #
-mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o
+mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 6e67ad11c713..292c30e71d7f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -74,6 +74,7 @@
 #include "lib/hv_vhca.h"
 #include "diag/rsc_dump.h"
 #include "sf/vhca_event.h"
+#include "sf/dev/dev.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -1155,6 +1156,8 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 		goto err_sriov;
 	}
 
+	mlx5_sf_dev_table_create(dev);
+
 	return 0;
 
 err_sriov:
@@ -1186,6 +1189,7 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 
 static void mlx5_unload(struct mlx5_core_dev *dev)
 {
+	mlx5_sf_dev_table_destroy(dev);
 	mlx5_sriov_detach(dev);
 	mlx5_ec_cleanup(dev);
 	mlx5_vhca_event_stop(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
new file mode 100644
index 000000000000..6562bf63afaa
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include <linux/mlx5/device.h>
+#include "mlx5_core.h"
+#include "dev.h"
+#include "sf/vhca_event.h"
+#include "sf/sf.h"
+#include "sf/mlx5_ifc_vhca_event.h"
+#include "ecpf.h"
+
+struct mlx5_sf_dev_table {
+	struct xarray devices;
+	unsigned int max_sfs;
+	phys_addr_t base_address;
+	u64 sf_bar_length;
+	struct notifier_block nb;
+	struct mlx5_core_dev *dev;
+};
+
+static bool mlx5_sf_dev_supported(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf) && mlx5_vhca_event_supported(dev);
+}
+
+static ssize_t sfnum_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct auxiliary_device *adev = container_of(dev, struct auxiliary_device, dev);
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum);
+}
+static DEVICE_ATTR_RO(sfnum);
+
+static struct attribute *sf_device_attrs[] = {
+	&dev_attr_sfnum.attr,
+	NULL,
+};
+
+static const struct attribute_group sf_attr_group = {
+	.attrs = sf_device_attrs,
+};
+
+static const struct attribute_group *sf_attr_groups[2] = {
+	&sf_attr_group,
+	NULL
+};
+
+static void mlx5_sf_dev_release(struct device *device)
+{
+	struct auxiliary_device *adev = container_of(device, struct auxiliary_device, dev);
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	mlx5_adev_idx_free(adev->id);
+	kfree(sf_dev);
+}
+
+static void mlx5_sf_dev_remove(struct mlx5_sf_dev *sf_dev)
+{
+	auxiliary_device_delete(&sf_dev->adev);
+	auxiliary_device_uninit(&sf_dev->adev);
+}
+
+static void mlx5_sf_dev_add(struct mlx5_core_dev *dev, u16 sf_index, u32 sfnum)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+	struct mlx5_sf_dev *sf_dev;
+	struct pci_dev *pdev;
+	int err;
+	int id;
+
+	id = mlx5_adev_idx_alloc();
+	if (id < 0) {
+		err = id;
+		goto add_err;
+	}
+
+	sf_dev = kzalloc(sizeof(*sf_dev), GFP_KERNEL);
+	if (!sf_dev) {
+		mlx5_adev_idx_free(id);
+		err = -ENOMEM;
+		goto add_err;
+	}
+	pdev = dev->pdev;
+	sf_dev->adev.id = id;
+	sf_dev->adev.name = MLX5_SF_DEV_ID_NAME;
+	sf_dev->adev.dev.release = mlx5_sf_dev_release;
+	sf_dev->adev.dev.parent = &pdev->dev;
+	sf_dev->adev.dev.groups = sf_attr_groups;
+	sf_dev->sfnum = sfnum;
+	sf_dev->parent_mdev = dev;
+
+	if (!table->max_sfs) {
+		mlx5_adev_idx_free(id);
+		kfree(sf_dev);
+		err = -EOPNOTSUPP;
+		goto add_err;
+	}
+	sf_dev->bar_base_addr = table->base_address + (sf_index * table->sf_bar_length);
+
+	err = auxiliary_device_init(&sf_dev->adev);
+	if (err) {
+		mlx5_adev_idx_free(id);
+		kfree(sf_dev);
+		goto add_err;
+	}
+
+	err = auxiliary_device_add(&sf_dev->adev);
+	if (err) {
+		put_device(&sf_dev->adev.dev);
+		goto add_err;
+	}
+
+	err = xa_insert(&table->devices, sf_index, sf_dev, GFP_KERNEL);
+	if (err)
+		goto xa_err;
+	return;
+
+xa_err:
+	mlx5_sf_dev_remove(sf_dev);
+add_err:
+	mlx5_core_err(dev, "SF DEV: fail device add for index=%d sfnum=%d err=%d\n",
+		      sf_index, sfnum, err);
+}
+
+static void mlx5_sf_dev_del(struct mlx5_core_dev *dev, struct mlx5_sf_dev *sf_dev, u16 sf_index)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+
+	xa_erase(&table->devices, sf_index);
+	mlx5_sf_dev_remove(sf_dev);
+}
+
+static int
+mlx5_sf_dev_state_change_handler(struct notifier_block *nb, unsigned long event_code, void *data)
+{
+	struct mlx5_sf_dev_table *table = container_of(nb, struct mlx5_sf_dev_table, nb);
+	const struct mlx5_vhca_state_event *event = data;
+	struct mlx5_sf_dev *sf_dev;
+	u16 sf_index;
+
+	sf_index = event->function_id - MLX5_CAP_GEN(table->dev, sf_base_id);
+	sf_dev = xa_load(&table->devices, sf_index);
+	switch (event->new_vhca_state) {
+	case MLX5_VHCA_STATE_TEARDOWN_REQUEST:
+		if (sf_dev)
+			mlx5_sf_dev_del(table->dev, sf_dev, sf_index);
+		else
+			mlx5_core_err(table->dev,
+				      "SF DEV: teardown state for invalid dev index=%d fn_id=0x%x\n",
+				      sf_index, event->sw_function_id);
+		break;
+	case MLX5_VHCA_STATE_ACTIVE:
+		if (!sf_dev)
+			mlx5_sf_dev_add(table->dev, sf_index, event->sw_function_id);
+		break;
+	default:
+		break;
+	}
+	return 0;
+}
+
+static int mlx5_sf_dev_vhca_arm_all(struct mlx5_sf_dev_table *table)
+{
+	struct mlx5_core_dev *dev = table->dev;
+	u16 max_functions;
+	u16 function_id;
+	int err = 0;
+	bool ecpu;
+	int i;
+
+	max_functions = mlx5_sf_max_functions(dev);
+	function_id = MLX5_CAP_GEN(dev, sf_base_id);
+	ecpu = mlx5_read_embedded_cpu(dev);
+	/* Arm the vhca context as the vhca event notifier */
+	for (i = 0; i < max_functions; i++) {
+		err = mlx5_vhca_event_arm(dev, function_id, ecpu);
+		if (err)
+			return err;
+
+		function_id++;
+	}
+	return 0;
+}
+
+void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table;
+	unsigned int max_sfs;
+	int err;
+
+	if (!mlx5_sf_dev_supported(dev) || !mlx5_vhca_event_supported(dev))
+		return;
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table) {
+		err = -ENOMEM;
+		goto table_err;
+	}
+
+	table->nb.notifier_call = mlx5_sf_dev_state_change_handler;
+	table->dev = dev;
+	if (MLX5_CAP_GEN(dev, max_num_sf))
+		max_sfs = MLX5_CAP_GEN(dev, max_num_sf);
+	else
+		max_sfs = 1 << MLX5_CAP_GEN(dev, log_max_sf);
+	table->sf_bar_length = 1 << (MLX5_CAP_GEN(dev, log_min_sf_size) + 12);
+	table->base_address = pci_resource_start(dev->pdev, 2);
+	table->max_sfs = max_sfs;
+	xa_init(&table->devices);
+	dev->priv.sf_dev_table = table;
+
+	err = mlx5_vhca_event_notifier_register(dev, &table->nb);
+	if (err)
+		goto vhca_err;
+	err = mlx5_sf_dev_vhca_arm_all(table);
+	if (err)
+		goto arm_err;
+	mlx5_core_dbg(dev, "SF DEV: max sf devices=%d\n", max_sfs);
+	return;
+
+arm_err:
+	mlx5_vhca_event_notifier_unregister(dev, &table->nb);
+vhca_err:
+	table->max_sfs = 0;
+	kfree(table);
+	dev->priv.sf_dev_table = NULL;
+table_err:
+	mlx5_core_err(dev, "SF DEV table create err = %d\n", err);
+}
+
+static void mlx5_sf_dev_destroy_all(struct mlx5_sf_dev_table *table)
+{
+	struct mlx5_sf_dev *sf_dev;
+	unsigned long index;
+
+	xa_for_each(&table->devices, index, sf_dev) {
+		xa_erase(&table->devices, index);
+		mlx5_sf_dev_remove(sf_dev);
+	}
+}
+
+void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+
+	if (!table)
+		return;
+
+	mlx5_vhca_event_notifier_unregister(dev, &table->nb);
+
+	/* Now that event handler is not running, it is safe to destroy
+	 * the sf device without race.
+	 */
+	mlx5_sf_dev_destroy_all(table);
+
+	WARN_ON(!xa_empty(&table->devices));
+	kfree(table);
+	dev->priv.sf_dev_table = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
new file mode 100644
index 000000000000..a6fb7289ba2c
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_DEV_H__
+#define __MLX5_SF_DEV_H__
+
+#ifdef CONFIG_MLX5_SF
+
+#include <linux/auxiliary_bus.h>
+
+#define MLX5_SF_DEV_ID_NAME "sf"
+
+struct mlx5_sf_dev {
+	struct auxiliary_device adev;
+	struct mlx5_core_dev *parent_mdev;
+	phys_addr_t bar_base_addr;
+	u32 sfnum;
+};
+
+void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev);
+void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev);
+
+#else
+
+static inline void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev)
+{
+}
+
+static inline void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
+{
+}
+
+#endif
+
+#endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index ffba0786051e..08e5fbe97df0 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -508,6 +508,7 @@ struct mlx5_fw_reset;
 struct mlx5_eq_table;
 struct mlx5_irq_table;
 struct mlx5_vhca_state_notifier;
+struct mlx5_sf_dev_table;
 
 struct mlx5_rate_limit {
 	u32			rate;
@@ -606,6 +607,7 @@ struct mlx5_priv {
 	struct mlx5_uars_page	       *uar;
 #ifdef CONFIG_MLX5_SF
 	struct mlx5_vhca_state_notifier *vhca_state_notifier;
+	struct mlx5_sf_dev_table *sf_dev_table;
 #endif
 };
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 08/15] net/mlx5: SF, Add auxiliary device driver
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (6 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Add auxiliary device driver for mlx5 subfunction auxiliary device.

A mlx5 subfunction is similar to PCI PF and VF. For a subfunction
an auxiliary device is created.

As a result, when mlx5 SF auxiliary device binds to the driver,
its netdev and rdma device are created, they appear as

$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.4 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.4

$ ls -l /sys/class/net/eth1/device
/sys/class/net/eth1/device -> ../../../mlx5_core.sf.4

$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.4/sfnum
88

$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false

$ rdma link show mlx5_0/1
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88

$ rdma dev show
8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

In future, devlink device instance name will adapt to have sfnum
annotation using either an alias or as devlink instance name described
in RFC [1].

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  12 +++
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  12 ++-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  10 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  20 ++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  |  10 ++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  20 ++++
 .../mellanox/mlx5/core/sf/dev/driver.c        | 101 ++++++++++++++++++
 include/linux/mlx5/driver.h                   |   4 +-
 10 files changed, 187 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 2aefbca404c3..efa95d6dd112 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -88,4 +88,4 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 #
 # SF device
 #
-mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o
+mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o sf/dev/driver.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 3261d0dc1104..9afe918c5827 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -7,6 +7,7 @@
 #include "fw_reset.h"
 #include "fs_core.h"
 #include "eswitch.h"
+#include "sf/dev/dev.h"
 
 static int mlx5_devlink_flash_update(struct devlink *devlink,
 				     struct devlink_flash_update_params *params,
@@ -127,6 +128,17 @@ static int mlx5_devlink_reload_down(struct devlink *devlink, bool netns_change,
 				    struct netlink_ext_ack *extack)
 {
 	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	bool sf_dev_allocated;
+
+	sf_dev_allocated = mlx5_sf_dev_allocated(dev);
+	if (sf_dev_allocated) {
+		/* Reload results in deleting SF device which further results in
+		 * unregistering devlink instance while holding devlink_mutext.
+		 * Hence, do not support reload.
+		 */
+		NL_SET_ERR_MSG_MOD(extack, "reload is unsupported when SFs are allocated\n");
+		return -EOPNOTSUPP;
+	}
 
 	switch (action) {
 	case DEVLINK_RELOAD_ACTION_DRIVER_REINIT:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 421febebc658..174dfbc996c6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -467,7 +467,7 @@ int mlx5_eq_table_init(struct mlx5_core_dev *dev)
 	for (i = 0; i < MLX5_EVENT_TYPE_MAX; i++)
 		ATOMIC_INIT_NOTIFIER_HEAD(&eq_table->nh[i]);
 
-	eq_table->irq_table = dev->priv.irq_table;
+	eq_table->irq_table = mlx5_irq_table_get(dev);
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 292c30e71d7f..932a280a56a5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -84,7 +84,6 @@ unsigned int mlx5_core_debug_mask;
 module_param_named(debug_mask, mlx5_core_debug_mask, uint, 0644);
 MODULE_PARM_DESC(debug_mask, "debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0");
 
-#define MLX5_DEFAULT_PROF	2
 static unsigned int prof_sel = MLX5_DEFAULT_PROF;
 module_param_named(prof_sel, prof_sel, uint, 0444);
 MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 2");
@@ -1303,7 +1302,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup)
 	mutex_unlock(&dev->intf_state_mutex);
 }
 
-static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
+int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	int err;
@@ -1353,7 +1352,7 @@ static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
 	return err;
 }
 
-static void mlx5_mdev_uninit(struct mlx5_core_dev *dev)
+void mlx5_mdev_uninit(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 
@@ -1693,6 +1692,10 @@ static int __init init(void)
 	if (err)
 		goto err_debug;
 
+	err = mlx5_sf_driver_register();
+	if (err)
+		goto err_sf;
+
 #ifdef CONFIG_MLX5_CORE_EN
 	err = mlx5e_init();
 	if (err) {
@@ -1703,6 +1706,8 @@ static int __init init(void)
 
 	return 0;
 
+err_sf:
+	pci_unregister_driver(&mlx5_core_driver);
 err_debug:
 	mlx5_unregister_debugfs();
 	return err;
@@ -1713,6 +1718,7 @@ static void __exit cleanup(void)
 #ifdef CONFIG_MLX5_CORE_EN
 	mlx5e_cleanup();
 #endif
+	mlx5_sf_driver_unregister();
 	pci_unregister_driver(&mlx5_core_driver);
 	mlx5_unregister_debugfs();
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index a33b7496d748..3754ef98554f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -117,6 +117,8 @@ enum mlx5_semaphore_space_address {
 	MLX5_SEMAPHORE_SW_RESET         = 0x20,
 };
 
+#define MLX5_DEFAULT_PROF       2
+
 int mlx5_query_hca_caps(struct mlx5_core_dev *dev);
 int mlx5_query_board_id(struct mlx5_core_dev *dev);
 int mlx5_cmd_init(struct mlx5_core_dev *dev);
@@ -176,6 +178,7 @@ struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
 int mlx5_irq_get_num_comp(struct mlx5_irq_table *table);
+struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev);
 
 int mlx5_events_init(struct mlx5_core_dev *dev);
 void mlx5_events_cleanup(struct mlx5_core_dev *dev);
@@ -257,6 +260,13 @@ enum {
 u8 mlx5_get_nic_state(struct mlx5_core_dev *dev);
 void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state);
 
+static inline bool mlx5_core_is_sf(const struct mlx5_core_dev *dev)
+{
+	return dev->coredev_type == MLX5_COREDEV_SF;
+}
+
+int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx);
+void mlx5_mdev_uninit(struct mlx5_core_dev *dev);
 void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup);
 int mlx5_load_one(struct mlx5_core_dev *dev, bool boot);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 6fd974920394..a61e09aff152 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -30,6 +30,9 @@ int mlx5_irq_table_init(struct mlx5_core_dev *dev)
 {
 	struct mlx5_irq_table *irq_table;
 
+	if (mlx5_core_is_sf(dev))
+		return 0;
+
 	irq_table = kvzalloc(sizeof(*irq_table), GFP_KERNEL);
 	if (!irq_table)
 		return -ENOMEM;
@@ -40,6 +43,9 @@ int mlx5_irq_table_init(struct mlx5_core_dev *dev)
 
 void mlx5_irq_table_cleanup(struct mlx5_core_dev *dev)
 {
+	if (mlx5_core_is_sf(dev))
+		return;
+
 	kvfree(dev->priv.irq_table);
 }
 
@@ -268,6 +274,9 @@ int mlx5_irq_table_create(struct mlx5_core_dev *dev)
 	int nvec;
 	int err;
 
+	if (mlx5_core_is_sf(dev))
+		return 0;
+
 	nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
 	       MLX5_IRQ_VEC_COMP_BASE;
 	nvec = min_t(int, nvec, num_eqs);
@@ -319,6 +328,9 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev)
 	struct mlx5_irq_table *table = dev->priv.irq_table;
 	int i;
 
+	if (mlx5_core_is_sf(dev))
+		return;
+
 	/* free_irq requires that affinity and rmap will be cleared
 	 * before calling it. This is why there is asymmetry with set_rmap
 	 * which should be called after alloc_irq but before request_irq.
@@ -332,3 +344,11 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev)
 	kfree(table->irq);
 }
 
+struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev)
+{
+#ifdef CONFIG_MLX5_SF
+	if (mlx5_core_is_sf(dev))
+		return dev->priv.parent_mdev->priv.irq_table;
+#endif
+	return dev->priv.irq_table;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
index 6562bf63afaa..2675b85d202d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
@@ -24,6 +24,16 @@ static bool mlx5_sf_dev_supported(const struct mlx5_core_dev *dev)
 	return MLX5_CAP_GEN(dev, sf) && mlx5_vhca_event_supported(dev);
 }
 
+bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+
+	if (!mlx5_sf_dev_supported(dev))
+		return false;
+
+	return xa_empty(&table->devices);
+}
+
 static ssize_t sfnum_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	struct auxiliary_device *adev = container_of(dev, struct auxiliary_device, dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
index a6fb7289ba2c..4de02902aef1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
@@ -13,6 +13,7 @@
 struct mlx5_sf_dev {
 	struct auxiliary_device adev;
 	struct mlx5_core_dev *parent_mdev;
+	struct mlx5_core_dev *mdev;
 	phys_addr_t bar_base_addr;
 	u32 sfnum;
 };
@@ -20,6 +21,11 @@ struct mlx5_sf_dev {
 void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev);
 void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev);
 
+int mlx5_sf_driver_register(void);
+void mlx5_sf_driver_unregister(void);
+
+bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev);
+
 #else
 
 static inline void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev)
@@ -30,6 +36,20 @@ static inline void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
 {
 }
 
+static inline int mlx5_sf_driver_register(void)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_driver_unregister(void)
+{
+}
+
+static inline bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
 #endif
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
new file mode 100644
index 000000000000..9a1ad331ce0a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include <linux/mlx5/device.h>
+#include "mlx5_core.h"
+#include "dev.h"
+#include "devlink.h"
+
+static int mlx5_sf_dev_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+	struct mlx5_core_dev *mdev;
+	struct devlink *devlink;
+	int err;
+
+	devlink = mlx5_devlink_alloc();
+	if (!devlink)
+		return -ENOMEM;
+
+	mdev = devlink_priv(devlink);
+	mdev->device = &adev->dev;
+	mdev->pdev = sf_dev->parent_mdev->pdev;
+	mdev->bar_addr = sf_dev->bar_base_addr;
+	mdev->iseg_base = sf_dev->bar_base_addr;
+	mdev->coredev_type = MLX5_COREDEV_SF;
+	mdev->priv.parent_mdev = sf_dev->parent_mdev;
+	mdev->priv.adev_idx = adev->id;
+	sf_dev->mdev = mdev;
+
+	err = mlx5_mdev_init(mdev, MLX5_DEFAULT_PROF);
+	if (err) {
+		mlx5_core_warn(mdev, "mlx5_mdev_init on err=%d\n", err);
+		goto mdev_err;
+	}
+
+	mdev->iseg = ioremap(mdev->iseg_base, sizeof(*mdev->iseg));
+	if (!mdev->iseg) {
+		mlx5_core_warn(mdev, "remap error\n");
+		goto remap_err;
+	}
+
+	err = mlx5_load_one(mdev, true);
+	if (err) {
+		mlx5_core_warn(mdev, "mlx5_load_one err=%d\n", err);
+		goto load_one_err;
+	}
+	return 0;
+
+load_one_err:
+	iounmap(mdev->iseg);
+remap_err:
+	mlx5_mdev_uninit(mdev);
+mdev_err:
+	mlx5_devlink_free(devlink);
+	return err;
+}
+
+static void mlx5_sf_dev_remove(struct auxiliary_device *adev)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+	struct devlink *devlink;
+
+	devlink = priv_to_devlink(sf_dev->mdev);
+	mlx5_unload_one(sf_dev->mdev, true);
+	iounmap(sf_dev->mdev->iseg);
+	mlx5_mdev_uninit(sf_dev->mdev);
+	mlx5_devlink_free(devlink);
+}
+
+static void mlx5_sf_dev_shutdown(struct auxiliary_device *adev)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	mlx5_unload_one(sf_dev->mdev, false);
+}
+
+static const struct auxiliary_device_id mlx5_sf_dev_id_table[] = {
+	{ .name = KBUILD_MODNAME "." MLX5_SF_DEV_ID_NAME, },
+	{ },
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mlx5_sf_dev_id_table);
+
+static struct auxiliary_driver mlx5_sf_driver = {
+	.name = KBUILD_MODNAME,
+	.probe = mlx5_sf_dev_probe,
+	.remove = mlx5_sf_dev_remove,
+	.shutdown = mlx5_sf_dev_shutdown,
+	.id_table = mlx5_sf_dev_id_table,
+};
+
+int mlx5_sf_driver_register(void)
+{
+	return auxiliary_driver_register(&mlx5_sf_driver);
+}
+
+void mlx5_sf_driver_unregister(void)
+{
+	auxiliary_driver_unregister(&mlx5_sf_driver);
+}
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 08e5fbe97df0..48e3638b1185 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -193,7 +193,8 @@ enum port_state_policy {
 
 enum mlx5_coredev_type {
 	MLX5_COREDEV_PF,
-	MLX5_COREDEV_VF
+	MLX5_COREDEV_VF,
+	MLX5_COREDEV_SF,
 };
 
 struct mlx5_field_desc {
@@ -608,6 +609,7 @@ struct mlx5_priv {
 #ifdef CONFIG_MLX5_SF
 	struct mlx5_vhca_state_notifier *vhca_state_notifier;
 	struct mlx5_sf_dev_table *sf_dev_table;
+	struct mlx5_core_dev *parent_mdev;
 #endif
 };
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (7 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-16  0:47   ` Jakub Kicinski
  2020-12-15  9:03 ` [net-next v5 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Vu Pham, Parav Pandit, Roi Dayan, Saeed Mahameed

From: Vu Pham <vuhuong@nvidia.com>

Prepare eswitch to handle SF vport during
(a) querying eswitch functions
(b) egress ACL creation
(c) account for SF vports in total vports calculation

Assign a dedicated placeholder for SFs vports and their representors.
They are placed after VFs vports and before ECPF vports as below:
[PF,VF0,...,VFn,SF0,...SFm,ECPF,UPLINK].

Change functions to map SF's vport numbers to indices when
accessing the vports or representors arrays, and vice versa.

Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   | 10 ++++
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 11 +++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 50 +++++++++++++++++++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 11 ++++
 .../net/ethernet/mellanox/mlx5/core/vport.c   |  3 +-
 6 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index d6c48582e7a8..ad45d20f9d44 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -212,3 +212,13 @@ config MLX5_SF
 	Build support for subfuction device in the NIC. A Mellanox subfunction
 	device can support RDMA, netdevice and vdpa device.
 	It is similar to a SRIOV VF but it doesn't require SRIOV support.
+
+config MLX5_SF_MANAGER
+	bool
+	depends on MLX5_SF && MLX5_ESWITCH
+	default y
+	help
+	Build support for subfuction port in the NIC. A Mellanox subfunction
+	port is managed through devlink.  A subfunction supports RDMA, netdevice
+	and vdpa device. It is similar to a SRIOV VF but it doesn't require
+	SRIOV support.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
index 4c74e2690d57..26b37a0f8762 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
@@ -150,7 +150,7 @@ static void esw_acl_egress_ofld_groups_destroy(struct mlx5_vport *vport)
 
 static bool esw_acl_egress_needed(const struct mlx5_eswitch *esw, u16 vport_num)
 {
-	return mlx5_eswitch_is_vf_vport(esw, vport_num);
+	return mlx5_eswitch_is_vf_vport(esw, vport_num) || mlx5_esw_is_sf_vport(esw, vport_num);
 }
 
 int esw_acl_egress_ofld_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index da901e364656..d75247a8ce55 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1366,9 +1366,15 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev)
 {
 	int outlen = MLX5_ST_SZ_BYTES(query_esw_functions_out);
 	u32 in[MLX5_ST_SZ_DW(query_esw_functions_in)] = {};
+	u16 max_sf_vports;
 	u32 *out;
 	int err;
 
+	max_sf_vports = mlx5_sf_max_functions(dev);
+	/* Device interface is array of 64-bits */
+	if (max_sf_vports)
+		outlen += DIV_ROUND_UP(max_sf_vports, BITS_PER_TYPE(__be64)) * sizeof(__be64);
+
 	out = kvzalloc(outlen, GFP_KERNEL);
 	if (!out)
 		return ERR_PTR(-ENOMEM);
@@ -1376,7 +1382,7 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev)
 	MLX5_SET(query_esw_functions_in, in, opcode,
 		 MLX5_CMD_OP_QUERY_ESW_FUNCTIONS);
 
-	err = mlx5_cmd_exec_inout(dev, query_esw_functions, in, out);
+	err = mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
 	if (!err)
 		return out;
 
@@ -1899,7 +1905,8 @@ static bool
 is_port_function_supported(const struct mlx5_eswitch *esw, u16 vport_num)
 {
 	return vport_num == MLX5_VPORT_PF ||
-	       mlx5_eswitch_is_vf_vport(esw, vport_num);
+	       mlx5_eswitch_is_vf_vport(esw, vport_num) ||
+	       mlx5_esw_is_sf_vport(esw, vport_num);
 }
 
 int mlx5_devlink_port_function_hw_addr_get(struct devlink *devlink,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index cf87de94418f..4e3ed878ff03 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -43,6 +43,7 @@
 #include <linux/mlx5/fs.h>
 #include "lib/mpfs.h"
 #include "lib/fs_chains.h"
+#include "sf/sf.h"
 #include "en/tc_ct.h"
 
 #ifdef CONFIG_MLX5_ESWITCH
@@ -499,6 +500,40 @@ static inline u16 mlx5_eswitch_first_host_vport_num(struct mlx5_core_dev *dev)
 		MLX5_VPORT_PF : MLX5_VPORT_FIRST_VF;
 }
 
+static inline int mlx5_esw_sf_start_idx(const struct mlx5_eswitch *esw)
+{
+	/* PF and VF vports indices start from 0 to max_vfs */
+	return MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev);
+}
+
+static inline int mlx5_esw_sf_end_idx(const struct mlx5_eswitch *esw)
+{
+	return mlx5_esw_sf_start_idx(esw) + mlx5_sf_max_functions(esw->dev);
+}
+
+static inline int
+mlx5_esw_sf_vport_num_to_index(const struct mlx5_eswitch *esw, u16 vport_num)
+{
+	return vport_num - mlx5_sf_start_function_id(esw->dev) +
+	       MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev);
+}
+
+static inline u16
+mlx5_esw_sf_vport_index_to_num(const struct mlx5_eswitch *esw, int idx)
+{
+	return mlx5_sf_start_function_id(esw->dev) + idx -
+	       (MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev));
+}
+
+static inline bool
+mlx5_esw_is_sf_vport(const struct mlx5_eswitch *esw, u16 vport_num)
+{
+	return mlx5_sf_supported(esw->dev) &&
+	       vport_num >= mlx5_sf_start_function_id(esw->dev) &&
+	       (vport_num < (mlx5_sf_start_function_id(esw->dev) +
+			     mlx5_sf_max_functions(esw->dev)));
+}
+
 static inline bool mlx5_eswitch_is_funcs_handler(const struct mlx5_core_dev *dev)
 {
 	return mlx5_core_is_ecpf_esw_manager(dev);
@@ -527,6 +562,10 @@ static inline int mlx5_eswitch_vport_num_to_index(struct mlx5_eswitch *esw,
 	if (vport_num == MLX5_VPORT_UPLINK)
 		return mlx5_eswitch_uplink_idx(esw);
 
+	if (mlx5_esw_is_sf_vport(esw, vport_num))
+		return mlx5_esw_sf_vport_num_to_index(esw, vport_num);
+
+	/* PF and VF vports start from 0 to max_vfs */
 	return vport_num;
 }
 
@@ -540,6 +579,12 @@ static inline u16 mlx5_eswitch_index_to_vport_num(struct mlx5_eswitch *esw,
 	if (index == mlx5_eswitch_uplink_idx(esw))
 		return MLX5_VPORT_UPLINK;
 
+	/* SF vports indices are after VFs and before ECPF */
+	if (mlx5_sf_supported(esw->dev) &&
+	    index > mlx5_core_max_vfs(esw->dev))
+		return mlx5_esw_sf_vport_index_to_num(esw, index);
+
+	/* PF and VF vports start from 0 to max_vfs */
 	return index;
 }
 
@@ -625,6 +670,11 @@ void mlx5e_tc_clean_fdb_peer_flows(struct mlx5_eswitch *esw);
 	for ((vport) = (nvfs);						\
 	     (vport) >= (esw)->first_host_vport; (vport)--)
 
+#define mlx5_esw_for_each_sf_rep(esw, i, rep)		\
+	for ((i) = mlx5_esw_sf_start_idx(esw);		\
+	     (rep) = &(esw)->offloads.vport_reps[(i)],	\
+	     (i) < mlx5_esw_sf_end_idx(esw); (i++))
+
 struct mlx5_eswitch *mlx5_devlink_eswitch_get(struct devlink *devlink);
 struct mlx5_vport *__must_check
 mlx5_eswitch_get_vport(struct mlx5_eswitch *esw, u16 vport_num);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 2f6a0ae20650..2d241f7351b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1800,11 +1800,22 @@ static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw,
 		esw->offloads.rep_ops[rep_type]->unload(rep);
 }
 
+static void __unload_reps_sf_vport(struct mlx5_eswitch *esw, u8 rep_type)
+{
+	struct mlx5_eswitch_rep *rep;
+	int i;
+
+	mlx5_esw_for_each_sf_rep(esw, i, rep)
+		__esw_offloads_unload_rep(esw, rep, rep_type);
+}
+
 static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type)
 {
 	struct mlx5_eswitch_rep *rep;
 	int i;
 
+	__unload_reps_sf_vport(esw, rep_type);
+
 	mlx5_esw_for_each_vf_rep_reverse(esw, i, rep, esw->esw_funcs.num_vfs)
 		__esw_offloads_unload_rep(esw, rep, rep_type);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index bdafc85fd874..ba78e0660523 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -36,6 +36,7 @@
 #include <linux/mlx5/vport.h>
 #include <linux/mlx5/eswitch.h>
 #include "mlx5_core.h"
+#include "sf/sf.h"
 
 /* Mutex to hold while enabling or disabling RoCE */
 static DEFINE_MUTEX(mlx5_roce_en_lock);
@@ -1160,6 +1161,6 @@ EXPORT_SYMBOL_GPL(mlx5_query_nic_system_image_guid);
  */
 u16 mlx5_eswitch_get_total_vports(const struct mlx5_core_dev *dev)
 {
-	return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev);
+	return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev) + mlx5_sf_max_functions(dev);
 }
 EXPORT_SYMBOL_GPL(mlx5_eswitch_get_total_vports);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 10/15] net/mlx5: E-switch, Add eswitch helpers for SF vport
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (8 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Roi Dayan, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Add helpers to enable/disable eswitch port, register its devlink port and
load its representor.

Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../mellanox/mlx5/core/esw/devlink_port.c     | 41 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 12 +++---
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 16 ++++++++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 36 +++++++++++++++-
 4 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
index ffff11baa3d0..4b7e9f783789 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
@@ -122,3 +122,44 @@ struct devlink_port *mlx5_esw_offloads_devlink_port(struct mlx5_eswitch *esw, u1
 	vport = mlx5_eswitch_get_vport(esw, vport_num);
 	return vport->dl_port;
 }
+
+int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum)
+{
+	struct mlx5_core_dev *dev = esw->dev;
+	struct netdev_phys_item_id ppid = {};
+	unsigned int dl_port_index;
+	struct mlx5_vport *vport;
+	struct devlink *devlink;
+	u16 pfnum;
+	int err;
+
+	vport = mlx5_eswitch_get_vport(esw, vport_num);
+	if (IS_ERR(vport))
+		return PTR_ERR(vport);
+
+	pfnum = PCI_FUNC(dev->pdev->devfn);
+	mlx5_esw_get_port_parent_id(dev, &ppid);
+	memcpy(dl_port->attrs.switch_id.id, &ppid.id[0], ppid.id_len);
+	dl_port->attrs.switch_id.id_len = ppid.id_len;
+	devlink_port_attrs_pci_sf_set(dl_port, 0, pfnum, sfnum, false);
+	devlink = priv_to_devlink(dev);
+	dl_port_index = mlx5_esw_vport_to_devlink_port_index(dev, vport_num);
+	err = devlink_port_register(devlink, dl_port, dl_port_index);
+	if (err)
+		return err;
+
+	vport->dl_port = dl_port;
+	return 0;
+}
+
+void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num)
+{
+	struct mlx5_vport *vport;
+
+	vport = mlx5_eswitch_get_vport(esw, vport_num);
+	if (IS_ERR(vport))
+		return;
+	devlink_port_unregister(vport->dl_port);
+	vport->dl_port = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index d75247a8ce55..d06e7a5f15de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1273,8 +1273,8 @@ static void esw_vport_cleanup(struct mlx5_eswitch *esw, struct mlx5_vport *vport
 	esw_vport_cleanup_acl(esw, vport);
 }
 
-static int esw_enable_vport(struct mlx5_eswitch *esw, u16 vport_num,
-			    enum mlx5_eswitch_vport_event enabled_events)
+int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, u16 vport_num,
+			  enum mlx5_eswitch_vport_event enabled_events)
 {
 	struct mlx5_vport *vport;
 	int ret;
@@ -1310,7 +1310,7 @@ static int esw_enable_vport(struct mlx5_eswitch *esw, u16 vport_num,
 	return ret;
 }
 
-static void esw_disable_vport(struct mlx5_eswitch *esw, u16 vport_num)
+void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_vport *vport;
 
@@ -1432,7 +1432,7 @@ int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num,
 {
 	int err;
 
-	err = esw_enable_vport(esw, vport_num, enabled_events);
+	err = mlx5_esw_vport_enable(esw, vport_num, enabled_events);
 	if (err)
 		return err;
 
@@ -1443,14 +1443,14 @@ int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num,
 	return err;
 
 err_rep:
-	esw_disable_vport(esw, vport_num);
+	mlx5_esw_vport_disable(esw, vport_num);
 	return err;
 }
 
 void mlx5_eswitch_unload_vport(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	esw_offloads_unload_rep(esw, vport_num);
-	esw_disable_vport(esw, vport_num);
+	mlx5_esw_vport_disable(esw, vport_num);
 }
 
 void mlx5_eswitch_unload_vf_vports(struct mlx5_eswitch *esw, u16 num_vfs)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 4e3ed878ff03..54514b04808d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -688,6 +688,10 @@ mlx5_eswitch_enable_pf_vf_vports(struct mlx5_eswitch *esw,
 				 enum mlx5_eswitch_vport_event enabled_events);
 void mlx5_eswitch_disable_pf_vf_vports(struct mlx5_eswitch *esw);
 
+int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, u16 vport_num,
+			  enum mlx5_eswitch_vport_event enabled_events);
+void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
+
 int
 esw_vport_create_offloads_acl_tables(struct mlx5_eswitch *esw,
 				     struct mlx5_vport *vport);
@@ -706,6 +710,9 @@ esw_get_max_restore_tag(struct mlx5_eswitch *esw);
 int esw_offloads_load_rep(struct mlx5_eswitch *esw, u16 vport_num);
 void esw_offloads_unload_rep(struct mlx5_eswitch *esw, u16 vport_num);
 
+int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num);
+void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num);
+
 int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num,
 			    enum mlx5_eswitch_vport_event enabled_events);
 void mlx5_eswitch_unload_vport(struct mlx5_eswitch *esw, u16 vport_num);
@@ -717,6 +724,15 @@ void mlx5_eswitch_unload_vf_vports(struct mlx5_eswitch *esw, u16 num_vfs);
 int mlx5_esw_offloads_devlink_port_register(struct mlx5_eswitch *esw, u16 vport_num);
 void mlx5_esw_offloads_devlink_port_unregister(struct mlx5_eswitch *esw, u16 vport_num);
 struct devlink_port *mlx5_esw_offloads_devlink_port(struct mlx5_eswitch *esw, u16 vport_num);
+
+int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum);
+void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num);
+
+int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum);
+void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
+
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 2d241f7351b5..7f09f2bbf7c1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1833,7 +1833,7 @@ static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type)
 	__esw_offloads_unload_rep(esw, rep, rep_type);
 }
 
-static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
+int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_eswitch_rep *rep;
 	int rep_type;
@@ -1857,7 +1857,7 @@ static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
 	return err;
 }
 
-static void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num)
+void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_eswitch_rep *rep;
 	int rep_type;
@@ -2835,3 +2835,35 @@ u32 mlx5_eswitch_get_vport_metadata_for_match(struct mlx5_eswitch *esw,
 	return vport->metadata << (32 - ESW_SOURCE_PORT_METADATA_BITS);
 }
 EXPORT_SYMBOL(mlx5_eswitch_get_vport_metadata_for_match);
+
+int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum)
+{
+	int err;
+
+	err = mlx5_esw_vport_enable(esw, vport_num, MLX5_VPORT_UC_ADDR_CHANGE);
+	if (err)
+		return err;
+
+	err = mlx5_esw_devlink_sf_port_register(esw, dl_port, vport_num, sfnum);
+	if (err)
+		goto devlink_err;
+
+	err = mlx5_esw_offloads_rep_load(esw, vport_num);
+	if (err)
+		goto rep_err;
+	return 0;
+
+rep_err:
+	mlx5_esw_devlink_sf_port_unregister(esw, vport_num);
+devlink_err:
+	mlx5_esw_vport_disable(esw, vport_num);
+	return err;
+}
+
+void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num)
+{
+	mlx5_esw_offloads_rep_unload(esw, vport_num);
+	mlx5_esw_devlink_sf_port_unregister(esw, vport_num);
+	mlx5_esw_vport_disable(esw, vport_num);
+}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 11/15] net/mlx5: SF, Add port add delete functionality
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (9 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-16  0:51   ` Jakub Kicinski
  2020-12-15  9:03 ` [net-next v5 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

To handle SF port management outside of the eswitch as independent
software layer, introduce eswitch notifier APIs so that upper layer who
wish to support sf port management in switchdev mode can perform its
task whenever eswitch mode is set to switchdev or before eswitch is
disabled.

Initialize sf port table on such eswitch event.

Add SF port add and delete functionality in switchdev mode.
Destroy all SF ports when eswitch is disabled.
Expose SF port add and delete to user via devlink commands.

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port show ens2f0npf0sf88 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "inactive",
                "opstate": "detached"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   5 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   4 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   5 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  25 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  12 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  18 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  27 ++
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  | 312 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/sf/hw_table.c | 125 +++++++
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |  17 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  28 ++
 include/linux/mlx5/driver.h                   |   6 +
 12 files changed, 584 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index efa95d6dd112..957d5d9cfb36 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -89,3 +89,8 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 # SF device
 #
 mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o sf/dev/driver.o
+
+#
+# SF manager
+#
+mlx5_core-$(CONFIG_MLX5_SF_MANAGER) += sf/cmd.o sf/hw_table.o sf/devlink.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 47dcc3ac2cf0..e8cecd50558d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -333,6 +333,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_DEALLOC_MEMIC:
 	case MLX5_CMD_OP_PAGE_FAULT_RESUME:
 	case MLX5_CMD_OP_QUERY_ESW_FUNCTIONS:
+	case MLX5_CMD_OP_DEALLOC_SF:
 		return MLX5_CMD_STAT_OK;
 
 	case MLX5_CMD_OP_QUERY_HCA_CAP:
@@ -466,6 +467,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_RELEASE_XRQ_ERROR:
 	case MLX5_CMD_OP_QUERY_VHCA_STATE:
 	case MLX5_CMD_OP_MODIFY_VHCA_STATE:
+	case MLX5_CMD_OP_ALLOC_SF:
 		*status = MLX5_DRIVER_STATUS_ABORTED;
 		*synd = MLX5_DRIVER_SYND;
 		return -EIO;
@@ -661,6 +663,8 @@ const char *mlx5_command_str(int command)
 	MLX5_COMMAND_STR_CASE(MODIFY_XRQ);
 	MLX5_COMMAND_STR_CASE(QUERY_VHCA_STATE);
 	MLX5_COMMAND_STR_CASE(MODIFY_VHCA_STATE);
+	MLX5_COMMAND_STR_CASE(ALLOC_SF);
+	MLX5_COMMAND_STR_CASE(DEALLOC_SF);
 	default: return "unknown command opcode";
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 9afe918c5827..d4c0cdf5edd9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -8,6 +8,7 @@
 #include "fs_core.h"
 #include "eswitch.h"
 #include "sf/dev/dev.h"
+#include "sf/sf.h"
 
 static int mlx5_devlink_flash_update(struct devlink *devlink,
 				     struct devlink_flash_update_params *params,
@@ -190,6 +191,10 @@ static const struct devlink_ops mlx5_devlink_ops = {
 	.eswitch_encap_mode_get = mlx5_devlink_eswitch_encap_mode_get,
 	.port_function_hw_addr_get = mlx5_devlink_port_function_hw_addr_get,
 	.port_function_hw_addr_set = mlx5_devlink_port_function_hw_addr_set,
+#endif
+#ifdef CONFIG_MLX5_SF_MANAGER
+	.port_new = mlx5_devlink_sf_port_new,
+	.port_del = mlx5_devlink_sf_port_del,
 #endif
 	.flash_update = mlx5_devlink_flash_update,
 	.info_get = mlx5_devlink_info_get,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index d06e7a5f15de..86e972c82af7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1600,6 +1600,15 @@ mlx5_eswitch_update_num_of_vfs(struct mlx5_eswitch *esw, int num_vfs)
 	kvfree(out);
 }
 
+static void mlx5_esw_mode_change_notify(struct mlx5_eswitch *esw, u16 mode)
+{
+	struct mlx5_esw_event_info info = {};
+
+	info.new_mode = mode;
+
+	blocking_notifier_call_chain(&esw->n_head, 0, &info);
+}
+
 /**
  * mlx5_eswitch_enable_locked - Enable eswitch
  * @esw:	Pointer to eswitch
@@ -1660,6 +1669,8 @@ int mlx5_eswitch_enable_locked(struct mlx5_eswitch *esw, int mode, int num_vfs)
 		 mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
 		 esw->esw_funcs.num_vfs, esw->enabled_vports);
 
+	mlx5_esw_mode_change_notify(esw, mode);
+
 	return 0;
 
 abort:
@@ -1716,6 +1727,11 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw, bool clear_vf)
 		 esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
 		 esw->esw_funcs.num_vfs, esw->enabled_vports);
 
+	/* Notify eswitch users that it is exiting from current mode.
+	 * So that it can do necessary cleanup before the eswitch is disabled.
+	 */
+	mlx5_esw_mode_change_notify(esw, MLX5_ESWITCH_NONE);
+
 	mlx5_eswitch_event_handlers_unregister(esw);
 
 	if (esw->mode == MLX5_ESWITCH_LEGACY)
@@ -1816,6 +1832,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
 	esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE;
 
 	dev->priv.eswitch = esw;
+	BLOCKING_INIT_NOTIFIER_HEAD(&esw->n_head);
 	return 0;
 abort:
 	if (esw->work_queue)
@@ -2507,4 +2524,12 @@ bool mlx5_esw_multipath_prereq(struct mlx5_core_dev *dev0,
 		dev1->priv.eswitch->mode == MLX5_ESWITCH_OFFLOADS);
 }
 
+int mlx5_esw_event_notifier_register(struct mlx5_eswitch *esw, struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&esw->n_head, nb);
+}
 
+void mlx5_esw_event_notifier_unregister(struct mlx5_eswitch *esw, struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&esw->n_head, nb);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 54514b04808d..479d2ac2cd85 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -278,6 +278,7 @@ struct mlx5_eswitch {
 	struct {
 		u32             large_group_num;
 	}  params;
+	struct blocking_notifier_head n_head;
 };
 
 void esw_offloads_disable(struct mlx5_eswitch *esw);
@@ -733,6 +734,17 @@ int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_p
 				      u16 vport_num, u32 sfnum);
 void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
 
+/**
+ * mlx5_esw_event_info - Indicates eswitch mode changed/changing.
+ *
+ * @new_mode: New mode of eswitch.
+ */
+struct mlx5_esw_event_info {
+	u16 new_mode;
+};
+
+int mlx5_esw_event_notifier_register(struct mlx5_eswitch *esw, struct notifier_block *n);
+void mlx5_esw_event_notifier_unregister(struct mlx5_eswitch *esw, struct notifier_block *n);
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 932a280a56a5..435323088ce0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -893,6 +893,18 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 		goto err_fpga_cleanup;
 	}
 
+	err = mlx5_sf_hw_table_init(dev);
+	if (err) {
+		mlx5_core_err(dev, "Failed to init SF HW table %d\n", err);
+		goto err_sf_hw_table_cleanup;
+	}
+
+	err = mlx5_sf_table_init(dev);
+	if (err) {
+		mlx5_core_err(dev, "Failed to init SF table %d\n", err);
+		goto err_sf_table_cleanup;
+	}
+
 	dev->dm = mlx5_dm_create(dev);
 	if (IS_ERR(dev->dm))
 		mlx5_core_warn(dev, "Failed to init device memory%d\n", err);
@@ -903,6 +915,10 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 
 	return 0;
 
+err_sf_table_cleanup:
+	mlx5_sf_hw_table_cleanup(dev);
+err_sf_hw_table_cleanup:
+	mlx5_vhca_event_cleanup(dev);
 err_fpga_cleanup:
 	mlx5_fpga_cleanup(dev);
 err_eswitch_cleanup:
@@ -936,6 +952,8 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 	mlx5_hv_vhca_destroy(dev->hv_vhca);
 	mlx5_fw_tracer_destroy(dev->tracer);
 	mlx5_dm_cleanup(dev);
+	mlx5_sf_table_cleanup(dev);
+	mlx5_sf_hw_table_cleanup(dev);
 	mlx5_vhca_event_cleanup(dev);
 	mlx5_fpga_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
new file mode 100644
index 000000000000..0bc3075f34fa
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include "priv.h"
+
+int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id)
+{
+	u32 out[MLX5_ST_SZ_DW(alloc_sf_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(alloc_sf_in)] = {};
+
+	MLX5_SET(alloc_sf_in, in, opcode, MLX5_CMD_OP_ALLOC_SF);
+	MLX5_SET(alloc_sf_in, in, function_id, function_id);
+
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id)
+{
+	u32 out[MLX5_ST_SZ_DW(dealloc_sf_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(dealloc_sf_in)] = {};
+
+	MLX5_SET(dealloc_sf_in, in, opcode, MLX5_CMD_OP_DEALLOC_SF);
+	MLX5_SET(dealloc_sf_in, in, function_id, function_id);
+
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
new file mode 100644
index 000000000000..09365f36a513
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include "eswitch.h"
+#include "priv.h"
+
+struct mlx5_sf {
+	struct devlink_port dl_port;
+	unsigned int port_index;
+	u16 id;
+};
+
+struct mlx5_sf_table {
+	struct mlx5_core_dev *dev; /* To refer from notifier context. */
+	struct xarray port_indices; /* port index based lookup. */
+	refcount_t refcount;
+	struct completion disable_complete;
+	struct notifier_block esw_nb;
+};
+
+static struct mlx5_sf *
+mlx5_sf_lookup_by_index(struct mlx5_sf_table *table, unsigned int port_index)
+{
+	return xa_load(&table->port_indices, port_index);
+}
+
+static int mlx5_sf_id_insert(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	return xa_insert(&table->port_indices, sf->port_index, sf, GFP_KERNEL);
+}
+
+static void mlx5_sf_id_erase(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	xa_erase(&table->port_indices, sf->port_index);
+}
+
+static struct mlx5_sf *
+mlx5_sf_alloc(struct mlx5_sf_table *table, u32 sfnum, struct netlink_ext_ack *extack)
+{
+	unsigned int dl_port_index;
+	struct mlx5_sf *sf;
+	u16 hw_fn_id;
+	int id_err;
+	int err;
+
+	id_err = mlx5_sf_hw_table_sf_alloc(table->dev, sfnum);
+	if (id_err < 0) {
+		err = id_err;
+		goto id_err;
+	}
+
+	sf = kzalloc(sizeof(*sf), GFP_KERNEL);
+	if (!sf) {
+		err = -ENOMEM;
+		goto alloc_err;
+	}
+	sf->id = id_err;
+	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sf->id);
+	dl_port_index = mlx5_esw_vport_to_devlink_port_index(table->dev, hw_fn_id);
+	sf->port_index = dl_port_index;
+
+	err = mlx5_sf_id_insert(table, sf);
+	if (err)
+		goto insert_err;
+
+	return sf;
+
+insert_err:
+	kfree(sf);
+alloc_err:
+	mlx5_sf_hw_table_sf_free(table->dev, id_err);
+id_err:
+	if (err == -EEXIST)
+		NL_SET_ERR_MSG_MOD(extack, "SF already exist. Choose different sfnum");
+	return ERR_PTR(err);
+}
+
+static void mlx5_sf_free(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	mlx5_sf_id_erase(table, sf);
+	mlx5_sf_hw_table_sf_free(table->dev, sf->id);
+	kfree(sf);
+}
+
+static struct mlx5_sf_table *mlx5_sf_table_try_get(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table = dev->priv.sf_table;
+
+	if (!table)
+		return NULL;
+
+	return refcount_inc_not_zero(&table->refcount) ? table : NULL;
+}
+
+static void mlx5_sf_table_put(struct mlx5_sf_table *table)
+{
+	if (refcount_dec_and_test(&table->refcount))
+		complete(&table->disable_complete);
+}
+
+static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
+		       const struct devlink_port_new_attrs *new_attr,
+		       struct netlink_ext_ack *extack)
+{
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
+	struct mlx5_sf *sf;
+	u16 hw_fn_id;
+	int err;
+
+	sf = mlx5_sf_alloc(table, new_attr->sfnum, extack);
+	if (IS_ERR(sf))
+		return PTR_ERR(sf);
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(dev, sf->id);
+	err = mlx5_esw_offloads_sf_vport_enable(esw, &sf->dl_port, hw_fn_id, new_attr->sfnum);
+	if (err)
+		goto esw_err;
+	return 0;
+
+esw_err:
+	mlx5_sf_free(table, sf);
+	return err;
+}
+
+static void mlx5_sf_del(struct mlx5_core_dev *dev, struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
+	u16 hw_fn_id;
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(dev, sf->id);
+	mlx5_esw_offloads_sf_vport_disable(esw, hw_fn_id);
+	mlx5_sf_free(table, sf);
+}
+
+static int
+mlx5_sf_new_check_attr(struct mlx5_core_dev *dev, const struct devlink_port_new_attrs *new_attr,
+		       struct netlink_ext_ack *extack)
+{
+	if (new_attr->flavour != DEVLINK_PORT_FLAVOUR_PCI_SF) {
+		NL_SET_ERR_MSG_MOD(extack, "Driver supports only SF port addition");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->port_index_valid) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Driver does not support user defined port index assignment");
+		return -EOPNOTSUPP;
+	}
+	if (!new_attr->sfnum_valid) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "User must provide unique sfnum. Driver does not support auto assignment");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->controller_valid && new_attr->controller) {
+		NL_SET_ERR_MSG_MOD(extack, "External controller is unsupported");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->pfnum != PCI_FUNC(dev->pdev->devfn)) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid pfnum supplied");
+		return -EOPNOTSUPP;
+	}
+	return 0;
+}
+
+int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_new_attrs *new_attr,
+			     struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	int err;
+
+	err = mlx5_sf_new_check_attr(dev, new_attr, extack);
+	if (err)
+		return err;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port add is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	err = mlx5_sf_add(dev, table, new_attr, extack);
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
+			     struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	struct mlx5_sf *sf;
+	int err = 0;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port del is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	sf = mlx5_sf_lookup_by_index(table, port_index);
+	if (!sf) {
+		err = -ENODEV;
+		goto sf_err;
+	}
+
+	mlx5_sf_del(dev, table, sf);
+sf_err:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+static void mlx5_sf_destroy_all(struct mlx5_sf_table *table)
+{
+	struct mlx5_core_dev *dev = table->dev;
+	unsigned long index;
+	struct mlx5_sf *sf;
+
+	xa_for_each(&table->port_indices, index, sf)
+		mlx5_sf_del(dev, table, sf);
+}
+
+static void mlx5_sf_table_enable(struct mlx5_sf_table *table)
+{
+	if (!mlx5_sf_max_functions(table->dev))
+		return;
+
+	init_completion(&table->disable_complete);
+	refcount_set(&table->refcount, 1);
+}
+
+static void mlx5_sf_table_disable(struct mlx5_sf_table *table)
+{
+	if (!mlx5_sf_max_functions(table->dev))
+		return;
+
+	if (!refcount_read(&table->refcount))
+		return;
+
+	/* Balances with refcount_set; drop the reference so that new user cmd cannot start. */
+	mlx5_sf_table_put(table);
+	wait_for_completion(&table->disable_complete);
+
+	/* At this point, no new user commands can start.
+	 * It is safe to destroy all user created SFs.
+	 */
+	mlx5_sf_destroy_all(table);
+}
+
+static int mlx5_sf_esw_event(struct notifier_block *nb, unsigned long event, void *data)
+{
+	struct mlx5_sf_table *table = container_of(nb, struct mlx5_sf_table, esw_nb);
+	const struct mlx5_esw_event_info *mode = data;
+
+	switch (mode->new_mode) {
+	case MLX5_ESWITCH_OFFLOADS:
+		mlx5_sf_table_enable(table);
+		break;
+	case MLX5_ESWITCH_NONE:
+		mlx5_sf_table_disable(table);
+		break;
+	default:
+		break;
+	};
+
+	return 0;
+}
+
+static bool mlx5_sf_table_supported(const struct mlx5_core_dev *dev)
+{
+	return dev->priv.eswitch && MLX5_ESWITCH_MANAGER(dev) && mlx5_sf_supported(dev);
+}
+
+int mlx5_sf_table_init(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table;
+	int err;
+
+	if (!mlx5_sf_table_supported(dev))
+		return 0;
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return -ENOMEM;
+
+	table->dev = dev;
+	xa_init(&table->port_indices);
+	dev->priv.sf_table = table;
+	table->esw_nb.notifier_call = mlx5_sf_esw_event;
+	err = mlx5_esw_event_notifier_register(dev->priv.eswitch, &table->esw_nb);
+	if (err)
+		goto reg_err;
+	return 0;
+
+reg_err:
+	kfree(table);
+	dev->priv.sf_table = NULL;
+	return err;
+}
+
+void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table = dev->priv.sf_table;
+
+	if (!table)
+		return;
+
+	mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb);
+	WARN_ON(refcount_read(&table->refcount));
+	WARN_ON(!xa_empty(&table->port_indices));
+	kfree(table);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
new file mode 100644
index 000000000000..c7757f399e8a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+#include <linux/mlx5/driver.h>
+#include "vhca_event.h"
+#include "priv.h"
+#include "sf.h"
+#include "ecpf.h"
+
+struct mlx5_sf_hw {
+	u32 usr_sfnum;
+	u8 allocated: 1;
+};
+
+struct mlx5_sf_hw_table {
+	struct mlx5_core_dev *dev;
+	struct mlx5_sf_hw *sfs;
+	int max_local_functions;
+	u8 ecpu: 1;
+};
+
+u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id)
+{
+	return sw_id + mlx5_sf_start_function_id(dev);
+}
+
+int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+	int sw_id = -ENOSPC;
+	u16 hw_fn_id;
+	int err;
+	int i;
+
+	if (!table->max_local_functions)
+		return -EOPNOTSUPP;
+
+	/* Check if sf with same sfnum already exists or not. */
+	for (i = 0; i < table->max_local_functions; i++) {
+		if (table->sfs[i].allocated && table->sfs[i].usr_sfnum == usr_sfnum)
+			return -EEXIST;
+	}
+
+	/* Find the free entry and allocate the entry from the array */
+	for (i = 0; i < table->max_local_functions; i++) {
+		if (!table->sfs[i].allocated) {
+			table->sfs[i].usr_sfnum = usr_sfnum;
+			table->sfs[i].allocated = true;
+			sw_id = i;
+			break;
+		}
+	}
+	if (sw_id == -ENOSPC) {
+		err = -ENOSPC;
+		goto err;
+	}
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sw_id);
+	err = mlx5_cmd_alloc_sf(table->dev, hw_fn_id);
+	if (err)
+		goto err;
+
+	err = mlx5_modify_vhca_sw_id(dev, hw_fn_id, table->ecpu, usr_sfnum);
+	if (err)
+		goto vhca_err;
+
+	return sw_id;
+
+vhca_err:
+	mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
+err:
+	table->sfs[i].allocated = false;
+	return err;
+}
+
+void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+	u16 hw_fn_id;
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, id);
+	mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
+	table->sfs[id].allocated = false;
+}
+
+int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_hw_table *table;
+	struct mlx5_sf_hw *sfs;
+	int max_functions;
+
+	if (!mlx5_sf_supported(dev))
+		return 0;
+
+	max_functions = mlx5_sf_max_functions(dev);
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return -ENOMEM;
+
+	sfs = kcalloc(max_functions, sizeof(*sfs), GFP_KERNEL);
+	if (!sfs)
+		goto table_err;
+
+	table->dev = dev;
+	table->sfs = sfs;
+	table->max_local_functions = max_functions;
+	table->ecpu = mlx5_read_embedded_cpu(dev);
+	dev->priv.sf_hw_table = table;
+	mlx5_core_dbg(dev, "SF HW table: max sfs = %d\n", max_functions);
+	return 0;
+
+table_err:
+	kfree(table);
+	return -ENOMEM;
+}
+
+void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+
+	if (!table)
+		return;
+
+	kfree(table->sfs);
+	kfree(table);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
new file mode 100644
index 000000000000..7f3622375a9c
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_PRIV_H__
+#define __MLX5_SF_PRIV_H__
+
+#include <linux/mlx5/driver.h>
+
+int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id);
+int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id);
+
+u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id);
+
+int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum);
+void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id);
+
+#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
index 623191679b49..dd23b6c2d887 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -28,6 +28,16 @@ static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
 		return 1 << MLX5_CAP_GEN(dev, log_max_sf);
 }
 
+int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev);
+void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev);
+
+int mlx5_sf_table_init(struct mlx5_core_dev *dev);
+void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev);
+
+int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_new_attrs *add_attr,
+			     struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
+			     struct netlink_ext_ack *extack);
 #else
 
 static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
@@ -40,6 +50,24 @@ static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
 	return 0;
 }
 
+static inline int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
+static inline int mlx5_sf_table_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
 #endif
 
 #endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 48e3638b1185..7e357c7f0d5e 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -510,6 +510,8 @@ struct mlx5_eq_table;
 struct mlx5_irq_table;
 struct mlx5_vhca_state_notifier;
 struct mlx5_sf_dev_table;
+struct mlx5_sf_hw_table;
+struct mlx5_sf_table;
 
 struct mlx5_rate_limit {
 	u32			rate;
@@ -611,6 +613,10 @@ struct mlx5_priv {
 	struct mlx5_sf_dev_table *sf_dev_table;
 	struct mlx5_core_dev *parent_mdev;
 #endif
+#ifdef CONFIG_MLX5_SF_MANAGER
+	struct mlx5_sf_hw_table *sf_hw_table;
+	struct mlx5_sf_table *sf_table;
+#endif
 };
 
 enum mlx5_device_state {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 12/15] net/mlx5: SF, Port function state change support
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (10 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 13/15] devlink: Add devlink port documentation Saeed Mahameed
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Support changing the state of the SF port's function through devlink.
When activating the SF port's function, enable the hca in the device
followed by adding its auxiliary device.
When deactivating the SF port's function, delete its auxiliary device
followed by disabling the vHCA.

Port function attributes get/set callbacks are invoked with devlink
instance lock held. Such callbacks need to synchronize with sf port
table getting disabled either via sriov sysfs callback. Such callbacks
synchronize with table disable context holding table refcount.

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active

$ devlink port show ens2f0npf0sf88 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

On port function activation, an auxiliary device is created in below
example.

$ devlink dev show
devlink dev show auxiliary/mlx5_core.sf.4

$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   2 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  10 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  22 ++
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  | 284 ++++++++++++++++--
 .../ethernet/mellanox/mlx5/core/sf/hw_table.c | 116 ++++++-
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |   4 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  19 ++
 7 files changed, 431 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index d4c0cdf5edd9..75d950d95fcf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -195,6 +195,8 @@ static const struct devlink_ops mlx5_devlink_ops = {
 #ifdef CONFIG_MLX5_SF_MANAGER
 	.port_new = mlx5_devlink_sf_port_new,
 	.port_del = mlx5_devlink_sf_port_del,
+	.port_function_state_get = mlx5_devlink_sf_port_fn_state_get,
+	.port_function_state_set = mlx5_devlink_sf_port_fn_state_set,
 #endif
 	.flash_update = mlx5_devlink_flash_update,
 	.info_get = mlx5_devlink_info_get,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 435323088ce0..f6b885fdd5c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -75,6 +75,7 @@
 #include "diag/rsc_dump.h"
 #include "sf/vhca_event.h"
 #include "sf/dev/dev.h"
+#include "sf/sf.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -1161,6 +1162,12 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 
 	mlx5_vhca_event_start(dev);
 
+	err = mlx5_sf_hw_table_create(dev);
+	if (err) {
+		mlx5_core_err(dev, "sf table create failed %d\n", err);
+		goto err_vhca;
+	}
+
 	err = mlx5_ec_init(dev);
 	if (err) {
 		mlx5_core_err(dev, "Failed to init embedded CPU\n");
@@ -1180,6 +1187,8 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 err_sriov:
 	mlx5_ec_cleanup(dev);
 err_ec:
+	mlx5_sf_hw_table_destroy(dev);
+err_vhca:
 	mlx5_vhca_event_stop(dev);
 	mlx5_cleanup_fs(dev);
 err_fs:
@@ -1209,6 +1218,7 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
 	mlx5_sf_dev_table_destroy(dev);
 	mlx5_sriov_detach(dev);
 	mlx5_ec_cleanup(dev);
+	mlx5_sf_hw_table_destroy(dev);
 	mlx5_vhca_event_stop(dev);
 	mlx5_cleanup_fs(dev);
 	mlx5_accel_ipsec_cleanup(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
index 0bc3075f34fa..a8d75c2f0275 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
@@ -25,3 +25,25 @@ int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id)
 
 	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
+
+int mlx5_cmd_sf_enable_hca(struct mlx5_core_dev *dev, u16 func_id)
+{
+	u32 out[MLX5_ST_SZ_DW(enable_hca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(enable_hca_in)] = {};
+
+	MLX5_SET(enable_hca_in, in, opcode, MLX5_CMD_OP_ENABLE_HCA);
+	MLX5_SET(enable_hca_in, in, function_id, func_id);
+	MLX5_SET(enable_hca_in, in, embedded_cpu_function, 0);
+	return mlx5_cmd_exec(dev, &in, sizeof(in), &out, sizeof(out));
+}
+
+int mlx5_cmd_sf_disable_hca(struct mlx5_core_dev *dev, u16 func_id)
+{
+	u32 out[MLX5_ST_SZ_DW(disable_hca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(disable_hca_in)] = {};
+
+	MLX5_SET(disable_hca_in, in, opcode, MLX5_CMD_OP_DISABLE_HCA);
+	MLX5_SET(disable_hca_in, in, function_id, func_id);
+	MLX5_SET(enable_hca_in, in, embedded_cpu_function, 0);
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
index 09365f36a513..eb5c536ff1d5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
@@ -4,11 +4,17 @@
 #include <linux/mlx5/driver.h>
 #include "eswitch.h"
 #include "priv.h"
+#include "sf/dev/dev.h"
+#include "mlx5_ifc_vhca_event.h"
+#include "vhca_event.h"
+#include "ecpf.h"
 
 struct mlx5_sf {
 	struct devlink_port dl_port;
 	unsigned int port_index;
 	u16 id;
+	u16 hw_fn_id;
+	u16 hw_state;
 };
 
 struct mlx5_sf_table {
@@ -16,7 +22,10 @@ struct mlx5_sf_table {
 	struct xarray port_indices; /* port index based lookup. */
 	refcount_t refcount;
 	struct completion disable_complete;
+	struct mutex sf_state_lock; /* Serializes sf state among user cmds & vhca event handler. */
 	struct notifier_block esw_nb;
+	struct notifier_block vhca_nb;
+	u8 ecpu: 1;
 };
 
 static struct mlx5_sf *
@@ -25,6 +34,19 @@ mlx5_sf_lookup_by_index(struct mlx5_sf_table *table, unsigned int port_index)
 	return xa_load(&table->port_indices, port_index);
 }
 
+static struct mlx5_sf *
+mlx5_sf_lookup_by_function_id(struct mlx5_sf_table *table, unsigned int fn_id)
+{
+	unsigned long index;
+	struct mlx5_sf *sf;
+
+	xa_for_each(&table->port_indices, index, sf) {
+		if (sf->hw_fn_id == fn_id)
+			return sf;
+	}
+	return NULL;
+}
+
 static int mlx5_sf_id_insert(struct mlx5_sf_table *table, struct mlx5_sf *sf)
 {
 	return xa_insert(&table->port_indices, sf->port_index, sf, GFP_KERNEL);
@@ -59,6 +81,8 @@ mlx5_sf_alloc(struct mlx5_sf_table *table, u32 sfnum, struct netlink_ext_ack *ex
 	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sf->id);
 	dl_port_index = mlx5_esw_vport_to_devlink_port_index(table->dev, hw_fn_id);
 	sf->port_index = dl_port_index;
+	sf->hw_fn_id = hw_fn_id;
+	sf->hw_state = MLX5_VHCA_STATE_ALLOCATED;
 
 	err = mlx5_sf_id_insert(table, sf);
 	if (err)
@@ -99,6 +123,146 @@ static void mlx5_sf_table_put(struct mlx5_sf_table *table)
 		complete(&table->disable_complete);
 }
 
+static enum devlink_port_function_state mlx5_sf_to_devlink_state(u8 hw_state)
+{
+	switch (hw_state) {
+	case MLX5_VHCA_STATE_ACTIVE:
+	case MLX5_VHCA_STATE_IN_USE:
+	case MLX5_VHCA_STATE_TEARDOWN_REQUEST:
+		return DEVLINK_PORT_FUNCTION_STATE_ACTIVE;
+	case MLX5_VHCA_STATE_INVALID:
+	case MLX5_VHCA_STATE_ALLOCATED:
+	default:
+		return DEVLINK_PORT_FUNCTION_STATE_INACTIVE;
+	}
+}
+
+static enum devlink_port_function_opstate mlx5_sf_to_devlink_opstate(u8 hw_state)
+{
+	switch (hw_state) {
+	case MLX5_VHCA_STATE_IN_USE:
+	case MLX5_VHCA_STATE_TEARDOWN_REQUEST:
+		return DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED;
+	case MLX5_VHCA_STATE_INVALID:
+	case MLX5_VHCA_STATE_ALLOCATED:
+	case MLX5_VHCA_STATE_ACTIVE:
+	default:
+		return DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED;
+	}
+}
+
+static bool mlx5_sf_is_active(const struct mlx5_sf *sf)
+{
+	return sf->hw_state == MLX5_VHCA_STATE_ACTIVE || sf->hw_state == MLX5_VHCA_STATE_IN_USE;
+}
+
+int mlx5_devlink_sf_port_fn_state_get(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state *state,
+				      enum devlink_port_function_opstate *opstate,
+				      struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	struct mlx5_sf *sf;
+	int err = 0;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table)
+		return -EOPNOTSUPP;
+
+	sf = mlx5_sf_lookup_by_index(table, dl_port->index);
+	if (!sf) {
+		err = -EOPNOTSUPP;
+		goto sf_err;
+	}
+	mutex_lock(&table->sf_state_lock);
+	*state = mlx5_sf_to_devlink_state(sf->hw_state);
+	*opstate = mlx5_sf_to_devlink_opstate(sf->hw_state);
+	mutex_unlock(&table->sf_state_lock);
+sf_err:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+static int mlx5_sf_activate(struct mlx5_core_dev *dev, struct mlx5_sf *sf)
+{
+	int err;
+
+	if (mlx5_sf_is_active(sf))
+		return 0;
+	if (sf->hw_state != MLX5_VHCA_STATE_ALLOCATED)
+		return -EINVAL;
+
+	err = mlx5_cmd_sf_enable_hca(dev, sf->hw_fn_id);
+	if (err)
+		return err;
+
+	sf->hw_state = MLX5_VHCA_STATE_ACTIVE;
+	return 0;
+}
+
+static int mlx5_sf_deactivate(struct mlx5_core_dev *dev, struct mlx5_sf *sf)
+{
+	int err;
+
+	if (!mlx5_sf_is_active(sf))
+		return 0;
+
+	err = mlx5_cmd_sf_disable_hca(dev, sf->hw_fn_id);
+	if (err)
+		return err;
+
+	sf->hw_state = MLX5_VHCA_STATE_TEARDOWN_REQUEST;
+	return 0;
+}
+
+static int mlx5_sf_state_set(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
+			     struct mlx5_sf *sf,
+			     enum devlink_port_function_state state)
+{
+	int err = 0;
+
+	mutex_lock(&table->sf_state_lock);
+	if (state == mlx5_sf_to_devlink_state(sf->hw_state))
+		goto out;
+	if (state == DEVLINK_PORT_FUNCTION_STATE_ACTIVE)
+		err = mlx5_sf_activate(dev, sf);
+	else if (state == DEVLINK_PORT_FUNCTION_STATE_INACTIVE)
+		err = mlx5_sf_deactivate(dev, sf);
+	else
+		err = -EINVAL;
+out:
+	mutex_unlock(&table->sf_state_lock);
+	return err;
+}
+
+int mlx5_devlink_sf_port_fn_state_set(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state state,
+				      struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	struct mlx5_sf *sf;
+	int err;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port state set is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	sf = mlx5_sf_lookup_by_index(table, dl_port->index);
+	if (!sf) {
+		err = -ENODEV;
+		goto out;
+	}
+
+	err = mlx5_sf_state_set(dev, table, sf, state);
+out:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
 static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
 		       const struct devlink_port_new_attrs *new_attr,
 		       struct netlink_ext_ack *extack)
@@ -123,16 +287,6 @@ static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
 	return err;
 }
 
-static void mlx5_sf_del(struct mlx5_core_dev *dev, struct mlx5_sf_table *table, struct mlx5_sf *sf)
-{
-	struct mlx5_eswitch *esw = dev->priv.eswitch;
-	u16 hw_fn_id;
-
-	hw_fn_id = mlx5_sf_sw_to_hw_id(dev, sf->id);
-	mlx5_esw_offloads_sf_vport_disable(esw, hw_fn_id);
-	mlx5_sf_free(table, sf);
-}
-
 static int
 mlx5_sf_new_check_attr(struct mlx5_core_dev *dev, const struct devlink_port_new_attrs *new_attr,
 		       struct netlink_ext_ack *extack)
@@ -184,10 +338,30 @@ int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_
 	return err;
 }
 
+static void mlx5_sf_dealloc(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	if (sf->hw_state == MLX5_VHCA_STATE_ALLOCATED) {
+		mlx5_sf_free(table, sf);
+	} else if (mlx5_sf_is_active(sf)) {
+		/* Even if its active, it is treated as in_use because by the time,
+		 * it is disabled here, it may getting used. So it is safe to
+		 * always look for the event to ensure that it is recycled only after
+		 * firmware gives confirmation that it is detached by the driver.
+		 */
+		mlx5_cmd_sf_disable_hca(table->dev, sf->hw_fn_id);
+		mlx5_sf_hw_table_sf_deferred_free(table->dev, sf->id);
+		kfree(sf);
+	} else {
+		mlx5_sf_hw_table_sf_deferred_free(table->dev, sf->id);
+		kfree(sf);
+	}
+}
+
 int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
 			     struct netlink_ext_ack *extack)
 {
 	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
 	struct mlx5_sf_table *table;
 	struct mlx5_sf *sf;
 	int err = 0;
@@ -204,20 +378,58 @@ int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
 		goto sf_err;
 	}
 
-	mlx5_sf_del(dev, table, sf);
+	mlx5_esw_offloads_sf_vport_disable(esw, sf->hw_fn_id);
+	mlx5_sf_id_erase(table, sf);
+
+	mutex_lock(&table->sf_state_lock);
+	mlx5_sf_dealloc(table, sf);
+	mutex_unlock(&table->sf_state_lock);
 sf_err:
 	mlx5_sf_table_put(table);
 	return err;
 }
 
-static void mlx5_sf_destroy_all(struct mlx5_sf_table *table)
+static bool mlx5_sf_state_update_check(const struct mlx5_sf *sf, u8 new_state)
 {
-	struct mlx5_core_dev *dev = table->dev;
-	unsigned long index;
+	if (sf->hw_state == MLX5_VHCA_STATE_ACTIVE && new_state == MLX5_VHCA_STATE_IN_USE)
+		return true;
+
+	if (sf->hw_state == MLX5_VHCA_STATE_IN_USE && new_state == MLX5_VHCA_STATE_ACTIVE)
+		return true;
+
+	if (sf->hw_state == MLX5_VHCA_STATE_TEARDOWN_REQUEST &&
+	    new_state == MLX5_VHCA_STATE_ALLOCATED)
+		return true;
+
+	return false;
+}
+
+static int mlx5_sf_vhca_event(struct notifier_block *nb, unsigned long opcode, void *data)
+{
+	struct mlx5_sf_table *table = container_of(nb, struct mlx5_sf_table, vhca_nb);
+	const struct mlx5_vhca_state_event *event = data;
+	bool update = false;
 	struct mlx5_sf *sf;
 
-	xa_for_each(&table->port_indices, index, sf)
-		mlx5_sf_del(dev, table, sf);
+	table = mlx5_sf_table_try_get(table->dev);
+	if (!table)
+		return 0;
+
+	mutex_lock(&table->sf_state_lock);
+	sf = mlx5_sf_lookup_by_function_id(table, event->function_id);
+	if (!sf)
+		goto sf_err;
+
+	/* When driver is attached or detached to a function, an event
+	 * notifies such state change.
+	 */
+	update = mlx5_sf_state_update_check(sf, event->new_vhca_state);
+	if (update)
+		sf->hw_state = event->new_vhca_state;
+sf_err:
+	mutex_unlock(&table->sf_state_lock);
+	mlx5_sf_table_put(table);
+	return 0;
 }
 
 static void mlx5_sf_table_enable(struct mlx5_sf_table *table)
@@ -229,6 +441,22 @@ static void mlx5_sf_table_enable(struct mlx5_sf_table *table)
 	refcount_set(&table->refcount, 1);
 }
 
+static void mlx5_sf_deactivate_all(struct mlx5_sf_table *table)
+{
+	struct mlx5_eswitch *esw = table->dev->priv.eswitch;
+	unsigned long index;
+	struct mlx5_sf *sf;
+
+	/* At this point, no new user commands can start and no vhca event can
+	 * arrive. It is safe to destroy all user created SFs.
+	 */
+	xa_for_each(&table->port_indices, index, sf) {
+		mlx5_esw_offloads_sf_vport_disable(esw, sf->hw_fn_id);
+		mlx5_sf_id_erase(table, sf);
+		mlx5_sf_dealloc(table, sf);
+	}
+}
+
 static void mlx5_sf_table_disable(struct mlx5_sf_table *table)
 {
 	if (!mlx5_sf_max_functions(table->dev))
@@ -237,14 +465,13 @@ static void mlx5_sf_table_disable(struct mlx5_sf_table *table)
 	if (!refcount_read(&table->refcount))
 		return;
 
-	/* Balances with refcount_set; drop the reference so that new user cmd cannot start. */
+	/* Balances with refcount_set; drop the reference so that new user cmd cannot start
+	 * and new vhca event handler cannnot run.
+	 */
 	mlx5_sf_table_put(table);
 	wait_for_completion(&table->disable_complete);
 
-	/* At this point, no new user commands can start.
-	 * It is safe to destroy all user created SFs.
-	 */
-	mlx5_sf_destroy_all(table);
+	mlx5_sf_deactivate_all(table);
 }
 
 static int mlx5_sf_esw_event(struct notifier_block *nb, unsigned long event, void *data)
@@ -276,23 +503,34 @@ int mlx5_sf_table_init(struct mlx5_core_dev *dev)
 	struct mlx5_sf_table *table;
 	int err;
 
-	if (!mlx5_sf_table_supported(dev))
+	if (!mlx5_sf_table_supported(dev) || !mlx5_vhca_event_supported(dev))
 		return 0;
 
 	table = kzalloc(sizeof(*table), GFP_KERNEL);
 	if (!table)
 		return -ENOMEM;
 
+	mutex_init(&table->sf_state_lock);
 	table->dev = dev;
 	xa_init(&table->port_indices);
 	dev->priv.sf_table = table;
+	refcount_set(&table->refcount, 0);
 	table->esw_nb.notifier_call = mlx5_sf_esw_event;
 	err = mlx5_esw_event_notifier_register(dev->priv.eswitch, &table->esw_nb);
 	if (err)
 		goto reg_err;
+
+	table->vhca_nb.notifier_call = mlx5_sf_vhca_event;
+	err = mlx5_vhca_event_notifier_register(table->dev, &table->vhca_nb);
+	if (err)
+		goto vhca_err;
+
 	return 0;
 
+vhca_err:
+	mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb);
 reg_err:
+	mutex_destroy(&table->sf_state_lock);
 	kfree(table);
 	dev->priv.sf_table = NULL;
 	return err;
@@ -305,8 +543,10 @@ void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev)
 	if (!table)
 		return;
 
+	mlx5_vhca_event_notifier_unregister(table->dev, &table->vhca_nb);
 	mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb);
 	WARN_ON(refcount_read(&table->refcount));
+	mutex_destroy(&table->sf_state_lock);
 	WARN_ON(!xa_empty(&table->port_indices));
 	kfree(table);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
index c7757f399e8a..58b6be0b03d7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
@@ -4,11 +4,14 @@
 #include "vhca_event.h"
 #include "priv.h"
 #include "sf.h"
+#include "mlx5_ifc_vhca_event.h"
+#include "vhca_event.h"
 #include "ecpf.h"
 
 struct mlx5_sf_hw {
 	u32 usr_sfnum;
 	u8 allocated: 1;
+	u8 pending_delete: 1;
 };
 
 struct mlx5_sf_hw_table {
@@ -16,6 +19,8 @@ struct mlx5_sf_hw_table {
 	struct mlx5_sf_hw *sfs;
 	int max_local_functions;
 	u8 ecpu: 1;
+	struct mutex table_lock; /* Serializes sf deletion and vhca state change handler. */
+	struct notifier_block vhca_nb;
 };
 
 u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id)
@@ -23,6 +28,11 @@ u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id)
 	return sw_id + mlx5_sf_start_function_id(dev);
 }
 
+static u16 mlx5_sf_hw_to_sw_id(const struct mlx5_core_dev *dev, u16 hw_id)
+{
+	return hw_id - mlx5_sf_start_function_id(dev);
+}
+
 int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum)
 {
 	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
@@ -34,10 +44,13 @@ int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum)
 	if (!table->max_local_functions)
 		return -EOPNOTSUPP;
 
+	mutex_lock(&table->table_lock);
 	/* Check if sf with same sfnum already exists or not. */
 	for (i = 0; i < table->max_local_functions; i++) {
-		if (table->sfs[i].allocated && table->sfs[i].usr_sfnum == usr_sfnum)
-			return -EEXIST;
+		if (table->sfs[i].allocated && table->sfs[i].usr_sfnum == usr_sfnum) {
+			err = -EEXIST;
+			goto exist_err;
+		}
 	}
 
 	/* Find the free entry and allocate the entry from the array */
@@ -63,16 +76,19 @@ int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum)
 	if (err)
 		goto vhca_err;
 
+	mutex_unlock(&table->table_lock);
 	return sw_id;
 
 vhca_err:
 	mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
 err:
 	table->sfs[i].allocated = false;
+exist_err:
+	mutex_unlock(&table->table_lock);
 	return err;
 }
 
-void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id)
+static void _mlx5_sf_hw_id_free(struct mlx5_core_dev *dev, u16 id)
 {
 	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
 	u16 hw_fn_id;
@@ -80,6 +96,50 @@ void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id)
 	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, id);
 	mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
 	table->sfs[id].allocated = false;
+	table->sfs[id].pending_delete = false;
+}
+
+void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+
+	mutex_lock(&table->table_lock);
+	_mlx5_sf_hw_id_free(dev, id);
+	mutex_unlock(&table->table_lock);
+}
+
+void mlx5_sf_hw_table_sf_deferred_free(struct mlx5_core_dev *dev, u16 id)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+	u32 out[MLX5_ST_SZ_DW(query_vhca_state_out)] = {};
+	u16 hw_fn_id;
+	u8 state;
+	int err;
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(dev, id);
+	mutex_lock(&table->table_lock);
+	err = mlx5_cmd_query_vhca_state(dev, hw_fn_id, table->ecpu, out, sizeof(out));
+	if (err)
+		goto err;
+	state = MLX5_GET(query_vhca_state_out, out, vhca_state_context.vhca_state);
+	if (state == MLX5_VHCA_STATE_ALLOCATED) {
+		mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
+		table->sfs[id].allocated = false;
+	} else {
+		table->sfs[id].pending_delete = true;
+	}
+err:
+	mutex_unlock(&table->table_lock);
+}
+
+static void mlx5_sf_hw_dealloc_all(struct mlx5_sf_hw_table *table)
+{
+	int i;
+
+	for (i = 0; i < table->max_local_functions; i++) {
+		if (table->sfs[i].allocated)
+			_mlx5_sf_hw_id_free(table->dev, i);
+	}
 }
 
 int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
@@ -88,7 +148,7 @@ int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
 	struct mlx5_sf_hw *sfs;
 	int max_functions;
 
-	if (!mlx5_sf_supported(dev))
+	if (!mlx5_sf_supported(dev) || !mlx5_vhca_event_supported(dev))
 		return 0;
 
 	max_functions = mlx5_sf_max_functions(dev);
@@ -100,6 +160,7 @@ int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
 	if (!sfs)
 		goto table_err;
 
+	mutex_init(&table->table_lock);
 	table->dev = dev;
 	table->sfs = sfs;
 	table->max_local_functions = max_functions;
@@ -120,6 +181,53 @@ void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev)
 	if (!table)
 		return;
 
+	mutex_destroy(&table->table_lock);
 	kfree(table->sfs);
 	kfree(table);
 }
+
+static int mlx5_sf_hw_vhca_event(struct notifier_block *nb, unsigned long opcode, void *data)
+{
+	struct mlx5_sf_hw_table *table = container_of(nb, struct mlx5_sf_hw_table, vhca_nb);
+	const struct mlx5_vhca_state_event *event = data;
+	struct mlx5_sf_hw *sf_hw;
+	u16 sw_id;
+
+	if (event->new_vhca_state != MLX5_VHCA_STATE_ALLOCATED)
+		return 0;
+
+	sw_id = mlx5_sf_hw_to_sw_id(table->dev, event->function_id);
+	sf_hw = &table->sfs[sw_id];
+
+	mutex_lock(&table->table_lock);
+	/* SF driver notified through firmware that SF is finally detached.
+	 * Hence recycle the sf hardware id for reuse.
+	 */
+	if (sf_hw->allocated && sf_hw->pending_delete)
+		_mlx5_sf_hw_id_free(table->dev, sw_id);
+	mutex_unlock(&table->table_lock);
+	return 0;
+}
+
+int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+
+	if (!table)
+		return 0;
+
+	table->vhca_nb.notifier_call = mlx5_sf_hw_vhca_event;
+	return mlx5_vhca_event_notifier_register(table->dev, &table->vhca_nb);
+}
+
+void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+
+	if (!table)
+		return;
+
+	mlx5_vhca_event_notifier_unregister(table->dev, &table->vhca_nb);
+	/* Dealloc SFs whose firmware event has been missed. */
+	mlx5_sf_hw_dealloc_all(table);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
index 7f3622375a9c..cb02a51d0986 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
@@ -9,9 +9,13 @@
 int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id);
 int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id);
 
+int mlx5_cmd_sf_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
+int mlx5_cmd_sf_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
+
 u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id);
 
 int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum);
 void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id);
+void mlx5_sf_hw_table_sf_deferred_free(struct mlx5_core_dev *dev, u16 id);
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
index dd23b6c2d887..296fd070617e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -31,6 +31,9 @@ static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
 int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev);
 void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev);
 
+int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev);
+void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev);
+
 int mlx5_sf_table_init(struct mlx5_core_dev *dev);
 void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev);
 
@@ -38,6 +41,13 @@ int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_
 			     struct netlink_ext_ack *extack);
 int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
 			     struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_fn_state_get(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state *state,
+				      enum devlink_port_function_opstate *opstate,
+				      struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_fn_state_set(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state state,
+				      struct netlink_ext_ack *extack);
 #else
 
 static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
@@ -59,6 +69,15 @@ static inline void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev)
 {
 }
 
+static inline int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev)
+{
+}
+
 static inline int mlx5_sf_table_init(struct mlx5_core_dev *dev)
 {
 	return 0;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 13/15] devlink: Add devlink port documentation
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (11 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-16  0:57   ` Jakub Kicinski
  2020-12-15  9:03 ` [net-next v5 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
  2020-12-15  9:03 ` [net-next v5 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
  14 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Added documentation for devlink port and port function related commands.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../networking/devlink/devlink-port.rst       | 118 ++++++++++++++++++
 Documentation/networking/devlink/index.rst    |   1 +
 2 files changed, 119 insertions(+)
 create mode 100644 Documentation/networking/devlink/devlink-port.rst

diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
new file mode 100644
index 000000000000..4c910dbb01ca
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -0,0 +1,118 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _devlink_port:
+
+============
+Devlink Port
+============
+
+``devlink-port`` is a port that exists on the device. It has a logically
+separate ingress/egress point of the device. A devlink port can be any one
+of many flavours. A devlink port flavour along with port attributes
+describe what a port represents.
+
+A device driver that intends to publish a devlink port sets the
+devlink port attributes and registers the devlink port.
+
+Devlink port flavours are described below.
+
+.. list-table:: List of devlink port flavours
+   :widths: 33 90
+
+   * - Flavour
+     - Description
+   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
+     - Any kind of physical port. This can be an eswitch physical port or any
+       other physical port on the device.
+   * - ``DEVLINK_PORT_FLAVOUR_DSA``
+     - This indicates a DSA interconnect port.
+   * - ``DEVLINK_PORT_FLAVOUR_CPU``
+     - This indicates a CPU port applicable only to DSA.
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
+     - This indicates an eswitch port representing a port of PCI
+       physical function (PF).
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
+     - This indicates an eswitch port representing a port of PCI
+       virtual function (VF).
+   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
+     - This indicates a virtual port for the PCI virtual function.
+
+Devlink port can have a different type based on the link layer described below.
+
+.. list-table:: List of devlink port types
+   :widths: 23 90
+
+   * - Type
+     - Description
+   * - ``DEVLINK_PORT_TYPE_ETH``
+     - Driver should set this port type when a link layer of the port is
+       Ethernet.
+   * - ``DEVLINK_PORT_TYPE_IB``
+     - Driver should set this port type when a link layer of the port is
+       InfiniBand.
+   * - ``DEVLINK_PORT_TYPE_AUTO``
+     - This type is indicated by the user when driver should detect the port
+       type automatically.
+
+PCI controllers
+---------------
+In most cases a PCI device has only one controller. A controller consists of
+potentially multiple physical and virtual functions. Such PCI function consists
+of one or more ports. This port of the function is represented by the devlink
+eswitch port.
+
+A PCI Device connected to multiple CPUs or multiple PCI root complexes or
+SmartNIC, however, may have multiple controllers. For a device with multiple
+controllers, each controller is distinguished by a unique controller number.
+An eswitch on the PCI device support ports of multiple controllers.
+
+An example view of a system with two controllers::
+
+                 ---------------------------------------------------------
+                 |                                                       |
+                 |           --------- ---------         ------- ------- |
+    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
+    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
+    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
+    | connect |  | -------                       -------                 |
+    -----------  |     | controller_num=1 (no eswitch)                   |
+                 ------|--------------------------------------------------
+                 (internal wire)
+                       |
+                 ---------------------------------------------------------
+                 | devlink eswitch ports and reps                        |
+                 | ----------------------------------------------------- |
+                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
+                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
+                 | ----------------------------------------------------- |
+                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
+                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
+                 | ----------------------------------------------------- |
+                 |                                                       |
+                 |                                                       |
+    -----------  |           --------- ---------         ------- ------- |
+    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
+    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
+    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
+    -----------  | -------                       -------                 |
+                 |                                                       |
+                 |  local controller_num=0 (eswitch)                     |
+                 ---------------------------------------------------------
+
+In above example, external controller (identified by controller number = 1)
+doesn't have eswitch. Local controller (identified by controller number = 0)
+has the eswitch. Devlink instance on local controller has eswitch devlink
+ports representing ports for both the controllers.
+
+Port function configuration
+===========================
+
+A user can configure the port function attribute before enumerating the
+PCI function. Usually it means, user should configure port function attribute
+before a bus specific device for the function is created. However, when
+SRIOV is enabled, virtual function devices are created on the PCI bus.
+Hence, function attribute should be configured before binding virtual
+function device to the driver.
+
+User may set the hardware address of the function represented by the devlink
+port function. For Ethernet port function this means a MAC address.
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index d82874760ae2..aab79667f97b 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -18,6 +18,7 @@ general.
    devlink-info
    devlink-flash
    devlink-params
+   devlink-port
    devlink-region
    devlink-resource
    devlink-reload
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 14/15] devlink: Extend devlink port documentation for subfunctions
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (12 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 13/15] devlink: Add devlink port documentation Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  2020-12-16  1:00   ` Jakub Kicinski
  2020-12-15  9:03 ` [net-next v5 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
  14 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Add devlink port documentation for subfunction management.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 Documentation/driver-api/auxiliary_bus.rst    |  2 +
 .../networking/devlink/devlink-port.rst       | 89 ++++++++++++++++++-
 2 files changed, 87 insertions(+), 4 deletions(-)

diff --git a/Documentation/driver-api/auxiliary_bus.rst b/Documentation/driver-api/auxiliary_bus.rst
index 2312506b0674..fff96c7ba7a8 100644
--- a/Documentation/driver-api/auxiliary_bus.rst
+++ b/Documentation/driver-api/auxiliary_bus.rst
@@ -1,5 +1,7 @@
 .. SPDX-License-Identifier: GPL-2.0-only
 
+.. _auxiliary_bus:
+
 =============
 Auxiliary Bus
 =============
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 4c910dbb01ca..c6924e7a341e 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -34,6 +34,9 @@ Devlink port flavours are described below.
    * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
      - This indicates an eswitch port representing a port of PCI
        virtual function (VF).
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
+     - This indicates an eswitch port representing a port of PCI
+       subfunction (SF).
    * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
      - This indicates a virtual port for the PCI virtual function.
 
@@ -57,9 +60,9 @@ Devlink port can have a different type based on the link layer described below.
 PCI controllers
 ---------------
 In most cases a PCI device has only one controller. A controller consists of
-potentially multiple physical and virtual functions. Such PCI function consists
-of one or more ports. This port of the function is represented by the devlink
-eswitch port.
+potentially multiple physical functions, virtual functions and subfunctions.
+Such PCI function consists of one or more ports. This port of the function
+is represented by the devlink eswitch port.
 
 A PCI Device connected to multiple CPUs or multiple PCI root complexes or
 SmartNIC, however, may have multiple controllers. For a device with multiple
@@ -112,7 +115,85 @@ PCI function. Usually it means, user should configure port function attribute
 before a bus specific device for the function is created. However, when
 SRIOV is enabled, virtual function devices are created on the PCI bus.
 Hence, function attribute should be configured before binding virtual
-function device to the driver.
+function device to the driver. For subfunctions, this means user should
+configure port function attribute before activating the port function.
 
 User may set the hardware address of the function represented by the devlink
 port function. For Ethernet port function this means a MAC address.
+
+Subfunctions
+============
+
+Subfunctions are lightweight functions that has parent PCI function on which
+it is deployed. Subfunctions are created and deployed in unit of 1. Unlike
+SRIOV VFs, they don't require their own PCI virtual function. They communicate
+with the hardware through the parent PCI function. Subfunctions can possibly
+scale better.
+
+To use a subfunction, 3 steps setup sequence is followed.
+(1) create - create a subfunction;
+(2) configure - configure subfunction attributes;
+(3) deploy - deploy the subfunction;
+
+Subfunction management is done using devlink port user interface.
+User performs setup on the subfunction management device.
+
+(1) Create
+----------
+A subfunction is created using a devlink port interface. User adds the
+subfunction by adding a devlink port of subfunction flavour. The devlink
+kernel code calls down to subfunction management driver (devlink op) and asks
+it to create a subfunction devlink port. Driver then instantiates the
+subfunction port and any associated objects such as health reporters and
+representor netdevice.
+
+(2) Configure
+-------------
+Subfunction devlink port is created but it is not active yet. That means the
+entities are created on devlink side, the e-switch port representor is created,
+but the subfunction device itself it not created. User might use e-switch port
+representor to do settings, putting it into bridge, adding TC rules, etc. User
+might as well configure the hardware address (such as MAC address) of the
+subfunction while subfunction is inactive.
+
+(3) Deploy
+----------
+Once subfunction is configured, user must activate it to use it. Upon
+activation, subfunction management driver asks the subfunction management
+device to instantiate the actual subfunction device on particular PCI function.
+A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. At this point matching
+subfunction driver binds to the subfunction's auxiliary device.
+
+Terms and Definitions
+=====================
+
+.. list-table:: Terms and Definitions
+   :widths: 22 90
+
+   * - Term
+     - Definitions
+   * - ``PCI device``
+     - A physical PCI device having one or more PCI bus consists of one or
+       more PCI controllers.
+   * - ``PCI controller``
+     -  A controller consists of potentially multiple physical functions,
+        virtual functions and subfunctions.
+   * - ``Port function``
+     -  An object to manage the function of a port.
+   * - ``Subfunction``
+     -  A lightweight function that has parent PCI function on which it is
+        deployed.
+   * - ``Subfunction device``
+     -  A bus device of the subfunction, usually on a auxiliary bus.
+   * - ``Subfunction driver``
+     -  A device driver for the subfunction auxiliary device.
+   * - ``Subfunction management device``
+     -  A PCI physical function that supports subfunction management.
+   * - ``Subfunction management driver``
+     -  A device driver for PCI physical function that supports
+        subfunction management using devlink port interface.
+   * - ``Subfunction host driver``
+     -  A device driver for PCI physical function that host subfunction
+        devices. In most cases it is same as subfunction management driver. When
+        subfunction is used on external controller, subfunction management and
+        host drivers are different.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next v5 15/15] net/mlx5: Add devlink subfunction port documentation
  2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (13 preceding siblings ...)
  2020-12-15  9:03 ` [net-next v5 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
@ 2020-12-15  9:03 ` Saeed Mahameed
  14 siblings, 0 replies; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-15  9:03 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Add documentation for subfunction management using devlink
port.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../device_drivers/ethernet/mellanox/mlx5.rst | 204 ++++++++++++++++++
 1 file changed, 204 insertions(+)

diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
index a5eb22793bb9..07e38c044355 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -12,6 +12,8 @@ Contents
 - `Enabling the driver and kconfig options`_
 - `Devlink info`_
 - `Devlink parameters`_
+- `mlx5 subfunction`_
+- `mlx5 port function`_
 - `Devlink health reporters`_
 - `mlx5 tracepoints`_
 
@@ -181,6 +183,208 @@ User command examples:
       values:
          cmode driverinit value true
 
+mlx5 subfunction
+================
+mlx5 supports subfunctions management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+A Subfunction has its own function capabilities and its own resources. This
+means a subfunction has its own dedicated queues(txq, rxq, cq, eq). These queues
+are neither shared nor stolen from the parent PCI function.
+
+When subfunction is RDMA capable, it has its own QP1, GID table and rdma
+resources neither shared nor stolen from the parent PCI function.
+
+A subfunction has dedicated window in PCI BAR space that is not shared
+with the other subfunctions or parent PCI function. This ensures that all
+class devices of the subfunction accesses only assigned PCI BAR space.
+
+A Subfunction supports eswitch representation through which it supports tc
+offloads. User must configure eswitch to send/receive packets from/to
+subfunction port.
+
+Subfunctions share PCI level resources such as PCI MSI-X IRQs with
+the other subfunctions and/or with its parent PCI function.
+
+Example mlx5 software, system and device view::
+
+       _______
+      | admin |
+      | user  |----------
+      |_______|         |
+          |             |
+      ____|____       __|______            _________________
+     |         |     |         |          |                 |
+     | devlink |     | tc tool |          |    user         |
+     | tool    |     |_________|          | applications    |
+     |_________|         |                |_________________|
+           |             |                   |          |
+           |             |                   |          |         Userspace
+ +---------|-------------|-------------------|----------|--------------------+
+           |             |           +----------+   +----------+   Kernel
+           |             |           |  netdev  |   | rdma dev |
+           |             |           +----------+   +----------+
+   (devlink port add/del |              ^               ^
+    port function set)   |              |               |
+           |             |              +---------------|
+      _____|___          |              |        _______|_______
+     |         |         |              |       | mlx5 class    |
+     | devlink |   +------------+       |       |   drivers     |
+     | kernel  |   | rep netdev |       |       |(mlx5_core,ib) |
+     |_________|   +------------+       |       |_______________|
+           |             |              |               ^
+   (devlink ops)         |              |          (probe/remove)
+  _________|________     |              |           ____|________
+ | subfunction      |    |     +---------------+   | subfunction |
+ | management driver|-----     | subfunction   |---|  driver     |
+ | (mlx5_core)      |          | auxiliary dev |   | (mlx5_core) |
+ |__________________|          +---------------+   |_____________|
+           |                                            ^
+  (sf add/del, vhca events)                             |
+           |                                      (device add/del)
+      _____|____                                    ____|________
+     |          |                                  | subfunction |
+     |  PCI NIC |---- activate/deactive events---->| host driver |
+     |__________|                                  | (mlx5_core) |
+                                                   |_____________|
+
+Subfunction is created using devlink port interface.
+
+- Change device to switchdev mode::
+
+    $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Add a devlink port of subfunction flavour::
+
+    $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+
+- Show a devlink port of the subfunction::
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:00:00
+
+- Delete a devlink port of subfunction after use::
+
+    $ devlink port del pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+
+mlx5 port function
+==================
+mlx5 driver provides mechanism to setup PCI VF/SF port function
+attributes in unified way for smartnic and non-smartnic NICs.
+
+This is supported only when eswitch mode is set to switchdev. Port function
+configuration of the PCI VF/SF is supported through devlink eswitch port.
+
+Port function attributes should be set before PCI VF/SF is enumerated by the
+driver.
+
+MAC address setup
+-----------------
+mlx5 driver provides mechanism to setup the MAC address of the PCI VF/SF.
+
+Configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get MAC address of the VF identified by its unique devlink port index::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+      function:
+        hw_addr 00:00:00:00:00:00
+
+- Set MAC address of the VF identified by its unique devlink port index::
+
+    $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+      function:
+        hw_addr 00:11:22:33:44:55
+
+- Get MAC address of the SF identified by its unique devlink port index::
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:00:00
+
+- Set MAC address of the VF identified by its unique devlink port index::
+
+    $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcivf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:88:88
+
+SF state setup
+--------------
+To use the SF, user must active the SF using SF port function state attribute.
+
+- Get state of the SF identified by its unique devlink port index::
+
+   $ devlink port show ens2f0npf0sf88
+   pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+     function:
+       hw_addr 00:00:00:00:88:88 state inactive opstate detached
+
+- Activate the function and verify its state is active::
+
+   $ devlink port function set ens2f0npf0sf88 state active
+
+   $ devlink port show ens2f0npf0sf88
+   pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+     function:
+       hw_addr 00:00:00:00:88:88 state active opstate detached
+
+Upon function activation, PF driver instance gets the event from the device that
+particular SF was activated. It's the cue to put the device on bus, probe it and
+instantiate devlink instance and class specific auxiliary devices for it.
+
+- Show the auxiliary device and port of the subfunction::
+
+    $ devlink dev show
+    devlink dev show auxiliary/mlx5_core.sf.4
+
+    $ devlink port show auxiliary/mlx5_core.sf.4/1
+    auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
+
+    $ rdma link show mlx5_0/1
+    link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
+
+    $ rdma dev show
+    8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
+    13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
+
+- Subfunction auxiliary device and class device hierarchy::
+
+                 mlx5_core.sf.4
+          (subfunction auxiliary device)
+                       /\
+                      /  \
+                     /    \
+                    /      \
+                   /        \
+      mlx5_core.eth.4     mlx5_core.rdma.4
+     (sf eth aux dev)     (sf rdma aux dev)
+         |                      |
+         |                      |
+      p0sf88                  mlx5_0
+     (sf netdev)          (sf rdma device)
+
+Additionally SF port also gets the event when the driver attaches to the
+auxiliary device of the subfunction. This results in changing the operational
+state of the function. This provides visibility to user to decide when it is
+safe to delete the SF port for graceful termination of the subfunction.
+
+- Show the SF port operational state::
+
+    $ devlink port show ens2f0npf0sf88
+    pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+      function:
+        hw_addr 00:00:00:00:88:88 state active opstate attached
+
 Devlink health reporters
 ========================
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute
  2020-12-15  9:03 ` [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
@ 2020-12-15 23:27   ` Jakub Kicinski
  2020-12-16  3:42     ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-15 23:27 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh,
	Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

On Tue, 15 Dec 2020 01:03:46 -0800 Saeed Mahameed wrote:
> + *	devlink_port_attrs_pci_sf_set - Set PCI SF port attributes
> + *
> + *	@devlink_port: devlink port
> + *	@controller: associated controller number for the devlink port instance
> + *	@pf: associated PF for the devlink port instance
> + *	@sf: associated SF of a PF for the devlink port instance
> + *	@external: indicates if the port is for an external controller
> + */
> +void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 controller,
> +				   u16 pf, u32 sf, bool external)
> +{
> +	struct devlink_port_attrs *attrs = &devlink_port->attrs;
> +	int ret;
> +
> +	if (WARN_ON(devlink_port->registered))
> +		return;
> +	ret = __devlink_port_attrs_set(devlink_port, DEVLINK_PORT_FLAVOUR_PCI_SF);
> +	if (ret)
> +		return;
> +	attrs->pci_sf.controller = controller;
> +	attrs->pci_sf.pf = pf;
> +	attrs->pci_sf.sf = sf;
> +	attrs->pci_sf.external = external;
> +}
> +EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_sf_set);

So subfunctions don't have a VF id but they may have a controller?

Can you tell us more about the use cases and deployment models you're
intending to support? Let's not add attributes and info which will go
unused.

How are SFs supposed to be used with SmartNICs? Are you assuming single
domain of control? It seems that the way the industry is moving the
major use case for SmartNICs is bare metal.

I always assumed nested eswitches when thinking about SmartNICs, what
are you intending to do?

What are your plans for enabling this feature in user space project?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 04/15] devlink: Support add and delete devlink port
  2020-12-15  9:03 ` [net-next v5 04/15] devlink: Support add and delete devlink port Saeed Mahameed
@ 2020-12-16  0:29   ` Jakub Kicinski
  2020-12-16  5:06     ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-16  0:29 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh,
	Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

On Tue, 15 Dec 2020 01:03:47 -0800 Saeed Mahameed wrote:
> From: Parav Pandit <parav@nvidia.com>
> 
> Extended devlink interface for the user to add and delete port.
> Extend devlink to connect user requests to driver to add/delete
> such port in the device.
> 
> When driver routines are invoked, devlink instance lock is not held.
> This enables driver to perform several devlink objects registration,
> unregistration such as (port, health reporter, resource etc)
> by using existing devlink APIs.
> This also helps to uniformly use the code for port unregistration
> during driver unload and during port deletion initiated by user.
> 
> Examples of add, show and delete commands:
> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> 
> $ devlink port show
> pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
> 
> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> 
> $ devlink port show pci/0000:06:00.0/32768
> pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
>   function:
>     hw_addr 00:00:00:00:88:88 state inactive opstate detached
> 
> $ udevadm test-builtin net_id /sys/class/net/eth0
> Load module index
> Parsed configuration file /usr/lib/systemd/network/99-default.link
> Created link configuration context.
> Using default interface naming scheme 'v245'.
> ID_NET_NAMING_SCHEME=v245
> ID_NET_NAME_PATH=enp6s0f0npf0sf88
> ID_NET_NAME_SLOT=ens2f0npf0sf88
> Unload module index
> Unloaded link configuration context.
> 
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
> Reviewed-by: Vu Pham <vuhuong@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

> diff --git a/include/net/devlink.h b/include/net/devlink.h
> index 5bd43f0a79a8..f8cff3e402da 100644
> --- a/include/net/devlink.h
> +++ b/include/net/devlink.h
> @@ -153,6 +153,17 @@ struct devlink_port {
>  	struct mutex reporters_lock; /* Protects reporter_list */
>  };
>  
> +struct devlink_port_new_attrs {
> +	enum devlink_port_flavour flavour;
> +	unsigned int port_index;
> +	u32 controller;
> +	u32 sfnum;
> +	u16 pfnum;

Oh. So you had the structure which actually gets stored in memory for
the lifetime of the device in patch 3 mispacked (u32 / u16 / u32 / u8).
But this one with arguments is packed. Please be consistent.

> +	u8 port_index_valid:1,
> +	   controller_valid:1,
> +	   sfnum_valid:1;
> +};
> +
>  struct devlink_sb_pool_info {
>  	enum devlink_sb_pool_type pool_type;
>  	u32 size;
> @@ -1363,6 +1374,34 @@ struct devlink_ops {
>  	int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port,
>  					 const u8 *hw_addr, int hw_addr_len,
>  					 struct netlink_ext_ack *extack);
> +	/**
> +	 * @port_new: Port add function.
> +	 *
> +	 * Should be used by device driver to let caller add new port of a
> +	 * specified flavour with optional attributes.

Add a new port of a specified flavor with optional attributes.

> +	 * Driver should return -EOPNOTSUPP if it doesn't support port addition

s/should/must/

> +	 * of a specified flavour or specified attributes. Driver should set
> +	 * extack error message in case of fail to add the port. Devlink core

s/fail to add the port/failure/

> +	 * does not hold a devlink instance lock when this callback is invoked.

Called without holding the devlink instance lock.

> +	 * Driver must ensures synchronization when adding or deleting a port.

s/ensures/ensure/ but really that's pretty obvious from the previous
sentence.

> +	 * Driver must register a port with devlink core.

s/must/is expected to/

Please make sure your comments and documentation are proof read by
someone.

> +static int devlink_nl_cmd_port_new_doit(struct sk_buff *skb,
> +					struct genl_info *info)
> +{
> +	struct netlink_ext_ack *extack = info->extack;
> +	struct devlink_port_new_attrs new_attrs = {};
> +	struct devlink *devlink = info->user_ptr[0];
> +
> +	if (!info->attrs[DEVLINK_ATTR_PORT_FLAVOUR] ||
> +	    !info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]) {
> +		NL_SET_ERR_MSG_MOD(extack, "Port flavour or PCI PF are not specified");
> +		return -EINVAL;
> +	}
> +	new_attrs.flavour = nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_FLAVOUR]);
> +	new_attrs.pfnum =
> +		nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]);
> +
> +	if (info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
> +		new_attrs.port_index =
> +			nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
> +		new_attrs.port_index_valid = true;
> +	}

This is the desired port index of the new port?
Or the index of the parent port?
Let's make it abundantly clear since its a pass-thru argument for the
driver to interpret.

> +	if (info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]) {
> +		new_attrs.controller =
> +			nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]);
> +		new_attrs.controller_valid = true;
> +	}
> +	if (info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]) {
> +		new_attrs.sfnum = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]);
> +		new_attrs.sfnum_valid = true;
> +	}
> +
> +	if (!devlink->ops->port_new)
> +		return -EOPNOTSUPP;

Why is this check not at the beginning of the function?
Also should there be an extack on it?

> +	return devlink->ops->port_new(devlink, &new_attrs, extack);

This should return the identifier of the created port back to user
space.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 05/15] devlink: Support get and set state of port function
  2020-12-15  9:03 ` [net-next v5 05/15] devlink: Support get and set state of port function Saeed Mahameed
@ 2020-12-16  0:37   ` Jakub Kicinski
  2020-12-16  5:15     ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-16  0:37 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh,
	Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

On Tue, 15 Dec 2020 01:03:48 -0800 Saeed Mahameed wrote:
> From: Parav Pandit <parav@nvidia.com>
> 
> devlink port function can be in active or inactive state.
> Allow users to get and set port function's state.
> 
> When the port function it activated, its operational state may change
> after a while when the device is created and driver binds to it.
> Similarly on deactivation flow.

So what's the flow device should implement?

User requests deactivated, the device sends a notification to 
the driver bound to the device. What if the driver ignores it?

> To clearly describe the state of the port function and its device's
> operational state in the host system, define state and opstate
> attributes.
> 
> Example of a PCI SF port which supports a port function:
> Create a device with ID=10 and one physical port.
> 
> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> 
> $ devlink port show
> pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
> 
> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> 
> $ devlink port show pci/0000:06:00.0/32768
> pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
>   function:
>     hw_addr 00:00:00:00:88:88 state inactive opstate detached
> 
> $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active

Is request to deactivate done by settings state to inactive?

> $ devlink port show pci/0000:06:00.0/32768 -jp
> {
>     "port": {
>         "pci/0000:06:00.0/32768": {
>             "type": "eth",
>             "netdev": "ens2f0npf0sf88",
>             "flavour": "pcisf",
>             "controller": 0,
>             "pfnum": 0,
>             "sfnum": 88,
>             "external": false,
>             "splittable": false,
>             "function": {
>                 "hw_addr": "00:00:00:00:88:88",
>                 "state": "active",
>                 "opstate": "attached"
>             }
>         }
>     }
> }
> 
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
> Reviewed-by: Vu Pham <vuhuong@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

> + * enum devlink_port_function_opstate - indicates operational state of port function
> + * @DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED: Driver is attached to the function of port,

This name definitely needs to be shortened.

> + *					    gracefufl tear down of the function, after

gracefufl

> + *					    inactivation of the port function, user should wait
> + *					    for operational state to turn DETACHED.

Why do you indent the comment by 40 characters and then go over 80
chars?

> + * @DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED: Driver is detached from the function of port; it is
> + *					    safe to delete the port.
> + */
> +enum devlink_port_function_opstate {
> +	DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED,

The port function must be some Mellanox speak - for the second time - 
I have no idea what it means. Please use meaningful names.

> +	DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED,
> +};
> +
>  #endif /* _UAPI_LINUX_DEVLINK_H_ */
> diff --git a/net/core/devlink.c b/net/core/devlink.c
> index 11043707f63f..b8acb8842aa1 100644
> --- a/net/core/devlink.c
> +++ b/net/core/devlink.c
> @@ -87,6 +87,9 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_trap_report);
>  
>  static const struct nla_policy devlink_function_nl_policy[DEVLINK_PORT_FUNCTION_ATTR_MAX + 1] = {
>  	[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR] = { .type = NLA_BINARY },
> +	[DEVLINK_PORT_FUNCTION_ATTR_STATE] =
> +		NLA_POLICY_RANGE(NLA_U8, DEVLINK_PORT_FUNCTION_STATE_INACTIVE,
> +				 DEVLINK_PORT_FUNCTION_STATE_ACTIVE),
>  };
>  
>  static LIST_HEAD(devlink_list);
> @@ -746,6 +749,57 @@ devlink_port_function_hw_addr_fill(struct devlink *devlink, const struct devlink
>  	return 0;
>  }
>  
> +static bool
> +devlink_port_function_state_valid(enum devlink_port_function_state state)
> +{
> +	return state == DEVLINK_PORT_FUNCTION_STATE_INACTIVE ||
> +	       state == DEVLINK_PORT_FUNCTION_STATE_ACTIVE;
> +}
> +
> +static bool
> +devlink_port_function_opstate_valid(enum devlink_port_function_opstate state)
> +{
> +	return state == DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED ||
> +	       state == DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED;
> +}
> +
> +static int
> +devlink_port_function_state_fill(struct devlink *devlink,
> +				 const struct devlink_ops *ops,
> +				 struct devlink_port *port, struct sk_buff *msg,
> +				 struct netlink_ext_ack *extack,
> +				 bool *msg_updated)
> +{
> +	enum devlink_port_function_opstate opstate;
> +	enum devlink_port_function_state state;
> +	int err;
> +
> +	if (!ops->port_function_state_get)
> +		return 0;
> +
> +	err = ops->port_function_state_get(devlink, port, &state, &opstate, extack);
> +	if (err) {
> +		if (err == -EOPNOTSUPP)
> +			return 0;
> +		return err;
> +	}
> +	if (!devlink_port_function_state_valid(state)) {
> +		WARN_ON_ONCE(1);
> +		NL_SET_ERR_MSG_MOD(extack, "Invalid state value read from driver");
> +		return -EINVAL;
> +	}
> +	if (!devlink_port_function_opstate_valid(opstate)) {
> +		WARN_ON_ONCE(1);
> +		NL_SET_ERR_MSG_MOD(extack, "Invalid operational state value read from driver");
> +		return -EINVAL;
> +	}
> +	if (nla_put_u8(msg, DEVLINK_PORT_FUNCTION_ATTR_STATE, state) ||
> +	    nla_put_u8(msg, DEVLINK_PORT_FUNCTION_ATTR_OPSTATE, opstate))
> +		return -EMSGSIZE;
> +	*msg_updated = true;
> +	return 0;
> +}
> +
>  static int
>  devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *port,
>  				   struct netlink_ext_ack *extack)
> @@ -762,6 +816,13 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por
>  
>  	ops = devlink->ops;
>  	err = devlink_port_function_hw_addr_fill(devlink, ops, port, msg, extack, &msg_updated);

Wrap your code, please.

> +	if (err)
> +		goto out;
> +	err = devlink_port_function_state_fill(devlink, ops, port, msg, extack,
> +					       &msg_updated);
> +	if (err)
> +		goto out;
> +out:
>  	if (err || !msg_updated)
>  		nla_nest_cancel(msg, function_attr);
>  	else

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-15  9:03 ` [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
@ 2020-12-16  0:43   ` Jakub Kicinski
  2020-12-16  5:19     ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-16  0:43 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh,
	Parav Pandit, Vu Pham, Saeed Mahameed

On Tue, 15 Dec 2020 01:03:50 -0800 Saeed Mahameed wrote:
> +static ssize_t sfnum_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	struct auxiliary_device *adev = container_of(dev, struct auxiliary_device, dev);
> +	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
> +
> +	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum);
> +}
> +static DEVICE_ATTR_RO(sfnum);
> +
> +static struct attribute *sf_device_attrs[] = {
> +	&dev_attr_sfnum.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group sf_attr_group = {
> +	.attrs = sf_device_attrs,
> +};
> +
> +static const struct attribute_group *sf_attr_groups[2] = {
> +	&sf_attr_group,
> +	NULL
> +};

Why the sysfs attribute? Devlink should be able to report device name
so there's no need for a tie in from the other end.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport
  2020-12-15  9:03 ` [net-next v5 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
@ 2020-12-16  0:47   ` Jakub Kicinski
  2020-12-16  5:28     ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-16  0:47 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Vu Pham,
	Parav Pandit, Roi Dayan, Saeed Mahameed

On Tue, 15 Dec 2020 01:03:52 -0800 Saeed Mahameed wrote:
> From: Vu Pham <vuhuong@nvidia.com>
> 
> Prepare eswitch to handle SF vport during
> (a) querying eswitch functions
> (b) egress ACL creation
> (c) account for SF vports in total vports calculation
> 
> Assign a dedicated placeholder for SFs vports and their representors.
> They are placed after VFs vports and before ECPF vports as below:
> [PF,VF0,...,VFn,SF0,...SFm,ECPF,UPLINK].
> 
> Change functions to map SF's vport numbers to indices when
> accessing the vports or representors arrays, and vice versa.
> 
> Signed-off-by: Vu Pham <vuhuong@nvidia.com>
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Roi Dayan <roid@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> index d6c48582e7a8..ad45d20f9d44 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> @@ -212,3 +212,13 @@ config MLX5_SF
>  	Build support for subfuction device in the NIC. A Mellanox subfunction
>  	device can support RDMA, netdevice and vdpa device.
>  	It is similar to a SRIOV VF but it doesn't require SRIOV support.
> +
> +config MLX5_SF_MANAGER
> +	bool
> +	depends on MLX5_SF && MLX5_ESWITCH
> +	default y
> +	help
> +	Build support for subfuction port in the NIC. A Mellanox subfunction
> +	port is managed through devlink.  A subfunction supports RDMA, netdevice
> +	and vdpa device. It is similar to a SRIOV VF but it doesn't require
> +	SRIOV support.

Why is this a separate knob?

And it's not used anywhere AFAICS.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 11/15] net/mlx5: SF, Add port add delete functionality
  2020-12-15  9:03 ` [net-next v5 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
@ 2020-12-16  0:51   ` Jakub Kicinski
  2020-12-16  5:31     ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-16  0:51 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh,
	Parav Pandit, Vu Pham, Saeed Mahameed

On Tue, 15 Dec 2020 01:03:54 -0800 Saeed Mahameed wrote:
> To handle SF port management outside of the eswitch as independent
> software layer, introduce eswitch notifier APIs so that upper layer who
> wish to support sf port management in switchdev mode can perform its

Could you unpack this? What's the "upper layer" software in this
context?

> task whenever eswitch mode is set to switchdev or before eswitch is
> disabled.

How does SF work if eswich is disabled?

> Initialize sf port table on such eswitch event.
> 
> Add SF port add and delete functionality in switchdev mode.
> Destroy all SF ports when eswitch is disabled.
> Expose SF port add and delete to user via devlink commands.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 13/15] devlink: Add devlink port documentation
  2020-12-15  9:03 ` [net-next v5 13/15] devlink: Add devlink port documentation Saeed Mahameed
@ 2020-12-16  0:57   ` Jakub Kicinski
  2020-12-16  5:40     ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-16  0:57 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh,
	Parav Pandit, Jiri Pirko, Saeed Mahameed

On Tue, 15 Dec 2020 01:03:56 -0800 Saeed Mahameed wrote:
> +PCI controllers
> +---------------
> +In most cases a PCI device has only one controller. A controller consists of
> +potentially multiple physical and virtual functions. Such PCI function consists
> +of one or more ports.

s/Such//

you say consists in two consecutive sentences.

> This port of the function is represented by the devlink eswitch port.

"This port of the function"? Why not just "Each port"?

> +A PCI Device connected to multiple CPUs or multiple PCI root complexes or

Why is device capitalized all of the sudden?

> +SmartNIC, however, may have multiple controllers. For a device with multiple

a SmartNIC or SmartNICs

> +controllers, each controller is distinguished by a unique controller number.
> +An eswitch on the PCI device support ports of multiple controllers.

eswitch is on a PCI device?

> +An example view of a system with two controllers::
> +
> +                 ---------------------------------------------------------
> +                 |                                                       |
> +                 |           --------- ---------         ------- ------- |
> +    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
> +    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
> +    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
> +    | connect |  | -------                       -------                 |
> +    -----------  |     | controller_num=1 (no eswitch)                   |
> +                 ------|--------------------------------------------------
> +                 (internal wire)
> +                       |
> +                 ---------------------------------------------------------
> +                 | devlink eswitch ports and reps                        |
> +                 | ----------------------------------------------------- |
> +                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
> +                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
> +                 | ----------------------------------------------------- |
> +                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
> +                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
> +                 | ----------------------------------------------------- |
> +                 |                                                       |
> +                 |                                                       |
> +    -----------  |           --------- ---------         ------- ------- |
> +    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
> +    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
> +    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
> +    -----------  | -------                       -------                 |
> +                 |                                                       |
> +                 |  local controller_num=0 (eswitch)                     |
> +                 ---------------------------------------------------------
> +
> +In above example, external controller (identified by controller number = 1)
> +doesn't have eswitch. Local controller (identified by controller number = 0)
> +has the eswitch. Devlink instance on local controller has eswitch devlink
> +ports representing ports for both the controllers.
> +
> +Port function configuration
> +===========================
> +
> +A user can configure the port function attribute before enumerating the

s/A user/User/

/port function attribute/$something_meaningful/

> +PCI function. Usually it means, user should configure port function attribute

attributes, plural

> +before a bus specific device for the function is created. However, when
> +SRIOV is enabled, virtual function devices are created on the PCI bus.
> +Hence, function attribute should be configured before binding virtual
> +function device to the driver.
> +
> +User may set the hardware address of the function represented by the devlink
> +port function. For Ethernet port function this means a MAC address.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 14/15] devlink: Extend devlink port documentation for subfunctions
  2020-12-15  9:03 ` [net-next v5 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
@ 2020-12-16  1:00   ` Jakub Kicinski
  2020-12-16  3:55     ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-16  1:00 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh,
	Parav Pandit, Saeed Mahameed

On Tue, 15 Dec 2020 01:03:57 -0800 Saeed Mahameed wrote:
> +Subfunctions are lightweight functions that has parent PCI function on which
> +it is deployed. Subfunctions are created and deployed in unit of 1. Unlike
> +SRIOV VFs, they don't require their own PCI virtual function. They communicate
> +with the hardware through the parent PCI function. Subfunctions can possibly
> +scale better.
> +
> +To use a subfunction, 3 steps setup sequence is followed.
> +(1) create - create a subfunction;
> +(2) configure - configure subfunction attributes;
> +(3) deploy - deploy the subfunction;
> +
> +Subfunction management is done using devlink port user interface.
> +User performs setup on the subfunction management device.
> +
> +(1) Create
> +----------
> +A subfunction is created using a devlink port interface. User adds the
> +subfunction by adding a devlink port of subfunction flavour. The devlink
> +kernel code calls down to subfunction management driver (devlink op) and asks
> +it to create a subfunction devlink port. Driver then instantiates the
> +subfunction port and any associated objects such as health reporters and
> +representor netdevice.
> +
> +(2) Configure
> +-------------
> +Subfunction devlink port is created but it is not active yet. That means the
> +entities are created on devlink side, the e-switch port representor is created,
> +but the subfunction device itself it not created. User might use e-switch port
> +representor to do settings, putting it into bridge, adding TC rules, etc. User
> +might as well configure the hardware address (such as MAC address) of the
> +subfunction while subfunction is inactive.
> +
> +(3) Deploy
> +----------
> +Once subfunction is configured, user must activate it to use it. Upon
> +activation, subfunction management driver asks the subfunction management
> +device to instantiate the actual subfunction device on particular PCI function.
> +A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. At this point matching
> +subfunction driver binds to the subfunction's auxiliary device.
> +
> +Terms and Definitions
> +=====================
> +
> +.. list-table:: Terms and Definitions
> +   :widths: 22 90
> +
> +   * - Term
> +     - Definitions
> +   * - ``PCI device``
> +     - A physical PCI device having one or more PCI bus consists of one or
> +       more PCI controllers.
> +   * - ``PCI controller``
> +     -  A controller consists of potentially multiple physical functions,
> +        virtual functions and subfunctions.
> +   * - ``Port function``
> +     -  An object to manage the function of a port.
> +   * - ``Subfunction``
> +     -  A lightweight function that has parent PCI function on which it is
> +        deployed.
> +   * - ``Subfunction device``
> +     -  A bus device of the subfunction, usually on a auxiliary bus.
> +   * - ``Subfunction driver``
> +     -  A device driver for the subfunction auxiliary device.
> +   * - ``Subfunction management device``
> +     -  A PCI physical function that supports subfunction management.
> +   * - ``Subfunction management driver``
> +     -  A device driver for PCI physical function that supports
> +        subfunction management using devlink port interface.
> +   * - ``Subfunction host driver``
> +     -  A device driver for PCI physical function that host subfunction
> +        devices. In most cases it is same as subfunction management driver. When
> +        subfunction is used on external controller, subfunction management and
> +        host drivers are different.

Would be great if someone from Mellanox could proof read this before we
spend cycles on correcting spelling in public review.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute
  2020-12-15 23:27   ` Jakub Kicinski
@ 2020-12-16  3:42     ` Parav Pandit
  2020-12-16 23:59       ` Jakub Kicinski
  0 siblings, 1 reply; 46+ messages in thread
From: Parav Pandit @ 2020-12-16  3:42 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Jiri Pirko,
	Vu Pham, Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, December 16, 2020 4:58 AM
> 
> On Tue, 15 Dec 2020 01:03:46 -0800 Saeed Mahameed wrote:
> > + *	devlink_port_attrs_pci_sf_set - Set PCI SF port attributes
> > + *
> > + *	@devlink_port: devlink port
> > + *	@controller: associated controller number for the devlink port
> instance
> > + *	@pf: associated PF for the devlink port instance
> > + *	@sf: associated SF of a PF for the devlink port instance
> > + *	@external: indicates if the port is for an external controller
> > + */
> > +void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32
> controller,
> > +				   u16 pf, u32 sf, bool external) {
> > +	struct devlink_port_attrs *attrs = &devlink_port->attrs;
> > +	int ret;
> > +
> > +	if (WARN_ON(devlink_port->registered))
> > +		return;
> > +	ret = __devlink_port_attrs_set(devlink_port,
> DEVLINK_PORT_FLAVOUR_PCI_SF);
> > +	if (ret)
> > +		return;
> > +	attrs->pci_sf.controller = controller;
> > +	attrs->pci_sf.pf = pf;
> > +	attrs->pci_sf.sf = sf;
> > +	attrs->pci_sf.external = external;
> > +}
> > +EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_sf_set);
> 
> So subfunctions don't have a VF id but they may have a controller?
>
Right. SF can be on external controller.
 
> Can you tell us more about the use cases and deployment models you're
> intending to support? Let's not add attributes and info which will go unused.
> 
External will be used the same way how it is used for PF and VF.

> How are SFs supposed to be used with SmartNICs? Are you assuming single
> domain of control?
No. it is not assumed. SF can be deployed from smartnic to external host.
A user has to pass appropriate controller number, pf number attributes during creation time.

> It seems that the way the industry is moving the major
> use case for SmartNICs is bare metal.
> 
> I always assumed nested eswitches when thinking about SmartNICs, what
> are you intending to do?
>
Mlx5 doesn't support nested eswitch. SF can be deployed on the external controller PCI function.
But this interface neither limited nor enforcing nested or flat eswitch.
 
> What are your plans for enabling this feature in user space project?
Do you mean K8s plugin or iproute2? Can you please tell us what user space project?
If iproute2, will send the iproute2 patchset like other patchset pointing to kernel uapi headers..


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 14/15] devlink: Extend devlink port documentation for subfunctions
  2020-12-16  1:00   ` Jakub Kicinski
@ 2020-12-16  3:55     ` Parav Pandit
  0 siblings, 0 replies; 46+ messages in thread
From: Parav Pandit @ 2020-12-16  3:55 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh,
	Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, December 16, 2020 6:31 AM
> 
> On Tue, 15 Dec 2020 01:03:57 -0800 Saeed Mahameed wrote:
> > +Subfunctions are lightweight functions that has parent PCI function
> > +on which it is deployed. Subfunctions are created and deployed in
> > +unit of 1. Unlike SRIOV VFs, they don't require their own PCI virtual
> > +function. They communicate with the hardware through the parent PCI
> > +function. Subfunctions can possibly scale better.
> > +
> > +To use a subfunction, 3 steps setup sequence is followed.
> > +(1) create - create a subfunction;
> > +(2) configure - configure subfunction attributes;
> > +(3) deploy - deploy the subfunction;
> > +
> > +Subfunction management is done using devlink port user interface.
> > +User performs setup on the subfunction management device.
> > +
> > +(1) Create
> > +----------
> > +A subfunction is created using a devlink port interface. User adds
> > +the subfunction by adding a devlink port of subfunction flavour. The
> > +devlink kernel code calls down to subfunction management driver
> > +(devlink op) and asks it to create a subfunction devlink port. Driver
> > +then instantiates the subfunction port and any associated objects
> > +such as health reporters and representor netdevice.
> > +
> > +(2) Configure
> > +-------------
> > +Subfunction devlink port is created but it is not active yet. That
> > +means the entities are created on devlink side, the e-switch port
> > +representor is created, but the subfunction device itself it not
> > +created. User might use e-switch port representor to do settings,
> > +putting it into bridge, adding TC rules, etc. User might as well
> > +configure the hardware address (such as MAC address) of the subfunction
> while subfunction is inactive.
> > +
> > +(3) Deploy
> > +----------
> > +Once subfunction is configured, user must activate it to use it. Upon
> > +activation, subfunction management driver asks the subfunction
> > +management device to instantiate the actual subfunction device on
> particular PCI function.
> > +A subfunction device is created on the
> > +:ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. At this
> point matching subfunction driver binds to the subfunction's auxiliary device.
> > +
> > +Terms and Definitions
> > +=====================
> > +
> > +.. list-table:: Terms and Definitions
> > +   :widths: 22 90
> > +
> > +   * - Term
> > +     - Definitions
> > +   * - ``PCI device``
> > +     - A physical PCI device having one or more PCI bus consists of one or
> > +       more PCI controllers.
> > +   * - ``PCI controller``
> > +     -  A controller consists of potentially multiple physical functions,
> > +        virtual functions and subfunctions.
> > +   * - ``Port function``
> > +     -  An object to manage the function of a port.
> > +   * - ``Subfunction``
> > +     -  A lightweight function that has parent PCI function on which it is
> > +        deployed.
> > +   * - ``Subfunction device``
> > +     -  A bus device of the subfunction, usually on a auxiliary bus.
> > +   * - ``Subfunction driver``
> > +     -  A device driver for the subfunction auxiliary device.
> > +   * - ``Subfunction management device``
> > +     -  A PCI physical function that supports subfunction management.
> > +   * - ``Subfunction management driver``
> > +     -  A device driver for PCI physical function that supports
> > +        subfunction management using devlink port interface.
> > +   * - ``Subfunction host driver``
> > +     -  A device driver for PCI physical function that host subfunction
> > +        devices. In most cases it is same as subfunction management driver.
> When
> > +        subfunction is used on external controller, subfunction management
> and
> > +        host drivers are different.
> 
> Would be great if someone from Mellanox could proof read this before we
> spend cycles on correcting spelling in public review.
Will get it done.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 04/15] devlink: Support add and delete devlink port
  2020-12-16  0:29   ` Jakub Kicinski
@ 2020-12-16  5:06     ` Parav Pandit
  0 siblings, 0 replies; 46+ messages in thread
From: Parav Pandit @ 2020-12-16  5:06 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Jiri Pirko,
	Vu Pham, Saeed Mahameed


> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, December 16, 2020 5:59 AM
> 
> > +struct devlink_port_new_attrs {
> > +	enum devlink_port_flavour flavour;
> > +	unsigned int port_index;
> > +	u32 controller;
> > +	u32 sfnum;
> > +	u16 pfnum;
> 
> Oh. So you had the structure which actually gets stored in memory for the
> lifetime of the device in patch 3 mispacked (u32 / u16 / u32 / u8).
> But this one with arguments is packed. Please be consistent.
>
Ok. I will change the packing in patch 3.
 
> > +	u8 port_index_valid:1,
> > +	   controller_valid:1,
> > +	   sfnum_valid:1;
> > +};
> > +
> >  struct devlink_sb_pool_info {
> >  	enum devlink_sb_pool_type pool_type;
> >  	u32 size;
> > @@ -1363,6 +1374,34 @@ struct devlink_ops {
> >  	int (*port_function_hw_addr_set)(struct devlink *devlink, struct
> devlink_port *port,
> >  					 const u8 *hw_addr, int
> hw_addr_len,
> >  					 struct netlink_ext_ack *extack);
> > +	/**
> > +	 * @port_new: Port add function.
> > +	 *
> > +	 * Should be used by device driver to let caller add new port of a
> > +	 * specified flavour with optional attributes.
> 
> Add a new port of a specified flavor with optional attributes.
> 
> > +	 * Driver should return -EOPNOTSUPP if it doesn't support port
> > +addition
> 
> s/should/must/
>
Ack.
 
> > +	 * of a specified flavour or specified attributes. Driver should set
> > +	 * extack error message in case of fail to add the port. Devlink
> > +core
> 
> s/fail to add the port/failure/
> 
Ack.

> > +	 * does not hold a devlink instance lock when this callback is invoked.
> 
> Called without holding the devlink instance lock.
>
Ack.
 
> > +	 * Driver must ensures synchronization when adding or deleting a
> port.
> 
> s/ensures/ensure/ but really that's pretty obvious from the previous
> sentence.
> 
It may be, but this extra clarity helps, so I am going to keep this explicit description.

> > +	 * Driver must register a port with devlink core.
> 
> s/must/is expected to/
>
Ack.
 
> Please make sure your comments and documentation are proof read by
> someone.
> 
Ack.

> > +static int devlink_nl_cmd_port_new_doit(struct sk_buff *skb,
> > +					struct genl_info *info)
> > +{
> > +	struct netlink_ext_ack *extack = info->extack;
> > +	struct devlink_port_new_attrs new_attrs = {};
> > +	struct devlink *devlink = info->user_ptr[0];
> > +
> > +	if (!info->attrs[DEVLINK_ATTR_PORT_FLAVOUR] ||
> > +	    !info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]) {
> > +		NL_SET_ERR_MSG_MOD(extack, "Port flavour or PCI PF are
> not specified");
> > +		return -EINVAL;
> > +	}
> > +	new_attrs.flavour = nla_get_u16(info-
> >attrs[DEVLINK_ATTR_PORT_FLAVOUR]);
> > +	new_attrs.pfnum =
> > +		nla_get_u16(info-
> >attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]);
> > +
> > +	if (info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
> > +		new_attrs.port_index =
> > +			nla_get_u32(info-
> >attrs[DEVLINK_ATTR_PORT_INDEX]);
> > +		new_attrs.port_index_valid = true;
> > +	}
> 
> This is the desired port index of the new port?
Yes.
> Let's make it abundantly clear since its a pass-thru argument for the driver to
> interpret.
>
Ok. Will add comment here.
 
> > +	if (info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]) {
> > +		new_attrs.controller =
> > +			nla_get_u16(info-
> >attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]);
> > +		new_attrs.controller_valid = true;
> > +	}
> > +	if (info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]) {
> > +		new_attrs.sfnum = nla_get_u32(info-
> >attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]);
> > +		new_attrs.sfnum_valid = true;
> > +	}
> > +
> > +	if (!devlink->ops->port_new)
> > +		return -EOPNOTSUPP;
> 
> Why is this check not at the beginning of the function?
Will move it up.

> Also should there be an extack on it?
> 
Will check, and add if required.
> > +	return devlink->ops->port_new(devlink, &new_attrs, extack);
> 
> This should return the identifier of the created port back to user space.
Ok. Will add.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 05/15] devlink: Support get and set state of port function
  2020-12-16  0:37   ` Jakub Kicinski
@ 2020-12-16  5:15     ` Parav Pandit
  2020-12-16 16:15       ` David Ahern
  2020-12-17  0:08       ` Jakub Kicinski
  0 siblings, 2 replies; 46+ messages in thread
From: Parav Pandit @ 2020-12-16  5:15 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Jiri Pirko,
	Vu Pham, Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, December 16, 2020 6:08 AM
> 
> On Tue, 15 Dec 2020 01:03:48 -0800 Saeed Mahameed wrote:
> > From: Parav Pandit <parav@nvidia.com>
> >
> > devlink port function can be in active or inactive state.
> > Allow users to get and set port function's state.
> >
> > When the port function it activated, its operational state may change
> > after a while when the device is created and driver binds to it.
> > Similarly on deactivation flow.
> 
> So what's the flow device should implement?
> 
> User requests deactivated, the device sends a notification to the driver
> bound to the device. What if the driver ignores it?
>
If driver ignores it, those devices are marked unusable for new allocation.
Device becomes usable only after it has act on the event.
 
> > $ devlink port function set pci/0000:06:00.0/32768 hw_addr
> > 00:00:00:00:88:88 state active
> 
> Is request to deactivate done by settings state to inactive?
>
Yes.
 
> > + * enum devlink_port_function_opstate - indicates operational state
> > + of port function
> > + * @DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED: Driver is attached
> to the
> > + function of port,
> 
> This name definitely needs to be shortened.
>
DEVLINK_PORT_FUNCTION_OPS_ATTACHED
Or
DEVLINK_PF_OPS_ATTACHED 

PF - port function
 
> > + *					    gracefufl tear down of the function,
> after
> 
> gracefufl
> 
> > + *					    inactivation of the port function,
> user should wait
> > + *					    for operational state to turn
> DETACHED.
> 
> Why do you indent the comment by 40 characters and then go over 80
> chars?
> 
Will fix it.

> > + * @DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED: Driver is detached
> from the function of port; it is
> > + *					    safe to delete the port.
> > + */
> > +enum devlink_port_function_opstate {
> > +	DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED,
> 
> The port function must be some Mellanox speak - for the second time - I
> have no idea what it means. Please use meaningful names.
>
It is not a Mellanox term.
Port function object is the one that represents function behind this port.
It is not a new term. Port function already exists in devlink whose operational state attribute is defined here.
 
> > devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct
> > devlink_port *por
> >
> >  	ops = devlink->ops;
> >  	err = devlink_port_function_hw_addr_fill(devlink, ops, port, msg,
> > extack, &msg_updated);
> 
> Wrap your code, please.
>
Sure, will do.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-16  0:43   ` Jakub Kicinski
@ 2020-12-16  5:19     ` Parav Pandit
  2020-12-17  0:11       ` Jakub Kicinski
  0 siblings, 1 reply; 46+ messages in thread
From: Parav Pandit @ 2020-12-16  5:19 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Vu Pham,
	Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, December 16, 2020 6:14 AM
> 
> On Tue, 15 Dec 2020 01:03:50 -0800 Saeed Mahameed wrote:
> > +static ssize_t sfnum_show(struct device *dev, struct device_attribute
> > +*attr, char *buf) {
> > +	struct auxiliary_device *adev = container_of(dev, struct
> auxiliary_device, dev);
> > +	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev,
> > +adev);
> > +
> > +	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum); } static
> > +DEVICE_ATTR_RO(sfnum);
> > +
> > +static struct attribute *sf_device_attrs[] = {
> > +	&dev_attr_sfnum.attr,
> > +	NULL,
> > +};
> > +
> > +static const struct attribute_group sf_attr_group = {
> > +	.attrs = sf_device_attrs,
> > +};
> > +
> > +static const struct attribute_group *sf_attr_groups[2] = {
> > +	&sf_attr_group,
> > +	NULL
> > +};
> 
> Why the sysfs attribute? Devlink should be able to report device name so
> there's no need for a tie in from the other end.
There isn't a need to enforce a devlink instance creation either, those mlx5 driver does it.
systemd/udev looks after the sysfs attributes, so its parent device, similar to how phys_port_name etc looked for representor side.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport
  2020-12-16  0:47   ` Jakub Kicinski
@ 2020-12-16  5:28     ` Parav Pandit
  0 siblings, 0 replies; 46+ messages in thread
From: Parav Pandit @ 2020-12-16  5:28 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Vu Pham,
	Roi Dayan, Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, December 16, 2020 6:18 AM
> 
> On Tue, 15 Dec 2020 01:03:52 -0800 Saeed Mahameed wrote:
> > From: Vu Pham <vuhuong@nvidia.com>
> >
> > Prepare eswitch to handle SF vport during
> > (a) querying eswitch functions
> > (b) egress ACL creation
> > (c) account for SF vports in total vports calculation
> >
> > Assign a dedicated placeholder for SFs vports and their representors.
> > They are placed after VFs vports and before ECPF vports as below:
> > [PF,VF0,...,VFn,SF0,...SFm,ECPF,UPLINK].
> >
> > Change functions to map SF's vport numbers to indices when accessing
> > the vports or representors arrays, and vice versa.
> >
> > Signed-off-by: Vu Pham <vuhuong@nvidia.com>
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > Reviewed-by: Roi Dayan <roid@nvidia.com>
> > Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > index d6c48582e7a8..ad45d20f9d44 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > @@ -212,3 +212,13 @@ config MLX5_SF
> >  	Build support for subfuction device in the NIC. A Mellanox
> subfunction
> >  	device can support RDMA, netdevice and vdpa device.
> >  	It is similar to a SRIOV VF but it doesn't require SRIOV support.
> > +
> > +config MLX5_SF_MANAGER
> > +	bool
> > +	depends on MLX5_SF && MLX5_ESWITCH
> > +	default y
> > +	help
> > +	Build support for subfuction port in the NIC. A Mellanox subfunction
> > +	port is managed through devlink.  A subfunction supports RDMA,
> netdevice
> > +	and vdpa device. It is similar to a SRIOV VF but it doesn't require
> > +	SRIOV support.
> 
> Why is this a separate knob?
> 
> And it's not used anywhere AFAICS.
SF device and SF manager are two different sides. SF manager is only supported when eswitch is enabled.
It is used in subsequent of sf/devlink.c to disable port add/del callbacks.
I should possibly move this hunk to devlink patch 11.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 11/15] net/mlx5: SF, Add port add delete functionality
  2020-12-16  0:51   ` Jakub Kicinski
@ 2020-12-16  5:31     ` Parav Pandit
  0 siblings, 0 replies; 46+ messages in thread
From: Parav Pandit @ 2020-12-16  5:31 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Vu Pham,
	Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, December 16, 2020 6:21 AM
> 
> On Tue, 15 Dec 2020 01:03:54 -0800 Saeed Mahameed wrote:
> > To handle SF port management outside of the eswitch as independent
> > software layer, introduce eswitch notifier APIs so that upper layer
> > who wish to support sf port management in switchdev mode can perform
> > its
> 
> Could you unpack this? What's the "upper layer" software in this context?
>
Upper layer in this context = sf management layer within the mlx5 driver which implements devlink port add/del callbacks and state handling.
 
> > task whenever eswitch mode is set to switchdev or before eswitch is
> > disabled.
> 
> How does SF work if eswich is disabled?
> 
It doesn't.
when eswitch is disabled, all SF ports gets destroyed through the eswitch event notifier.

> > Initialize sf port table on such eswitch event.
> >
> > Add SF port add and delete functionality in switchdev mode.
> > Destroy all SF ports when eswitch is disabled.
> > Expose SF port add and delete to user via devlink commands.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 13/15] devlink: Add devlink port documentation
  2020-12-16  0:57   ` Jakub Kicinski
@ 2020-12-16  5:40     ` Parav Pandit
  0 siblings, 0 replies; 46+ messages in thread
From: Parav Pandit @ 2020-12-16  5:40 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Jiri Pirko,
	Saeed Mahameed


> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, December 16, 2020 6:28 AM
> 
> On Tue, 15 Dec 2020 01:03:56 -0800 Saeed Mahameed wrote:
> > +PCI controllers
> > +---------------
> > +In most cases a PCI device has only one controller. A controller
> > +consists of potentially multiple physical and virtual functions. Such
> > +PCI function consists of one or more ports.
> 
> s/Such//
>
Ack.
 
> you say consists in two consecutive sentences.
> 
> > This port of the function is represented by the devlink eswitch port.
> 
First sentence describe controller. Second sentence describe function.
So what is wrong in that?

> "This port of the function"? Why not just "Each port"?
> 
That's fine too. Will simplify.

> > +A PCI Device connected to multiple CPUs or multiple PCI root
> > +complexes or
> 
> Why is device capitalized all of the sudden?
>
Will fix.
 
> > +SmartNIC, however, may have multiple controllers. For a device with
> > +multiple
> 
> a SmartNIC or SmartNICs
> 
> > +controllers, each controller is distinguished by a unique controller
> number.
> > +An eswitch on the PCI device support ports of multiple controllers.
> 
> eswitch is on a PCI device?
>
Will change.
 
> > +An example view of a system with two controllers::
> > +
> > +                 ---------------------------------------------------------
> > +                 |                                                       |
> > +                 |           --------- ---------         ------- ------- |
> > +    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
> > +    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
> > +    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
> > +    | connect |  | -------                       -------                 |
> > +    -----------  |     | controller_num=1 (no eswitch)                   |
> > +                 ------|--------------------------------------------------
> > +                 (internal wire)
> > +                       |
> > +                 ---------------------------------------------------------
> > +                 | devlink eswitch ports and reps                        |
> > +                 | ----------------------------------------------------- |
> > +                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
> > +                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
> > +                 | ----------------------------------------------------- |
> > +                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
> > +                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
> > +                 | ----------------------------------------------------- |
> > +                 |                                                       |
> > +                 |                                                       |
> > +    -----------  |           --------- ---------         ------- ------- |
> > +    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
> > +    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
> > +    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
> > +    -----------  | -------                       -------                 |
> > +                 |                                                       |
> > +                 |  local controller_num=0 (eswitch)                     |
> > +
> > + ---------------------------------------------------------
> > +
> > +In above example, external controller (identified by controller
> > +number = 1) doesn't have eswitch. Local controller (identified by
> > +controller number = 0) has the eswitch. Devlink instance on local
> > +controller has eswitch devlink ports representing ports for both the
> controllers.
> > +
> > +Port function configuration
> > +===========================
> > +
> > +A user can configure the port function attribute before enumerating
> > +the
> 
> s/A user/User/
> 
> /port function attribute/$something_meaningful/
> 
May be just say function attribute?

> > +PCI function. Usually it means, user should configure port function
> > +attribute
> 
> attributes, plural
> 
Yes, but at present there is only one i.e. mac address, so didn't use plural.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 05/15] devlink: Support get and set state of port function
  2020-12-16  5:15     ` Parav Pandit
@ 2020-12-16 16:15       ` David Ahern
  2020-12-17  0:08       ` Jakub Kicinski
  1 sibling, 0 replies; 46+ messages in thread
From: David Ahern @ 2020-12-16 16:15 UTC (permalink / raw)
  To: Parav Pandit, Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Jiri Pirko,
	Vu Pham, Saeed Mahameed

On 12/15/20 10:15 PM, Parav Pandit wrote:
>>> + * enum devlink_port_function_opstate - indicates operational state
>>> + of port function
>>> + * @DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED: Driver is attached
>> to the
>>> + function of port,
>> This name definitely needs to be shortened.
>>
> DEVLINK_PORT_FUNCTION_OPS_ATTACHED
> Or
> DEVLINK_PF_OPS_ATTACHED 
> 
> PF - port function
>  

The devlink attribute names need to start using established short names
to find that balance between readability and ridiculously long names.

In this case PF for networking has an established link to SRIOV
'physical function'.

FUNCTION can be written as FCN.
ATTACHED can be shortened to ATTCH.

So in this case DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED (38 chars) drops
to DEVLINK_PORT_FCN_OPSTATE_ATTCH (30 chars). That is a step in the
right direction.




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute
  2020-12-16  3:42     ` Parav Pandit
@ 2020-12-16 23:59       ` Jakub Kicinski
  2020-12-17  4:44         ` Saeed Mahameed
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-16 23:59 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Jiri Pirko, Vu Pham, Saeed Mahameed

On Wed, 16 Dec 2020 03:42:51 +0000 Parav Pandit wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > So subfunctions don't have a VF id but they may have a controller?
> >  
> Right. SF can be on external controller.
>  
> > Can you tell us more about the use cases and deployment models you're
> > intending to support? Let's not add attributes and info which will go unused.
> >   
> External will be used the same way how it is used for PF and VF.
> 
> > How are SFs supposed to be used with SmartNICs? Are you assuming single
> > domain of control?  
> No. it is not assumed. SF can be deployed from smartnic to external host.
> A user has to pass appropriate controller number, pf number attributes during creation time.

My problem with this series is that I've gotten some real life
application exposure over the last year, and still I have no idea 
who is going to find this feature useful and why.

That's the point of my questions in the previous email - what
are the use cases, how are they going to operate.

It's hard to review an API without knowing the use of it. iproute2
is low level plumbing.

Here the patch is adding the ability to apparently create a SF on 
a remote controller. If you haven't thought that use case through
just don't allow it until you know how it will work.

> > It seems that the way the industry is moving the major
> > use case for SmartNICs is bare metal.
> > 
> > I always assumed nested eswitches when thinking about SmartNICs, what
> > are you intending to do?
> >  
> Mlx5 doesn't support nested eswitch. SF can be deployed on the external controller PCI function.
> But this interface neither limited nor enforcing nested or flat eswitch.
>  
> > What are your plans for enabling this feature in user space project?  
> Do you mean K8s plugin or iproute2? Can you please tell us what user space project?

That's my question. For SR-IOV it'd be all the virt stacks out there.
But this can't do virt. So what can it do?

> If iproute2, will send the iproute2 patchset like other patchset pointing to kernel uapi headers..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 05/15] devlink: Support get and set state of port function
  2020-12-16  5:15     ` Parav Pandit
  2020-12-16 16:15       ` David Ahern
@ 2020-12-17  0:08       ` Jakub Kicinski
  2020-12-17  5:46         ` Parav Pandit
  1 sibling, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-17  0:08 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Jiri Pirko, Vu Pham, Saeed Mahameed

On Wed, 16 Dec 2020 05:15:04 +0000 Parav Pandit wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > Sent: Wednesday, December 16, 2020 6:08 AM
> > 
> > On Tue, 15 Dec 2020 01:03:48 -0800 Saeed Mahameed wrote:  
> > > From: Parav Pandit <parav@nvidia.com>
> > >
> > > devlink port function can be in active or inactive state.
> > > Allow users to get and set port function's state.
> > >
> > > When the port function it activated, its operational state may change
> > > after a while when the device is created and driver binds to it.
> > > Similarly on deactivation flow.  
> > 
> > So what's the flow device should implement?
> > 
> > User requests deactivated, the device sends a notification to the driver
> > bound to the device. What if the driver ignores it?
> >  
> If driver ignores it, those devices are marked unusable for new allocation.
> Device becomes usable only after it has act on the event.

But the device remains fully operational?

So if I'm an admin who wants to unplug a misbehaving "entity"[1]
the deactivate is not gonna help me, it's just a graceful hint?
Is there no need for a forceful shutdown?

[1] refer to earlier email, IDK what entity is supposed to use this

> > > + * @DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED: Driver is detached  
> > from the function of port; it is  
> > > + *					    safe to delete the port.
> > > + */
> > > +enum devlink_port_function_opstate {
> > > +	DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED,  
> > 
> > The port function must be some Mellanox speak - for the second time - I
> > have no idea what it means. Please use meaningful names.
> >  
> It is not a Mellanox term.
> Port function object is the one that represents function behind this port.
> It is not a new term. Port function already exists in devlink whose operational state attribute is defined here.

I must have missed that in review. PCI functions can host multiple
ports. So "port function" does not compute for me. Can we drop the
"function"?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-16  5:19     ` Parav Pandit
@ 2020-12-17  0:11       ` Jakub Kicinski
  2020-12-17  5:23         ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-17  0:11 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Vu Pham, Saeed Mahameed

On Wed, 16 Dec 2020 05:19:15 +0000 Parav Pandit wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > Sent: Wednesday, December 16, 2020 6:14 AM
> > 
> > On Tue, 15 Dec 2020 01:03:50 -0800 Saeed Mahameed wrote:  
> > > +static ssize_t sfnum_show(struct device *dev, struct device_attribute
> > > +*attr, char *buf) {
> > > +	struct auxiliary_device *adev = container_of(dev, struct  
> > auxiliary_device, dev);  
> > > +	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev,
> > > +adev);
> > > +
> > > +	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum); } static
> > > +DEVICE_ATTR_RO(sfnum);
> > > +
> > > +static struct attribute *sf_device_attrs[] = {
> > > +	&dev_attr_sfnum.attr,
> > > +	NULL,
> > > +};
> > > +
> > > +static const struct attribute_group sf_attr_group = {
> > > +	.attrs = sf_device_attrs,
> > > +};
> > > +
> > > +static const struct attribute_group *sf_attr_groups[2] = {
> > > +	&sf_attr_group,
> > > +	NULL
> > > +};  
> > 
> > Why the sysfs attribute? Devlink should be able to report device name so
> > there's no need for a tie in from the other end.  
> There isn't a need to enforce a devlink instance creation either,

You mean there isn't a need for the SF to be spawned by devlink?

> those mlx5 driver does it.

Really, no idea what you're trying to say. Read your emails before
you send them.

> systemd/udev looks after the sysfs attributes, so its parent device, similar to how phys_port_name etc looked for representor side.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute
  2020-12-16 23:59       ` Jakub Kicinski
@ 2020-12-17  4:44         ` Saeed Mahameed
  2020-12-18 19:48           ` Jakub Kicinski
  0 siblings, 1 reply; 46+ messages in thread
From: Saeed Mahameed @ 2020-12-17  4:44 UTC (permalink / raw)
  To: Jakub Kicinski, Parav Pandit
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Jiri Pirko,
	Vu Pham

On Wed, 2020-12-16 at 15:59 -0800, Jakub Kicinski wrote:
> On Wed, 16 Dec 2020 03:42:51 +0000 Parav Pandit wrote:
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > So subfunctions don't have a VF id but they may have a
> > > controller?
> > >  
> > Right. SF can be on external controller.
> >  
> > > Can you tell us more about the use cases and deployment models
> > > you're
> > > intending to support? Let's not add attributes and info which
> > > will go unused.
> > >   
> > External will be used the same way how it is used for PF and VF.
> > 
> > > How are SFs supposed to be used with SmartNICs? Are you assuming
> > > single
> > > domain of control?  
> > No. it is not assumed. SF can be deployed from smartnic to external
> > host.
> > A user has to pass appropriate controller number, pf number
> > attributes during creation time.
> 
> My problem with this series is that I've gotten some real life
> application exposure over the last year, and still I have no idea 
> who is going to find this feature useful and why.
> 
> That's the point of my questions in the previous email - what
> are the use cases, how are they going to operate.
> 

The main focus of this feature is scale-ability we want to run
thousands of Containers/VMs, this is useful for both smartnic and
baremetal hypervisor worlds, where security and control is exclusive to
the eswitch manager may it be the smarnic embedded CPU or the x86
Hypervisor.

deployment models is identical to SRIOV, the only difference is the
instantiation model of SF, which is the main discussion point of this
series (i hope), which to my taste is very modest and minimal.
after SF is instantiated from that point nothing is new, the SF is
exposing standard linux interfaces netdev/rdma identical to what VF
does, most likely you will assign them a namespace and pass them
through to a container or assign them (not direct assignment) to a VM
via the virt stack, or create a vdpa instance and pass it to a virtio
interface.

There are endless usecases for the netdev stack, for customers who want
high scale virtualized/containerized environments, with thousands of
network functions that can deliver high speed and full offload
accelerators, Native XDP, Crypto, encap/decap, and HW filtering and
processing pipeline capabilities.

I have a long list of customers with various and different applications
and i am not even talking about the rdma and vdpa customers ! those
customers just can't wait to leave sriov behind and scale up !

this feature has a lot of value to the netdev users only because of the
minimal foot print to the netdev stack (to be honest there is no change
in netdev, only a thin API layer in devlink) and the immediate and
effortless benefits to deploy multiple (accelerated) netdevs at scale.


> It's hard to review an API without knowing the use of it. iproute2
> is low level plumbing.
> 

I don't know how to put this, let me try:
A) SRIOV model
echo 128 > /sys/class/net/eth0/device/sriov_numvfs
ubind vf

ip set vf attribute x
configure representor .. 
deploy vf/netdev/rdma interface into the container

B) SF model 
you do (every thing under the devlink umbrella/switchdev):
for i in {1..1024} ; do
devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum $i
devlink port sf $i set attribute x

# from here on, it is identical to a VF
configure representor
deply sf/netdev/rdma interfaces into a container 

B is more scale-able and has more visibility and controllability  to
the user, after you create the SFs deployment and usecases are
identical to SRIOV VF usecases.

See the improvement ? :)

> Here the patch is adding the ability to apparently create a SF on 
> a remote controller. If you haven't thought that use case through
> just don't allow it until you know how it will work.
> 

We have thought the use case through it is not any different from the 
local controller use case. the code is uniform, we need to work hard to
block a remote controller :) .. 

> > > It seems that the way the industry is moving the major
> > > use case for SmartNICs is bare metal.
> > > 
> > > I always assumed nested eswitches when thinking about SmartNICs,
> > > what
> > > are you intending to do?
> > >  
> > Mlx5 doesn't support nested eswitch. SF can be deployed on the
> > external controller PCI function.
> > But this interface neither limited nor enforcing nested or flat
> > eswitch.
> >  
> > > What are your plans for enabling this feature in user space
> > > project?  
> > Do you mean K8s plugin or iproute2? Can you please tell us what
> > user space project?
> 
> That's my question. For SR-IOV it'd be all the virt stacks out there.
> But this can't do virt. So what can it do?
> 

you are thinking VF direct assignment. but don't forget
virt handles netdev assignment to a vm perfectly fine and SF has a
netdev.

And don't get me started on the weird virt handling of SRIOV VF, the
whole thing is a big mess :) it shouldn't be a de facto standard that
we need to follow.. 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-17  0:11       ` Jakub Kicinski
@ 2020-12-17  5:23         ` Parav Pandit
  2020-12-18 19:58           ` Jakub Kicinski
  0 siblings, 1 reply; 46+ messages in thread
From: Parav Pandit @ 2020-12-17  5:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Vu Pham, Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, December 17, 2020 5:42 AM
> 
> On Wed, 16 Dec 2020 05:19:15 +0000 Parav Pandit wrote:
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Sent: Wednesday, December 16, 2020 6:14 AM
> > >
> > > On Tue, 15 Dec 2020 01:03:50 -0800 Saeed Mahameed wrote:
> > > > +static ssize_t sfnum_show(struct device *dev, struct
> > > > +device_attribute *attr, char *buf) {
> > > > +	struct auxiliary_device *adev = container_of(dev, struct
> > > auxiliary_device, dev);
> > > > +	struct mlx5_sf_dev *sf_dev = container_of(adev, struct
> > > > +mlx5_sf_dev, adev);
> > > > +
> > > > +	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum); }
> > > > +static DEVICE_ATTR_RO(sfnum);
> > > > +
> > > > +static struct attribute *sf_device_attrs[] = {
> > > > +	&dev_attr_sfnum.attr,
> > > > +	NULL,
> > > > +};
> > > > +
> > > > +static const struct attribute_group sf_attr_group = {
> > > > +	.attrs = sf_device_attrs,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *sf_attr_groups[2] = {
> > > > +	&sf_attr_group,
> > > > +	NULL
> > > > +};
> > >
> > > Why the sysfs attribute? Devlink should be able to report device
> > > name so there's no need for a tie in from the other end.
> > There isn't a need to enforce a devlink instance creation either,
> 
> You mean there isn't a need for the SF to be spawned by devlink?
>
No. sorry for the confusion.
Let me list down the sequence and plumbing.
1. Devlink instance having eswitch spawns the SF port (port add, flavour = pcisf [..]).
2. This SF is either for local or external controller. Just like today's VF.
3. When SF port is activated (port function set state), SF auxiliary device is spawned on the hosting PF.
4. This SF auxiliary device when attached to mlx5_core driver it registers devlink instance (auxiliary/mlx5_core.sf.4).
5. When netdev of SF dev is created, it register devlink port of virtual flavour with link to its netdev.
/sys/class/net/<sf_netdev>/device points to the auxiliary device.
/sys/class/infiniband/<sf_rdma_dev>/device points to the auxiliary device.

6. SF auxiliary device has the sysfs file read by systemd/udev to rename netdev and rdma devices of SF.

Steps 4,5,6 are equivalent to an existing VF.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 05/15] devlink: Support get and set state of port function
  2020-12-17  0:08       ` Jakub Kicinski
@ 2020-12-17  5:46         ` Parav Pandit
  2020-12-18 19:51           ` Jakub Kicinski
  0 siblings, 1 reply; 46+ messages in thread
From: Parav Pandit @ 2020-12-17  5:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Jiri Pirko, Vu Pham, Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, December 17, 2020 5:39 AM
> 
> On Wed, 16 Dec 2020 05:15:04 +0000 Parav Pandit wrote:
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Sent: Wednesday, December 16, 2020 6:08 AM
> > >
> > > On Tue, 15 Dec 2020 01:03:48 -0800 Saeed Mahameed wrote:
> > > > From: Parav Pandit <parav@nvidia.com>
> > > >
> > > > devlink port function can be in active or inactive state.
> > > > Allow users to get and set port function's state.
> > > >
> > > > When the port function it activated, its operational state may
> > > > change after a while when the device is created and driver binds to it.
> > > > Similarly on deactivation flow.
> > >
> > > So what's the flow device should implement?
> > >
> > > User requests deactivated, the device sends a notification to the
> > > driver bound to the device. What if the driver ignores it?
> > >
> > If driver ignores it, those devices are marked unusable for new allocation.
> > Device becomes usable only after it has act on the event.
> 
> But the device remains fully operational?
> 
> So if I'm an admin who wants to unplug a misbehaving "entity"[1] the
> deactivate is not gonna help me, it's just a graceful hint?
Right.
> Is there no need for a forceful shutdown?
In this patchset, no. I didn't add the knob for it. It is already at 15 patches.
But yes, forceful shutdown extension can be done by the admin in future patchset as,

$ devlink port del pci/0000:06:00.0/<port_index> force true
                                                                                         ^^^^^^^^
Above will be the extension in control of the admin.

> 
> [1] refer to earlier email, IDK what entity is supposed to use this
> 
While I was replying, Saeed already answered it.
> > Port function object is the one that represents function behind this port.
> > It is not a new term. Port function already exists in devlink whose
> operational state attribute is defined here.
> 
> I must have missed that in review. PCI functions can host multiple ports. 
This is exactly why I had "multiple networking ports" above to differentiate it from devlink port.
And you asked me to drop 'networking' because devlink is all networking ports, that creates this confusion.

Anyways, I will rewrite the commit message as 'function', instead of 'port function' as below.

New commit message snippet _start:
A function can be in active or inactive state. Allow users to get and set function's state.

When the function it activated, its operational state may change after a while when the device is created and driver binds to it.
Similarly on deactivation flow.

To clearly describe the state of the function and its device's operational state in the host system, define state and opstate attributes.
_end.

> So
> "port function" does not compute for me. Can we drop the "function"?
No. it is better to keep it. Because it clearly distinguishes the host facing function whose attribute (mac) and state are controlled.
But I shorten the names, enums etc in code from port_function to port_fn. So it should be readable now.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute
  2020-12-17  4:44         ` Saeed Mahameed
@ 2020-12-18 19:48           ` Jakub Kicinski
  2020-12-19  4:43             ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-18 19:48 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Parav Pandit, David S. Miller, Jason Gunthorpe, Leon Romanovsky,
	netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Jiri Pirko,
	Vu Pham

On Wed, 16 Dec 2020 20:44:21 -0800 Saeed Mahameed wrote:
> On Wed, 2020-12-16 at 15:59 -0800, Jakub Kicinski wrote:
> > On Wed, 16 Dec 2020 03:42:51 +0000 Parav Pandit wrote:  
> > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > So subfunctions don't have a VF id but they may have a
> > > > controller?
> > > >    
> > > Right. SF can be on external controller.
> > >    
> > > > Can you tell us more about the use cases and deployment models
> > > > you're
> > > > intending to support? Let's not add attributes and info which
> > > > will go unused.
> > > >     
> > > External will be used the same way how it is used for PF and VF.
> > >   
> > > > How are SFs supposed to be used with SmartNICs? Are you assuming
> > > > single
> > > > domain of control?    
> > > No. it is not assumed. SF can be deployed from smartnic to external
> > > host.
> > > A user has to pass appropriate controller number, pf number
> > > attributes during creation time.  
> > 
> > My problem with this series is that I've gotten some real life
> > application exposure over the last year, and still I have no idea 
> > who is going to find this feature useful and why.
> > 
> > That's the point of my questions in the previous email - what
> > are the use cases, how are they going to operate.
> >   
> 
> The main focus of this feature is scale-ability we want to run
> thousands of Containers/VMs, this is useful for both smartnic and
> baremetal hypervisor worlds, where security and control is exclusive to
> the eswitch manager may it be the smarnic embedded CPU or the x86
> Hypervisor.
> 
> deployment models is identical to SRIOV, the only difference is the
> instantiation model of SF, which is the main discussion point of this
> series (i hope), which to my taste is very modest and minimal.
> after SF is instantiated from that point nothing is new, the SF is
> exposing standard linux interfaces netdev/rdma identical to what VF
> does, most likely you will assign them a namespace and pass them
> through to a container or assign them (not direct assignment) to a VM
> via the virt stack, or create a vdpa instance and pass it to a virtio
> interface.
> 
> There are endless usecases for the netdev stack, for customers who want

"endless" :)

> high scale virtualized/containerized environments, with thousands of
> network functions that can deliver high speed and full offload
> accelerators, Native XDP, Crypto, encap/decap, and HW filtering and
> processing pipeline capabilities.
> 
> I have a long list of customers with various and different applications
> and i am not even talking about the rdma and vdpa customers ! those
> customers just can't wait to leave sriov behind and scale up !
> 
> this feature has a lot of value to the netdev users only because of the
> minimal foot print to the netdev stack (to be honest there is no change
> in netdev, only a thin API layer in devlink) and the immediate and
> effortless benefits to deploy multiple (accelerated) netdevs at scale.

The acceleration can hopefully be plumbed through the software devices.

I think your HW is capable of doing large queue sets so I'm curious
how this actually performs. We're probably talking 1000+ queues here -
the CPU will have hard time serving so many queues. In my experiments
basically the more queues the more cache trashing, the more interrupts,
etc. and the lower the performance.

> > It's hard to review an API without knowing the use of it. iproute2
> > is low level plumbing.
> 
> I don't know how to put this, let me try:
> A) SRIOV model
> echo 128 > /sys/class/net/eth0/device/sriov_numvfs
> ubind vf
> 
> ip set vf attribute x
> configure representor .. 
> deploy vf/netdev/rdma interface into the container

No, no, my point is that for SR-IOV it's OpenStack, libvirt etc. which
do this. I understand the manual steps. Often problems pop up when real
systems try to string the HW objects together, allocated them, learn
their capabilities, etc.

> B) SF model 
> you do (every thing under the devlink umbrella/switchdev):
> for i in {1..1024} ; do
> devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum $i
> devlink port sf $i set attribute x
> 
> # from here on, it is identical to a VF
> configure representor
> deply sf/netdev/rdma interfaces into a container 
> 
> B is more scale-able and has more visibility and controllability  to
> the user, after you create the SFs deployment and usecases are
> identical to SRIOV VF usecases.
> 
> See the improvement ? :)
> 
> > Here the patch is adding the ability to apparently create a SF on 
> > a remote controller. If you haven't thought that use case through
> > just don't allow it until you know how it will work.
> 
> We have thought the use case through it is not any different from the 
> local controller use case. the code is uniform, we need to work hard to
> block a remote controller :) .. 

So the SF is always created from the eswitch controller side?
How does the host side look?

I really think that for ease of merging this we should leave 
the remote controller out at the beginning - only allow local
creation.

> > > > It seems that the way the industry is moving the major
> > > > use case for SmartNICs is bare metal.
> > > > 
> > > > I always assumed nested eswitches when thinking about SmartNICs,
> > > > what
> > > > are you intending to do?
> > > >    
> > > Mlx5 doesn't support nested eswitch. SF can be deployed on the
> > > external controller PCI function.
> > > But this interface neither limited nor enforcing nested or flat
> > > eswitch.
> > >    
> > > > What are your plans for enabling this feature in user space
> > > > project?    
> > > Do you mean K8s plugin or iproute2? Can you please tell us what
> > > user space project?  
> > 
> > That's my question. For SR-IOV it'd be all the virt stacks out there.
> > But this can't do virt. So what can it do?
> 
> you are thinking VF direct assignment. but don't forget
> virt handles netdev assignment to a vm perfectly fine and SF has a
> netdev.
> 
> And don't get me started on the weird virt handling of SRIOV VF, the
> whole thing is a big mess :) it shouldn't be a de facto standard that
> we need to follow.. 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 05/15] devlink: Support get and set state of port function
  2020-12-17  5:46         ` Parav Pandit
@ 2020-12-18 19:51           ` Jakub Kicinski
  2020-12-19  5:06             ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-18 19:51 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Jiri Pirko, Vu Pham, Saeed Mahameed

On Thu, 17 Dec 2020 05:46:45 +0000 Parav Pandit wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > Sent: Thursday, December 17, 2020 5:39 AM
> > 
> > On Wed, 16 Dec 2020 05:15:04 +0000 Parav Pandit wrote:  
> > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > Sent: Wednesday, December 16, 2020 6:08 AM
> > > >
> > > > On Tue, 15 Dec 2020 01:03:48 -0800 Saeed Mahameed wrote:  
> > > > > From: Parav Pandit <parav@nvidia.com>
> > > > >
> > > > > devlink port function can be in active or inactive state.
> > > > > Allow users to get and set port function's state.
> > > > >
> > > > > When the port function it activated, its operational state may
> > > > > change after a while when the device is created and driver binds to it.
> > > > > Similarly on deactivation flow.  
> > > >
> > > > So what's the flow device should implement?
> > > >
> > > > User requests deactivated, the device sends a notification to the
> > > > driver bound to the device. What if the driver ignores it?
> > > >  
> > > If driver ignores it, those devices are marked unusable for new allocation.
> > > Device becomes usable only after it has act on the event.  
> > 
> > But the device remains fully operational?
> > 
> > So if I'm an admin who wants to unplug a misbehaving "entity"[1] the
> > deactivate is not gonna help me, it's just a graceful hint?  
> Right.
> > Is there no need for a forceful shutdown?  
> In this patchset, no. I didn't add the knob for it. It is already at 15 patches.
> But yes, forceful shutdown extension can be done by the admin in future patchset as,
> 
> $ devlink port del pci/0000:06:00.0/<port_index> force true
>                                                                                          ^^^^^^^^
> Above will be the extension in control of the admin.

Can we come up with operational states that would encompass that?

The "force true" does not look too clean.

And let's document meaning of the states. We don't want the next vendor
to just "assume" the states match their own interpretation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-17  5:23         ` Parav Pandit
@ 2020-12-18 19:58           ` Jakub Kicinski
  2020-12-19  4:53             ` Parav Pandit
  0 siblings, 1 reply; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-18 19:58 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Vu Pham, Saeed Mahameed

On Thu, 17 Dec 2020 05:23:10 +0000 Parav Pandit wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > Sent: Thursday, December 17, 2020 5:42 AM
> > 
> > On Wed, 16 Dec 2020 05:19:15 +0000 Parav Pandit wrote:  
> > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > Sent: Wednesday, December 16, 2020 6:14 AM
> > > >
> > > > On Tue, 15 Dec 2020 01:03:50 -0800 Saeed Mahameed wrote:  
> > > > > +static ssize_t sfnum_show(struct device *dev, struct
> > > > > +device_attribute *attr, char *buf) {
> > > > > +	struct auxiliary_device *adev = container_of(dev, struct  
> > > > auxiliary_device, dev);  
> > > > > +	struct mlx5_sf_dev *sf_dev = container_of(adev, struct
> > > > > +mlx5_sf_dev, adev);
> > > > > +
> > > > > +	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum); }
> > > > > +static DEVICE_ATTR_RO(sfnum);
> > > > > +
> > > > > +static struct attribute *sf_device_attrs[] = {
> > > > > +	&dev_attr_sfnum.attr,
> > > > > +	NULL,
> > > > > +};
> > > > > +
> > > > > +static const struct attribute_group sf_attr_group = {
> > > > > +	.attrs = sf_device_attrs,
> > > > > +};
> > > > > +
> > > > > +static const struct attribute_group *sf_attr_groups[2] = {
> > > > > +	&sf_attr_group,
> > > > > +	NULL
> > > > > +};  
> > > >
> > > > Why the sysfs attribute? Devlink should be able to report device
> > > > name so there's no need for a tie in from the other end.  
> > > There isn't a need to enforce a devlink instance creation either,  
> > 
> > You mean there isn't a need for the SF to be spawned by devlink?
> >  
> No. sorry for the confusion.
> Let me list down the sequence and plumbing.
> 1. Devlink instance having eswitch spawns the SF port (port add, flavour = pcisf [..]).
> 2. This SF is either for local or external controller. Just like today's VF.
> 3. When SF port is activated (port function set state), SF auxiliary device is spawned on the hosting PF.
> 4. This SF auxiliary device when attached to mlx5_core driver it registers devlink instance (auxiliary/mlx5_core.sf.4).
> 5. When netdev of SF dev is created, it register devlink port of virtual flavour with link to its netdev.
> /sys/class/net/<sf_netdev>/device points to the auxiliary device.
> /sys/class/infiniband/<sf_rdma_dev>/device points to the auxiliary device.
> 
> 6. SF auxiliary device has the sysfs file read by systemd/udev to rename netdev and rdma devices of SF.

Why can't the SF ID match aux dev ID? You only register one aux dev per
SF, right? Or one for RDMA, one for netdev, etc?

> Steps 4,5,6 are equivalent to an existing VF.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute
  2020-12-18 19:48           ` Jakub Kicinski
@ 2020-12-19  4:43             ` Parav Pandit
  0 siblings, 0 replies; 46+ messages in thread
From: Parav Pandit @ 2020-12-19  4:43 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, Jason Gunthorpe, Leon Romanovsky, netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	david.m.ertman, dan.j.williams, kiran.patil, gregkh, Jiri Pirko,
	Vu Pham



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Saturday, December 19, 2020 1:18 AM
> So the SF is always created from the eswitch controller side?
> How does the host side look?
>
Host side creates the auxiliary device for the SF.

$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.4 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.4
 
> I really think that for ease of merging this we should leave the remote
> controller out at the beginning - only allow local creation.
> 
Alright. I will drop remote controller attribute of SF in this patchset.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-18 19:58           ` Jakub Kicinski
@ 2020-12-19  4:53             ` Parav Pandit
  2020-12-19 17:43               ` Jakub Kicinski
  0 siblings, 1 reply; 46+ messages in thread
From: Parav Pandit @ 2020-12-19  4:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Vu Pham, Saeed Mahameed



> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Saturday, December 19, 2020 1:29 AM
> 
> On Thu, 17 Dec 2020 05:23:10 +0000 Parav Pandit wrote:
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Sent: Thursday, December 17, 2020 5:42 AM
> > >
> > > On Wed, 16 Dec 2020 05:19:15 +0000 Parav Pandit wrote:
> > > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > > Sent: Wednesday, December 16, 2020 6:14 AM
> > > > >
> > > > > On Tue, 15 Dec 2020 01:03:50 -0800 Saeed Mahameed wrote:
> > > > > > +static ssize_t sfnum_show(struct device *dev, struct
> > > > > > +device_attribute *attr, char *buf) {
> > > > > > +	struct auxiliary_device *adev = container_of(dev, struct
> > > > > auxiliary_device, dev);
> > > > > > +	struct mlx5_sf_dev *sf_dev = container_of(adev, struct
> > > > > > +mlx5_sf_dev, adev);
> > > > > > +
> > > > > > +	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum); }
> > > > > > +static DEVICE_ATTR_RO(sfnum);
> > > > > > +
> > > > > > +static struct attribute *sf_device_attrs[] = {
> > > > > > +	&dev_attr_sfnum.attr,
> > > > > > +	NULL,
> > > > > > +};
> > > > > > +
> > > > > > +static const struct attribute_group sf_attr_group = {
> > > > > > +	.attrs = sf_device_attrs,
> > > > > > +};
> > > > > > +
> > > > > > +static const struct attribute_group *sf_attr_groups[2] = {
> > > > > > +	&sf_attr_group,
> > > > > > +	NULL
> > > > > > +};
> > > > >
> > > > > Why the sysfs attribute? Devlink should be able to report device
> > > > > name so there's no need for a tie in from the other end.
> > > > There isn't a need to enforce a devlink instance creation either,
> > >
> > > You mean there isn't a need for the SF to be spawned by devlink?
> > >
> > No. sorry for the confusion.
> > Let me list down the sequence and plumbing.
> > 1. Devlink instance having eswitch spawns the SF port (port add, flavour =
> pcisf [..]).
> > 2. This SF is either for local or external controller. Just like today's VF.
> > 3. When SF port is activated (port function set state), SF auxiliary device is
> spawned on the hosting PF.
> > 4. This SF auxiliary device when attached to mlx5_core driver it registers
> devlink instance (auxiliary/mlx5_core.sf.4).
> > 5. When netdev of SF dev is created, it register devlink port of virtual
> flavour with link to its netdev.
> > /sys/class/net/<sf_netdev>/device points to the auxiliary device.
> > /sys/class/infiniband/<sf_rdma_dev>/device points to the auxiliary device.
> >
> > 6. SF auxiliary device has the sysfs file read by systemd/udev to rename
> netdev and rdma devices of SF.
> 
> Why can't the SF ID match aux dev ID? 
Auxiliary bus holds the SFs of multiple PFs.
SF ID can be same for SFs from multiple PFs. Encoding PCI address in SF auxiliary device name doesn't do good.
So SF ID attribute of device is more appropriate.

> You only register one aux dev per SF right? 
Right.
> Or one for RDMA, one for netdev, etc?
> 
These protocol/class specific auxiliary devices are on top of SF's auxiliary devices which are only for matching service purpose.
I have covered this detail with actual example in diagram in documentation in patch15 under " Subfunction auxiliary device and class device hierarchy::"

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [net-next v5 05/15] devlink: Support get and set state of port function
  2020-12-18 19:51           ` Jakub Kicinski
@ 2020-12-19  5:06             ` Parav Pandit
  0 siblings, 0 replies; 46+ messages in thread
From: Parav Pandit @ 2020-12-19  5:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Jiri Pirko, Vu Pham, Saeed Mahameed


> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Saturday, December 19, 2020 1:21 AM
> 
> On Thu, 17 Dec 2020 05:46:45 +0000 Parav Pandit wrote:
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Sent: Thursday, December 17, 2020 5:39 AM
> > >
> > > On Wed, 16 Dec 2020 05:15:04 +0000 Parav Pandit wrote:
> > > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > > Sent: Wednesday, December 16, 2020 6:08 AM
> > > > >
> > > > > On Tue, 15 Dec 2020 01:03:48 -0800 Saeed Mahameed wrote:
> > > > > > From: Parav Pandit <parav@nvidia.com>
> > > > > >
> > > > > > devlink port function can be in active or inactive state.
> > > > > > Allow users to get and set port function's state.
> > > > > >
> > > > > > When the port function it activated, its operational state may
> > > > > > change after a while when the device is created and driver binds to
> it.
> > > > > > Similarly on deactivation flow.
> > > > >
> > > > > So what's the flow device should implement?
> > > > >
> > > > > User requests deactivated, the device sends a notification to
> > > > > the driver bound to the device. What if the driver ignores it?
> > > > >
> > > > If driver ignores it, those devices are marked unusable for new
> allocation.
> > > > Device becomes usable only after it has act on the event.
> > >
> > > But the device remains fully operational?
> > >
> > > So if I'm an admin who wants to unplug a misbehaving "entity"[1] the
> > > deactivate is not gonna help me, it's just a graceful hint?
> > Right.
> > > Is there no need for a forceful shutdown?
> > In this patchset, no. I didn't add the knob for it. It is already at 15 patches.
> > But yes, forceful shutdown extension can be done by the admin in
> > future patchset as,
> >
> > $ devlink port del pci/0000:06:00.0/<port_index> force true
> >
> > ^^^^^^^^ Above will be the extension in control of the admin.
> 
> Can we come up with operational states that would encompass that?
> 
Operational state is read only. Adding more states will likely make user job harder, unless its absolute necessary.
Currently state and operational state definitions cover all the scenario needed.
Only exception is user doesn't have the ability of force delete.
$ devlink port shutdown pci/0000:03:00.0/port_index
Above command will attempt graceful port deletion.

$ devlink port del pci/0000:03:00.0/port_index
Above command will do force deletion.

> The "force true" does not look too clean.
>
I think notion of force to user is more intuitive than above two commands, as its exist for other parts of the system for example 'reboot'.
So no need for flag as true/false. Just force if user wish to do force removal.

> And let's document meaning of the states. We don't want the next vendor to
> just "assume" the states match their own interpretation.
Oh yes, I did document it in the UAPI header file which will be the first place for vendors to look on how to implement get/set.
But I will add this to patch_14 in the devlink 'subfunctions' section documentation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-19  4:53             ` Parav Pandit
@ 2020-12-19 17:43               ` Jakub Kicinski
  0 siblings, 0 replies; 46+ messages in thread
From: Jakub Kicinski @ 2020-12-19 17:43 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Vu Pham, Saeed Mahameed

On Sat, 19 Dec 2020 04:53:45 +0000 Parav Pandit wrote:
> > Why can't the SF ID match aux dev ID?   
> Auxiliary bus holds the SFs of multiple PFs.

I see it now. Very unfortunate :(

> SF ID can be same for SFs from multiple PFs. Encoding PCI address in SF auxiliary device name doesn't do good.
> So SF ID attribute of device is more appropriate.


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2020-12-19 17:44 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-15  9:03 [net-next v5 00/15] Add mlx5 subfunction support Saeed Mahameed
2020-12-15  9:03 ` [net-next v5 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
2020-12-15  9:03 ` [net-next v5 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
2020-12-15  9:03 ` [net-next v5 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
2020-12-15 23:27   ` Jakub Kicinski
2020-12-16  3:42     ` Parav Pandit
2020-12-16 23:59       ` Jakub Kicinski
2020-12-17  4:44         ` Saeed Mahameed
2020-12-18 19:48           ` Jakub Kicinski
2020-12-19  4:43             ` Parav Pandit
2020-12-15  9:03 ` [net-next v5 04/15] devlink: Support add and delete devlink port Saeed Mahameed
2020-12-16  0:29   ` Jakub Kicinski
2020-12-16  5:06     ` Parav Pandit
2020-12-15  9:03 ` [net-next v5 05/15] devlink: Support get and set state of port function Saeed Mahameed
2020-12-16  0:37   ` Jakub Kicinski
2020-12-16  5:15     ` Parav Pandit
2020-12-16 16:15       ` David Ahern
2020-12-17  0:08       ` Jakub Kicinski
2020-12-17  5:46         ` Parav Pandit
2020-12-18 19:51           ` Jakub Kicinski
2020-12-19  5:06             ` Parav Pandit
2020-12-15  9:03 ` [net-next v5 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
2020-12-15  9:03 ` [net-next v5 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
2020-12-16  0:43   ` Jakub Kicinski
2020-12-16  5:19     ` Parav Pandit
2020-12-17  0:11       ` Jakub Kicinski
2020-12-17  5:23         ` Parav Pandit
2020-12-18 19:58           ` Jakub Kicinski
2020-12-19  4:53             ` Parav Pandit
2020-12-19 17:43               ` Jakub Kicinski
2020-12-15  9:03 ` [net-next v5 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
2020-12-15  9:03 ` [net-next v5 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
2020-12-16  0:47   ` Jakub Kicinski
2020-12-16  5:28     ` Parav Pandit
2020-12-15  9:03 ` [net-next v5 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
2020-12-15  9:03 ` [net-next v5 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
2020-12-16  0:51   ` Jakub Kicinski
2020-12-16  5:31     ` Parav Pandit
2020-12-15  9:03 ` [net-next v5 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
2020-12-15  9:03 ` [net-next v5 13/15] devlink: Add devlink port documentation Saeed Mahameed
2020-12-16  0:57   ` Jakub Kicinski
2020-12-16  5:40     ` Parav Pandit
2020-12-15  9:03 ` [net-next v5 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
2020-12-16  1:00   ` Jakub Kicinski
2020-12-16  3:55     ` Parav Pandit
2020-12-15  9:03 ` [net-next v5 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.