All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next v4 00/15] Add mlx5 subfunction support
@ 2020-12-14 21:43 Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
                   ` (15 more replies)
  0 siblings, 16 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Saeed Mahameed

Hi Dave, Jakub, Jason,

This series form Parav was the theme of this mlx5 release cycle,
we've been waiting anxiously for the auxbus infrastructure to make it into
the kernel, and now as the auxbus is in and all the stars are aligned, I
can finally submit this V2 of the devlink and mlx5 subfunction support.

Subfunctions came to solve the scaling issue of virtualization
and switchdev environments, where SRIOV failed to deliver and users ran
out of VFs very quickly as SRIOV demands huge amount of physical resources
in both of the servers and the NIC.

Subfunction provide the same functionality as SRIOV but in a very
lightweight manner, please see the thorough and detailed
documentation from Parav below, in the commit messages and the
Networking documentation patches at the end of this series.

Sending V4 as a continuation to V1 that was sent Last month [0],
[0] https://lore.kernel.org/linux-rdma/20201112192424.2742-1-parav@nvidia.com/

---
Changelog:
v3->v4:
 - Fix 32bit compilation issue

v2->v3:
 - added header file sf/priv.h to cmd.c to avoid missing prototype warning
 - made mlx5_sf_table_disable as static function as its used only in one file

v1->v2:
 - added documentation for subfunction and its mlx5 implementation
 - add MLX5_SF config option documentation
 - rebased
 - dropped devlink global lock improvement patch as mlx5 doesn't support
   reload while SFs are allocated
 - dropped devlink reload lock patch as mlx5 doesn't support reload
   when SFs are allocated
 - using updated vhca event from device to add remove auxiliary device
 - split sf devlink port allocation and sf hardware context allocation

Parav Pandit Says:
=================

This patchset introduces support for mlx5 subfunction (SF).

A subfunction is a lightweight function that has a parent PCI function on
which it is deployed. mlx5 subfunction has its own function capabilities
and its own resources. This means a subfunction has its own dedicated
queues(txq, rxq, cq, eq). These queues are neither shared nor stealed from
the parent PCI function.

When subfunction is RDMA capable, it has its own QP1, GID table and rdma
resources neither shared nor stealed from the parent PCI function.

A subfunction has dedicated window in PCI BAR space that is not shared
with ther other subfunctions or parent PCI function. This ensures that all
class devices of the subfunction accesses only assigned PCI BAR space.

A Subfunction supports eswitch representation through which it supports tc
offloads. User must configure eswitch to send/receive packets from/to
subfunction port.

Subfunctions share PCI level resources such as PCI MSI-X IRQs with
their other subfunctions and/or with its parent PCI function.

Patch summary:
--------------
Patch 1 to 4 prepares devlink
patch 5 to 7 mlx5 adds SF device support
Patch 8 to 11 mlx5 adds SF devlink port support
Patch 12 and 14 adds documentation

Patch-1 prepares code to handle multiple port function attributes
Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
Patch-3 adds port add and delete driver callbacks
Patch-4 adds port function state get and set callbacks
Patch-5 mlx5 vhca event notifier support to distribute subfunction
        state change notification
Patch-6 adds SF auxiliary device
Patch-7 adds SF auxiliary driver
Patch-8 prepares eswitch to handler SF vport
Patch-9 adds eswitch helpers to add/remove SF vport
Patch-10 implements devlink port add/del callbacks
Patch-11 implements devlink port function get/set callbacks
Patch-12 to 14 adds documentation
Patch-12 added mlx5 port function documentation
Patch-13 adds subfunction documentation
Patch-14 adds mlx5 subfunction documentation

Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support. Functionality of this patchset is best explained using real
examples further below.

overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.

A subfunction can be configured using devlink port function attribute
before its activated.

When a subfunction is activated, it results in an auxiliary device on
the host PCI device where it is deployed. A driver binds to the
auxiliary device that further creates supported class devices.

example subfunction usage sequence:
-----------------------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Add a devlink port of subfunction flaovur:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88

Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active

Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff

$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

After use inactivate the function:
$ devlink port function set ens2f0npf0sf88 state inactive

Now delete the subfunction port:
$ devlink port del ens2f0npf0sf88

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2

=================

Parav Pandit (14):
  net/mlx5: Fix compilation warning for 32-bit platform
  devlink: Prepare code to fill multiple port function attributes
  devlink: Introduce PCI SF port flavour and port attribute
  devlink: Support add and delete devlink port
  devlink: Support get and set state of port function
  net/mlx5: Introduce vhca state event notifier
  net/mlx5: SF, Add auxiliary device support
  net/mlx5: SF, Add auxiliary device driver
  net/mlx5: E-switch, Add eswitch helpers for SF vport
  net/mlx5: SF, Add port add delete functionality
  net/mlx5: SF, Port function state change support
  devlink: Add devlink port documentation
  devlink: Extend devlink port documentation for subfunctions
  net/mlx5: Add devlink subfunction port documentation

Vu Pham (1):
  net/mlx5: E-switch, Prepare eswitch to handle SF vport

 Documentation/driver-api/auxiliary_bus.rst    |   2 +
 .../device_drivers/ethernet/mellanox/mlx5.rst | 209 +++++++
 .../networking/devlink/devlink-port.rst       | 199 +++++++
 Documentation/networking/devlink/index.rst    |   1 +
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  19 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   8 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  19 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   5 +-
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c  |   2 +-
 .../mellanox/mlx5/core/esw/devlink_port.c     |  41 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  48 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  78 +++
 .../mellanox/mlx5/core/eswitch_offloads.c     |  47 +-
 .../net/ethernet/mellanox/mlx5/core/events.c  |   7 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  60 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  12 +
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  20 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  49 ++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  | 271 +++++++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  55 ++
 .../mellanox/mlx5/core/sf/dev/driver.c        | 101 ++++
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  | 552 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/sf/hw_table.c | 233 ++++++++
 .../mlx5/core/sf/mlx5_ifc_vhca_event.h        |  82 +++
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |  21 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  92 +++
 .../mellanox/mlx5/core/sf/vhca_event.c        | 189 ++++++
 .../mellanox/mlx5/core/sf/vhca_event.h        |  57 ++
 .../net/ethernet/mellanox/mlx5/core/vport.c   |   3 +-
 include/linux/mlx5/driver.h                   |  16 +-
 include/linux/mlx5/mlx5_ifc.h                 |   6 +-
 include/net/devlink.h                         |  79 +++
 include/uapi/linux/devlink.h                  |  26 +
 net/core/devlink.c                            | 266 ++++++++-
 35 files changed, 2834 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/networking/devlink/devlink-port.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h

-- 
2.26.2


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 22:31   ` Alexander Duyck
  2020-12-14 21:43 ` [net-next v4 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Stephen Rothwell, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

MLX5_GENERAL_OBJECT_TYPES types bitfield is 64-bit field.

Defining an enum for such bit fields on 32-bit platform results in below
warning.

./include/vdso/bits.h:7:26: warning: left shift count >= width of type [-Wshift-count-overflow]
                         ^
./include/linux/mlx5/mlx5_ifc.h:10716:46: note: in expansion of macro ‘BIT’
 MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
                                             ^~~
Use 32-bit friendly left shift.

Fixes: 2a2970891647 ("net/mlx5: Add sample offload hardware bits and structures")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeed@kernel.org>
---
 include/linux/mlx5/mlx5_ifc.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 0d6e287d614f..b9f15935dfe5 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -10711,9 +10711,9 @@ struct mlx5_ifc_affiliated_event_header_bits {
 };
 
 enum {
-	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = BIT(0xc),
-	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = BIT(0x13),
-	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
+	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = 1ULL << 0xc,
+	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = 1ULL << 0x13,
+	MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = 1ULL << 0x20,
 };
 
 enum {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 02/15] devlink: Prepare code to fill multiple port function attributes
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Prepare code to fill zero or more port function optional attributes.
Subsequent patch makes use of this to fill more port function
attributes.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 net/core/devlink.c | 63 +++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 31 deletions(-)

diff --git a/net/core/devlink.c b/net/core/devlink.c
index ee828e4b1007..13e0de80c4f9 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -712,6 +712,31 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 	return 0;
 }
 
+static int
+devlink_port_function_hw_addr_fill(struct devlink *devlink, const struct devlink_ops *ops,
+				   struct devlink_port *port, struct sk_buff *msg,
+				   struct netlink_ext_ack *extack, bool *msg_updated)
+{
+	u8 hw_addr[MAX_ADDR_LEN];
+	int hw_addr_len;
+	int err;
+
+	if (!ops->port_function_hw_addr_get)
+		return 0;
+
+	err = ops->port_function_hw_addr_get(devlink, port, hw_addr, &hw_addr_len, extack);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+	err = nla_put(msg, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, hw_addr_len, hw_addr);
+	if (err)
+		return err;
+	*msg_updated = true;
+	return 0;
+}
+
 static int
 devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *port,
 				   struct netlink_ext_ack *extack)
@@ -719,36 +744,16 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por
 	struct devlink *devlink = port->devlink;
 	const struct devlink_ops *ops;
 	struct nlattr *function_attr;
-	bool empty_nest = true;
-	int err = 0;
+	bool msg_updated = false;
+	int err;
 
 	function_attr = nla_nest_start_noflag(msg, DEVLINK_ATTR_PORT_FUNCTION);
 	if (!function_attr)
 		return -EMSGSIZE;
 
 	ops = devlink->ops;
-	if (ops->port_function_hw_addr_get) {
-		int hw_addr_len;
-		u8 hw_addr[MAX_ADDR_LEN];
-
-		err = ops->port_function_hw_addr_get(devlink, port, hw_addr, &hw_addr_len, extack);
-		if (err == -EOPNOTSUPP) {
-			/* Port function attributes are optional for a port. If port doesn't
-			 * support function attribute, returning -EOPNOTSUPP is not an error.
-			 */
-			err = 0;
-			goto out;
-		} else if (err) {
-			goto out;
-		}
-		err = nla_put(msg, DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR, hw_addr_len, hw_addr);
-		if (err)
-			goto out;
-		empty_nest = false;
-	}
-
-out:
-	if (err || empty_nest)
+	err = devlink_port_function_hw_addr_fill(devlink, ops, port, msg, extack, &msg_updated);
+	if (err || !msg_updated)
 		nla_nest_cancel(msg, function_attr);
 	else
 		nla_nest_end(msg, function_attr);
@@ -986,7 +991,6 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 	const struct devlink_ops *ops;
 	const u8 *hw_addr;
 	int hw_addr_len;
-	int err;
 
 	hw_addr = nla_data(attr);
 	hw_addr_len = nla_len(attr);
@@ -1011,12 +1015,7 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 		return -EOPNOTSUPP;
 	}
 
-	err = ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
-	if (err)
-		return err;
-
-	devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
-	return 0;
+	return ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
 }
 
 static int
@@ -1037,6 +1036,8 @@ devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 	if (attr)
 		err = devlink_port_function_hw_addr_set(devlink, port, attr, extack);
 
+	if (!err)
+		devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
 	return err;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 03/15] devlink: Introduce PCI SF port flavour and port attribute
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 04/15] devlink: Support add and delete devlink port Saeed Mahameed
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

A PCI sub-function (SF) represents a portion of the device similar
to PCI VF.

In an eswitch, PCI SF may have port which is normally represented
using a representor netdevice.
To have better visibility of eswitch port, its association with SF,
and its representor netdevice, introduce a PCI SF port flavour.

When devlink port flavour is PCI SF, fill up PCI SF attributes of the
port.

Extend port name creation using PCI PF and SF number scheme on best
effort basis, so that vendor drivers can skip defining their own
scheme.
This is done as cApfNSfM, where A, N and M are controller, PCI PF and
PCI SF number respectively.
This is similar to existing naming for PCI PF and PCI VF ports.

An example view of a PCI SF port:

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state active opstate attached

$ devlink port show pci/0000:06:00.0/32768 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 include/net/devlink.h        | 17 +++++++++++++
 include/uapi/linux/devlink.h |  5 ++++
 net/core/devlink.c           | 46 ++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index f466819cc477..5bd43f0a79a8 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -93,6 +93,20 @@ struct devlink_port_pci_vf_attrs {
 	u8 external:1;
 };
 
+/**
+ * struct devlink_port_pci_sf_attrs - devlink port's PCI SF attributes
+ * @controller: Associated controller number
+ * @pf: Associated PCI PF number for this port.
+ * @sf: Associated PCI SF for of the PCI PF for this port.
+ * @external: when set, indicates if a port is for an external controller
+ */
+struct devlink_port_pci_sf_attrs {
+	u32 controller;
+	u16 pf;
+	u32 sf;
+	u8 external:1;
+};
+
 /**
  * struct devlink_port_attrs - devlink port object
  * @flavour: flavour of the port
@@ -114,6 +128,7 @@ struct devlink_port_attrs {
 		struct devlink_port_phys_attrs phys;
 		struct devlink_port_pci_pf_attrs pci_pf;
 		struct devlink_port_pci_vf_attrs pci_vf;
+		struct devlink_port_pci_sf_attrs pci_sf;
 	};
 };
 
@@ -1404,6 +1419,8 @@ void devlink_port_attrs_pci_pf_set(struct devlink_port *devlink_port, u32 contro
 				   u16 pf, bool external);
 void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 controller,
 				   u16 pf, u16 vf, bool external);
+void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 controller,
+				   u16 pf, u32 sf, bool external);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 5203f54a2be1..6fe00f10eb3f 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -200,6 +200,10 @@ enum devlink_port_flavour {
 	DEVLINK_PORT_FLAVOUR_UNUSED, /* Port which exists in the switch, but
 				      * is not used in any way.
 				      */
+	DEVLINK_PORT_FLAVOUR_PCI_SF, /* Represents eswitch port
+				      * for the PCI SF. It is an internal
+				      * port that faces the PCI SF.
+				      */
 };
 
 enum devlink_param_cmode {
@@ -529,6 +533,7 @@ enum devlink_attr {
 	DEVLINK_ATTR_RELOAD_ACTION_INFO,        /* nested */
 	DEVLINK_ATTR_RELOAD_ACTION_STATS,       /* nested */
 
+	DEVLINK_ATTR_PORT_PCI_SF_NUMBER,	/* u32 */
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 13e0de80c4f9..08eac247f200 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -690,6 +690,15 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 		if (nla_put_u8(msg, DEVLINK_ATTR_PORT_EXTERNAL, attrs->pci_vf.external))
 			return -EMSGSIZE;
 		break;
+	case DEVLINK_PORT_FLAVOUR_PCI_SF:
+		if (nla_put_u32(msg, DEVLINK_ATTR_PORT_CONTROLLER_NUMBER,
+				attrs->pci_sf.controller) ||
+		    nla_put_u16(msg, DEVLINK_ATTR_PORT_PCI_PF_NUMBER, attrs->pci_sf.pf) ||
+		    nla_put_u32(msg, DEVLINK_ATTR_PORT_PCI_SF_NUMBER, attrs->pci_sf.sf))
+			return -EMSGSIZE;
+		if (nla_put_u8(msg, DEVLINK_ATTR_PORT_EXTERNAL, attrs->pci_sf.external))
+			return -EMSGSIZE;
+		break;
 	case DEVLINK_PORT_FLAVOUR_PHYSICAL:
 	case DEVLINK_PORT_FLAVOUR_CPU:
 	case DEVLINK_PORT_FLAVOUR_DSA:
@@ -8373,6 +8382,33 @@ void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 contro
 }
 EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_vf_set);
 
+/**
+ *	devlink_port_attrs_pci_sf_set - Set PCI SF port attributes
+ *
+ *	@devlink_port: devlink port
+ *	@controller: associated controller number for the devlink port instance
+ *	@pf: associated PF for the devlink port instance
+ *	@sf: associated SF of a PF for the devlink port instance
+ *	@external: indicates if the port is for an external controller
+ */
+void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 controller,
+				   u16 pf, u32 sf, bool external)
+{
+	struct devlink_port_attrs *attrs = &devlink_port->attrs;
+	int ret;
+
+	if (WARN_ON(devlink_port->registered))
+		return;
+	ret = __devlink_port_attrs_set(devlink_port, DEVLINK_PORT_FLAVOUR_PCI_SF);
+	if (ret)
+		return;
+	attrs->pci_sf.controller = controller;
+	attrs->pci_sf.pf = pf;
+	attrs->pci_sf.sf = sf;
+	attrs->pci_sf.external = external;
+}
+EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_sf_set);
+
 static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port,
 					     char *name, size_t len)
 {
@@ -8421,6 +8457,16 @@ static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port,
 		n = snprintf(name, len, "pf%uvf%u",
 			     attrs->pci_vf.pf, attrs->pci_vf.vf);
 		break;
+	case DEVLINK_PORT_FLAVOUR_PCI_SF:
+		if (attrs->pci_sf.external) {
+			n = snprintf(name, len, "c%u", attrs->pci_sf.controller);
+			if (n >= len)
+				return -EINVAL;
+			len -= n;
+			name += n;
+		}
+		n = snprintf(name, len, "pf%usf%u", attrs->pci_sf.pf, attrs->pci_sf.sf);
+		break;
 	}
 
 	if (n >= len)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 04/15] devlink: Support add and delete devlink port
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (2 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 05/15] devlink: Support get and set state of port function Saeed Mahameed
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Extended devlink interface for the user to add and delete port.
Extend devlink to connect user requests to driver to add/delete
such port in the device.

When driver routines are invoked, devlink instance lock is not held.
This enables driver to perform several devlink objects registration,
unregistration such as (port, health reporter, resource etc)
by using exising devlink APIs.
This also helps to uniformly use the code for port unregistration
during driver unload and during port deletion initiated by user.

Examples of add, show and delete commands:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ udevadm test-builtin net_id /sys/class/net/eth0
Load module index
Parsed configuration file /usr/lib/systemd/network/99-default.link
Created link configuration context.
Using default interface naming scheme 'v245'.
ID_NET_NAMING_SCHEME=v245
ID_NET_NAME_PATH=enp6s0f0npf0sf88
ID_NET_NAME_SLOT=ens2f0npf0sf88
Unload module index
Unloaded link configuration context.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 include/net/devlink.h | 39 ++++++++++++++++++++++++
 net/core/devlink.c    | 71 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 5bd43f0a79a8..f8cff3e402da 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -153,6 +153,17 @@ struct devlink_port {
 	struct mutex reporters_lock; /* Protects reporter_list */
 };
 
+struct devlink_port_new_attrs {
+	enum devlink_port_flavour flavour;
+	unsigned int port_index;
+	u32 controller;
+	u32 sfnum;
+	u16 pfnum;
+	u8 port_index_valid:1,
+	   controller_valid:1,
+	   sfnum_valid:1;
+};
+
 struct devlink_sb_pool_info {
 	enum devlink_sb_pool_type pool_type;
 	u32 size;
@@ -1363,6 +1374,34 @@ struct devlink_ops {
 	int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port,
 					 const u8 *hw_addr, int hw_addr_len,
 					 struct netlink_ext_ack *extack);
+	/**
+	 * @port_new: Port add function.
+	 *
+	 * Should be used by device driver to let caller add new port of a
+	 * specified flavour with optional attributes.
+	 * Driver should return -EOPNOTSUPP if it doesn't support port addition
+	 * of a specified flavour or specified attributes. Driver should set
+	 * extack error message in case of fail to add the port. Devlink core
+	 * does not hold a devlink instance lock when this callback is invoked.
+	 * Driver must ensures synchronization when adding or deleting a port.
+	 * Driver must register a port with devlink core.
+	 */
+	int (*port_new)(struct devlink *devlink,
+			const struct devlink_port_new_attrs *attrs,
+			struct netlink_ext_ack *extack);
+	/**
+	 * @port_del: Port delete function.
+	 *
+	 * Should be used by device driver to let caller delete port which was
+	 * previously created using port_new() callback.
+	 * Driver should return -EOPNOTSUPP if it doesn't support port deletion.
+	 * Driver should set extack error message in case of fail to delete the
+	 * port. Devlink core does not hold a devlink instance lock when this
+	 * callback is invoked. Driver must ensures synchronization when adding
+	 * or deleting a port. Driver must register a port with devlink core.
+	 */
+	int (*port_del)(struct devlink *devlink, unsigned int port_index,
+			struct netlink_ext_ack *extack);
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 08eac247f200..11043707f63f 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -1146,6 +1146,61 @@ static int devlink_nl_cmd_port_unsplit_doit(struct sk_buff *skb,
 	return devlink_port_unsplit(devlink, port_index, info->extack);
 }
 
+static int devlink_nl_cmd_port_new_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink_port_new_attrs new_attrs = {};
+	struct devlink *devlink = info->user_ptr[0];
+
+	if (!info->attrs[DEVLINK_ATTR_PORT_FLAVOUR] ||
+	    !info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]) {
+		NL_SET_ERR_MSG_MOD(extack, "Port flavour or PCI PF are not specified");
+		return -EINVAL;
+	}
+	new_attrs.flavour = nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_FLAVOUR]);
+	new_attrs.pfnum =
+		nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]);
+
+	if (info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
+		new_attrs.port_index =
+			nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
+		new_attrs.port_index_valid = true;
+	}
+	if (info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]) {
+		new_attrs.controller =
+			nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]);
+		new_attrs.controller_valid = true;
+	}
+	if (info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]) {
+		new_attrs.sfnum = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]);
+		new_attrs.sfnum_valid = true;
+	}
+
+	if (!devlink->ops->port_new)
+		return -EOPNOTSUPP;
+
+	return devlink->ops->port_new(devlink, &new_attrs, extack);
+}
+
+static int devlink_nl_cmd_port_del_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink *devlink = info->user_ptr[0];
+	unsigned int port_index;
+
+	if (!info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
+		NL_SET_ERR_MSG_MOD(extack, "Port index is not specified");
+		return -EINVAL;
+	}
+	port_index = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
+
+	if (!devlink->ops->port_del)
+		return -EOPNOTSUPP;
+	return devlink->ops->port_del(devlink, port_index, extack);
+}
+
 static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink,
 			      struct devlink_sb *devlink_sb,
 			      enum devlink_command cmd, u32 portid,
@@ -7604,6 +7659,10 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_RELOAD_ACTION] = NLA_POLICY_RANGE(NLA_U8, DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
 							DEVLINK_RELOAD_ACTION_MAX),
 	[DEVLINK_ATTR_RELOAD_LIMITS] = NLA_POLICY_BITFIELD32(DEVLINK_RELOAD_LIMITS_VALID_MASK),
+	[DEVLINK_ATTR_PORT_FLAVOUR] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_PORT_PCI_PF_NUMBER] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_PORT_PCI_SF_NUMBER] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER] = { .type = NLA_U32 },
 };
 
 static const struct genl_small_ops devlink_nl_ops[] = {
@@ -7643,6 +7702,18 @@ static const struct genl_small_ops devlink_nl_ops[] = {
 		.flags = GENL_ADMIN_PERM,
 		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
 	},
+	{
+		.cmd = DEVLINK_CMD_PORT_NEW,
+		.doit = devlink_nl_cmd_port_new_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+	},
+	{
+		.cmd = DEVLINK_CMD_PORT_DEL,
+		.doit = devlink_nl_cmd_port_del_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+	},
 	{
 		.cmd = DEVLINK_CMD_SB_GET,
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 05/15] devlink: Support get and set state of port function
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (3 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 04/15] devlink: Support add and delete devlink port Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

devlink port function can be in active or inactive state.
Allow users to get and set port function's state.

When the port function it activated, its operational state may change
after a while when the device is created and driver binds to it.
Similarly on deactivation flow.

To clearly describe the state of the port function and its device's
operational state in the host system, define state and opstate
attributes.

Example of a PCI SF port which supports a port function:
Create a device with ID=10 and one physical port.

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active

$ devlink port show pci/0000:06:00.0/32768 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 include/net/devlink.h        | 23 +++++++++
 include/uapi/linux/devlink.h | 21 +++++++++
 net/core/devlink.c           | 90 +++++++++++++++++++++++++++++++++++-
 3 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index f8cff3e402da..18a7e66b7982 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1374,6 +1374,29 @@ struct devlink_ops {
 	int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port,
 					 const u8 *hw_addr, int hw_addr_len,
 					 struct netlink_ext_ack *extack);
+	/**
+	 * @port_function_state_get: Port function's state get function.
+	 *
+	 * Should be used by device drivers to report the state of a function
+	 * managed by the devlink port. Driver should return -EOPNOTSUPP if it
+	 * doesn't support port function handling for a particular port.
+	 */
+	int (*port_function_state_get)(struct devlink *devlink,
+				       struct devlink_port *port,
+				       enum devlink_port_function_state *state,
+				       enum devlink_port_function_opstate *opstate,
+				       struct netlink_ext_ack *extack);
+	/**
+	 * @port_function_state_set: Port function's state set function.
+	 *
+	 * Should be used by device drivers to set the state of a function
+	 * managed by the devlink port. Driver should return -EOPNOTSUPP if it
+	 * doesn't support port function handling for a particular port.
+	 */
+	int (*port_function_state_set)(struct devlink *devlink,
+				       struct devlink_port *port,
+				       enum devlink_port_function_state state,
+				       struct netlink_ext_ack *extack);
 	/**
 	 * @port_new: Port add function.
 	 *
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 6fe00f10eb3f..beeb30bb6b20 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -583,9 +583,30 @@ enum devlink_resource_unit {
 enum devlink_port_function_attr {
 	DEVLINK_PORT_FUNCTION_ATTR_UNSPEC,
 	DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR,	/* binary */
+	DEVLINK_PORT_FUNCTION_ATTR_STATE,	/* u8 */
+	DEVLINK_PORT_FUNCTION_ATTR_OPSTATE,	/* u8 */
 
 	__DEVLINK_PORT_FUNCTION_ATTR_MAX,
 	DEVLINK_PORT_FUNCTION_ATTR_MAX = __DEVLINK_PORT_FUNCTION_ATTR_MAX - 1
 };
 
+enum devlink_port_function_state {
+	DEVLINK_PORT_FUNCTION_STATE_INACTIVE,
+	DEVLINK_PORT_FUNCTION_STATE_ACTIVE,
+};
+
+/**
+ * enum devlink_port_function_opstate - indicates operational state of port function
+ * @DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED: Driver is attached to the function of port, for
+ *					    gracefufl tear down of the function, after
+ *					    inactivation of the port function, user should wait
+ *					    for operational state to turn DETACHED.
+ * @DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED: Driver is detached from the function of port; it is
+ *					    safe to delete the port.
+ */
+enum devlink_port_function_opstate {
+	DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED,
+	DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED,
+};
+
 #endif /* _UAPI_LINUX_DEVLINK_H_ */
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 11043707f63f..b8acb8842aa1 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -87,6 +87,9 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(devlink_trap_report);
 
 static const struct nla_policy devlink_function_nl_policy[DEVLINK_PORT_FUNCTION_ATTR_MAX + 1] = {
 	[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR] = { .type = NLA_BINARY },
+	[DEVLINK_PORT_FUNCTION_ATTR_STATE] =
+		NLA_POLICY_RANGE(NLA_U8, DEVLINK_PORT_FUNCTION_STATE_INACTIVE,
+				 DEVLINK_PORT_FUNCTION_STATE_ACTIVE),
 };
 
 static LIST_HEAD(devlink_list);
@@ -746,6 +749,57 @@ devlink_port_function_hw_addr_fill(struct devlink *devlink, const struct devlink
 	return 0;
 }
 
+static bool
+devlink_port_function_state_valid(enum devlink_port_function_state state)
+{
+	return state == DEVLINK_PORT_FUNCTION_STATE_INACTIVE ||
+	       state == DEVLINK_PORT_FUNCTION_STATE_ACTIVE;
+}
+
+static bool
+devlink_port_function_opstate_valid(enum devlink_port_function_opstate state)
+{
+	return state == DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED ||
+	       state == DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED;
+}
+
+static int
+devlink_port_function_state_fill(struct devlink *devlink,
+				 const struct devlink_ops *ops,
+				 struct devlink_port *port, struct sk_buff *msg,
+				 struct netlink_ext_ack *extack,
+				 bool *msg_updated)
+{
+	enum devlink_port_function_opstate opstate;
+	enum devlink_port_function_state state;
+	int err;
+
+	if (!ops->port_function_state_get)
+		return 0;
+
+	err = ops->port_function_state_get(devlink, port, &state, &opstate, extack);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+	if (!devlink_port_function_state_valid(state)) {
+		WARN_ON_ONCE(1);
+		NL_SET_ERR_MSG_MOD(extack, "Invalid state value read from driver");
+		return -EINVAL;
+	}
+	if (!devlink_port_function_opstate_valid(opstate)) {
+		WARN_ON_ONCE(1);
+		NL_SET_ERR_MSG_MOD(extack, "Invalid operational state value read from driver");
+		return -EINVAL;
+	}
+	if (nla_put_u8(msg, DEVLINK_PORT_FUNCTION_ATTR_STATE, state) ||
+	    nla_put_u8(msg, DEVLINK_PORT_FUNCTION_ATTR_OPSTATE, opstate))
+		return -EMSGSIZE;
+	*msg_updated = true;
+	return 0;
+}
+
 static int
 devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *port,
 				   struct netlink_ext_ack *extack)
@@ -762,6 +816,13 @@ devlink_nl_port_function_attrs_put(struct sk_buff *msg, struct devlink_port *por
 
 	ops = devlink->ops;
 	err = devlink_port_function_hw_addr_fill(devlink, ops, port, msg, extack, &msg_updated);
+	if (err)
+		goto out;
+	err = devlink_port_function_state_fill(devlink, ops, port, msg, extack,
+					       &msg_updated);
+	if (err)
+		goto out;
+out:
 	if (err || !msg_updated)
 		nla_nest_cancel(msg, function_attr);
 	else
@@ -1027,6 +1088,22 @@ devlink_port_function_hw_addr_set(struct devlink *devlink, struct devlink_port *
 	return ops->port_function_hw_addr_set(devlink, port, hw_addr, hw_addr_len, extack);
 }
 
+static int
+devlink_port_function_state_set(struct devlink *devlink, struct devlink_port *port,
+				const struct nlattr *attr, struct netlink_ext_ack *extack)
+{
+	enum devlink_port_function_state state;
+	const struct devlink_ops *ops;
+
+	state = nla_get_u8(attr);
+	ops = devlink->ops;
+	if (!ops->port_function_state_set) {
+		NL_SET_ERR_MSG_MOD(extack, "Port function does not support state setting");
+		return -EOPNOTSUPP;
+	}
+	return ops->port_function_state_set(devlink, port, state, extack);
+}
+
 static int
 devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 			  const struct nlattr *attr, struct netlink_ext_ack *extack)
@@ -1042,8 +1119,19 @@ devlink_port_function_set(struct devlink *devlink, struct devlink_port *port,
 	}
 
 	attr = tb[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR];
-	if (attr)
+	if (attr) {
 		err = devlink_port_function_hw_addr_set(devlink, port, attr, extack);
+		if (err)
+			return err;
+	}
+	/* Keep this as the last function attribute set, so that when
+	 * multiple port function attributes are set along with state,
+	 * Those can be applied first before activating the state.
+	 */
+	attr = tb[DEVLINK_PORT_FUNCTION_ATTR_STATE];
+	if (attr)
+		err = devlink_port_function_state_set(devlink, port, attr,
+						      extack);
 
 	if (!err)
 		devlink_port_notify(port, DEVLINK_CMD_PORT_NEW);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 06/15] net/mlx5: Introduce vhca state event notifier
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (4 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 05/15] devlink: Support get and set state of port function Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

vhca state events indicates change in the state of the vhca that may
occur due to a SF allocation, deallocation or enabling/disabling the
SF HCA.

Introduce vhca state event handler which will be used by SF devlink
port manager and SF hardware id allocator in subsequent patches
to act on the event.

This enables single entity to subscribe, query and rearm the event
for a function.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   9 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   3 +
 .../net/ethernet/mellanox/mlx5/core/events.c  |   7 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  16 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../mlx5/core/sf/mlx5_ifc_vhca_event.h        |  82 ++++++++
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  45 +++++
 .../mellanox/mlx5/core/sf/vhca_event.c        | 189 ++++++++++++++++++
 .../mellanox/mlx5/core/sf/vhca_event.h        |  57 ++++++
 include/linux/mlx5/driver.h                   |   4 +
 12 files changed, 422 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 6e4d7bb7fea2..d6c48582e7a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -203,3 +203,12 @@ config MLX5_SW_STEERING
 	default y
 	help
 	Build support for software-managed steering in the NIC.
+
+config MLX5_SF
+	bool "Mellanox Technologies subfunction device support using auxiliary device"
+	depends on MLX5_CORE && MLX5_CORE_EN
+	default n
+	help
+	Build support for subfuction device in the NIC. A Mellanox subfunction
+	device can support RDMA, netdevice and vdpa device.
+	It is similar to a SRIOV VF but it doesn't require SRIOV support.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 77961643d5a9..292c02c4828c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -85,3 +85,7 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 					steering/dr_ste.o steering/dr_send.o \
 					steering/dr_cmd.o steering/dr_fw.o \
 					steering/dr_action.o steering/fs_dr.o
+#
+# SF device
+#
+mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 50c7b9ee80c3..47dcc3ac2cf0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -464,6 +464,8 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_ALLOC_MEMIC:
 	case MLX5_CMD_OP_MODIFY_XRQ:
 	case MLX5_CMD_OP_RELEASE_XRQ_ERROR:
+	case MLX5_CMD_OP_QUERY_VHCA_STATE:
+	case MLX5_CMD_OP_MODIFY_VHCA_STATE:
 		*status = MLX5_DRIVER_STATUS_ABORTED;
 		*synd = MLX5_DRIVER_SYND;
 		return -EIO;
@@ -657,6 +659,8 @@ const char *mlx5_command_str(int command)
 	MLX5_COMMAND_STR_CASE(DESTROY_UMEM);
 	MLX5_COMMAND_STR_CASE(RELEASE_XRQ_ERROR);
 	MLX5_COMMAND_STR_CASE(MODIFY_XRQ);
+	MLX5_COMMAND_STR_CASE(QUERY_VHCA_STATE);
+	MLX5_COMMAND_STR_CASE(MODIFY_VHCA_STATE);
 	default: return "unknown command opcode";
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index fc0afa03d407..421febebc658 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -595,6 +595,9 @@ static void gather_async_events_mask(struct mlx5_core_dev *dev, u64 mask[4])
 		async_event_mask |=
 			(1ull << MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED);
 
+	if (MLX5_CAP_GEN_MAX(dev, vhca_state))
+		async_event_mask |= (1ull << MLX5_EVENT_TYPE_VHCA_STATE_CHANGE);
+
 	mask[0] = async_event_mask;
 
 	if (MLX5_CAP_GEN(dev, event_cap))
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/events.c b/drivers/net/ethernet/mellanox/mlx5/core/events.c
index 3ce17c3d7a00..5523d218e5fb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/events.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/events.c
@@ -110,6 +110,8 @@ static const char *eqe_type_str(u8 type)
 		return "MLX5_EVENT_TYPE_CMD";
 	case MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED:
 		return "MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED";
+	case MLX5_EVENT_TYPE_VHCA_STATE_CHANGE:
+		return "MLX5_EVENT_TYPE_VHCA_STATE_CHANGE";
 	case MLX5_EVENT_TYPE_PAGE_REQUEST:
 		return "MLX5_EVENT_TYPE_PAGE_REQUEST";
 	case MLX5_EVENT_TYPE_PAGE_FAULT:
@@ -403,3 +405,8 @@ int mlx5_notifier_call_chain(struct mlx5_events *events, unsigned int event, voi
 {
 	return atomic_notifier_call_chain(&events->nh, event, data);
 }
+
+void mlx5_events_work_enqueue(struct mlx5_core_dev *dev, struct work_struct *work)
+{
+	queue_work(dev->priv.events->wq, work);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index c08315b51fd3..6e67ad11c713 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -73,6 +73,7 @@
 #include "ecpf.h"
 #include "lib/hv_vhca.h"
 #include "diag/rsc_dump.h"
+#include "sf/vhca_event.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -567,6 +568,8 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
 	if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
 		MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
 
+	mlx5_vhca_state_cap_handle(dev, set_hca_cap);
+
 	return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }
 
@@ -884,6 +887,12 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 		goto err_eswitch_cleanup;
 	}
 
+	err = mlx5_vhca_event_init(dev);
+	if (err) {
+		mlx5_core_err(dev, "Failed to init vhca event notifier %d\n", err);
+		goto err_fpga_cleanup;
+	}
+
 	dev->dm = mlx5_dm_create(dev);
 	if (IS_ERR(dev->dm))
 		mlx5_core_warn(dev, "Failed to init device memory%d\n", err);
@@ -894,6 +903,8 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 
 	return 0;
 
+err_fpga_cleanup:
+	mlx5_fpga_cleanup(dev);
 err_eswitch_cleanup:
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 err_sriov_cleanup:
@@ -925,6 +936,7 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 	mlx5_hv_vhca_destroy(dev->hv_vhca);
 	mlx5_fw_tracer_destroy(dev->tracer);
 	mlx5_dm_cleanup(dev);
+	mlx5_vhca_event_cleanup(dev);
 	mlx5_fpga_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
 	mlx5_sriov_cleanup(dev);
@@ -1129,6 +1141,8 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 		goto err_sriov;
 	}
 
+	mlx5_vhca_event_start(dev);
+
 	err = mlx5_ec_init(dev);
 	if (err) {
 		mlx5_core_err(dev, "Failed to init embedded CPU\n");
@@ -1146,6 +1160,7 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 err_sriov:
 	mlx5_ec_cleanup(dev);
 err_ec:
+	mlx5_vhca_event_stop(dev);
 	mlx5_cleanup_fs(dev);
 err_fs:
 	mlx5_accel_tls_cleanup(dev);
@@ -1173,6 +1188,7 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
 {
 	mlx5_sriov_detach(dev);
 	mlx5_ec_cleanup(dev);
+	mlx5_vhca_event_stop(dev);
 	mlx5_cleanup_fs(dev);
 	mlx5_accel_ipsec_cleanup(dev);
 	mlx5_accel_tls_cleanup(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 0a0302ce7144..a33b7496d748 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -259,4 +259,6 @@ void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state);
 
 void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup);
 int mlx5_load_one(struct mlx5_core_dev *dev, bool boot);
+
+void mlx5_events_work_enqueue(struct mlx5_core_dev *dev, struct work_struct *work);
 #endif /* __MLX5_CORE_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
new file mode 100644
index 000000000000..1daf5a122ba3
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/mlx5_ifc_vhca_event.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_IFC_VHCA_EVENT_H__
+#define __MLX5_IFC_VHCA_EVENT_H__
+
+enum mlx5_ifc_vhca_state {
+	MLX5_VHCA_STATE_INVALID = 0x0,
+	MLX5_VHCA_STATE_ALLOCATED = 0x1,
+	MLX5_VHCA_STATE_ACTIVE = 0x2,
+	MLX5_VHCA_STATE_IN_USE = 0x3,
+	MLX5_VHCA_STATE_TEARDOWN_REQUEST = 0x4,
+};
+
+struct mlx5_ifc_vhca_state_context_bits {
+	u8         arm_change_event[0x1];
+	u8         reserved_at_1[0xb];
+	u8         vhca_state[0x4];
+	u8         reserved_at_10[0x10];
+
+	u8         sw_function_id[0x20];
+
+	u8         reserved_at_40[0x80];
+};
+
+struct mlx5_ifc_query_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+
+	struct mlx5_ifc_vhca_state_context_bits vhca_state_context;
+};
+
+struct mlx5_ifc_query_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         embedded_cpu_function[0x1];
+	u8         reserved_at_41[0xf];
+	u8         function_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_vhca_state_field_select_bits {
+	u8         reserved_at_0[0x1e];
+	u8         sw_function_id[0x1];
+	u8         arm_change_event[0x1];
+};
+
+struct mlx5_ifc_modify_vhca_state_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_modify_vhca_state_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         embedded_cpu_function[0x1];
+	u8         reserved_at_41[0xf];
+	u8         function_id[0x10];
+
+	struct mlx5_ifc_vhca_state_field_select_bits vhca_state_field_select;
+
+	struct mlx5_ifc_vhca_state_context_bits vhca_state_context;
+};
+
+#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
new file mode 100644
index 000000000000..623191679b49
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_H__
+#define __MLX5_SF_H__
+
+#include <linux/mlx5/driver.h>
+
+static inline u16 mlx5_sf_start_function_id(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf_base_id);
+}
+
+#ifdef CONFIG_MLX5_SF
+
+static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf);
+}
+
+static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
+{
+	if (!mlx5_sf_supported(dev))
+		return 0;
+	if (MLX5_CAP_GEN(dev, max_num_sf))
+		return MLX5_CAP_GEN(dev, max_num_sf);
+	else
+		return 1 << MLX5_CAP_GEN(dev, log_max_sf);
+}
+
+#else
+
+static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
+{
+	return false;
+}
+
+static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+#endif
+
+#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
new file mode 100644
index 000000000000..af2f2dd9db25
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.c
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include "mlx5_ifc_vhca_event.h"
+#include "mlx5_core.h"
+#include "vhca_event.h"
+#include "ecpf.h"
+
+struct mlx5_vhca_state_notifier {
+	struct mlx5_core_dev *dev;
+	struct mlx5_nb nb;
+	struct blocking_notifier_head n_head;
+};
+
+struct mlx5_vhca_event_work {
+	struct work_struct work;
+	struct mlx5_vhca_state_notifier *notifier;
+	struct mlx5_vhca_state_event event;
+};
+
+int mlx5_cmd_query_vhca_state(struct mlx5_core_dev *dev, u16 function_id,
+			      bool ecpu, u32 *out, u32 outlen)
+{
+	u32 in[MLX5_ST_SZ_DW(query_vhca_state_in)] = {};
+
+	MLX5_SET(query_vhca_state_in, in, opcode, MLX5_CMD_OP_QUERY_VHCA_STATE);
+	MLX5_SET(query_vhca_state_in, in, function_id, function_id);
+	MLX5_SET(query_vhca_state_in, in, embedded_cpu_function, ecpu);
+
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
+}
+
+static int mlx5_cmd_modify_vhca_state(struct mlx5_core_dev *dev, u16 function_id,
+				      bool ecpu, u32 *in, u32 inlen)
+{
+	u32 out[MLX5_ST_SZ_DW(modify_vhca_state_out)] = {};
+
+	MLX5_SET(modify_vhca_state_in, in, opcode, MLX5_CMD_OP_MODIFY_VHCA_STATE);
+	MLX5_SET(modify_vhca_state_in, in, function_id, function_id);
+	MLX5_SET(modify_vhca_state_in, in, embedded_cpu_function, ecpu);
+
+	return mlx5_cmd_exec(dev, in, inlen, out, sizeof(out));
+}
+
+int mlx5_modify_vhca_sw_id(struct mlx5_core_dev *dev, u16 function_id, bool ecpu, u32 sw_fn_id)
+{
+	u32 out[MLX5_ST_SZ_DW(modify_vhca_state_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(modify_vhca_state_in)] = {};
+
+	MLX5_SET(modify_vhca_state_in, in, opcode, MLX5_CMD_OP_MODIFY_VHCA_STATE);
+	MLX5_SET(modify_vhca_state_in, in, function_id, function_id);
+	MLX5_SET(modify_vhca_state_in, in, embedded_cpu_function, ecpu);
+	MLX5_SET(modify_vhca_state_in, in, vhca_state_field_select.sw_function_id, 1);
+	MLX5_SET(modify_vhca_state_in, in, vhca_state_context.sw_function_id, sw_fn_id);
+
+	return mlx5_cmd_exec_inout(dev, modify_vhca_state, in, out);
+}
+
+int mlx5_vhca_event_arm(struct mlx5_core_dev *dev, u16 function_id, bool ecpu)
+{
+	u32 in[MLX5_ST_SZ_DW(modify_vhca_state_in)] = {};
+
+	MLX5_SET(modify_vhca_state_in, in, vhca_state_context.arm_change_event, 1);
+	MLX5_SET(modify_vhca_state_in, in, vhca_state_field_select.arm_change_event, 1);
+
+	return mlx5_cmd_modify_vhca_state(dev, function_id, ecpu, in, sizeof(in));
+}
+
+static void
+mlx5_vhca_event_notify(struct mlx5_core_dev *dev, struct mlx5_vhca_state_event *event)
+{
+	u32 out[MLX5_ST_SZ_DW(query_vhca_state_out)] = {};
+	int err;
+
+	err = mlx5_cmd_query_vhca_state(dev, event->function_id, event->ecpu, out, sizeof(out));
+	if (err)
+		return;
+
+	event->sw_function_id = MLX5_GET(query_vhca_state_out, out,
+					 vhca_state_context.sw_function_id);
+	event->new_vhca_state = MLX5_GET(query_vhca_state_out, out,
+					 vhca_state_context.vhca_state);
+
+	mlx5_vhca_event_arm(dev, event->function_id, event->ecpu);
+
+	blocking_notifier_call_chain(&dev->priv.vhca_state_notifier->n_head, 0, event);
+}
+
+static void mlx5_vhca_state_work_handler(struct work_struct *_work)
+{
+	struct mlx5_vhca_event_work *work = container_of(_work, struct mlx5_vhca_event_work, work);
+	struct mlx5_vhca_state_notifier *notifier = work->notifier;
+	struct mlx5_core_dev *dev = notifier->dev;
+
+	mlx5_vhca_event_notify(dev, &work->event);
+}
+
+static int
+mlx5_vhca_state_change_notifier(struct notifier_block *nb, unsigned long type, void *data)
+{
+	struct mlx5_vhca_state_notifier *notifier =
+				mlx5_nb_cof(nb, struct mlx5_vhca_state_notifier, nb);
+	struct mlx5_vhca_event_work *work;
+	struct mlx5_eqe *eqe = data;
+
+	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work)
+		return NOTIFY_DONE;
+	INIT_WORK(&work->work, &mlx5_vhca_state_work_handler);
+	work->notifier = notifier;
+	work->event.function_id = be16_to_cpu(eqe->data.vhca_state.function_id);
+	work->event.ecpu = be16_to_cpu(eqe->data.vhca_state.ec_function);
+	mlx5_events_work_enqueue(notifier->dev, &work->work);
+	return NOTIFY_OK;
+}
+
+void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap)
+{
+	if (!mlx5_vhca_event_supported(dev))
+		return;
+
+	MLX5_SET(cmd_hca_cap, set_hca_cap, vhca_state, 1);
+	MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_allocated, 1);
+	MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_active, 1);
+	MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_in_use, 1);
+	MLX5_SET(cmd_hca_cap, set_hca_cap, event_on_vhca_state_teardown_request, 1);
+}
+
+int mlx5_vhca_event_init(struct mlx5_core_dev *dev)
+{
+	struct mlx5_vhca_state_notifier *notifier;
+
+	if (!mlx5_vhca_event_supported(dev))
+		return 0;
+
+	notifier = kzalloc(sizeof(*notifier), GFP_KERNEL);
+	if (!notifier)
+		return -ENOMEM;
+
+	dev->priv.vhca_state_notifier = notifier;
+	notifier->dev = dev;
+	BLOCKING_INIT_NOTIFIER_HEAD(&notifier->n_head);
+	MLX5_NB_INIT(&notifier->nb, mlx5_vhca_state_change_notifier, VHCA_STATE_CHANGE);
+	return 0;
+}
+
+void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev)
+{
+	if (!mlx5_vhca_event_supported(dev))
+		return;
+
+	kfree(dev->priv.vhca_state_notifier);
+	dev->priv.vhca_state_notifier = NULL;
+}
+
+void mlx5_vhca_event_start(struct mlx5_core_dev *dev)
+{
+	struct mlx5_vhca_state_notifier *notifier;
+
+	if (!dev->priv.vhca_state_notifier)
+		return;
+
+	notifier = dev->priv.vhca_state_notifier;
+	mlx5_eq_notifier_register(dev, &notifier->nb);
+}
+
+void mlx5_vhca_event_stop(struct mlx5_core_dev *dev)
+{
+	struct mlx5_vhca_state_notifier *notifier;
+
+	if (!dev->priv.vhca_state_notifier)
+		return;
+
+	notifier = dev->priv.vhca_state_notifier;
+	mlx5_eq_notifier_unregister(dev, &notifier->nb);
+}
+
+int mlx5_vhca_event_notifier_register(struct mlx5_core_dev *dev, struct notifier_block *nb)
+{
+	if (!dev->priv.vhca_state_notifier)
+		return -EOPNOTSUPP;
+	return blocking_notifier_chain_register(&dev->priv.vhca_state_notifier->n_head, nb);
+}
+
+void mlx5_vhca_event_notifier_unregister(struct mlx5_core_dev *dev, struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&dev->priv.vhca_state_notifier->n_head, nb);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h
new file mode 100644
index 000000000000..1fe1ec6f4d4b
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/vhca_event.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_VHCA_EVENT_H__
+#define __MLX5_VHCA_EVENT_H__
+
+#ifdef CONFIG_MLX5_SF
+
+struct mlx5_vhca_state_event {
+	u16 function_id;
+	u16 sw_function_id;
+	u8 new_vhca_state;
+	bool ecpu;
+};
+
+static inline bool mlx5_vhca_event_supported(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN_MAX(dev, vhca_state);
+}
+
+void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap);
+int mlx5_vhca_event_init(struct mlx5_core_dev *dev);
+void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev);
+void mlx5_vhca_event_start(struct mlx5_core_dev *dev);
+void mlx5_vhca_event_stop(struct mlx5_core_dev *dev);
+int mlx5_vhca_event_notifier_register(struct mlx5_core_dev *dev, struct notifier_block *nb);
+void mlx5_vhca_event_notifier_unregister(struct mlx5_core_dev *dev, struct notifier_block *nb);
+int mlx5_modify_vhca_sw_id(struct mlx5_core_dev *dev, u16 function_id, bool ecpu, u32 sw_fn_id);
+int mlx5_vhca_event_arm(struct mlx5_core_dev *dev, u16 function_id, bool ecpu);
+int mlx5_cmd_query_vhca_state(struct mlx5_core_dev *dev, u16 function_id,
+			      bool ecpu, u32 *out, u32 outlen);
+#else
+
+static inline void mlx5_vhca_state_cap_handle(struct mlx5_core_dev *dev, void *set_hca_cap)
+{
+}
+
+static inline int mlx5_vhca_event_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_vhca_event_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
+static inline void mlx5_vhca_event_start(struct mlx5_core_dev *dev)
+{
+}
+
+static inline void mlx5_vhca_event_stop(struct mlx5_core_dev *dev)
+{
+}
+
+#endif
+
+#endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index f93bfe7473aa..ffba0786051e 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -507,6 +507,7 @@ struct mlx5_devcom;
 struct mlx5_fw_reset;
 struct mlx5_eq_table;
 struct mlx5_irq_table;
+struct mlx5_vhca_state_notifier;
 
 struct mlx5_rate_limit {
 	u32			rate;
@@ -603,6 +604,9 @@ struct mlx5_priv {
 
 	struct mlx5_bfreg_data		bfregs;
 	struct mlx5_uars_page	       *uar;
+#ifdef CONFIG_MLX5_SF
+	struct mlx5_vhca_state_notifier *vhca_state_notifier;
+#endif
 };
 
 enum mlx5_device_state {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 07/15] net/mlx5: SF, Add auxiliary device support
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (5 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Introduce API to add and delete an auxiliary device for an SF.
Each SF has its own dedicated window in the PCI BAR 2.

SF device is similar to PCI PF and VF that supports multiple class of
devices such as net, rdma and vdpa.

SF device will be added or removed in subsequent patch during SF
devlink port function state change command.

A subfunction device exposes user supplied subfunction number which will
be further used by systemd/udev to have deterministic name for its
netdevice and rdma device.

An mlx5 subfunction auxiliary device example:

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 state active

On activation,

$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.4 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.4

$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.4/sfnum
88

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../device_drivers/ethernet/mellanox/mlx5.rst |   5 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |   4 +
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  | 261 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  35 +++
 include/linux/mlx5/driver.h                   |   2 +
 6 files changed, 308 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h

diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
index e9b65035cd47..a5eb22793bb9 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -97,6 +97,11 @@ Enabling the driver and kconfig options
 
 |   Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
 
+**CONFIG_MLX5_SF=(y/n)**
+
+|   Build support for subfunction.
+|   Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
+|   will enable support for creating subfunction devices.
 
 **External options** ( Choose if the corresponding mlx5 feature is required )
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 292c02c4828c..2aefbca404c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -88,4 +88,4 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 #
 # SF device
 #
-mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o
+mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 6e67ad11c713..292c30e71d7f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -74,6 +74,7 @@
 #include "lib/hv_vhca.h"
 #include "diag/rsc_dump.h"
 #include "sf/vhca_event.h"
+#include "sf/dev/dev.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -1155,6 +1156,8 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 		goto err_sriov;
 	}
 
+	mlx5_sf_dev_table_create(dev);
+
 	return 0;
 
 err_sriov:
@@ -1186,6 +1189,7 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 
 static void mlx5_unload(struct mlx5_core_dev *dev)
 {
+	mlx5_sf_dev_table_destroy(dev);
 	mlx5_sriov_detach(dev);
 	mlx5_ec_cleanup(dev);
 	mlx5_vhca_event_stop(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
new file mode 100644
index 000000000000..6562bf63afaa
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include <linux/mlx5/device.h>
+#include "mlx5_core.h"
+#include "dev.h"
+#include "sf/vhca_event.h"
+#include "sf/sf.h"
+#include "sf/mlx5_ifc_vhca_event.h"
+#include "ecpf.h"
+
+struct mlx5_sf_dev_table {
+	struct xarray devices;
+	unsigned int max_sfs;
+	phys_addr_t base_address;
+	u64 sf_bar_length;
+	struct notifier_block nb;
+	struct mlx5_core_dev *dev;
+};
+
+static bool mlx5_sf_dev_supported(const struct mlx5_core_dev *dev)
+{
+	return MLX5_CAP_GEN(dev, sf) && mlx5_vhca_event_supported(dev);
+}
+
+static ssize_t sfnum_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct auxiliary_device *adev = container_of(dev, struct auxiliary_device, dev);
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	return scnprintf(buf, PAGE_SIZE, "%u\n", sf_dev->sfnum);
+}
+static DEVICE_ATTR_RO(sfnum);
+
+static struct attribute *sf_device_attrs[] = {
+	&dev_attr_sfnum.attr,
+	NULL,
+};
+
+static const struct attribute_group sf_attr_group = {
+	.attrs = sf_device_attrs,
+};
+
+static const struct attribute_group *sf_attr_groups[2] = {
+	&sf_attr_group,
+	NULL
+};
+
+static void mlx5_sf_dev_release(struct device *device)
+{
+	struct auxiliary_device *adev = container_of(device, struct auxiliary_device, dev);
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	mlx5_adev_idx_free(adev->id);
+	kfree(sf_dev);
+}
+
+static void mlx5_sf_dev_remove(struct mlx5_sf_dev *sf_dev)
+{
+	auxiliary_device_delete(&sf_dev->adev);
+	auxiliary_device_uninit(&sf_dev->adev);
+}
+
+static void mlx5_sf_dev_add(struct mlx5_core_dev *dev, u16 sf_index, u32 sfnum)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+	struct mlx5_sf_dev *sf_dev;
+	struct pci_dev *pdev;
+	int err;
+	int id;
+
+	id = mlx5_adev_idx_alloc();
+	if (id < 0) {
+		err = id;
+		goto add_err;
+	}
+
+	sf_dev = kzalloc(sizeof(*sf_dev), GFP_KERNEL);
+	if (!sf_dev) {
+		mlx5_adev_idx_free(id);
+		err = -ENOMEM;
+		goto add_err;
+	}
+	pdev = dev->pdev;
+	sf_dev->adev.id = id;
+	sf_dev->adev.name = MLX5_SF_DEV_ID_NAME;
+	sf_dev->adev.dev.release = mlx5_sf_dev_release;
+	sf_dev->adev.dev.parent = &pdev->dev;
+	sf_dev->adev.dev.groups = sf_attr_groups;
+	sf_dev->sfnum = sfnum;
+	sf_dev->parent_mdev = dev;
+
+	if (!table->max_sfs) {
+		mlx5_adev_idx_free(id);
+		kfree(sf_dev);
+		err = -EOPNOTSUPP;
+		goto add_err;
+	}
+	sf_dev->bar_base_addr = table->base_address + (sf_index * table->sf_bar_length);
+
+	err = auxiliary_device_init(&sf_dev->adev);
+	if (err) {
+		mlx5_adev_idx_free(id);
+		kfree(sf_dev);
+		goto add_err;
+	}
+
+	err = auxiliary_device_add(&sf_dev->adev);
+	if (err) {
+		put_device(&sf_dev->adev.dev);
+		goto add_err;
+	}
+
+	err = xa_insert(&table->devices, sf_index, sf_dev, GFP_KERNEL);
+	if (err)
+		goto xa_err;
+	return;
+
+xa_err:
+	mlx5_sf_dev_remove(sf_dev);
+add_err:
+	mlx5_core_err(dev, "SF DEV: fail device add for index=%d sfnum=%d err=%d\n",
+		      sf_index, sfnum, err);
+}
+
+static void mlx5_sf_dev_del(struct mlx5_core_dev *dev, struct mlx5_sf_dev *sf_dev, u16 sf_index)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+
+	xa_erase(&table->devices, sf_index);
+	mlx5_sf_dev_remove(sf_dev);
+}
+
+static int
+mlx5_sf_dev_state_change_handler(struct notifier_block *nb, unsigned long event_code, void *data)
+{
+	struct mlx5_sf_dev_table *table = container_of(nb, struct mlx5_sf_dev_table, nb);
+	const struct mlx5_vhca_state_event *event = data;
+	struct mlx5_sf_dev *sf_dev;
+	u16 sf_index;
+
+	sf_index = event->function_id - MLX5_CAP_GEN(table->dev, sf_base_id);
+	sf_dev = xa_load(&table->devices, sf_index);
+	switch (event->new_vhca_state) {
+	case MLX5_VHCA_STATE_TEARDOWN_REQUEST:
+		if (sf_dev)
+			mlx5_sf_dev_del(table->dev, sf_dev, sf_index);
+		else
+			mlx5_core_err(table->dev,
+				      "SF DEV: teardown state for invalid dev index=%d fn_id=0x%x\n",
+				      sf_index, event->sw_function_id);
+		break;
+	case MLX5_VHCA_STATE_ACTIVE:
+		if (!sf_dev)
+			mlx5_sf_dev_add(table->dev, sf_index, event->sw_function_id);
+		break;
+	default:
+		break;
+	}
+	return 0;
+}
+
+static int mlx5_sf_dev_vhca_arm_all(struct mlx5_sf_dev_table *table)
+{
+	struct mlx5_core_dev *dev = table->dev;
+	u16 max_functions;
+	u16 function_id;
+	int err = 0;
+	bool ecpu;
+	int i;
+
+	max_functions = mlx5_sf_max_functions(dev);
+	function_id = MLX5_CAP_GEN(dev, sf_base_id);
+	ecpu = mlx5_read_embedded_cpu(dev);
+	/* Arm the vhca context as the vhca event notifier */
+	for (i = 0; i < max_functions; i++) {
+		err = mlx5_vhca_event_arm(dev, function_id, ecpu);
+		if (err)
+			return err;
+
+		function_id++;
+	}
+	return 0;
+}
+
+void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table;
+	unsigned int max_sfs;
+	int err;
+
+	if (!mlx5_sf_dev_supported(dev) || !mlx5_vhca_event_supported(dev))
+		return;
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table) {
+		err = -ENOMEM;
+		goto table_err;
+	}
+
+	table->nb.notifier_call = mlx5_sf_dev_state_change_handler;
+	table->dev = dev;
+	if (MLX5_CAP_GEN(dev, max_num_sf))
+		max_sfs = MLX5_CAP_GEN(dev, max_num_sf);
+	else
+		max_sfs = 1 << MLX5_CAP_GEN(dev, log_max_sf);
+	table->sf_bar_length = 1 << (MLX5_CAP_GEN(dev, log_min_sf_size) + 12);
+	table->base_address = pci_resource_start(dev->pdev, 2);
+	table->max_sfs = max_sfs;
+	xa_init(&table->devices);
+	dev->priv.sf_dev_table = table;
+
+	err = mlx5_vhca_event_notifier_register(dev, &table->nb);
+	if (err)
+		goto vhca_err;
+	err = mlx5_sf_dev_vhca_arm_all(table);
+	if (err)
+		goto arm_err;
+	mlx5_core_dbg(dev, "SF DEV: max sf devices=%d\n", max_sfs);
+	return;
+
+arm_err:
+	mlx5_vhca_event_notifier_unregister(dev, &table->nb);
+vhca_err:
+	table->max_sfs = 0;
+	kfree(table);
+	dev->priv.sf_dev_table = NULL;
+table_err:
+	mlx5_core_err(dev, "SF DEV table create err = %d\n", err);
+}
+
+static void mlx5_sf_dev_destroy_all(struct mlx5_sf_dev_table *table)
+{
+	struct mlx5_sf_dev *sf_dev;
+	unsigned long index;
+
+	xa_for_each(&table->devices, index, sf_dev) {
+		xa_erase(&table->devices, index);
+		mlx5_sf_dev_remove(sf_dev);
+	}
+}
+
+void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+
+	if (!table)
+		return;
+
+	mlx5_vhca_event_notifier_unregister(dev, &table->nb);
+
+	/* Now that event handler is not running, it is safe to destroy
+	 * the sf device without race.
+	 */
+	mlx5_sf_dev_destroy_all(table);
+
+	WARN_ON(!xa_empty(&table->devices));
+	kfree(table);
+	dev->priv.sf_dev_table = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
new file mode 100644
index 000000000000..a6fb7289ba2c
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_DEV_H__
+#define __MLX5_SF_DEV_H__
+
+#ifdef CONFIG_MLX5_SF
+
+#include <linux/auxiliary_bus.h>
+
+#define MLX5_SF_DEV_ID_NAME "sf"
+
+struct mlx5_sf_dev {
+	struct auxiliary_device adev;
+	struct mlx5_core_dev *parent_mdev;
+	phys_addr_t bar_base_addr;
+	u32 sfnum;
+};
+
+void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev);
+void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev);
+
+#else
+
+static inline void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev)
+{
+}
+
+static inline void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
+{
+}
+
+#endif
+
+#endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index ffba0786051e..08e5fbe97df0 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -508,6 +508,7 @@ struct mlx5_fw_reset;
 struct mlx5_eq_table;
 struct mlx5_irq_table;
 struct mlx5_vhca_state_notifier;
+struct mlx5_sf_dev_table;
 
 struct mlx5_rate_limit {
 	u32			rate;
@@ -606,6 +607,7 @@ struct mlx5_priv {
 	struct mlx5_uars_page	       *uar;
 #ifdef CONFIG_MLX5_SF
 	struct mlx5_vhca_state_notifier *vhca_state_notifier;
+	struct mlx5_sf_dev_table *sf_dev_table;
 #endif
 };
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 08/15] net/mlx5: SF, Add auxiliary device driver
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (6 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Add auxiliary device driver for mlx5 subfunction auxiliary device.

A mlx5 subfunction is similar to PCI PF and VF. For a subfunction
an auxiliary device is created.

As a result, when mlx5 SF auxiliary device binds to the driver,
its netdev and rdma device are created, they appear as

$ ls -l /sys/bus/auxiliary/devices/
mlx5_core.sf.4 -> ../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_core.sf.4

$ ls -l /sys/class/net/eth1/device
/sys/class/net/eth1/device -> ../../../mlx5_core.sf.4

$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.4/sfnum
88

$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false

$ rdma link show mlx5_0/1
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88

$ rdma dev show
8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

In future, devlink device instance name will adapt to have sfnum
annotation using either an alias or as devlink instance name described
in RFC [1].

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  12 +++
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  12 ++-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  10 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  20 ++++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.c  |  10 ++
 .../ethernet/mellanox/mlx5/core/sf/dev/dev.h  |  20 ++++
 .../mellanox/mlx5/core/sf/dev/driver.c        | 101 ++++++++++++++++++
 include/linux/mlx5/driver.h                   |   4 +-
 10 files changed, 187 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 2aefbca404c3..efa95d6dd112 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -88,4 +88,4 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 #
 # SF device
 #
-mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o
+mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o sf/dev/driver.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 3261d0dc1104..9afe918c5827 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -7,6 +7,7 @@
 #include "fw_reset.h"
 #include "fs_core.h"
 #include "eswitch.h"
+#include "sf/dev/dev.h"
 
 static int mlx5_devlink_flash_update(struct devlink *devlink,
 				     struct devlink_flash_update_params *params,
@@ -127,6 +128,17 @@ static int mlx5_devlink_reload_down(struct devlink *devlink, bool netns_change,
 				    struct netlink_ext_ack *extack)
 {
 	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	bool sf_dev_allocated;
+
+	sf_dev_allocated = mlx5_sf_dev_allocated(dev);
+	if (sf_dev_allocated) {
+		/* Reload results in deleting SF device which further results in
+		 * unregistering devlink instance while holding devlink_mutext.
+		 * Hence, do not support reload.
+		 */
+		NL_SET_ERR_MSG_MOD(extack, "reload is unsupported when SFs are allocated\n");
+		return -EOPNOTSUPP;
+	}
 
 	switch (action) {
 	case DEVLINK_RELOAD_ACTION_DRIVER_REINIT:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 421febebc658..174dfbc996c6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -467,7 +467,7 @@ int mlx5_eq_table_init(struct mlx5_core_dev *dev)
 	for (i = 0; i < MLX5_EVENT_TYPE_MAX; i++)
 		ATOMIC_INIT_NOTIFIER_HEAD(&eq_table->nh[i]);
 
-	eq_table->irq_table = dev->priv.irq_table;
+	eq_table->irq_table = mlx5_irq_table_get(dev);
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 292c30e71d7f..932a280a56a5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -84,7 +84,6 @@ unsigned int mlx5_core_debug_mask;
 module_param_named(debug_mask, mlx5_core_debug_mask, uint, 0644);
 MODULE_PARM_DESC(debug_mask, "debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0");
 
-#define MLX5_DEFAULT_PROF	2
 static unsigned int prof_sel = MLX5_DEFAULT_PROF;
 module_param_named(prof_sel, prof_sel, uint, 0444);
 MODULE_PARM_DESC(prof_sel, "profile selector. Valid range 0 - 2");
@@ -1303,7 +1302,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup)
 	mutex_unlock(&dev->intf_state_mutex);
 }
 
-static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
+int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	int err;
@@ -1353,7 +1352,7 @@ static int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx)
 	return err;
 }
 
-static void mlx5_mdev_uninit(struct mlx5_core_dev *dev)
+void mlx5_mdev_uninit(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 
@@ -1693,6 +1692,10 @@ static int __init init(void)
 	if (err)
 		goto err_debug;
 
+	err = mlx5_sf_driver_register();
+	if (err)
+		goto err_sf;
+
 #ifdef CONFIG_MLX5_CORE_EN
 	err = mlx5e_init();
 	if (err) {
@@ -1703,6 +1706,8 @@ static int __init init(void)
 
 	return 0;
 
+err_sf:
+	pci_unregister_driver(&mlx5_core_driver);
 err_debug:
 	mlx5_unregister_debugfs();
 	return err;
@@ -1713,6 +1718,7 @@ static void __exit cleanup(void)
 #ifdef CONFIG_MLX5_CORE_EN
 	mlx5e_cleanup();
 #endif
+	mlx5_sf_driver_unregister();
 	pci_unregister_driver(&mlx5_core_driver);
 	mlx5_unregister_debugfs();
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index a33b7496d748..3754ef98554f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -117,6 +117,8 @@ enum mlx5_semaphore_space_address {
 	MLX5_SEMAPHORE_SW_RESET         = 0x20,
 };
 
+#define MLX5_DEFAULT_PROF       2
+
 int mlx5_query_hca_caps(struct mlx5_core_dev *dev);
 int mlx5_query_board_id(struct mlx5_core_dev *dev);
 int mlx5_cmd_init(struct mlx5_core_dev *dev);
@@ -176,6 +178,7 @@ struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
 int mlx5_irq_get_num_comp(struct mlx5_irq_table *table);
+struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev);
 
 int mlx5_events_init(struct mlx5_core_dev *dev);
 void mlx5_events_cleanup(struct mlx5_core_dev *dev);
@@ -257,6 +260,13 @@ enum {
 u8 mlx5_get_nic_state(struct mlx5_core_dev *dev);
 void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state);
 
+static inline bool mlx5_core_is_sf(const struct mlx5_core_dev *dev)
+{
+	return dev->coredev_type == MLX5_COREDEV_SF;
+}
+
+int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx);
+void mlx5_mdev_uninit(struct mlx5_core_dev *dev);
 void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup);
 int mlx5_load_one(struct mlx5_core_dev *dev, bool boot);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 6fd974920394..a61e09aff152 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -30,6 +30,9 @@ int mlx5_irq_table_init(struct mlx5_core_dev *dev)
 {
 	struct mlx5_irq_table *irq_table;
 
+	if (mlx5_core_is_sf(dev))
+		return 0;
+
 	irq_table = kvzalloc(sizeof(*irq_table), GFP_KERNEL);
 	if (!irq_table)
 		return -ENOMEM;
@@ -40,6 +43,9 @@ int mlx5_irq_table_init(struct mlx5_core_dev *dev)
 
 void mlx5_irq_table_cleanup(struct mlx5_core_dev *dev)
 {
+	if (mlx5_core_is_sf(dev))
+		return;
+
 	kvfree(dev->priv.irq_table);
 }
 
@@ -268,6 +274,9 @@ int mlx5_irq_table_create(struct mlx5_core_dev *dev)
 	int nvec;
 	int err;
 
+	if (mlx5_core_is_sf(dev))
+		return 0;
+
 	nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
 	       MLX5_IRQ_VEC_COMP_BASE;
 	nvec = min_t(int, nvec, num_eqs);
@@ -319,6 +328,9 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev)
 	struct mlx5_irq_table *table = dev->priv.irq_table;
 	int i;
 
+	if (mlx5_core_is_sf(dev))
+		return;
+
 	/* free_irq requires that affinity and rmap will be cleared
 	 * before calling it. This is why there is asymmetry with set_rmap
 	 * which should be called after alloc_irq but before request_irq.
@@ -332,3 +344,11 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev)
 	kfree(table->irq);
 }
 
+struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev)
+{
+#ifdef CONFIG_MLX5_SF
+	if (mlx5_core_is_sf(dev))
+		return dev->priv.parent_mdev->priv.irq_table;
+#endif
+	return dev->priv.irq_table;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
index 6562bf63afaa..2675b85d202d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
@@ -24,6 +24,16 @@ static bool mlx5_sf_dev_supported(const struct mlx5_core_dev *dev)
 	return MLX5_CAP_GEN(dev, sf) && mlx5_vhca_event_supported(dev);
 }
 
+bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_dev_table *table = dev->priv.sf_dev_table;
+
+	if (!mlx5_sf_dev_supported(dev))
+		return false;
+
+	return xa_empty(&table->devices);
+}
+
 static ssize_t sfnum_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	struct auxiliary_device *adev = container_of(dev, struct auxiliary_device, dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
index a6fb7289ba2c..4de02902aef1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.h
@@ -13,6 +13,7 @@
 struct mlx5_sf_dev {
 	struct auxiliary_device adev;
 	struct mlx5_core_dev *parent_mdev;
+	struct mlx5_core_dev *mdev;
 	phys_addr_t bar_base_addr;
 	u32 sfnum;
 };
@@ -20,6 +21,11 @@ struct mlx5_sf_dev {
 void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev);
 void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev);
 
+int mlx5_sf_driver_register(void);
+void mlx5_sf_driver_unregister(void);
+
+bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev);
+
 #else
 
 static inline void mlx5_sf_dev_table_create(struct mlx5_core_dev *dev)
@@ -30,6 +36,20 @@ static inline void mlx5_sf_dev_table_destroy(struct mlx5_core_dev *dev)
 {
 }
 
+static inline int mlx5_sf_driver_register(void)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_driver_unregister(void)
+{
+}
+
+static inline bool mlx5_sf_dev_allocated(const struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
 #endif
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
new file mode 100644
index 000000000000..9a1ad331ce0a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/driver.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include <linux/mlx5/device.h>
+#include "mlx5_core.h"
+#include "dev.h"
+#include "devlink.h"
+
+static int mlx5_sf_dev_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+	struct mlx5_core_dev *mdev;
+	struct devlink *devlink;
+	int err;
+
+	devlink = mlx5_devlink_alloc();
+	if (!devlink)
+		return -ENOMEM;
+
+	mdev = devlink_priv(devlink);
+	mdev->device = &adev->dev;
+	mdev->pdev = sf_dev->parent_mdev->pdev;
+	mdev->bar_addr = sf_dev->bar_base_addr;
+	mdev->iseg_base = sf_dev->bar_base_addr;
+	mdev->coredev_type = MLX5_COREDEV_SF;
+	mdev->priv.parent_mdev = sf_dev->parent_mdev;
+	mdev->priv.adev_idx = adev->id;
+	sf_dev->mdev = mdev;
+
+	err = mlx5_mdev_init(mdev, MLX5_DEFAULT_PROF);
+	if (err) {
+		mlx5_core_warn(mdev, "mlx5_mdev_init on err=%d\n", err);
+		goto mdev_err;
+	}
+
+	mdev->iseg = ioremap(mdev->iseg_base, sizeof(*mdev->iseg));
+	if (!mdev->iseg) {
+		mlx5_core_warn(mdev, "remap error\n");
+		goto remap_err;
+	}
+
+	err = mlx5_load_one(mdev, true);
+	if (err) {
+		mlx5_core_warn(mdev, "mlx5_load_one err=%d\n", err);
+		goto load_one_err;
+	}
+	return 0;
+
+load_one_err:
+	iounmap(mdev->iseg);
+remap_err:
+	mlx5_mdev_uninit(mdev);
+mdev_err:
+	mlx5_devlink_free(devlink);
+	return err;
+}
+
+static void mlx5_sf_dev_remove(struct auxiliary_device *adev)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+	struct devlink *devlink;
+
+	devlink = priv_to_devlink(sf_dev->mdev);
+	mlx5_unload_one(sf_dev->mdev, true);
+	iounmap(sf_dev->mdev->iseg);
+	mlx5_mdev_uninit(sf_dev->mdev);
+	mlx5_devlink_free(devlink);
+}
+
+static void mlx5_sf_dev_shutdown(struct auxiliary_device *adev)
+{
+	struct mlx5_sf_dev *sf_dev = container_of(adev, struct mlx5_sf_dev, adev);
+
+	mlx5_unload_one(sf_dev->mdev, false);
+}
+
+static const struct auxiliary_device_id mlx5_sf_dev_id_table[] = {
+	{ .name = KBUILD_MODNAME "." MLX5_SF_DEV_ID_NAME, },
+	{ },
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mlx5_sf_dev_id_table);
+
+static struct auxiliary_driver mlx5_sf_driver = {
+	.name = KBUILD_MODNAME,
+	.probe = mlx5_sf_dev_probe,
+	.remove = mlx5_sf_dev_remove,
+	.shutdown = mlx5_sf_dev_shutdown,
+	.id_table = mlx5_sf_dev_id_table,
+};
+
+int mlx5_sf_driver_register(void)
+{
+	return auxiliary_driver_register(&mlx5_sf_driver);
+}
+
+void mlx5_sf_driver_unregister(void)
+{
+	auxiliary_driver_unregister(&mlx5_sf_driver);
+}
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 08e5fbe97df0..48e3638b1185 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -193,7 +193,8 @@ enum port_state_policy {
 
 enum mlx5_coredev_type {
 	MLX5_COREDEV_PF,
-	MLX5_COREDEV_VF
+	MLX5_COREDEV_VF,
+	MLX5_COREDEV_SF,
 };
 
 struct mlx5_field_desc {
@@ -608,6 +609,7 @@ struct mlx5_priv {
 #ifdef CONFIG_MLX5_SF
 	struct mlx5_vhca_state_notifier *vhca_state_notifier;
 	struct mlx5_sf_dev_table *sf_dev_table;
+	struct mlx5_core_dev *parent_mdev;
 #endif
 };
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (7 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Vu Pham, Parav Pandit, Roi Dayan, Saeed Mahameed

From: Vu Pham <vuhuong@nvidia.com>

Prepare eswitch to handle SF vport during
(a) querying eswitch functions
(b) egress ACL creation
(c) account for SF vports in total vports calculation

Assign a dedicated placeholder for SFs vports and their representors.
They are placed after VFs vports and before ECPF vports as below:
[PF,VF0,...,VFn,SF0,...SFm,ECPF,UPLINK].

Change functions to map SF's vport numbers to indices when
accessing the vports or representors arrays, and vice versa.

Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Kconfig   | 10 ++++
 .../mellanox/mlx5/core/esw/acl/egress_ofld.c  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 11 +++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 50 +++++++++++++++++++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 11 ++++
 .../net/ethernet/mellanox/mlx5/core/vport.c   |  3 +-
 6 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index d6c48582e7a8..ad45d20f9d44 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -212,3 +212,13 @@ config MLX5_SF
 	Build support for subfuction device in the NIC. A Mellanox subfunction
 	device can support RDMA, netdevice and vdpa device.
 	It is similar to a SRIOV VF but it doesn't require SRIOV support.
+
+config MLX5_SF_MANAGER
+	bool
+	depends on MLX5_SF && MLX5_ESWITCH
+	default y
+	help
+	Build support for subfuction port in the NIC. A Mellanox subfunction
+	port is managed through devlink.  A subfunction supports RDMA, netdevice
+	and vdpa device. It is similar to a SRIOV VF but it doesn't require
+	SRIOV support.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
index 4c74e2690d57..26b37a0f8762 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/acl/egress_ofld.c
@@ -150,7 +150,7 @@ static void esw_acl_egress_ofld_groups_destroy(struct mlx5_vport *vport)
 
 static bool esw_acl_egress_needed(const struct mlx5_eswitch *esw, u16 vport_num)
 {
-	return mlx5_eswitch_is_vf_vport(esw, vport_num);
+	return mlx5_eswitch_is_vf_vport(esw, vport_num) || mlx5_esw_is_sf_vport(esw, vport_num);
 }
 
 int esw_acl_egress_ofld_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index da901e364656..d75247a8ce55 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1366,9 +1366,15 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev)
 {
 	int outlen = MLX5_ST_SZ_BYTES(query_esw_functions_out);
 	u32 in[MLX5_ST_SZ_DW(query_esw_functions_in)] = {};
+	u16 max_sf_vports;
 	u32 *out;
 	int err;
 
+	max_sf_vports = mlx5_sf_max_functions(dev);
+	/* Device interface is array of 64-bits */
+	if (max_sf_vports)
+		outlen += DIV_ROUND_UP(max_sf_vports, BITS_PER_TYPE(__be64)) * sizeof(__be64);
+
 	out = kvzalloc(outlen, GFP_KERNEL);
 	if (!out)
 		return ERR_PTR(-ENOMEM);
@@ -1376,7 +1382,7 @@ const u32 *mlx5_esw_query_functions(struct mlx5_core_dev *dev)
 	MLX5_SET(query_esw_functions_in, in, opcode,
 		 MLX5_CMD_OP_QUERY_ESW_FUNCTIONS);
 
-	err = mlx5_cmd_exec_inout(dev, query_esw_functions, in, out);
+	err = mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
 	if (!err)
 		return out;
 
@@ -1899,7 +1905,8 @@ static bool
 is_port_function_supported(const struct mlx5_eswitch *esw, u16 vport_num)
 {
 	return vport_num == MLX5_VPORT_PF ||
-	       mlx5_eswitch_is_vf_vport(esw, vport_num);
+	       mlx5_eswitch_is_vf_vport(esw, vport_num) ||
+	       mlx5_esw_is_sf_vport(esw, vport_num);
 }
 
 int mlx5_devlink_port_function_hw_addr_get(struct devlink *devlink,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index cf87de94418f..4e3ed878ff03 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -43,6 +43,7 @@
 #include <linux/mlx5/fs.h>
 #include "lib/mpfs.h"
 #include "lib/fs_chains.h"
+#include "sf/sf.h"
 #include "en/tc_ct.h"
 
 #ifdef CONFIG_MLX5_ESWITCH
@@ -499,6 +500,40 @@ static inline u16 mlx5_eswitch_first_host_vport_num(struct mlx5_core_dev *dev)
 		MLX5_VPORT_PF : MLX5_VPORT_FIRST_VF;
 }
 
+static inline int mlx5_esw_sf_start_idx(const struct mlx5_eswitch *esw)
+{
+	/* PF and VF vports indices start from 0 to max_vfs */
+	return MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev);
+}
+
+static inline int mlx5_esw_sf_end_idx(const struct mlx5_eswitch *esw)
+{
+	return mlx5_esw_sf_start_idx(esw) + mlx5_sf_max_functions(esw->dev);
+}
+
+static inline int
+mlx5_esw_sf_vport_num_to_index(const struct mlx5_eswitch *esw, u16 vport_num)
+{
+	return vport_num - mlx5_sf_start_function_id(esw->dev) +
+	       MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev);
+}
+
+static inline u16
+mlx5_esw_sf_vport_index_to_num(const struct mlx5_eswitch *esw, int idx)
+{
+	return mlx5_sf_start_function_id(esw->dev) + idx -
+	       (MLX5_VPORT_PF_PLACEHOLDER + mlx5_core_max_vfs(esw->dev));
+}
+
+static inline bool
+mlx5_esw_is_sf_vport(const struct mlx5_eswitch *esw, u16 vport_num)
+{
+	return mlx5_sf_supported(esw->dev) &&
+	       vport_num >= mlx5_sf_start_function_id(esw->dev) &&
+	       (vport_num < (mlx5_sf_start_function_id(esw->dev) +
+			     mlx5_sf_max_functions(esw->dev)));
+}
+
 static inline bool mlx5_eswitch_is_funcs_handler(const struct mlx5_core_dev *dev)
 {
 	return mlx5_core_is_ecpf_esw_manager(dev);
@@ -527,6 +562,10 @@ static inline int mlx5_eswitch_vport_num_to_index(struct mlx5_eswitch *esw,
 	if (vport_num == MLX5_VPORT_UPLINK)
 		return mlx5_eswitch_uplink_idx(esw);
 
+	if (mlx5_esw_is_sf_vport(esw, vport_num))
+		return mlx5_esw_sf_vport_num_to_index(esw, vport_num);
+
+	/* PF and VF vports start from 0 to max_vfs */
 	return vport_num;
 }
 
@@ -540,6 +579,12 @@ static inline u16 mlx5_eswitch_index_to_vport_num(struct mlx5_eswitch *esw,
 	if (index == mlx5_eswitch_uplink_idx(esw))
 		return MLX5_VPORT_UPLINK;
 
+	/* SF vports indices are after VFs and before ECPF */
+	if (mlx5_sf_supported(esw->dev) &&
+	    index > mlx5_core_max_vfs(esw->dev))
+		return mlx5_esw_sf_vport_index_to_num(esw, index);
+
+	/* PF and VF vports start from 0 to max_vfs */
 	return index;
 }
 
@@ -625,6 +670,11 @@ void mlx5e_tc_clean_fdb_peer_flows(struct mlx5_eswitch *esw);
 	for ((vport) = (nvfs);						\
 	     (vport) >= (esw)->first_host_vport; (vport)--)
 
+#define mlx5_esw_for_each_sf_rep(esw, i, rep)		\
+	for ((i) = mlx5_esw_sf_start_idx(esw);		\
+	     (rep) = &(esw)->offloads.vport_reps[(i)],	\
+	     (i) < mlx5_esw_sf_end_idx(esw); (i++))
+
 struct mlx5_eswitch *mlx5_devlink_eswitch_get(struct devlink *devlink);
 struct mlx5_vport *__must_check
 mlx5_eswitch_get_vport(struct mlx5_eswitch *esw, u16 vport_num);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 2f6a0ae20650..2d241f7351b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1800,11 +1800,22 @@ static void __esw_offloads_unload_rep(struct mlx5_eswitch *esw,
 		esw->offloads.rep_ops[rep_type]->unload(rep);
 }
 
+static void __unload_reps_sf_vport(struct mlx5_eswitch *esw, u8 rep_type)
+{
+	struct mlx5_eswitch_rep *rep;
+	int i;
+
+	mlx5_esw_for_each_sf_rep(esw, i, rep)
+		__esw_offloads_unload_rep(esw, rep, rep_type);
+}
+
 static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type)
 {
 	struct mlx5_eswitch_rep *rep;
 	int i;
 
+	__unload_reps_sf_vport(esw, rep_type);
+
 	mlx5_esw_for_each_vf_rep_reverse(esw, i, rep, esw->esw_funcs.num_vfs)
 		__esw_offloads_unload_rep(esw, rep, rep_type);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index bdafc85fd874..ba78e0660523 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -36,6 +36,7 @@
 #include <linux/mlx5/vport.h>
 #include <linux/mlx5/eswitch.h>
 #include "mlx5_core.h"
+#include "sf/sf.h"
 
 /* Mutex to hold while enabling or disabling RoCE */
 static DEFINE_MUTEX(mlx5_roce_en_lock);
@@ -1160,6 +1161,6 @@ EXPORT_SYMBOL_GPL(mlx5_query_nic_system_image_guid);
  */
 u16 mlx5_eswitch_get_total_vports(const struct mlx5_core_dev *dev)
 {
-	return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev);
+	return MLX5_SPECIAL_VPORTS(dev) + mlx5_core_max_vfs(dev) + mlx5_sf_max_functions(dev);
 }
 EXPORT_SYMBOL_GPL(mlx5_eswitch_get_total_vports);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 10/15] net/mlx5: E-switch, Add eswitch helpers for SF vport
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (8 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Roi Dayan, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Add helpers to enable/disable eswitch port, register its devlink port and
load its representor.

Signed-off-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../mellanox/mlx5/core/esw/devlink_port.c     | 41 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 12 +++---
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 16 ++++++++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 36 +++++++++++++++-
 4 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
index ffff11baa3d0..4b7e9f783789 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
@@ -122,3 +122,44 @@ struct devlink_port *mlx5_esw_offloads_devlink_port(struct mlx5_eswitch *esw, u1
 	vport = mlx5_eswitch_get_vport(esw, vport_num);
 	return vport->dl_port;
 }
+
+int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum)
+{
+	struct mlx5_core_dev *dev = esw->dev;
+	struct netdev_phys_item_id ppid = {};
+	unsigned int dl_port_index;
+	struct mlx5_vport *vport;
+	struct devlink *devlink;
+	u16 pfnum;
+	int err;
+
+	vport = mlx5_eswitch_get_vport(esw, vport_num);
+	if (IS_ERR(vport))
+		return PTR_ERR(vport);
+
+	pfnum = PCI_FUNC(dev->pdev->devfn);
+	mlx5_esw_get_port_parent_id(dev, &ppid);
+	memcpy(dl_port->attrs.switch_id.id, &ppid.id[0], ppid.id_len);
+	dl_port->attrs.switch_id.id_len = ppid.id_len;
+	devlink_port_attrs_pci_sf_set(dl_port, 0, pfnum, sfnum, false);
+	devlink = priv_to_devlink(dev);
+	dl_port_index = mlx5_esw_vport_to_devlink_port_index(dev, vport_num);
+	err = devlink_port_register(devlink, dl_port, dl_port_index);
+	if (err)
+		return err;
+
+	vport->dl_port = dl_port;
+	return 0;
+}
+
+void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num)
+{
+	struct mlx5_vport *vport;
+
+	vport = mlx5_eswitch_get_vport(esw, vport_num);
+	if (IS_ERR(vport))
+		return;
+	devlink_port_unregister(vport->dl_port);
+	vport->dl_port = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index d75247a8ce55..d06e7a5f15de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1273,8 +1273,8 @@ static void esw_vport_cleanup(struct mlx5_eswitch *esw, struct mlx5_vport *vport
 	esw_vport_cleanup_acl(esw, vport);
 }
 
-static int esw_enable_vport(struct mlx5_eswitch *esw, u16 vport_num,
-			    enum mlx5_eswitch_vport_event enabled_events)
+int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, u16 vport_num,
+			  enum mlx5_eswitch_vport_event enabled_events)
 {
 	struct mlx5_vport *vport;
 	int ret;
@@ -1310,7 +1310,7 @@ static int esw_enable_vport(struct mlx5_eswitch *esw, u16 vport_num,
 	return ret;
 }
 
-static void esw_disable_vport(struct mlx5_eswitch *esw, u16 vport_num)
+void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_vport *vport;
 
@@ -1432,7 +1432,7 @@ int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num,
 {
 	int err;
 
-	err = esw_enable_vport(esw, vport_num, enabled_events);
+	err = mlx5_esw_vport_enable(esw, vport_num, enabled_events);
 	if (err)
 		return err;
 
@@ -1443,14 +1443,14 @@ int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num,
 	return err;
 
 err_rep:
-	esw_disable_vport(esw, vport_num);
+	mlx5_esw_vport_disable(esw, vport_num);
 	return err;
 }
 
 void mlx5_eswitch_unload_vport(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	esw_offloads_unload_rep(esw, vport_num);
-	esw_disable_vport(esw, vport_num);
+	mlx5_esw_vport_disable(esw, vport_num);
 }
 
 void mlx5_eswitch_unload_vf_vports(struct mlx5_eswitch *esw, u16 num_vfs)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 4e3ed878ff03..54514b04808d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -688,6 +688,10 @@ mlx5_eswitch_enable_pf_vf_vports(struct mlx5_eswitch *esw,
 				 enum mlx5_eswitch_vport_event enabled_events);
 void mlx5_eswitch_disable_pf_vf_vports(struct mlx5_eswitch *esw);
 
+int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, u16 vport_num,
+			  enum mlx5_eswitch_vport_event enabled_events);
+void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
+
 int
 esw_vport_create_offloads_acl_tables(struct mlx5_eswitch *esw,
 				     struct mlx5_vport *vport);
@@ -706,6 +710,9 @@ esw_get_max_restore_tag(struct mlx5_eswitch *esw);
 int esw_offloads_load_rep(struct mlx5_eswitch *esw, u16 vport_num);
 void esw_offloads_unload_rep(struct mlx5_eswitch *esw, u16 vport_num);
 
+int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num);
+void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num);
+
 int mlx5_eswitch_load_vport(struct mlx5_eswitch *esw, u16 vport_num,
 			    enum mlx5_eswitch_vport_event enabled_events);
 void mlx5_eswitch_unload_vport(struct mlx5_eswitch *esw, u16 vport_num);
@@ -717,6 +724,15 @@ void mlx5_eswitch_unload_vf_vports(struct mlx5_eswitch *esw, u16 num_vfs);
 int mlx5_esw_offloads_devlink_port_register(struct mlx5_eswitch *esw, u16 vport_num);
 void mlx5_esw_offloads_devlink_port_unregister(struct mlx5_eswitch *esw, u16 vport_num);
 struct devlink_port *mlx5_esw_offloads_devlink_port(struct mlx5_eswitch *esw, u16 vport_num);
+
+int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum);
+void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num);
+
+int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum);
+void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
+
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 2d241f7351b5..7f09f2bbf7c1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1833,7 +1833,7 @@ static void __unload_reps_all_vport(struct mlx5_eswitch *esw, u8 rep_type)
 	__esw_offloads_unload_rep(esw, rep, rep_type);
 }
 
-static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
+int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_eswitch_rep *rep;
 	int rep_type;
@@ -1857,7 +1857,7 @@ static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
 	return err;
 }
 
-static void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num)
+void mlx5_esw_offloads_rep_unload(struct mlx5_eswitch *esw, u16 vport_num)
 {
 	struct mlx5_eswitch_rep *rep;
 	int rep_type;
@@ -2835,3 +2835,35 @@ u32 mlx5_eswitch_get_vport_metadata_for_match(struct mlx5_eswitch *esw,
 	return vport->metadata << (32 - ESW_SOURCE_PORT_METADATA_BITS);
 }
 EXPORT_SYMBOL(mlx5_eswitch_get_vport_metadata_for_match);
+
+int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_port *dl_port,
+				      u16 vport_num, u32 sfnum)
+{
+	int err;
+
+	err = mlx5_esw_vport_enable(esw, vport_num, MLX5_VPORT_UC_ADDR_CHANGE);
+	if (err)
+		return err;
+
+	err = mlx5_esw_devlink_sf_port_register(esw, dl_port, vport_num, sfnum);
+	if (err)
+		goto devlink_err;
+
+	err = mlx5_esw_offloads_rep_load(esw, vport_num);
+	if (err)
+		goto rep_err;
+	return 0;
+
+rep_err:
+	mlx5_esw_devlink_sf_port_unregister(esw, vport_num);
+devlink_err:
+	mlx5_esw_vport_disable(esw, vport_num);
+	return err;
+}
+
+void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num)
+{
+	mlx5_esw_offloads_rep_unload(esw, vport_num);
+	mlx5_esw_devlink_sf_port_unregister(esw, vport_num);
+	mlx5_esw_vport_disable(esw, vport_num);
+}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 11/15] net/mlx5: SF, Add port add delete functionality
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (9 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

To handle SF port management outside of the eswitch as independent
software layer, introduce eswitch notifier APIs so that upper layer who
wish to support sf port management in switchdev mode can perform its
task whenever eswitch mode is set to switchdev or before eswitch is
disabled.

Initialize sf port table on such eswitch event.

Add SF port add and delete functionality in switchdev mode.
Destroy all SF ports when eswitch is disabled.
Expose SF port add and delete to user via devlink commands.

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port show ens2f0npf0sf88 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "inactive",
                "opstate": "detached"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   5 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   4 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   5 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  25 ++
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  12 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  18 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  27 ++
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  | 312 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/sf/hw_table.c | 125 +++++++
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |  17 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  28 ++
 include/linux/mlx5/driver.h                   |   6 +
 12 files changed, 584 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index efa95d6dd112..957d5d9cfb36 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -89,3 +89,8 @@ mlx5_core-$(CONFIG_MLX5_SW_STEERING) += steering/dr_domain.o steering/dr_table.o
 # SF device
 #
 mlx5_core-$(CONFIG_MLX5_SF) += sf/vhca_event.o sf/dev/dev.o sf/dev/driver.o
+
+#
+# SF manager
+#
+mlx5_core-$(CONFIG_MLX5_SF_MANAGER) += sf/cmd.o sf/hw_table.o sf/devlink.o
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 47dcc3ac2cf0..e8cecd50558d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -333,6 +333,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_DEALLOC_MEMIC:
 	case MLX5_CMD_OP_PAGE_FAULT_RESUME:
 	case MLX5_CMD_OP_QUERY_ESW_FUNCTIONS:
+	case MLX5_CMD_OP_DEALLOC_SF:
 		return MLX5_CMD_STAT_OK;
 
 	case MLX5_CMD_OP_QUERY_HCA_CAP:
@@ -466,6 +467,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_RELEASE_XRQ_ERROR:
 	case MLX5_CMD_OP_QUERY_VHCA_STATE:
 	case MLX5_CMD_OP_MODIFY_VHCA_STATE:
+	case MLX5_CMD_OP_ALLOC_SF:
 		*status = MLX5_DRIVER_STATUS_ABORTED;
 		*synd = MLX5_DRIVER_SYND;
 		return -EIO;
@@ -661,6 +663,8 @@ const char *mlx5_command_str(int command)
 	MLX5_COMMAND_STR_CASE(MODIFY_XRQ);
 	MLX5_COMMAND_STR_CASE(QUERY_VHCA_STATE);
 	MLX5_COMMAND_STR_CASE(MODIFY_VHCA_STATE);
+	MLX5_COMMAND_STR_CASE(ALLOC_SF);
+	MLX5_COMMAND_STR_CASE(DEALLOC_SF);
 	default: return "unknown command opcode";
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 9afe918c5827..d4c0cdf5edd9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -8,6 +8,7 @@
 #include "fs_core.h"
 #include "eswitch.h"
 #include "sf/dev/dev.h"
+#include "sf/sf.h"
 
 static int mlx5_devlink_flash_update(struct devlink *devlink,
 				     struct devlink_flash_update_params *params,
@@ -190,6 +191,10 @@ static const struct devlink_ops mlx5_devlink_ops = {
 	.eswitch_encap_mode_get = mlx5_devlink_eswitch_encap_mode_get,
 	.port_function_hw_addr_get = mlx5_devlink_port_function_hw_addr_get,
 	.port_function_hw_addr_set = mlx5_devlink_port_function_hw_addr_set,
+#endif
+#ifdef CONFIG_MLX5_SF_MANAGER
+	.port_new = mlx5_devlink_sf_port_new,
+	.port_del = mlx5_devlink_sf_port_del,
 #endif
 	.flash_update = mlx5_devlink_flash_update,
 	.info_get = mlx5_devlink_info_get,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index d06e7a5f15de..86e972c82af7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1600,6 +1600,15 @@ mlx5_eswitch_update_num_of_vfs(struct mlx5_eswitch *esw, int num_vfs)
 	kvfree(out);
 }
 
+static void mlx5_esw_mode_change_notify(struct mlx5_eswitch *esw, u16 mode)
+{
+	struct mlx5_esw_event_info info = {};
+
+	info.new_mode = mode;
+
+	blocking_notifier_call_chain(&esw->n_head, 0, &info);
+}
+
 /**
  * mlx5_eswitch_enable_locked - Enable eswitch
  * @esw:	Pointer to eswitch
@@ -1660,6 +1669,8 @@ int mlx5_eswitch_enable_locked(struct mlx5_eswitch *esw, int mode, int num_vfs)
 		 mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
 		 esw->esw_funcs.num_vfs, esw->enabled_vports);
 
+	mlx5_esw_mode_change_notify(esw, mode);
+
 	return 0;
 
 abort:
@@ -1716,6 +1727,11 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw, bool clear_vf)
 		 esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
 		 esw->esw_funcs.num_vfs, esw->enabled_vports);
 
+	/* Notify eswitch users that it is exiting from current mode.
+	 * So that it can do necessary cleanup before the eswitch is disabled.
+	 */
+	mlx5_esw_mode_change_notify(esw, MLX5_ESWITCH_NONE);
+
 	mlx5_eswitch_event_handlers_unregister(esw);
 
 	if (esw->mode == MLX5_ESWITCH_LEGACY)
@@ -1816,6 +1832,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
 	esw->offloads.inline_mode = MLX5_INLINE_MODE_NONE;
 
 	dev->priv.eswitch = esw;
+	BLOCKING_INIT_NOTIFIER_HEAD(&esw->n_head);
 	return 0;
 abort:
 	if (esw->work_queue)
@@ -2507,4 +2524,12 @@ bool mlx5_esw_multipath_prereq(struct mlx5_core_dev *dev0,
 		dev1->priv.eswitch->mode == MLX5_ESWITCH_OFFLOADS);
 }
 
+int mlx5_esw_event_notifier_register(struct mlx5_eswitch *esw, struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&esw->n_head, nb);
+}
 
+void mlx5_esw_event_notifier_unregister(struct mlx5_eswitch *esw, struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&esw->n_head, nb);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 54514b04808d..479d2ac2cd85 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -278,6 +278,7 @@ struct mlx5_eswitch {
 	struct {
 		u32             large_group_num;
 	}  params;
+	struct blocking_notifier_head n_head;
 };
 
 void esw_offloads_disable(struct mlx5_eswitch *esw);
@@ -733,6 +734,17 @@ int mlx5_esw_offloads_sf_vport_enable(struct mlx5_eswitch *esw, struct devlink_p
 				      u16 vport_num, u32 sfnum);
 void mlx5_esw_offloads_sf_vport_disable(struct mlx5_eswitch *esw, u16 vport_num);
 
+/**
+ * mlx5_esw_event_info - Indicates eswitch mode changed/changing.
+ *
+ * @new_mode: New mode of eswitch.
+ */
+struct mlx5_esw_event_info {
+	u16 new_mode;
+};
+
+int mlx5_esw_event_notifier_register(struct mlx5_eswitch *esw, struct notifier_block *n);
+void mlx5_esw_event_notifier_unregister(struct mlx5_eswitch *esw, struct notifier_block *n);
 #else  /* CONFIG_MLX5_ESWITCH */
 /* eswitch API stubs */
 static inline int  mlx5_eswitch_init(struct mlx5_core_dev *dev) { return 0; }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 932a280a56a5..435323088ce0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -893,6 +893,18 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 		goto err_fpga_cleanup;
 	}
 
+	err = mlx5_sf_hw_table_init(dev);
+	if (err) {
+		mlx5_core_err(dev, "Failed to init SF HW table %d\n", err);
+		goto err_sf_hw_table_cleanup;
+	}
+
+	err = mlx5_sf_table_init(dev);
+	if (err) {
+		mlx5_core_err(dev, "Failed to init SF table %d\n", err);
+		goto err_sf_table_cleanup;
+	}
+
 	dev->dm = mlx5_dm_create(dev);
 	if (IS_ERR(dev->dm))
 		mlx5_core_warn(dev, "Failed to init device memory%d\n", err);
@@ -903,6 +915,10 @@ static int mlx5_init_once(struct mlx5_core_dev *dev)
 
 	return 0;
 
+err_sf_table_cleanup:
+	mlx5_sf_hw_table_cleanup(dev);
+err_sf_hw_table_cleanup:
+	mlx5_vhca_event_cleanup(dev);
 err_fpga_cleanup:
 	mlx5_fpga_cleanup(dev);
 err_eswitch_cleanup:
@@ -936,6 +952,8 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
 	mlx5_hv_vhca_destroy(dev->hv_vhca);
 	mlx5_fw_tracer_destroy(dev->tracer);
 	mlx5_dm_cleanup(dev);
+	mlx5_sf_table_cleanup(dev);
+	mlx5_sf_hw_table_cleanup(dev);
 	mlx5_vhca_event_cleanup(dev);
 	mlx5_fpga_cleanup(dev);
 	mlx5_eswitch_cleanup(dev->priv.eswitch);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
new file mode 100644
index 000000000000..0bc3075f34fa
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include "priv.h"
+
+int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id)
+{
+	u32 out[MLX5_ST_SZ_DW(alloc_sf_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(alloc_sf_in)] = {};
+
+	MLX5_SET(alloc_sf_in, in, opcode, MLX5_CMD_OP_ALLOC_SF);
+	MLX5_SET(alloc_sf_in, in, function_id, function_id);
+
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id)
+{
+	u32 out[MLX5_ST_SZ_DW(dealloc_sf_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(dealloc_sf_in)] = {};
+
+	MLX5_SET(dealloc_sf_in, in, opcode, MLX5_CMD_OP_DEALLOC_SF);
+	MLX5_SET(dealloc_sf_in, in, function_id, function_id);
+
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
new file mode 100644
index 000000000000..09365f36a513
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#include <linux/mlx5/driver.h>
+#include "eswitch.h"
+#include "priv.h"
+
+struct mlx5_sf {
+	struct devlink_port dl_port;
+	unsigned int port_index;
+	u16 id;
+};
+
+struct mlx5_sf_table {
+	struct mlx5_core_dev *dev; /* To refer from notifier context. */
+	struct xarray port_indices; /* port index based lookup. */
+	refcount_t refcount;
+	struct completion disable_complete;
+	struct notifier_block esw_nb;
+};
+
+static struct mlx5_sf *
+mlx5_sf_lookup_by_index(struct mlx5_sf_table *table, unsigned int port_index)
+{
+	return xa_load(&table->port_indices, port_index);
+}
+
+static int mlx5_sf_id_insert(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	return xa_insert(&table->port_indices, sf->port_index, sf, GFP_KERNEL);
+}
+
+static void mlx5_sf_id_erase(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	xa_erase(&table->port_indices, sf->port_index);
+}
+
+static struct mlx5_sf *
+mlx5_sf_alloc(struct mlx5_sf_table *table, u32 sfnum, struct netlink_ext_ack *extack)
+{
+	unsigned int dl_port_index;
+	struct mlx5_sf *sf;
+	u16 hw_fn_id;
+	int id_err;
+	int err;
+
+	id_err = mlx5_sf_hw_table_sf_alloc(table->dev, sfnum);
+	if (id_err < 0) {
+		err = id_err;
+		goto id_err;
+	}
+
+	sf = kzalloc(sizeof(*sf), GFP_KERNEL);
+	if (!sf) {
+		err = -ENOMEM;
+		goto alloc_err;
+	}
+	sf->id = id_err;
+	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sf->id);
+	dl_port_index = mlx5_esw_vport_to_devlink_port_index(table->dev, hw_fn_id);
+	sf->port_index = dl_port_index;
+
+	err = mlx5_sf_id_insert(table, sf);
+	if (err)
+		goto insert_err;
+
+	return sf;
+
+insert_err:
+	kfree(sf);
+alloc_err:
+	mlx5_sf_hw_table_sf_free(table->dev, id_err);
+id_err:
+	if (err == -EEXIST)
+		NL_SET_ERR_MSG_MOD(extack, "SF already exist. Choose different sfnum");
+	return ERR_PTR(err);
+}
+
+static void mlx5_sf_free(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	mlx5_sf_id_erase(table, sf);
+	mlx5_sf_hw_table_sf_free(table->dev, sf->id);
+	kfree(sf);
+}
+
+static struct mlx5_sf_table *mlx5_sf_table_try_get(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table = dev->priv.sf_table;
+
+	if (!table)
+		return NULL;
+
+	return refcount_inc_not_zero(&table->refcount) ? table : NULL;
+}
+
+static void mlx5_sf_table_put(struct mlx5_sf_table *table)
+{
+	if (refcount_dec_and_test(&table->refcount))
+		complete(&table->disable_complete);
+}
+
+static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
+		       const struct devlink_port_new_attrs *new_attr,
+		       struct netlink_ext_ack *extack)
+{
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
+	struct mlx5_sf *sf;
+	u16 hw_fn_id;
+	int err;
+
+	sf = mlx5_sf_alloc(table, new_attr->sfnum, extack);
+	if (IS_ERR(sf))
+		return PTR_ERR(sf);
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(dev, sf->id);
+	err = mlx5_esw_offloads_sf_vport_enable(esw, &sf->dl_port, hw_fn_id, new_attr->sfnum);
+	if (err)
+		goto esw_err;
+	return 0;
+
+esw_err:
+	mlx5_sf_free(table, sf);
+	return err;
+}
+
+static void mlx5_sf_del(struct mlx5_core_dev *dev, struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
+	u16 hw_fn_id;
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(dev, sf->id);
+	mlx5_esw_offloads_sf_vport_disable(esw, hw_fn_id);
+	mlx5_sf_free(table, sf);
+}
+
+static int
+mlx5_sf_new_check_attr(struct mlx5_core_dev *dev, const struct devlink_port_new_attrs *new_attr,
+		       struct netlink_ext_ack *extack)
+{
+	if (new_attr->flavour != DEVLINK_PORT_FLAVOUR_PCI_SF) {
+		NL_SET_ERR_MSG_MOD(extack, "Driver supports only SF port addition");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->port_index_valid) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Driver does not support user defined port index assignment");
+		return -EOPNOTSUPP;
+	}
+	if (!new_attr->sfnum_valid) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "User must provide unique sfnum. Driver does not support auto assignment");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->controller_valid && new_attr->controller) {
+		NL_SET_ERR_MSG_MOD(extack, "External controller is unsupported");
+		return -EOPNOTSUPP;
+	}
+	if (new_attr->pfnum != PCI_FUNC(dev->pdev->devfn)) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid pfnum supplied");
+		return -EOPNOTSUPP;
+	}
+	return 0;
+}
+
+int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_new_attrs *new_attr,
+			     struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	int err;
+
+	err = mlx5_sf_new_check_attr(dev, new_attr, extack);
+	if (err)
+		return err;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port add is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	err = mlx5_sf_add(dev, table, new_attr, extack);
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
+			     struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	struct mlx5_sf *sf;
+	int err = 0;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port del is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	sf = mlx5_sf_lookup_by_index(table, port_index);
+	if (!sf) {
+		err = -ENODEV;
+		goto sf_err;
+	}
+
+	mlx5_sf_del(dev, table, sf);
+sf_err:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+static void mlx5_sf_destroy_all(struct mlx5_sf_table *table)
+{
+	struct mlx5_core_dev *dev = table->dev;
+	unsigned long index;
+	struct mlx5_sf *sf;
+
+	xa_for_each(&table->port_indices, index, sf)
+		mlx5_sf_del(dev, table, sf);
+}
+
+static void mlx5_sf_table_enable(struct mlx5_sf_table *table)
+{
+	if (!mlx5_sf_max_functions(table->dev))
+		return;
+
+	init_completion(&table->disable_complete);
+	refcount_set(&table->refcount, 1);
+}
+
+static void mlx5_sf_table_disable(struct mlx5_sf_table *table)
+{
+	if (!mlx5_sf_max_functions(table->dev))
+		return;
+
+	if (!refcount_read(&table->refcount))
+		return;
+
+	/* Balances with refcount_set; drop the reference so that new user cmd cannot start. */
+	mlx5_sf_table_put(table);
+	wait_for_completion(&table->disable_complete);
+
+	/* At this point, no new user commands can start.
+	 * It is safe to destroy all user created SFs.
+	 */
+	mlx5_sf_destroy_all(table);
+}
+
+static int mlx5_sf_esw_event(struct notifier_block *nb, unsigned long event, void *data)
+{
+	struct mlx5_sf_table *table = container_of(nb, struct mlx5_sf_table, esw_nb);
+	const struct mlx5_esw_event_info *mode = data;
+
+	switch (mode->new_mode) {
+	case MLX5_ESWITCH_OFFLOADS:
+		mlx5_sf_table_enable(table);
+		break;
+	case MLX5_ESWITCH_NONE:
+		mlx5_sf_table_disable(table);
+		break;
+	default:
+		break;
+	};
+
+	return 0;
+}
+
+static bool mlx5_sf_table_supported(const struct mlx5_core_dev *dev)
+{
+	return dev->priv.eswitch && MLX5_ESWITCH_MANAGER(dev) && mlx5_sf_supported(dev);
+}
+
+int mlx5_sf_table_init(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table;
+	int err;
+
+	if (!mlx5_sf_table_supported(dev))
+		return 0;
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return -ENOMEM;
+
+	table->dev = dev;
+	xa_init(&table->port_indices);
+	dev->priv.sf_table = table;
+	table->esw_nb.notifier_call = mlx5_sf_esw_event;
+	err = mlx5_esw_event_notifier_register(dev->priv.eswitch, &table->esw_nb);
+	if (err)
+		goto reg_err;
+	return 0;
+
+reg_err:
+	kfree(table);
+	dev->priv.sf_table = NULL;
+	return err;
+}
+
+void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_table *table = dev->priv.sf_table;
+
+	if (!table)
+		return;
+
+	mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb);
+	WARN_ON(refcount_read(&table->refcount));
+	WARN_ON(!xa_empty(&table->port_indices));
+	kfree(table);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
new file mode 100644
index 000000000000..c7757f399e8a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+#include <linux/mlx5/driver.h>
+#include "vhca_event.h"
+#include "priv.h"
+#include "sf.h"
+#include "ecpf.h"
+
+struct mlx5_sf_hw {
+	u32 usr_sfnum;
+	u8 allocated: 1;
+};
+
+struct mlx5_sf_hw_table {
+	struct mlx5_core_dev *dev;
+	struct mlx5_sf_hw *sfs;
+	int max_local_functions;
+	u8 ecpu: 1;
+};
+
+u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id)
+{
+	return sw_id + mlx5_sf_start_function_id(dev);
+}
+
+int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+	int sw_id = -ENOSPC;
+	u16 hw_fn_id;
+	int err;
+	int i;
+
+	if (!table->max_local_functions)
+		return -EOPNOTSUPP;
+
+	/* Check if sf with same sfnum already exists or not. */
+	for (i = 0; i < table->max_local_functions; i++) {
+		if (table->sfs[i].allocated && table->sfs[i].usr_sfnum == usr_sfnum)
+			return -EEXIST;
+	}
+
+	/* Find the free entry and allocate the entry from the array */
+	for (i = 0; i < table->max_local_functions; i++) {
+		if (!table->sfs[i].allocated) {
+			table->sfs[i].usr_sfnum = usr_sfnum;
+			table->sfs[i].allocated = true;
+			sw_id = i;
+			break;
+		}
+	}
+	if (sw_id == -ENOSPC) {
+		err = -ENOSPC;
+		goto err;
+	}
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sw_id);
+	err = mlx5_cmd_alloc_sf(table->dev, hw_fn_id);
+	if (err)
+		goto err;
+
+	err = mlx5_modify_vhca_sw_id(dev, hw_fn_id, table->ecpu, usr_sfnum);
+	if (err)
+		goto vhca_err;
+
+	return sw_id;
+
+vhca_err:
+	mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
+err:
+	table->sfs[i].allocated = false;
+	return err;
+}
+
+void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+	u16 hw_fn_id;
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, id);
+	mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
+	table->sfs[id].allocated = false;
+}
+
+int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_hw_table *table;
+	struct mlx5_sf_hw *sfs;
+	int max_functions;
+
+	if (!mlx5_sf_supported(dev))
+		return 0;
+
+	max_functions = mlx5_sf_max_functions(dev);
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return -ENOMEM;
+
+	sfs = kcalloc(max_functions, sizeof(*sfs), GFP_KERNEL);
+	if (!sfs)
+		goto table_err;
+
+	table->dev = dev;
+	table->sfs = sfs;
+	table->max_local_functions = max_functions;
+	table->ecpu = mlx5_read_embedded_cpu(dev);
+	dev->priv.sf_hw_table = table;
+	mlx5_core_dbg(dev, "SF HW table: max sfs = %d\n", max_functions);
+	return 0;
+
+table_err:
+	kfree(table);
+	return -ENOMEM;
+}
+
+void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+
+	if (!table)
+		return;
+
+	kfree(table->sfs);
+	kfree(table);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
new file mode 100644
index 000000000000..7f3622375a9c
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/* Copyright (c) 2020 Mellanox Technologies Ltd */
+
+#ifndef __MLX5_SF_PRIV_H__
+#define __MLX5_SF_PRIV_H__
+
+#include <linux/mlx5/driver.h>
+
+int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id);
+int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id);
+
+u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id);
+
+int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum);
+void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id);
+
+#endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
index 623191679b49..dd23b6c2d887 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -28,6 +28,16 @@ static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
 		return 1 << MLX5_CAP_GEN(dev, log_max_sf);
 }
 
+int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev);
+void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev);
+
+int mlx5_sf_table_init(struct mlx5_core_dev *dev);
+void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev);
+
+int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_new_attrs *add_attr,
+			     struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
+			     struct netlink_ext_ack *extack);
 #else
 
 static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
@@ -40,6 +50,24 @@ static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
 	return 0;
 }
 
+static inline int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
+static inline int mlx5_sf_table_init(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev)
+{
+}
+
 #endif
 
 #endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 48e3638b1185..7e357c7f0d5e 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -510,6 +510,8 @@ struct mlx5_eq_table;
 struct mlx5_irq_table;
 struct mlx5_vhca_state_notifier;
 struct mlx5_sf_dev_table;
+struct mlx5_sf_hw_table;
+struct mlx5_sf_table;
 
 struct mlx5_rate_limit {
 	u32			rate;
@@ -611,6 +613,10 @@ struct mlx5_priv {
 	struct mlx5_sf_dev_table *sf_dev_table;
 	struct mlx5_core_dev *parent_mdev;
 #endif
+#ifdef CONFIG_MLX5_SF_MANAGER
+	struct mlx5_sf_hw_table *sf_hw_table;
+	struct mlx5_sf_table *sf_table;
+#endif
 };
 
 enum mlx5_device_state {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 12/15] net/mlx5: SF, Port function state change support
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (10 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 13/15] devlink: Add devlink port documentation Saeed Mahameed
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Vu Pham, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Support changing the state of the SF port's function through devlink.
When activating the SF port's function, enable the hca in the device
followed by adding its auxiliary device.
When deactivating the SF port's function, delete its auxiliary device
followed by disabling the vHCA.

Port function attributes get/set callbacks are invoked with devlink
instance lock held. Such callbacks need to synchronize with sf port
table getting disabled either via sriov sysfs callback. Such callbacks
synchronize with table disable context holding table refcount.

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active

$ devlink port show ens2f0npf0sf88 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

On port function activation, an auxiliary device is created in below
example.

$ devlink dev show
devlink dev show auxiliary/mlx5_core.sf.4

$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   2 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |  10 +
 .../net/ethernet/mellanox/mlx5/core/sf/cmd.c  |  22 ++
 .../ethernet/mellanox/mlx5/core/sf/devlink.c  | 284 ++++++++++++++++--
 .../ethernet/mellanox/mlx5/core/sf/hw_table.c | 116 ++++++-
 .../net/ethernet/mellanox/mlx5/core/sf/priv.h |   4 +
 .../net/ethernet/mellanox/mlx5/core/sf/sf.h   |  19 ++
 7 files changed, 431 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index d4c0cdf5edd9..75d950d95fcf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -195,6 +195,8 @@ static const struct devlink_ops mlx5_devlink_ops = {
 #ifdef CONFIG_MLX5_SF_MANAGER
 	.port_new = mlx5_devlink_sf_port_new,
 	.port_del = mlx5_devlink_sf_port_del,
+	.port_function_state_get = mlx5_devlink_sf_port_fn_state_get,
+	.port_function_state_set = mlx5_devlink_sf_port_fn_state_set,
 #endif
 	.flash_update = mlx5_devlink_flash_update,
 	.info_get = mlx5_devlink_info_get,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 435323088ce0..f6b885fdd5c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -75,6 +75,7 @@
 #include "diag/rsc_dump.h"
 #include "sf/vhca_event.h"
 #include "sf/dev/dev.h"
+#include "sf/sf.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -1161,6 +1162,12 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 
 	mlx5_vhca_event_start(dev);
 
+	err = mlx5_sf_hw_table_create(dev);
+	if (err) {
+		mlx5_core_err(dev, "sf table create failed %d\n", err);
+		goto err_vhca;
+	}
+
 	err = mlx5_ec_init(dev);
 	if (err) {
 		mlx5_core_err(dev, "Failed to init embedded CPU\n");
@@ -1180,6 +1187,8 @@ static int mlx5_load(struct mlx5_core_dev *dev)
 err_sriov:
 	mlx5_ec_cleanup(dev);
 err_ec:
+	mlx5_sf_hw_table_destroy(dev);
+err_vhca:
 	mlx5_vhca_event_stop(dev);
 	mlx5_cleanup_fs(dev);
 err_fs:
@@ -1209,6 +1218,7 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
 	mlx5_sf_dev_table_destroy(dev);
 	mlx5_sriov_detach(dev);
 	mlx5_ec_cleanup(dev);
+	mlx5_sf_hw_table_destroy(dev);
 	mlx5_vhca_event_stop(dev);
 	mlx5_cleanup_fs(dev);
 	mlx5_accel_ipsec_cleanup(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
index 0bc3075f34fa..a8d75c2f0275 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/cmd.c
@@ -25,3 +25,25 @@ int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id)
 
 	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
+
+int mlx5_cmd_sf_enable_hca(struct mlx5_core_dev *dev, u16 func_id)
+{
+	u32 out[MLX5_ST_SZ_DW(enable_hca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(enable_hca_in)] = {};
+
+	MLX5_SET(enable_hca_in, in, opcode, MLX5_CMD_OP_ENABLE_HCA);
+	MLX5_SET(enable_hca_in, in, function_id, func_id);
+	MLX5_SET(enable_hca_in, in, embedded_cpu_function, 0);
+	return mlx5_cmd_exec(dev, &in, sizeof(in), &out, sizeof(out));
+}
+
+int mlx5_cmd_sf_disable_hca(struct mlx5_core_dev *dev, u16 func_id)
+{
+	u32 out[MLX5_ST_SZ_DW(disable_hca_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(disable_hca_in)] = {};
+
+	MLX5_SET(disable_hca_in, in, opcode, MLX5_CMD_OP_DISABLE_HCA);
+	MLX5_SET(disable_hca_in, in, function_id, func_id);
+	MLX5_SET(enable_hca_in, in, embedded_cpu_function, 0);
+	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
index 09365f36a513..eb5c536ff1d5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c
@@ -4,11 +4,17 @@
 #include <linux/mlx5/driver.h>
 #include "eswitch.h"
 #include "priv.h"
+#include "sf/dev/dev.h"
+#include "mlx5_ifc_vhca_event.h"
+#include "vhca_event.h"
+#include "ecpf.h"
 
 struct mlx5_sf {
 	struct devlink_port dl_port;
 	unsigned int port_index;
 	u16 id;
+	u16 hw_fn_id;
+	u16 hw_state;
 };
 
 struct mlx5_sf_table {
@@ -16,7 +22,10 @@ struct mlx5_sf_table {
 	struct xarray port_indices; /* port index based lookup. */
 	refcount_t refcount;
 	struct completion disable_complete;
+	struct mutex sf_state_lock; /* Serializes sf state among user cmds & vhca event handler. */
 	struct notifier_block esw_nb;
+	struct notifier_block vhca_nb;
+	u8 ecpu: 1;
 };
 
 static struct mlx5_sf *
@@ -25,6 +34,19 @@ mlx5_sf_lookup_by_index(struct mlx5_sf_table *table, unsigned int port_index)
 	return xa_load(&table->port_indices, port_index);
 }
 
+static struct mlx5_sf *
+mlx5_sf_lookup_by_function_id(struct mlx5_sf_table *table, unsigned int fn_id)
+{
+	unsigned long index;
+	struct mlx5_sf *sf;
+
+	xa_for_each(&table->port_indices, index, sf) {
+		if (sf->hw_fn_id == fn_id)
+			return sf;
+	}
+	return NULL;
+}
+
 static int mlx5_sf_id_insert(struct mlx5_sf_table *table, struct mlx5_sf *sf)
 {
 	return xa_insert(&table->port_indices, sf->port_index, sf, GFP_KERNEL);
@@ -59,6 +81,8 @@ mlx5_sf_alloc(struct mlx5_sf_table *table, u32 sfnum, struct netlink_ext_ack *ex
 	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, sf->id);
 	dl_port_index = mlx5_esw_vport_to_devlink_port_index(table->dev, hw_fn_id);
 	sf->port_index = dl_port_index;
+	sf->hw_fn_id = hw_fn_id;
+	sf->hw_state = MLX5_VHCA_STATE_ALLOCATED;
 
 	err = mlx5_sf_id_insert(table, sf);
 	if (err)
@@ -99,6 +123,146 @@ static void mlx5_sf_table_put(struct mlx5_sf_table *table)
 		complete(&table->disable_complete);
 }
 
+static enum devlink_port_function_state mlx5_sf_to_devlink_state(u8 hw_state)
+{
+	switch (hw_state) {
+	case MLX5_VHCA_STATE_ACTIVE:
+	case MLX5_VHCA_STATE_IN_USE:
+	case MLX5_VHCA_STATE_TEARDOWN_REQUEST:
+		return DEVLINK_PORT_FUNCTION_STATE_ACTIVE;
+	case MLX5_VHCA_STATE_INVALID:
+	case MLX5_VHCA_STATE_ALLOCATED:
+	default:
+		return DEVLINK_PORT_FUNCTION_STATE_INACTIVE;
+	}
+}
+
+static enum devlink_port_function_opstate mlx5_sf_to_devlink_opstate(u8 hw_state)
+{
+	switch (hw_state) {
+	case MLX5_VHCA_STATE_IN_USE:
+	case MLX5_VHCA_STATE_TEARDOWN_REQUEST:
+		return DEVLINK_PORT_FUNCTION_OPSTATE_ATTACHED;
+	case MLX5_VHCA_STATE_INVALID:
+	case MLX5_VHCA_STATE_ALLOCATED:
+	case MLX5_VHCA_STATE_ACTIVE:
+	default:
+		return DEVLINK_PORT_FUNCTION_OPSTATE_DETACHED;
+	}
+}
+
+static bool mlx5_sf_is_active(const struct mlx5_sf *sf)
+{
+	return sf->hw_state == MLX5_VHCA_STATE_ACTIVE || sf->hw_state == MLX5_VHCA_STATE_IN_USE;
+}
+
+int mlx5_devlink_sf_port_fn_state_get(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state *state,
+				      enum devlink_port_function_opstate *opstate,
+				      struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	struct mlx5_sf *sf;
+	int err = 0;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table)
+		return -EOPNOTSUPP;
+
+	sf = mlx5_sf_lookup_by_index(table, dl_port->index);
+	if (!sf) {
+		err = -EOPNOTSUPP;
+		goto sf_err;
+	}
+	mutex_lock(&table->sf_state_lock);
+	*state = mlx5_sf_to_devlink_state(sf->hw_state);
+	*opstate = mlx5_sf_to_devlink_opstate(sf->hw_state);
+	mutex_unlock(&table->sf_state_lock);
+sf_err:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
+static int mlx5_sf_activate(struct mlx5_core_dev *dev, struct mlx5_sf *sf)
+{
+	int err;
+
+	if (mlx5_sf_is_active(sf))
+		return 0;
+	if (sf->hw_state != MLX5_VHCA_STATE_ALLOCATED)
+		return -EINVAL;
+
+	err = mlx5_cmd_sf_enable_hca(dev, sf->hw_fn_id);
+	if (err)
+		return err;
+
+	sf->hw_state = MLX5_VHCA_STATE_ACTIVE;
+	return 0;
+}
+
+static int mlx5_sf_deactivate(struct mlx5_core_dev *dev, struct mlx5_sf *sf)
+{
+	int err;
+
+	if (!mlx5_sf_is_active(sf))
+		return 0;
+
+	err = mlx5_cmd_sf_disable_hca(dev, sf->hw_fn_id);
+	if (err)
+		return err;
+
+	sf->hw_state = MLX5_VHCA_STATE_TEARDOWN_REQUEST;
+	return 0;
+}
+
+static int mlx5_sf_state_set(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
+			     struct mlx5_sf *sf,
+			     enum devlink_port_function_state state)
+{
+	int err = 0;
+
+	mutex_lock(&table->sf_state_lock);
+	if (state == mlx5_sf_to_devlink_state(sf->hw_state))
+		goto out;
+	if (state == DEVLINK_PORT_FUNCTION_STATE_ACTIVE)
+		err = mlx5_sf_activate(dev, sf);
+	else if (state == DEVLINK_PORT_FUNCTION_STATE_INACTIVE)
+		err = mlx5_sf_deactivate(dev, sf);
+	else
+		err = -EINVAL;
+out:
+	mutex_unlock(&table->sf_state_lock);
+	return err;
+}
+
+int mlx5_devlink_sf_port_fn_state_set(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state state,
+				      struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_sf_table *table;
+	struct mlx5_sf *sf;
+	int err;
+
+	table = mlx5_sf_table_try_get(dev);
+	if (!table) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Port state set is only supported in eswitch switchdev mode or SF ports are disabled.");
+		return -EOPNOTSUPP;
+	}
+	sf = mlx5_sf_lookup_by_index(table, dl_port->index);
+	if (!sf) {
+		err = -ENODEV;
+		goto out;
+	}
+
+	err = mlx5_sf_state_set(dev, table, sf, state);
+out:
+	mlx5_sf_table_put(table);
+	return err;
+}
+
 static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
 		       const struct devlink_port_new_attrs *new_attr,
 		       struct netlink_ext_ack *extack)
@@ -123,16 +287,6 @@ static int mlx5_sf_add(struct mlx5_core_dev *dev, struct mlx5_sf_table *table,
 	return err;
 }
 
-static void mlx5_sf_del(struct mlx5_core_dev *dev, struct mlx5_sf_table *table, struct mlx5_sf *sf)
-{
-	struct mlx5_eswitch *esw = dev->priv.eswitch;
-	u16 hw_fn_id;
-
-	hw_fn_id = mlx5_sf_sw_to_hw_id(dev, sf->id);
-	mlx5_esw_offloads_sf_vport_disable(esw, hw_fn_id);
-	mlx5_sf_free(table, sf);
-}
-
 static int
 mlx5_sf_new_check_attr(struct mlx5_core_dev *dev, const struct devlink_port_new_attrs *new_attr,
 		       struct netlink_ext_ack *extack)
@@ -184,10 +338,30 @@ int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_
 	return err;
 }
 
+static void mlx5_sf_dealloc(struct mlx5_sf_table *table, struct mlx5_sf *sf)
+{
+	if (sf->hw_state == MLX5_VHCA_STATE_ALLOCATED) {
+		mlx5_sf_free(table, sf);
+	} else if (mlx5_sf_is_active(sf)) {
+		/* Even if its active, it is treated as in_use because by the time,
+		 * it is disabled here, it may getting used. So it is safe to
+		 * always look for the event to ensure that it is recycled only after
+		 * firmware gives confirmation that it is detached by the driver.
+		 */
+		mlx5_cmd_sf_disable_hca(table->dev, sf->hw_fn_id);
+		mlx5_sf_hw_table_sf_deferred_free(table->dev, sf->id);
+		kfree(sf);
+	} else {
+		mlx5_sf_hw_table_sf_deferred_free(table->dev, sf->id);
+		kfree(sf);
+	}
+}
+
 int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
 			     struct netlink_ext_ack *extack)
 {
 	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	struct mlx5_eswitch *esw = dev->priv.eswitch;
 	struct mlx5_sf_table *table;
 	struct mlx5_sf *sf;
 	int err = 0;
@@ -204,20 +378,58 @@ int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
 		goto sf_err;
 	}
 
-	mlx5_sf_del(dev, table, sf);
+	mlx5_esw_offloads_sf_vport_disable(esw, sf->hw_fn_id);
+	mlx5_sf_id_erase(table, sf);
+
+	mutex_lock(&table->sf_state_lock);
+	mlx5_sf_dealloc(table, sf);
+	mutex_unlock(&table->sf_state_lock);
 sf_err:
 	mlx5_sf_table_put(table);
 	return err;
 }
 
-static void mlx5_sf_destroy_all(struct mlx5_sf_table *table)
+static bool mlx5_sf_state_update_check(const struct mlx5_sf *sf, u8 new_state)
 {
-	struct mlx5_core_dev *dev = table->dev;
-	unsigned long index;
+	if (sf->hw_state == MLX5_VHCA_STATE_ACTIVE && new_state == MLX5_VHCA_STATE_IN_USE)
+		return true;
+
+	if (sf->hw_state == MLX5_VHCA_STATE_IN_USE && new_state == MLX5_VHCA_STATE_ACTIVE)
+		return true;
+
+	if (sf->hw_state == MLX5_VHCA_STATE_TEARDOWN_REQUEST &&
+	    new_state == MLX5_VHCA_STATE_ALLOCATED)
+		return true;
+
+	return false;
+}
+
+static int mlx5_sf_vhca_event(struct notifier_block *nb, unsigned long opcode, void *data)
+{
+	struct mlx5_sf_table *table = container_of(nb, struct mlx5_sf_table, vhca_nb);
+	const struct mlx5_vhca_state_event *event = data;
+	bool update = false;
 	struct mlx5_sf *sf;
 
-	xa_for_each(&table->port_indices, index, sf)
-		mlx5_sf_del(dev, table, sf);
+	table = mlx5_sf_table_try_get(table->dev);
+	if (!table)
+		return 0;
+
+	mutex_lock(&table->sf_state_lock);
+	sf = mlx5_sf_lookup_by_function_id(table, event->function_id);
+	if (!sf)
+		goto sf_err;
+
+	/* When driver is attached or detached to a function, an event
+	 * notifies such state change.
+	 */
+	update = mlx5_sf_state_update_check(sf, event->new_vhca_state);
+	if (update)
+		sf->hw_state = event->new_vhca_state;
+sf_err:
+	mutex_unlock(&table->sf_state_lock);
+	mlx5_sf_table_put(table);
+	return 0;
 }
 
 static void mlx5_sf_table_enable(struct mlx5_sf_table *table)
@@ -229,6 +441,22 @@ static void mlx5_sf_table_enable(struct mlx5_sf_table *table)
 	refcount_set(&table->refcount, 1);
 }
 
+static void mlx5_sf_deactivate_all(struct mlx5_sf_table *table)
+{
+	struct mlx5_eswitch *esw = table->dev->priv.eswitch;
+	unsigned long index;
+	struct mlx5_sf *sf;
+
+	/* At this point, no new user commands can start and no vhca event can
+	 * arrive. It is safe to destroy all user created SFs.
+	 */
+	xa_for_each(&table->port_indices, index, sf) {
+		mlx5_esw_offloads_sf_vport_disable(esw, sf->hw_fn_id);
+		mlx5_sf_id_erase(table, sf);
+		mlx5_sf_dealloc(table, sf);
+	}
+}
+
 static void mlx5_sf_table_disable(struct mlx5_sf_table *table)
 {
 	if (!mlx5_sf_max_functions(table->dev))
@@ -237,14 +465,13 @@ static void mlx5_sf_table_disable(struct mlx5_sf_table *table)
 	if (!refcount_read(&table->refcount))
 		return;
 
-	/* Balances with refcount_set; drop the reference so that new user cmd cannot start. */
+	/* Balances with refcount_set; drop the reference so that new user cmd cannot start
+	 * and new vhca event handler cannnot run.
+	 */
 	mlx5_sf_table_put(table);
 	wait_for_completion(&table->disable_complete);
 
-	/* At this point, no new user commands can start.
-	 * It is safe to destroy all user created SFs.
-	 */
-	mlx5_sf_destroy_all(table);
+	mlx5_sf_deactivate_all(table);
 }
 
 static int mlx5_sf_esw_event(struct notifier_block *nb, unsigned long event, void *data)
@@ -276,23 +503,34 @@ int mlx5_sf_table_init(struct mlx5_core_dev *dev)
 	struct mlx5_sf_table *table;
 	int err;
 
-	if (!mlx5_sf_table_supported(dev))
+	if (!mlx5_sf_table_supported(dev) || !mlx5_vhca_event_supported(dev))
 		return 0;
 
 	table = kzalloc(sizeof(*table), GFP_KERNEL);
 	if (!table)
 		return -ENOMEM;
 
+	mutex_init(&table->sf_state_lock);
 	table->dev = dev;
 	xa_init(&table->port_indices);
 	dev->priv.sf_table = table;
+	refcount_set(&table->refcount, 0);
 	table->esw_nb.notifier_call = mlx5_sf_esw_event;
 	err = mlx5_esw_event_notifier_register(dev->priv.eswitch, &table->esw_nb);
 	if (err)
 		goto reg_err;
+
+	table->vhca_nb.notifier_call = mlx5_sf_vhca_event;
+	err = mlx5_vhca_event_notifier_register(table->dev, &table->vhca_nb);
+	if (err)
+		goto vhca_err;
+
 	return 0;
 
+vhca_err:
+	mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb);
 reg_err:
+	mutex_destroy(&table->sf_state_lock);
 	kfree(table);
 	dev->priv.sf_table = NULL;
 	return err;
@@ -305,8 +543,10 @@ void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev)
 	if (!table)
 		return;
 
+	mlx5_vhca_event_notifier_unregister(table->dev, &table->vhca_nb);
 	mlx5_esw_event_notifier_unregister(dev->priv.eswitch, &table->esw_nb);
 	WARN_ON(refcount_read(&table->refcount));
+	mutex_destroy(&table->sf_state_lock);
 	WARN_ON(!xa_empty(&table->port_indices));
 	kfree(table);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
index c7757f399e8a..58b6be0b03d7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/hw_table.c
@@ -4,11 +4,14 @@
 #include "vhca_event.h"
 #include "priv.h"
 #include "sf.h"
+#include "mlx5_ifc_vhca_event.h"
+#include "vhca_event.h"
 #include "ecpf.h"
 
 struct mlx5_sf_hw {
 	u32 usr_sfnum;
 	u8 allocated: 1;
+	u8 pending_delete: 1;
 };
 
 struct mlx5_sf_hw_table {
@@ -16,6 +19,8 @@ struct mlx5_sf_hw_table {
 	struct mlx5_sf_hw *sfs;
 	int max_local_functions;
 	u8 ecpu: 1;
+	struct mutex table_lock; /* Serializes sf deletion and vhca state change handler. */
+	struct notifier_block vhca_nb;
 };
 
 u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id)
@@ -23,6 +28,11 @@ u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id)
 	return sw_id + mlx5_sf_start_function_id(dev);
 }
 
+static u16 mlx5_sf_hw_to_sw_id(const struct mlx5_core_dev *dev, u16 hw_id)
+{
+	return hw_id - mlx5_sf_start_function_id(dev);
+}
+
 int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum)
 {
 	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
@@ -34,10 +44,13 @@ int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum)
 	if (!table->max_local_functions)
 		return -EOPNOTSUPP;
 
+	mutex_lock(&table->table_lock);
 	/* Check if sf with same sfnum already exists or not. */
 	for (i = 0; i < table->max_local_functions; i++) {
-		if (table->sfs[i].allocated && table->sfs[i].usr_sfnum == usr_sfnum)
-			return -EEXIST;
+		if (table->sfs[i].allocated && table->sfs[i].usr_sfnum == usr_sfnum) {
+			err = -EEXIST;
+			goto exist_err;
+		}
 	}
 
 	/* Find the free entry and allocate the entry from the array */
@@ -63,16 +76,19 @@ int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum)
 	if (err)
 		goto vhca_err;
 
+	mutex_unlock(&table->table_lock);
 	return sw_id;
 
 vhca_err:
 	mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
 err:
 	table->sfs[i].allocated = false;
+exist_err:
+	mutex_unlock(&table->table_lock);
 	return err;
 }
 
-void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id)
+static void _mlx5_sf_hw_id_free(struct mlx5_core_dev *dev, u16 id)
 {
 	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
 	u16 hw_fn_id;
@@ -80,6 +96,50 @@ void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id)
 	hw_fn_id = mlx5_sf_sw_to_hw_id(table->dev, id);
 	mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
 	table->sfs[id].allocated = false;
+	table->sfs[id].pending_delete = false;
+}
+
+void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+
+	mutex_lock(&table->table_lock);
+	_mlx5_sf_hw_id_free(dev, id);
+	mutex_unlock(&table->table_lock);
+}
+
+void mlx5_sf_hw_table_sf_deferred_free(struct mlx5_core_dev *dev, u16 id)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+	u32 out[MLX5_ST_SZ_DW(query_vhca_state_out)] = {};
+	u16 hw_fn_id;
+	u8 state;
+	int err;
+
+	hw_fn_id = mlx5_sf_sw_to_hw_id(dev, id);
+	mutex_lock(&table->table_lock);
+	err = mlx5_cmd_query_vhca_state(dev, hw_fn_id, table->ecpu, out, sizeof(out));
+	if (err)
+		goto err;
+	state = MLX5_GET(query_vhca_state_out, out, vhca_state_context.vhca_state);
+	if (state == MLX5_VHCA_STATE_ALLOCATED) {
+		mlx5_cmd_dealloc_sf(table->dev, hw_fn_id);
+		table->sfs[id].allocated = false;
+	} else {
+		table->sfs[id].pending_delete = true;
+	}
+err:
+	mutex_unlock(&table->table_lock);
+}
+
+static void mlx5_sf_hw_dealloc_all(struct mlx5_sf_hw_table *table)
+{
+	int i;
+
+	for (i = 0; i < table->max_local_functions; i++) {
+		if (table->sfs[i].allocated)
+			_mlx5_sf_hw_id_free(table->dev, i);
+	}
 }
 
 int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
@@ -88,7 +148,7 @@ int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
 	struct mlx5_sf_hw *sfs;
 	int max_functions;
 
-	if (!mlx5_sf_supported(dev))
+	if (!mlx5_sf_supported(dev) || !mlx5_vhca_event_supported(dev))
 		return 0;
 
 	max_functions = mlx5_sf_max_functions(dev);
@@ -100,6 +160,7 @@ int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev)
 	if (!sfs)
 		goto table_err;
 
+	mutex_init(&table->table_lock);
 	table->dev = dev;
 	table->sfs = sfs;
 	table->max_local_functions = max_functions;
@@ -120,6 +181,53 @@ void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev)
 	if (!table)
 		return;
 
+	mutex_destroy(&table->table_lock);
 	kfree(table->sfs);
 	kfree(table);
 }
+
+static int mlx5_sf_hw_vhca_event(struct notifier_block *nb, unsigned long opcode, void *data)
+{
+	struct mlx5_sf_hw_table *table = container_of(nb, struct mlx5_sf_hw_table, vhca_nb);
+	const struct mlx5_vhca_state_event *event = data;
+	struct mlx5_sf_hw *sf_hw;
+	u16 sw_id;
+
+	if (event->new_vhca_state != MLX5_VHCA_STATE_ALLOCATED)
+		return 0;
+
+	sw_id = mlx5_sf_hw_to_sw_id(table->dev, event->function_id);
+	sf_hw = &table->sfs[sw_id];
+
+	mutex_lock(&table->table_lock);
+	/* SF driver notified through firmware that SF is finally detached.
+	 * Hence recycle the sf hardware id for reuse.
+	 */
+	if (sf_hw->allocated && sf_hw->pending_delete)
+		_mlx5_sf_hw_id_free(table->dev, sw_id);
+	mutex_unlock(&table->table_lock);
+	return 0;
+}
+
+int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+
+	if (!table)
+		return 0;
+
+	table->vhca_nb.notifier_call = mlx5_sf_hw_vhca_event;
+	return mlx5_vhca_event_notifier_register(table->dev, &table->vhca_nb);
+}
+
+void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev)
+{
+	struct mlx5_sf_hw_table *table = dev->priv.sf_hw_table;
+
+	if (!table)
+		return;
+
+	mlx5_vhca_event_notifier_unregister(table->dev, &table->vhca_nb);
+	/* Dealloc SFs whose firmware event has been missed. */
+	mlx5_sf_hw_dealloc_all(table);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
index 7f3622375a9c..cb02a51d0986 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/priv.h
@@ -9,9 +9,13 @@
 int mlx5_cmd_alloc_sf(struct mlx5_core_dev *dev, u16 function_id);
 int mlx5_cmd_dealloc_sf(struct mlx5_core_dev *dev, u16 function_id);
 
+int mlx5_cmd_sf_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
+int mlx5_cmd_sf_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
+
 u16 mlx5_sf_sw_to_hw_id(const struct mlx5_core_dev *dev, u16 sw_id);
 
 int mlx5_sf_hw_table_sf_alloc(struct mlx5_core_dev *dev, u32 usr_sfnum);
 void mlx5_sf_hw_table_sf_free(struct mlx5_core_dev *dev, u16 id);
+void mlx5_sf_hw_table_sf_deferred_free(struct mlx5_core_dev *dev, u16 id);
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
index dd23b6c2d887..296fd070617e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/sf.h
@@ -31,6 +31,9 @@ static inline u16 mlx5_sf_max_functions(const struct mlx5_core_dev *dev)
 int mlx5_sf_hw_table_init(struct mlx5_core_dev *dev);
 void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev);
 
+int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev);
+void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev);
+
 int mlx5_sf_table_init(struct mlx5_core_dev *dev);
 void mlx5_sf_table_cleanup(struct mlx5_core_dev *dev);
 
@@ -38,6 +41,13 @@ int mlx5_devlink_sf_port_new(struct devlink *devlink, const struct devlink_port_
 			     struct netlink_ext_ack *extack);
 int mlx5_devlink_sf_port_del(struct devlink *devlink, unsigned int port_index,
 			     struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_fn_state_get(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state *state,
+				      enum devlink_port_function_opstate *opstate,
+				      struct netlink_ext_ack *extack);
+int mlx5_devlink_sf_port_fn_state_set(struct devlink *devlink, struct devlink_port *dl_port,
+				      enum devlink_port_function_state state,
+				      struct netlink_ext_ack *extack);
 #else
 
 static inline bool mlx5_sf_supported(const struct mlx5_core_dev *dev)
@@ -59,6 +69,15 @@ static inline void mlx5_sf_hw_table_cleanup(struct mlx5_core_dev *dev)
 {
 }
 
+static inline int mlx5_sf_hw_table_create(struct mlx5_core_dev *dev)
+{
+	return 0;
+}
+
+static inline void mlx5_sf_hw_table_destroy(struct mlx5_core_dev *dev)
+{
+}
+
 static inline int mlx5_sf_table_init(struct mlx5_core_dev *dev)
 {
 	return 0;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 13/15] devlink: Add devlink port documentation
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (11 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Jiri Pirko, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Added documentation for devlink port and port function related commands.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../networking/devlink/devlink-port.rst       | 118 ++++++++++++++++++
 Documentation/networking/devlink/index.rst    |   1 +
 2 files changed, 119 insertions(+)
 create mode 100644 Documentation/networking/devlink/devlink-port.rst

diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
new file mode 100644
index 000000000000..4c910dbb01ca
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -0,0 +1,118 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _devlink_port:
+
+============
+Devlink Port
+============
+
+``devlink-port`` is a port that exists on the device. It has a logically
+separate ingress/egress point of the device. A devlink port can be any one
+of many flavours. A devlink port flavour along with port attributes
+describe what a port represents.
+
+A device driver that intends to publish a devlink port sets the
+devlink port attributes and registers the devlink port.
+
+Devlink port flavours are described below.
+
+.. list-table:: List of devlink port flavours
+   :widths: 33 90
+
+   * - Flavour
+     - Description
+   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
+     - Any kind of physical port. This can be an eswitch physical port or any
+       other physical port on the device.
+   * - ``DEVLINK_PORT_FLAVOUR_DSA``
+     - This indicates a DSA interconnect port.
+   * - ``DEVLINK_PORT_FLAVOUR_CPU``
+     - This indicates a CPU port applicable only to DSA.
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
+     - This indicates an eswitch port representing a port of PCI
+       physical function (PF).
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
+     - This indicates an eswitch port representing a port of PCI
+       virtual function (VF).
+   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
+     - This indicates a virtual port for the PCI virtual function.
+
+Devlink port can have a different type based on the link layer described below.
+
+.. list-table:: List of devlink port types
+   :widths: 23 90
+
+   * - Type
+     - Description
+   * - ``DEVLINK_PORT_TYPE_ETH``
+     - Driver should set this port type when a link layer of the port is
+       Ethernet.
+   * - ``DEVLINK_PORT_TYPE_IB``
+     - Driver should set this port type when a link layer of the port is
+       InfiniBand.
+   * - ``DEVLINK_PORT_TYPE_AUTO``
+     - This type is indicated by the user when driver should detect the port
+       type automatically.
+
+PCI controllers
+---------------
+In most cases a PCI device has only one controller. A controller consists of
+potentially multiple physical and virtual functions. Such PCI function consists
+of one or more ports. This port of the function is represented by the devlink
+eswitch port.
+
+A PCI Device connected to multiple CPUs or multiple PCI root complexes or
+SmartNIC, however, may have multiple controllers. For a device with multiple
+controllers, each controller is distinguished by a unique controller number.
+An eswitch on the PCI device support ports of multiple controllers.
+
+An example view of a system with two controllers::
+
+                 ---------------------------------------------------------
+                 |                                                       |
+                 |           --------- ---------         ------- ------- |
+    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
+    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
+    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
+    | connect |  | -------                       -------                 |
+    -----------  |     | controller_num=1 (no eswitch)                   |
+                 ------|--------------------------------------------------
+                 (internal wire)
+                       |
+                 ---------------------------------------------------------
+                 | devlink eswitch ports and reps                        |
+                 | ----------------------------------------------------- |
+                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
+                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
+                 | ----------------------------------------------------- |
+                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
+                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
+                 | ----------------------------------------------------- |
+                 |                                                       |
+                 |                                                       |
+    -----------  |           --------- ---------         ------- ------- |
+    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
+    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
+    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
+    -----------  | -------                       -------                 |
+                 |                                                       |
+                 |  local controller_num=0 (eswitch)                     |
+                 ---------------------------------------------------------
+
+In above example, external controller (identified by controller number = 1)
+doesn't have eswitch. Local controller (identified by controller number = 0)
+has the eswitch. Devlink instance on local controller has eswitch devlink
+ports representing ports for both the controllers.
+
+Port function configuration
+===========================
+
+A user can configure the port function attribute before enumerating the
+PCI function. Usually it means, user should configure port function attribute
+before a bus specific device for the function is created. However, when
+SRIOV is enabled, virtual function devices are created on the PCI bus.
+Hence, function attribute should be configured before binding virtual
+function device to the driver.
+
+User may set the hardware address of the function represented by the devlink
+port function. For Ethernet port function this means a MAC address.
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index d82874760ae2..aab79667f97b 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -18,6 +18,7 @@ general.
    devlink-info
    devlink-flash
    devlink-params
+   devlink-port
    devlink-region
    devlink-resource
    devlink-reload
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 14/15] devlink: Extend devlink port documentation for subfunctions
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (12 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 13/15] devlink: Add devlink port documentation Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-14 21:43 ` [net-next v4 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
  2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Add devlink port documentation for subfunction management.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 Documentation/driver-api/auxiliary_bus.rst    |  2 +
 .../networking/devlink/devlink-port.rst       | 89 ++++++++++++++++++-
 2 files changed, 87 insertions(+), 4 deletions(-)

diff --git a/Documentation/driver-api/auxiliary_bus.rst b/Documentation/driver-api/auxiliary_bus.rst
index 2312506b0674..fff96c7ba7a8 100644
--- a/Documentation/driver-api/auxiliary_bus.rst
+++ b/Documentation/driver-api/auxiliary_bus.rst
@@ -1,5 +1,7 @@
 .. SPDX-License-Identifier: GPL-2.0-only
 
+.. _auxiliary_bus:
+
 =============
 Auxiliary Bus
 =============
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 4c910dbb01ca..c6924e7a341e 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -34,6 +34,9 @@ Devlink port flavours are described below.
    * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
      - This indicates an eswitch port representing a port of PCI
        virtual function (VF).
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
+     - This indicates an eswitch port representing a port of PCI
+       subfunction (SF).
    * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
      - This indicates a virtual port for the PCI virtual function.
 
@@ -57,9 +60,9 @@ Devlink port can have a different type based on the link layer described below.
 PCI controllers
 ---------------
 In most cases a PCI device has only one controller. A controller consists of
-potentially multiple physical and virtual functions. Such PCI function consists
-of one or more ports. This port of the function is represented by the devlink
-eswitch port.
+potentially multiple physical functions, virtual functions and subfunctions.
+Such PCI function consists of one or more ports. This port of the function
+is represented by the devlink eswitch port.
 
 A PCI Device connected to multiple CPUs or multiple PCI root complexes or
 SmartNIC, however, may have multiple controllers. For a device with multiple
@@ -112,7 +115,85 @@ PCI function. Usually it means, user should configure port function attribute
 before a bus specific device for the function is created. However, when
 SRIOV is enabled, virtual function devices are created on the PCI bus.
 Hence, function attribute should be configured before binding virtual
-function device to the driver.
+function device to the driver. For subfunctions, this means user should
+configure port function attribute before activating the port function.
 
 User may set the hardware address of the function represented by the devlink
 port function. For Ethernet port function this means a MAC address.
+
+Subfunctions
+============
+
+Subfunctions are lightweight functions that has parent PCI function on which
+it is deployed. Subfunctions are created and deployed in unit of 1. Unlike
+SRIOV VFs, they don't require their own PCI virtual function. They communicate
+with the hardware through the parent PCI function. Subfunctions can possibly
+scale better.
+
+To use a subfunction, 3 steps setup sequence is followed.
+(1) create - create a subfunction;
+(2) configure - configure subfunction attributes;
+(3) deploy - deploy the subfunction;
+
+Subfunction management is done using devlink port user interface.
+User performs setup on the subfunction management device.
+
+(1) Create
+----------
+A subfunction is created using a devlink port interface. User adds the
+subfunction by adding a devlink port of subfunction flavour. The devlink
+kernel code calls down to subfunction management driver (devlink op) and asks
+it to create a subfunction devlink port. Driver then instantiates the
+subfunction port and any associated objects such as health reporters and
+representor netdevice.
+
+(2) Configure
+-------------
+Subfunction devlink port is created but it is not active yet. That means the
+entities are created on devlink side, the e-switch port representor is created,
+but the subfunction device itself it not created. User might use e-switch port
+representor to do settings, putting it into bridge, adding TC rules, etc. User
+might as well configure the hardware address (such as MAC address) of the
+subfunction while subfunction is inactive.
+
+(3) Deploy
+----------
+Once subfunction is configured, user must activate it to use it. Upon
+activation, subfunction management driver asks the subfunction management
+device to instantiate the actual subfunction device on particular PCI function.
+A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. At this point matching
+subfunction driver binds to the subfunction's auxiliary device.
+
+Terms and Definitions
+=====================
+
+.. list-table:: Terms and Definitions
+   :widths: 22 90
+
+   * - Term
+     - Definitions
+   * - ``PCI device``
+     - A physical PCI device having one or more PCI bus consists of one or
+       more PCI controllers.
+   * - ``PCI controller``
+     -  A controller consists of potentially multiple physical functions,
+        virtual functions and subfunctions.
+   * - ``Port function``
+     -  An object to manage the function of a port.
+   * - ``Subfunction``
+     -  A lightweight function that has parent PCI function on which it is
+        deployed.
+   * - ``Subfunction device``
+     -  A bus device of the subfunction, usually on a auxiliary bus.
+   * - ``Subfunction driver``
+     -  A device driver for the subfunction auxiliary device.
+   * - ``Subfunction management device``
+     -  A PCI physical function that supports subfunction management.
+   * - ``Subfunction management driver``
+     -  A device driver for PCI physical function that supports
+        subfunction management using devlink port interface.
+   * - ``Subfunction host driver``
+     -  A device driver for PCI physical function that host subfunction
+        devices. In most cases it is same as subfunction management driver. When
+        subfunction is used on external controller, subfunction management and
+        host drivers are different.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [net-next v4 15/15] net/mlx5: Add devlink subfunction port documentation
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (13 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
@ 2020-12-14 21:43 ` Saeed Mahameed
  2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
  15 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 21:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jason Gunthorpe
  Cc: Leon Romanovsky, netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, david.m.ertman, dan.j.williams, kiran.patil,
	gregkh, Parav Pandit, Saeed Mahameed

From: Parav Pandit <parav@nvidia.com>

Add documentation for subfunction management using devlink
port.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../device_drivers/ethernet/mellanox/mlx5.rst | 204 ++++++++++++++++++
 1 file changed, 204 insertions(+)

diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
index a5eb22793bb9..07e38c044355 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -12,6 +12,8 @@ Contents
 - `Enabling the driver and kconfig options`_
 - `Devlink info`_
 - `Devlink parameters`_
+- `mlx5 subfunction`_
+- `mlx5 port function`_
 - `Devlink health reporters`_
 - `mlx5 tracepoints`_
 
@@ -181,6 +183,208 @@ User command examples:
       values:
          cmode driverinit value true
 
+mlx5 subfunction
+================
+mlx5 supports subfunctions management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+A Subfunction has its own function capabilities and its own resources. This
+means a subfunction has its own dedicated queues(txq, rxq, cq, eq). These queues
+are neither shared nor stealed from the parent PCI function.
+
+When subfunction is RDMA capable, it has its own QP1, GID table and rdma
+resources neither shared nor stealed from the parent PCI function.
+
+A subfunction has dedicated window in PCI BAR space that is not shared
+with ther other subfunctions or parent PCI function. This ensures that all
+class devices of the subfunction accesses only assigned PCI BAR space.
+
+A Subfunction supports eswitch representation through which it supports tc
+offloads. User must configure eswitch to send/receive packets from/to
+subfunction port.
+
+Subfunctions share PCI level resources such as PCI MSI-X IRQs with
+ther other subfunctions and/or with its parent PCI function.
+
+Example mlx5 software, system and device view::
+
+       _______
+      | admin |
+      | user  |----------
+      |_______|         |
+          |             |
+      ____|____       __|______            _________________
+     |         |     |         |          |                 |
+     | devlink |     | tc tool |          |    user         |
+     | tool    |     |_________|          | applications    |
+     |_________|         |                |_________________|
+           |             |                   |          |
+           |             |                   |          |         Userspace
+ +---------|-------------|-------------------|----------|--------------------+
+           |             |           +----------+   +----------+   Kernel
+           |             |           |  netdev  |   | rdma dev |
+           |             |           +----------+   +----------+
+   (devlink port add/del |              ^               ^
+    port function set)   |              |               |
+           |             |              +---------------|
+      _____|___          |              |        _______|_______
+     |         |         |              |       | mlx5 class    |
+     | devlink |   +------------+       |       |   drivers     |
+     | kernel  |   | rep netdev |       |       |(mlx5_core,ib) |
+     |_________|   +------------+       |       |_______________|
+           |             |              |               ^
+   (devlink ops)         |              |          (probe/remove)
+  _________|________     |              |           ____|________
+ | subfunction      |    |     +---------------+   | subfunction |
+ | management driver|-----     | subfunction   |---|  driver     |
+ | (mlx5_core)      |          | auxiliary dev |   | (mlx5_core) |
+ |__________________|          +---------------+   |_____________|
+           |                                            ^
+  (sf add/del, vhca events)                             |
+           |                                      (device add/del)
+      _____|____                                    ____|________
+     |          |                                  | subfunction |
+     |  PCI NIC |---- activate/deactive events---->| host driver |
+     |__________|                                  | (mlx5_core) |
+                                                   |_____________|
+
+Subfunction is created using devlink port interface.
+
+- Change device to switchdev mode::
+
+    $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Add a devlink port of subfunction flaovur::
+
+    $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+
+- Show a devlink port of the subfunction::
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:00:00
+
+- Delete a devlink port of subfunction after use::
+
+    $ devlink port del pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+
+mlx5 port function
+==================
+mlx5 driver provides mechanism to setup PCI VF/SF port function
+attributes in unified way for smartnic and non-smartnic NICs.
+
+This is supported only when eswitch mode is set to switchdev. Port function
+configuration of the PCI VF/SF is supported through devlink eswitch port.
+
+Port function attributes should be set before PCI VF/SF is enumerated by the
+driver.
+
+MAC address setup
+-----------------
+mlx5 driver provides mechanism to setup the MAC address of the PCI VF/SF.
+
+Configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get MAC address of the VF identified by its unique devlink port index::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+      function:
+        hw_addr 00:00:00:00:00:00
+
+- Set MAC address of the VF identified by its unique devlink port index::
+
+    $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+      function:
+        hw_addr 00:11:22:33:44:55
+
+- Get MAC address of the SF identified by its unique devlink port index::
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:00:00
+
+- Set MAC address of the VF identified by its unique devlink port index::
+
+    $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcivf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:88:88
+
+SF state setup
+--------------
+To use the SF, user must active the SF using SF port function state attribute.
+
+- Get state of the SF identified by its unique devlink port index::
+
+   $ devlink port show ens2f0npf0sf88
+   pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+     function:
+       hw_addr 00:00:00:00:88:88 state inactive opstate detached
+
+- Activate the function and verify its state is active::
+
+   $ devlink port function set ens2f0npf0sf88 state active
+
+   $ devlink port show ens2f0npf0sf88
+   pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+     function:
+       hw_addr 00:00:00:00:88:88 state active opstate detached
+
+Upon function activation, PF driver instance gets the event from the device that
+particular SF was activated. It's the cue to put the device on bus, probe it and
+instantiate devlink instance and class specific auxiliary devices for it.
+
+- Show the auxiliary device and port of the subfunction::
+
+    $ devlink dev show
+    devlink dev show auxiliary/mlx5_core.sf.4
+
+    $ devlink port show auxiliary/mlx5_core.sf.4/1
+    auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
+
+    $ rdma link show mlx5_0/1
+    link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
+
+    $ rdma dev show
+    8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
+    13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
+
+- Subfunction auxilary device and class device hierarchy::
+
+                 mlx5_core.sf.4
+          (subfunction auxilary device)
+                       /\
+                      /  \
+                     /    \
+                    /      \
+                   /        \
+      mlx5_core.eth.4     mlx5_core.rdma.4
+     (sf eth aux dev)     (sf rdma aux dev)
+         |                      |
+         |                      |
+      p0sf88                  mlx5_0
+     (sf netdev)          (sf rdma device)
+
+Additionally SF port also gets the event when the driver attaches to the
+auxiliary device of the subfunction. This results in changing the operational
+state of the function. This provides visiblity to user to decide when it is
+safe to delete the SF port for graceful termination of the subfunction.
+
+- Show the SF port operational state::
+
+    $ devlink port show ens2f0npf0sf88
+    pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+      function:
+        hw_addr 00:00:00:00:88:88 state active opstate attached
+
 Devlink health reporters
 ========================
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform
  2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
@ 2020-12-14 22:31   ` Alexander Duyck
  2020-12-14 22:45     ` Saeed Mahameed
  2020-12-15  4:59     ` Leon Romanovsky
  0 siblings, 2 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-14 22:31 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH, Parav Pandit, Stephen Rothwell

On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> From: Parav Pandit <parav@nvidia.com>
>
> MLX5_GENERAL_OBJECT_TYPES types bitfield is 64-bit field.
>
> Defining an enum for such bit fields on 32-bit platform results in below
> warning.
>
> ./include/vdso/bits.h:7:26: warning: left shift count >= width of type [-Wshift-count-overflow]
>                          ^
> ./include/linux/mlx5/mlx5_ifc.h:10716:46: note: in expansion of macro ‘BIT’
>  MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
>                                              ^~~
> Use 32-bit friendly left shift.
>
> Fixes: 2a2970891647 ("net/mlx5: Add sample offload hardware bits and structures")
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeed@kernel.org>
> ---
>  include/linux/mlx5/mlx5_ifc.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> index 0d6e287d614f..b9f15935dfe5 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -10711,9 +10711,9 @@ struct mlx5_ifc_affiliated_event_header_bits {
>  };
>
>  enum {
> -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = BIT(0xc),
> -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = BIT(0x13),
> -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
> +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = 1ULL << 0xc,
> +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = 1ULL << 0x13,
> +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = 1ULL << 0x20,
>  };

Why not just use BIT_ULL?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform
  2020-12-14 22:31   ` Alexander Duyck
@ 2020-12-14 22:45     ` Saeed Mahameed
  2020-12-15  4:59     ` Leon Romanovsky
  1 sibling, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-14 22:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH, Parav Pandit, Stephen Rothwell

On Mon, 2020-12-14 at 14:31 -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> wrote:
> > From: Parav Pandit <parav@nvidia.com>
> > 
> > MLX5_GENERAL_OBJECT_TYPES types bitfield is 64-bit field.
> > 
> > Defining an enum for such bit fields on 32-bit platform results in
> > below
> > warning.
> > 
> > ./include/vdso/bits.h:7:26: warning: left shift count >= width of
> > type [-Wshift-count-overflow]
> >                          ^
> > ./include/linux/mlx5/mlx5_ifc.h:10716:46: note: in expansion of
> > macro ‘BIT’
> >  MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
> >                                              ^~~
> > Use 32-bit friendly left shift.
> > 
> > Fixes: 2a2970891647 ("net/mlx5: Add sample offload hardware bits
> > and structures")
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > Signed-off-by: Saeed Mahameed <saeed@kernel.org>
> > ---
> >  include/linux/mlx5/mlx5_ifc.h | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/mlx5/mlx5_ifc.h
> > b/include/linux/mlx5/mlx5_ifc.h
> > index 0d6e287d614f..b9f15935dfe5 100644
> > --- a/include/linux/mlx5/mlx5_ifc.h
> > +++ b/include/linux/mlx5/mlx5_ifc.h
> > @@ -10711,9 +10711,9 @@ struct
> > mlx5_ifc_affiliated_event_header_bits {
> >  };
> > 
> >  enum {
> > -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY =
> > BIT(0xc),
> > -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = BIT(0x13),
> > -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
> > +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = 1ULL <<
> > 0xc,
> > +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = 1ULL << 0x13,
> > +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = 1ULL << 0x20,
> >  };
> 
> Why not just use BIT_ULL?

I was following the file convention where we use 1ULL/1UL in all of the
places, I will consider changing the whole file to use BIT macros in
another patch.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
                   ` (14 preceding siblings ...)
  2020-12-14 21:43 ` [net-next v4 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
@ 2020-12-15  1:53 ` Alexander Duyck
  2020-12-15  2:44   ` David Ahern
                     ` (2 more replies)
  15 siblings, 3 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-15  1:53 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> Hi Dave, Jakub, Jason,
>
> This series form Parav was the theme of this mlx5 release cycle,
> we've been waiting anxiously for the auxbus infrastructure to make it into
> the kernel, and now as the auxbus is in and all the stars are aligned, I
> can finally submit this V2 of the devlink and mlx5 subfunction support.
>
> Subfunctions came to solve the scaling issue of virtualization
> and switchdev environments, where SRIOV failed to deliver and users ran
> out of VFs very quickly as SRIOV demands huge amount of physical resources
> in both of the servers and the NIC.
>
> Subfunction provide the same functionality as SRIOV but in a very
> lightweight manner, please see the thorough and detailed
> documentation from Parav below, in the commit messages and the
> Networking documentation patches at the end of this series.
>

Just to clarify a few things for myself. You mention virtualization
and SR-IOV in your patch description but you cannot support direct
assignment with this correct? The idea here is simply logical
partitioning of an existing network interface, correct? So this isn't
so much a solution for virtualization, but may work better for
containers. I view this as an important distinction to make as the
first thing that came to mind when I read this was mediated devices
which is similar, but focused only on the virtualization case:
https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html

> Parav Pandit Says:
> =================
>
> This patchset introduces support for mlx5 subfunction (SF).
>
> A subfunction is a lightweight function that has a parent PCI function on
> which it is deployed. mlx5 subfunction has its own function capabilities
> and its own resources. This means a subfunction has its own dedicated
> queues(txq, rxq, cq, eq). These queues are neither shared nor stealed from
> the parent PCI function.

Rather than calling this a subfunction, would it make more sense to
call it something such as a queue set? It seems like this is exposing
some of the same functionality we did in the Intel drivers such as
ixgbe and i40e via the macvlan offload interface. However the
ixgbe/i40e hardware was somewhat limited in that we were only able to
expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and even
with that we have seen some limitations to the interface. It sounds
like you are able to break out RDMA capable devices this way as well.
So in terms of ways to go I would argue this is likely better. However
one downside is that we are going to end up seeing each subfunction
being different from driver to driver and vendor to vendor which I
would argue was also one of the problems with SR-IOV as you end up
with a bit of vendor lock-in as a result of this feature since each
vendor will be providing a different interface.

> When subfunction is RDMA capable, it has its own QP1, GID table and rdma
> resources neither shared nor stealed from the parent PCI function.
>
> A subfunction has dedicated window in PCI BAR space that is not shared
> with ther other subfunctions or parent PCI function. This ensures that all
> class devices of the subfunction accesses only assigned PCI BAR space.
>
> A Subfunction supports eswitch representation through which it supports tc
> offloads. User must configure eswitch to send/receive packets from/to
> subfunction port.
>
> Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> their other subfunctions and/or with its parent PCI function.

This piece to the architecture for this has me somewhat concerned. If
all your resources are shared and you are allowing devices to be
created incrementally you either have to pre-partition the entire
function which usually results in limited resources for your base
setup, or free resources from existing interfaces and redistribute
them as things change. I would be curious which approach you are
taking here? So for example if you hit a certain threshold will you
need to reset the port and rebalance the IRQs between the various
functions?

> Patch summary:
> --------------
> Patch 1 to 4 prepares devlink
> patch 5 to 7 mlx5 adds SF device support
> Patch 8 to 11 mlx5 adds SF devlink port support
> Patch 12 and 14 adds documentation
>
> Patch-1 prepares code to handle multiple port function attributes
> Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
> Patch-3 adds port add and delete driver callbacks
> Patch-4 adds port function state get and set callbacks
> Patch-5 mlx5 vhca event notifier support to distribute subfunction
>         state change notification
> Patch-6 adds SF auxiliary device
> Patch-7 adds SF auxiliary driver
> Patch-8 prepares eswitch to handler SF vport
> Patch-9 adds eswitch helpers to add/remove SF vport
> Patch-10 implements devlink port add/del callbacks
> Patch-11 implements devlink port function get/set callbacks
> Patch-12 to 14 adds documentation
> Patch-12 added mlx5 port function documentation
> Patch-13 adds subfunction documentation
> Patch-14 adds mlx5 subfunction documentation
>
> Subfunction support is discussed in detail in RFC [1] and [2].
> RFC [1] and extension [2] describes requirements, design and proposed
> plumbing using devlink, auxiliary bus and sysfs for systemd/udev
> support. Functionality of this patchset is best explained using real
> examples further below.
>
> overview:
> --------
> A subfunction can be created and deleted by a user using devlink port
> add/delete interface.
>
> A subfunction can be configured using devlink port function attribute
> before its activated.
>
> When a subfunction is activated, it results in an auxiliary device on
> the host PCI device where it is deployed. A driver binds to the
> auxiliary device that further creates supported class devices.
>
> example subfunction usage sequence:
> -----------------------------------
> Change device to switchdev mode:
> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
>
> Add a devlink port of subfunction flaovur:
> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Typo in your description. Also I don't know if you want to stick with
"flavour" or just shorten it to the U.S. spelling which is "flavor".

> Configure mac address of the port function:
> $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
>
> Now activate the function:
> $ devlink port function set ens2f0npf0sf88 state active
>
> Now use the auxiliary device and class devices:
> $ devlink dev show
> pci/0000:06:00.0
> auxiliary/mlx5_core.sf.4
>
> $ ip link show
> 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
>     altname enp6s0f0np0
> 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>

I assume that p0sf88 is supposed to be the newly created subfunction.
However I thought the naming was supposed to be the same as what you
are referring to in the devlink, or did I miss something?

> $ rdma dev show
> 43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
> 44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
>
> After use inactivate the function:
> $ devlink port function set ens2f0npf0sf88 state inactive
>
> Now delete the subfunction port:
> $ devlink port del ens2f0npf0sf88

This seems wrong to me as it breaks the symmetry with the port add
command and assumes you have ownership of the interface in the host. I
would much prefer to to see the same arguments that were passed to the
add command being used to do the teardown as that would allow for the
parent function to create the object, assign it to a container
namespace, and not need to pull it back in order to destroy it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
@ 2020-12-15  2:44   ` David Ahern
  2020-12-15 16:16     ` Alexander Duyck
  2020-12-15  5:48   ` Parav Pandit
  2020-12-15  6:15   ` Saeed Mahameed
  2 siblings, 1 reply; 65+ messages in thread
From: David Ahern @ 2020-12-15  2:44 UTC (permalink / raw)
  To: Alexander Duyck, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On 12/14/20 6:53 PM, Alexander Duyck wrote:
>> example subfunction usage sequence:
>> -----------------------------------
>> Change device to switchdev mode:
>> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
>>
>> Add a devlink port of subfunction flaovur:
>> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> 
> Typo in your description. Also I don't know if you want to stick with
> "flavour" or just shorten it to the U.S. spelling which is "flavor".

The term exists in devlink today (since 2018). When support was added to
iproute2 I decided there was no reason to require the US spelling over
the British spelling, so I accepted the patch.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform
  2020-12-14 22:31   ` Alexander Duyck
  2020-12-14 22:45     ` Saeed Mahameed
@ 2020-12-15  4:59     ` Leon Romanovsky
  1 sibling, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2020-12-15  4:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH,
	Parav Pandit, Stephen Rothwell

On Mon, Dec 14, 2020 at 02:31:25PM -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org> wrote:
> >
> > From: Parav Pandit <parav@nvidia.com>
> >
> > MLX5_GENERAL_OBJECT_TYPES types bitfield is 64-bit field.
> >
> > Defining an enum for such bit fields on 32-bit platform results in below
> > warning.
> >
> > ./include/vdso/bits.h:7:26: warning: left shift count >= width of type [-Wshift-count-overflow]
> >                          ^
> > ./include/linux/mlx5/mlx5_ifc.h:10716:46: note: in expansion of macro ‘BIT’
> >  MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
> >                                              ^~~
> > Use 32-bit friendly left shift.
> >
> > Fixes: 2a2970891647 ("net/mlx5: Add sample offload hardware bits and structures")
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > Signed-off-by: Saeed Mahameed <saeed@kernel.org>
> > ---
> >  include/linux/mlx5/mlx5_ifc.h | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> > index 0d6e287d614f..b9f15935dfe5 100644
> > --- a/include/linux/mlx5/mlx5_ifc.h
> > +++ b/include/linux/mlx5/mlx5_ifc.h
> > @@ -10711,9 +10711,9 @@ struct mlx5_ifc_affiliated_event_header_bits {
> >  };
> >
> >  enum {
> > -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = BIT(0xc),
> > -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = BIT(0x13),
> > -       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = BIT(0x20),
> > +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_ENCRYPTION_KEY = 1ULL << 0xc,
> > +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_IPSEC = 1ULL << 0x13,
> > +       MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_SAMPLER = 1ULL << 0x20,
> >  };
>
> Why not just use BIT_ULL?

mlx5_ifc.h doesn't include bits.h on-purpose and there are "*.c" files
that include that ifc header file, but don't include bits.h either.

It can cause to build failures in random builds.

The mlx5_ifc.h is our main hardware definition file that we are using in
other projects outside of the kernel (rdma-core) too, so it is preferable
to keep it as plain-C without any extra dependencies.

Thanks

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
  2020-12-15  2:44   ` David Ahern
@ 2020-12-15  5:48   ` Parav Pandit
  2020-12-15 18:47     ` Alexander Duyck
  2020-12-15 20:59     ` David Ahern
  2020-12-15  6:15   ` Saeed Mahameed
  2 siblings, 2 replies; 65+ messages in thread
From: Parav Pandit @ 2020-12-15  5:48 UTC (permalink / raw)
  To: Alexander Duyck, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH


> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Tuesday, December 15, 2020 7:24 AM
> 
> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> wrote:
> >
> > Hi Dave, Jakub, Jason,
> >
> 
> Just to clarify a few things for myself. You mention virtualization and SR-IOV
> in your patch description but you cannot support direct assignment with this
> correct? 
Correct. it cannot be directly assigned.

> The idea here is simply logical partitioning of an existing network
> interface, correct? 
No. Idea is to spawn multiple functions from a single PCI device.
These functions are not born in PCI device and in OS until they are created by user.
Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
I better not repeat all of it here again. Please go through it.
If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.

> So this isn't so much a solution for virtualization, but may
> work better for containers. I view this as an important distinction to make as
> the first thing that came to mind when I read this was mediated devices
> which is similar, but focused only on the virtualization case:
> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
> device.html
>
Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device.
We are not going back to it anymore.
It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more.
 
> Rather than calling this a subfunction, would it make more sense to call it
> something such as a queue set? 
No, queue is just one way to send and receive data/packets.
Jason and Saeed explained and discussed  this piece to you and others during v0 few weeks back at [1], [2], [3].
Please take a look.

> So in terms of ways to go I would argue this is likely better. However one
> downside is that we are going to end up seeing each subfunction being
> different from driver to driver and vendor to vendor which I would argue
> was also one of the problems with SR-IOV as you end up with a bit of vendor
> lock-in as a result of this feature since each vendor will be providing a
> different interface.
>
Each and several vendors provided unified interface for managing VFs. i.e.
(a) enable/disable was via vendor neutral sysfs
(b) sriov capability exposed via standard pci capability and sysfs
(c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink
Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced.

So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'.

> > A Subfunction supports eswitch representation through which it
> > supports tc offloads. User must configure eswitch to send/receive
> > packets from/to subfunction port.
> >
> > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > their other subfunctions and/or with its parent PCI function.
> 
> This piece to the architecture for this has me somewhat concerned. If all your
> resources are shared and 
All resources are not shared.

> you are allowing devices to be created
> incrementally you either have to pre-partition the entire function which
> usually results in limited resources for your base setup, or free resources
> from existing interfaces and redistribute them as things change. I would be
> curious which approach you are taking here? So for example if you hit a
> certain threshold will you need to reset the port and rebalance the IRQs
> between the various functions?
No. Its works bit differently for mlx5 device.
When base function is started, it started as if it doesn't have any subfunctions.
When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants.

For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.
For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF.
In future, yes, a dedicated IRQs per SF is likely desired.
Sridhar also talked about limiting number of queues to a subfunction.
I believe there will be resources/attributes of the function to be controlled.
devlink already provides rich interface to achieve that using devlink resources [8].

[..]

> > $ ip link show
> > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> DOWN mode DEFAULT group default qlen 1000
> >     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
> >     altname enp6s0f0np0
> > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> mode DEFAULT group default qlen 1000
> >     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
> 
> I assume that p0sf88 is supposed to be the newly created subfunction.
> However I thought the naming was supposed to be the same as what you are
> referring to in the devlink, or did I miss something?
>
I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects.
Hope below description clarifies a bit.
There are two netdevices.
(a) representor netdevice, attached to the devlink port of the eswitch
(b) netdevice of the SF used by the end application (in your example, this is assigned to container).
 
Both netdevice follow obviously a different naming scheme.
Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher.
It is based on phys_port_name sysfs attribute.
This is same for existing PF and SF representors exist for year+ now. Further used by subfunction.

For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads
phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container.

> > After use inactivate the function:
> > $ devlink port function set ens2f0npf0sf88 state inactive
> >
> > Now delete the subfunction port:
> > $ devlink port del ens2f0npf0sf88
> 
> This seems wrong to me as it breaks the symmetry with the port add
> command and
Example of the representor device is only to make life easier for the user.
Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit).
I explained this in a thread with Sridhar at [6].
In short devlink port del <bus/device_name/port_index command is just fine.
Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation.
I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported.

> assumes you have ownership of the interface in the host. I
> would much prefer to to see the same arguments that were passed to the
> add command being used to do the teardown as that would allow for the
> parent function to create the object, assign it to a container namespace, and
> not need to pull it back in order to destroy it.
Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases.
So port delete command works on the port index.
Host doesn't need to pull it back to destroy it. It is destroyed via port del command.

[1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
[2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
[3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
[4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
[5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
[6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
[7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
[8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
  2020-12-15  2:44   ` David Ahern
  2020-12-15  5:48   ` Parav Pandit
@ 2020-12-15  6:15   ` Saeed Mahameed
  2020-12-15 19:12     ` Alexander Duyck
  2 siblings, 1 reply; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-15  6:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> wrote:
> > Hi Dave, Jakub, Jason,
> > 
> > This series form Parav was the theme of this mlx5 release cycle,
> > we've been waiting anxiously for the auxbus infrastructure to make
> > it into
> > the kernel, and now as the auxbus is in and all the stars are
> > aligned, I
> > can finally submit this V2 of the devlink and mlx5 subfunction
> > support.
> > 
> > Subfunctions came to solve the scaling issue of virtualization
> > and switchdev environments, where SRIOV failed to deliver and users
> > ran
> > out of VFs very quickly as SRIOV demands huge amount of physical
> > resources
> > in both of the servers and the NIC.
> > 
> > Subfunction provide the same functionality as SRIOV but in a very
> > lightweight manner, please see the thorough and detailed
> > documentation from Parav below, in the commit messages and the
> > Networking documentation patches at the end of this series.
> > 
> 
> Just to clarify a few things for myself. You mention virtualization
> and SR-IOV in your patch description but you cannot support direct
> assignment with this correct? The idea here is simply logical
> partitioning of an existing network interface, correct? So this isn't
> so much a solution for virtualization, but may work better for
> containers. I view this as an important distinction to make as the

at the current state yes, but the SF solution can be extended to
support direct assignment, so this is why i think SF solution can do
better and eventually replace SRIOV.
also many customers are currently using SRIOV with containers to get
the performance and isolation features since there was no other
options.

> first thing that came to mind when I read this was mediated devices
> which is similar, but focused only on the virtualization case:
> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html
> 
> > Parav Pandit Says:
> > =================
> > 
> > This patchset introduces support for mlx5 subfunction (SF).
> > 
> > A subfunction is a lightweight function that has a parent PCI
> > function on
> > which it is deployed. mlx5 subfunction has its own function
> > capabilities
> > and its own resources. This means a subfunction has its own
> > dedicated
> > queues(txq, rxq, cq, eq). These queues are neither shared nor
> > stealed from
> > the parent PCI function.
> 
> Rather than calling this a subfunction, would it make more sense to
> call it something such as a queue set? It seems like this is exposing
> some of the same functionality we did in the Intel drivers such as
> ixgbe and i40e via the macvlan offload interface. However the
> ixgbe/i40e hardware was somewhat limited in that we were only able to
> expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and
> even
> with that we have seen some limitations to the interface. It sounds
> like you are able to break out RDMA capable devices this way as well.
> So in terms of ways to go I would argue this is likely better. 

We've discussed this thoroughly on V0, the SF solutions is closer to a
VF than a VMDQ, this is not just a set of queues.

https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/

> However
> one downside is that we are going to end up seeing each subfunction
> being different from driver to driver and vendor to vendor which I
> would argue was also one of the problems with SR-IOV as you end up
> with a bit of vendor lock-in as a result of this feature since each
> vendor will be providing a different interface.
> 

I disagree, SFs are tightly coupled with switchdev model and devlink
functions port, they are backed with the a well defined model, i can
say the same about sriov with switchdev mode, this sort of vendor lock-
in issues is eliminated when you migrate to switchdev mode.

> > When subfunction is RDMA capable, it has its own QP1, GID table and
> > rdma
> > resources neither shared nor stealed from the parent PCI function.
> > 
> > A subfunction has dedicated window in PCI BAR space that is not
> > shared
> > with ther other subfunctions or parent PCI function. This ensures
> > that all
> > class devices of the subfunction accesses only assigned PCI BAR
> > space.
> > 
> > A Subfunction supports eswitch representation through which it
> > supports tc
> > offloads. User must configure eswitch to send/receive packets
> > from/to
> > subfunction port.
> > 
> > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > their other subfunctions and/or with its parent PCI function.
> 
> This piece to the architecture for this has me somewhat concerned. If
> all your resources are shared and you are allowing devices to be

not all, only PCI MSIX, for now..

> created incrementally you either have to pre-partition the entire
> function which usually results in limited resources for your base
> setup, or free resources from existing interfaces and redistribute
> them as things change. I would be curious which approach you are
> taking here? So for example if you hit a certain threshold will you
> need to reset the port and rebalance the IRQs between the various
> functions?
> 

Currently SFs will use whatever IRQs the PF has pre-allocated for
itself, so there is no IRQ limit issue at the moment, we are
considering a dynamic IRQ pool with dynamic balancing, or even better
us the IMS approach, which perfectly fits the SF architecture. 
https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/

for internal resources the are fully isolated (not shared) and
they are internally managed by FW exactly like a VF internal resources.




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15  2:44   ` David Ahern
@ 2020-12-15 16:16     ` Alexander Duyck
  2020-12-15 16:59       ` Parav Pandit
  0 siblings, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-15 16:16 UTC (permalink / raw)
  To: David Ahern
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Mon, Dec 14, 2020 at 6:44 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 12/14/20 6:53 PM, Alexander Duyck wrote:
> >> example subfunction usage sequence:
> >> -----------------------------------
> >> Change device to switchdev mode:
> >> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> >>
> >> Add a devlink port of subfunction flaovur:
> >> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> >
> > Typo in your description. Also I don't know if you want to stick with
> > "flavour" or just shorten it to the U.S. spelling which is "flavor".
>
> The term exists in devlink today (since 2018). When support was added to
> iproute2 I decided there was no reason to require the US spelling over
> the British spelling, so I accepted the patch.

Okay. The only reason why I noticed is because "flaovur" is definitely
a wrong spelling. If it is already in the interface then no need to
change it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 16:16     ` Alexander Duyck
@ 2020-12-15 16:59       ` Parav Pandit
  0 siblings, 0 replies; 65+ messages in thread
From: Parav Pandit @ 2020-12-15 16:59 UTC (permalink / raw)
  To: Alexander Duyck, David Ahern
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH



> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Tuesday, December 15, 2020 9:47 PM
> 
> On Mon, Dec 14, 2020 at 6:44 PM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 12/14/20 6:53 PM, Alexander Duyck wrote:
> > >> example subfunction usage sequence:
> > >> -----------------------------------
> > >> Change device to switchdev mode:
> > >> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> > >>
> > >> Add a devlink port of subfunction flaovur:
> > >> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> > >
> > > Typo in your description. Also I don't know if you want to stick
> > > with "flavour" or just shorten it to the U.S. spelling which is "flavor".
> >
> > The term exists in devlink today (since 2018). When support was added
> > to
> > iproute2 I decided there was no reason to require the US spelling over
> > the British spelling, so I accepted the patch.
> 
> Okay. The only reason why I noticed is because "flaovur" is definitely a wrong
> spelling. If it is already in the interface then no need to change it.
I am using to write "flavor" and I realized that I should probably say "flavour" because in devlink it is that way.
So I added 'u' and typo added it at wrong location. :-)
Thanks for catching it. 
Saeed sent the v5 fixing this along with few more English corrections.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15  5:48   ` Parav Pandit
@ 2020-12-15 18:47     ` Alexander Duyck
  2020-12-15 20:05       ` Saeed Mahameed
                         ` (2 more replies)
  2020-12-15 20:59     ` David Ahern
  1 sibling, 3 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-15 18:47 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Mon, Dec 14, 2020 at 9:48 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Sent: Tuesday, December 15, 2020 7:24 AM
> >
> > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > wrote:
> > >
> > > Hi Dave, Jakub, Jason,
> > >
> >
> > Just to clarify a few things for myself. You mention virtualization and SR-IOV
> > in your patch description but you cannot support direct assignment with this
> > correct?
> Correct. it cannot be directly assigned.
>
> > The idea here is simply logical partitioning of an existing network
> > interface, correct?
> No. Idea is to spawn multiple functions from a single PCI device.
> These functions are not born in PCI device and in OS until they are created by user.

That is the definition of logical partitioning. You are essentially
taking one physical PCIe function and splitting up the resources over
multiple logical devices. With something like an MFD driver you would
partition the device as soon as the driver loads, but with this you
are peeling our resources and defining the devices on demand.

> Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
> I better not repeat all of it here again. Please go through it.
> If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.

I think I have a pretty good idea of how the feature works. My concern
is more the use of marketing speak versus actual functionality. The
way this is being setup it sounds like it is useful for virtualization
and it is not, at least in its current state. It may be at some point
in the future but I worry that it is really going to muddy the waters
as we end up with yet another way to partition devices.

> > So this isn't so much a solution for virtualization, but may
> > work better for containers. I view this as an important distinction to make as
> > the first thing that came to mind when I read this was mediated devices
> > which is similar, but focused only on the virtualization case:
> > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
> > device.html
> >
> Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device.

I agree with you on that. My thought was more the fact that the two
can be easily confused. If we are going to do this we need to define
that for networking devices perhaps that using the mdev interface
would be deprecated and we would need to go through devlink. However
before we do that we need to make sure we have this completely
standardized.

> We are not going back to it anymore.
> It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more.

That is kind of my point. It is already in the kernel. What you are
adding is the stuff that is duplicating it. I'm assuming that in order
to be able to virtualize your interfaces in the future you are going
to have to make use of the same vfio plumbing that the mediated
devices do.

> > Rather than calling this a subfunction, would it make more sense to call it
> > something such as a queue set?
> No, queue is just one way to send and receive data/packets.
> Jason and Saeed explained and discussed  this piece to you and others during v0 few weeks back at [1], [2], [3].
> Please take a look.

Yeah, I recall that. However I feel like it is being oversold. It
isn't "SR-IOV done right" it seems more like "VMDq done better". The
fact that interrupts are shared between the subfunctions is telling.
That is exactly how things work for Intel parts when they do VMDq as
well. The queues are split up into pools and a block of queues belongs
to a specific queue. From what I can can tell the only difference is
that there is isolation of the pool into specific pages in the BAR.
Which is essentially a requirement for mediated devices so that they
can be direct assigned.

> > So in terms of ways to go I would argue this is likely better. However one
> > downside is that we are going to end up seeing each subfunction being
> > different from driver to driver and vendor to vendor which I would argue
> > was also one of the problems with SR-IOV as you end up with a bit of vendor
> > lock-in as a result of this feature since each vendor will be providing a
> > different interface.
> >
> Each and several vendors provided unified interface for managing VFs. i.e.
> (a) enable/disable was via vendor neutral sysfs
> (b) sriov capability exposed via standard pci capability and sysfs
> (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink
> Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced.
>
> So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'.

I think you are missing the point. The biggest issue with SR-IOV
adoption was the fact that the drivers all created different VF
interfaces and those interfaces didn't support migration. That was the
two biggest drawbacks for SR-IOV. I don't really see this approach
resolving either of those and so that is one of the reasons why I say
this is closer to "VMDq done better"  rather than "SR-IOV done right".
Assuming at some point one of the flavours is a virtio-net style
interface you could eventually get to the point of something similar
to what seems to have been the goal of mdev which was meant to address
these two points.

> > > A Subfunction supports eswitch representation through which it
> > > supports tc offloads. User must configure eswitch to send/receive
> > > packets from/to subfunction port.
> > >
> > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > > their other subfunctions and/or with its parent PCI function.
> >
> > This piece to the architecture for this has me somewhat concerned. If all your
> > resources are shared and
> All resources are not shared.

Just to clarify, when I say "shared" I mean that they are all coming
from the same function. So if for example I were to direct-assign the
PF then all the resources for the subfunctions would go with it.

> > you are allowing devices to be created
> > incrementally you either have to pre-partition the entire function which
> > usually results in limited resources for your base setup, or free resources
> > from existing interfaces and redistribute them as things change. I would be
> > curious which approach you are taking here? So for example if you hit a
> > certain threshold will you need to reset the port and rebalance the IRQs
> > between the various functions?
> No. Its works bit differently for mlx5 device.
> When base function is started, it started as if it doesn't have any subfunctions.
> When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants.
>
> For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.

In the grand scheme BAR doesn't really matter much. The assumption
here is that resources are page aligned so that you can map the pages
into a guest eventually.

> For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF.
> In future, yes, a dedicated IRQs per SF is likely desired.
> Sridhar also talked about limiting number of queues to a subfunction.
> I believe there will be resources/attributes of the function to be controlled.
> devlink already provides rich interface to achieve that using devlink resources [8].
>
> [..]

So it sounds like the device firmware is pre-partitioining the
resources and splitting them up between the PCI PF and your
subfunctions. Although in your case it sounds like there are
significantly more resources than you might find in an ixgbe interface
for instance. :)

The point is that we should probably define some sort of standard
and/or expectations on what should happen when you spawn a new
interface. Would it be acceptable for the PF and existing subfunctions
to have to reset if you need to rebalance the IRQ distribution, or
should they not be disrupted when you spawn a new interface?

One of the things I prefer about the mediated device setup is the fact
that it has essentially pre-partitioned things beforehand and you know
how many of each type of interface you can spawn. Is there any
information like that provided by this interface?

> > > $ ip link show
> > > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> > DOWN mode DEFAULT group default qlen 1000
> > >     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
> > >     altname enp6s0f0np0
> > > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
> > mode DEFAULT group default qlen 1000
> > >     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
> >
> > I assume that p0sf88 is supposed to be the newly created subfunction.
> > However I thought the naming was supposed to be the same as what you are
> > referring to in the devlink, or did I miss something?
> >
> I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
> I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects.
> Hope below description clarifies a bit.
> There are two netdevices.
> (a) representor netdevice, attached to the devlink port of the eswitch
> (b) netdevice of the SF used by the end application (in your example, this is assigned to container).

Sorry, that wasn't clear from your example. So in this case you
started in a namespace and the new device you created via devlink was
spawned in the root namespace?

> Both netdevice follow obviously a different naming scheme.
> Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher.
> It is based on phys_port_name sysfs attribute.
> This is same for existing PF and SF representors exist for year+ now. Further used by subfunction.
>
> For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads
> phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container.

Admittedly I have been out of the loop for the last couple years since
I had switched over to memory management work for a while. It would be
useful to include something that shows your created network interface
in the example in addition to your switchdev port.

> > > After use inactivate the function:
> > > $ devlink port function set ens2f0npf0sf88 state inactive
> > >
> > > Now delete the subfunction port:
> > > $ devlink port del ens2f0npf0sf88
> >
> > This seems wrong to me as it breaks the symmetry with the port add
> > command and
> Example of the representor device is only to make life easier for the user.
> Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit).
> I explained this in a thread with Sridhar at [6].
> In short devlink port del <bus/device_name/port_index command is just fine.
> Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation.
> I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported.

Okay then, that addresses my concern. I just wanted to make sure we
weren't in some situation where you had to have the interface in order
to remove it.

> > assumes you have ownership of the interface in the host. I
> > would much prefer to to see the same arguments that were passed to the
> > add command being used to do the teardown as that would allow for the
> > parent function to create the object, assign it to a container namespace, and
> > not need to pull it back in order to destroy it.
> Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases.
> So port delete command works on the port index.
> Host doesn't need to pull it back to destroy it. It is destroyed via port del command.
>
> [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
> [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> [3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
> [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
> [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
> [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
> [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
> [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15  6:15   ` Saeed Mahameed
@ 2020-12-15 19:12     ` Alexander Duyck
  2020-12-15 20:35       ` Saeed Mahameed
  0 siblings, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-15 19:12 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote:
> > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > wrote:
> > > Hi Dave, Jakub, Jason,
> > >
> > > This series form Parav was the theme of this mlx5 release cycle,
> > > we've been waiting anxiously for the auxbus infrastructure to make
> > > it into
> > > the kernel, and now as the auxbus is in and all the stars are
> > > aligned, I
> > > can finally submit this V2 of the devlink and mlx5 subfunction
> > > support.
> > >
> > > Subfunctions came to solve the scaling issue of virtualization
> > > and switchdev environments, where SRIOV failed to deliver and users
> > > ran
> > > out of VFs very quickly as SRIOV demands huge amount of physical
> > > resources
> > > in both of the servers and the NIC.
> > >
> > > Subfunction provide the same functionality as SRIOV but in a very
> > > lightweight manner, please see the thorough and detailed
> > > documentation from Parav below, in the commit messages and the
> > > Networking documentation patches at the end of this series.
> > >
> >
> > Just to clarify a few things for myself. You mention virtualization
> > and SR-IOV in your patch description but you cannot support direct
> > assignment with this correct? The idea here is simply logical
> > partitioning of an existing network interface, correct? So this isn't
> > so much a solution for virtualization, but may work better for
> > containers. I view this as an important distinction to make as the
>
> at the current state yes, but the SF solution can be extended to
> support direct assignment, so this is why i think SF solution can do
> better and eventually replace SRIOV.

My only real concern is that this and mediated devices are essentially
the same thing. When you start making this work for direct-assignment
the only real difference becomes the switchdev and devlink interfaces.
Basically this is netdev specific mdev versus the PCIe specific mdev.

> also many customers are currently using SRIOV with containers to get
> the performance and isolation features since there was no other
> options.

There were, but you hadn't implemented them. The fact is the approach
Intel had taken for that was offloaded macvlan.

I think the big thing we really should do if we are going to go this
route is to look at standardizing what the flavours are that get
created by the parent netdevice. Otherwise we are just creating the
same mess we had with SRIOV all over again and muddying the waters of
mediated devices.

> > first thing that came to mind when I read this was mediated devices
> > which is similar, but focused only on the virtualization case:
> > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html
> >
> > > Parav Pandit Says:
> > > =================
> > >
> > > This patchset introduces support for mlx5 subfunction (SF).
> > >
> > > A subfunction is a lightweight function that has a parent PCI
> > > function on
> > > which it is deployed. mlx5 subfunction has its own function
> > > capabilities
> > > and its own resources. This means a subfunction has its own
> > > dedicated
> > > queues(txq, rxq, cq, eq). These queues are neither shared nor
> > > stealed from
> > > the parent PCI function.
> >
> > Rather than calling this a subfunction, would it make more sense to
> > call it something such as a queue set? It seems like this is exposing
> > some of the same functionality we did in the Intel drivers such as
> > ixgbe and i40e via the macvlan offload interface. However the
> > ixgbe/i40e hardware was somewhat limited in that we were only able to
> > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and
> > even
> > with that we have seen some limitations to the interface. It sounds
> > like you are able to break out RDMA capable devices this way as well.
> > So in terms of ways to go I would argue this is likely better.
>
> We've discussed this thoroughly on V0, the SF solutions is closer to a
> VF than a VMDQ, this is not just a set of queues.
>
> https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/

VMDq is more than just a set of queues. The fact is it is a pool of
resources that get created to handle the requests for a specific VM.
The extra bits that are added here are essentially stuff that was
required to support mediated devices.

> > However
> > one downside is that we are going to end up seeing each subfunction
> > being different from driver to driver and vendor to vendor which I
> > would argue was also one of the problems with SR-IOV as you end up
> > with a bit of vendor lock-in as a result of this feature since each
> > vendor will be providing a different interface.
> >
>
> I disagree, SFs are tightly coupled with switchdev model and devlink
> functions port, they are backed with the a well defined model, i can
> say the same about sriov with switchdev mode, this sort of vendor lock-
> in issues is eliminated when you migrate to switchdev mode.

What you are talking about is the backend. I am talking about what is
exposed to the user. The user is going to see a Mellanox device having
to be placed into their container in order to support this. One of the
advantages of the Intel approach was that the macvlan interface was
generic so you could have an offloaded interface or not and the user
wouldn't necessarily know. The offload could be disabled and the user
would be none the wiser as it is moved from one interface to another.
I see that as a big thing that is missing in this solution.

> > > When subfunction is RDMA capable, it has its own QP1, GID table and
> > > rdma
> > > resources neither shared nor stealed from the parent PCI function.
> > >
> > > A subfunction has dedicated window in PCI BAR space that is not
> > > shared
> > > with ther other subfunctions or parent PCI function. This ensures
> > > that all
> > > class devices of the subfunction accesses only assigned PCI BAR
> > > space.
> > >
> > > A Subfunction supports eswitch representation through which it
> > > supports tc
> > > offloads. User must configure eswitch to send/receive packets
> > > from/to
> > > subfunction port.
> > >
> > > Subfunctions share PCI level resources such as PCI MSI-X IRQs with
> > > their other subfunctions and/or with its parent PCI function.
> >
> > This piece to the architecture for this has me somewhat concerned. If
> > all your resources are shared and you are allowing devices to be
>
> not all, only PCI MSIX, for now..

They aren't shared after you partition them but they are coming from
the same device. Basically you are subdividing the BAR2 in order to
generate the subfunctions. BAR2 is a shared resource in my point of
view.

> > created incrementally you either have to pre-partition the entire
> > function which usually results in limited resources for your base
> > setup, or free resources from existing interfaces and redistribute
> > them as things change. I would be curious which approach you are
> > taking here? So for example if you hit a certain threshold will you
> > need to reset the port and rebalance the IRQs between the various
> > functions?
> >
>
> Currently SFs will use whatever IRQs the PF has pre-allocated for
> itself, so there is no IRQ limit issue at the moment, we are
> considering a dynamic IRQ pool with dynamic balancing, or even better
> us the IMS approach, which perfectly fits the SF architecture.
> https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/

When you say you are using the PF's interrupts are you just using that
as a pool of resources or having the interrupt process interrupts for
both the PF and SFs? Without IMS you are limited to 2048 interrupts.
Moving over to that would make sense since SF is similar to mdev in
the way it partitions up the device and resources.

> for internal resources the are fully isolated (not shared) and
> they are internally managed by FW exactly like a VF internal resources.

I assume by isolated you mean they are contained within page aligned
blocks like what was required for mdev?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 18:47     ` Alexander Duyck
@ 2020-12-15 20:05       ` Saeed Mahameed
  2020-12-15 21:03       ` Jason Gunthorpe
  2020-12-16  1:12       ` Edwin Peer
  2 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-15 20:05 UTC (permalink / raw)
  To: Alexander Duyck, Parav Pandit
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Tue, 2020-12-15 at 10:47 -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2020 at 9:48 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > 
> > > From: Alexander Duyck <alexander.duyck@gmail.com>
> > > Sent: Tuesday, December 15, 2020 7:24 AM
> > > 
> > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > > wrote:
> > > > Hi Dave, Jakub, Jason,
> > > > 
> > > 
> > > Just to clarify a few things for myself. You mention
> > > virtualization and SR-IOV
> > > in your patch description but you cannot support direct
> > > assignment with this
> > > correct?
> > Correct. it cannot be directly assigned.
> > 
> > > The idea here is simply logical partitioning of an existing
> > > network
> > > interface, correct?
> > No. Idea is to spawn multiple functions from a single PCI device.
> > These functions are not born in PCI device and in OS until they are
> > created by user.
> 
> That is the definition of logical partitioning. You are essentially
> taking one physical PCIe function and splitting up the resources over
> multiple logical devices. With something like an MFD driver you would
> partition the device as soon as the driver loads, but with this you
> are peeling our resources and defining the devices on demand.
> 

Sure, same for SRIOV and same for VMDQ. they are all logical
partitioning and they are all sharing the same resources of the system.
our point here is that the SF mechanisms are more similar to SRIOV than
VMDQ, other than sharing the MSIX vectors and slicing up the BARs,
everything else works exactly like SRIOV.

> > Jason and Saeed explained this in great detail few weeks back in v0
> > version of the patchset at [1], [2] and [3].
> > I better not repeat all of it here again. Please go through it.
> > If you may want to read precursor to it, RFC from Jiri at [4] is
> > also explains this in great detail.
> 
> I think I have a pretty good idea of how the feature works. My
> concern
> is more the use of marketing speak versus actual functionality. The
> way this is being setup it sounds like it is useful for
> virtualization
> and it is not, at least in its current state. It may be at some point
> in the future but I worry that it is really going to muddy the waters
> as we end up with yet another way to partition devices.
> 

Ok, maybe we have different views on the feature and use cases, this is
useful for visualization for various reasons, take vdpa for an
instance, but anyway, we can improve documentation to address your
concerns and at some point in the future when we add the direct
assignment we can add the necessary documentation.


> > > So this isn't so much a solution for virtualization, but may
> > > work better for containers. I view this as an important
> > > distinction to make as
> > > the first thing that came to mind when I read this was mediated
> > > devices
> > > which is similar, but focused only on the virtualization case:
> > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
> > > device.html
> > > 
> > Managing subfunction using medicated device is already ruled out
> > last year at [5] as it is the abuse of the mdev bus for this
> > purpose + has severe limitations of managing the subfunction
> > device.
> 
> I agree with you on that. My thought was more the fact that the two
> can be easily confused. If we are going to do this we need to define
> that for networking devices perhaps that using the mdev interface
> would be deprecated and we would need to go through devlink. However
> before we do that we need to make sure we have this completely
> standardized.
> 

Then lets keep this discussion for later, when we add the direct
assignment support :)

> > We are not going back to it anymore.
> > It will be duplicating lot of the plumbing which exists in devlink,
> > netlink, auxiliary bus and more.
> 
> That is kind of my point. It is already in the kernel. What you are
> adding is the stuff that is duplicating it. I'm assuming that in
> order
> to be able to virtualize your interfaces in the future you are going
> to have to make use of the same vfio plumbing that the mediated
> devices do.
> 
> > > Rather than calling this a subfunction, would it make more sense
> > > to call it
> > > something such as a queue set?
> > No, queue is just one way to send and receive data/packets.
> > Jason and Saeed explained and discussed  this piece to you and
> > others during v0 few weeks back at [1], [2], [3].
> > Please take a look.
> 
> Yeah, I recall that. However I feel like it is being oversold. It
> isn't "SR-IOV done right" it seems more like "VMDq done better". The

Ok then will improve documentation to not oversell this as "SRIOV done
right".

> fact that interrupts are shared between the subfunctions is telling.
> That is exactly how things work for Intel parts when they do VMDq as
> well. The queues are split up into pools and a block of queues
> belongs
> to a specific queue. From what I can can tell the only difference is
> that there is isolation of the pool into specific pages in the BAR.
> Which is essentially a requirement for mediated devices so that they
> can be direct assigned.
> 

I disagree, this is very dissimilar to VDMQ.
SF is a VF without a unique PCI function and bar, the BAR is split up
to give the "SF" access to its own partition in the HW, and then it
will access the HW exactly like a VF would do, there are no "pools"
involved here or any PF resource/queue management, from our HW
architecture perspective there is no difference between SF and VF..
really.

> > > So in terms of ways to go I would argue this is likely better.
> > > However one
> > > downside is that we are going to end up seeing each subfunction
> > > being
> > > different from driver to driver and vendor to vendor which I
> > > would argue
> > > was also one of the problems with SR-IOV as you end up with a bit
> > > of vendor
> > > lock-in as a result of this feature since each vendor will be
> > > providing a
> > > different interface.
> > > 
> > Each and several vendors provided unified interface for managing
> > VFs. i.e.
> > (a) enable/disable was via vendor neutral sysfs
> > (b) sriov capability exposed via standard pci capability and sysfs
> > (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust)
> > are using vendor agnostic netlink
> > Even though the driver's internal implementation largely differs on
> > how trust, spoof, mac, vlan rate etc are enforced.
> > 
> > So subfunction feature/attribute/functionality will be implemented
> > differently internally in the driver matching vendor's device, for
> > reasonably abstract concept of 'subfunction'.
> 
> I think you are missing the point. The biggest issue with SR-IOV
> adoption was the fact that the drivers all created different VF
> interfaces and those interfaces didn't support migration. That was
> the
> two biggest drawbacks for SR-IOV. I don't really see this approach
> resolving either of those and so that is one of the reasons why I say
> this is closer to "VMDq done better"  rather than "SR-IOV done
> right".
> Assuming at some point one of the flavours is a virtio-net style
> interface you could eventually get to the point of something similar
> to what seems to have been the goal of mdev which was meant to
> address
> these two points.
> 

Improving documentation will address these concerns ? 
but to be clear it is much easier to solve sriov live migration when
SFs are involved.

> > > > A Subfunction supports eswitch representation through which it
> > > > supports tc offloads. User must configure eswitch to
> > > > send/receive
> > > > packets from/to subfunction port.
> > > > 
> > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs
> > > > with
> > > > their other subfunctions and/or with its parent PCI function.
> > > 
> > > This piece to the architecture for this has me somewhat
> > > concerned. If all your
> > > resources are shared and
> > All resources are not shared.
> 
> Just to clarify, when I say "shared" I mean that they are all coming
> from the same function. So if for example I were to direct-assign the
> PF then all the resources for the subfunctions would go with it.
> 
> > > you are allowing devices to be created
> > > incrementally you either have to pre-partition the entire
> > > function which
> > > usually results in limited resources for your base setup, or free
> > > resources
> > > from existing interfaces and redistribute them as things change.
> > > I would be
> > > curious which approach you are taking here? So for example if you
> > > hit a
> > > certain threshold will you need to reset the port and rebalance
> > > the IRQs
> > > between the various functions?
> > No. Its works bit differently for mlx5 device.
> > When base function is started, it started as if it doesn't have any
> > subfunctions.
> > When subfunction is instantiated, it spawns new resources in device
> > (hw, fw, memory) depending on how much a function wants.
> > 
> > For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.
> 
> In the grand scheme BAR doesn't really matter much. The assumption
> here is that resources are page aligned so that you can map the pages
> into a guest eventually.
> 
> > For IRQs, subfunction instance shares the IRQ with its
> > parent/hosting PCI PF.
> > In future, yes, a dedicated IRQs per SF is likely desired.
> > Sridhar also talked about limiting number of queues to a
> > subfunction.
> > I believe there will be resources/attributes of the function to be
> > controlled.
> > devlink already provides rich interface to achieve that using
> > devlink resources [8].
> > 
> > [..]
> 
> So it sounds like the device firmware is pre-partitioining the
> resources and splitting them up between the PCI PF and your
> subfunctions. Although in your case it sounds like there are
> significantly more resources than you might find in an ixgbe
> interface
> for instance. :)
> 
> The point is that we should probably define some sort of standard
> and/or expectations on what should happen when you spawn a new
> interface. Would it be acceptable for the PF and existing
> subfunctions
> to have to reset if you need to rebalance the IRQ distribution, or
> should they not be disrupted when you spawn a new interface?
> 
> One of the things I prefer about the mediated device setup is the
> fact
> that it has essentially pre-partitioned things beforehand and you
> know
> how many of each type of interface you can spawn. Is there any
> information like that provided by this interface?
> 
> > > > $ ip link show
> > > > 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> > > DOWN mode DEFAULT group default qlen 1000
> > > >     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
> > > >     altname enp6s0f0np0
> > > > 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
> > > > DOWN
> > > mode DEFAULT group default qlen 1000
> > > >     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
> > > 
> > > I assume that p0sf88 is supposed to be the newly created
> > > subfunction.
> > > However I thought the naming was supposed to be the same as what
> > > you are
> > > referring to in the devlink, or did I miss something?
> > > 
> > I believe you are confused with the representor netdevice of
> > subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
> > I suggest that please refer to the diagram in patch_15 in [7] to
> > see the stack, modules, objects.
> > Hope below description clarifies a bit.
> > There are two netdevices.
> > (a) representor netdevice, attached to the devlink port of the
> > eswitch
> > (b) netdevice of the SF used by the end application (in your
> > example, this is assigned to container).
> 
> Sorry, that wasn't clear from your example. So in this case you
> started in a namespace and the new device you created via devlink was
> spawned in the root namespace?
> 
> > Both netdevice follow obviously a different naming scheme.
> > Representor netdevice follows naming scheme well defined in kernel
> > + systemd/udev v245 and higher.
> > It is based on phys_port_name sysfs attribute.
> > This is same for existing PF and SF representors exist for year+
> > now. Further used by subfunction.
> > 
> > For subfunction netdevice (p0s88), system/udev will be extended. I
> > put example based on my few lines of udev rule that reads
> > phys_port_name and user supplied sfnum, so that user exactly knows
> > which interface to assign to container.
> 
> Admittedly I have been out of the loop for the last couple years
> since
> I had switched over to memory management work for a while. It would
> be
> useful to include something that shows your created network interface
> in the example in addition to your switchdev port.
> 
> > > > After use inactivate the function:
> > > > $ devlink port function set ens2f0npf0sf88 state inactive
> > > > 
> > > > Now delete the subfunction port:
> > > > $ devlink port del ens2f0npf0sf88
> > > 
> > > This seems wrong to me as it breaks the symmetry with the port
> > > add
> > > command and
> > Example of the representor device is only to make life easier for
> > the user.
> > Devlink port del command works based on the devlink port index,
> > just like existing devlink port commands (get,set,split,unsplit).
> > I explained this in a thread with Sridhar at [6].
> > In short devlink port del <bus/device_name/port_index command is
> > just fine.
> > Port index is unique handle for the devlink instance that user
> > refers to delete, get, set port and port function attributes post
> > its creation.
> > I choose the representor netdev example because it is more
> > intuitive to related to, but port index is equally fine and
> > supported.
> 
> Okay then, that addresses my concern. I just wanted to make sure we
> weren't in some situation where you had to have the interface in
> order
> to remove it.
> 
> > > assumes you have ownership of the interface in the host. I
> > > would much prefer to to see the same arguments that were passed
> > > to the
> > > add command being used to do the teardown as that would allow for
> > > the
> > > parent function to create the object, assign it to a container
> > > namespace, and
> > > not need to pull it back in order to destroy it.
> > Parent function will not have same netdevice name as that of
> > representor netdevice, because both devices exist in single system
> > for large part of the use cases.
> > So port delete command works on the port index.
> > Host doesn't need to pull it back to destroy it. It is destroyed
> > via port del command.
> > 
> > [1] 
> > https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
> > [2] 
> > https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> > [3] 
> > https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
> > [4] 
> > https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
> > [5] 
> > https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
> > [6] 
> > https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
> > [7] 
> > https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
> > [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html
> > 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 19:12     ` Alexander Duyck
@ 2020-12-15 20:35       ` Saeed Mahameed
  2020-12-15 21:28         ` Jakub Kicinski
  2020-12-15 21:41         ` Alexander Duyck
  0 siblings, 2 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-15 20:35 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Tue, 2020-12-15 at 11:12 -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org>
> wrote:
> > On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote:
> > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > > wrote:
> > > > Hi Dave, Jakub, Jason,
> > > > 
> > > > This series form Parav was the theme of this mlx5 release
> > > > cycle,
> > > > we've been waiting anxiously for the auxbus infrastructure to
> > > > make
> > > > it into
> > > > the kernel, and now as the auxbus is in and all the stars are
> > > > aligned, I
> > > > can finally submit this V2 of the devlink and mlx5 subfunction
> > > > support.
> > > > 
> > > > Subfunctions came to solve the scaling issue of virtualization
> > > > and switchdev environments, where SRIOV failed to deliver and
> > > > users
> > > > ran
> > > > out of VFs very quickly as SRIOV demands huge amount of
> > > > physical
> > > > resources
> > > > in both of the servers and the NIC.
> > > > 
> > > > Subfunction provide the same functionality as SRIOV but in a
> > > > very
> > > > lightweight manner, please see the thorough and detailed
> > > > documentation from Parav below, in the commit messages and the
> > > > Networking documentation patches at the end of this series.
> > > > 
> > > 
> > > Just to clarify a few things for myself. You mention
> > > virtualization
> > > and SR-IOV in your patch description but you cannot support
> > > direct
> > > assignment with this correct? The idea here is simply logical
> > > partitioning of an existing network interface, correct? So this
> > > isn't
> > > so much a solution for virtualization, but may work better for
> > > containers. I view this as an important distinction to make as
> > > the
> > 
> > at the current state yes, but the SF solution can be extended to
> > support direct assignment, so this is why i think SF solution can
> > do
> > better and eventually replace SRIOV.
> 
> My only real concern is that this and mediated devices are
> essentially
> the same thing. When you start making this work for direct-assignment
> the only real difference becomes the switchdev and devlink
> interfaces.

not just devlink and switchdev, auxbus was also introduced to
standardize some of the interfaces.

> Basically this is netdev specific mdev versus the PCIe specific mdev.
> 

SF is not netdev specific mdev .. :/

> > also many customers are currently using SRIOV with containers to
> > get
> > the performance and isolation features since there was no other
> > options.
> 
> There were, but you hadn't implemented them. The fact is the approach
> Intel had taken for that was offloaded macvlan.
> 

offloaded macvlan is just a macvlan with checksum/tso and gro.

macvlan can't provide RDMA, TC offloads, ethtool steering, PTP, vdpa
...
our SF provides the same set of features a VF can provide


> I think the big thing we really should do if we are going to go this
> route is to look at standardizing what the flavours are that get
> created by the parent netdevice. Otherwise we are just creating the
> same mess we had with SRIOV all over again and muddying the waters of
> mediated devices.
> 

yes in the near future we will be working on auxbus interfaces for
auto-probing and user flavor selection, this is a must have feature for
us.

> > > first thing that came to mind when I read this was mediated
> > > devices
> > > which is similar, but focused only on the virtualization case:
> > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html
> > > 
> > > > Parav Pandit Says:
> > > > =================
> > > > 
> > > > This patchset introduces support for mlx5 subfunction (SF).
> > > > 
> > > > A subfunction is a lightweight function that has a parent PCI
> > > > function on
> > > > which it is deployed. mlx5 subfunction has its own function
> > > > capabilities
> > > > and its own resources. This means a subfunction has its own
> > > > dedicated
> > > > queues(txq, rxq, cq, eq). These queues are neither shared nor
> > > > stealed from
> > > > the parent PCI function.
> > > 
> > > Rather than calling this a subfunction, would it make more sense
> > > to
> > > call it something such as a queue set? It seems like this is
> > > exposing
> > > some of the same functionality we did in the Intel drivers such
> > > as
> > > ixgbe and i40e via the macvlan offload interface. However the
> > > ixgbe/i40e hardware was somewhat limited in that we were only
> > > able to
> > > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and
> > > even
> > > with that we have seen some limitations to the interface. It
> > > sounds
> > > like you are able to break out RDMA capable devices this way as
> > > well.
> > > So in terms of ways to go I would argue this is likely better.
> > 
> > We've discussed this thoroughly on V0, the SF solutions is closer
> > to a
> > VF than a VMDQ, this is not just a set of queues.
> > 
> > https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> 
> VMDq is more than just a set of queues. The fact is it is a pool of
> resources that get created to handle the requests for a specific VM.
> The extra bits that are added here are essentially stuff that was
> required to support mediated devices.
> 

VMDq pools are managed by the driver and only logically isolated in the
kernel, SFs has no shared pool for transport resources (queues), SFs
have their own isolated steering domains, processing engines, and HW
objects, exactly like a VF.

> > > However
> > > one downside is that we are going to end up seeing each
> > > subfunction
> > > being different from driver to driver and vendor to vendor which
> > > I
> > > would argue was also one of the problems with SR-IOV as you end
> > > up
> > > with a bit of vendor lock-in as a result of this feature since
> > > each
> > > vendor will be providing a different interface.
> > > 
> > 
> > I disagree, SFs are tightly coupled with switchdev model and
> > devlink
> > functions port, they are backed with the a well defined model, i
> > can
> > say the same about sriov with switchdev mode, this sort of vendor
> > lock-
> > in issues is eliminated when you migrate to switchdev mode.
> 
> What you are talking about is the backend. I am talking about what is
> exposed to the user. The user is going to see a Mellanox device
> having
> to be placed into their container in order to support this. One of
> the
> advantages of the Intel approach was that the macvlan interface was
> generic so you could have an offloaded interface or not and the user
> wouldn't necessarily know. The offload could be disabled and the user
> would be none the wiser as it is moved from one interface to another.
> I see that as a big thing that is missing in this solution.
> 

You are talking about the basic netdev users, Sure there are users who
would want a more generic netdev, so yes. but most of my customers are
not like that, they want vdpa/rdma and heavy netdev offload such as
encap/decap/crypto and driver xdp in their containers, the SF approach
will make more sense to them than sriov and VMDq.

> > > > When subfunction is RDMA capable, it has its own QP1, GID table
> > > > and
> > > > rdma
> > > > resources neither shared nor stealed from the parent PCI
> > > > function.
> > > > 
> > > > A subfunction has dedicated window in PCI BAR space that is not
> > > > shared
> > > > with ther other subfunctions or parent PCI function. This
> > > > ensures
> > > > that all
> > > > class devices of the subfunction accesses only assigned PCI BAR
> > > > space.
> > > > 
> > > > A Subfunction supports eswitch representation through which it
> > > > supports tc
> > > > offloads. User must configure eswitch to send/receive packets
> > > > from/to
> > > > subfunction port.
> > > > 
> > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs
> > > > with
> > > > their other subfunctions and/or with its parent PCI function.
> > > 
> > > This piece to the architecture for this has me somewhat
> > > concerned. If
> > > all your resources are shared and you are allowing devices to be
> > 
> > not all, only PCI MSIX, for now..
> 
> They aren't shared after you partition them but they are coming from
> the same device. Basically you are subdividing the BAR2 in order to
> generate the subfunctions. BAR2 is a shared resource in my point of
> view.
> 

Sure, but it doesn't host any actual resources, only the communication
channel with the HW partition, so other than the BAR and the msix the
actual HW resources, steering pipelines, offloads and queues are totlay
isolated and separated. 


> > > created incrementally you either have to pre-partition the entire
> > > function which usually results in limited resources for your base
> > > setup, or free resources from existing interfaces and
> > > redistribute
> > > them as things change. I would be curious which approach you are
> > > taking here? So for example if you hit a certain threshold will
> > > you
> > > need to reset the port and rebalance the IRQs between the various
> > > functions?
> > > 
> > 
> > Currently SFs will use whatever IRQs the PF has pre-allocated for
> > itself, so there is no IRQ limit issue at the moment, we are
> > considering a dynamic IRQ pool with dynamic balancing, or even
> > better
> > us the IMS approach, which perfectly fits the SF architecture.
> > https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/
> 
> When you say you are using the PF's interrupts are you just using
> that
> as a pool of resources or having the interrupt process interrupts for
> both the PF and SFs? Without IMS you are limited to 2048 interrupts.
> Moving over to that would make sense since SF is similar to mdev in
> the way it partitions up the device and resources.
> 

Yes moving to IMS is on the top of our priorities.

> > for internal resources the are fully isolated (not shared) and
> > they are internally managed by FW exactly like a VF internal
> > resources.
> 
> I assume by isolated you mean they are contained within page aligned
> blocks like what was required for mdev?

I mean they are isolated and abstracted in the FW, we don't really
expose any resource directly to the BAR. the BAR is only used for
communicating with the device, so VF and SF will work exactly the same
the only difference is where they get their BAR  and offsets from,
everything else is just similar.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15  5:48   ` Parav Pandit
  2020-12-15 18:47     ` Alexander Duyck
@ 2020-12-15 20:59     ` David Ahern
  1 sibling, 0 replies; 65+ messages in thread
From: David Ahern @ 2020-12-15 20:59 UTC (permalink / raw)
  To: Parav Pandit, Alexander Duyck, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On 12/14/20 10:48 PM, Parav Pandit wrote:
> 
>> From: Alexander Duyck <alexander.duyck@gmail.com>
>> Sent: Tuesday, December 15, 2020 7:24 AM
>>
>> On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
>> wrote:
>>>
>>> Hi Dave, Jakub, Jason,
>>>
>>
>> Just to clarify a few things for myself. You mention virtualization and SR-IOV
>> in your patch description but you cannot support direct assignment with this
>> correct? 
> Correct. it cannot be directly assigned.
> 
>> The idea here is simply logical partitioning of an existing network
>> interface, correct? 
> No. Idea is to spawn multiple functions from a single PCI device.
> These functions are not born in PCI device and in OS until they are created by user.
> Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
> I better not repeat all of it here again. Please go through it.
> If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.
> 
>> So this isn't so much a solution for virtualization, but may
>> work better for containers. I view this as an important distinction to make as
>> the first thing that came to mind when I read this was mediated devices
>> which is similar, but focused only on the virtualization case:
>> https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-
>> device.html
>>
> Managing subfunction using medicated device is already ruled out last year at [5] as it is the abuse of the mdev bus for this purpose + has severe limitations of managing the subfunction device.
> We are not going back to it anymore.
> It will be duplicating lot of the plumbing which exists in devlink, netlink, auxiliary bus and more.
>  
>> Rather than calling this a subfunction, would it make more sense to call it
>> something such as a queue set? 
> No, queue is just one way to send and receive data/packets.
> Jason and Saeed explained and discussed  this piece to you and others during v0 few weeks back at [1], [2], [3].
> Please take a look.
> 
>> So in terms of ways to go I would argue this is likely better. However one
>> downside is that we are going to end up seeing each subfunction being
>> different from driver to driver and vendor to vendor which I would argue
>> was also one of the problems with SR-IOV as you end up with a bit of vendor
>> lock-in as a result of this feature since each vendor will be providing a
>> different interface.
>>
> Each and several vendors provided unified interface for managing VFs. i.e.
> (a) enable/disable was via vendor neutral sysfs
> (b) sriov capability exposed via standard pci capability and sysfs
> (c) sriov vf config (mac, vlan, rss, tx rate, spoof check trust) are using vendor agnostic netlink
> Even though the driver's internal implementation largely differs on how trust, spoof, mac, vlan rate etc are enforced.
> 
> So subfunction feature/attribute/functionality will be implemented differently internally in the driver matching vendor's device, for reasonably abstract concept of 'subfunction'.
> 
>>> A Subfunction supports eswitch representation through which it
>>> supports tc offloads. User must configure eswitch to send/receive
>>> packets from/to subfunction port.
>>>
>>> Subfunctions share PCI level resources such as PCI MSI-X IRQs with
>>> their other subfunctions and/or with its parent PCI function.
>>
>> This piece to the architecture for this has me somewhat concerned. If all your
>> resources are shared and 
> All resources are not shared.
> 
>> you are allowing devices to be created
>> incrementally you either have to pre-partition the entire function which
>> usually results in limited resources for your base setup, or free resources
>> from existing interfaces and redistribute them as things change. I would be
>> curious which approach you are taking here? So for example if you hit a
>> certain threshold will you need to reset the port and rebalance the IRQs
>> between the various functions?
> No. Its works bit differently for mlx5 device.
> When base function is started, it started as if it doesn't have any subfunctions.
> When subfunction is instantiated, it spawns new resources in device (hw, fw, memory) depending on how much a function wants.
> 
> For example, PCI PF uses BAR 0, while subfunctions uses BAR 2.
> For IRQs, subfunction instance shares the IRQ with its parent/hosting PCI PF.
> In future, yes, a dedicated IRQs per SF is likely desired.
> Sridhar also talked about limiting number of queues to a subfunction.
> I believe there will be resources/attributes of the function to be controlled.
> devlink already provides rich interface to achieve that using devlink resources [8].
> 
> [..]
> 
>>> $ ip link show
>>> 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
>> DOWN mode DEFAULT group default qlen 1000
>>>     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
>>>     altname enp6s0f0np0
>>> 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
>> mode DEFAULT group default qlen 1000
>>>     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff>
>>
>> I assume that p0sf88 is supposed to be the newly created subfunction.
>> However I thought the naming was supposed to be the same as what you are
>> referring to in the devlink, or did I miss something?
>>
> I believe you are confused with the representor netdevice of subfuction with devices of subfunction. (netdev, rdma, vdpa etc).
> I suggest that please refer to the diagram in patch_15 in [7] to see the stack, modules, objects.
> Hope below description clarifies a bit.
> There are two netdevices.
> (a) representor netdevice, attached to the devlink port of the eswitch
> (b) netdevice of the SF used by the end application (in your example, this is assigned to container).
>  
> Both netdevice follow obviously a different naming scheme.
> Representor netdevice follows naming scheme well defined in kernel + systemd/udev v245 and higher.
> It is based on phys_port_name sysfs attribute.
> This is same for existing PF and SF representors exist for year+ now. Further used by subfunction.
> 
> For subfunction netdevice (p0s88), system/udev will be extended. I put example based on my few lines of udev rule that reads
> phys_port_name and user supplied sfnum, so that user exactly knows which interface to assign to container.
> 
>>> After use inactivate the function:
>>> $ devlink port function set ens2f0npf0sf88 state inactive
>>>
>>> Now delete the subfunction port:
>>> $ devlink port del ens2f0npf0sf88
>>
>> This seems wrong to me as it breaks the symmetry with the port add
>> command and
> Example of the representor device is only to make life easier for the user.
> Devlink port del command works based on the devlink port index, just like existing devlink port commands (get,set,split,unsplit).
> I explained this in a thread with Sridhar at [6].
> In short devlink port del <bus/device_name/port_index command is just fine.
> Port index is unique handle for the devlink instance that user refers to delete, get, set port and port function attributes post its creation.
> I choose the representor netdev example because it is more intuitive to related to, but port index is equally fine and supported.
> 
>> assumes you have ownership of the interface in the host. I
>> would much prefer to to see the same arguments that were passed to the
>> add command being used to do the teardown as that would allow for the
>> parent function to create the object, assign it to a container namespace, and
>> not need to pull it back in order to destroy it.
> Parent function will not have same netdevice name as that of representor netdevice, because both devices exist in single system for large part of the use cases.
> So port delete command works on the port index.
> Host doesn't need to pull it back to destroy it. It is destroyed via port del command.
> 
> [1] https://lore.kernel.org/netdev/20201112192424.2742-1-parav@nvidia.com/
> [2] https://lore.kernel.org/netdev/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> [3] https://lore.kernel.org/netdev/20201120161659.GE917484@nvidia.com/
> [4] https://lore.kernel.org/netdev/20200501091449.GA25211@nanopsycho.orion/
> [5] https://lore.kernel.org/netdev/20191107160448.20962-1-parav@mellanox.com/
> [6] https://lore.kernel.org/netdev/BY5PR12MB43227784BB34D929CA64E315DCCA0@BY5PR12MB4322.namprd12.prod.outlook.com/
> [7] https://lore.kernel.org/netdev/20201214214352.198172-16-saeed@kernel.org/T/#u
> [8] https://man7.org/linux/man-pages/man8/devlink-resource.8.html
> 

Seems to be a repeated line of questions. You might want to add these
FAQs, responses and references to the subfunction document once this set
gets merged.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 18:47     ` Alexander Duyck
  2020-12-15 20:05       ` Saeed Mahameed
@ 2020-12-15 21:03       ` Jason Gunthorpe
  2020-12-16  1:12       ` Edwin Peer
  2 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-15 21:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Parav Pandit, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Tue, Dec 15, 2020 at 10:47:36AM -0800, Alexander Duyck wrote:

> > Jason and Saeed explained this in great detail few weeks back in v0 version of the patchset at [1], [2] and [3].
> > I better not repeat all of it here again. Please go through it.
> > If you may want to read precursor to it, RFC from Jiri at [4] is also explains this in great detail.
> 
> I think I have a pretty good idea of how the feature works. My concern
> is more the use of marketing speak versus actual functionality. The
> way this is being setup it sounds like it is useful for virtualization
> and it is not, at least in its current state. It may be at some point
> in the future but I worry that it is really going to muddy the waters
> as we end up with yet another way to partition devices.

If we do a virtualization version then it will take a SF and instead
of loading a mlx5_core on the SF aux device, we will load some
vfio_mdev_mlx5 driver which will convert the SF aux device into a
/dev/vfio/*

This is essentially the same as how you'd take a PCI VF and replace
mlx5_core with vfio-pci to get /dev/vfio/*. It has to be a special
mdev driver because it sits on the SF aux device, not on the VF PCI
device.

The vfio_mdev_mlx5 driver will create what Intel calls an SIOV ADI
from the SF, in other words the SF is already a superset of what a
SIOV ADI should be.

This matches very nicely the driver model in Linux, and I don't think
it becomes more muddied as we go along. If anything it is becoming
more clear and sane as things progress.

> I agree with you on that. My thought was more the fact that the two
> can be easily confused. If we are going to do this we need to define
> that for networking devices perhaps that using the mdev interface
> would be deprecated and we would need to go through devlink. However
> before we do that we need to make sure we have this completely
> standardized.

mdev is for creating /dev/vfio/* interfaces in userspace. Using it for
anything else is a bad abuse of the driver model.

We had this debate endlessly already.

AFAIK, there is nothing to deprecate, there are no mdev_drivers in
drivers/net, and none should ever be added. The only mdev_driver that
should ever exists is in vfio_mdev.c

If someone is using a mdev_driver in drivers/net out of tree then they
will need to convert to an aux driver for in-tree.

> Yeah, I recall that. However I feel like it is being oversold. It
> isn't "SR-IOV done right" it seems more like "VMDq done better". The
> fact that interrupts are shared between the subfunctions is telling.

The interrupt sharing is a consequence of having an ADI-like model
without relying on IMS. When IMS works then shared interrupts won't be
very necessary. Otherwise there is no choice but to share the MSI
table of the function.

> That is exactly how things work for Intel parts when they do VMDq as
> well. The queues are split up into pools and a block of queues belongs
> to a specific queue. From what I can can tell the only difference is
> that there is isolation of the pool into specific pages in the BAR.
> Which is essentially a requirement for mediated devices so that they
> can be direct assigned.

No, I said this to Jakub, mlx5 SFs have very little to do with
queues. There is no some 'queue' HW element that needs partitioning.

The SF is a hardware security boundary that wraps every operation a
mlx5 device can do. This is why it is an ADI. It is not a crappy ADI
that relies on hypervisor emulation, it is the real thing, just like a
SRIOV VF. You stick it in the VM and the guest can directly talk to
the HW. The HW provides the security.

I can't put focus on this enough: A mlx5 SF can run a *full RDMA
stack*. This means the driver can create all the RDMA HW objects and
resources under the SF. This is *not* just steering some ethernet
traffic to a few different ethernet queues like VMDq is.

The Intel analog to a SF is a *full virtual function* on one of the
Intel iWarp capable NICs, not VMDq.

> Assuming at some point one of the flavours is a virtio-net style
> interface you could eventually get to the point of something similar
> to what seems to have been the goal of mdev which was meant to address
> these two points.

mlx5 already supports VDPA virtio-net on PF/VF and with this series SF
too.

ie you can take a SF, bind the vdpa_mlx5 driver, and get a fully HW
accelerated "ADI" that does virtio-net. This can be assigned to a
guest and shows up as a PCI virtio-net netdev. With VT-d guest packet
tx/rx on this netdev never uses the hypervisor CPU.

> The point is that we should probably define some sort of standard
> and/or expectations on what should happen when you spawn a new
> interface. Would it be acceptable for the PF and existing subfunctions
> to have to reset if you need to rebalance the IRQ distribution, or
> should they not be disrupted when you spawn a new interface?

It is best to think of the SF as an ADI, so if you change something in
the PF and that causes the driver attached to the ADI in a VM to
reset, is that OK? I'd say no.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 20:35       ` Saeed Mahameed
@ 2020-12-15 21:28         ` Jakub Kicinski
  2020-12-16  6:50           ` Leon Romanovsky
  2020-12-15 21:41         ` Alexander Duyck
  1 sibling, 1 reply; 65+ messages in thread
From: Jakub Kicinski @ 2020-12-15 21:28 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexander Duyck, David S. Miller, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote:
> > I think the big thing we really should do if we are going to go this
> > route is to look at standardizing what the flavours are that get
> > created by the parent netdevice. Otherwise we are just creating the
> > same mess we had with SRIOV all over again and muddying the waters of
> > mediated devices.
> 
> yes in the near future we will be working on auxbus interfaces for
> auto-probing and user flavor selection, this is a must have feature for
> us.

Can you elaborate? I thought config would be via devlink.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 20:35       ` Saeed Mahameed
  2020-12-15 21:28         ` Jakub Kicinski
@ 2020-12-15 21:41         ` Alexander Duyck
  2020-12-16  0:19           ` Jason Gunthorpe
  1 sibling, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-15 21:41 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Jason Gunthorpe,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Tue, Dec 15, 2020 at 12:35 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On Tue, 2020-12-15 at 11:12 -0800, Alexander Duyck wrote:
> > On Mon, Dec 14, 2020 at 10:15 PM Saeed Mahameed <saeed@kernel.org>
> > wrote:
> > > On Mon, 2020-12-14 at 17:53 -0800, Alexander Duyck wrote:
> > > > On Mon, Dec 14, 2020 at 1:49 PM Saeed Mahameed <saeed@kernel.org>
> > > > wrote:
> > > > > Hi Dave, Jakub, Jason,
> > > > >
> > > > > This series form Parav was the theme of this mlx5 release
> > > > > cycle,
> > > > > we've been waiting anxiously for the auxbus infrastructure to
> > > > > make
> > > > > it into
> > > > > the kernel, and now as the auxbus is in and all the stars are
> > > > > aligned, I
> > > > > can finally submit this V2 of the devlink and mlx5 subfunction
> > > > > support.
> > > > >
> > > > > Subfunctions came to solve the scaling issue of virtualization
> > > > > and switchdev environments, where SRIOV failed to deliver and
> > > > > users
> > > > > ran
> > > > > out of VFs very quickly as SRIOV demands huge amount of
> > > > > physical
> > > > > resources
> > > > > in both of the servers and the NIC.
> > > > >
> > > > > Subfunction provide the same functionality as SRIOV but in a
> > > > > very
> > > > > lightweight manner, please see the thorough and detailed
> > > > > documentation from Parav below, in the commit messages and the
> > > > > Networking documentation patches at the end of this series.
> > > > >
> > > >
> > > > Just to clarify a few things for myself. You mention
> > > > virtualization
> > > > and SR-IOV in your patch description but you cannot support
> > > > direct
> > > > assignment with this correct? The idea here is simply logical
> > > > partitioning of an existing network interface, correct? So this
> > > > isn't
> > > > so much a solution for virtualization, but may work better for
> > > > containers. I view this as an important distinction to make as
> > > > the
> > >
> > > at the current state yes, but the SF solution can be extended to
> > > support direct assignment, so this is why i think SF solution can
> > > do
> > > better and eventually replace SRIOV.
> >
> > My only real concern is that this and mediated devices are
> > essentially
> > the same thing. When you start making this work for direct-assignment
> > the only real difference becomes the switchdev and devlink
> > interfaces.
>
> not just devlink and switchdev, auxbus was also introduced to
> standardize some of the interfaces.

The auxbus is just there to make up for the fact that there isn't
another bus type for this though. I would imagine otherwise this would
be on some sort of platform bus.

> > Basically this is netdev specific mdev versus the PCIe specific mdev.
> >
>
> SF is not netdev specific mdev .. :/

I agree it is not. However there are just a few extensions to it. What
I would really like to see is a solid standardization of what this is.
Otherwise the comparison is going to be made. Especially since a year
ago Mellanox was pushing this as an mdev type interface. There is more
here than just mdev, however my concern is that we may be also losing
some of the advantages of mdev.

It would be much easier for me to go along with this if we had more
than one vendor pushing it. My concern is that this is becoming
something that may end up being vendor specific.

> > > also many customers are currently using SRIOV with containers to
> > > get
> > > the performance and isolation features since there was no other
> > > options.
> >
> > There were, but you hadn't implemented them. The fact is the approach
> > Intel had taken for that was offloaded macvlan.
> >
>
> offloaded macvlan is just a macvlan with checksum/tso and gro.
>
> macvlan can't provide RDMA, TC offloads, ethtool steering, PTP, vdpa

Agreed. I have already acknowledged that macvlan couldn't meet the
needs for all use cases. However at the same time it provides a
consistent interface regardless of vendors.

If we decide to go with the vendor specific drivers for subfunctions
that is fine, however I see that going down the same path as SR-IOV
and ultimately ending in obscurity since I don't see many being
willing to adopt it.

> ...
> our SF provides the same set of features a VF can provide

That is all well and good. However if we agree that SR-IOV wasn't done
right saying that you are spinning up something that works just like
SR-IOV isn't all that appealing, is it?

> > I think the big thing we really should do if we are going to go this
> > route is to look at standardizing what the flavours are that get
> > created by the parent netdevice. Otherwise we are just creating the
> > same mess we had with SRIOV all over again and muddying the waters of
> > mediated devices.
> >
>
> yes in the near future we will be working on auxbus interfaces for
> auto-probing and user flavor selection, this is a must have feature for
> us.

I would take this one step further. If we are going to have flavours
maybe we should have interfaces pre-defined that are vendor agnostic
that can represent the possible flavors. Basically an Ethernet
interface for that case, and RDMA interface for that case and so on.
It limits what functionality can be exposed, however it frees up the
containers and/or guests to work on whatever NIC you want as long as
it supports that interface.

> > > > first thing that came to mind when I read this was mediated
> > > > devices
> > > > which is similar, but focused only on the virtualization case:
> > > > https://www.kernel.org/doc/html/v5.9/driver-api/vfio-mediated-device.html
> > > >
> > > > > Parav Pandit Says:
> > > > > =================
> > > > >
> > > > > This patchset introduces support for mlx5 subfunction (SF).
> > > > >
> > > > > A subfunction is a lightweight function that has a parent PCI
> > > > > function on
> > > > > which it is deployed. mlx5 subfunction has its own function
> > > > > capabilities
> > > > > and its own resources. This means a subfunction has its own
> > > > > dedicated
> > > > > queues(txq, rxq, cq, eq). These queues are neither shared nor
> > > > > stealed from
> > > > > the parent PCI function.
> > > >
> > > > Rather than calling this a subfunction, would it make more sense
> > > > to
> > > > call it something such as a queue set? It seems like this is
> > > > exposing
> > > > some of the same functionality we did in the Intel drivers such
> > > > as
> > > > ixgbe and i40e via the macvlan offload interface. However the
> > > > ixgbe/i40e hardware was somewhat limited in that we were only
> > > > able to
> > > > expose Ethernet interfaces via this sort of VMQ/VMDQ feature, and
> > > > even
> > > > with that we have seen some limitations to the interface. It
> > > > sounds
> > > > like you are able to break out RDMA capable devices this way as
> > > > well.
> > > > So in terms of ways to go I would argue this is likely better.
> > >
> > > We've discussed this thoroughly on V0, the SF solutions is closer
> > > to a
> > > VF than a VMDQ, this is not just a set of queues.
> > >
> > > https://lore.kernel.org/linux-rdma/421951d99a33d28b91f2b2997409d0c97fa5a98a.camel@kernel.org/
> >
> > VMDq is more than just a set of queues. The fact is it is a pool of
> > resources that get created to handle the requests for a specific VM.
> > The extra bits that are added here are essentially stuff that was
> > required to support mediated devices.
> >
>
> VMDq pools are managed by the driver and only logically isolated in the
> kernel, SFs has no shared pool for transport resources (queues), SFs
> have their own isolated steering domains, processing engines, and HW
> objects, exactly like a VF.

You are describing your specific implementation. That may not apply to
others. What you are defining as the differences of VMDq and SR-IOV
are not the same as other vendors.

You are essentially arguing implementation semantics, if it is
configured by the driver or firmware it doesn't really make any
difference. Being fully isolated versus only logically isolated only
really matters in terms of direct assignment. In the grand scheme of
things the only real difference between SR-IOV and VMDq is the
spawning of the PCIe device with its own BAR to access the resources.
Isolating the queues to their own 4K bounded subset of a BAR is pretty
straightforward and I assume that and the firmware is what is giving
you most of your isolation in this case.

> > > > However
> > > > one downside is that we are going to end up seeing each
> > > > subfunction
> > > > being different from driver to driver and vendor to vendor which
> > > > I
> > > > would argue was also one of the problems with SR-IOV as you end
> > > > up
> > > > with a bit of vendor lock-in as a result of this feature since
> > > > each
> > > > vendor will be providing a different interface.
> > > >
> > >
> > > I disagree, SFs are tightly coupled with switchdev model and
> > > devlink
> > > functions port, they are backed with the a well defined model, i
> > > can
> > > say the same about sriov with switchdev mode, this sort of vendor
> > > lock-
> > > in issues is eliminated when you migrate to switchdev mode.
> >
> > What you are talking about is the backend. I am talking about what is
> > exposed to the user. The user is going to see a Mellanox device
> > having
> > to be placed into their container in order to support this. One of
> > the
> > advantages of the Intel approach was that the macvlan interface was
> > generic so you could have an offloaded interface or not and the user
> > wouldn't necessarily know. The offload could be disabled and the user
> > would be none the wiser as it is moved from one interface to another.
> > I see that as a big thing that is missing in this solution.
> >
>
> You are talking about the basic netdev users, Sure there are users who
> would want a more generic netdev, so yes. but most of my customers are
> not like that, they want vdpa/rdma and heavy netdev offload such as
> encap/decap/crypto and driver xdp in their containers, the SF approach
> will make more sense to them than sriov and VMDq.

I am talking about my perspective. From what I have seen, one-off
features that are only available from specific vendors are a pain to
deal with and difficult to enable when you have to support multiple
vendors within your ecosystem. What you end up going for is usually
the lowest common denominator because you ideally want to be able to
configure all your devices the same and have one recipe for setup.

I'm not saying you cannot enable those features. However at the same
time I am saying it would be nice to have a vendor neutral way of
dealing with those if we are going to support SF, ideally with some
sort of software fallback that may not perform as well but will at
least get us the same functionality.

I'm trying to remember which netdev conference it was. I referred to
this as a veth switchdev offload when something like this was first
brought up. The more I think about it now it would almost make more
sense to have something like that as a flavor. The way I view it we
have a few different use cases floating around which will have
different needs. My thought is having a standardized interface that
could address those needs would be a good way to go for this as it
would force everyone to come together and define a standardized
feature set that all of the vendors would want to expose.

> > > > > When subfunction is RDMA capable, it has its own QP1, GID table
> > > > > and
> > > > > rdma
> > > > > resources neither shared nor stealed from the parent PCI
> > > > > function.
> > > > >
> > > > > A subfunction has dedicated window in PCI BAR space that is not
> > > > > shared
> > > > > with ther other subfunctions or parent PCI function. This
> > > > > ensures
> > > > > that all
> > > > > class devices of the subfunction accesses only assigned PCI BAR
> > > > > space.
> > > > >
> > > > > A Subfunction supports eswitch representation through which it
> > > > > supports tc
> > > > > offloads. User must configure eswitch to send/receive packets
> > > > > from/to
> > > > > subfunction port.
> > > > >
> > > > > Subfunctions share PCI level resources such as PCI MSI-X IRQs
> > > > > with
> > > > > their other subfunctions and/or with its parent PCI function.
> > > >
> > > > This piece to the architecture for this has me somewhat
> > > > concerned. If
> > > > all your resources are shared and you are allowing devices to be
> > >
> > > not all, only PCI MSIX, for now..
> >
> > They aren't shared after you partition them but they are coming from
> > the same device. Basically you are subdividing the BAR2 in order to
> > generate the subfunctions. BAR2 is a shared resource in my point of
> > view.
> >
>
> Sure, but it doesn't host any actual resources, only the communication
> channel with the HW partition, so other than the BAR and the msix the
> actual HW resources, steering pipelines, offloads and queues are totlay
> isolated and separated.

I understand what you are trying to say, however this is semantics
specific to the implementation. Ultimately you are having to share the
function.

> > > > created incrementally you either have to pre-partition the entire
> > > > function which usually results in limited resources for your base
> > > > setup, or free resources from existing interfaces and
> > > > redistribute
> > > > them as things change. I would be curious which approach you are
> > > > taking here? So for example if you hit a certain threshold will
> > > > you
> > > > need to reset the port and rebalance the IRQs between the various
> > > > functions?
> > > >
> > >
> > > Currently SFs will use whatever IRQs the PF has pre-allocated for
> > > itself, so there is no IRQ limit issue at the moment, we are
> > > considering a dynamic IRQ pool with dynamic balancing, or even
> > > better
> > > us the IMS approach, which perfectly fits the SF architecture.
> > > https://patchwork.kernel.org/project/linux-pci/cover/1568338328-22458-1-git-send-email-megha.dey@linux.intel.com/
> >
> > When you say you are using the PF's interrupts are you just using
> > that
> > as a pool of resources or having the interrupt process interrupts for
> > both the PF and SFs? Without IMS you are limited to 2048 interrupts.
> > Moving over to that would make sense since SF is similar to mdev in
> > the way it partitions up the device and resources.
> >
>
> Yes moving to IMS is on the top of our priorities.
>
> > > for internal resources the are fully isolated (not shared) and
> > > they are internally managed by FW exactly like a VF internal
> > > resources.
> >
> > I assume by isolated you mean they are contained within page aligned
> > blocks like what was required for mdev?
>
> I mean they are isolated and abstracted in the FW, we don't really
> expose any resource directly to the BAR. the BAR is only used for
> communicating with the device, so VF and SF will work exactly the same
> the only difference is where they get their BAR  and offsets from,
> everything else is just similar.

I think where you and I differ is our use of the term "resource". I
would consider the address space a "resource", while you argue that
the resources are hidden behind the BAR.

I agree with you that the firmware should be managing most of the
resources in the device. So it isn't surprising that it would be
splitting them up and then dolling out pieces as needed to put
together a subfunction.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 21:41         ` Alexander Duyck
@ 2020-12-16  0:19           ` Jason Gunthorpe
  2020-12-16  2:19             ` Alexander Duyck
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-16  0:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Tue, Dec 15, 2020 at 01:41:04PM -0800, Alexander Duyck wrote:

> > not just devlink and switchdev, auxbus was also introduced to
> > standardize some of the interfaces.
> 
> The auxbus is just there to make up for the fact that there isn't
> another bus type for this though. I would imagine otherwise this would
> be on some sort of platform bus.

Please lets not start this again. This was gone over with Greg for
literally a year and a half and he explicitly NAK'd platform bus for
this purpose.

Aux bus exists to connect different kernel subsystems that touch the
same HW block together. Here we have the mlx5_core subsystem, vdpa,
rdma, and netdev all being linked together using auxbus.

It is kind of like what MFD does, but again, using MFD for this was
also NAK'd by Greg.

At the very worst we might sometime find out there is some common
stuff between ADIs that we might get an ADI bus, but I'm not
optimistic. So far it looks like there is no commonality.

Aux bus has at least 4 users already in various stages of submission,
and many other target areas that should be replaced by it.

> I would really like to see is a solid standardization of what this is.
> Otherwise the comparison is going to be made. Especially since a year
> ago Mellanox was pushing this as an mdev type interface. 

mdev was NAK'd too.

mdev is only for creating /dev/vfio/*.

> That is all well and good. However if we agree that SR-IOV wasn't done
> right saying that you are spinning up something that works just like
> SR-IOV isn't all that appealing, is it?

Fitting into some universal least-common-denominator was never a goal
for SR-IOV, so I wouldn't agree it was done wrong. 

> I am talking about my perspective. From what I have seen, one-off
> features that are only available from specific vendors are a pain to
> deal with and difficult to enable when you have to support multiple
> vendors within your ecosystem. What you end up going for is usually
> the lowest common denominator because you ideally want to be able to
> configure all your devices the same and have one recipe for setup.

So encourage other vendors to support the switchdev model for managing
VFs and ADIs!

> I'm not saying you cannot enable those features. However at the same
> time I am saying it would be nice to have a vendor neutral way of
> dealing with those if we are going to support SF, ideally with some
> sort of software fallback that may not perform as well but will at
> least get us the same functionality.

Is it really true there is no way to create a software device on a
switchdev today? I looked for a while and couldn't find
anything. openvswitch can do this, so it does seem like a gap, but
this has nothing to do with this series.

A software switchdev path should still end up with the representor and
user facing netdev, and the behavior of the two netdevs should be
identical to the VF switchdev flow we already have today.

SF doesn't change any of this, it just shines a light that, yes,
people actually have been using VFs with netdevs in containers and
switchdev, as part of their operations.

FWIW, I view this as a positive because it shows the switchdev model
is working very well and seeing adoption beyond the original idea of
controlling VMs with SRIOV.

> I'm trying to remember which netdev conference it was. I referred to
> this as a veth switchdev offload when something like this was first
> brought up. 

Sure, though I think the way you'd create such a thing might be
different. These APIs are really about creating an ADI that might be
assigned to a VM and never have a netdev.

It would be nonsense to create a veth-switchdev thing with out a
netdev, and there have been various past attempts already NAK'd to
transform a netdev into an ADI.

Anyhow, if such a thing exists someday it could make sense to
automatically substitute the HW version using a SF, if available.

> could address those needs would be a good way to go for this as it
> would force everyone to come together and define a standardized
> feature set that all of the vendors would want to expose.

I would say switchdev is already the standard feature set.
 
Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 18:47     ` Alexander Duyck
  2020-12-15 20:05       ` Saeed Mahameed
  2020-12-15 21:03       ` Jason Gunthorpe
@ 2020-12-16  1:12       ` Edwin Peer
  2020-12-16  2:39         ` Jason Gunthorpe
  2020-12-16  3:12         ` Alexander Duyck
  2 siblings, 2 replies; 65+ messages in thread
From: Edwin Peer @ 2020-12-16  1:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Parav Pandit, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Jason Gunthorpe, Leon Romanovsky, Netdev, linux-rdma,
	David Ahern, Jacob Keller, Sridhar Samudrala, Ertman, David M,
	Dan Williams, Kiran Patil, Greg KH

[-- Attachment #1: Type: text/plain, Size: 3103 bytes --]

On Tue, Dec 15, 2020 at 10:49 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:

> It isn't "SR-IOV done right" it seems more like "VMDq done better".

I don't think I agree with that assertion. The fact that VMDq can talk
to a common driver still makes VMDq preferable in some respects. Thus,
subfunctions do appear to be more of a better SR-IOV than a better
VMDq, but I'm similarly not sold on whether a better SR-IOV is
sufficient benefit to warrant the additional complexity this
introduces. If I understand correctly, subfunctions buy two things:

1) More than 256 SFs are possible: Maybe it's about time PCI-SIG
addresses this limit for VFs? If that were the only problem with VFs,
then fixing it once there would be cleaner. The devlink interface for
configuring a SF is certainly more sexy than legacy SR-IOV, but it
shouldn't be fundamentally impossible to zhuzh up VFs either. One can
also imagine possibilities around remapping multiple PFs (and their
VFs) in a clever way to get around the limited number of PCI resources
exposed to the host.

2) More flexible division of resources: It's not clear that device
firmwarre can't perform smarter allocation than N/<num VFs>, but
subfunctions also appear to allow sharing of certain resources by the
PF driver, if desirable. To the extent that resources are shared, how
are workloads isolated from each other?

I'm not sure I like the idea of having to support another resource
allocation model in our driver just to support this, at least not
without a clearer understanding of what is being gained.

Like you, I would also prefer a more common infrastructure for
exposing something based on VirtIO/VMDq as the container/VM facing
netdevs. Is the lowest common denominator that a VMDq based interface
would constrain things to really unsuitable for container use cases?
Is the added complexity and confusion around VF vs SF vs VMDq really
warranted? I also don't see how this tackles container/VF portability,
migration of workloads, kernel network stack bypass, or any of the
other legacy limitations regarding SR-IOV VFs when we have vendor
specific aux bus drivers talking directly to vendor specific backend
hardware resources. In this regard, don't subfunctions, by definition,
have most of the same limitations as SR-IOV VFs?

Regards,
Edwin Peer

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4160 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16  0:19           ` Jason Gunthorpe
@ 2020-12-16  2:19             ` Alexander Duyck
  2020-12-16  3:03               ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-16  2:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Tue, Dec 15, 2020 at 4:20 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 15, 2020 at 01:41:04PM -0800, Alexander Duyck wrote:
>
> > > not just devlink and switchdev, auxbus was also introduced to
> > > standardize some of the interfaces.
> >
> > The auxbus is just there to make up for the fact that there isn't
> > another bus type for this though. I would imagine otherwise this would
> > be on some sort of platform bus.
>
> Please lets not start this again. This was gone over with Greg for
> literally a year and a half and he explicitly NAK'd platform bus for
> this purpose.
>
> Aux bus exists to connect different kernel subsystems that touch the
> same HW block together. Here we have the mlx5_core subsystem, vdpa,
> rdma, and netdev all being linked together using auxbus.
>
> It is kind of like what MFD does, but again, using MFD for this was
> also NAK'd by Greg.

Sorry I wasn't saying we needed to use MFD or platform bus. I'm aware
of the discussions I was just saying there are some parallels, not
that we needed to go through and use that.

> At the very worst we might sometime find out there is some common
> stuff between ADIs that we might get an ADI bus, but I'm not
> optimistic. So far it looks like there is no commonality.
>
> Aux bus has at least 4 users already in various stages of submission,
> and many other target areas that should be replaced by it.

Aux bus is fine and I am happy with it. I was just saying that auxbus
isn't something that should really be used to say that this is
significantly different from mdev as they both rely on a bus topology.

> > I would really like to see is a solid standardization of what this is.
> > Otherwise the comparison is going to be made. Especially since a year
> > ago Mellanox was pushing this as an mdev type interface.
>
> mdev was NAK'd too.
>
> mdev is only for creating /dev/vfio/*.

Agreed. However my worry is that as we start looking to make this
support virtualization it will still end up swinging more toward mdev.
I would much prefer to make certain that any mdev is a flavor and
doesn't end up coloring the entire interface.

> > That is all well and good. However if we agree that SR-IOV wasn't done
> > right saying that you are spinning up something that works just like
> > SR-IOV isn't all that appealing, is it?
>
> Fitting into some universal least-common-denominator was never a goal
> for SR-IOV, so I wouldn't agree it was done wrong.

It isn't so much about right or wrong but he use cases. My experience
has been that SR-IOV ends up being used for very niche use cases where
you are direct assigning it into either DPDK or some NFV VM and you
are essentially building the application around the NIC. It is all
well and good, but for general virtualization it never really caught
on.

My thought is that if we are going to do this we need to do this in a
way that improves the usability, otherwise we are looking at more
niche use cases.

> > I am talking about my perspective. From what I have seen, one-off
> > features that are only available from specific vendors are a pain to
> > deal with and difficult to enable when you have to support multiple
> > vendors within your ecosystem. What you end up going for is usually
> > the lowest common denominator because you ideally want to be able to
> > configure all your devices the same and have one recipe for setup.
>
> So encourage other vendors to support the switchdev model for managing
> VFs and ADIs!

Ugh, don't get me started on switchdev. The biggest issue as I see it
with switchev is that you have to have a true switch in order to
really be able to use it. As such dumbed down hardware like the ixgbe
for instance cannot use it since it defaults to outputting anything
that doesn't have an existing rule to the external port. If we could
tweak the design to allow for more dumbed down hardware it would
probably be much easier to get wider adoption.

Honestly, the switchdev interface isn't what I was talking about. I
was talking about the SF interface, not the switchdev side of it. In
my mind you can place your complexity into the switchdev side of the
interface, but keep the SF interface simple. Then you can back it with
whatever you want, but without having to have a vendor specific
version of the interface being plugged into the guest or container.

One of the reasons why virtio-net is being pushed as a common
interface for vendors is for this reason. It is an interface that can
be emulated by software or hardware and it allows the guest to run on
any arbitrary hardware.

> > I'm not saying you cannot enable those features. However at the same
> > time I am saying it would be nice to have a vendor neutral way of
> > dealing with those if we are going to support SF, ideally with some
> > sort of software fallback that may not perform as well but will at
> > least get us the same functionality.
>
> Is it really true there is no way to create a software device on a
> switchdev today? I looked for a while and couldn't find
> anything. openvswitch can do this, so it does seem like a gap, but
> this has nothing to do with this series.

It has plenty to do with this series. This topic has been under
discussion since something like 2017 when Mellanox first brought it up
at Netdev 2.1. At the time I told them they should implement this as a
veth offload. Then it becomes obvious what the fallback becomes as you
can place packets into one end of a veth and it comes out the other,
just like a switchdev representor and the SF in this case. It would
make much more sense to do it this way rather than setting up yet
another vendor proprietary interface pair.

> A software switchdev path should still end up with the representor and
> user facing netdev, and the behavior of the two netdevs should be
> identical to the VF switchdev flow we already have today.
>
> SF doesn't change any of this, it just shines a light that, yes,
> people actually have been using VFs with netdevs in containers and
> switchdev, as part of their operations.
>
> FWIW, I view this as a positive because it shows the switchdev model
> is working very well and seeing adoption beyond the original idea of
> controlling VMs with SRIOV.

PF/VF isolation is a given. So the existing switchdev behavior is fine
in that regard. I wouldn't expect that to change. The fact is we
actually support something similar in the macvlan approach we put in
the Intel drivers since the macvlan itself provided an option for
isolation or to hairpin the traffic if you wanted to allow the VMDq
instances to be bypassed. That was another thing I view as a huge
feature as broadcast/multicast traffic can really get ugly when the
devices are all separate pipelines and being able to switch that off
and just do software replication and hairpinning can be extremely
useful.

> > I'm trying to remember which netdev conference it was. I referred to
> > this as a veth switchdev offload when something like this was first
> > brought up.
>
> Sure, though I think the way you'd create such a thing might be
> different. These APIs are really about creating an ADI that might be
> assigned to a VM and never have a netdev.
>
> It would be nonsense to create a veth-switchdev thing with out a
> netdev, and there have been various past attempts already NAK'd to
> transform a netdev into an ADI.
>
> Anyhow, if such a thing exists someday it could make sense to
> automatically substitute the HW version using a SF, if available.

The main problem as I see it is the fact that the SF interface is
bound too tightly to the hardware. The direct pass-thru with no
hairpin is always an option but if we are going to have an interface
where both ends are in the same system there are many cases where
always pushing all the packets off to the hardware don't necessarily
make sense.

> > could address those needs would be a good way to go for this as it
> > would force everyone to come together and define a standardized
> > feature set that all of the vendors would want to expose.
>
> I would say switchdev is already the standard feature set.

Yes, it is a standard feature set for the control plane. However for
the data-path it is somewhat limited as I feel it only describes what
goes through the switch. Not the interfaces that are exposed as the
endpoints. It is the problem of that last bit and how it is handled
that can make things ugly. For example the multicast/broadcast
replication problem that just occurred to me while writing this up.
The fact is for east-west traffic there has always been a problem with
the switchdev model as it limits everything to PCIe/DMA so there are
cases where software switches can outperform the hardware ones.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16  1:12       ` Edwin Peer
@ 2020-12-16  2:39         ` Jason Gunthorpe
  2020-12-16  3:12         ` Alexander Duyck
  1 sibling, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-16  2:39 UTC (permalink / raw)
  To: Edwin Peer
  Cc: Alexander Duyck, Parav Pandit, Saeed Mahameed, David S. Miller,
	Jakub Kicinski, Leon Romanovsky, Netdev, linux-rdma, David Ahern,
	Jacob Keller, Sridhar Samudrala, Ertman, David M, Dan Williams,
	Kiran Patil, Greg KH

On Tue, Dec 15, 2020 at 05:12:33PM -0800, Edwin Peer wrote:

> 1) More than 256 SFs are possible: Maybe it's about time PCI-SIG
> addresses this limit for VFs? 

They can't, the Bus/Device/Function is limited by protocol and
changing that would upend the entire PCI world.

Instead PCI-SIG said PASID is the way forward.

> If that were the only problem with VFs, then fixing it once there
> would be cleaner. 

Maybe, but half the problem with VFs is how HW expensive they are. The
mlx5 SF version is not such a good example, but Intel has shown in
other recent patches, like for their idxd, that the HW side of an ADI
can be very simple and hypervisor emulation can build a simple HW
capability into a full ADI for assignment to a guest.

A lot of the trappings that PCI-SIG requires to be implemented in HW
for a VF, like PCI config space, MSI tables, BAR space, etc. is all
just dead weight when scaling up to 1000's of VFs.

The ADI scheme is not that bad, the very simplest HW is just a queue
that can have all DMA contained by a PASID and can trigger an
addr/data interrupt message. Much less HW costly than a SRIOV VF.

Regardless, Intel kicked this path off years ago when they published
their SIOV cookbook and everyone started integrating PASID support
into their IOMMUs and working on ADIs. The mlx5 SFs are kind of early
because the HW is flexible enough to avoid the parts of SIOV that are
not ready or widely deployed yet, like IMS and PASID.

> Like you, I would also prefer a more common infrastructure for
> exposing something based on VirtIO/VMDq as the container/VM facing
> netdevs. 

A major point is to get switchdev.

> I also don't see how this tackles container/VF portability,
> migration of workloads, kernel network stack bypass, or any of the
> other legacy limitations regarding SR-IOV VFs

It isn't ment too. SF/ADI are just a way to have more VF's than PCI SIG
can support..

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16  2:19             ` Alexander Duyck
@ 2020-12-16  3:03               ` Jason Gunthorpe
  2020-12-16  4:13                 ` Alexander Duyck
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-16  3:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Tue, Dec 15, 2020 at 06:19:18PM -0800, Alexander Duyck wrote:

> > > I would really like to see is a solid standardization of what this is.
> > > Otherwise the comparison is going to be made. Especially since a year
> > > ago Mellanox was pushing this as an mdev type interface.
> >
> > mdev was NAK'd too.
> >
> > mdev is only for creating /dev/vfio/*.
> 
> Agreed. However my worry is that as we start looking to make this
> support virtualization it will still end up swinging more toward
> mdev.

Of course. mdev is also the only way to create a /dev/vfio/* :)

So all paths that want to use vfio must end up creating a mdev.

Here we would choose to create the mdev on top of the SF aux device.
There isn't really anything mlx5 specific about that decision. 

The SF models the vendor specific ADI in the driver model.

> It isn't so much about right or wrong but he use cases. My experience
> has been that SR-IOV ends up being used for very niche use cases where
> you are direct assigning it into either DPDK or some NFV VM and you
> are essentially building the application around the NIC. It is all
> well and good, but for general virtualization it never really caught
> on.

Sure
 
> > So encourage other vendors to support the switchdev model for managing
> > VFs and ADIs!
> 
> Ugh, don't get me started on switchdev. The biggest issue as I see it
> with switchev is that you have to have a true switch in order to
> really be able to use it. 

That cuts both ways, suggesting HW with a true switch model itself
with VMDq is equally problematic.

> As such dumbed down hardware like the ixgbe for instance cannot use
> it since it defaults to outputting anything that doesn't have an
> existing rule to the external port. If we could tweak the design to
> allow for more dumbed down hardware it would probably be much easier
> to get wider adoption.

I'd agree with this

> interface, but keep the SF interface simple. Then you can back it with
> whatever you want, but without having to have a vendor specific
> version of the interface being plugged into the guest or container.

The entire point *is* to create the vendor version because that serves
the niche cases where SRIOV assignment is already being used.

Having a general solution that can't do vendor SRIOV is useful for
other application, but doesn't eliminate the need for the SRIOV case.

> One of the reasons why virtio-net is being pushed as a common
> interface for vendors is for this reason. It is an interface that can
> be emulated by software or hardware and it allows the guest to run on
> any arbitrary hardware.

Yes, and there is mlx5_vdpa to support this usecase, and it binds to
the SF. Of course all of that is vendor specific too, the driver to
convert HW specifc register programming into a virio-net ADI has to
live *somewhere*

> It has plenty to do with this series. This topic has been under
> discussion since something like 2017 when Mellanox first brought it up
> at Netdev 2.1. At the time I told them they should implement this as a
> veth offload. 

veth doesn't give an ADI, it is useless for these niche cases.

veth offload might be interesting for some container case, but feels
like writing an enormous amount of code to accomplish nothing new...

> Then it becomes obvious what the fallback becomes as you can place
> packets into one end of a veth and it comes out the other, just like
> a switchdev representor and the SF in this case. It would make much
> more sense to do it this way rather than setting up yet another
> vendor proprietary interface pair.

I agree it makes sense to have an all SW veth-like option, but I
wouldn't try to make that as the entry point for all the HW
acceleration or to serve the niche SRIOV use cases, or to represent an
ADI.

It just can't do that and it would make a huge mess if you tried to
force it. Didn't Intel already try this once with trying to use the
macvlan netdev and its queue offload to build an ADI?

> > Anyhow, if such a thing exists someday it could make sense to
> > automatically substitute the HW version using a SF, if available.
> 
> The main problem as I see it is the fact that the SF interface is
> bound too tightly to the hardware. 

That is goal here. This is not about creating just a netdev, this is
about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.

The SF has to support all of that completely. Focusing only on the
one use case of netdevs in containers misses the bigger picture. 

Yes, lots of this stuff is niche, but niche stuff needs to be
supported too.

> Yes, it is a standard feature set for the control plane. However for
> the data-path it is somewhat limited as I feel it only describes what
> goes through the switch.

Sure, I think that is its main point.

> Not the interfaces that are exposed as the endpoints. 

It came from modeling physical HW so the endports are 'physical'
things like actual HW switch ports, or SRIOV VFs, ADI, etc.

> It is the problem of that last bit and how it is handled that can
> make things ugly. For example the multicast/broadcast replication
> problem that just occurred to me while writing this up.  The fact is
> for east-west traffic there has always been a problem with the
> switchdev model as it limits everything to PCIe/DMA so there are
> cases where software switches can outperform the hardware ones.

Yes, but, mixing CPU and DMA in the same packet delivery scheme is
very complicated :)

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16  1:12       ` Edwin Peer
  2020-12-16  2:39         ` Jason Gunthorpe
@ 2020-12-16  3:12         ` Alexander Duyck
  1 sibling, 0 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-16  3:12 UTC (permalink / raw)
  To: Edwin Peer
  Cc: Parav Pandit, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Jason Gunthorpe, Leon Romanovsky, Netdev, linux-rdma,
	David Ahern, Jacob Keller, Sridhar Samudrala, Ertman, David M,
	Dan Williams, Kiran Patil, Greg KH

On Tue, Dec 15, 2020 at 5:13 PM Edwin Peer <edwin.peer@broadcom.com> wrote:
>
> On Tue, Dec 15, 2020 at 10:49 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
> > It isn't "SR-IOV done right" it seems more like "VMDq done better".
>
> I don't think I agree with that assertion. The fact that VMDq can talk
> to a common driver still makes VMDq preferable in some respects. Thus,
> subfunctions do appear to be more of a better SR-IOV than a better
> VMDq, but I'm similarly not sold on whether a better SR-IOV is
> sufficient benefit to warrant the additional complexity this
> introduces. If I understand correctly, subfunctions buy two things:
>
> 1) More than 256 SFs are possible: Maybe it's about time PCI-SIG
> addresses this limit for VFs? If that were the only problem with VFs,
> then fixing it once there would be cleaner. The devlink interface for
> configuring a SF is certainly more sexy than legacy SR-IOV, but it
> shouldn't be fundamentally impossible to zhuzh up VFs either. One can
> also imagine possibilities around remapping multiple PFs (and their
> VFs) in a clever way to get around the limited number of PCI resources
> exposed to the host.

The fact is SR-IOV just wasn't designed to scale well. I think we are
probably going to see most vendors move away from it.

The fact is what we are talking about now is the future of all this
and how to implement Scalable I/O Virtualization
(https://software.intel.com/content/www/us/en/develop/download/intel-scalable-io-virtualization-technical-specification.html).
The document is a good primer to many of the features we are
discussing as it describes how to compose a device.

The problem is as it was with SR-IOV that the S-IOV specification is
very PCIe centric and doesn't do a good job explaining how to deal
with the network as it relates to all this. Then to complicate things
the S-IOV expected this to be used with direct assigned devices for
guests/applications and instead we are talking about using the devices
in the host which makes things a bit messier.

> 2) More flexible division of resources: It's not clear that device
> firmwarre can't perform smarter allocation than N/<num VFs>, but
> subfunctions also appear to allow sharing of certain resources by the
> PF driver, if desirable. To the extent that resources are shared, how
> are workloads isolated from each other?
>
> I'm not sure I like the idea of having to support another resource
> allocation model in our driver just to support this, at least not
> without a clearer understanding of what is being gained.

I view this as the future alternative to SR-IOV. It is just a matter
of how we define it. Eventually we probably would be dropping the
SR-IOV implementation and instead moving over to S-IOV as an
alternative instead. As such if this is done right I don't see this as
a thing where we need to support both. Really we should be able to
drop support for one if we have the other.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16  3:03               ` Jason Gunthorpe
@ 2020-12-16  4:13                 ` Alexander Duyck
  2020-12-16  4:45                   ` Parav Pandit
  2020-12-16 13:33                   ` Jason Gunthorpe
  0 siblings, 2 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-16  4:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Tue, Dec 15, 2020 at 7:04 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 15, 2020 at 06:19:18PM -0800, Alexander Duyck wrote:
>
> > > > I would really like to see is a solid standardization of what this is.
> > > > Otherwise the comparison is going to be made. Especially since a year
> > > > ago Mellanox was pushing this as an mdev type interface.
> > >
> > > mdev was NAK'd too.
> > >
> > > mdev is only for creating /dev/vfio/*.
> >
> > Agreed. However my worry is that as we start looking to make this
> > support virtualization it will still end up swinging more toward
> > mdev.
>
> Of course. mdev is also the only way to create a /dev/vfio/* :)
>
> So all paths that want to use vfio must end up creating a mdev.
>
> Here we would choose to create the mdev on top of the SF aux device.
> There isn't really anything mlx5 specific about that decision.
>
> The SF models the vendor specific ADI in the driver model.
>
> > It isn't so much about right or wrong but he use cases. My experience
> > has been that SR-IOV ends up being used for very niche use cases where
> > you are direct assigning it into either DPDK or some NFV VM and you
> > are essentially building the application around the NIC. It is all
> > well and good, but for general virtualization it never really caught
> > on.
>
> Sure
>
> > > So encourage other vendors to support the switchdev model for managing
> > > VFs and ADIs!
> >
> > Ugh, don't get me started on switchdev. The biggest issue as I see it
> > with switchev is that you have to have a true switch in order to
> > really be able to use it.
>
> That cuts both ways, suggesting HW with a true switch model itself
> with VMDq is equally problematic.

Yes and no. For example the macvlan offload I had setup could be
configured both ways and it made use of VMDq. I'm not necessarily
arguing that we need to do VMDq here, however at the same time saying
that this is only meant to replace SR-IOV becomes problematic since we
already have SR-IOV so why replace it with something that has many of
the same limitations?

> > As such dumbed down hardware like the ixgbe for instance cannot use
> > it since it defaults to outputting anything that doesn't have an
> > existing rule to the external port. If we could tweak the design to
> > allow for more dumbed down hardware it would probably be much easier
> > to get wider adoption.
>
> I'd agree with this
>
> > interface, but keep the SF interface simple. Then you can back it with
> > whatever you want, but without having to have a vendor specific
> > version of the interface being plugged into the guest or container.
>
> The entire point *is* to create the vendor version because that serves
> the niche cases where SRIOV assignment is already being used.
>
> Having a general solution that can't do vendor SRIOV is useful for
> other application, but doesn't eliminate the need for the SRIOV case.

So part of the problem here is we already have SR-IOV. So we don't
need to repeat the mistakes. Rather, we need to have a solution to the
existing problems and then we can look at eliminating it.

That said I understand your argument, however I view the elimination
of SR-IOV to be something we do after we get this interface right and
can justify doing so. I don't have a problem necessarily with vendor
specific instances, unless we are only able to get vendor specific
instances. Thus I would prefer that we have a solution in place before
we allow the switch over.

> > One of the reasons why virtio-net is being pushed as a common
> > interface for vendors is for this reason. It is an interface that can
> > be emulated by software or hardware and it allows the guest to run on
> > any arbitrary hardware.
>
> Yes, and there is mlx5_vdpa to support this usecase, and it binds to
> the SF. Of course all of that is vendor specific too, the driver to
> convert HW specifc register programming into a virio-net ADI has to
> live *somewhere*

Right, but this is more the model I am in favor of. The backend is
hidden from the guest and lives somewhere on the host.

Also it might be useful to call out the flavours and planned flavours
in the cover page. Admittedly the description is somewhat lacking in
that regard.

> > It has plenty to do with this series. This topic has been under
> > discussion since something like 2017 when Mellanox first brought it up
> > at Netdev 2.1. At the time I told them they should implement this as a
> > veth offload.
>
> veth doesn't give an ADI, it is useless for these niche cases.
>
> veth offload might be interesting for some container case, but feels
> like writing an enormous amount of code to accomplish nothing new...

My concern is if we are going to start partitioning up a PF on the
host we might as well make the best use of it. I would argue that it
would make more sense to have some standardized mechanism in place for
the PF to communicate and interact with the SFs. I would argue that is
one of the reasons why this keeps being compared to either VMDq or VMQ
as it is something that SR-IOV has yet to fully replace and has many
features that would be useful in an interface that is a subpartition
of an existing interface.

> > Then it becomes obvious what the fallback becomes as you can place
> > packets into one end of a veth and it comes out the other, just like
> > a switchdev representor and the SF in this case. It would make much
> > more sense to do it this way rather than setting up yet another
> > vendor proprietary interface pair.
>
> I agree it makes sense to have an all SW veth-like option, but I
> wouldn't try to make that as the entry point for all the HW
> acceleration or to serve the niche SRIOV use cases, or to represent an
> ADI.
>
> It just can't do that and it would make a huge mess if you tried to
> force it. Didn't Intel already try this once with trying to use the
> macvlan netdev and its queue offload to build an ADI?

The Intel drivers still have the macvlan as the assignable ADI and
make use of VMDq to enable it. Actually I would consider it an example
of the kind of thing I am talking about. It is capable of doing
software switching between interfaces, broadcast/multicast replication
in software, and makes use of the hardware interfaces to allow for
receiving directly from the driver into the macvlan interface.

The limitation as I see it is that the macvlan interface doesn't allow
for much in the way of custom offloads and the Intel hardware doesn't
support switchdev. As such it is good for a basic interface, but
doesn't really do well in terms of supporting advanced vendor-specific
features.

> > > Anyhow, if such a thing exists someday it could make sense to
> > > automatically substitute the HW version using a SF, if available.
> >
> > The main problem as I see it is the fact that the SF interface is
> > bound too tightly to the hardware.
>
> That is goal here. This is not about creating just a netdev, this is
> about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.

One issue is right now we are only seeing the rdma and netdev. It is
kind of backwards as it is using the ADIs on the host when this was
really meant to be used for things like mdev.

> The SF has to support all of that completely. Focusing only on the
> one use case of netdevs in containers misses the bigger picture.
>
> Yes, lots of this stuff is niche, but niche stuff needs to be
> supported too.

I have no problem with niche stuff, however we need to address the
basics before we move on to the niche stuff.

> > Yes, it is a standard feature set for the control plane. However for
> > the data-path it is somewhat limited as I feel it only describes what
> > goes through the switch.
>
> Sure, I think that is its main point.
>
> > Not the interfaces that are exposed as the endpoints.
>
> It came from modeling physical HW so the endports are 'physical'
> things like actual HW switch ports, or SRIOV VFs, ADI, etc.

The problem is the "physical things" such as the SRIOV VFs and ADI
aren't really defined in the specification and are left up to the
implementers interpretation. These specs have always been fuzzy since
they are essentially PCI specifications and don't explain anything
about how the network on such a device should be configured or
expected to work. The swtichdev API puts some restrictions in place
but there still ends up being parts without any definition.

> > It is the problem of that last bit and how it is handled that can
> > make things ugly. For example the multicast/broadcast replication
> > problem that just occurred to me while writing this up.  The fact is
> > for east-west traffic there has always been a problem with the
> > switchdev model as it limits everything to PCIe/DMA so there are
> > cases where software switches can outperform the hardware ones.
>
> Yes, but, mixing CPU and DMA in the same packet delivery scheme is
> very complicated :)

I'm not necessarily saying we need to mix the two. However there are
cases such as multicast/broadcast where it would make much more sense
to avoid the duplication of packets and instead simply send one copy
and have it replicated by the software.

What would probably make sense for now would be to look at splitting
the netdev into two pieces. The frontend which would provide the
netdev and be a common driver for subfunction netdevs in this case,
and a backend which would be a common point for all the subfunctions
that are being used directly on the host. This is essentially what we
have with the macvlan model. The idea is that if we wanted to do
software switching or duplication of traffic we could, but if not then
we wouldn't.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16  4:13                 ` Alexander Duyck
@ 2020-12-16  4:45                   ` Parav Pandit
  2020-12-16 13:33                   ` Jason Gunthorpe
  1 sibling, 0 replies; 65+ messages in thread
From: Parav Pandit @ 2020-12-16  4:45 UTC (permalink / raw)
  To: Alexander Duyck, Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH



> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Wednesday, December 16, 2020 9:43 AM
> 
> >
> > That is goal here. This is not about creating just a netdev, this is
> > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.
> 
> One issue is right now we are only seeing the rdma and netdev. It is kind of
> backwards as it is using the ADIs on the host when this was really meant to
> be used for things like mdev.
>
mdev is just yet another _use_ of subfunction.
There are users of subfunction who want to use it without mapping subfunction as mdev to guest VM.
i.e. as netdev, rdma, vdpa.
In future direct assignment can be via mdev of subfunction, like how rest of the above devices are done.

Creating a subfunction for non vfio purpose via vfio mdev is just not right.
If I understand you correctly, I hope you are not suggesting that.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-15 21:28         ` Jakub Kicinski
@ 2020-12-16  6:50           ` Leon Romanovsky
  2020-12-16 17:59             ` Saeed Mahameed
  0 siblings, 1 reply; 65+ messages in thread
From: Leon Romanovsky @ 2020-12-16  6:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, Alexander Duyck, David S. Miller,
	Jason Gunthorpe, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Tue, Dec 15, 2020 at 01:28:05PM -0800, Jakub Kicinski wrote:
> On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote:
> > > I think the big thing we really should do if we are going to go this
> > > route is to look at standardizing what the flavours are that get
> > > created by the parent netdevice. Otherwise we are just creating the
> > > same mess we had with SRIOV all over again and muddying the waters of
> > > mediated devices.
> >
> > yes in the near future we will be working on auxbus interfaces for
> > auto-probing and user flavor selection, this is a must have feature for
> > us.
>
> Can you elaborate? I thought config would be via devlink.

Yes, everything continues to be done through devlink.

One of the immediate features is an ability to disable/enable creation
of specific SF types.

For example, if user doesn't want RDMA, the SF RDMA won't be created.

Thanks

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16  4:13                 ` Alexander Duyck
  2020-12-16  4:45                   ` Parav Pandit
@ 2020-12-16 13:33                   ` Jason Gunthorpe
  2020-12-16 16:31                     ` Alexander Duyck
  1 sibling, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-16 13:33 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Tue, Dec 15, 2020 at 08:13:21PM -0800, Alexander Duyck wrote:

> > > Ugh, don't get me started on switchdev. The biggest issue as I see it
> > > with switchev is that you have to have a true switch in order to
> > > really be able to use it.
> >
> > That cuts both ways, suggesting HW with a true switch model itself
> > with VMDq is equally problematic.
> 
> Yes and no. For example the macvlan offload I had setup could be
> configured both ways and it made use of VMDq. I'm not necessarily
> arguing that we need to do VMDq here, however at the same time saying
> that this is only meant to replace SR-IOV becomes problematic since we
> already have SR-IOV so why replace it with something that has many of
> the same limitations?

Why? Because SR-IOV is the *only* option for many use cases. Still. I
said this already, something more generic does not magicaly eliminate
SR-IOV.

The SIOV ADI model is a small refinement to the existing VF scheme, it
is completely parallel to making more generic things.

It is not "repeating mistakes" it is accepting the limitations of
SR-IOV because benefits exist and applications need those benefits.
 
> That said I understand your argument, however I view the elimination
> of SR-IOV to be something we do after we get this interface right and
> can justify doing so. 

Elimination of SR-IOV isn't even a goal here!

> Also it might be useful to call out the flavours and planned flavours
> in the cover page. Admittedly the description is somewhat lacking in
> that regard.

This is more of a general switchdev remark though. In the swithdev
model you have a the switch and a switch port. Each port has a
swichdev representor on the switch side and a "user port" of some
kind.

It can be a physical thing:
 - SFP
 - QSFP
 - WiFi Antennae

It could be a semi-physical thing outside the view of the kernel:
 - SmartNIC VF/SF attached to another CPU

It can be a semi-physical thing in view of this kernel:
 - SRIOV VF (struct pci device)
 - SF (struct aux device)

It could be a SW construct in this kernel:
 - netdev (struct net device)

*all* of these different port types are needed. Probably more down the
road!

Notice I don't have VPDA, VF/SF netdev, or virtio-mdev as a "user
port" type here. Instead creating the user port pci or aux device
allows the user to use the Linux driver model to control what happens
to the pci/aux device next.

> I would argue that is one of the reasons why this keeps being
> compared to either VMDq or VMQ as it is something that SR-IOV has
> yet to fully replace and has many features that would be useful in
> an interface that is a subpartition of an existing interface.

In what sense does switchdev and a VF not fully replace macvlan VMDq?

> The Intel drivers still have the macvlan as the assignable ADI and
> make use of VMDq to enable it.

Is this in-tree or only in the proprietary driver? AFAIK there is no
in-tree way to extract the DMA queue from the macvlan netdev into
userspace..

Remeber all this VF/SF/VDPA stuff results in a HW dataplane, not a SW
one. It doesn't really make sense to compare a SW dataplane to a HW
one. HW dataplanes come with limitations and require special driver
code.

> The limitation as I see it is that the macvlan interface doesn't allow
> for much in the way of custom offloads and the Intel hardware doesn't
> support switchdev. As such it is good for a basic interface, but
> doesn't really do well in terms of supporting advanced vendor-specific
> features.

I don't know what it is that prevents Intel from modeling their
selector HW in switchdev, but I think it is on them to work with the
switchdev folks to figure something out.

I'm a bit surprised HW that can do macvlan can't be modeled with
switchdev? What is missing?

> > That is goal here. This is not about creating just a netdev, this is
> > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.
> 
> One issue is right now we are only seeing the rdma and netdev. It is
> kind of backwards as it is using the ADIs on the host when this was
> really meant to be used for things like mdev.

This is second 15 patch series on this path already. It is not
possible to pack every single thing into this series. This is the
micro step of introducing the SF idea and using SF==VF to show how the
driver stack works. The minimal changing to the existing drivers
implies this can support an ADI as well.

Further, this does already show an ADI! vdpa_mlx5 will run on the
VF/SF and eventually causes qemu to build a virtio-net ADI that
directly passes HW DMA rings into the guest.

Isn't this exactly the kind of generic SRIOV replacement option you
have been asking for? Doesn't this completely supersede stuff built on
macvlan?

> expected to work. The swtichdev API puts some restrictions in place
> but there still ends up being parts without any definition.

I'm curious what you see as needing definition here? 

The SRIOV model has the HW register programming API is device
specific.

The switchdev model is: no matter what HW register programing is done
on the VF/SF all the packets tx/rx'd will flow through the switchdev.

The purpose of switchdev/SRIOV/SIOV has never been to define a single
"one register set to rule them all".

That is the area that VDPA virtio-net and others are covering.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16 13:33                   ` Jason Gunthorpe
@ 2020-12-16 16:31                     ` Alexander Duyck
  2020-12-16 17:51                       ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-16 16:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Wed, Dec 16, 2020 at 5:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 15, 2020 at 08:13:21PM -0800, Alexander Duyck wrote:
>
> > > > Ugh, don't get me started on switchdev. The biggest issue as I see it
> > > > with switchev is that you have to have a true switch in order to
> > > > really be able to use it.
> > >
> > > That cuts both ways, suggesting HW with a true switch model itself
> > > with VMDq is equally problematic.
> >
> > Yes and no. For example the macvlan offload I had setup could be
> > configured both ways and it made use of VMDq. I'm not necessarily
> > arguing that we need to do VMDq here, however at the same time saying
> > that this is only meant to replace SR-IOV becomes problematic since we
> > already have SR-IOV so why replace it with something that has many of
> > the same limitations?
>
> Why? Because SR-IOV is the *only* option for many use cases. Still. I
> said this already, something more generic does not magicaly eliminate
> SR-IOV.
>
> The SIOV ADI model is a small refinement to the existing VF scheme, it
> is completely parallel to making more generic things.
>
> It is not "repeating mistakes" it is accepting the limitations of
> SR-IOV because benefits exist and applications need those benefits.

If we have two interfaces, both with pretty much the same limitations
then many would view it as "repeating mistakes". The fact is we
already have SR-IOV. Why introduce yet another interface that has the
same functionality?

You say this will scale better but I am not even sure about that. The
fact is SR-IOV could scale to 256 VFs, but for networking I kind of
doubt the limitation would have been the bus number and would more
likely be issues with packet replication and PCIe throughput,
especially when you start dealing with east-west traffic within the
same system.

> > That said I understand your argument, however I view the elimination
> > of SR-IOV to be something we do after we get this interface right and
> > can justify doing so.
>
> Elimination of SR-IOV isn't even a goal here!

Sorry you used the word "replace", and my assumption here was that the
goal is to get something in place that can take the place of SR-IOV so
that you wouldn't be maintaining the two systems at the same time.
That is my concern as I don't want us having SR-IOV, and then several
flavors of SIOV. We need to decide on one thing that will be the way
forward.

> > Also it might be useful to call out the flavours and planned flavours
> > in the cover page. Admittedly the description is somewhat lacking in
> > that regard.
>
> This is more of a general switchdev remark though. In the swithdev
> model you have a the switch and a switch port. Each port has a
> swichdev representor on the switch side and a "user port" of some
> kind.
>
> It can be a physical thing:
>  - SFP
>  - QSFP
>  - WiFi Antennae
>
> It could be a semi-physical thing outside the view of the kernel:
>  - SmartNIC VF/SF attached to another CPU
>
> It can be a semi-physical thing in view of this kernel:
>  - SRIOV VF (struct pci device)
>  - SF (struct aux device)
>
> It could be a SW construct in this kernel:
>  - netdev (struct net device)
>
> *all* of these different port types are needed. Probably more down the
> road!
>
> Notice I don't have VPDA, VF/SF netdev, or virtio-mdev as a "user
> port" type here. Instead creating the user port pci or aux device
> allows the user to use the Linux driver model to control what happens
> to the pci/aux device next.

I get that. That is why I said switchdev isn't a standard for the
endpoint. One of the biggest issues with SR-IOV that I have seen is
the fact that the last piece isn't really defined. We never did a good
job of defining how the ADI should look to the guest and as a result
it kind of stalled in adoption.

> > I would argue that is one of the reasons why this keeps being
> > compared to either VMDq or VMQ as it is something that SR-IOV has
> > yet to fully replace and has many features that would be useful in
> > an interface that is a subpartition of an existing interface.
>
> In what sense does switchdev and a VF not fully replace macvlan VMDq?

One of the biggest is east-west traffic. You quickly run up against
the PCIe bandwidth bottleneck and then the performance tanks. I have
seen a number of cases where peer-to-peer on the same host swamps the
network interface.

> > The Intel drivers still have the macvlan as the assignable ADI and
> > make use of VMDq to enable it.
>
> Is this in-tree or only in the proprietary driver? AFAIK there is no
> in-tree way to extract the DMA queue from the macvlan netdev into
> userspace..
>
> Remeber all this VF/SF/VDPA stuff results in a HW dataplane, not a SW
> one. It doesn't really make sense to compare a SW dataplane to a HW
> one. HW dataplanes come with limitations and require special driver
> code.

I get that. At the same time we can mask some of those limitations by
allowing for the backend to be somewhat abstract so you have the
possibility of augmenting the hardware dataplane with a software one
if needed.

> > The limitation as I see it is that the macvlan interface doesn't allow
> > for much in the way of custom offloads and the Intel hardware doesn't
> > support switchdev. As such it is good for a basic interface, but
> > doesn't really do well in terms of supporting advanced vendor-specific
> > features.
>
> I don't know what it is that prevents Intel from modeling their
> selector HW in switchdev, but I think it is on them to work with the
> switchdev folks to figure something out.

They tried for the ixgbe and i40e. The problem is the hardware
couldn't conform to what was asked for if I recall. It has been a few
years since I worked in the Ethernet group at intel so I don't recall
the exact details.

> I'm a bit surprised HW that can do macvlan can't be modeled with
> switchdev? What is missing?

If I recall it was the fact that the hardware defaults to transmitting
everything that doesn't match an existing rule to the external port
unless it comes from the external port.

> > > That is goal here. This is not about creating just a netdev, this is
> > > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.
> >
> > One issue is right now we are only seeing the rdma and netdev. It is
> > kind of backwards as it is using the ADIs on the host when this was
> > really meant to be used for things like mdev.
>
> This is second 15 patch series on this path already. It is not
> possible to pack every single thing into this series. This is the
> micro step of introducing the SF idea and using SF==VF to show how the
> driver stack works. The minimal changing to the existing drivers
> implies this can support an ADI as well.
>
> Further, this does already show an ADI! vdpa_mlx5 will run on the
> VF/SF and eventually causes qemu to build a virtio-net ADI that
> directly passes HW DMA rings into the guest.
>
> Isn't this exactly the kind of generic SRIOV replacement option you
> have been asking for? Doesn't this completely supersede stuff built on
> macvlan?

Something like the vdpa model is more like what I had in mind. Only
vdpa only works for the userspace networking case.

Basically the idea is to have an assignable device interface that
isn't directly tied to the hardware. Instead it is making use of a
slice of it and referencing the PF as the parent leaving the PF as the
owner of the slice. If at some point in the future we could make
changes to allow for software to step in and do some switching if
needed. The key bit is the abstraction of the assignable interface so
that it is vendor agnostic and could be switched over to pure software
backing if needed.

> > expected to work. The swtichdev API puts some restrictions in place
> > but there still ends up being parts without any definition.
>
> I'm curious what you see as needing definition here?
>
> The SRIOV model has the HW register programming API is device
> specific.
>
> The switchdev model is: no matter what HW register programing is done
> on the VF/SF all the packets tx/rx'd will flow through the switchdev.
>
> The purpose of switchdev/SRIOV/SIOV has never been to define a single
> "one register set to rule them all".
>
> That is the area that VDPA virtio-net and others are covering.

That is fine and that covers it for direct assigned devices. However
that doesn't cover the container case. My thought is if we are going
to partition a PF into multiple netdevices we should have some generic
interface that can be provided to represent the netdevs so that if
they are pushed into containers you don't have to rip them out if for
some reason you need to change the network configuration. For the
Intel NICs we did that with macvlan in the VMDq case. I see no reason
why you couldn't do something like that here with the subfunction
case.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16 16:31                     ` Alexander Duyck
@ 2020-12-16 17:51                       ` Jason Gunthorpe
  2020-12-16 19:27                         ` Alexander Duyck
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-16 17:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Wed, Dec 16, 2020 at 08:31:44AM -0800, Alexander Duyck wrote:

> You say this will scale better but I am not even sure about that. The
> fact is SR-IOV could scale to 256 VFs, but for networking I kind of
> doubt the limitation would have been the bus number and would more
> likely be issues with packet replication and PCIe throughput,
> especially when you start dealing with east-west traffic within the
> same system.

We have been seeing deployments already hitting the 256 limit. This is
not a "theoretical use" patch set. There are already VM and container
farms with SW networking that support much more than 256 VM/containers
per server.

The optimization here is to reduce the hypervisor workload and free up
CPU cycles for the VMs/containers to consume. This means less handling
of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.

Even the extra DMA on the NIC is not really a big deal. These are 400G
NICs with big fast PCI. If you top them out you are already doing an
aggregate of 400G of network traffic. That is a big number for a
single sever, it is OK.

Someone might feel differently if they did this on a 10/40G NIC, in
which case this is not the solution for their application.

> Sorry you used the word "replace", and my assumption here was that the
> goal is to get something in place that can take the place of SR-IOV so
> that you wouldn't be maintaining the two systems at the same time.
> That is my concern as I don't want us having SR-IOV, and then several
> flavors of SIOV. We need to decide on one thing that will be the way
> forward.

SRIOV has to continue until the PASID and IMS platform features are
widely available and mature. It will probably be 10 years before we
see most people able to use SIOV for everything they want.

I think we will see lots of SIOV varients, I know Intel is already
pushing SIOV parts outside netdev.

> I get that. That is why I said switchdev isn't a standard for the
> endpoint. One of the biggest issues with SR-IOV that I have seen is
> the fact that the last piece isn't really defined. We never did a good
> job of defining how the ADI should look to the guest and as a result
> it kind of stalled in adoption.

The ADI is supposed to present the HW programming API that is
desired. It is always up to the implementation.

SIOV was never a project to standardize HW programming models like
virtio-net, NVMe, etc.

> > I'm a bit surprised HW that can do macvlan can't be modeled with
> > switchdev? What is missing?
> 
> If I recall it was the fact that the hardware defaults to transmitting
> everything that doesn't match an existing rule to the external port
> unless it comes from the external port.

That seems small enough it should be resolvable, IMHO. eg some new
switch rule that matches that specific HW behavior?

> Something like the vdpa model is more like what I had in mind. Only
> vdpa only works for the userspace networking case.

That's because making a driver that converts the native HW to VDPA and
then running a generic netdev on the resulting virtio-net is a pretty
wild thing to do. I can't really think of an actual use case.

> Basically the idea is to have an assignable device interface that
> isn't directly tied to the hardware. 

The switchdev model is to create a switch port. As I explained in
Linux we see "pci device" and "aux device" as being some "user port"
options to access to that switch.

If you want a "generic device" that is fine, but what exactly is that
programming interface in Linux? Sketch out an API, where does the idea
go?  What does the driver that implement it look like? What consumes
it?

Should this be a great idea, then a mlx5 version of this will still be
to create an SF aux device, bind mlx5_core, then bind "generic device"
on top of that. This is simply a reflection of how the mlx5 HW/SW
layering works. Squashing all of this into a single layer is work with
no bad ROI.

> they are pushed into containers you don't have to rip them out if for
> some reason you need to change the network configuration. 

Why would you need to rip them out?

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16  6:50           ` Leon Romanovsky
@ 2020-12-16 17:59             ` Saeed Mahameed
  0 siblings, 0 replies; 65+ messages in thread
From: Saeed Mahameed @ 2020-12-16 17:59 UTC (permalink / raw)
  To: Leon Romanovsky, Jakub Kicinski
  Cc: Alexander Duyck, David S. Miller, Jason Gunthorpe, Netdev,
	linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala, Ertman,
	David M, Dan Williams, Kiran Patil, Greg KH

On Wed, 2020-12-16 at 08:50 +0200, Leon Romanovsky wrote:
> On Tue, Dec 15, 2020 at 01:28:05PM -0800, Jakub Kicinski wrote:
> > On Tue, 15 Dec 2020 12:35:20 -0800 Saeed Mahameed wrote:
> > > > I think the big thing we really should do if we are going to go
> > > > this
> > > > route is to look at standardizing what the flavours are that
> > > > get
> > > > created by the parent netdevice. Otherwise we are just creating
> > > > the
> > > > same mess we had with SRIOV all over again and muddying the
> > > > waters of
> > > > mediated devices.
> > > 
> > > yes in the near future we will be working on auxbus interfaces
> > > for
> > > auto-probing and user flavor selection, this is a must have
> > > feature for
> > > us.
> > 
> > Can you elaborate? I thought config would be via devlink.
> 
> Yes, everything continues to be done through devlink.
> 
> One of the immediate features is an ability to disable/enable
> creation
> of specific SF types.
> 
> For example, if user doesn't want RDMA, the SF RDMA won't be created.
> 

Devlink is an option too, we still don't have our mind set on a
specific API, we are considering both as a valuable solutions since
devlink make sense as a go to interface for everything SF, but on the
other hand, auto-probing and device instantiating is done at the auxbus
level, so it also make sense to have some sort of "device type" user
selection api in the auxbus, anyway this discussion is for a future
patch.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16 17:51                       ` Jason Gunthorpe
@ 2020-12-16 19:27                         ` Alexander Duyck
  2020-12-16 20:35                           ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-16 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Wed, Dec 16, 2020 at 9:51 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Dec 16, 2020 at 08:31:44AM -0800, Alexander Duyck wrote:
>
> > You say this will scale better but I am not even sure about that. The
> > fact is SR-IOV could scale to 256 VFs, but for networking I kind of
> > doubt the limitation would have been the bus number and would more
> > likely be issues with packet replication and PCIe throughput,
> > especially when you start dealing with east-west traffic within the
> > same system.
>
> We have been seeing deployments already hitting the 256 limit. This is
> not a "theoretical use" patch set. There are already VM and container
> farms with SW networking that support much more than 256 VM/containers
> per server.

That has been the case for a long time. However it had been my
experience that SR-IOV never scaled well to meet those needs and so it
hadn't been used in such deployments.

> The optimization here is to reduce the hypervisor workload and free up
> CPU cycles for the VMs/containers to consume. This means less handling
> of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.
>
> Even the extra DMA on the NIC is not really a big deal. These are 400G
> NICs with big fast PCI. If you top them out you are already doing an
> aggregate of 400G of network traffic. That is a big number for a
> single sever, it is OK.

Yes, but at a certain point you start bumping up against memory
throughput limitations as well. Doubling up the memory footprint by
having the device have to write to new pages instead of being able to
do something like pinning and zero-copy would be expensive.

For something like an NFV use case it might make sense, but in the
general high server count case it seems like it is a setup that would
be detrimental.

> Someone might feel differently if they did this on a 10/40G NIC, in
> which case this is not the solution for their application.

My past experience was with 10/40G NIC with tens of VFs. When we start
talking about hundreds I would imagine the overhead becomes orders of
magnitudes worse as the problem becomes more of an n^2 issue since you
will have n times more systems sending to n times more systems
receiving. As such things like broadcast traffic would end up
consuming a fair bit of traffic.

> > Sorry you used the word "replace", and my assumption here was that the
> > goal is to get something in place that can take the place of SR-IOV so
> > that you wouldn't be maintaining the two systems at the same time.
> > That is my concern as I don't want us having SR-IOV, and then several
> > flavors of SIOV. We need to decide on one thing that will be the way
> > forward.
>
> SRIOV has to continue until the PASID and IMS platform features are
> widely available and mature. It will probably be 10 years before we
> see most people able to use SIOV for everything they want.
>
> I think we will see lots of SIOV varients, I know Intel is already
> pushing SIOV parts outside netdev.

The key bit here is outside of netdev. Like I said, SIOV and SR-IOV
tend to be PCIe specific specifications. What we are defining here is
how the network interfaces presented by such devices will work.

> > I get that. That is why I said switchdev isn't a standard for the
> > endpoint. One of the biggest issues with SR-IOV that I have seen is
> > the fact that the last piece isn't really defined. We never did a good
> > job of defining how the ADI should look to the guest and as a result
> > it kind of stalled in adoption.
>
> The ADI is supposed to present the HW programming API that is
> desired. It is always up to the implementation.
>
> SIOV was never a project to standardize HW programming models like
> virtio-net, NVMe, etc.

I agree. Just like SR-IOV never spelled out that network devices
should be using switchdev. That is something we decided on as a
community. What I am explaining here is that we should be thinking
about the implications of how the network interface is exposed in the
host in the case of subfunctions that are associated with a given
switchdev device.

> > > I'm a bit surprised HW that can do macvlan can't be modeled with
> > > switchdev? What is missing?
> >
> > If I recall it was the fact that the hardware defaults to transmitting
> > everything that doesn't match an existing rule to the external port
> > unless it comes from the external port.
>
> That seems small enough it should be resolvable, IMHO. eg some new
> switch rule that matches that specific HW behavior?

I would have to go digging to find the conversation. It was about 3 or
4 years ago. I seem to recall mentioning the idea of having some
static rules but it was a no-go at the time. If we wanted to spin off
this conversation and pull in some Intel folks I would be up for us
revisiting it. However I'm not with Intel anymore so it would mostly
be something I would be working on as a hobby project instead of
anything serious.

> > Something like the vdpa model is more like what I had in mind. Only
> > vdpa only works for the userspace networking case.
>
> That's because making a driver that converts the native HW to VDPA and
> then running a generic netdev on the resulting virtio-net is a pretty
> wild thing to do. I can't really think of an actual use case.

I'm not talking about us drastically changing existing models. I would
still expect the mlx5 driver to be running on top of the aux device.
However it may be that the aux device is associated with something
like the switchdev port as a parent and the output from the traffic is
then going to the subfunction netdev.

> > Basically the idea is to have an assignable device interface that
> > isn't directly tied to the hardware.
>
> The switchdev model is to create a switch port. As I explained in
> Linux we see "pci device" and "aux device" as being some "user port"
> options to access to that switch.
>
> If you want a "generic device" that is fine, but what exactly is that
> programming interface in Linux? Sketch out an API, where does the idea
> go?  What does the driver that implement it look like? What consumes
> it?
>
> Should this be a great idea, then a mlx5 version of this will still be
> to create an SF aux device, bind mlx5_core, then bind "generic device"
> on top of that. This is simply a reflection of how the mlx5 HW/SW
> layering works. Squashing all of this into a single layer is work with
> no bad ROI.

In my mind the mlx5_core is still binding to the SF aux device so I am
fine with that. The way I view this working is that it would work sort
of like the macvlan approach that was taken with the Intel parts.
Although instead of binding to the PF it might make more sense to have
it bound to the switchdev port associated with the subfunction and
looking like a veth pair from the host perspective in terms of
behavior. The basic idea is most of the control is still in the
switchdev port, but it becomes more visible that the switchdev port
and the subfunction netdev are linked with the switchdev port as the
parent.

> > they are pushed into containers you don't have to rip them out if for
> > some reason you need to change the network configuration.
>
> Why would you need to rip them out?

Because they are specifically tied to the mlx5 device. So if for
example I need to hotplug out the mlx5 and replace it, it would be
useful to allow the interface in the containers to stay in place and
fail over to some other software backing interface in the switchdev
namespace. One of the issues with VFs is that we have always had to
push some sort of bond on top, or switch over to a virtio-net
interface in order to support fail-over. If we can resolve that in the
host case that would go a long way toward solving one of the main
issues of SR-IOV.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16 19:27                         ` Alexander Duyck
@ 2020-12-16 20:35                           ` Jason Gunthorpe
  2020-12-16 22:53                             ` Alexander Duyck
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-16 20:35 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Wed, Dec 16, 2020 at 11:27:32AM -0800, Alexander Duyck wrote:

> That has been the case for a long time. However it had been my
> experience that SR-IOV never scaled well to meet those needs and so it
> hadn't been used in such deployments.

Seems to be going quite well here, perhaps the applications are
different.

> > The optimization here is to reduce the hypervisor workload and free up
> > CPU cycles for the VMs/containers to consume. This means less handling
> > of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.
> >
> > Even the extra DMA on the NIC is not really a big deal. These are 400G
> > NICs with big fast PCI. If you top them out you are already doing an
> > aggregate of 400G of network traffic. That is a big number for a
> > single sever, it is OK.
> 
> Yes, but at a certain point you start bumping up against memory
> throughput limitations as well. Doubling up the memory footprint by
> having the device have to write to new pages instead of being able to
> do something like pinning and zero-copy would be expensive.

You can't zero-copy when using VMs.

And when using containers every skb still has to go through all the
switching and encapsulation logic, which is not free in SW.

At a certain point the gains of avoiding the DMA copy are lost by the
costs of all the extra CPU work. The factor being optimized here is
CPU capacity.

> > Someone might feel differently if they did this on a 10/40G NIC, in
> > which case this is not the solution for their application.
> 
> My past experience was with 10/40G NIC with tens of VFs. When we start
> talking about hundreds I would imagine the overhead becomes orders of
> magnitudes worse as the problem becomes more of an n^2 issue since you
> will have n times more systems sending to n times more systems

The traffic demand is application dependent. If an application has an
n^2 traffic pattern then it needs a network to sustain that cross
sectional bandwidth regardless of how the VMs are packed.

It just becomes a design factor of the network and now the network
includes that switching component on the PCIe NIC as part of the
capacity for cross sectional BW.

There is some balance where a VM can only generate so much traffic
based on the CPU it has available, and you can design the entire
infrastructure to balance the CPU with the NIC with the switches and
come to some packing factor of VMs. 

As CPU constrains VM performance, removing CPU overheads from the
system will improve packing density. A HW network data path in the VMs
is one such case that can turn to a net win if the CPU bottleneck is
bigger than the network bottleneck.

It is really over simplifiying to just say PCIe DMA copies are bad.

> receiving. As such things like broadcast traffic would end up
> consuming a fair bit of traffic.

I think you have a lot bigger network problems if your broadcast
traffic is so high that you start to worry about DMA copy performance
in a 400G NIC.

> The key bit here is outside of netdev. Like I said, SIOV and SR-IOV
> tend to be PCIe specific specifications. What we are defining here is
> how the network interfaces presented by such devices will work.

I think we've achieved this..

> > That seems small enough it should be resolvable, IMHO. eg some new
> > switch rule that matches that specific HW behavior?
> 
> I would have to go digging to find the conversation. It was about 3 or
> 4 years ago. I seem to recall mentioning the idea of having some
> static rules but it was a no-go at the time. If we wanted to spin off
> this conversation and pull in some Intel folks I would be up for us
> revisiting it. However I'm not with Intel anymore so it would mostly
> be something I would be working on as a hobby project instead of
> anything serious.

Personally I welcome getting more drivers to implement the switchdev
model, I think it is only good for the netdev community as as a whole
to understand and standardize on this.

> > > Something like the vdpa model is more like what I had in mind. Only
> > > vdpa only works for the userspace networking case.
> >
> > That's because making a driver that converts the native HW to VDPA and
> > then running a generic netdev on the resulting virtio-net is a pretty
> > wild thing to do. I can't really think of an actual use case.
> 
> I'm not talking about us drastically changing existing models. I would
> still expect the mlx5 driver to be running on top of the aux device.
> However it may be that the aux device is associated with something
> like the switchdev port as a parent 
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That is exactly how this works. The switchdev representor and the aux
device are paired and form the analog of the veth tunnel. IIRC this
relationship with the aux device is shown in the devlink output for
the switchdev ports.

I still can't understand what you think should be changed here.

We can't get rid of the aux device, it is integral to the software
layering and essential to support any device assignment flow.

We can't add a mandatory netdev, because that is just pure waste for
any device assignment flow.

> The basic idea is most of the control is still in the switchdev
> port, but it becomes more visible that the switchdev port and the
> subfunction netdev are linked with the switchdev port as the parent.

If a user netdev is created, say for a container, then it should have
as a parent the aux device.

If you inspect the port configuration on the switchdev side it will
show the aux device associated with the port.

Are you just asking that the userspace tools give a little help and
show that switchdev port XX is visable as netdev YYY by cross matching
the auxbus?

> > > they are pushed into containers you don't have to rip them out if for
> > > some reason you need to change the network configuration.
> >
> > Why would you need to rip them out?
> 
> Because they are specifically tied to the mlx5 device. So if for
> example I need to hotplug out the mlx5 and replace it, 

Uhh, I think we are very very far away from being able to hot unplug a
switchdev driver, keep the switchdev running, and drop in a different
driver.

That isn't even on the radar, AFAIK.

> namespace. One of the issues with VFs is that we have always had to
> push some sort of bond on top, or switch over to a virtio-net
> interface in order to support fail-over. If we can resolve that in the
> host case that would go a long way toward solving one of the main
> issues of SR-IOV.

This is all solved already, virtio-net is the answer.

qemu can swap back ends under the virtio-net ADI it created on the
fly. This means it can go from processing a virtio-net queue in mlx5
HW, to full SW, to some other HW on another machine. All hitlessly and
transparently to the guest VM.

Direct HW processing of a queue inside a VM without any downsides for
VM migration. Check.

I have the feeling this stuff you are asking for is already done..

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16 20:35                           ` Jason Gunthorpe
@ 2020-12-16 22:53                             ` Alexander Duyck
  2020-12-17  0:38                               ` Jason Gunthorpe
  2020-12-18  1:30                               ` David Ahern
  0 siblings, 2 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-16 22:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Wed, Dec 16, 2020 at 12:35 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Dec 16, 2020 at 11:27:32AM -0800, Alexander Duyck wrote:
>
> > That has been the case for a long time. However it had been my
> > experience that SR-IOV never scaled well to meet those needs and so it
> > hadn't been used in such deployments.
>
> Seems to be going quite well here, perhaps the applications are
> different.
>
> > > The optimization here is to reduce the hypervisor workload and free up
> > > CPU cycles for the VMs/containers to consume. This means less handling
> > > of packets in the CPU, especially for VM cases w/ SRIOV or VDPA.
> > >
> > > Even the extra DMA on the NIC is not really a big deal. These are 400G
> > > NICs with big fast PCI. If you top them out you are already doing an
> > > aggregate of 400G of network traffic. That is a big number for a
> > > single sever, it is OK.
> >
> > Yes, but at a certain point you start bumping up against memory
> > throughput limitations as well. Doubling up the memory footprint by
> > having the device have to write to new pages instead of being able to
> > do something like pinning and zero-copy would be expensive.
>
> You can't zero-copy when using VMs.
>
> And when using containers every skb still has to go through all the
> switching and encapsulation logic, which is not free in SW.
>
> At a certain point the gains of avoiding the DMA copy are lost by the
> costs of all the extra CPU work. The factor being optimized here is
> CPU capacity.
>
> > > Someone might feel differently if they did this on a 10/40G NIC, in
> > > which case this is not the solution for their application.
> >
> > My past experience was with 10/40G NIC with tens of VFs. When we start
> > talking about hundreds I would imagine the overhead becomes orders of
> > magnitudes worse as the problem becomes more of an n^2 issue since you
> > will have n times more systems sending to n times more systems
>
> The traffic demand is application dependent. If an application has an
> n^2 traffic pattern then it needs a network to sustain that cross
> sectional bandwidth regardless of how the VMs are packed.
>
> It just becomes a design factor of the network and now the network
> includes that switching component on the PCIe NIC as part of the
> capacity for cross sectional BW.
>
> There is some balance where a VM can only generate so much traffic
> based on the CPU it has available, and you can design the entire
> infrastructure to balance the CPU with the NIC with the switches and
> come to some packing factor of VMs.
>
> As CPU constrains VM performance, removing CPU overheads from the
> system will improve packing density. A HW network data path in the VMs
> is one such case that can turn to a net win if the CPU bottleneck is
> bigger than the network bottleneck.
>
> It is really over simplifiying to just say PCIe DMA copies are bad.

I'm not saying the copies are bad. However they can be limiting. As
you said it all depends on the use case. If you have nothing but
functions that are performing bump-in-the-wire type operations odds
are the PCIe bandwidth won't be the problem. It all depends on the use
case and that is why I would prefer the interface to be more flexible
rather than just repeating what has been done with SR-IOV.

The problem in my case was based on a past experience where east-west
traffic became a problem and it was easily shown that bypassing the
NIC for traffic was significantly faster.

> > receiving. As such things like broadcast traffic would end up
> > consuming a fair bit of traffic.
>
> I think you have a lot bigger network problems if your broadcast
> traffic is so high that you start to worry about DMA copy performance
> in a 400G NIC.

Usually the problems were more multicast rather than broadcast, but
yeah this typically isn't an issue. However still at 256 VFs you would
be talking about a replication rate such that at 1-2Gbps one VF could
saturate the entire 400G device.

> > The key bit here is outside of netdev. Like I said, SIOV and SR-IOV
> > tend to be PCIe specific specifications. What we are defining here is
> > how the network interfaces presented by such devices will work.
>
> I think we've achieved this..

Somewhat. We have it explained for the control plane. What we are
defining now is how it will appear in the guest/container.

> > > That seems small enough it should be resolvable, IMHO. eg some new
> > > switch rule that matches that specific HW behavior?
> >
> > I would have to go digging to find the conversation. It was about 3 or
> > 4 years ago. I seem to recall mentioning the idea of having some
> > static rules but it was a no-go at the time. If we wanted to spin off
> > this conversation and pull in some Intel folks I would be up for us
> > revisiting it. However I'm not with Intel anymore so it would mostly
> > be something I would be working on as a hobby project instead of
> > anything serious.
>
> Personally I welcome getting more drivers to implement the switchdev
> model, I think it is only good for the netdev community as as a whole
> to understand and standardize on this.

Agreed.

> > > > Something like the vdpa model is more like what I had in mind. Only
> > > > vdpa only works for the userspace networking case.
> > >
> > > That's because making a driver that converts the native HW to VDPA and
> > > then running a generic netdev on the resulting virtio-net is a pretty
> > > wild thing to do. I can't really think of an actual use case.
> >
> > I'm not talking about us drastically changing existing models. I would
> > still expect the mlx5 driver to be running on top of the aux device.
> > However it may be that the aux device is associated with something
> > like the switchdev port as a parent
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> That is exactly how this works. The switchdev representor and the aux
> device are paired and form the analog of the veth tunnel. IIRC this
> relationship with the aux device is shown in the devlink output for
> the switchdev ports.
>
> I still can't understand what you think should be changed here.
>
> We can't get rid of the aux device, it is integral to the software
> layering and essential to support any device assignment flow.
>
> We can't add a mandatory netdev, because that is just pure waste for
> any device assignment flow.

I'm not saying to get rid of the netdev. I'm saying the netdev created
should be a generic interface that could be reused by other vendors.
The whole idea behind using something like macvlan is to hide the
underlying device. It becomes the namespace assignable interface. What
I would like to see is us get rid of that step and instead just have a
generic interface spawned in the first place and push the driver
specific bits back to the switchdev port.

The problem right now is that the switchdev netdev and the subfunction
netdev are treated as peers. In my mind it should work somewhere in
between the macvlan and veth. Basically you have the switchedev port
handling both ends of the traffic and handling the aux device
directly, and the subfunction device is floating on top of it
sometimes acting like a macvlan when traffic is going to/from the aux
device, and acting like a veth pair in some cases perhaps such as
broadcast/multicast allowing the swtichdev to take care of that
locally.

> > The basic idea is most of the control is still in the switchdev
> > port, but it becomes more visible that the switchdev port and the
> > subfunction netdev are linked with the switchdev port as the parent.
>
> If a user netdev is created, say for a container, then it should have
> as a parent the aux device.
>
> If you inspect the port configuration on the switchdev side it will
> show the aux device associated with the port.
>
> Are you just asking that the userspace tools give a little help and
> show that switchdev port XX is visable as netdev YYY by cross matching
> the auxbus?

It isn't about the association, it is about who is handling the
traffic. Going back to the macvlan model what we did is we had a group
of rings on the device that would automatically forward unicast
packets to the macvlan interface and would be reserved for
transmitting packets from the macvlan interface. We took care of
multicast and broadcast replication in software.

In my mind it should be possible to do something similar for the
swtichdev case. Basically the packets would be routed from the
subfunction netdev, to the switchdev netdev, and could then be
transmitted through the aux device. In order to make this work in the
case of ixgbe I had come up with the concept of a subordinate device,
"sb_dev", which was a pointer used in the Tx case to identify Tx rings
and qdiscs that had been given to macvlan interface.

> > > > they are pushed into containers you don't have to rip them out if for
> > > > some reason you need to change the network configuration.
> > >
> > > Why would you need to rip them out?
> >
> > Because they are specifically tied to the mlx5 device. So if for
> > example I need to hotplug out the mlx5 and replace it,
>
> Uhh, I think we are very very far away from being able to hot unplug a
> switchdev driver, keep the switchdev running, and drop in a different
> driver.
>
> That isn't even on the radar, AFAIK.

That might be a bad example, I was thinking of the issues we have had
with VFs and direct assignment to Qemu based guests in the past.
Essentially what I am getting at is that the setup in the container
should be vendor agnostic. The interface exposed shouldn't be specific
to any one vendor. So if I want to fire up a container or Mellanox,
Broadcom, or some other vendor it shouldn't matter or be visible to
the user. They should just see a vendor agnostic subfunction
netdevice.

Something like that is doable already using something like a macvlan
on top of a subfunction interface, but I feel like that is an
unnecessary step and creates unnecessary netdevices.

> > namespace. One of the issues with VFs is that we have always had to
> > push some sort of bond on top, or switch over to a virtio-net
> > interface in order to support fail-over. If we can resolve that in the
> > host case that would go a long way toward solving one of the main
> > issues of SR-IOV.
>
> This is all solved already, virtio-net is the answer.
>
> qemu can swap back ends under the virtio-net ADI it created on the
> fly. This means it can go from processing a virtio-net queue in mlx5
> HW, to full SW, to some other HW on another machine. All hitlessly and
> transparently to the guest VM.
>
> Direct HW processing of a queue inside a VM without any downsides for
> VM migration. Check.
>
> I have the feeling this stuff you are asking for is already done..

The case you are describing has essentially solved it for Qemu
virtualization and direct assignment. It still doesn't necessarily
solve it for the container case though.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16 22:53                             ` Alexander Duyck
@ 2020-12-17  0:38                               ` Jason Gunthorpe
  2020-12-17 18:48                                 ` Alexander Duyck
  2020-12-18  1:30                               ` David Ahern
  1 sibling, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-17  0:38 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote:
 
> It isn't about the association, it is about who is handling the
> traffic. Going back to the macvlan model what we did is we had a group
> of rings on the device that would automatically forward unicast
> packets to the macvlan interface and would be reserved for
> transmitting packets from the macvlan interface. We took care of
> multicast and broadcast replication in software.

Okay, maybe I'm starting to see where you are coming from.

First, I think some clarity here, as I see it the devlink
infrastructure is all about creating the auxdevice for a switchdev
port.

What goes into that auxdevice is *completely* up to the driver. mlx5
is doing a SF which == VF, but that is not a requirement of the design
at all.

If an Intel driver wants to put a queue block into the aux device and
that is != VF, it is just fine.

The Intel netdev that binds to the auxdevice can transform the queue
block and specific switchdev config into a netdev identical to
accelerated macvlan. Nothing about the breaks the switchdev model.

Essentially think of it as generalizing the acceleration plugin for a
netdev. Instead of making something specific to limited macvlan, the
driver gets to provide exactly the structure that matches its HW to
provide the netdev as the user side of the switchdev port. I see no
limitation here so long as the switchdev model for controlling traffic
is followed.

Let me segue into a short story from RDMA.. We've had a netdev called
IPoIB for a long time. It is actually kind of similar to this general
thing you are talking about, in that there is a programming layer
under the IPOIB netdev called RDMA verbs that generalizes the actual
HW. Over the years this became more complicated because every new
netdev offloaded needed mirroring into the RDMA verbs general
API. TSO, GSO, checksum offload, endlessly onwards. It became quite
dumb in the end. We gave up and said the HW driver should directly
implement netdev. Implementing a middle API layer makes zero sense
when netdev is already perfectly suited to implement ontop of
HW. Removing SW layers caused performance to go up something like
2x.

The hard earned lesson I take from that is don't put software layers
between a struct net_device and the actual HW. The closest coupling is
really the best thing. Provide libary code in the kernel to help
drivers implement common patterns when making their netdevs, do not
provide wrapper netdevs around drivers.

IMHO the approach of macvlan accleration made some sense in 2013, but
today I would say it is mashing unrelated layers together and
polluting what should be a pure SW implementation with HW hooks.

I see from the mailing list comments this was done because creating a
device specific netdev via 'ip link add' was rightly rejected. However
here we *can* create a device specific vmdq *auxdevice*.  This is OK
because the netdev is controlling and containing the aux device via
switchdev.

So, Intel can get the "VMDQ link type" that was originally desired more
or less directly, so long as the associated switchdev port controls
the MAC filter process, not "ip link add".

And if you want to make the vmdq auxdevice into an ADI by user DMA to
queues, then sure, that model is completely sane too (vs hacking up
macvlan to expose user queues) - so long as the kernel controls the
selection of traffic into those queues and follows the switchdev
model. I would recommend creating a simple RDMA raw ethernet queue
driver over the aux device for something like this :)

> That might be a bad example, I was thinking of the issues we have had
> with VFs and direct assignment to Qemu based guests in the past.

As described, this is solved by VDPA.

> Essentially what I am getting at is that the setup in the container
> should be vendor agnostic. The interface exposed shouldn't be specific
> to any one vendor. So if I want to fire up a container or Mellanox,
> Broadcom, or some other vendor it shouldn't matter or be visible to
> the user. They should just see a vendor agnostic subfunction
> netdevice.

Agree. The agnostic container user interface here is 'struct
net_device'.

> > I have the feeling this stuff you are asking for is already done..
> 
> The case you are describing has essentially solved it for Qemu
> virtualization and direct assignment. It still doesn't necessarily
> solve it for the container case though.

The container case doesn't need solving.

Any scheme I've heard for container live migration, like CRIU,
essentially hot plugs the entire kernel in/out of a user process. We
rely on the kernel providing low leakage of the implementation details
of the struct net_device as part of it's uAPI contract. When CRIU
swaps the kernel the new kernel can have any implementation of the
container netdev it wants.

I've never heard of a use case to hot swap the implemention *under* a
netdev from a container. macvlan can't do this today. If you have a
use case here, it really has nothing to do with with this series.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-17  0:38                               ` Jason Gunthorpe
@ 2020-12-17 18:48                                 ` Alexander Duyck
  2020-12-17 19:40                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-17 18:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Wed, Dec 16, 2020 at 4:38 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote:
>
> > It isn't about the association, it is about who is handling the
> > traffic. Going back to the macvlan model what we did is we had a group
> > of rings on the device that would automatically forward unicast
> > packets to the macvlan interface and would be reserved for
> > transmitting packets from the macvlan interface. We took care of
> > multicast and broadcast replication in software.
>
> Okay, maybe I'm starting to see where you are coming from.
>
> First, I think some clarity here, as I see it the devlink
> infrastructure is all about creating the auxdevice for a switchdev
> port.
>
> What goes into that auxdevice is *completely* up to the driver. mlx5
> is doing a SF which == VF, but that is not a requirement of the design
> at all.
>
> If an Intel driver wants to put a queue block into the aux device and
> that is != VF, it is just fine.
>
> The Intel netdev that binds to the auxdevice can transform the queue
> block and specific switchdev config into a netdev identical to
> accelerated macvlan. Nothing about the breaks the switchdev model.

Just to clarify I am not with Intel, nor do I plan to work on any
Intel drivers related to this.

My concern has more to do with how this is being plumbed and the fact
that the basic architecture is somewhat limiting.

> Essentially think of it as generalizing the acceleration plugin for a
> netdev. Instead of making something specific to limited macvlan, the
> driver gets to provide exactly the structure that matches its HW to
> provide the netdev as the user side of the switchdev port. I see no
> limitation here so long as the switchdev model for controlling traffic
> is followed.

I see plenty. The problem is it just sets up more vendor lock-in and
features that have to be thrown away when you have to settle for
least-common denominator in order to maintain functionality across
vendors.

> Let me segue into a short story from RDMA.. We've had a netdev called
> IPoIB for a long time. It is actually kind of similar to this general
> thing you are talking about, in that there is a programming layer
> under the IPOIB netdev called RDMA verbs that generalizes the actual
> HW. Over the years this became more complicated because every new
> netdev offloaded needed mirroring into the RDMA verbs general
> API. TSO, GSO, checksum offload, endlessly onwards. It became quite
> dumb in the end. We gave up and said the HW driver should directly
> implement netdev. Implementing a middle API layer makes zero sense
> when netdev is already perfectly suited to implement ontop of
> HW. Removing SW layers caused performance to go up something like
> 2x.
>
> The hard earned lesson I take from that is don't put software layers
> between a struct net_device and the actual HW. The closest coupling is
> really the best thing. Provide libary code in the kernel to help
> drivers implement common patterns when making their netdevs, do not
> provide wrapper netdevs around drivers.
>
> IMHO the approach of macvlan accleration made some sense in 2013, but
> today I would say it is mashing unrelated layers together and
> polluting what should be a pure SW implementation with HW hooks.

I disagree here. In my mind a design where two interfaces, which both
exist in the kernel, have to go to hardware in order to communicate is
very limiting. The main thing I am wanting to see is the option of
being able to pass traffic directly between the switchdev and the SF
without the need to touch the hardware.

An easy example of such traffic that would likely benefit from this is
multicast/broadcast traffic. Instead of having to process each and
every broadcast packet in hardware you could very easily process it at
the switchdev and then directly hand it off from the switchdev to the
SF in this case instead of having to send it to hardware for each
switchdev instance.

> I see from the mailing list comments this was done because creating a
> device specific netdev via 'ip link add' was rightly rejected. However
> here we *can* create a device specific vmdq *auxdevice*.  This is OK
> because the netdev is controlling and containing the aux device via
> switchdev.
>
> So, Intel can get the "VMDQ link type" that was originally desired more
> or less directly, so long as the associated switchdev port controls
> the MAC filter process, not "ip link add".
>
> And if you want to make the vmdq auxdevice into an ADI by user DMA to
> queues, then sure, that model is completely sane too (vs hacking up
> macvlan to expose user queues) - so long as the kernel controls the
> selection of traffic into those queues and follows the switchdev
> model. I would recommend creating a simple RDMA raw ethernet queue
> driver over the aux device for something like this :)

You lost me here, I'm not seeing how RDMA and macvlan are connected.

> > That might be a bad example, I was thinking of the issues we have had
> > with VFs and direct assignment to Qemu based guests in the past.
>
> As described, this is solved by VDPA.
>
> > Essentially what I am getting at is that the setup in the container
> > should be vendor agnostic. The interface exposed shouldn't be specific
> > to any one vendor. So if I want to fire up a container or Mellanox,
> > Broadcom, or some other vendor it shouldn't matter or be visible to
> > the user. They should just see a vendor agnostic subfunction
> > netdevice.
>
> Agree. The agnostic container user interface here is 'struct
> net_device'.

I disagree here. The fact is a mellanox netdev, versus a broadcom
netdev, versus an intel netdev all have a very different look at feel
as the netdev is essentially just the base device you are building
around.

In addition it still doesn't address my concern as called out above
which is the east-west traffic problem.

> > > I have the feeling this stuff you are asking for is already done..
> >
> > The case you are describing has essentially solved it for Qemu
> > virtualization and direct assignment. It still doesn't necessarily
> > solve it for the container case though.
>
> The container case doesn't need solving.

I disagree and that is at the heart where you and I have different
views. I view there being two advantages to having the container case
solved:
1. A standardized set of features that can be provided regardless of vendor
2. Allowing for the case where east-west traffic can avoid having to
touch hardware

> Any scheme I've heard for container live migration, like CRIU,
> essentially hot plugs the entire kernel in/out of a user process. We
> rely on the kernel providing low leakage of the implementation details
> of the struct net_device as part of it's uAPI contract. When CRIU
> swaps the kernel the new kernel can have any implementation of the
> container netdev it wants.

I'm not thinking about migration. I am thinking more about the user
experience. In my mind if I set up a container I shouldn't need to
know which vendor provided the network interface when I set it up. The
problem is most NICs have so many one-off proprietary tweaks needed
that it gets annoying. That is why in my mind it would make much more
sense to have a simple vendor agnostic interface. That is why I would
prefer to avoid the VF model.

> I've never heard of a use case to hot swap the implemention *under* a
> netdev from a container. macvlan can't do this today. If you have a
> use case here, it really has nothing to do with with this series.

Again, the hot-swap isn't necessarily what I am talking about. I am
talking about setting up a config for a set of containers in a
datacenter. What I don't want to do is have to have one set of configs
for an mlx5 SF, another for a broadcom SF, and yet another set for any
other vendors out there. I would much rather have all of that dealt
with within the namespace that is handling the switchdev setup.

In addition, the east-west traffic is the other bit I would like to
see addressed. I am okay excusing this in the case of direct
assignment since the resources for the SF will not be available to the
host. However if the SF will be operating in the same kernel as the
PF/switchev it would make much more sense to enable an east/west
channel which would allow for hardware bypass under certain
circumstances without having to ever leave the kernel.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-17 18:48                                 ` Alexander Duyck
@ 2020-12-17 19:40                                   ` Jason Gunthorpe
  2020-12-17 21:05                                     ` Alexander Duyck
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-17 19:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Thu, Dec 17, 2020 at 10:48:48AM -0800, Alexander Duyck wrote:

> Just to clarify I am not with Intel, nor do I plan to work on any
> Intel drivers related to this.

Sure
 
> I disagree here. In my mind a design where two interfaces, which both
> exist in the kernel, have to go to hardware in order to communicate is
> very limiting. The main thing I am wanting to see is the option of
> being able to pass traffic directly between the switchdev and the SF
> without the need to touch the hardware.

I view the SW bypass path you are talking about similarly to
GSO/etc. It should be accessed by the HW driver as an optional service
provided by the core netdev, not implemented as some wrapper netdev
around a HW implementation.

If you feel strongly it is needed then there is nothing standing in
the way to implement it in the switchdev auxdevice model.

It is simple enough, the HW driver's tx path would somehow detect
east/west and queue it differently, and the rx path would somehow be
able to mux in skbs from a SW queue. Not seeing any blockers here.

> > model. I would recommend creating a simple RDMA raw ethernet queue
> > driver over the aux device for something like this :)
> 
> You lost me here, I'm not seeing how RDMA and macvlan are connected.

RDMA is the standard uAPI to get a userspace HW DMA queue for ethernet
packets.

> > > Essentially what I am getting at is that the setup in the container
> > > should be vendor agnostic. The interface exposed shouldn't be specific
> > > to any one vendor. So if I want to fire up a container or Mellanox,
> > > Broadcom, or some other vendor it shouldn't matter or be visible to
> > > the user. They should just see a vendor agnostic subfunction
> > > netdevice.
> >
> > Agree. The agnostic container user interface here is 'struct
> > net_device'.
> 
> I disagree here. The fact is a mellanox netdev, versus a broadcom
> netdev, versus an intel netdev all have a very different look at feel
> as the netdev is essentially just the base device you are building
> around.

Then fix the lack of standardization of netdev implementations!

Adding more abstraction layers isn't going to fix that fundamental
problem.

Frankly it seems a bit absurd to complain that the very basic element
of the common kernel uAPI - struct net_device - is so horribly
fragmented and vendor polluted that we can't rely on it as a stable
interface for containers.

Even if that is true, I don't belive for a second that adding a
different HW abstraction layer is going to somehow undo the mistakes
of the last 20 years.

> Again, the hot-swap isn't necessarily what I am talking about. I am
> talking about setting up a config for a set of containers in a
> datacenter. What I don't want to do is have to have one set of configs
> for an mlx5 SF, another for a broadcom SF, and yet another set for any
> other vendors out there. I would much rather have all of that dealt
> with within the namespace that is handling the switchdev setup.

If there is real problems here then I very much encourage you to start
an effort to push all the vendors to implement a consistent user
experience for the HW netdevs.

I don't know what your issues are, but it sounds like it would be a
very interesting conference presentation.

But it has nothing to do with this series.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-17 19:40                                   ` Jason Gunthorpe
@ 2020-12-17 21:05                                     ` Alexander Duyck
  2020-12-18  0:08                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-17 21:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Thu, Dec 17, 2020 at 11:40 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Dec 17, 2020 at 10:48:48AM -0800, Alexander Duyck wrote:
>
> > Just to clarify I am not with Intel, nor do I plan to work on any
> > Intel drivers related to this.
>
> Sure
>
> > I disagree here. In my mind a design where two interfaces, which both
> > exist in the kernel, have to go to hardware in order to communicate is
> > very limiting. The main thing I am wanting to see is the option of
> > being able to pass traffic directly between the switchdev and the SF
> > without the need to touch the hardware.
>
> I view the SW bypass path you are talking about similarly to
> GSO/etc. It should be accessed by the HW driver as an optional service
> provided by the core netdev, not implemented as some wrapper netdev
> around a HW implementation.

I view it as being something that would be a part of the switchdev API
itself. Basically the switchev and endpoint would need to be able to
control something like this because if XDP were enabled on one end or
the other you would need to be able to switch it off so that all of
the packets followed the same flow and could be scanned by the XDP
program.

> If you feel strongly it is needed then there is nothing standing in
> the way to implement it in the switchdev auxdevice model.
>
> It is simple enough, the HW driver's tx path would somehow detect
> east/west and queue it differently, and the rx path would somehow be
> able to mux in skbs from a SW queue. Not seeing any blockers here.

In my mind the simple proof of concept for this would be to check for
the multicast bit being set in the destination MAC address for packets
coming from the subfunction. If it is then shunt to this bypass route,
and if not then you transmit to the hardware queues. In the case of
packets coming from the switchdev port it would probably depend. The
part I am not sure about is if any packets need to be actually
transmitted to the hardware in the standard case for packets going
from the switchdev port to the subfunction. If there is no XDP or
anything like that present in the subfunction it probably wouldn't
matter and you could just shunt it straight across and bypass the
hardware, however if XDP is present you would need to get the packets
into the ring which would force the bypass to be turned off.

> > > model. I would recommend creating a simple RDMA raw ethernet queue
> > > driver over the aux device for something like this :)
> >
> > You lost me here, I'm not seeing how RDMA and macvlan are connected.
>
> RDMA is the standard uAPI to get a userspace HW DMA queue for ethernet
> packets.

Ah, I think you are talking about device assignment. In my mind I was
just talking about the interface assigned to the container which as
you have stated is basically just a netdev.

> > > > Essentially what I am getting at is that the setup in the container
> > > > should be vendor agnostic. The interface exposed shouldn't be specific
> > > > to any one vendor. So if I want to fire up a container or Mellanox,
> > > > Broadcom, or some other vendor it shouldn't matter or be visible to
> > > > the user. They should just see a vendor agnostic subfunction
> > > > netdevice.
> > >
> > > Agree. The agnostic container user interface here is 'struct
> > > net_device'.
> >
> > I disagree here. The fact is a mellanox netdev, versus a broadcom
> > netdev, versus an intel netdev all have a very different look at feel
> > as the netdev is essentially just the base device you are building
> > around.
>
> Then fix the lack of standardization of netdev implementations!

We're trying to work on that, but trying to fix it after the fact is
like herding cats.

> Adding more abstraction layers isn't going to fix that fundamental
> problem.
>
> Frankly it seems a bit absurd to complain that the very basic element
> of the common kernel uAPI - struct net_device - is so horribly
> fragmented and vendor polluted that we can't rely on it as a stable
> interface for containers.

The problem isn't necessarily the net_device it is more the
net_device_ops and the fact that there are so many different ways to
get things done. Arguably the flexibility of the netd_device is great
for allowing vendors to expose their features. However at the same
time it allows for features to be left out so what you end up with a
wide variety of things that are net_devices.

> Even if that is true, I don't belive for a second that adding a
> different HW abstraction layer is going to somehow undo the mistakes
> of the last 20 years.

It depends on how it is done. The general idea is to address the
biggest limitation that has occured, which is the fact that in many
cases we don't have software offloads to take care of things when the
hardware offloads provided by a certain piece of hardware are not
present. It would basically allow us to reset the feature set. If
something cannot be offloaded in software in a reasonable way, it is
not allowed to be present in the interface provided to a container.
That way instead of having to do all the custom configuration in the
container recipe it can be centralized to one container handling all
of the switching and hardware configuration.

> > Again, the hot-swap isn't necessarily what I am talking about. I am
> > talking about setting up a config for a set of containers in a
> > datacenter. What I don't want to do is have to have one set of configs
> > for an mlx5 SF, another for a broadcom SF, and yet another set for any
> > other vendors out there. I would much rather have all of that dealt
> > with within the namespace that is handling the switchdev setup.
>
> If there is real problems here then I very much encourage you to start
> an effort to push all the vendors to implement a consistent user
> experience for the HW netdevs.

To some extent that has been going on for some time. It is one of the
reasons why there is supposed to be software offloads for any datapath
features that get added to the hardware. Such as GSO to offset TSO.
However there always ends up the occasional thing that ends up getting
past and that is where the frustration comes in.

> I don't know what your issues are, but it sounds like it would be a
> very interesting conference presentation.
>
> But it has nothing to do with this series.
>
> Jason

There I disagree. Now I can agree that most of the series is about
presenting the aux device and that part I am fine with. However when
the aux device is a netdev and that netdev is being loaded into the
same kernel as the switchdev port is where the red flags start flying,
especially when we start talking about how it is the same as a VF.

In my mind we are talking about how the switchdev will behave and it
makes sense to see about defining if a east-west bypass makes sense
and how it could be implemented, rather than saying we won't bother
for now and potentially locking in the subfunction to virtual function
equality. In my mind we need more than just the increased count to
justify going to subfunctions, and I think being able to solve the
east-west problem at least in terms of containers would be such a
thing.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-17 21:05                                     ` Alexander Duyck
@ 2020-12-18  0:08                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-18  0:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On Thu, Dec 17, 2020 at 01:05:03PM -0800, Alexander Duyck wrote:

> > I view the SW bypass path you are talking about similarly to
> > GSO/etc. It should be accessed by the HW driver as an optional service
> > provided by the core netdev, not implemented as some wrapper netdev
> > around a HW implementation.
> 
> I view it as being something that would be a part of the switchdev API
> itself. Basically the switchev and endpoint would need to be able to
> control something like this because if XDP were enabled on one end or
> the other you would need to be able to switch it off so that all of
> the packets followed the same flow and could be scanned by the XDP
> program.

To me that still all comes down to being something like an optional
offload that the HW driver can trigger if the conditions are met.

> > It is simple enough, the HW driver's tx path would somehow detect
> > east/west and queue it differently, and the rx path would somehow be
> > able to mux in skbs from a SW queue. Not seeing any blockers here.
> 
> In my mind the simple proof of concept for this would be to check for
> the multicast bit being set in the destination MAC address for packets
> coming from the subfunction. If it is then shunt to this bypass route,
> and if not then you transmit to the hardware queues. 

Sure, not sure multicast optimization like this isn't incredibly niche
too, but it would be an interesting path to explore.

But again, there is nothing fundamental about the model here that
precludes this optional optimization.

> > Even if that is true, I don't belive for a second that adding a
> > different HW abstraction layer is going to somehow undo the mistakes
> > of the last 20 years.
> 
> It depends on how it is done. The general idea is to address the
> biggest limitation that has occured, which is the fact that in many
> cases we don't have software offloads to take care of things when the
> hardware offloads provided by a certain piece of hardware are not
> present. 

This is really disappointing to hear. Admittedly I don't follow all
the twists and turns on the mailing list, but I thought having a SW
version of everything was one of the fundamental tenants of netdev
that truly distinguished it from something like RDMA.

> It would basically allow us to reset the feature set. If something
> cannot be offloaded in software in a reasonable way, it is not
> allowed to be present in the interface provided to a container.
> That way instead of having to do all the custom configuration in the
> container recipe it can be centralized to one container handling all
> of the switching and hardware configuration.

Well, you could start by blocking stuff without a SW fallback..

> There I disagree. Now I can agree that most of the series is about
> presenting the aux device and that part I am fine with. However when
> the aux device is a netdev and that netdev is being loaded into the
> same kernel as the switchdev port is where the red flags start flying,
> especially when we start talking about how it is the same as a VF.

Well, it happens for the same reason a VF can create a netdev,
stopping it would actually be more patches. As I said before, people
are already doing this model with VFs.

I can agree with some of our points, but this is not the series to
argue them. What you want is to start some new thread on optimizing
switchdev for the container user case.

> In my mind we are talking about how the switchdev will behave and it
> makes sense to see about defining if a east-west bypass makes sense
> and how it could be implemented, rather than saying we won't bother
> for now and potentially locking in the subfunction to virtual function
> equality.

At least for mlx5 SF == VF, that is a consequence of the HW. Any SW
bypass would need to be specially built in the mlx5 netdev running on
a VF/SF attached to a switchdev port.

I don't see anything about this part of the model that precludes ever
doing that, and I also don't see this optimization as being valuable
enough to block things "just to be sure"

> In my mind we need more than just the increased count to justify
> going to subfunctions, and I think being able to solve the east-west
> problem at least in terms of containers would be such a thing.

Increased count is pretty important for users with SRIOV.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-16 22:53                             ` Alexander Duyck
  2020-12-17  0:38                               ` Jason Gunthorpe
@ 2020-12-18  1:30                               ` David Ahern
  2020-12-18  3:11                                 ` Alexander Duyck
  1 sibling, 1 reply; 65+ messages in thread
From: David Ahern @ 2020-12-18  1:30 UTC (permalink / raw)
  To: Alexander Duyck, Jason Gunthorpe
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Netdev, linux-rdma, David Ahern, Jacob Keller, Sridhar Samudrala,
	Ertman, David M, Dan Williams, Kiran Patil, Greg KH

On 12/16/20 3:53 PM, Alexander Duyck wrote:
> The problem in my case was based on a past experience where east-west
> traffic became a problem and it was easily shown that bypassing the
> NIC for traffic was significantly faster.

If a deployment expects a lot of east-west traffic *within a host* why
is it using hardware based isolation like a VF. That is a side effect of
a design choice that is remedied by other options.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18  1:30                               ` David Ahern
@ 2020-12-18  3:11                                 ` Alexander Duyck
  2020-12-18  3:55                                   ` David Ahern
  2020-12-18  5:20                                   ` Parav Pandit
  0 siblings, 2 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-18  3:11 UTC (permalink / raw)
  To: David Ahern
  Cc: Jason Gunthorpe, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > The problem in my case was based on a past experience where east-west
> > traffic became a problem and it was easily shown that bypassing the
> > NIC for traffic was significantly faster.
>
> If a deployment expects a lot of east-west traffic *within a host* why
> is it using hardware based isolation like a VF. That is a side effect of
> a design choice that is remedied by other options.

I am mostly talking about this from past experience as I had seen a
few instances when I was at Intel when it became an issue. Sales and
marketing people aren't exactly happy when you tell them "don't sell
that" in response to them trying to sell a feature into an area where
it doesn't belong. Generally they want a solution. The macvlan offload
addressed these issues as the replication and local switching can be
handled in software.

The problem is PCIe DMA wasn't designed to function as a network
switch fabric and when we start talking about a 400Gb NIC trying to
handle over 256 subfunctions it will quickly reduce the
receive/transmit throughput to gigabit or less speeds when
encountering hardware multicast/broadcast replication. With 256
subfunctions a simple 60B ARP could consume more than 19KB of PCIe
bandwidth due to the packet having to be duplicated so many times. In
my mind it should be simpler to simply clone a single skb 256 times,
forward that to the switchdev ports, and have them perform a bypass
(if available) to deliver it to the subfunctions. That's why I was
thinking it might be a good time to look at addressing it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18  3:11                                 ` Alexander Duyck
@ 2020-12-18  3:55                                   ` David Ahern
  2020-12-18 15:54                                     ` Alexander Duyck
  2020-12-18  5:20                                   ` Parav Pandit
  1 sibling, 1 reply; 65+ messages in thread
From: David Ahern @ 2020-12-18  3:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jason Gunthorpe, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On 12/17/20 8:11 PM, Alexander Duyck wrote:
> On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
>>
>> On 12/16/20 3:53 PM, Alexander Duyck wrote:
>>> The problem in my case was based on a past experience where east-west
>>> traffic became a problem and it was easily shown that bypassing the
>>> NIC for traffic was significantly faster.
>>
>> If a deployment expects a lot of east-west traffic *within a host* why
>> is it using hardware based isolation like a VF. That is a side effect of
>> a design choice that is remedied by other options.
> 
> I am mostly talking about this from past experience as I had seen a
> few instances when I was at Intel when it became an issue. Sales and
> marketing people aren't exactly happy when you tell them "don't sell
> that" in response to them trying to sell a feature into an area where

that's a problem engineers can never solve...

> it doesn't belong. Generally they want a solution. The macvlan offload
> addressed these issues as the replication and local switching can be
> handled in software.

well, I guess almost never. :-)

> 
> The problem is PCIe DMA wasn't designed to function as a network
> switch fabric and when we start talking about a 400Gb NIC trying to
> handle over 256 subfunctions it will quickly reduce the
> receive/transmit throughput to gigabit or less speeds when
> encountering hardware multicast/broadcast replication. With 256
> subfunctions a simple 60B ARP could consume more than 19KB of PCIe
> bandwidth due to the packet having to be duplicated so many times. In
> my mind it should be simpler to simply clone a single skb 256 times,
> forward that to the switchdev ports, and have them perform a bypass
> (if available) to deliver it to the subfunctions. That's why I was
> thinking it might be a good time to look at addressing it.
> 

east-west traffic within a host is more than likely the same tenant in
which case a proper VPC is a better solution than the s/w stack trying
to detect and guess that a bypass is needed. Guesses cost cycles in the
fast path which is a net loss - and even more so as speeds increase.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18  3:11                                 ` Alexander Duyck
  2020-12-18  3:55                                   ` David Ahern
@ 2020-12-18  5:20                                   ` Parav Pandit
  2020-12-18  5:36                                     ` Parav Pandit
  2020-12-18 16:01                                     ` Alexander Duyck
  1 sibling, 2 replies; 65+ messages in thread
From: Parav Pandit @ 2020-12-18  5:20 UTC (permalink / raw)
  To: Alexander Duyck, David Ahern
  Cc: Jason Gunthorpe, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH


> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Friday, December 18, 2020 8:41 AM
> 
> On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> The problem is PCIe DMA wasn't designed to function as a network switch
> fabric and when we start talking about a 400Gb NIC trying to handle over 256
> subfunctions it will quickly reduce the receive/transmit throughput to gigabit
> or less speeds when encountering hardware multicast/broadcast replication.
> With 256 subfunctions a simple 60B ARP could consume more than 19KB of
> PCIe bandwidth due to the packet having to be duplicated so many times. In
> my mind it should be simpler to simply clone a single skb 256 times, forward
> that to the switchdev ports, and have them perform a bypass (if available) to
> deliver it to the subfunctions. That's why I was thinking it might be a good
> time to look at addressing it.
Linux tc framework is rich to address this and already used by openvswich for years now.
Today arp broadcasts are not offloaded. They go through software patch and replicated in the L2 domain.
It is a solved problem for many years now.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18  5:20                                   ` Parav Pandit
@ 2020-12-18  5:36                                     ` Parav Pandit
  2020-12-18 16:01                                     ` Alexander Duyck
  1 sibling, 0 replies; 65+ messages in thread
From: Parav Pandit @ 2020-12-18  5:36 UTC (permalink / raw)
  To: Parav Pandit, Alexander Duyck, David Ahern
  Cc: Jason Gunthorpe, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH



> From: Parav Pandit <parav@nvidia.com>
> Sent: Friday, December 18, 2020 10:51 AM
> 
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Sent: Friday, December 18, 2020 8:41 AM
> >
> > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > The problem is PCIe DMA wasn't designed to function as a network
> > switch fabric and when we start talking about a 400Gb NIC trying to
> > handle over 256 subfunctions it will quickly reduce the
> > receive/transmit throughput to gigabit or less speeds when encountering
> hardware multicast/broadcast replication.
> > With 256 subfunctions a simple 60B ARP could consume more than 19KB of
> > PCIe bandwidth due to the packet having to be duplicated so many
> > times. In my mind it should be simpler to simply clone a single skb
> > 256 times, forward that to the switchdev ports, and have them perform
> > a bypass (if available) to deliver it to the subfunctions. That's why
> > I was thinking it might be a good time to look at addressing it.
> Linux tc framework is rich to address this and already used by openvswich for
> years now.
> Today arp broadcasts are not offloaded. They go through software patch and
s/patch/path

> replicated in the L2 domain.
> It is a solved problem for many years now.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18  3:55                                   ` David Ahern
@ 2020-12-18 15:54                                     ` Alexander Duyck
  0 siblings, 0 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-18 15:54 UTC (permalink / raw)
  To: David Ahern
  Cc: Jason Gunthorpe, Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Leon Romanovsky, Netdev, linux-rdma, David Ahern, Jacob Keller,
	Sridhar Samudrala, Ertman, David M, Dan Williams, Kiran Patil,
	Greg KH

On Thu, Dec 17, 2020 at 7:55 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 12/17/20 8:11 PM, Alexander Duyck wrote:
> > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
> >>
> >> On 12/16/20 3:53 PM, Alexander Duyck wrote:
> >>> The problem in my case was based on a past experience where east-west
> >>> traffic became a problem and it was easily shown that bypassing the
> >>> NIC for traffic was significantly faster.
> >>
> >> If a deployment expects a lot of east-west traffic *within a host* why
> >> is it using hardware based isolation like a VF. That is a side effect of
> >> a design choice that is remedied by other options.
> >
> > I am mostly talking about this from past experience as I had seen a
> > few instances when I was at Intel when it became an issue. Sales and
> > marketing people aren't exactly happy when you tell them "don't sell
> > that" in response to them trying to sell a feature into an area where
>
> that's a problem engineers can never solve...
>
> > it doesn't belong. Generally they want a solution. The macvlan offload
> > addressed these issues as the replication and local switching can be
> > handled in software.
>
> well, I guess almost never. :-)
>
> >
> > The problem is PCIe DMA wasn't designed to function as a network
> > switch fabric and when we start talking about a 400Gb NIC trying to
> > handle over 256 subfunctions it will quickly reduce the
> > receive/transmit throughput to gigabit or less speeds when
> > encountering hardware multicast/broadcast replication. With 256
> > subfunctions a simple 60B ARP could consume more than 19KB of PCIe
> > bandwidth due to the packet having to be duplicated so many times. In
> > my mind it should be simpler to simply clone a single skb 256 times,
> > forward that to the switchdev ports, and have them perform a bypass
> > (if available) to deliver it to the subfunctions. That's why I was
> > thinking it might be a good time to look at addressing it.
> >
>
> east-west traffic within a host is more than likely the same tenant in
> which case a proper VPC is a better solution than the s/w stack trying
> to detect and guess that a bypass is needed. Guesses cost cycles in the
> fast path which is a net loss - and even more so as speeds increase.

Yes, but this becomes the hardware limitations deciding the layout of
the network. I lean towards more flexibility to allow more
configuration options being a good thing rather than us needing to
dictate how a network has to be constructed based on the limitations
of the hardware and software.

For broadcast/multicast it isn't so much a guess. It would be a single
bit test. My understanding is the switchdev setup is already making
special cases for things like broadcast/multicast due to the extra
overhead incurred. I mentioned ARP because in many cases it has to be
offloaded specifically due to these sorts of issues.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18  5:20                                   ` Parav Pandit
  2020-12-18  5:36                                     ` Parav Pandit
@ 2020-12-18 16:01                                     ` Alexander Duyck
  2020-12-18 18:01                                       ` Parav Pandit
  1 sibling, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-18 16:01 UTC (permalink / raw)
  To: Parav Pandit
  Cc: David Ahern, Jason Gunthorpe, Saeed Mahameed, David S. Miller,
	Jakub Kicinski, Leon Romanovsky, Netdev, linux-rdma, David Ahern,
	Jacob Keller, Sridhar Samudrala, Ertman, David M, Dan Williams,
	Kiran Patil, Greg KH

On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Sent: Friday, December 18, 2020 8:41 AM
> >
> > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com> wrote:
> > >
> > > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > The problem is PCIe DMA wasn't designed to function as a network switch
> > fabric and when we start talking about a 400Gb NIC trying to handle over 256
> > subfunctions it will quickly reduce the receive/transmit throughput to gigabit
> > or less speeds when encountering hardware multicast/broadcast replication.
> > With 256 subfunctions a simple 60B ARP could consume more than 19KB of
> > PCIe bandwidth due to the packet having to be duplicated so many times. In
> > my mind it should be simpler to simply clone a single skb 256 times, forward
> > that to the switchdev ports, and have them perform a bypass (if available) to
> > deliver it to the subfunctions. That's why I was thinking it might be a good
> > time to look at addressing it.
> Linux tc framework is rich to address this and already used by openvswich for years now.
> Today arp broadcasts are not offloaded. They go through software path and replicated in the L2 domain.
> It is a solved problem for many years now.

When you say they are replicated in the L2 domain I assume you are
talking about the software switch connected to the switchdev ports. My
question is what are you doing with them after you have replicated
them? I'm assuming they are being sent to the other switchdev ports
which will require a DMA to transmit them, and another to receive them
on the VF/SF, or are you saying something else is going on here?

My argument is that this cuts into both the transmit and receive DMA
bandwidth of the NIC, and could easily be avoided in the case where SF
exists in the same kernel as the switchdev port by identifying the
multicast bit being set and simply bypassing the device.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18 16:01                                     ` Alexander Duyck
@ 2020-12-18 18:01                                       ` Parav Pandit
  2020-12-18 19:22                                         ` Alexander Duyck
  0 siblings, 1 reply; 65+ messages in thread
From: Parav Pandit @ 2020-12-18 18:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Ahern, Jason Gunthorpe, Saeed Mahameed, David S. Miller,
	Jakub Kicinski, Leon Romanovsky, Netdev, linux-rdma, David Ahern,
	Jacob Keller, Sridhar Samudrala, Ertman, David M, Dan Williams,
	Kiran Patil, Greg KH


> From: Alexander Duyck <alexander.duyck@gmail.com>
> Sent: Friday, December 18, 2020 9:31 PM
> 
> On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Alexander Duyck <alexander.duyck@gmail.com>
> > > Sent: Friday, December 18, 2020 8:41 AM
> > >
> > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com>
> wrote:
> > > >
> > > > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > > The problem is PCIe DMA wasn't designed to function as a network
> > > switch fabric and when we start talking about a 400Gb NIC trying to
> > > handle over 256 subfunctions it will quickly reduce the
> > > receive/transmit throughput to gigabit or less speeds when encountering
> hardware multicast/broadcast replication.
> > > With 256 subfunctions a simple 60B ARP could consume more than 19KB
> > > of PCIe bandwidth due to the packet having to be duplicated so many
> > > times. In my mind it should be simpler to simply clone a single skb
> > > 256 times, forward that to the switchdev ports, and have them
> > > perform a bypass (if available) to deliver it to the subfunctions.
> > > That's why I was thinking it might be a good time to look at addressing it.
> > Linux tc framework is rich to address this and already used by openvswich
> for years now.
> > Today arp broadcasts are not offloaded. They go through software path
> and replicated in the L2 domain.
> > It is a solved problem for many years now.
> 
> When you say they are replicated in the L2 domain I assume you are talking
> about the software switch connected to the switchdev ports. 
Yes.

> My question is
> what are you doing with them after you have replicated them? I'm assuming
> they are being sent to the other switchdev ports which will require a DMA to
> transmit them, and another to receive them on the VF/SF, or are you saying
> something else is going on here?
> 
Yes, that is correct.

> My argument is that this cuts into both the transmit and receive DMA
> bandwidth of the NIC, and could easily be avoided in the case where SF
> exists in the same kernel as the switchdev port by identifying the multicast
> bit being set and simply bypassing the device.
It probably can be avoided but its probably not worth for occasional ARP packets on neighbor cache miss.
If I am not mistaken, even some recent HW can forward such ARP packets to multiple switchcev ports with commit 7ee3f6d2486e without following the above described DMA path.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18 18:01                                       ` Parav Pandit
@ 2020-12-18 19:22                                         ` Alexander Duyck
  2020-12-18 20:18                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 65+ messages in thread
From: Alexander Duyck @ 2020-12-18 19:22 UTC (permalink / raw)
  To: Parav Pandit
  Cc: David Ahern, Jason Gunthorpe, Saeed Mahameed, David S. Miller,
	Jakub Kicinski, Leon Romanovsky, Netdev, linux-rdma, David Ahern,
	Jacob Keller, Sridhar Samudrala, Ertman, David M, Dan Williams,
	Kiran Patil, Greg KH

On Fri, Dec 18, 2020 at 10:01 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Alexander Duyck <alexander.duyck@gmail.com>
> > Sent: Friday, December 18, 2020 9:31 PM
> >
> > On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Alexander Duyck <alexander.duyck@gmail.com>
> > > > Sent: Friday, December 18, 2020 8:41 AM
> > > >
> > > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@gmail.com>
> > wrote:
> > > > >
> > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote:
> > > > The problem is PCIe DMA wasn't designed to function as a network
> > > > switch fabric and when we start talking about a 400Gb NIC trying to
> > > > handle over 256 subfunctions it will quickly reduce the
> > > > receive/transmit throughput to gigabit or less speeds when encountering
> > hardware multicast/broadcast replication.
> > > > With 256 subfunctions a simple 60B ARP could consume more than 19KB
> > > > of PCIe bandwidth due to the packet having to be duplicated so many
> > > > times. In my mind it should be simpler to simply clone a single skb
> > > > 256 times, forward that to the switchdev ports, and have them
> > > > perform a bypass (if available) to deliver it to the subfunctions.
> > > > That's why I was thinking it might be a good time to look at addressing it.
> > > Linux tc framework is rich to address this and already used by openvswich
> > for years now.
> > > Today arp broadcasts are not offloaded. They go through software path
> > and replicated in the L2 domain.
> > > It is a solved problem for many years now.
> >
> > When you say they are replicated in the L2 domain I assume you are talking
> > about the software switch connected to the switchdev ports.
> Yes.
>
> > My question is
> > what are you doing with them after you have replicated them? I'm assuming
> > they are being sent to the other switchdev ports which will require a DMA to
> > transmit them, and another to receive them on the VF/SF, or are you saying
> > something else is going on here?
> >
> Yes, that is correct.
>
> > My argument is that this cuts into both the transmit and receive DMA
> > bandwidth of the NIC, and could easily be avoided in the case where SF
> > exists in the same kernel as the switchdev port by identifying the multicast
> > bit being set and simply bypassing the device.
> It probably can be avoided but its probably not worth for occasional ARP packets on neighbor cache miss.
> If I am not mistaken, even some recent HW can forward such ARP packets to multiple switchdev ports with commit 7ee3f6d2486e without following the above described DMA path.

Even with that it sounds like it will have to DMA the packet to
multiple Rx destinations even if it is only performing the Tx DMA
once. The Intel NICs did all this replication in hardware as well so
that is what I was thinking of when I was talking about the
replication behavior seen with SR-IOV.

Basically what I am getting at is that this could be used as an
architectural feature for switchdev to avoid creating increased DMA
overhead for broadcast, multicast and unknown-unicast traffic. I'm not
saying this is anything mandatory, and I would be perfectly okay with
something like this being optional and defaulted to off. In my mind
the setup only has the interfaces handling traffic to single point
destinations so that at most you are only looking at a 2x bump in PCIe
bandwidth for those cases where the packet ends up needing to go out
the physical port. It would essentially be a software offload to avoid
saturating the PCIe bus.

This setup would only work if both interfaces are present in the same
kernel though so that is why I chose now to bring it up as
historically SR-IOV hasn't normally been associated with containers
due to the limited number of interfaces that could be created.

Also as far as the patch count complaints I have seen in a few threads
I would be fine with splitting things up so that the devlink and aux
device creation get handled in one set, and then we work out the
details of mlx5 attaching to the devices and spawning of the SF
netdevs in another since that seems to be where the debate is.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18 19:22                                         ` Alexander Duyck
@ 2020-12-18 20:18                                           ` Jason Gunthorpe
  2020-12-19  0:03                                             ` Alexander Duyck
  0 siblings, 1 reply; 65+ messages in thread
From: Jason Gunthorpe @ 2020-12-18 20:18 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Parav Pandit, David Ahern, Saeed Mahameed, David S. Miller,
	Jakub Kicinski, Leon Romanovsky, Netdev, linux-rdma, David Ahern,
	Jacob Keller, Sridhar Samudrala, Ertman, David M, Dan Williams,
	Kiran Patil, Greg KH

On Fri, Dec 18, 2020 at 11:22:12AM -0800, Alexander Duyck wrote:

> Also as far as the patch count complaints I have seen in a few threads
> I would be fine with splitting things up so that the devlink and aux
> device creation get handled in one set, and then we work out the
> details of mlx5 attaching to the devices and spawning of the SF
> netdevs in another since that seems to be where the debate is.

It doesn't work like that. The aux device creates a mlx5_core and
every mlx5_core can run mlx5_en.

This really isn't the series to raise this feature request. Adding an
optional short cut path to VF/SF is something that can be done later
if up to date benchmarks show it has value. There is no blocker in
this model to doing that.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [net-next v4 00/15] Add mlx5 subfunction support
  2020-12-18 20:18                                           ` Jason Gunthorpe
@ 2020-12-19  0:03                                             ` Alexander Duyck
  0 siblings, 0 replies; 65+ messages in thread
From: Alexander Duyck @ 2020-12-19  0:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Parav Pandit, David Ahern, Saeed Mahameed, David S. Miller,
	Jakub Kicinski, Leon Romanovsky, Netdev, linux-rdma, David Ahern,
	Jacob Keller, Sridhar Samudrala, Ertman, David M, Dan Williams,
	Kiran Patil, Greg KH

On Fri, Dec 18, 2020 at 12:18 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Dec 18, 2020 at 11:22:12AM -0800, Alexander Duyck wrote:
>
> > Also as far as the patch count complaints I have seen in a few threads
> > I would be fine with splitting things up so that the devlink and aux
> > device creation get handled in one set, and then we work out the
> > details of mlx5 attaching to the devices and spawning of the SF
> > netdevs in another since that seems to be where the debate is.
>
> It doesn't work like that. The aux device creates a mlx5_core and
> every mlx5_core can run mlx5_en.

The aux device is still just a device isn't it? Until you register a
driver that latches on to "MLX5_SF_DEV_ID_NAME" the device is there
and it should function like an unbound/idle VF.

> This really isn't the series to raise this feature request. Adding an
> optional short cut path to VF/SF is something that can be done later
> if up to date benchmarks show it has value. There is no blocker in
> this model to doing that.

That is the point of contention that we probably won't solve. I feel
like if we are going to put out something that is an alternative to
the macvlan and SR-IOV approaches it should address the issues
resolved in both. Right now this just throws up yet another
virtualization solution that ends up reintroducing many of the
existing problems we already had with SR-IOV, and possibly
exacerbating them by allowing for an even greater number of
subfunctions.

Anyway I have made my opinion known, and I am not in the position to
block the patches since I am not the maintainer.

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2020-12-19  0:04 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
2020-12-14 22:31   ` Alexander Duyck
2020-12-14 22:45     ` Saeed Mahameed
2020-12-15  4:59     ` Leon Romanovsky
2020-12-14 21:43 ` [net-next v4 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 04/15] devlink: Support add and delete devlink port Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 05/15] devlink: Support get and set state of port function Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 13/15] devlink: Add devlink port documentation Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
2020-12-15  2:44   ` David Ahern
2020-12-15 16:16     ` Alexander Duyck
2020-12-15 16:59       ` Parav Pandit
2020-12-15  5:48   ` Parav Pandit
2020-12-15 18:47     ` Alexander Duyck
2020-12-15 20:05       ` Saeed Mahameed
2020-12-15 21:03       ` Jason Gunthorpe
2020-12-16  1:12       ` Edwin Peer
2020-12-16  2:39         ` Jason Gunthorpe
2020-12-16  3:12         ` Alexander Duyck
2020-12-15 20:59     ` David Ahern
2020-12-15  6:15   ` Saeed Mahameed
2020-12-15 19:12     ` Alexander Duyck
2020-12-15 20:35       ` Saeed Mahameed
2020-12-15 21:28         ` Jakub Kicinski
2020-12-16  6:50           ` Leon Romanovsky
2020-12-16 17:59             ` Saeed Mahameed
2020-12-15 21:41         ` Alexander Duyck
2020-12-16  0:19           ` Jason Gunthorpe
2020-12-16  2:19             ` Alexander Duyck
2020-12-16  3:03               ` Jason Gunthorpe
2020-12-16  4:13                 ` Alexander Duyck
2020-12-16  4:45                   ` Parav Pandit
2020-12-16 13:33                   ` Jason Gunthorpe
2020-12-16 16:31                     ` Alexander Duyck
2020-12-16 17:51                       ` Jason Gunthorpe
2020-12-16 19:27                         ` Alexander Duyck
2020-12-16 20:35                           ` Jason Gunthorpe
2020-12-16 22:53                             ` Alexander Duyck
2020-12-17  0:38                               ` Jason Gunthorpe
2020-12-17 18:48                                 ` Alexander Duyck
2020-12-17 19:40                                   ` Jason Gunthorpe
2020-12-17 21:05                                     ` Alexander Duyck
2020-12-18  0:08                                       ` Jason Gunthorpe
2020-12-18  1:30                               ` David Ahern
2020-12-18  3:11                                 ` Alexander Duyck
2020-12-18  3:55                                   ` David Ahern
2020-12-18 15:54                                     ` Alexander Duyck
2020-12-18  5:20                                   ` Parav Pandit
2020-12-18  5:36                                     ` Parav Pandit
2020-12-18 16:01                                     ` Alexander Duyck
2020-12-18 18:01                                       ` Parav Pandit
2020-12-18 19:22                                         ` Alexander Duyck
2020-12-18 20:18                                           ` Jason Gunthorpe
2020-12-19  0:03                                             ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.