netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 00/10] devlink subdev
@ 2019-11-08 16:18 Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 01/10] devlink: Introduce subdev Yuval Avnery
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

This patchset introduces devlink subdev.

Currently, legacy tools do not provide a comprehensive solution that can
be used in both SmartNic and non-SmartNic mode.
Subdev represents a device that exists on the ASIC but is not necessarily
visible to the kernel.

Using devlink ports is not suitable because:

1. Those devices aren't necessarily network devices (such as NVMe devices)
   and doesn’t have E-switch representation. Therefore, there is need for
   more generic representation of PCI VF.
2. Some attributes are not necessarily pure port attributes
   (number of MSIX vectors)
3. It creates a confusing devlink topology, with multiple port flavours
   and indices.

Subdev will be created along with flavour and attributes.
Some network subdevs may be linked with a devlink port.

This is also aimed to replace "ip link vf" commands as they are strongly
linked to the PCI topology and allow access only to enabled VFs.
Even though current patchset and example is limited to MAC address
of the VF, this interface will allow to manage PF, VF, mdev in
SmartNic and non SmartNic modes, in unified way for networking and
non-networking devices via devlink instance.

Use case example:
An example system view of a networking ASIC (aka SmartNIC), can be seen in
below diagram, where devlink eswitch instance and PCI PF and/or VFs are
situated on two different CPU subsystems:


      +------------------------------+
      |                              |
      |             HOST             |
      |                              |
      |   +----+-----+-----+-----+   |
      |   | PF | VF0 | VF1 | VF2 |   |
      +---+----+-----------+-----+---+
                 PCI1|
          +---+------------+
              |
     +----------------------------------------+
     |        |         SmartNic              |
     |   +----+-------------------------+     |
     |   |                              |     |
     |   |               NIC            |     |
     |   |                              |     |
     |   +---------------------+--------+     |
     |                         |  PCI2        |
     |         +-----+---------+--+           |
     |               |                        |
     |      +-----+--+--+--------------+      |
     |      |     | PF  |              |      |
     |      |     +-----+              |      |
     |      |      Embedded CPU        |      |
     |      |                          |      |
     |      +--------------------------+      |
     |                                        |
     +----------------------------------------+

The below diagram shows an example devlink subdev topology where some
subdevs are connected to devlink ports::



            (PF0)    (VF0)    (VF1)           (NVME VF2)
         +--------------------------+         +--------+
         | devlink| devlink| devlink|         | devlink|
         | subdev | subdev | subdev |         | subdev |
         |    0   |    1   |    2   |         |    3   |
         +--------------------------+         +--------+
              |        |        |
              |        |        |
              |        |        |
     +----------------------------------+
     |   | devlink| devlink| devlink|   |
     |   |  port  |  port  |  port  |   |
     |   |    0   |    1   |    2   |   |
     |   +--------------------------+   |
     |                                  |
     |                                  |
     |           E-switch               |
     |                                  |
     |                                  |
     |          +--------+              |
     |          | uplink |              |
     |          | devlink|              |
     |          |  port  |              |
     +----------------------------------+
 
Devlink command example:

A privileged user wants to configure a VF's hw_addr, before the VF is
enabled.

$ devlink subdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66

$ devlink subdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 10:22:33:44:55:66

$ devlink subdev show pci/0000:03:00.0/1 -jp
{
    "subdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0,
            "port_index": 1,
            "hw_addr": "10:22:33:44:55:66"
        }
    }
}

Patches 1-6 adds devlink support for subdev.
Patches 7-8 adds netdevsim implementation and test.
Patch 10 adds mlx5 subdev creation and hw_addr get/set.

---

v1->v2:
 - vdev -> subdev
 - Update cover letter and add more examples.

Yuval Avnery (10):
  devlink: Introduce subdev
  devlink: Add PCI attributes support for subdev
  devlink: Add port with subdev register support
  devlink: Support subdev HW address get
  devlink: Support subdev HW address set
  Documentation: Add devlink-subdev documentation.
  netdevsim: Add max_vfs to bus_dev
  netdevsim: Add devlink subdev creation
  netdevsim: Add devlink subdev sefltest for netdevsim
  net/mlx5e: Add support for devlink subdev and subdev hw_addr set/show

 Documentation/networking/devlink-subdev.rst   |  96 +++++
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  92 ++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   5 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |   9 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  19 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   7 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     |   8 +
 drivers/net/netdevsim/Makefile                |   2 +-
 drivers/net/netdevsim/bus.c                   |  39 +-
 drivers/net/netdevsim/dev.c                   |   9 +-
 drivers/net/netdevsim/netdevsim.h             |  11 +
 drivers/net/netdevsim/subdev.c                |  99 +++++
 include/net/devlink.h                         |  46 ++
 include/uapi/linux/devlink.h                  |  16 +
 net/core/devlink.c                            | 403 +++++++++++++++++-
 .../drivers/net/netdevsim/devlink.sh          |  55 ++-
 17 files changed, 894 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/networking/devlink-subdev.rst
 create mode 100644 drivers/net/netdevsim/subdev.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 01/10] devlink: Introduce subdev
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 02/10] devlink: Add PCI attributes support for subdev Yuval Avnery
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

Subdev represents a device that exists on the ASIC which is not necessarily
visible to the kernel.
Just like devlink port, the driver should set index and attributes to
the subdev.

Example:
A VF represented by subdev pci/0000:03:00.0/1, before it is enabled.
The PF subdev is represented by pci/0000:03:00.0/0.

$ devlink dev show
pci/0000:03:00.0

$ devlink subdev show
pci/0000:03:00.0/0
pci/0000:03:00.0/1

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Suggested-by: Parav Pandit <parav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h        |  18 +++
 include/uapi/linux/devlink.h |   7 +
 net/core/devlink.c           | 261 +++++++++++++++++++++++++++++++++++
 3 files changed, 286 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 6bf3b9e0595a..9d6b50b906ee 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -23,6 +23,7 @@ struct devlink_ops;
 struct devlink {
 	struct list_head list;
 	struct list_head port_list;
+	struct list_head subdev_list;
 	struct list_head sb_list;
 	struct list_head dpipe_table_list;
 	struct list_head resource_list;
@@ -73,6 +74,8 @@ struct devlink_port_attrs {
 	};
 };
 
+struct devlink_subdev;
+
 struct devlink_port {
 	struct list_head list;
 	struct list_head param_list;
@@ -89,6 +92,9 @@ struct devlink_port {
 	struct delayed_work type_warn_dw;
 };
 
+struct devlink_subdev_attrs {
+};
+
 struct devlink_sb_pool_info {
 	enum devlink_sb_pool_type pool_type;
 	u32 size;
@@ -743,6 +749,9 @@ struct devlink_ops {
 			       const struct devlink_trap_group *group);
 };
 
+struct devlink_subdev_ops {
+};
+
 static inline void *devlink_priv(struct devlink *devlink)
 {
 	BUG_ON(!devlink);
@@ -802,6 +811,15 @@ void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port,
 				   const unsigned char *switch_id,
 				   unsigned char switch_id_len,
 				   u16 pf, u16 vf);
+struct devlink_subdev *
+devlink_subdev_create(struct devlink *devlink,
+		      unsigned int subdev_index,
+		      const struct devlink_subdev_ops *ops,
+		      const struct devlink_subdev_attrs *attrs,
+		      void *priv);
+void devlink_subdev_destroy(struct devlink_subdev *devlink_subdev);
+struct devlink *devlink_subdev_devlink(struct devlink_subdev *devlink_subdev);
+void *devlink_subdev_priv(struct devlink_subdev *devlink_subdev);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index b558ea88b766..df894091f26a 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -117,6 +117,11 @@ enum devlink_command {
 	DEVLINK_CMD_TRAP_GROUP_NEW,
 	DEVLINK_CMD_TRAP_GROUP_DEL,
 
+	DEVLINK_CMD_SUBDEV_GET,		/* can dump */
+	DEVLINK_CMD_SUBDEV_SET,
+	DEVLINK_CMD_SUBDEV_NEW,
+	DEVLINK_CMD_SUBDEV_DEL,
+
 	/* add new commands above here */
 	__DEVLINK_CMD_MAX,
 	DEVLINK_CMD_MAX = __DEVLINK_CMD_MAX - 1
@@ -425,6 +430,8 @@ enum devlink_attr {
 	DEVLINK_ATTR_NETNS_PID,			/* u32 */
 	DEVLINK_ATTR_NETNS_ID,			/* u32 */
 
+	DEVLINK_ATTR_SUBDEV_INDEX,		/* u32 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 97e9a2246929..5ab2fc2f2d82 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -31,6 +31,15 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/devlink.h>
 
+struct devlink_subdev {
+	struct list_head list;
+	struct devlink *devlink;
+	unsigned int index;
+	const struct devlink_subdev_ops *ops;
+	struct devlink_subdev_attrs attrs;
+	void *priv;
+};
+
 static struct devlink_dpipe_field devlink_dpipe_fields_ethernet[] = {
 	{
 		.name = "destination mac",
@@ -183,6 +192,46 @@ static struct devlink_port *devlink_port_get_from_info(struct devlink *devlink,
 	return devlink_port_get_from_attrs(devlink, info->attrs);
 }
 
+static struct devlink_subdev *
+devlink_subdev_get_by_index(struct devlink *devlink, unsigned int subdev_index)
+{
+	struct devlink_subdev *devlink_subdev;
+
+	list_for_each_entry(devlink_subdev, &devlink->subdev_list, list) {
+		if (devlink_subdev->index == subdev_index)
+			return devlink_subdev;
+	}
+	return NULL;
+}
+
+static bool devlink_subdev_index_exists(struct devlink *devlink,
+					unsigned int subdev_index)
+{
+	return devlink_subdev_get_by_index(devlink, subdev_index);
+}
+
+static struct devlink_subdev *
+devlink_subdev_get_from_attrs(struct devlink *devlink, struct nlattr **attrs)
+{
+	struct devlink_subdev *devlink_subdev;
+	u32 subdev_index;
+
+	if (!attrs[DEVLINK_ATTR_SUBDEV_INDEX])
+		return ERR_PTR(-EINVAL);
+
+	subdev_index = nla_get_u32(attrs[DEVLINK_ATTR_SUBDEV_INDEX]);
+	devlink_subdev = devlink_subdev_get_by_index(devlink, subdev_index);
+	if (!devlink_subdev)
+		return ERR_PTR(-ENODEV);
+	return devlink_subdev;
+}
+
+static struct devlink_subdev *
+devlink_subdev_get_from_info(struct devlink *devlink, struct genl_info *info)
+{
+	return devlink_subdev_get_from_attrs(devlink, info->attrs);
+}
+
 struct devlink_sb {
 	struct list_head list;
 	unsigned int index;
@@ -386,6 +435,7 @@ devlink_region_snapshot_get_by_id(struct devlink_region *region, u32 id)
 #define DEVLINK_NL_FLAG_NEED_DEVLINK	BIT(0)
 #define DEVLINK_NL_FLAG_NEED_PORT	BIT(1)
 #define DEVLINK_NL_FLAG_NEED_SB		BIT(2)
+#define DEVLINK_NL_FLAG_NEED_SUBDEV       BIT(3)
 
 /* The per devlink instance lock is taken by default in the pre-doit
  * operation, yet several commands do not require this. The global
@@ -418,6 +468,15 @@ static int devlink_nl_pre_doit(const struct genl_ops *ops,
 			goto unlock;
 		}
 		info->user_ptr[0] = devlink_port;
+	} else if (ops->internal_flags & DEVLINK_NL_FLAG_NEED_SUBDEV) {
+		struct devlink_subdev *devlink_subdev;
+
+		devlink_subdev = devlink_subdev_get_from_info(devlink, info);
+		if (IS_ERR(devlink_subdev)) {
+			err = PTR_ERR(devlink_subdev);
+			goto unlock;
+		}
+		info->user_ptr[0] = devlink_subdev;
 	}
 	if (ops->internal_flags & DEVLINK_NL_FLAG_NEED_SB) {
 		struct devlink_sb *devlink_sb;
@@ -644,6 +703,54 @@ static void devlink_port_notify(struct devlink_port *devlink_port,
 				msg, 0, DEVLINK_MCGRP_CONFIG, GFP_KERNEL);
 }
 
+static int devlink_nl_subdev_fill(struct sk_buff *msg, struct devlink *devlink,
+				  struct devlink_subdev *devlink_subdev,
+				  enum devlink_command cmd, u32 subdevid,
+				  u32 seq, int flags)
+{
+	void *hdr;
+
+	hdr = genlmsg_put(msg, subdevid, seq, &devlink_nl_family, flags, cmd);
+	if (!hdr)
+		return -EMSGSIZE;
+
+	if (devlink_nl_put_handle(msg, devlink))
+		goto nla_put_failure;
+	if (nla_put_u32(msg, DEVLINK_ATTR_SUBDEV_INDEX, devlink_subdev->index))
+		goto nla_put_failure;
+
+	genlmsg_end(msg, hdr);
+	return 0;
+
+nla_put_failure:
+	genlmsg_cancel(msg, hdr);
+	return -EMSGSIZE;
+}
+
+static void devlink_subdev_notify(struct devlink_subdev *devlink_subdev,
+				  enum devlink_command cmd)
+{
+	struct devlink *devlink = devlink_subdev->devlink;
+	struct sk_buff *msg;
+	int err;
+
+	WARN_ON(cmd != DEVLINK_CMD_SUBDEV_NEW && cmd != DEVLINK_CMD_SUBDEV_DEL);
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return;
+
+	err = devlink_nl_subdev_fill(msg, devlink,
+				     devlink_subdev, cmd, 0, 0, 0);
+	if (err) {
+		nlmsg_free(msg);
+		return;
+	}
+
+	genlmsg_multicast_netns(&devlink_nl_family, devlink_net(devlink),
+				msg, 0, DEVLINK_MCGRP_CONFIG, GFP_KERNEL);
+}
+
 static int devlink_nl_cmd_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct devlink *devlink = info->user_ptr[0];
@@ -844,6 +951,78 @@ static int devlink_nl_cmd_port_unsplit_doit(struct sk_buff *skb,
 	return devlink_port_unsplit(devlink, port_index, info->extack);
 }
 
+static int devlink_nl_cmd_subdev_get_doit(struct sk_buff *skb,
+					  struct genl_info *info)
+{
+	struct devlink_subdev *devlink_subdev = info->user_ptr[0];
+	struct devlink *devlink = devlink_subdev->devlink;
+	struct sk_buff *msg;
+	int err;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	err = devlink_nl_subdev_fill(msg, devlink, devlink_subdev,
+				     DEVLINK_CMD_SUBDEV_NEW,
+				     info->snd_portid, info->snd_seq, 0);
+	if (err) {
+		nlmsg_free(msg);
+		return err;
+	}
+
+	return genlmsg_reply(msg, info);
+}
+
+static int devlink_nl_cmd_subdev_get_dumpit(struct sk_buff *msg,
+					    struct netlink_callback *cb)
+{
+	struct devlink_subdev *devlink_subdev;
+	struct list_head *subdev_list;
+	struct devlink *devlink;
+	int start = cb->args[0];
+	int idx = 0;
+	int err;
+
+	mutex_lock(&devlink_mutex);
+	list_for_each_entry(devlink, &devlink_list, list) {
+		subdev_list = &devlink->subdev_list;
+
+		if (!net_eq(devlink_net(devlink), sock_net(msg->sk)))
+			continue;
+		mutex_lock(&devlink->lock);
+		list_for_each_entry(devlink_subdev, subdev_list, list) {
+			if (idx < start) {
+				idx++;
+				continue;
+			}
+			err = devlink_nl_subdev_fill(msg, devlink,
+						     devlink_subdev,
+						     DEVLINK_CMD_NEW,
+						     NETLINK_CB(cb->skb).portid,
+						     cb->nlh->nlmsg_seq,
+						     NLM_F_MULTI);
+			if (err) {
+				mutex_unlock(&devlink->lock);
+				goto out;
+			}
+			idx++;
+		}
+		mutex_unlock(&devlink->lock);
+	}
+out:
+	mutex_unlock(&devlink_mutex);
+
+	cb->args[0] = idx;
+	return msg->len;
+}
+
+static int devlink_nl_cmd_subdev_set_doit(struct sk_buff *skb,
+					  struct genl_info *info)
+{
+	return 0;
+}
+
 static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink,
 			      struct devlink_sb *devlink_sb,
 			      enum devlink_command cmd, u32 portid,
@@ -5902,6 +6081,7 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_NETNS_PID] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_NETNS_FD] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_NETNS_ID] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_SUBDEV_INDEX] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops devlink_nl_ops[] = {
@@ -5928,6 +6108,19 @@ static const struct genl_ops devlink_nl_ops[] = {
 		.flags = GENL_ADMIN_PERM,
 		.internal_flags = DEVLINK_NL_FLAG_NEED_PORT,
 	},
+	{
+		.cmd = DEVLINK_CMD_SUBDEV_GET,
+		.doit = devlink_nl_cmd_subdev_get_doit,
+		.dumpit = devlink_nl_cmd_subdev_get_dumpit,
+		.internal_flags = DEVLINK_NL_FLAG_NEED_SUBDEV,
+		/* can be retrieved by unprivileged users */
+	},
+	{
+		.cmd = DEVLINK_CMD_SUBDEV_SET,
+		.doit = devlink_nl_cmd_subdev_set_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NEED_SUBDEV,
+	},
 	{
 		.cmd = DEVLINK_CMD_PORT_SPLIT,
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
@@ -6268,6 +6461,7 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
 	devlink->ops = ops;
 	__devlink_net_set(devlink, &init_net);
 	INIT_LIST_HEAD(&devlink->port_list);
+	INIT_LIST_HEAD(&devlink->subdev_list);
 	INIT_LIST_HEAD(&devlink->sb_list);
 	INIT_LIST_HEAD_RCU(&devlink->dpipe_table_list);
 	INIT_LIST_HEAD(&devlink->resource_list);
@@ -6332,6 +6526,7 @@ void devlink_free(struct devlink *devlink)
 	WARN_ON(!list_empty(&devlink->dpipe_table_list));
 	WARN_ON(!list_empty(&devlink->sb_list));
 	WARN_ON(!list_empty(&devlink->port_list));
+	WARN_ON(!list_empty(&devlink->subdev_list));
 
 	kfree(devlink);
 }
@@ -6657,6 +6852,72 @@ static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port,
 	return 0;
 }
 
+void *devlink_subdev_priv(struct devlink_subdev *devlink_subdev)
+{
+	return devlink_subdev->priv;
+}
+EXPORT_SYMBOL_GPL(devlink_subdev_priv);
+
+/**
+ *	devlink_subdev_create - create devlink subdev
+ *
+ *	@devlink: devlink
+ *	@subdev_index: driver-specific numerical identifier of the subdev
+ *	@ops: subdev specific ops
+ *	@attrs: subdev specific attributes
+ *	@priv: driver private data
+ *
+ *	Create devlink subdev with provided subdev index. User can use
+ *	any indexing, even hw-related one.
+ */
+struct devlink_subdev *
+devlink_subdev_create(struct devlink *devlink,
+		      unsigned int subdev_index,
+		      const struct devlink_subdev_ops *ops,
+		      const struct devlink_subdev_attrs *attrs,
+		      void *priv)
+{
+	struct devlink_subdev *devlink_subdev;
+
+	devlink_subdev = kzalloc(sizeof(*devlink_subdev), GFP_KERNEL);
+	if (!devlink_subdev)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&devlink->lock);
+	if (devlink_subdev_index_exists(devlink, subdev_index)) {
+		mutex_unlock(&devlink->lock);
+		kfree(devlink_subdev);
+		return ERR_PTR(-EEXIST);
+	}
+	devlink_subdev->devlink = devlink;
+	devlink_subdev->priv = priv;
+	devlink_subdev->index = subdev_index;
+	devlink_subdev->ops = ops;
+	devlink_subdev->attrs = *attrs;
+	list_add_tail(&devlink_subdev->list, &devlink->subdev_list);
+	mutex_unlock(&devlink->lock);
+	devlink_subdev_notify(devlink_subdev, DEVLINK_CMD_SUBDEV_NEW);
+	return devlink_subdev;
+}
+EXPORT_SYMBOL_GPL(devlink_subdev_create);
+
+/**
+ *	devlink_subdev_destroy - destroy devlink subdev
+ *
+ *	@devlink_subdev: devlink subdev
+ */
+void devlink_subdev_destroy(struct devlink_subdev *devlink_subdev)
+{
+	struct devlink *devlink = devlink_subdev->devlink;
+
+	devlink_subdev_notify(devlink_subdev, DEVLINK_CMD_SUBDEV_DEL);
+	mutex_lock(&devlink->lock);
+	list_del(&devlink_subdev->list);
+	mutex_unlock(&devlink->lock);
+	kfree(devlink_subdev);
+}
+EXPORT_SYMBOL_GPL(devlink_subdev_destroy);
+
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 02/10] devlink: Add PCI attributes support for subdev
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 01/10] devlink: Introduce subdev Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 03/10] devlink: Add port with subdev register support Yuval Avnery
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

When subdev represents a PCI device it should reflect the PCI device
type (PF/VF) and it's index.

Example:

$ devlink subdev show pci/0000:03:00.0/1 -jp
{
    "subdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0
        }
    }
}

$ devlink subdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h        | 18 +++++++++++++
 include/uapi/linux/devlink.h |  8 ++++++
 net/core/devlink.c           | 52 ++++++++++++++++++++++++++++++++++++
 3 files changed, 78 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 9d6b50b906ee..1e12a9be5c23 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -92,7 +92,21 @@ struct devlink_port {
 	struct delayed_work type_warn_dw;
 };
 
+struct devlink_subdev_pci_pf_attrs {
+	u16 pf;	/* Associated PCI PF for this subdev. */
+};
+
+struct devlink_subdev_pci_vf_attrs {
+	u16 pf;	/* Associated PCI PF for this subdev. */
+	u16 vf;	/* Associated PCI VF for of the PCI PF for this subdev. */
+};
+
 struct devlink_subdev_attrs {
+	enum devlink_subdev_flavour flavour;
+	union {
+		struct devlink_subdev_pci_pf_attrs pci_pf;
+		struct devlink_subdev_pci_vf_attrs pci_vf;
+	};
 };
 
 struct devlink_sb_pool_info {
@@ -820,6 +834,10 @@ devlink_subdev_create(struct devlink *devlink,
 void devlink_subdev_destroy(struct devlink_subdev *devlink_subdev);
 struct devlink *devlink_subdev_devlink(struct devlink_subdev *devlink_subdev);
 void *devlink_subdev_priv(struct devlink_subdev *devlink_subdev);
+void devlink_subdev_attrs_pci_pf_init(struct devlink_subdev_attrs *attrs,
+				      u16 pf);
+void devlink_subdev_attrs_pci_vf_init(struct devlink_subdev_attrs *attrs,
+				      u16 pf, u16 vf);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index df894091f26a..da79ffad9c5a 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -194,6 +194,11 @@ enum devlink_port_flavour {
 				      */
 };
 
+enum devlink_subdev_flavour {
+	DEVLINK_SUBDEV_FLAVOUR_PCI_PF,
+	DEVLINK_SUBDEV_FLAVOUR_PCI_VF,
+};
+
 enum devlink_param_cmode {
 	DEVLINK_PARAM_CMODE_RUNTIME,
 	DEVLINK_PARAM_CMODE_DRIVERINIT,
@@ -431,6 +436,9 @@ enum devlink_attr {
 	DEVLINK_ATTR_NETNS_ID,			/* u32 */
 
 	DEVLINK_ATTR_SUBDEV_INDEX,		/* u32 */
+	DEVLINK_ATTR_SUBDEV_FLAVOUR,		/* u16 */
+	DEVLINK_ATTR_SUBDEV_PF_INDEX,		/* u32 */
+	DEVLINK_ATTR_SUBDEV_VF_INDEX,		/* u32 */
 
 	/* add new attributes above here, update the policy in devlink.c */
 
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 5ab2fc2f2d82..76f5fba7d242 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -708,6 +708,7 @@ static int devlink_nl_subdev_fill(struct sk_buff *msg, struct devlink *devlink,
 				  enum devlink_command cmd, u32 subdevid,
 				  u32 seq, int flags)
 {
+	struct devlink_subdev_attrs *attrs = &devlink_subdev->attrs;
 	void *hdr;
 
 	hdr = genlmsg_put(msg, subdevid, seq, &devlink_nl_family, flags, cmd);
@@ -719,6 +720,24 @@ static int devlink_nl_subdev_fill(struct sk_buff *msg, struct devlink *devlink,
 	if (nla_put_u32(msg, DEVLINK_ATTR_SUBDEV_INDEX, devlink_subdev->index))
 		goto nla_put_failure;
 
+	if (nla_put_u16(msg, DEVLINK_ATTR_SUBDEV_FLAVOUR, attrs->flavour))
+		goto nla_put_failure;
+	switch (attrs->flavour) {
+	case DEVLINK_SUBDEV_FLAVOUR_PCI_PF:
+		if (nla_put_u32(msg, DEVLINK_ATTR_SUBDEV_PF_INDEX,
+				attrs->pci_pf.pf))
+			goto nla_put_failure;
+		break;
+	case DEVLINK_SUBDEV_FLAVOUR_PCI_VF:
+		if (nla_put_u32(msg, DEVLINK_ATTR_SUBDEV_PF_INDEX,
+				attrs->pci_vf.pf))
+			goto nla_put_failure;
+		if (nla_put_u32(msg, DEVLINK_ATTR_SUBDEV_VF_INDEX,
+				attrs->pci_vf.vf))
+			goto nla_put_failure;
+		break;
+	}
+
 	genlmsg_end(msg, hdr);
 	return 0;
 
@@ -6082,6 +6101,9 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_NETNS_FD] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_NETNS_ID] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_SUBDEV_INDEX] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_SUBDEV_FLAVOUR] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_SUBDEV_PF_INDEX] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_SUBDEV_VF_INDEX] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops devlink_nl_ops[] = {
@@ -6918,6 +6940,36 @@ void devlink_subdev_destroy(struct devlink_subdev *devlink_subdev)
 }
 EXPORT_SYMBOL_GPL(devlink_subdev_destroy);
 
+/**
+ *	devlink_subdev_attrs_pci_pf_int - Init PCI PF subdev attributes
+ *
+ *	@devlink_subdev_attr: devlink subdev attributes
+ *	@pf: associated PF index for the devlink subdev instance
+ */
+void devlink_subdev_attrs_pci_pf_init(struct devlink_subdev_attrs *attrs,
+				      u16 pf)
+{
+	attrs->flavour = DEVLINK_SUBDEV_FLAVOUR_PCI_PF;
+	attrs->pci_pf.pf = pf;
+}
+EXPORT_SYMBOL_GPL(devlink_subdev_attrs_pci_pf_init);
+
+/**
+ *	devlink_subdev_attrs_pci_vf_init - Init PCI VF subdev attributes
+ *
+ *	@devlink_subdev: devlink subdev
+ *	@pf: associated PF index for the devlink subdev instance
+ *	@vf: associated VF index for the devlink subdev instance
+ */
+void devlink_subdev_attrs_pci_vf_init(struct devlink_subdev_attrs *attrs,
+				      u16 pf, u16 vf)
+{
+	attrs->flavour = DEVLINK_SUBDEV_FLAVOUR_PCI_VF;
+	attrs->pci_vf.pf = pf;
+	attrs->pci_vf.vf = vf;
+}
+EXPORT_SYMBOL_GPL(devlink_subdev_attrs_pci_vf_init);
+
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 03/10] devlink: Add port with subdev register support
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 01/10] devlink: Introduce subdev Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 02/10] devlink: Add PCI attributes support for subdev Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 04/10] devlink: Support subdev HW address get Yuval Avnery
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

A subdev may represent a network device, so it may be linked to
a devlink port.
Added devlink_port_register_with_subdev to allow subdev-port linkage.

Example:

$ devlink subdev show pci/0000:03:00.0/1 -jp
{
    "subdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0,
            "port_index": 1
        }
    }
}
$ devlink subdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1

$ devlink port show pci/0000:03:00.0/1
pci/0000:03:00.0/1: type eth netdev ens2f0_0 flavour pcivf pfnum 0 vfnum 0 subdev_index 1

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h |  5 ++++
 net/core/devlink.c    | 53 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 1e12a9be5c23..0cedd6d34ef8 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -90,6 +90,7 @@ struct devlink_port {
 	void *type_dev;
 	struct devlink_port_attrs attrs;
 	struct delayed_work type_warn_dw;
+	struct devlink_subdev *devlink_subdev; /* linked subdev */
 };
 
 struct devlink_subdev_pci_pf_attrs {
@@ -806,6 +807,10 @@ void devlink_free(struct devlink *devlink);
 int devlink_port_register(struct devlink *devlink,
 			  struct devlink_port *devlink_port,
 			  unsigned int port_index);
+int devlink_port_register_with_subdev(struct devlink *devlink,
+				    struct devlink_port *devlink_port,
+				    unsigned int port_index,
+				    struct devlink_subdev *devlink_subdev);
 void devlink_port_unregister(struct devlink_port *devlink_port);
 void devlink_port_type_eth_set(struct devlink_port *devlink_port,
 			       struct net_device *netdev);
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 76f5fba7d242..0c97c51dea0d 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -37,6 +37,7 @@ struct devlink_subdev {
 	unsigned int index;
 	const struct devlink_subdev_ops *ops;
 	struct devlink_subdev_attrs attrs;
+	struct devlink_port *devlink_port; /* linked port */
 	void *priv;
 };
 
@@ -664,6 +665,10 @@ static int devlink_nl_port_fill(struct sk_buff *msg, struct devlink *devlink,
 			goto nla_put_failure_type_locked;
 	}
 	spin_unlock_bh(&devlink_port->type_lock);
+	if (devlink_port->devlink_subdev)
+		if (nla_put_u32(msg, DEVLINK_ATTR_SUBDEV_INDEX,
+				devlink_port->devlink_subdev->index))
+			goto nla_put_failure;
 	if (devlink_nl_port_attrs_put(msg, devlink_port))
 		goto nla_put_failure;
 
@@ -738,6 +743,11 @@ static int devlink_nl_subdev_fill(struct sk_buff *msg, struct devlink *devlink,
 		break;
 	}
 
+	if (devlink_subdev->devlink_port)
+		if (nla_put_u32(msg, DEVLINK_ATTR_PORT_INDEX,
+				devlink_subdev->devlink_port->index))
+			goto nla_put_failure;
+
 	genlmsg_end(msg, hdr);
 	return 0;
 
@@ -6587,11 +6597,12 @@ static void devlink_port_type_warn_cancel(struct devlink_port *devlink_port)
 }
 
 /**
- *	devlink_port_register - Register devlink port
+ *	devlink_port_register_with_subdev - Register devlink port
  *
  *	@devlink: devlink
  *	@devlink_port: devlink port
  *	@port_index: driver-specific numerical identifier of the port
+ *	@devlink_subdev: subdev to link with the port
  *
  *	Register devlink port with provided port index. User can use
  *	any indexing, even hw-related one. devlink_port structure
@@ -6599,9 +6610,10 @@ static void devlink_port_type_warn_cancel(struct devlink_port *devlink_port)
  *	Note that the caller should take care of zeroing the devlink_port
  *	structure.
  */
-int devlink_port_register(struct devlink *devlink,
-			  struct devlink_port *devlink_port,
-			  unsigned int port_index)
+int devlink_port_register_with_subdev(struct devlink *devlink,
+				      struct devlink_port *devlink_port,
+				      unsigned int port_index,
+				      struct devlink_subdev *devlink_subdev)
 {
 	mutex_lock(&devlink->lock);
 	if (devlink_port_index_exists(devlink, port_index)) {
@@ -6611,15 +6623,42 @@ int devlink_port_register(struct devlink *devlink,
 	devlink_port->devlink = devlink;
 	devlink_port->index = port_index;
 	devlink_port->registered = true;
+	devlink_port->devlink_subdev = devlink_subdev;
 	spin_lock_init(&devlink_port->type_lock);
 	list_add_tail(&devlink_port->list, &devlink->port_list);
 	INIT_LIST_HEAD(&devlink_port->param_list);
 	mutex_unlock(&devlink->lock);
 	INIT_DELAYED_WORK(&devlink_port->type_warn_dw, &devlink_port_type_warn);
 	devlink_port_type_warn_schedule(devlink_port);
+	if (devlink_subdev) {
+		devlink_subdev->devlink_port = devlink_port;
+		devlink_subdev_notify(devlink_subdev, DEVLINK_CMD_SUBDEV_NEW);
+	}
 	devlink_port_notify(devlink_port, DEVLINK_CMD_PORT_NEW);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(devlink_port_register_with_subdev);
+
+/**
+ *	devlink_port_register - Register devlink port
+ *
+ *	@devlink: devlink
+ *	@devlink_port: devlink port
+ *	@port_index: driver-specific numerical identifier of the port
+ *
+ *	Register devlink port with provided port index. User can use
+ *	any indexing, even hw-related one. devlink_port structure
+ *	is convenient to be embedded inside user driver private structure.
+ *	Note that the caller should take care of zeroing the devlink_port
+ *	structure.
+ */
+int devlink_port_register(struct devlink *devlink,
+			  struct devlink_port *devlink_port,
+			  unsigned int port_index)
+{
+	return devlink_port_register_with_subdev(devlink, devlink_port,
+					       port_index, NULL);
+}
 EXPORT_SYMBOL_GPL(devlink_port_register);
 
 /**
@@ -6629,9 +6668,14 @@ EXPORT_SYMBOL_GPL(devlink_port_register);
  */
 void devlink_port_unregister(struct devlink_port *devlink_port)
 {
+	struct devlink_subdev *devlink_subdev = devlink_port->devlink_subdev;
 	struct devlink *devlink = devlink_port->devlink;
 
 	devlink_port_type_warn_cancel(devlink_port);
+	if (devlink_subdev) {
+		devlink_subdev->devlink_port = NULL;
+		devlink_subdev_notify(devlink_subdev, DEVLINK_CMD_SUBDEV_NEW);
+	}
 	devlink_port_notify(devlink_port, DEVLINK_CMD_PORT_DEL);
 	mutex_lock(&devlink->lock);
 	list_del(&devlink_port->list);
@@ -6932,6 +6976,7 @@ void devlink_subdev_destroy(struct devlink_subdev *devlink_subdev)
 {
 	struct devlink *devlink = devlink_subdev->devlink;
 
+	WARN_ON(devlink_subdev->devlink_port);
 	devlink_subdev_notify(devlink_subdev, DEVLINK_CMD_SUBDEV_DEL);
 	mutex_lock(&devlink->lock);
 	list_del(&devlink_subdev->list);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 04/10] devlink: Support subdev HW address get
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
                   ` (2 preceding siblings ...)
  2019-11-08 16:18 ` [PATCH net-next v2 03/10] devlink: Add port with subdev register support Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 18:00   ` Parav Pandit
  2019-11-08 16:18 ` [PATCH net-next v2 05/10] devlink: Support subdev HW address set Yuval Avnery
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

Allow privileged user to get the HW address of a subdev.

Example:

$ devlink subdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 00:23:35:af:35:34

$ devlink subdev show pci/0000:03:00.0/1 -pj
{
    "subdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0,
            "port_index": 1,
            "hw_addr": "00:23:35:af:35:34"
        }
    }
}

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h        |  3 +++
 include/uapi/linux/devlink.h |  1 +
 net/core/devlink.c           | 34 +++++++++++++++++++++++++++-------
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 0cedd6d34ef8..5917260e5748 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -765,6 +765,9 @@ struct devlink_ops {
 };
 
 struct devlink_subdev_ops {
+	int (*hw_addr_get)(struct devlink_subdev *subdev,
+			   u8 *hw_addr, struct netlink_ext_ack *extack);
+	unsigned int hw_addr_len;
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index da79ffad9c5a..c7a7ad4c4a20 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -439,6 +439,7 @@ enum devlink_attr {
 	DEVLINK_ATTR_SUBDEV_FLAVOUR,		/* u16 */
 	DEVLINK_ATTR_SUBDEV_PF_INDEX,		/* u32 */
 	DEVLINK_ATTR_SUBDEV_VF_INDEX,		/* u32 */
+	DEVLINK_ATTR_SUBDEV_HW_ADDR,		/* binary */
 
 	/* add new attributes above here, update the policy in devlink.c */
 
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 0c97c51dea0d..7d6e3da8a64c 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -711,10 +711,13 @@ static void devlink_port_notify(struct devlink_port *devlink_port,
 static int devlink_nl_subdev_fill(struct sk_buff *msg, struct devlink *devlink,
 				  struct devlink_subdev *devlink_subdev,
 				  enum devlink_command cmd, u32 subdevid,
-				  u32 seq, int flags)
+				  u32 seq, int flags,
+				  struct netlink_ext_ack *extack)
 {
 	struct devlink_subdev_attrs *attrs = &devlink_subdev->attrs;
+	const struct devlink_subdev_ops *ops = devlink_subdev->ops;
 	void *hdr;
+	int err;
 
 	hdr = genlmsg_put(msg, subdevid, seq, &devlink_nl_family, flags, cmd);
 	if (!hdr)
@@ -748,6 +751,19 @@ static int devlink_nl_subdev_fill(struct sk_buff *msg, struct devlink *devlink,
 				devlink_subdev->devlink_port->index))
 			goto nla_put_failure;
 
+	if (ops && ops->hw_addr_get) {
+		u8 hw_addr[MAX_ADDR_LEN];
+
+		err = ops->hw_addr_get(devlink_subdev, hw_addr, extack);
+		if (err) {
+			genlmsg_cancel(msg, hdr);
+			return err;
+		}
+		if (nla_put(msg, DEVLINK_ATTR_SUBDEV_HW_ADDR,
+			    ops->hw_addr_len, hw_addr))
+			goto nla_put_failure;
+	}
+
 	genlmsg_end(msg, hdr);
 	return 0;
 
@@ -769,8 +785,8 @@ static void devlink_subdev_notify(struct devlink_subdev *devlink_subdev,
 	if (!msg)
 		return;
 
-	err = devlink_nl_subdev_fill(msg, devlink,
-				     devlink_subdev, cmd, 0, 0, 0);
+	err = devlink_nl_subdev_fill(msg, devlink, devlink_subdev, cmd,
+				     0, 0, 0, NULL);
 	if (err) {
 		nlmsg_free(msg);
 		return;
@@ -993,8 +1009,8 @@ static int devlink_nl_cmd_subdev_get_doit(struct sk_buff *skb,
 		return -ENOMEM;
 
 	err = devlink_nl_subdev_fill(msg, devlink, devlink_subdev,
-				     DEVLINK_CMD_SUBDEV_NEW,
-				     info->snd_portid, info->snd_seq, 0);
+				     DEVLINK_CMD_SUBDEV_NEW, info->snd_portid,
+				     info->snd_seq, 0, info->extack);
 	if (err) {
 		nlmsg_free(msg);
 		return err;
@@ -1011,7 +1027,7 @@ static int devlink_nl_cmd_subdev_get_dumpit(struct sk_buff *msg,
 	struct devlink *devlink;
 	int start = cb->args[0];
 	int idx = 0;
-	int err;
+	int err = 0;
 
 	mutex_lock(&devlink_mutex);
 	list_for_each_entry(devlink, &devlink_list, list) {
@@ -1030,7 +1046,7 @@ static int devlink_nl_cmd_subdev_get_dumpit(struct sk_buff *msg,
 						     DEVLINK_CMD_NEW,
 						     NETLINK_CB(cb->skb).portid,
 						     cb->nlh->nlmsg_seq,
-						     NLM_F_MULTI);
+						     NLM_F_MULTI, NULL);
 			if (err) {
 				mutex_unlock(&devlink->lock);
 				goto out;
@@ -1041,6 +1057,8 @@ static int devlink_nl_cmd_subdev_get_dumpit(struct sk_buff *msg,
 	}
 out:
 	mutex_unlock(&devlink_mutex);
+	if (err != -EMSGSIZE)
+		return err;
 
 	cb->args[0] = idx;
 	return msg->len;
@@ -6114,6 +6132,8 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_SUBDEV_FLAVOUR] = { .type = NLA_U16 },
 	[DEVLINK_ATTR_SUBDEV_PF_INDEX] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_SUBDEV_VF_INDEX] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_SUBDEV_HW_ADDR] = { .type = NLA_BINARY,
+					.len = MAX_ADDR_LEN },
 };
 
 static const struct genl_ops devlink_nl_ops[] = {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 05/10] devlink: Support subdev HW address set
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
                   ` (3 preceding siblings ...)
  2019-11-08 16:18 ` [PATCH net-next v2 04/10] devlink: Support subdev HW address get Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 06/10] Documentation: Add devlink-subdev documentation Yuval Avnery
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

Allow privileged user to set the HW address of a subdev.

Example:

$ devlink subdev set pci/0000:03:00.0/1 hw_addr 00:23:35:af:35:34

$ devlink subdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 00:23:35:af:35:34

$ devlink subdev show pci/0000:03:00.0/1 -pj
{
    "subdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0,
            "port_index": 1,
            "hw_addr": "00:23:35:af:35:34"
        }
    }
}

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h |  2 ++
 net/core/devlink.c    | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 5917260e5748..b2cd3505bba4 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -765,6 +765,8 @@ struct devlink_ops {
 };
 
 struct devlink_subdev_ops {
+	int (*hw_addr_set)(struct devlink_subdev *subdev,
+			   u8 *hw_addr, struct netlink_ext_ack *extack);
 	int (*hw_addr_get)(struct devlink_subdev *subdev,
 			   u8 *hw_addr, struct netlink_ext_ack *extack);
 	unsigned int hw_addr_len;
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 7d6e3da8a64c..10867f82b7d3 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -1067,6 +1067,23 @@ static int devlink_nl_cmd_subdev_get_dumpit(struct sk_buff *msg,
 static int devlink_nl_cmd_subdev_set_doit(struct sk_buff *skb,
 					  struct genl_info *info)
 {
+	struct nlattr *nla_hw_addr = info->attrs[DEVLINK_ATTR_SUBDEV_HW_ADDR];
+	struct devlink_subdev *devlink_subdev = info->user_ptr[0];
+	const struct devlink_subdev_ops *ops;
+	int err;
+
+	ops = devlink_subdev->ops;
+	if (nla_hw_addr) {
+		u8 *hw_addr;
+
+		if (!ops || !ops->hw_addr_set)
+			return -EOPNOTSUPP;
+
+		hw_addr = nla_data(nla_hw_addr);
+		err = ops->hw_addr_set(devlink_subdev, hw_addr, info->extack);
+		if (err)
+			return err;
+	}
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 06/10] Documentation: Add devlink-subdev documentation.
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
                   ` (4 preceding siblings ...)
  2019-11-08 16:18 ` [PATCH net-next v2 05/10] devlink: Support subdev HW address set Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 07/10] netdevsim: Add max_vfs to bus_dev Yuval Avnery
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

Add devlink-subdev documentation.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 Documentation/networking/devlink-subdev.rst | 125 ++++++++++++++++++++
 1 file changed, 125 insertions(+)
 create mode 100644 Documentation/networking/devlink-subdev.rst

diff --git a/Documentation/networking/devlink-subdev.rst b/Documentation/networking/devlink-subdev.rst
new file mode 100644
index 000000000000..724464111aac
--- /dev/null
+++ b/Documentation/networking/devlink-subdev.rst
@@ -0,0 +1,125 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Subdev
+==============
+
+Introduction
+============
+The network ASIC may expose privileged and non-privileged (subdev) devices.
+A privileged user can control the parameters of the subdev through the
+privileged device.
+
+The network ASIC exposes PCI PF and SR-IOV VF device(s) to the host system
+where it is connected as PCIe end point. An administrator needs to manage
+attributes and resource of such PCI PF/VF such as MAC address, MSI-X vectors
+etc. in the ASIC.
+
+An administrator doesn't have access to the host system where the PCIe ASIC
+device is connected. In some use cases host system may not even be running Linux
+kernel where the PCIe PF and VF devices are enumerated.
+Currently, the networking ASIC exposes devlink interfaces for eswitch
+mamangement. However, there is no user interface to manage attributes and
+resources on the ASIC device.
+
+Additionally, the PCIe PF and VF devices may be of a different class
+than networking; such as NVMe, GPU, crypto or something else.
+For non-networking devices, there may not be any eswitch (eswitch ports).
+
+It is desirable to control common PCIe attributes and resources through
+common kernel infrastructure. Some of the attributes or resources may not be a
+port parameter, i.e. number of IRQ (MSI-X) vectors per PF/VF.
+
+An example system view of a networking ASIC (aka SmartNIC), can be seen in the
+below diagram, where devlink eswitch instance and PCI PF and/or VFs are
+situated on two different CPU subsystems::
+
+
+	  +------------------------------+
+	  |                              |
+	  |             HOST             |
+	  |                              |
+	  |   +----+-----+-----+-----+   |
+	  |   | PF | VF0 | VF1 | VF2 |   |
+	  +---+----+-----------+-----+---+
+        	     PCI1|
+	      +---+------------+
+        	  |
+	 +----------------------------------------+
+	 |        |         SmartNic              |
+	 |   +----+-------------------------+     |
+	 |   |                              |     |
+	 |   |               NIC            |     |
+	 |   |                              |     |
+	 |   +---------------------+--------+     |
+	 |                         |  PCI2        |
+	 |         +-----+---------+--+           |
+	 |               |                        |
+	 |      +-----+--+--+--------------+      |
+	 |      |     | PF  |              |      |
+	 |      |     +-----+              |      |
+	 |      |      Embedded CPU        |      |
+	 |      |                          |      |
+	 |      +--------------------------+      |
+	 |                                        |
+	 +----------------------------------------+
+
+The below diagram shows an example of devlink subdev topology where some
+subdevs are connected to devlink ports::
+
+
+
+            (PF0)    (VF0)    (VF1)           (NVME VF2)
+         +--------------------------+         +--------+
+         | devlink| devlink| devlink|         | devlink|
+         | subdev | subdev | subdev |         | subdev |
+         |    0   |    1   |    2   |         |    3   |
+         +--------------------------+         +--------+
+              |        |        |
+              |        |        |
+              |        |        |
+     +----------------------------------+
+     |   | devlink| devlink| devlink|   |
+     |   |  port  |  port  |  port  |   |
+     |   |    0   |    1   |    2   |   |
+     |   +--------------------------+   |
+     |                                  |
+     |                                  |
+     |           E-switch               |
+     |                                  |
+     |                                  |
+     |          +--------+              |
+     |          | uplink |              |
+     |          | devlink|              |
+     |          |  port  |              |
+     +----------------------------------+
+
+
+.. _Subdev-Flavours:
+
+Subdev flavours
+===============
+
+The ``devlink-subdev`` supports the following device flavours:
+
+  * ``pcipf``: exposes pf_index attribute.
+  * ``pcivf``: exposes vf_index and pf_index.
+
+.. _Subdev-Actions:
+
+Network attributes:
+==================
+
+The ``devlink-subdev`` may represent a network device and expose the following
+attributes:
+
+  * ``hw_addr``: The HW addr of the network device.
+  * ``port_index``: The ``devlink-port`` index associated with the device.
+
+Testing
+=======
+
+See ``tools/testing/selftests/drivers/net/netdevsim/devlink.sh`` for a
+test covering the core infrastructure. Test cases should be added for any new
+functionality.
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 07/10] netdevsim: Add max_vfs to bus_dev
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
                   ` (5 preceding siblings ...)
  2019-11-08 16:18 ` [PATCH net-next v2 06/10] Documentation: Add devlink-subdev documentation Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 08/10] netdevsim: Add devlink subdev creation Yuval Avnery
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

Currently there is no limit to the number of VFs netdevsim can enable.
Subdevs that represent VFs can exist and be configured before user enabled
them.
max_vfs defines the number of subdevs the driver should create.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/netdevsim/bus.c       | 39 +++++++++++++++++++++++--------
 drivers/net/netdevsim/netdevsim.h |  1 +
 2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/net/netdevsim/bus.c b/drivers/net/netdevsim/bus.c
index 6aeed0c600f8..f1a0171080cb 100644
--- a/drivers/net/netdevsim/bus.c
+++ b/drivers/net/netdevsim/bus.c
@@ -26,9 +26,9 @@ static struct nsim_bus_dev *to_nsim_bus_dev(struct device *dev)
 static int nsim_bus_dev_vfs_enable(struct nsim_bus_dev *nsim_bus_dev,
 				   unsigned int num_vfs)
 {
-	nsim_bus_dev->vfconfigs = kcalloc(num_vfs,
-					  sizeof(struct nsim_vf_config),
-					  GFP_KERNEL);
+	if (nsim_bus_dev->max_vfs < num_vfs)
+		return -ENOMEM;
+
 	if (!nsim_bus_dev->vfconfigs)
 		return -ENOMEM;
 	nsim_bus_dev->num_vfs = num_vfs;
@@ -38,8 +38,6 @@ static int nsim_bus_dev_vfs_enable(struct nsim_bus_dev *nsim_bus_dev,
 
 static void nsim_bus_dev_vfs_disable(struct nsim_bus_dev *nsim_bus_dev)
 {
-	kfree(nsim_bus_dev->vfconfigs);
-	nsim_bus_dev->vfconfigs = NULL;
 	nsim_bus_dev->num_vfs = 0;
 }
 
@@ -154,22 +152,29 @@ static struct device_type nsim_bus_dev_type = {
 };
 
 static struct nsim_bus_dev *
-nsim_bus_dev_new(unsigned int id, unsigned int port_count);
+nsim_bus_dev_new(unsigned int id, unsigned int port_count,
+		 unsigned int max_vfs);
+
+#define NSIM_BUS_DEV_MAX_VFS 4
 
 static ssize_t
 new_device_store(struct bus_type *bus, const char *buf, size_t count)
 {
 	struct nsim_bus_dev *nsim_bus_dev;
 	unsigned int port_count;
+	unsigned int max_vfs;
 	unsigned int id;
 	int err;
 
-	err = sscanf(buf, "%u %u", &id, &port_count);
+	err = sscanf(buf, "%u %u %u", &id, &port_count, &max_vfs);
 	switch (err) {
 	case 1:
 		port_count = 1;
 		/* fall through */
 	case 2:
+		max_vfs = NSIM_BUS_DEV_MAX_VFS;
+		/* fall through */
+	case 3:
 		if (id > INT_MAX) {
 			pr_err("Value of \"id\" is too big.\n");
 			return -EINVAL;
@@ -179,7 +184,7 @@ new_device_store(struct bus_type *bus, const char *buf, size_t count)
 		pr_err("Format for adding new device is \"id port_count\" (uint uint).\n");
 		return -EINVAL;
 	}
-	nsim_bus_dev = nsim_bus_dev_new(id, port_count);
+	nsim_bus_dev = nsim_bus_dev_new(id, port_count, max_vfs);
 	if (IS_ERR(nsim_bus_dev))
 		return PTR_ERR(nsim_bus_dev);
 
@@ -267,7 +272,8 @@ static struct bus_type nsim_bus = {
 };
 
 static struct nsim_bus_dev *
-nsim_bus_dev_new(unsigned int id, unsigned int port_count)
+nsim_bus_dev_new(unsigned int id, unsigned int port_count,
+		 unsigned int max_vfs)
 {
 	struct nsim_bus_dev *nsim_bus_dev;
 	int err;
@@ -284,12 +290,24 @@ nsim_bus_dev_new(unsigned int id, unsigned int port_count)
 	nsim_bus_dev->dev.type = &nsim_bus_dev_type;
 	nsim_bus_dev->port_count = port_count;
 	nsim_bus_dev->initial_net = current->nsproxy->net_ns;
+	nsim_bus_dev->max_vfs = max_vfs;
+
+	nsim_bus_dev->vfconfigs = kcalloc(nsim_bus_dev->max_vfs,
+					  sizeof(struct nsim_vf_config),
+					  GFP_KERNEL);
+	if (!nsim_bus_dev->vfconfigs) {
+		err = -ENOMEM;
+		goto err_nsim_bus_dev_id_free;
+	}
 
 	err = device_register(&nsim_bus_dev->dev);
 	if (err)
-		goto err_nsim_bus_dev_id_free;
+		goto err_nsim_vfconfigs_free;
+
 	return nsim_bus_dev;
 
+err_nsim_vfconfigs_free:
+	kfree(nsim_bus_dev->vfconfigs);
 err_nsim_bus_dev_id_free:
 	ida_free(&nsim_bus_dev_ids, nsim_bus_dev->dev.id);
 err_nsim_bus_dev_free:
@@ -301,6 +319,7 @@ static void nsim_bus_dev_del(struct nsim_bus_dev *nsim_bus_dev)
 {
 	device_unregister(&nsim_bus_dev->dev);
 	ida_free(&nsim_bus_dev_ids, nsim_bus_dev->dev.id);
+	kfree(nsim_bus_dev->vfconfigs);
 	kfree(nsim_bus_dev);
 }
 
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index 94df795ef4d3..e2049856add8 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -238,6 +238,7 @@ struct nsim_bus_dev {
 	struct net *initial_net; /* Purpose of this is to carry net pointer
 				  * during the probe time only.
 				  */
+	unsigned int max_vfs;
 	unsigned int num_vfs;
 	struct nsim_vf_config *vfconfigs;
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 08/10] netdevsim: Add devlink subdev creation
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
                   ` (6 preceding siblings ...)
  2019-11-08 16:18 ` [PATCH net-next v2 07/10] netdevsim: Add max_vfs to bus_dev Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 09/10] netdevsim: Add devlink subdev sefltest for netdevsim Yuval Avnery
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

Add devlink subdev creation to represent VFs, and implement hw_addr
get/set.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/netdevsim/Makefile    |   2 +-
 drivers/net/netdevsim/dev.c       |   9 ++-
 drivers/net/netdevsim/netdevsim.h |  10 +++
 drivers/net/netdevsim/subdev.c    | 100 ++++++++++++++++++++++++++++++
 4 files changed, 119 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/netdevsim/subdev.c

diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index f4d8f62f28c2..dbbcfb57951e 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -3,7 +3,7 @@
 obj-$(CONFIG_NETDEVSIM) += netdevsim.o
 
 netdevsim-objs := \
-	netdev.o dev.o fib.o bus.o health.o
+	netdev.o dev.o fib.o bus.o health.o subdev.o
 
 ifeq ($(CONFIG_BPF_SYSCALL),y)
 netdevsim-objs += \
diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index 468e157a7cb1..85fc0d40dda1 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -785,10 +785,14 @@ static struct nsim_dev *nsim_dev_create(struct nsim_bus_dev *nsim_bus_dev)
 	if (err)
 		goto err_fib_destroy;
 
+	err = nsim_dev_subdevs_create(nsim_dev, devlink);
+	if (err)
+		goto err_dl_unregister;
+
 	err = devlink_params_register(devlink, nsim_devlink_params,
 				      ARRAY_SIZE(nsim_devlink_params));
 	if (err)
-		goto err_dl_unregister;
+		goto err_dl_subdevs_destroy;
 	nsim_devlink_set_params_init_values(nsim_dev, devlink);
 
 	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
@@ -831,6 +835,8 @@ static struct nsim_dev *nsim_dev_create(struct nsim_bus_dev *nsim_bus_dev)
 err_params_unregister:
 	devlink_params_unregister(devlink, nsim_devlink_params,
 				  ARRAY_SIZE(nsim_devlink_params));
+err_dl_subdevs_destroy:
+	nsim_dev_subdevs_destroy(nsim_dev);
 err_dl_unregister:
 	devlink_unregister(devlink);
 err_fib_destroy:
@@ -866,6 +872,7 @@ static void nsim_dev_destroy(struct nsim_dev *nsim_dev)
 	nsim_dev_debugfs_exit(nsim_dev);
 	devlink_params_unregister(devlink, nsim_devlink_params,
 				  ARRAY_SIZE(nsim_devlink_params));
+	nsim_dev_subdevs_destroy(nsim_dev);
 	devlink_unregister(devlink);
 	devlink_resources_unregister(devlink, NULL);
 	devlink_free(devlink);
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index e2049856add8..9f8b294be826 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -154,6 +154,12 @@ struct nsim_dev_port {
 	struct netdevsim *ns;
 };
 
+struct nsim_subdev {
+	struct devlink_subdev *devlink_subdev;
+	unsigned int subdev_index;
+	struct nsim_bus_dev *nsim_bus_dev;
+};
+
 struct nsim_dev {
 	struct nsim_bus_dev *nsim_bus_dev;
 	struct nsim_fib_data *fib_data;
@@ -177,6 +183,7 @@ struct nsim_dev {
 	bool fail_reload;
 	struct devlink_region *dummy_region;
 	struct nsim_dev_health health;
+	struct nsim_subdev *subdevs;
 };
 
 static inline struct net *nsim_dev_net(struct nsim_dev *nsim_dev)
@@ -245,3 +252,6 @@ struct nsim_bus_dev {
 
 int nsim_bus_init(void);
 void nsim_bus_exit(void);
+
+int nsim_dev_subdevs_create(struct nsim_dev *nsim_dev, struct devlink *devlink);
+void nsim_dev_subdevs_destroy(struct nsim_dev *nsim_dev);
diff --git a/drivers/net/netdevsim/subdev.c b/drivers/net/netdevsim/subdev.c
new file mode 100644
index 000000000000..56899fe8e785
--- /dev/null
+++ b/drivers/net/netdevsim/subdev.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Mellanox Technologies */
+
+#include <linux/device.h>
+#include <linux/etherdevice.h>
+#include <linux/inet.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include <linux/rtnetlink.h>
+#include <net/devlink.h>
+#include <uapi/linux/devlink.h>
+
+#include "netdevsim.h"
+
+static int
+nsim_hw_addr_set(struct devlink_subdev *devlink_subdev, u8 *hw_addr,
+		 struct netlink_ext_ack *extack)
+{
+	struct nsim_bus_dev *nsim_bus_dev;
+	struct nsim_subdev *nsim_subdev;
+	int subdev_index;
+
+	nsim_subdev = devlink_subdev_priv(devlink_subdev);
+	nsim_bus_dev = nsim_subdev->nsim_bus_dev;
+	subdev_index = nsim_subdev->subdev_index;
+
+	ether_addr_copy(nsim_bus_dev->vfconfigs[subdev_index].vf_mac,
+			hw_addr);
+	return 0;
+}
+
+static int
+nsim_hw_addr_get(struct devlink_subdev *devlink_subdev, u8 *hw_addr,
+		 struct netlink_ext_ack *extack)
+{
+	struct nsim_bus_dev *nsim_bus_dev;
+	struct nsim_subdev *nsim_subdev;
+	int subdev_index;
+
+	nsim_subdev = devlink_subdev_priv(devlink_subdev);
+	nsim_bus_dev = nsim_subdev->nsim_bus_dev;
+	subdev_index = nsim_subdev->subdev_index;
+
+	ether_addr_copy(hw_addr,
+			nsim_bus_dev->vfconfigs[subdev_index].vf_mac);
+	return 0;
+}
+
+static struct devlink_subdev_ops subdev_ops = {
+	.hw_addr_set = nsim_hw_addr_set,
+	.hw_addr_get = nsim_hw_addr_get,
+	.hw_addr_len = ETH_ALEN,
+};
+
+int nsim_dev_subdevs_create(struct nsim_dev *nsim_dev, struct devlink *devlink)
+{
+	int max_vfs = nsim_dev->nsim_bus_dev->max_vfs;
+	struct devlink_subdev_attrs attrs;
+	int err;
+	int vf;
+
+	nsim_dev->subdevs = kcalloc(max_vfs, sizeof(*nsim_dev->subdevs),
+				    GFP_KERNEL);
+	if (!nsim_dev->subdevs)
+		return -ENOMEM;
+
+	for (vf = 0; vf < max_vfs; vf++) {
+		struct nsim_subdev *nsim_subdev = &nsim_dev->subdevs[vf];
+		struct devlink_subdev *devlink_subdev;
+
+		nsim_subdev->nsim_bus_dev = nsim_dev->nsim_bus_dev;
+		devlink_subdev_attrs_pci_vf_init(&attrs, 0, vf);
+		devlink_subdev = devlink_subdev_create(devlink, vf,
+						       &subdev_ops,
+						       &attrs,
+						       nsim_subdev);
+		if (IS_ERR(devlink_subdev)) {
+			err = PTR_ERR(devlink_subdev);
+			goto err_subdevs_destroy;
+		}
+		nsim_subdev->devlink_subdev = devlink_subdev;
+		nsim_subdev->subdev_index = vf;
+	}
+
+	return 0;
+
+err_subdevs_destroy:
+	for (vf--; vf >= 0; vf--)
+		devlink_subdev_destroy(nsim_dev->subdevs[vf].devlink_subdev);
+	return err;
+}
+
+void nsim_dev_subdevs_destroy(struct nsim_dev *nsim_dev)
+{
+	int vf;
+
+	for (vf = 0; vf < nsim_dev->nsim_bus_dev->max_vfs; vf++)
+		devlink_subdev_destroy(nsim_dev->subdevs[vf].devlink_subdev);
+}
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 09/10] netdevsim: Add devlink subdev sefltest for netdevsim
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
                   ` (7 preceding siblings ...)
  2019-11-08 16:18 ` [PATCH net-next v2 08/10] netdevsim: Add devlink subdev creation Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-08 16:18 ` [PATCH net-next v2 10/10] net/mlx5e: Add support for devlink subdev and subdev hw_addr set/show Yuval Avnery
  2019-11-11 18:00 ` [PATCH net-next v2 00/10] devlink subdev Jakub Kicinski
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

Test will assign hw_addr to all registered subdevs and query it.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 .../drivers/net/netdevsim/devlink.sh          | 55 ++++++++++++++++++-
 1 file changed, 53 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index ee89cd2f5bee..15e0911df871 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -5,12 +5,13 @@ lib_dir=$(dirname $0)/../../../net/forwarding
 
 ALL_TESTS="fw_flash_test params_test regions_test reload_test \
 	   netns_reload_test resource_test dev_info_test \
-	   empty_reporter_test dummy_reporter_test"
+	   empty_reporter_test dummy_reporter_test subdev_test"
 NUM_NETIFS=0
 source $lib_dir/lib.sh
 
 BUS_ADDR=10
 PORT_COUNT=4
+VF_COUNT=4
 DEV_NAME=netdevsim$BUS_ADDR
 SYSFS_NET_DIR=/sys/bus/netdevsim/devices/$DEV_NAME/net/
 DEBUGFS_DIR=/sys/kernel/debug/netdevsim/$DEV_NAME/
@@ -428,10 +429,60 @@ dummy_reporter_test()
 	log_test "dummy reporter test"
 }
 
+subdev_attr_get()
+{
+	local handle=$1
+	local name=$2
+
+	cmd_jq "devlink subdev show $handle -j" '.[][].'$name
+}
+
+subdev_objects_get()
+{
+	local handle=$1
+
+	cmd_jq "devlink subdev show -j" \
+	       '.[] | keys[] | select(contains("'$handle'"))'
+}
+
+subdev_attr_set()
+{
+	local handle=$1
+	local name=$2
+	local value=$3
+
+	devlink subdev set $handle $name $value
+}
+
+subdev_test()
+{
+	RET=0
+
+	local subdevs=`subdev_objects_get $DL_HANDLE`
+	local num_subdevs=`echo $subdevs | wc -w`
+	[ $num_subdevs == $VF_COUNT ]
+	check_err $? "Expected $VF_COUNT subdevs but got $num_subdevs"
+
+	i=1
+	for subdev in $subdevs
+	do
+		local hw_addr=`printf "10:22:33:44:55:%02x" $i`
+
+		subdev_attr_set "$subdev" hw_addr $hw_addr
+		check_err $? "Failed to set hw_addr value"
+		value=$(subdev_attr_get $subdev hw_addr)
+		check_err $? "Failed to get hw_addr attr value"
+		[ "$value" == "$hw_addr" ]
+		check_err $? "Unexpected hw_addr attr value $value != $hw_addr"
+		i=$(($i+1))
+	done
+	log_test "subdev test"
+}
+
 setup_prepare()
 {
 	modprobe netdevsim
-	echo "$BUS_ADDR $PORT_COUNT" > /sys/bus/netdevsim/new_device
+	echo "$BUS_ADDR $PORT_COUNT $VF_COUNT" > /sys/bus/netdevsim/new_device
 	while [ ! -d $SYSFS_NET_DIR ] ; do :; done
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v2 10/10] net/mlx5e: Add support for devlink subdev and subdev hw_addr set/show
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
                   ` (8 preceding siblings ...)
  2019-11-08 16:18 ` [PATCH net-next v2 09/10] netdevsim: Add devlink subdev sefltest for netdevsim Yuval Avnery
@ 2019-11-08 16:18 ` Yuval Avnery
  2019-11-11 18:00 ` [PATCH net-next v2 00/10] devlink subdev Jakub Kicinski
  10 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-08 16:18 UTC (permalink / raw)
  To: netdev
  Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, danielj, parav,
	andrew.gospodarek, michael.chan, Yuval Avnery

Add subdev creation and implement get/set ops for subdev hw_addr.
Add subdev linkage to devlink port registration.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 92 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |  5 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  9 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 19 +++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  7 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     |  8 ++
 7 files changed, 136 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 381925c90d94..c1615a5feed7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -226,3 +226,95 @@ void mlx5_devlink_unregister(struct devlink *devlink)
 				  ARRAY_SIZE(mlx5_devlink_params));
 	devlink_unregister(devlink);
 }
+
+static int
+mlx5_devlink_mac_set(struct devlink_subdev *devlink_subdev, u8 *mac,
+		     struct netlink_ext_ack *extack)
+{
+	struct mlx5_vport *vport = devlink_subdev_priv(devlink_subdev);
+
+	return mlx5_eswitch_set_vport_mac(vport->dev->priv.eswitch,
+					 vport->vport, mac, extack);
+}
+
+static int
+mlx5_devlink_mac_get(struct devlink_subdev *devlink_subdev, u8 *mac,
+		     struct netlink_ext_ack *extack)
+{
+	struct mlx5_vport *vport = devlink_subdev_priv(devlink_subdev);
+	struct ifla_vf_info ivi;
+	int err;
+
+	vport = devlink_subdev_priv(devlink_subdev);
+
+	err = mlx5_eswitch_get_vport_config(vport->dev->priv.eswitch,
+					    vport->vport, &ivi, extack);
+	if (!err)
+		ether_addr_copy(mac, ivi.mac);
+
+	return err;
+}
+
+static struct devlink_subdev_ops subdev_ops = {
+	.hw_addr_set = mlx5_devlink_mac_set,
+	.hw_addr_get = mlx5_devlink_mac_get,
+	.hw_addr_len = ETH_ALEN,
+};
+
+int mlx5_devlink_subdevs_create(struct mlx5_eswitch *esw)
+{
+	struct devlink *devlink = priv_to_devlink(esw->dev);
+	struct mlx5_vport *vport;
+	int err;
+	int i;
+
+	mlx5_esw_for_all_vports(esw, i, vport) {
+		struct devlink_subdev *devlink_subdev;
+		struct devlink_subdev_attrs attrs;
+		int subdev_idx;
+
+		if (IS_ERR(vport)) {
+			err = PTR_ERR(vport);
+			goto err_dl_destroy;
+		}
+
+		if (vport->vport == MLX5_VPORT_UPLINK)
+			continue;
+		if (vport->vport == mlx5_eswitch_manager_vport(esw->dev))
+			continue;
+
+		subdev_idx = vport->vport;
+		if (mlx5_eswitch_is_vf_vport(esw, subdev_idx))
+			devlink_subdev_attrs_pci_vf_init(&attrs, 0,
+							 subdev_idx - 1);
+		else
+			devlink_subdev_attrs_pci_pf_init(&attrs, subdev_idx);
+
+		devlink_subdev = devlink_subdev_create(devlink, subdev_idx,
+						       &subdev_ops, &attrs,
+						       vport);
+		if (IS_ERR(devlink_subdev)) {
+			err = PTR_ERR(devlink_subdev);
+			goto err_dl_destroy;
+		}
+		vport->devlink_subdev = devlink_subdev;
+	}
+
+	return 0;
+
+err_dl_destroy:
+	mlx5_devlink_subdevs_destroy(esw);
+	return err;
+}
+
+void mlx5_devlink_subdevs_destroy(struct mlx5_eswitch *esw)
+{
+	struct mlx5_vport *vport;
+	int i;
+
+	mlx5_esw_for_all_vports(esw, i, vport) {
+		if (vport->devlink_subdev)
+			devlink_subdev_destroy(vport->devlink_subdev);
+	}
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
index d0ba03774ddf..d0be2ad5e34b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
@@ -6,9 +6,14 @@
 
 #include <net/devlink.h>
 
+struct mlx5_eswitch;
+
 struct devlink *mlx5_devlink_alloc(void);
 void mlx5_devlink_free(struct devlink *devlink);
 int mlx5_devlink_register(struct devlink *devlink, struct device *dev);
 void mlx5_devlink_unregister(struct devlink *devlink);
 
+int mlx5_devlink_subdevs_create(struct mlx5_eswitch *esw);
+void mlx5_devlink_subdevs_destroy(struct mlx5_eswitch *esw);
+
 #endif /* __MLX5_DEVLINK_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 772bfdbdeb9c..429672f69b6f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4054,7 +4054,8 @@ int mlx5e_set_vf_mac(struct net_device *dev, int vf, u8 *mac)
 	struct mlx5e_priv *priv = netdev_priv(dev);
 	struct mlx5_core_dev *mdev = priv->mdev;
 
-	return mlx5_eswitch_set_vport_mac(mdev->priv.eswitch, vf + 1, mac);
+	return mlx5_eswitch_set_vport_mac(mdev->priv.eswitch, vf + 1,
+					  mac, NULL);
 }
 
 static int mlx5e_set_vf_vlan(struct net_device *dev, int vf, u16 vlan, u8 qos,
@@ -4135,7 +4136,8 @@ int mlx5e_get_vf_config(struct net_device *dev,
 	struct mlx5_core_dev *mdev = priv->mdev;
 	int err;
 
-	err = mlx5_eswitch_get_vport_config(mdev->priv.eswitch, vf + 1, ivi);
+	err = mlx5_eswitch_get_vport_config(mdev->priv.eswitch, vf + 1,
+					    ivi, NULL);
 	if (err)
 		return err;
 	ivi->linkstate = mlx5_vport_link2ifla(ivi->linkstate);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index cd9bb7c7b341..eefb38dd3c79 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1812,8 +1812,10 @@ static int register_devlink_port(struct mlx5_core_dev *dev,
 {
 	struct devlink *devlink = priv_to_devlink(dev);
 	struct mlx5_eswitch_rep *rep = rpriv->rep;
+	struct devlink_subdev *devlink_subdev = NULL;
 	struct netdev_phys_item_id ppid = {};
 	unsigned int dl_port_index = 0;
+	struct mlx5_vport *vport;
 
 	if (!is_devlink_port_supported(dev, rpriv))
 		return 0;
@@ -1831,6 +1833,8 @@ static int register_devlink_port(struct mlx5_core_dev *dev,
 					      &ppid.id[0], ppid.id_len,
 					      dev->pdev->devfn);
 		dl_port_index = rep->vport;
+		vport = mlx5_eswitch_get_vport(dev->priv.eswitch, rep->vport);
+		devlink_subdev = vport->devlink_subdev;
 	} else if (mlx5_eswitch_is_vf_vport(dev->priv.eswitch,
 					    rpriv->rep->vport)) {
 		devlink_port_attrs_pci_vf_set(&rpriv->dl_port,
@@ -1838,9 +1842,12 @@ static int register_devlink_port(struct mlx5_core_dev *dev,
 					      dev->pdev->devfn,
 					      rep->vport - 1);
 		dl_port_index = vport_to_devlink_port_index(dev, rep->vport);
+		vport = mlx5_eswitch_get_vport(dev->priv.eswitch, rep->vport);
+		devlink_subdev = vport->devlink_subdev;
 	}
 
-	return devlink_port_register(devlink, &rpriv->dl_port, dl_port_index);
+	return devlink_port_register_with_subdev(devlink, &rpriv->dl_port,
+					       dl_port_index, devlink_subdev);
 }
 
 static void unregister_devlink_port(struct mlx5_core_dev *dev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 7baade9e62b7..19c4020dc5df 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -2175,7 +2175,8 @@ void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw)
 
 /* Vport Administration */
 int mlx5_eswitch_set_vport_mac(struct mlx5_eswitch *esw,
-			       u16 vport, u8 mac[ETH_ALEN])
+			       u16 vport, u8 mac[ETH_ALEN],
+			       struct netlink_ext_ack *extack)
 {
 	struct mlx5_vport *evport = mlx5_eswitch_get_vport(esw, vport);
 	u64 node_guid;
@@ -2188,25 +2189,30 @@ int mlx5_eswitch_set_vport_mac(struct mlx5_eswitch *esw,
 
 	mutex_lock(&esw->state_lock);
 
-	if (evport->info.spoofchk && !is_valid_ether_addr(mac))
+	if (evport->info.spoofchk && !is_valid_ether_addr(mac)) {
 		mlx5_core_warn(esw->dev,
 			       "Set invalid MAC while spoofchk is on, vport(%d)\n",
 			       vport);
+		NL_SET_ERR_MSG_MOD(extack, "Set invalid MAC while spoofchk is on");
+	}
 
 	err = mlx5_modify_nic_vport_mac_address(esw->dev, vport, mac);
 	if (err) {
 		mlx5_core_warn(esw->dev,
 			       "Failed to mlx5_modify_nic_vport_mac vport(%d) err=(%d)\n",
 			       vport, err);
+		NL_SET_ERR_MSG_MOD(extack, "Failed to modify NIC vport");
 		goto unlock;
 	}
 
 	node_guid_gen_from_mac(&node_guid, mac);
 	err = mlx5_modify_nic_vport_node_guid(esw->dev, vport, node_guid);
-	if (err)
+	if (err) {
 		mlx5_core_warn(esw->dev,
 			       "Failed to set vport %d node guid, err = %d. RDMA_CM will not function properly for this VF.\n",
 			       vport, err);
+		NL_SET_ERR_MSG_MOD(extack, "Failed to set node guid, RDMA_CM will not function properly for this vport");
+	}
 
 	ether_addr_copy(evport->info.mac, mac);
 	evport->info.node_guid = node_guid;
@@ -2249,12 +2255,15 @@ int mlx5_eswitch_set_vport_state(struct mlx5_eswitch *esw,
 }
 
 int mlx5_eswitch_get_vport_config(struct mlx5_eswitch *esw,
-				  u16 vport, struct ifla_vf_info *ivi)
+				  u16 vport, struct ifla_vf_info *ivi,
+				  struct netlink_ext_ack *extack)
 {
 	struct mlx5_vport *evport = mlx5_eswitch_get_vport(esw, vport);
 
-	if (IS_ERR(evport))
+	if (IS_ERR(evport)) {
+		NL_SET_ERR_MSG_MOD(extack, "Vport does not exist");
 		return PTR_ERR(evport);
+	}
 
 	memset(ivi, 0, sizeof(*ivi));
 	ivi->vf = vport - 1;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 920d8f529fb9..56dbcc480053 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -138,6 +138,7 @@ struct mlx5_vport {
 
 	bool                    enabled;
 	enum mlx5_eswitch_vport_event enabled_events;
+	struct devlink_subdev     *devlink_subdev;
 };
 
 enum offloads_fdb_flags {
@@ -277,7 +278,8 @@ void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw);
 int mlx5_eswitch_enable(struct mlx5_eswitch *esw, int mode);
 void mlx5_eswitch_disable(struct mlx5_eswitch *esw, bool clear_vf);
 int mlx5_eswitch_set_vport_mac(struct mlx5_eswitch *esw,
-			       u16 vport, u8 mac[ETH_ALEN]);
+			       u16 vport, u8 mac[ETH_ALEN],
+			       struct netlink_ext_ack *extack);
 int mlx5_eswitch_set_vport_state(struct mlx5_eswitch *esw,
 				 u16 vport, int link_state);
 int mlx5_eswitch_set_vport_vlan(struct mlx5_eswitch *esw,
@@ -291,7 +293,8 @@ int mlx5_eswitch_set_vport_rate(struct mlx5_eswitch *esw, u16 vport,
 int mlx5_eswitch_set_vepa(struct mlx5_eswitch *esw, u8 setting);
 int mlx5_eswitch_get_vepa(struct mlx5_eswitch *esw, u8 *setting);
 int mlx5_eswitch_get_vport_config(struct mlx5_eswitch *esw,
-				  u16 vport, struct ifla_vf_info *ivi);
+				  u16 vport, struct ifla_vf_info *ivi,
+				  struct netlink_ext_ack *extack);
 int mlx5_eswitch_get_vport_stats(struct mlx5_eswitch *esw,
 				 u16 vport,
 				 struct ifla_vf_stats *vf_stats);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 5e28a35dfb0a..282ead8eba10 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -42,6 +42,7 @@
 #include "fs_core.h"
 #include "lib/devcom.h"
 #include "lib/eq.h"
+#include "devlink.h"
 
 /* There are two match-all miss flows, one for unicast dst mac and
  * one for multicast.
@@ -2193,6 +2194,10 @@ int esw_offloads_enable(struct mlx5_eswitch *esw)
 	if (err)
 		goto err_vports;
 
+	err = mlx5_devlink_subdevs_create(esw);
+	if (err)
+		goto err_subdevs;
+
 	err = esw_offloads_load_all_reps(esw);
 	if (err)
 		goto err_reps;
@@ -2203,6 +2208,8 @@ int esw_offloads_enable(struct mlx5_eswitch *esw)
 	return 0;
 
 err_reps:
+	mlx5_devlink_subdevs_destroy(esw);
+err_subdevs:
 	mlx5_eswitch_disable_pf_vf_vports(esw);
 err_vports:
 	esw_set_passing_vport_metadata(esw, false);
@@ -2236,6 +2243,7 @@ void esw_offloads_disable(struct mlx5_eswitch *esw)
 {
 	esw_offloads_devcom_cleanup(esw);
 	esw_offloads_unload_all_reps(esw);
+	mlx5_devlink_subdevs_destroy(esw);
 	mlx5_eswitch_disable_pf_vf_vports(esw);
 	esw_set_passing_vport_metadata(esw, false);
 	esw_offloads_steering_cleanup(esw);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [PATCH net-next v2 04/10] devlink: Support subdev HW address get
  2019-11-08 16:18 ` [PATCH net-next v2 04/10] devlink: Support subdev HW address get Yuval Avnery
@ 2019-11-08 18:00   ` Parav Pandit
  2019-11-09 17:45     ` Yuval Avnery
  0 siblings, 1 reply; 23+ messages in thread
From: Parav Pandit @ 2019-11-08 18:00 UTC (permalink / raw)
  To: Yuval Avnery, netdev
  Cc: Jiri Pirko, Saeed Mahameed, leon, davem, jakub.kicinski, shuah,
	Daniel Jurgens, andrew.gospodarek, michael.chan, Yuval Avnery



> -----Original Message-----
> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Yuval Avnery
> Sent: Friday, November 8, 2019 10:19 AM
> To: netdev@vger.kernel.org
> Cc: Jiri Pirko <jiri@mellanox.com>; Saeed Mahameed
> <saeedm@mellanox.com>; leon@kernel.org; davem@davemloft.net;
> jakub.kicinski@netronome.com; shuah@kernel.org; Daniel Jurgens
> <danielj@mellanox.com>; Parav Pandit <parav@mellanox.com>;
> andrew.gospodarek@broadcom.com; michael.chan@broadcom.com; Yuval
> Avnery <yuvalav@mellanox.com>
> Subject: [PATCH net-next v2 04/10] devlink: Support subdev HW address get
> 
> Allow privileged user to get the HW address of a subdev.
> 
> Example:
> 
> $ devlink subdev show pci/0000:03:00.0/1
> pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr
> 00:23:35:af:35:34
> 
> $ devlink subdev show pci/0000:03:00.0/1 -pj {
>     "subdev": {
>         "pci/0000:03:00.0/1": {
>             "flavour": "pcivf",
>             "pf": 0,
>             "vf": 0,
>             "port_index": 1,
>             "hw_addr": "00:23:35:af:35:34"
I prefer this to be 'address' to match to 'ip link set address LLADDR'.
That will make it consistent with rest of the iproute2/ip tool.
So that users don't have to remember one mor keyword for the 'address'.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH net-next v2 04/10] devlink: Support subdev HW address get
  2019-11-08 18:00   ` Parav Pandit
@ 2019-11-09 17:45     ` Yuval Avnery
  2019-11-09 18:13       ` Parav Pandit
  0 siblings, 1 reply; 23+ messages in thread
From: Yuval Avnery @ 2019-11-09 17:45 UTC (permalink / raw)
  To: Parav Pandit, netdev
  Cc: Jiri Pirko, Saeed Mahameed, leon, davem, jakub.kicinski, shuah,
	Daniel Jurgens, andrew.gospodarek, michael.chan



> -----Original Message-----
> From: Parav Pandit
> Sent: Friday, November 8, 2019 10:00 AM
> To: Yuval Avnery <yuvalav@mellanox.com>; netdev@vger.kernel.org
> Cc: Jiri Pirko <jiri@mellanox.com>; Saeed Mahameed
> <saeedm@mellanox.com>; leon@kernel.org; davem@davemloft.net;
> jakub.kicinski@netronome.com; shuah@kernel.org; Daniel Jurgens
> <danielj@mellanox.com>; andrew.gospodarek@broadcom.com;
> michael.chan@broadcom.com; Yuval Avnery <yuvalav@mellanox.com>
> Subject: RE: [PATCH net-next v2 04/10] devlink: Support subdev HW address
> get
> 
> 
> 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
> On
> > Behalf Of Yuval Avnery
> > Sent: Friday, November 8, 2019 10:19 AM
> > To: netdev@vger.kernel.org
> > Cc: Jiri Pirko <jiri@mellanox.com>; Saeed Mahameed
> > <saeedm@mellanox.com>; leon@kernel.org; davem@davemloft.net;
> > jakub.kicinski@netronome.com; shuah@kernel.org; Daniel Jurgens
> > <danielj@mellanox.com>; Parav Pandit <parav@mellanox.com>;
> > andrew.gospodarek@broadcom.com; michael.chan@broadcom.com;
> Yuval
> > Avnery <yuvalav@mellanox.com>
> > Subject: [PATCH net-next v2 04/10] devlink: Support subdev HW address
> > get
> >
> > Allow privileged user to get the HW address of a subdev.
> >
> > Example:
> >
> > $ devlink subdev show pci/0000:03:00.0/1
> > pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr
> > 00:23:35:af:35:34
> >
> > $ devlink subdev show pci/0000:03:00.0/1 -pj {
> >     "subdev": {
> >         "pci/0000:03:00.0/1": {
> >             "flavour": "pcivf",
> >             "pf": 0,
> >             "vf": 0,
> >             "port_index": 1,
> >             "hw_addr": "00:23:35:af:35:34"
> I prefer this to be 'address' to match to 'ip link set address LLADDR'.
> That will make it consistent with rest of the iproute2/ip tool.
> So that users don't have to remember one mor keyword for the 'address'.
I think hw_addr is more accurate, and consistency doesn't exist here anyway.
We already have "ip link set vf set mac"
Address is not specific enough and can also mean IP address, while hw_addr
covers both ETH and IB


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH net-next v2 04/10] devlink: Support subdev HW address get
  2019-11-09 17:45     ` Yuval Avnery
@ 2019-11-09 18:13       ` Parav Pandit
  2019-11-10  8:12         ` Jiri Pirko
  0 siblings, 1 reply; 23+ messages in thread
From: Parav Pandit @ 2019-11-09 18:13 UTC (permalink / raw)
  To: Yuval Avnery, netdev
  Cc: Jiri Pirko, Saeed Mahameed, leon, davem, jakub.kicinski, shuah,
	Daniel Jurgens, andrew.gospodarek, michael.chan



> From: Yuval Avnery <yuvalav@mellanox.com>
> > > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
> > > Avnery <yuvalav@mellanox.com>
> > > Subject: [PATCH net-next v2 04/10] devlink: Support subdev HW
> > > address get
> > >
> > > Allow privileged user to get the HW address of a subdev.
> > >
> > > Example:
> > >
> > > $ devlink subdev show pci/0000:03:00.0/1
> > > pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr
> > > 00:23:35:af:35:34
> > >
> > > $ devlink subdev show pci/0000:03:00.0/1 -pj {
> > >     "subdev": {
> > >         "pci/0000:03:00.0/1": {
> > >             "flavour": "pcivf",
> > >             "pf": 0,
> > >             "vf": 0,
> > >             "port_index": 1,
> > >             "hw_addr": "00:23:35:af:35:34"
> > I prefer this to be 'address' to match to 'ip link set address LLADDR'.
> > That will make it consistent with rest of the iproute2/ip tool.
> > So that users don't have to remember one mor keyword for the 'address'.
> I think hw_addr is more accurate, and consistency doesn't exist here anyway.
> We already have "ip link set vf set mac"
> Address is not specific enough and can also mean IP address, while hw_addr
> covers both ETH and IB
Ok.

Jiri,
Do you want to wait to conclude the SF discussion, as we are discussing subdev there in confused state currently?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v2 04/10] devlink: Support subdev HW address get
  2019-11-09 18:13       ` Parav Pandit
@ 2019-11-10  8:12         ` Jiri Pirko
  0 siblings, 0 replies; 23+ messages in thread
From: Jiri Pirko @ 2019-11-10  8:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Yuval Avnery, netdev, Jiri Pirko, Saeed Mahameed, leon, davem,
	jakub.kicinski, shuah, Daniel Jurgens, andrew.gospodarek,
	michael.chan

Sat, Nov 09, 2019 at 07:13:30PM CET, parav@mellanox.com wrote:
>
>
>> From: Yuval Avnery <yuvalav@mellanox.com>
>> > > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
>> > > Avnery <yuvalav@mellanox.com>
>> > > Subject: [PATCH net-next v2 04/10] devlink: Support subdev HW
>> > > address get
>> > >
>> > > Allow privileged user to get the HW address of a subdev.
>> > >
>> > > Example:
>> > >
>> > > $ devlink subdev show pci/0000:03:00.0/1
>> > > pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr
>> > > 00:23:35:af:35:34
>> > >
>> > > $ devlink subdev show pci/0000:03:00.0/1 -pj {
>> > >     "subdev": {
>> > >         "pci/0000:03:00.0/1": {
>> > >             "flavour": "pcivf",
>> > >             "pf": 0,
>> > >             "vf": 0,
>> > >             "port_index": 1,
>> > >             "hw_addr": "00:23:35:af:35:34"
>> > I prefer this to be 'address' to match to 'ip link set address LLADDR'.
>> > That will make it consistent with rest of the iproute2/ip tool.
>> > So that users don't have to remember one mor keyword for the 'address'.
>> I think hw_addr is more accurate, and consistency doesn't exist here anyway.
>> We already have "ip link set vf set mac"
>> Address is not specific enough and can also mean IP address, while hw_addr
>> covers both ETH and IB
>Ok.
>
>Jiri,
>Do you want to wait to conclude the SF discussion, as we are discussing subdev there in confused state currently?

I belive that subdev is parallel to the discussion about SFs.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v2 00/10] devlink subdev
  2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
                   ` (9 preceding siblings ...)
  2019-11-08 16:18 ` [PATCH net-next v2 10/10] net/mlx5e: Add support for devlink subdev and subdev hw_addr set/show Yuval Avnery
@ 2019-11-11 18:00 ` Jakub Kicinski
  2019-11-11 18:46   ` Yuval Avnery
  10 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2019-11-11 18:00 UTC (permalink / raw)
  To: Yuval Avnery, jiri
  Cc: netdev, saeedm, leon, davem, shuah, danielj, parav,
	andrew.gospodarek, michael.chan

On Fri,  8 Nov 2019 18:18:36 +0200, Yuval Avnery wrote:
> This patchset introduces devlink subdev.
> 
> Currently, legacy tools do not provide a comprehensive solution that can
> be used in both SmartNic and non-SmartNic mode.
> Subdev represents a device that exists on the ASIC but is not necessarily
> visible to the kernel.
> 
> Using devlink ports is not suitable because:
> 
> 1. Those devices aren't necessarily network devices (such as NVMe devices)
>    and doesn’t have E-switch representation. Therefore, there is need for
>    more generic representation of PCI VF.
> 2. Some attributes are not necessarily pure port attributes
>    (number of MSIX vectors)

PCIe attrs will require persistence. Where is that in this design?

Also MSIX vectors are configuration of the devlink port (ASIC side), the
only reason you're putting them in subdev is because some of the subdevs
don't have a port, muddying up the meaning of things.

> 3. It creates a confusing devlink topology, with multiple port flavours
>    and indices.
> 
> Subdev will be created along with flavour and attributes.
> Some network subdevs may be linked with a devlink port.
> 
> This is also aimed to replace "ip link vf" commands as they are strongly
> linked to the PCI topology and allow access only to enabled VFs.
> Even though current patchset and example is limited to MAC address
> of the VF, this interface will allow to manage PF, VF, mdev in
> SmartNic and non SmartNic modes, in unified way for networking and
> non-networking devices via devlink instance.
> 
> Use case example:
> An example system view of a networking ASIC (aka SmartNIC), can be seen in
> below diagram, where devlink eswitch instance and PCI PF and/or VFs are
> situated on two different CPU subsystems:
> 
> 
>       +------------------------------+
>       |                              |
>       |             HOST             |
>       |                              |
>       |   +----+-----+-----+-----+   |
>       |   | PF | VF0 | VF1 | VF2 |   |
>       +---+----+-----------+-----+---+
>                  PCI1|
>           +---+------------+
>               |
>      +----------------------------------------+
>      |        |         SmartNic              |
>      |   +----+-------------------------+     |
>      |   |                              |     |
>      |   |               NIC            |     |
>      |   |                              |     |
>      |   +---------------------+--------+     |
>      |                         |  PCI2        |
>      |         +-----+---------+--+           |
>      |               |                        |
>      |      +-----+--+--+--------------+      |
>      |      |     | PF  |              |      |
>      |      |     +-----+              |      |
>      |      |      Embedded CPU        |      |
>      |      |                          |      |
>      |      +--------------------------+      |
>      |                                        |
>      +----------------------------------------+
> 
> The below diagram shows an example devlink subdev topology where some
> subdevs are connected to devlink ports::
> 
> 
> 
>             (PF0)    (VF0)    (VF1)           (NVME VF2)
>          +--------------------------+         +--------+
>          | devlink| devlink| devlink|         | devlink|
>          | subdev | subdev | subdev |         | subdev |
>          |    0   |    1   |    2   |         |    3   |
>          +--------------------------+         +--------+
>               |        |        |

What is this NVME VF2 connected to? It's gotta get traffic from
somewhere? Frames come in from the uplink, then what?

>               |        |        |
>               |        |        |
>      +----------------------------------+
>      |   | devlink| devlink| devlink|   |
>      |   |  port  |  port  |  port  |   |
>      |   |    0   |    1   |    2   |   |
>      |   +--------------------------+   |
>      |                                  |
>      |                                  |
>      |           E-switch               |
>      |                                  |
>      |                                  |
>      |          +--------+              |
>      |          | uplink |              |
>      |          | devlink|              |
>      |          |  port  |              |
>      +----------------------------------+
>  
> Devlink command example:
> 
> A privileged user wants to configure a VF's hw_addr, before the VF is
> enabled.
> 
> $ devlink subdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66
> 
> $ devlink subdev show pci/0000:03:00.0/1
> pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 10:22:33:44:55:66
> 
> $ devlink subdev show pci/0000:03:00.0/1 -jp
> {
>     "subdev": {
>         "pci/0000:03:00.0/1": {
>             "flavour": "pcivf",

If the flavour is pcivf what differentiates this (Ethernet) VF from 
a NVME one?

>             "pf": 0,
>             "vf": 0,
>             "port_index": 1,
>             "hw_addr": "10:22:33:44:55:66"

Since you're messing with the "hw_addr", you should at least provision
the uAPI for adding multiple addresses. Intel guys were asking for this
long time ago.

Have you considered implementing some compat code so drivers don't
have to implement the legacy ndos if they support subdevs?

>         }
>     }
> }

Okay, so you added two diagrams. I guess I was naive in thinking that
"you thought this all through in detail and have more documentation and
design docs internally".

I don't like how unconstrained this is, the only implemented use case is
weak. But since you're not seeing this you probably never will, so
seems like a waste of time to fight it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH net-next v2 00/10] devlink subdev
  2019-11-11 18:00 ` [PATCH net-next v2 00/10] devlink subdev Jakub Kicinski
@ 2019-11-11 18:46   ` Yuval Avnery
  2019-11-12 14:55     ` Andy Gospodarek
  0 siblings, 1 reply; 23+ messages in thread
From: Yuval Avnery @ 2019-11-11 18:46 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: netdev, Saeed Mahameed, leon, davem, shuah, Daniel Jurgens,
	Parav Pandit, andrew.gospodarek, michael.chan



> -----Original Message-----
> From: Jakub Kicinski <jakub.kicinski@netronome.com>
> Sent: Monday, November 11, 2019 10:00 AM
> To: Yuval Avnery <yuvalav@mellanox.com>; Jiri Pirko <jiri@mellanox.com>
> Cc: netdev@vger.kernel.org; Saeed Mahameed <saeedm@mellanox.com>;
> leon@kernel.org; davem@davemloft.net; shuah@kernel.org; Daniel Jurgens
> <danielj@mellanox.com>; Parav Pandit <parav@mellanox.com>;
> andrew.gospodarek@broadcom.com; michael.chan@broadcom.com
> Subject: Re: [PATCH net-next v2 00/10] devlink subdev
> 
> On Fri,  8 Nov 2019 18:18:36 +0200, Yuval Avnery wrote:
> > This patchset introduces devlink subdev.
> >
> > Currently, legacy tools do not provide a comprehensive solution that
> > can be used in both SmartNic and non-SmartNic mode.
> > Subdev represents a device that exists on the ASIC but is not
> > necessarily visible to the kernel.
> >
> > Using devlink ports is not suitable because:
> >
> > 1. Those devices aren't necessarily network devices (such as NVMe devices)
> >    and doesn’t have E-switch representation. Therefore, there is need for
> >    more generic representation of PCI VF.
> > 2. Some attributes are not necessarily pure port attributes
> >    (number of MSIX vectors)
> 
> PCIe attrs will require persistence. Where is that in this design?
> 
> Also MSIX vectors are configuration of the devlink port (ASIC side), the only
> reason you're putting them in subdev is because some of the subdevs don't
> have a port, muddying up the meaning of things.

The way I see this, they are both on the ASIC side.
But port has nothing to do with MSIX so subdev makes more sense.

> 
> > 3. It creates a confusing devlink topology, with multiple port flavours
> >    and indices.
> >
> > Subdev will be created along with flavour and attributes.
> > Some network subdevs may be linked with a devlink port.
> >
> > This is also aimed to replace "ip link vf" commands as they are
> > strongly linked to the PCI topology and allow access only to enabled VFs.
> > Even though current patchset and example is limited to MAC address of
> > the VF, this interface will allow to manage PF, VF, mdev in SmartNic
> > and non SmartNic modes, in unified way for networking and
> > non-networking devices via devlink instance.
> >
> > Use case example:
> > An example system view of a networking ASIC (aka SmartNIC), can be
> > seen in below diagram, where devlink eswitch instance and PCI PF
> > and/or VFs are situated on two different CPU subsystems:
> >
> >
> >       +------------------------------+
> >       |                              |
> >       |             HOST             |
> >       |                              |
> >       |   +----+-----+-----+-----+   |
> >       |   | PF | VF0 | VF1 | VF2 |   |
> >       +---+----+-----------+-----+---+
> >                  PCI1|
> >           +---+------------+
> >               |
> >      +----------------------------------------+
> >      |        |         SmartNic              |
> >      |   +----+-------------------------+     |
> >      |   |                              |     |
> >      |   |               NIC            |     |
> >      |   |                              |     |
> >      |   +---------------------+--------+     |
> >      |                         |  PCI2        |
> >      |         +-----+---------+--+           |
> >      |               |                        |
> >      |      +-----+--+--+--------------+      |
> >      |      |     | PF  |              |      |
> >      |      |     +-----+              |      |
> >      |      |      Embedded CPU        |      |
> >      |      |                          |      |
> >      |      +--------------------------+      |
> >      |                                        |
> >      +----------------------------------------+
> >
> > The below diagram shows an example devlink subdev topology where
> some
> > subdevs are connected to devlink ports::
> >
> >
> >
> >             (PF0)    (VF0)    (VF1)           (NVME VF2)
> >          +--------------------------+         +--------+
> >          | devlink| devlink| devlink|         | devlink|
> >          | subdev | subdev | subdev |         | subdev |
> >          |    0   |    1   |    2   |         |    3   |
> >          +--------------------------+         +--------+
> >               |        |        |
> 
> What is this NVME VF2 connected to? It's gotta get traffic from somewhere?
> Frames come in from the uplink, then what?

The whole point here is that this is not a network device
So it has nothing to do with the network.
It is simply a PCI device exposed by the NIC to the host.
devlink subdev lets us configure attributes for this device.

> 
> >               |        |        |
> >               |        |        |
> >      +----------------------------------+
> >      |   | devlink| devlink| devlink|   |
> >      |   |  port  |  port  |  port  |   |
> >      |   |    0   |    1   |    2   |   |
> >      |   +--------------------------+   |
> >      |                                  |
> >      |                                  |
> >      |           E-switch               |
> >      |                                  |
> >      |                                  |
> >      |          +--------+              |
> >      |          | uplink |              |
> >      |          | devlink|              |
> >      |          |  port  |              |
> >      +----------------------------------+
> >
> > Devlink command example:
> >
> > A privileged user wants to configure a VF's hw_addr, before the VF is
> > enabled.
> >
> > $ devlink subdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66
> >
> > $ devlink subdev show pci/0000:03:00.0/1
> > pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr
> > 10:22:33:44:55:66
> >
> > $ devlink subdev show pci/0000:03:00.0/1 -jp {
> >     "subdev": {
> >         "pci/0000:03:00.0/1": {
> >             "flavour": "pcivf",
> 
> If the flavour is pcivf what differentiates this (Ethernet) VF from a NVME
> one?
> 

This describes the bus not the device purpose.

> >             "pf": 0,
> >             "vf": 0,
> >             "port_index": 1,
> >             "hw_addr": "10:22:33:44:55:66"
> 
> Since you're messing with the "hw_addr", you should at least provision the
> uAPI for adding multiple addresses. Intel guys were asking for this long time
> ago.

This is the "permanent" address. The one the subdev boots with.

> 
> Have you considered implementing some compat code so drivers don't have
> to implement the legacy ndos if they support subdevs?

Will consider

> 
> >         }
> >     }
> > }
> 
> Okay, so you added two diagrams. I guess I was naive in thinking that "you
> thought this all through in detail and have more documentation and design
> docs internally".
> 
> I don't like how unconstrained this is, the only implemented use case is
> weak. But since you're not seeing this you probably never will, so seems like
> a waste of time to fight it.

What am I missing? How would you constrain this?
I don’t see how this is weak. There is currently no way to set/show those devices from SmartNic.
We introduced a way that will cover both SmartNic and non-SmartNic mode.
We are also planning to add rate groups which will control the min/max rate of each subdev,
and support grouping them into rate groups.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v2 00/10] devlink subdev
  2019-11-11 18:46   ` Yuval Avnery
@ 2019-11-12 14:55     ` Andy Gospodarek
  2019-11-12 18:35       ` Yuval Avnery
  2019-11-13  6:37       ` Parav Pandit
  0 siblings, 2 replies; 23+ messages in thread
From: Andy Gospodarek @ 2019-11-12 14:55 UTC (permalink / raw)
  To: Yuval Avnery
  Cc: Jakub Kicinski, Jiri Pirko, netdev, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, Parav Pandit, andrew.gospodarek,
	michael.chan

On Mon, Nov 11, 2019 at 06:46:52PM +0000, Yuval Avnery wrote:
> 
> 
> > -----Original Message-----
> > From: Jakub Kicinski <jakub.kicinski@netronome.com>
> > Sent: Monday, November 11, 2019 10:00 AM
> > To: Yuval Avnery <yuvalav@mellanox.com>; Jiri Pirko <jiri@mellanox.com>
> > Cc: netdev@vger.kernel.org; Saeed Mahameed <saeedm@mellanox.com>;
> > leon@kernel.org; davem@davemloft.net; shuah@kernel.org; Daniel Jurgens
> > <danielj@mellanox.com>; Parav Pandit <parav@mellanox.com>;
> > andrew.gospodarek@broadcom.com; michael.chan@broadcom.com
> > Subject: Re: [PATCH net-next v2 00/10] devlink subdev
> > 
> > On Fri,  8 Nov 2019 18:18:36 +0200, Yuval Avnery wrote:
> > > This patchset introduces devlink subdev.
> > >
> > > Currently, legacy tools do not provide a comprehensive solution that
> > > can be used in both SmartNic and non-SmartNic mode.
> > > Subdev represents a device that exists on the ASIC but is not
> > > necessarily visible to the kernel.
> > >
> > > Using devlink ports is not suitable because:
> > >
> > > 1. Those devices aren't necessarily network devices (such as NVMe devices)
> > >    and doesn’t have E-switch representation. Therefore, there is need for
> > >    more generic representation of PCI VF.
> > > 2. Some attributes are not necessarily pure port attributes
> > >    (number of MSIX vectors)
> > 
> > PCIe attrs will require persistence. Where is that in this design?
> > 
> > Also MSIX vectors are configuration of the devlink port (ASIC side), the only
> > reason you're putting them in subdev is because some of the subdevs don't
> > have a port, muddying up the meaning of things.
> 
> The way I see this, they are both on the ASIC side.
> But port has nothing to do with MSIX so subdev makes more sense.
> 
> > 
> > > 3. It creates a confusing devlink topology, with multiple port flavours
> > >    and indices.
> > >
> > > Subdev will be created along with flavour and attributes.
> > > Some network subdevs may be linked with a devlink port.
> > >
> > > This is also aimed to replace "ip link vf" commands as they are
> > > strongly linked to the PCI topology and allow access only to enabled VFs.
> > > Even though current patchset and example is limited to MAC address of
> > > the VF, this interface will allow to manage PF, VF, mdev in SmartNic
> > > and non SmartNic modes, in unified way for networking and
> > > non-networking devices via devlink instance.
> > >
> > > Use case example:
> > > An example system view of a networking ASIC (aka SmartNIC), can be
> > > seen in below diagram, where devlink eswitch instance and PCI PF
> > > and/or VFs are situated on two different CPU subsystems:
> > >
> > >
> > >       +------------------------------+
> > >       |                              |
> > >       |             HOST             |
> > >       |                              |
> > >       |   +----+-----+-----+-----+   |
> > >       |   | PF | VF0 | VF1 | VF2 |   |
> > >       +---+----+-----------+-----+---+
> > >                  PCI1|
> > >           +---+------------+
> > >               |
> > >      +----------------------------------------+
> > >      |        |         SmartNic              |
> > >      |   +----+-------------------------+     |
> > >      |   |                              |     |
> > >      |   |               NIC            |     |
> > >      |   |                              |     |
> > >      |   +---------------------+--------+     |
> > >      |                         |  PCI2        |
> > >      |         +-----+---------+--+           |
> > >      |               |                        |
> > >      |      +-----+--+--+--------------+      |
> > >      |      |     | PF  |              |      |
> > >      |      |     +-----+              |      |
> > >      |      |      Embedded CPU        |      |
> > >      |      |                          |      |
> > >      |      +--------------------------+      |
> > >      |                                        |
> > >      +----------------------------------------+
> > >
> > > The below diagram shows an example devlink subdev topology where
> > some
> > > subdevs are connected to devlink ports::
> > >
> > >
> > >
> > >             (PF0)    (VF0)    (VF1)           (NVME VF2)
> > >          +--------------------------+         +--------+
> > >          | devlink| devlink| devlink|         | devlink|
> > >          | subdev | subdev | subdev |         | subdev |
> > >          |    0   |    1   |    2   |         |    3   |
> > >          +--------------------------+         +--------+
> > >               |        |        |
> > 
> > What is this NVME VF2 connected to? It's gotta get traffic from somewhere?
> > Frames come in from the uplink, then what?
> 
> The whole point here is that this is not a network device
> So it has nothing to do with the network.
> It is simply a PCI device exposed by the NIC to the host.
> devlink subdev lets us configure attributes for this device.
> 
> > 
> > >               |        |        |
> > >               |        |        |
> > >      +----------------------------------+
> > >      |   | devlink| devlink| devlink|   |
> > >      |   |  port  |  port  |  port  |   |
> > >      |   |    0   |    1   |    2   |   |
> > >      |   +--------------------------+   |
> > >      |                                  |
> > >      |                                  |
> > >      |           E-switch               |
> > >      |                                  |
> > >      |                                  |
> > >      |          +--------+              |
> > >      |          | uplink |              |
> > >      |          | devlink|              |
> > >      |          |  port  |              |
> > >      +----------------------------------+
> > >
> > > Devlink command example:
> > >
> > > A privileged user wants to configure a VF's hw_addr, before the VF is
> > > enabled.
> > >
> > > $ devlink subdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66
> > >
> > > $ devlink subdev show pci/0000:03:00.0/1
> > > pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr
> > > 10:22:33:44:55:66
> > >
> > > $ devlink subdev show pci/0000:03:00.0/1 -jp {
> > >     "subdev": {
> > >         "pci/0000:03:00.0/1": {
> > >             "flavour": "pcivf",
> > 
> > If the flavour is pcivf what differentiates this (Ethernet) VF from a NVME
> > one?
> > 
> 
> This describes the bus not the device purpose.
> 
> > >             "pf": 0,
> > >             "vf": 0,
> > >             "port_index": 1,
> > >             "hw_addr": "10:22:33:44:55:66"
> > 
> > Since you're messing with the "hw_addr", you should at least provision the
> > uAPI for adding multiple addresses. Intel guys were asking for this long time
> > ago.
> 
> This is the "permanent" address. The one the subdev boots with.
> 
> > 
> > Have you considered implementing some compat code so drivers don't have
> > to implement the legacy ndos if they support subdevs?
> 
> Will consider
> 
> > 
> > >         }
> > >     }
> > > }
> > 
> > Okay, so you added two diagrams. I guess I was naive in thinking that "you
> > thought this all through in detail and have more documentation and design
> > docs internally".
> > 
> > I don't like how unconstrained this is, the only implemented use case is
> > weak. But since you're not seeing this you probably never will, so seems like
> > a waste of time to fight it.
> 
> What am I missing? How would you constrain this?

It feels a bit 'unconstrained' to me as well.  As Jakub said you added
some documentation, but the direction of this long-term is not clear.
What seems to happen too often is that we skip creating better infra for
existing devices and create it only for the newest shiniest object.
These changes are what garners the most attention from those who grant
permission to push things upstream (*cough* managers *cough*), but it's
not the right choice in this case.  I'm not sure if that is part of what
bothers Jakub, but it is one thing that bothers me and why this feels
incomplete.

The thing that has been bouncing around in my head about this (and until
I was in front of a good text-based MUA I didn't dare respond) is that
we seem to have an overlap in functionality between what you are
proposing and existing virtual device configuration, but you are
completely ignoring improving upon the existing creation methods.
Whether dealing with a SmartNIC (which I used to describe a NIC with
general purpose processors that could be running Linux and I think you
do, too) or a regular NIC it seems like we should use devlink to create
and configure a variety of devices these could be:

1.  Number of PFs (there are cases where more than 1 PF per uplink port
    is required for command/control of the SmartNIC or where a single
    PCI b/d/f may have many uplink ports that need to be addressed
    separately)
2.  Device-specific SR-IOV VFs visible to server
3.  mdev devices that _may_ have representers on the embedded cores of
    the NIC
4.  Hardware VirtIO-net devices supported
5.  Other non-network devices (storage given as the first example) 
6.  ...

We cannot get rid of the methods for creating hardware VFs via sysfs,
but now that we are seeing lots of different flavors of devices that
might be created we should start by making sure at a minimum we can
create existing hardware devices (VFs) with this devlink interface and
move from there.  Is there a plan to use this interface to create
subdevs that are VFs or just new subdev flavors?  I could start to get
behind an interface like this if the patches showed that devlink would
be the one place where device creation and control could be done for all
types of subdevs, but that isn't clear that it is your direction based
on the patches.

So just to make sure this is clear, what I'm proposing that devlink is
used to create and configure all types of devices that it may be
possible to create, configure, or address from a PF and the starting
place should be standard, hardware VFs.  If that can be done right and
we can address some of the issues with the current implementation (more
than one hw addr is a _great_ example) then adding new flavours for the
server devices and and adding new flavors for SmartNIC-based devices
should be easy and feel natural.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH net-next v2 00/10] devlink subdev
  2019-11-12 14:55     ` Andy Gospodarek
@ 2019-11-12 18:35       ` Yuval Avnery
  2019-11-12 22:19         ` Jakub Kicinski
  2019-11-13  6:37       ` Parav Pandit
  1 sibling, 1 reply; 23+ messages in thread
From: Yuval Avnery @ 2019-11-12 18:35 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Jakub Kicinski, Jiri Pirko, netdev, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, Parav Pandit, michael.chan



> -----Original Message-----
> From: Andy Gospodarek <andrew.gospodarek@broadcom.com>
> Sent: Tuesday, November 12, 2019 6:56 AM
> To: Yuval Avnery <yuvalav@mellanox.com>
> Cc: Jakub Kicinski <jakub.kicinski@netronome.com>; Jiri Pirko
> <jiri@mellanox.com>; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>; leon@kernel.org; davem@davemloft.net;
> shuah@kernel.org; Daniel Jurgens <danielj@mellanox.com>; Parav Pandit
> <parav@mellanox.com>; andrew.gospodarek@broadcom.com;
> michael.chan@broadcom.com
> Subject: Re: [PATCH net-next v2 00/10] devlink subdev
> 
> On Mon, Nov 11, 2019 at 06:46:52PM +0000, Yuval Avnery wrote:
> >
> >
> > > -----Original Message-----
> > > From: Jakub Kicinski <jakub.kicinski@netronome.com>
> > > Sent: Monday, November 11, 2019 10:00 AM
> > > To: Yuval Avnery <yuvalav@mellanox.com>; Jiri Pirko
> > > <jiri@mellanox.com>
> > > Cc: netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>;
> > > leon@kernel.org; davem@davemloft.net; shuah@kernel.org; Daniel
> > > Jurgens <danielj@mellanox.com>; Parav Pandit <parav@mellanox.com>;
> > > andrew.gospodarek@broadcom.com; michael.chan@broadcom.com
> > > Subject: Re: [PATCH net-next v2 00/10] devlink subdev
> > >
> > > On Fri,  8 Nov 2019 18:18:36 +0200, Yuval Avnery wrote:
> > > > This patchset introduces devlink subdev.
> > >
> > > Okay, so you added two diagrams. I guess I was naive in thinking
> > > that "you thought this all through in detail and have more
> > > documentation and design docs internally".
> > >
> > > I don't like how unconstrained this is, the only implemented use
> > > case is weak. But since you're not seeing this you probably never
> > > will, so seems like a waste of time to fight it.
> >
> > What am I missing? How would you constrain this?
> 
> It feels a bit 'unconstrained' to me as well.  As Jakub said you added some
> documentation, but the direction of this long-term is not clear.
> What seems to happen too often is that we skip creating better infra for
> existing devices and create it only for the newest shiniest object.
> These changes are what garners the most attention from those who grant
> permission to push things upstream (*cough* managers *cough*), but it's
> not the right choice in this case.  I'm not sure if that is part of what bothers
> Jakub, but it is one thing that bothers me and why this feels incomplete.
> 
> The thing that has been bouncing around in my head about this (and until I
> was in front of a good text-based MUA I didn't dare respond) is that we seem
> to have an overlap in functionality between what you are proposing and
> existing virtual device configuration, but you are completely ignoring
> improving upon the existing creation methods.
> Whether dealing with a SmartNIC (which I used to describe a NIC with
> general purpose processors that could be running Linux and I think you do,
> too) or a regular NIC it seems like we should use devlink to create and
> configure a variety of devices these could be:
> 
> 1.  Number of PFs (there are cases where more than 1 PF per uplink port
>     is required for command/control of the SmartNIC or where a single
>     PCI b/d/f may have many uplink ports that need to be addressed
>     separately)

Yes, uplink ports can be addressed using "devlink port".

Multiple pfs are already supported.

$ devlink subdev show
pci/0000:03:00.0/1: flavour pcipf pf 0
pci/0000:03:00.0/1: flavour pcipf pf 1
pci/0000:03:00.0/1: flavour pcivf pf 1 vf 0

> 2.  Device-specific SR-IOV VFs visible to server 3.  mdev devices that _may_
> have representers on the embedded cores of
>     the NIC

Yes, Is also planned for current interface. (will need to add mdev flavor in the future)

> 4.  Hardware VirtIO-net devices supported 5.  Other non-network devices
> (storage given as the first example) 6.  ...

All is up to driver - what to expose and what flavor to grant the subdev.

> 
> We cannot get rid of the methods for creating hardware VFs via sysfs, but
> now that we are seeing lots of different flavors of devices that might be
> created we should start by making sure at a minimum we can create existing
> hardware devices (VFs) with this devlink interface and move from there.  Is
> there a plan to use this interface to create subdevs that are VFs or just new
> subdev flavors?  I could start to get behind an interface like this if the patches
> showed that devlink would be the one place where device creation and
> control could be done for all types of subdevs, but that isn't clear that it is
> your direction based on the patches.

I don’t see how the SmartNic can force the host PF to create new VFs.
Isn’t that the host PF responsibility? Doesn't make sense to me.
Do we want to have an interface that works only for NICs and not from the SmartNic embedded system?
 
What we do here is expose all the PFS and potential VFs/mdevs that the host can enable.
Unlike ip link which exposes only enabled VFs, here (in mlx5 implementation) we expose all the VFs that the host PF _can_ enable.
This allows, for example, to set the MAC of a VF before it is enabled.

> 
> So just to make sure this is clear, what I'm proposing that devlink is used to
> create and configure all types of devices that it may be possible to create,

More flavors can be defiantly added in the future. About creation, see my comment above.
Actually flavor is the only required attribute for registration. The rest is optional.
 
> configure, or address from a PF and the starting place should be standard,
> hardware VFs.  If that can be done right and we can address some of the
> issues with the current implementation (more than one hw addr is a _great_
> example) then adding new flavours for the server devices and and adding
> new flavors for SmartNIC-based devices should be easy and feel natural.

So what you are saying that allowing multiple addresses per subdev will make a strong case for this patch set?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v2 00/10] devlink subdev
  2019-11-12 18:35       ` Yuval Avnery
@ 2019-11-12 22:19         ` Jakub Kicinski
  2019-11-12 23:32           ` Yuval Avnery
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2019-11-12 22:19 UTC (permalink / raw)
  To: Yuval Avnery
  Cc: Andy Gospodarek, Jiri Pirko, netdev, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, Parav Pandit, michael.chan

On Tue, 12 Nov 2019 18:35:26 +0000, Yuval Avnery wrote:
> > It feels a bit 'unconstrained' to me as well.  As Jakub said you added some
> > documentation, but the direction of this long-term is not clear.
> > What seems to happen too often is that we skip creating better infra for
> > existing devices and create it only for the newest shiniest object.
> > These changes are what garners the most attention from those who grant
> > permission to push things upstream (*cough* managers *cough*), but it's
> > not the right choice in this case.  I'm not sure if that is part of what bothers
> > Jakub, but it is one thing that bothers me and why this feels incomplete.
> > 
> > The thing that has been bouncing around in my head about this (and until I
> > was in front of a good text-based MUA I didn't dare respond) is that we seem
> > to have an overlap in functionality between what you are proposing and
> > existing virtual device configuration, but you are completely ignoring
> > improving upon the existing creation methods.
> > Whether dealing with a SmartNIC (which I used to describe a NIC with
> > general purpose processors that could be running Linux and I think you do,
> > too) or a regular NIC it seems like we should use devlink to create and
> > configure a variety of devices these could be:
> > 
> > 1.  Number of PFs (there are cases where more than 1 PF per uplink port
> >     is required for command/control of the SmartNIC or where a single
> >     PCI b/d/f may have many uplink ports that need to be addressed
> >     separately)  
> 
> Yes, uplink ports can be addressed using "devlink port".
> 
> Multiple pfs are already supported.
> 
> $ devlink subdev show
> pci/0000:03:00.0/1: flavour pcipf pf 0
> pci/0000:03:00.0/1: flavour pcipf pf 1
> pci/0000:03:00.0/1: flavour pcivf pf 1 vf 0

I don't think that covers what Andy was taking about.

> > 2.  Device-specific SR-IOV VFs visible to server 3.  mdev devices that _may_
> > have representers on the embedded cores of
> >     the NIC  
> 
> Yes, Is also planned for current interface. (will need to add mdev flavor in the future)
> 
> > 4.  Hardware VirtIO-net devices supported 5.  Other non-network devices
> > (storage given as the first example) 6.  ...  
> 
> All is up to driver - what to expose and what flavor to grant the subdev.
> 
> > We cannot get rid of the methods for creating hardware VFs via sysfs, but
> > now that we are seeing lots of different flavors of devices that might be
> > created we should start by making sure at a minimum we can create existing
> > hardware devices (VFs) with this devlink interface and move from there.  Is
> > there a plan to use this interface to create subdevs that are VFs or just new
> > subdev flavors?  I could start to get behind an interface like this if the patches
> > showed that devlink would be the one place where device creation and
> > control could be done for all types of subdevs, but that isn't clear that it is
> > your direction based on the patches.  
> 
> I don’t see how the SmartNic can force the host PF to create new VFs.
> Isn’t that the host PF responsibility? Doesn't make sense to me.
> Do we want to have an interface that works only for NICs and not from
> the SmartNic embedded system?
>
> What we do here is expose all the PFS and potential VFs/mdevs that the host can enable.
> Unlike ip link which exposes only enabled VFs, here (in mlx5
> implementation) we expose all the VFs that the host PF _can_ enable.
> This allows, for example, to set the MAC of a VF before it is enabled.

So subdev will never be used to create interfaces? Just expose all
_possible_? Ugh. Wouldn't that be something to document?

How things are configured matters. This patch set throws some basics
together to get you the hw_addr from the SmartNIC side, and then you
start talking big plans like NVMe and PCIe control which you don't seem
to have thought through.

How does the PF in your diagram magically have different types of VFs?
PCIe SR-IOV cap has VF device vendor:product, you won't have NVMe VFs
hanging off the main PF like that in a normal world :/

Jiri pitched the subdev object as the "other side of the wire" but you
are back to arguing that you think it's actually the same as the port,
just more general. How does the hw_addr belongs to the ASIC side then?

And for PCIe you'd use this new interface for MSI-X or other resource
allocation. Say VFs. Who controls that in a SmartNIC scenario? Is the
host not allowed to move them around? This starts to touch multi-host
use case which Linux networking never begun to address.

You're creating a new object type with no clear semantics just to throw
in one feature - namely the hw_addr. If this is the quality of design
we can expect - just add the hw addr to devlink ports and let's move on.

> > So just to make sure this is clear, what I'm proposing that devlink
> > is used to create and configure all types of devices that it may be
> > possible to create,  
> 
> More flavors can be defiantly added in the future. About creation,
> see my comment above. Actually flavor is the only required attribute
> for registration. The rest is optional. 
> > configure, or address from a PF and the starting place should be
> > standard, hardware VFs.  If that can be done right and we can
> > address some of the issues with the current implementation (more
> > than one hw addr is a _great_ example) then adding new flavours for
> > the server devices and and adding new flavors for SmartNIC-based
> > devices should be easy and feel natural.  
> 
> So what you are saying that allowing multiple addresses per subdev
> will make a strong case for this patch set?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH net-next v2 00/10] devlink subdev
  2019-11-12 22:19         ` Jakub Kicinski
@ 2019-11-12 23:32           ` Yuval Avnery
  0 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-12 23:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andy Gospodarek, Jiri Pirko, netdev, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, Parav Pandit, michael.chan



> -----Original Message-----
> From: Jakub Kicinski <jakub.kicinski@netronome.com>
> Sent: Tuesday, November 12, 2019 2:20 PM
> To: Yuval Avnery <yuvalav@mellanox.com>
> Cc: Andy Gospodarek <andrew.gospodarek@broadcom.com>; Jiri Pirko
> <jiri@mellanox.com>; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>; leon@kernel.org; davem@davemloft.net;
> shuah@kernel.org; Daniel Jurgens <danielj@mellanox.com>; Parav Pandit
> <parav@mellanox.com>; michael.chan@broadcom.com
> Subject: Re: [PATCH net-next v2 00/10] devlink subdev
> 
> On Tue, 12 Nov 2019 18:35:26 +0000, Yuval Avnery wrote:
> > > It feels a bit 'unconstrained' to me as well.  As Jakub said you
> > > added some documentation, but the direction of this long-term is not
> clear.
> > > What seems to happen too often is that we skip creating better infra
> > > for existing devices and create it only for the newest shiniest object.
> > > These changes are what garners the most attention from those who
> > > grant permission to push things upstream (*cough* managers *cough*),
> > > but it's not the right choice in this case.  I'm not sure if that is
> > > part of what bothers Jakub, but it is one thing that bothers me and why
> this feels incomplete.
> > >
> > > The thing that has been bouncing around in my head about this (and
> > > until I was in front of a good text-based MUA I didn't dare respond)
> > > is that we seem to have an overlap in functionality between what you
> > > are proposing and existing virtual device configuration, but you are
> > > completely ignoring improving upon the existing creation methods.
> > > Whether dealing with a SmartNIC (which I used to describe a NIC with
> > > general purpose processors that could be running Linux and I think
> > > you do,
> > > too) or a regular NIC it seems like we should use devlink to create
> > > and configure a variety of devices these could be:
> > >
> > > 1.  Number of PFs (there are cases where more than 1 PF per uplink port
> > >     is required for command/control of the SmartNIC or where a single
> > >     PCI b/d/f may have many uplink ports that need to be addressed
> > >     separately)
> >
> > Yes, uplink ports can be addressed using "devlink port".
> >
> > Multiple pfs are already supported.
> >
> > $ devlink subdev show
> > pci/0000:03:00.0/1: flavour pcipf pf 0
> > pci/0000:03:00.0/1: flavour pcipf pf 1
> > pci/0000:03:00.0/1: flavour pcivf pf 1 vf 0
> 
> I don't think that covers what Andy was taking about.
> 
> > > 2.  Device-specific SR-IOV VFs visible to server 3.  mdev devices
> > > that _may_ have representers on the embedded cores of
> > >     the NIC
> >
> > Yes, Is also planned for current interface. (will need to add mdev
> > flavor in the future)
> >
> > > 4.  Hardware VirtIO-net devices supported 5.  Other non-network
> > > devices (storage given as the first example) 6.  ...
> >
> > All is up to driver - what to expose and what flavor to grant the subdev.
> >
> > > We cannot get rid of the methods for creating hardware VFs via
> > > sysfs, but now that we are seeing lots of different flavors of
> > > devices that might be created we should start by making sure at a
> > > minimum we can create existing hardware devices (VFs) with this
> > > devlink interface and move from there.  Is there a plan to use this
> > > interface to create subdevs that are VFs or just new subdev flavors?
> > > I could start to get behind an interface like this if the patches
> > > showed that devlink would be the one place where device creation and
> > > control could be done for all types of subdevs, but that isn't clear that it is
> your direction based on the patches.
> >
> > I don’t see how the SmartNic can force the host PF to create new VFs.
> > Isn’t that the host PF responsibility? Doesn't make sense to me.
> > Do we want to have an interface that works only for NICs and not from
> > the SmartNic embedded system?
> >
> > What we do here is expose all the PFS and potential VFs/mdevs that the
> host can enable.
> > Unlike ip link which exposes only enabled VFs, here (in mlx5
> > implementation) we expose all the VFs that the host PF _can_ enable.
> > This allows, for example, to set the MAC of a VF before it is enabled.
> 
> So subdev will never be used to create interfaces? Just expose all _possible_?
> Ugh. Wouldn't that be something to document?
> 

I think it should be the driver's implementation decision what subdevs to expose and when,
but perhaps we need more consistency here.

> How things are configured matters. This patch set throws some basics
> together to get you the hw_addr from the SmartNIC side, and then you start
> talking big plans like NVMe and PCIe control which you don't seem to have
> thought through.
> 
> How does the PF in your diagram magically have different types of VFs?
> PCIe SR-IOV cap has VF device vendor:product, you won't have NVMe VFs
> hanging off the main PF like that in a normal world :/
 
I don’t care what the VF does, it is a VF that the ASIC exposes.
NVME was just an example. The driver can choose which VFs to expose and with what
Relevant attributes.

> Jiri pitched the subdev object as the "other side of the wire" but you are back
> to arguing that you think it's actually the same as the port, just more general.
> How does the hw_addr belongs to the ASIC side then?

No, the port is a port on the e-switch, not the NIC port. 
The subdev holds the NIC's attributes (among other attributes)	
Yes, the peer sub-object would cover this case but we wanted to give a more
general solution that covers all the PCIe functions that the ASIC exposes to the host
(also those that are not NICs).

> And for PCIe you'd use this new interface for MSI-X or other resource
> allocation. Say VFs. Who controls that in a SmartNIC scenario? Is the host not
> allowed to move them around? This starts to touch multi-host use case
> which Linux networking never begun to address.

I don’t think we should discuss MSI-X specifics, it is just an example of a resource.
Probably max_tx_rate is a much better example.

> You're creating a new object type with no clear semantics just to throw in
> one feature - namely the hw_addr. If this is the quality of design we can
> expect - just add the hw addr to devlink ports and let's move on.
> 

But it is not the hw_addr of the switch port.
It is logically not true.
Actually the fact that the ports have flavors which tell to which device
they are connected to, is a bit awkward.

> > > So just to make sure this is clear, what I'm proposing that devlink
> > > is used to create and configure all types of devices that it may be
> > > possible to create,
> >
> > More flavors can be defiantly added in the future. About creation, see
> > my comment above. Actually flavor is the only required attribute for
> > registration. The rest is optional.
> > > configure, or address from a PF and the starting place should be
> > > standard, hardware VFs.  If that can be done right and we can
> > > address some of the issues with the current implementation (more
> > > than one hw addr is a _great_ example) then adding new flavours for
> > > the server devices and and adding new flavors for SmartNIC-based
> > > devices should be easy and feel natural.
> >
> > So what you are saying that allowing multiple addresses per subdev
> > will make a strong case for this patch set?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH net-next v2 00/10] devlink subdev
  2019-11-12 14:55     ` Andy Gospodarek
  2019-11-12 18:35       ` Yuval Avnery
@ 2019-11-13  6:37       ` Parav Pandit
  2019-11-14 16:47         ` Yuval Avnery
  1 sibling, 1 reply; 23+ messages in thread
From: Parav Pandit @ 2019-11-13  6:37 UTC (permalink / raw)
  To: Andy Gospodarek, Yuval Avnery
  Cc: Jakub Kicinski, Jiri Pirko, netdev, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, michael.chan

Hi Andy,

> From: Andy Gospodarek <andrew.gospodarek@broadcom.com>
> Sent: Tuesday, November 12, 2019 8:56 AM
> On Mon, Nov 11, 2019 at 06:46:52PM +0000, Yuval Avnery wrote:
> >
> > >
[..]

> 
> It feels a bit 'unconstrained' to me as well.  As Jakub said you added some
> documentation, but the direction of this long-term is not clear.

I am interested to know if the long-term direction of the devlink was clear enough in cover letter [1], that devlink device will change net namespace, it will have health reporters, dump regions, eswitch VF and PF ports, representor binding, "enable_roce" extensions.
Because that vision is not clear in the cover letter [1] at least to me.

Resource division was very clearly described in (4) in cover letter [1] which you were fan of from 2016 in [2].
MSIX or in general interrupts resource splitting among VFs is no different than any other resource division.

devlink eswitch port for anchoring interrupt attribute is certainly abuse of an 'esw port' object.
Defining a subdev object, even though today it holds just 2 attributes (hw_address, and msix interrupt), it still sane object hierarchy.
Setting host side mac address is not really job of eswitch ports.
It may be on the same ASIC, but it should be put at right object.

> What seems to happen too often is that we skip creating better infra for
> existing devices and create it only for the newest shiniest object.
This infrastructure is for existing devices (non smartnic devices too).
May be worth clarifying explicitly in cover letter.

> I'm not sure if that is part of what bothers Jakub,
> but it is one thing that bothers me and why this feels incomplete.
Let's work towards making it complete as much as for given functionality.

> 
> The thing that has been bouncing around in my head about this (and until I was
> in front of a good text-based MUA I didn't dare respond) is that we seem to
> have an overlap in functionality between what you are proposing and existing
> virtual device configuration, but you are completely ignoring improving upon
> the existing creation methods.
> Whether dealing with a SmartNIC (which I used to describe a NIC with general
> purpose processors that could be running Linux and I think you do, too) or a
> regular NIC it seems like we should use devlink to create and configure a
> variety of devices these could be:
> 
> 1.  Number of PFs (there are cases where more than 1 PF per uplink port
>     is required for command/control of the SmartNIC or where a single
>     PCI b/d/f may have many uplink ports that need to be addressed
>     separately)
How is this related to subdev object? Do you mean user should be able to create PF subdevice object for host?
Currently decision to create subdev objects is done by the vendor driver.
But it is not prevented in future to extend it, if it make sense.
You need to describe the usecase and usual review process..

> 2.  Device-specific SR-IOV VFs visible to server 
> 3.  mdev devices that _may_
> have representers on the embedded cores of
>     the NIC
There is no mdev NIC devices on embedded cores NIC driver exist in kernel today.
I posted the series [3] to cover this part and we worked through kvm and netdev mailing list to get right, deterministic phys_port_name for it, right devlink port flavour etc necessary plumbing. This was done 2 months before posting [3].
So mdev's subdev management is covered and cover letter also talks about it to manage it in uniform way as PF/VF/mdev.

> 4.  Hardware VirtIO-net devices supported 
How this is blocked by subdev object introduction?
The way I envision from the recent discussion is,
There will be mdev-virtio or virtio bus where hardware virto devices will be exposed.
And therefore, their subdevices can be also controlled this way.
But my understanding of unpublished mellanox virtio-net device and yours could be different. :-)
So if you have link/ppt/rfc/patches I should read, please let me know, how you see hw virtio-net devices.

> 5.  Other non-network devices
> (storage given as the first example) 6.  ...
> 
I think its really beyond the scope of life cycling everything using this subdev object.
It is not a ultimate solution to all use cases. :-)

> We cannot get rid of the methods for creating hardware VFs via sysfs, but now
> that we are seeing lots of different flavors of devices that might be created we
> should start by making sure at a minimum we can create existing hardware
> devices (VFs) with this devlink interface and move from there.
Well, I have been thinking about it for last 6-7 months as there is locking issues present between num_vfs sysfs (device lock!), ip link (no lock) and devlink (devlink lock).
All unsyncronized with each other which leads to call traces; using some lock in multiple drivers will be a red bandage.
It is not the scope of this series.

> Is there a plan
> to use this interface to create subdevs that are VFs or just new subdev flavors?
Creating individual VFs is supported by PCI spec when I checked last time.
Do you know if that is supported now?
Last year autoprobe file was added to pci core to dynamically bind the VFs and bug was fixed recently by Alex Williamson.

Let's say if that is added in future, even than subdev object creation on event of dynamic VF creation holds good. Isn't it?
Whenever mdev series updated version gets accepted, it will dynamically create the subdev object anyway of mdev or some other flavour.

> I could start to get behind an interface like this if the patches showed that
> devlink would be the one place where device creation and control could be
> done for all types of subdevs, but that isn't clear that it is your direction based
> on the patches.
> 
> So just to make sure this is clear, what I'm proposing that devlink is used to
> create and configure all types of devices that it may be possible to create,
> configure, or address from a PF and the starting place should be standard,
> hardware VFs.  
I do not agree with 'creating all devices' part that it should be the starting point given that other method already exists for VFs, mdev (though its buggy for VFs).
Extending devlink to create mdev is nice, but not a must. Same goes for other flavours such as virtio/sf.
I also proposed that to create them via devlink in [4], but its not must.
This series for mdev [3] already fits in the existing devlink, eswitch, devlink port, devlink subdev object model.

> If that can be done right and we can address some of the issues
> with the current implementation (more than one hw addr is a _great_
> example) then adding new flavours for the server devices and and adding new
> flavors for SmartNIC-based devices should be easy and feel natural.
I also agree that setting more than one hw address should be supported, Jakub highlighted too of this need for Intel.
Command should be similar to 'ip addr add', and not 'devlink set hw_addr'.
Mlx5 doesn't currently supported so setting multiple hw_address should return right error code ENOSPEC or ENOSUPP by mlx5 driver.

Yuval, Can you please address this comment from Jakub and Andy to support multiple hw_address?

I just want to say that subdev is not so grand object.
As use case, features, ASIC evolves right attributes should be placed in right object.
(just like how devlink evolved from [1] to now.)
Aren't we reviewing the patches? Why should it become a unconstrained object to put everything in it?
This series doesn't ask to put everything in it.

As Yuval pointed, there are two/three near term plan after this series.
1. Grouping VFs subdevs so that host side min/max rates can be applied on the group (group of VFs)
2. Extending subdev flavour for mdev/virtio/sf depending on how series [3] take shape
3. msix configuration for VFs

[1] https://lwn.net/Articles/677967/
[2] https://www.spinics.net/lists/netdev/msg365532.html
[3] https://lore.kernel.org/linux-rdma/20191107160448.20962-1-parav@mellanox.com/
[4] https://lore.kernel.org/linux-rdma/AM0PR05MB4866444210721BC4EE775D27D17B0@AM0PR05MB4866.eurprd05.prod.outlook.com/



^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH net-next v2 00/10] devlink subdev
  2019-11-13  6:37       ` Parav Pandit
@ 2019-11-14 16:47         ` Yuval Avnery
  0 siblings, 0 replies; 23+ messages in thread
From: Yuval Avnery @ 2019-11-14 16:47 UTC (permalink / raw)
  To: Parav Pandit, Andy Gospodarek
  Cc: Jakub Kicinski, Jiri Pirko, netdev, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, michael.chan

Andy, 
Jakub,

Thank you for your comments

I will send another version with the following enhancements:

- Support multiple hw_addr.
- Describe more future use cases.
- Describe mdevs creation with devlink subdev. (this can be used for VFs, once PCIe spec supports)


> -----Original Message-----
> From: Parav Pandit
> Sent: Tuesday, November 12, 2019 10:37 PM
> To: Andy Gospodarek <andrew.gospodarek@broadcom.com>; Yuval Avnery
> <yuvalav@mellanox.com>
> Cc: Jakub Kicinski <jakub.kicinski@netronome.com>; Jiri Pirko
> <jiri@mellanox.com>; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>; leon@kernel.org; davem@davemloft.net;
> shuah@kernel.org; Daniel Jurgens <danielj@mellanox.com>;
> michael.chan@broadcom.com
> Subject: RE: [PATCH net-next v2 00/10] devlink subdev
> 
> Hi Andy,
> 
> > From: Andy Gospodarek <andrew.gospodarek@broadcom.com>
> > Sent: Tuesday, November 12, 2019 8:56 AM On Mon, Nov 11, 2019 at
> > 06:46:52PM +0000, Yuval Avnery wrote:
> > >
> > > >
> [..]
> 
> >
> > It feels a bit 'unconstrained' to me as well.  As Jakub said you added
> > some documentation, but the direction of this long-term is not clear.
> 
> I am interested to know if the long-term direction of the devlink was clear
> enough in cover letter [1], that devlink device will change net namespace, it
> will have health reporters, dump regions, eswitch VF and PF ports,
> representor binding, "enable_roce" extensions.
> Because that vision is not clear in the cover letter [1] at least to me.
> 
> Resource division was very clearly described in (4) in cover letter [1] which
> you were fan of from 2016 in [2].
> MSIX or in general interrupts resource splitting among VFs is no different
> than any other resource division.
> 
> devlink eswitch port for anchoring interrupt attribute is certainly abuse of an
> 'esw port' object.
> Defining a subdev object, even though today it holds just 2 attributes
> (hw_address, and msix interrupt), it still sane object hierarchy.
> Setting host side mac address is not really job of eswitch ports.
> It may be on the same ASIC, but it should be put at right object.
> 
> > What seems to happen too often is that we skip creating better infra
> > for existing devices and create it only for the newest shiniest object.
> This infrastructure is for existing devices (non smartnic devices too).
> May be worth clarifying explicitly in cover letter.
> 
> > I'm not sure if that is part of what bothers Jakub, but it is one
> > thing that bothers me and why this feels incomplete.
> Let's work towards making it complete as much as for given functionality.
> 
> >
> > The thing that has been bouncing around in my head about this (and
> > until I was in front of a good text-based MUA I didn't dare respond)
> > is that we seem to have an overlap in functionality between what you
> > are proposing and existing virtual device configuration, but you are
> > completely ignoring improving upon the existing creation methods.
> > Whether dealing with a SmartNIC (which I used to describe a NIC with
> > general purpose processors that could be running Linux and I think you
> > do, too) or a regular NIC it seems like we should use devlink to
> > create and configure a variety of devices these could be:
> >
> > 1.  Number of PFs (there are cases where more than 1 PF per uplink port
> >     is required for command/control of the SmartNIC or where a single
> >     PCI b/d/f may have many uplink ports that need to be addressed
> >     separately)
> How is this related to subdev object? Do you mean user should be able to
> create PF subdevice object for host?
> Currently decision to create subdev objects is done by the vendor driver.
> But it is not prevented in future to extend it, if it make sense.
> You need to describe the usecase and usual review process..
> 
> > 2.  Device-specific SR-IOV VFs visible to server 3.  mdev devices that
> > _may_ have representers on the embedded cores of
> >     the NIC
> There is no mdev NIC devices on embedded cores NIC driver exist in kernel
> today.
> I posted the series [3] to cover this part and we worked through kvm and
> netdev mailing list to get right, deterministic phys_port_name for it, right
> devlink port flavour etc necessary plumbing. This was done 2 months before
> posting [3].
> So mdev's subdev management is covered and cover letter also talks about it
> to manage it in uniform way as PF/VF/mdev.
> 
> > 4.  Hardware VirtIO-net devices supported
> How this is blocked by subdev object introduction?
> The way I envision from the recent discussion is, There will be mdev-virtio or
> virtio bus where hardware virto devices will be exposed.
> And therefore, their subdevices can be also controlled this way.
> But my understanding of unpublished mellanox virtio-net device and yours
> could be different. :-) So if you have link/ppt/rfc/patches I should read,
> please let me know, how you see hw virtio-net devices.
> 
> > 5.  Other non-network devices
> > (storage given as the first example) 6.  ...
> >
> I think its really beyond the scope of life cycling everything using this subdev
> object.
> It is not a ultimate solution to all use cases. :-)
> 
> > We cannot get rid of the methods for creating hardware VFs via sysfs,
> > but now that we are seeing lots of different flavors of devices that
> > might be created we should start by making sure at a minimum we can
> > create existing hardware devices (VFs) with this devlink interface and move
> from there.
> Well, I have been thinking about it for last 6-7 months as there is locking
> issues present between num_vfs sysfs (device lock!), ip link (no lock) and
> devlink (devlink lock).
> All unsyncronized with each other which leads to call traces; using some lock
> in multiple drivers will be a red bandage.
> It is not the scope of this series.
> 
> > Is there a plan
> > to use this interface to create subdevs that are VFs or just new subdev
> flavors?
> Creating individual VFs is supported by PCI spec when I checked last time.
> Do you know if that is supported now?
> Last year autoprobe file was added to pci core to dynamically bind the VFs
> and bug was fixed recently by Alex Williamson.
> 
> Let's say if that is added in future, even than subdev object creation on event
> of dynamic VF creation holds good. Isn't it?
> Whenever mdev series updated version gets accepted, it will dynamically
> create the subdev object anyway of mdev or some other flavour.
> 
> > I could start to get behind an interface like this if the patches
> > showed that devlink would be the one place where device creation and
> > control could be done for all types of subdevs, but that isn't clear
> > that it is your direction based on the patches.
> >
> > So just to make sure this is clear, what I'm proposing that devlink is
> > used to create and configure all types of devices that it may be
> > possible to create, configure, or address from a PF and the starting
> > place should be standard, hardware VFs.
> I do not agree with 'creating all devices' part that it should be the starting
> point given that other method already exists for VFs, mdev (though its buggy
> for VFs).
> Extending devlink to create mdev is nice, but not a must. Same goes for other
> flavours such as virtio/sf.
> I also proposed that to create them via devlink in [4], but its not must.
> This series for mdev [3] already fits in the existing devlink, eswitch, devlink
> port, devlink subdev object model.
> 
> > If that can be done right and we can address some of the issues with
> > the current implementation (more than one hw addr is a _great_
> > example) then adding new flavours for the server devices and and
> > adding new flavors for SmartNIC-based devices should be easy and feel
> natural.
> I also agree that setting more than one hw address should be supported,
> Jakub highlighted too of this need for Intel.
> Command should be similar to 'ip addr add', and not 'devlink set hw_addr'.
> Mlx5 doesn't currently supported so setting multiple hw_address should
> return right error code ENOSPEC or ENOSUPP by mlx5 driver.
> 
> Yuval, Can you please address this comment from Jakub and Andy to support
> multiple hw_address?
> 
> I just want to say that subdev is not so grand object.
> As use case, features, ASIC evolves right attributes should be placed in right
> object.
> (just like how devlink evolved from [1] to now.) Aren't we reviewing the
> patches? Why should it become a unconstrained object to put everything in
> it?
> This series doesn't ask to put everything in it.
> 
> As Yuval pointed, there are two/three near term plan after this series.
> 1. Grouping VFs subdevs so that host side min/max rates can be applied on
> the group (group of VFs) 2. Extending subdev flavour for mdev/virtio/sf
> depending on how series [3] take shape 3. msix configuration for VFs
> 
> [1] https://lwn.net/Articles/677967/
> [2] https://www.spinics.net/lists/netdev/msg365532.html
> [3] https://lore.kernel.org/linux-rdma/20191107160448.20962-1-
> parav@mellanox.com/
> [4] https://lore.kernel.org/linux-
> rdma/AM0PR05MB4866444210721BC4EE775D27D17B0@AM0PR05MB4866.e
> urprd05.prod.outlook.com/
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-11-14 16:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-08 16:18 [PATCH net-next v2 00/10] devlink subdev Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 01/10] devlink: Introduce subdev Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 02/10] devlink: Add PCI attributes support for subdev Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 03/10] devlink: Add port with subdev register support Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 04/10] devlink: Support subdev HW address get Yuval Avnery
2019-11-08 18:00   ` Parav Pandit
2019-11-09 17:45     ` Yuval Avnery
2019-11-09 18:13       ` Parav Pandit
2019-11-10  8:12         ` Jiri Pirko
2019-11-08 16:18 ` [PATCH net-next v2 05/10] devlink: Support subdev HW address set Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 06/10] Documentation: Add devlink-subdev documentation Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 07/10] netdevsim: Add max_vfs to bus_dev Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 08/10] netdevsim: Add devlink subdev creation Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 09/10] netdevsim: Add devlink subdev sefltest for netdevsim Yuval Avnery
2019-11-08 16:18 ` [PATCH net-next v2 10/10] net/mlx5e: Add support for devlink subdev and subdev hw_addr set/show Yuval Avnery
2019-11-11 18:00 ` [PATCH net-next v2 00/10] devlink subdev Jakub Kicinski
2019-11-11 18:46   ` Yuval Avnery
2019-11-12 14:55     ` Andy Gospodarek
2019-11-12 18:35       ` Yuval Avnery
2019-11-12 22:19         ` Jakub Kicinski
2019-11-12 23:32           ` Yuval Avnery
2019-11-13  6:37       ` Parav Pandit
2019-11-14 16:47         ` Yuval Avnery

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).