netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/9] devlink vdev
@ 2019-10-22 17:43 Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 1/9] devlink: Introduce vdev Yuval Avnery
                   ` (10 more replies)
  0 siblings, 11 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

This patchset introduces devlink vdev.

Currently, legacy tools do not provide a comprehensive solution that can
be used in both SmartNic and non-SmartNic mode.
Vdev represents a device that exists on the ASIC but is not necessarily
visible to the kernel.

Using devlink ports is not suitable because:

1. Those devices aren't necessarily network devices (such as NVMe devices)
   and doesn’t have E-switch representation. Therefore, there is need for
   more generic representation of PCI VF.
2. Some attributes are not necessarily pure port attributes
   (number of MSIX vectors)
3. It creates a confusing devlink topology, with multiple port flavours
   and indices.

Vdev will be created along with flavour and attributes.
Some network vdevs may be linked with a devlink port.

This is also aimed to replace "ip link vf" commands as they are strongly
linked to the PCI topology and allow access only to enabled VFs.
Even though current patchset and example is limited to MAC address
of the VF, this interface will allow to manage PF, VF, mdev in
SmartNic and non SmartNic modes, in unified way for networking and
non-networking devices via devlink instance.

Example:

A privileged user wants to configure a VF's hw_addr, before the VF is
enabled.

$ devlink vdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66

$ devlink vdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 10:22:33:44:55:66

$ devlink vdev show pci/0000:03:00.0/1 -jp
{
    "vdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0,
            "port_index": 1,
            "hw_addr": "10:22:33:44:55:66"
        }
    }
}

Patches 1-5 adds devlink support for vdev.
Patches 6-7 adds netdevsim implementation and test.
Patch 9 adds mlx5 vdev creation and hw_addr get/set.

Yuval Avnery (9):
  devlink: Introduce vdev
  devlink: Add PCI attributes support for vdev
  devlink: Add port with vdev register support
  devlink: Support vdev HW address get
  devlink: Support vdev HW address set
  netdevsim: Add max_vfs to bus_dev
  netdevsim: Add devlink vdev creation
  netdevsim: Add devlink vdev sefltest for netdevsim
  net/mlx5e: Add support for devlink vdev and vdev hw_addr set/show

 .../net/ethernet/mellanox/mlx5/core/devlink.c |  86 ++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   5 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |   9 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  19 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   7 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     |   8 +
 drivers/net/netdevsim/Makefile                |   2 +-
 drivers/net/netdevsim/bus.c                   |  39 +-
 drivers/net/netdevsim/dev.c                   |   9 +-
 drivers/net/netdevsim/netdevsim.h             |  11 +
 drivers/net/netdevsim/vdev.c                  |  96 +++++
 include/net/devlink.h                         |  44 ++
 include/uapi/linux/devlink.h                  |  16 +
 net/core/devlink.c                            | 393 +++++++++++++++++-
 .../drivers/net/netdevsim/devlink.sh          |  55 ++-
 16 files changed, 777 insertions(+), 28 deletions(-)
 create mode 100644 drivers/net/netdevsim/vdev.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next 1/9] devlink: Introduce vdev
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 2/9] devlink: Add PCI attributes support for vdev Yuval Avnery
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

Vdev is a device that exists on the ASIC which is not necessarily visible
to the kernel.
Just like devlink port, the driver should set index and attributes to
the vdev.

Example:
A VF represented by vdev pci/0000:03:00.0/1, before it is enabled.
The PF vdev is represented by pci/0000:03:00.0/0.

$ devlink dev show
pci/0000:03:00.0

$ devlink vdev show
pci/0000:03:00.0/0
pci/0000:03:00.0/1

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Suggested-by: Parav Pandit <parav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h        |  17 +++
 include/uapi/linux/devlink.h |   7 +
 net/core/devlink.c           | 255 +++++++++++++++++++++++++++++++++++
 3 files changed, 279 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 6bf3b9e0595a..58fe8c339368 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -23,6 +23,7 @@ struct devlink_ops;
 struct devlink {
 	struct list_head list;
 	struct list_head port_list;
+	struct list_head vdev_list;
 	struct list_head sb_list;
 	struct list_head dpipe_table_list;
 	struct list_head resource_list;
@@ -73,6 +74,8 @@ struct devlink_port_attrs {
 	};
 };
 
+struct devlink_vdev;
+
 struct devlink_port {
 	struct list_head list;
 	struct list_head param_list;
@@ -89,6 +92,9 @@ struct devlink_port {
 	struct delayed_work type_warn_dw;
 };
 
+struct devlink_vdev_attrs {
+};
+
 struct devlink_sb_pool_info {
 	enum devlink_sb_pool_type pool_type;
 	u32 size;
@@ -743,6 +749,9 @@ struct devlink_ops {
 			       const struct devlink_trap_group *group);
 };
 
+struct devlink_vdev_ops {
+};
+
 static inline void *devlink_priv(struct devlink *devlink)
 {
 	BUG_ON(!devlink);
@@ -802,6 +811,14 @@ void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port,
 				   const unsigned char *switch_id,
 				   unsigned char switch_id_len,
 				   u16 pf, u16 vf);
+struct devlink_vdev *devlink_vdev_create(struct devlink *devlink,
+					 unsigned int vdev_index,
+					 const struct devlink_vdev_ops *ops,
+					 const struct devlink_vdev_attrs *attrs,
+					 void *priv);
+void devlink_vdev_destroy(struct devlink_vdev *devlink_vdev);
+struct devlink *devlink_vdev_devlink(struct devlink_vdev *devlink_vdev);
+void *devlink_vdev_priv(struct devlink_vdev *devlink_vdev);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index b558ea88b766..70e2816331c5 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -117,6 +117,11 @@ enum devlink_command {
 	DEVLINK_CMD_TRAP_GROUP_NEW,
 	DEVLINK_CMD_TRAP_GROUP_DEL,
 
+	DEVLINK_CMD_VDEV_GET,		/* can dump */
+	DEVLINK_CMD_VDEV_SET,
+	DEVLINK_CMD_VDEV_NEW,
+	DEVLINK_CMD_VDEV_DEL,
+
 	/* add new commands above here */
 	__DEVLINK_CMD_MAX,
 	DEVLINK_CMD_MAX = __DEVLINK_CMD_MAX - 1
@@ -425,6 +430,8 @@ enum devlink_attr {
 	DEVLINK_ATTR_NETNS_PID,			/* u32 */
 	DEVLINK_ATTR_NETNS_ID,			/* u32 */
 
+	DEVLINK_ATTR_VDEV_INDEX,		/* u32 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 97e9a2246929..3d6099ec139e 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -31,6 +31,15 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/devlink.h>
 
+struct devlink_vdev {
+	struct list_head list;
+	struct devlink *devlink;
+	unsigned int index;
+	const struct devlink_vdev_ops *ops;
+	struct devlink_vdev_attrs attrs;
+	void *priv;
+};
+
 static struct devlink_dpipe_field devlink_dpipe_fields_ethernet[] = {
 	{
 		.name = "destination mac",
@@ -183,6 +192,46 @@ static struct devlink_port *devlink_port_get_from_info(struct devlink *devlink,
 	return devlink_port_get_from_attrs(devlink, info->attrs);
 }
 
+static struct devlink_vdev *devlink_vdev_get_by_index(struct devlink *devlink,
+						      unsigned int vdev_index)
+{
+	struct devlink_vdev *devlink_vdev;
+
+	list_for_each_entry(devlink_vdev, &devlink->vdev_list, list) {
+		if (devlink_vdev->index == vdev_index)
+			return devlink_vdev;
+	}
+	return NULL;
+}
+
+static bool devlink_vdev_index_exists(struct devlink *devlink,
+				      unsigned int vdev_index)
+{
+	return devlink_vdev_get_by_index(devlink, vdev_index);
+}
+
+static struct devlink_vdev *devlink_vdev_get_from_attrs(struct devlink *devlink,
+							struct nlattr **attrs)
+{
+	struct devlink_vdev *devlink_vdev;
+	u32 vdev_index;
+
+	if (!attrs[DEVLINK_ATTR_VDEV_INDEX])
+		return ERR_PTR(-EINVAL);
+
+	vdev_index = nla_get_u32(attrs[DEVLINK_ATTR_VDEV_INDEX]);
+	devlink_vdev = devlink_vdev_get_by_index(devlink, vdev_index);
+	if (!devlink_vdev)
+		return ERR_PTR(-ENODEV);
+	return devlink_vdev;
+}
+
+static struct devlink_vdev *devlink_vdev_get_from_info(struct devlink *devlink,
+						       struct genl_info *info)
+{
+	return devlink_vdev_get_from_attrs(devlink, info->attrs);
+}
+
 struct devlink_sb {
 	struct list_head list;
 	unsigned int index;
@@ -386,6 +435,7 @@ devlink_region_snapshot_get_by_id(struct devlink_region *region, u32 id)
 #define DEVLINK_NL_FLAG_NEED_DEVLINK	BIT(0)
 #define DEVLINK_NL_FLAG_NEED_PORT	BIT(1)
 #define DEVLINK_NL_FLAG_NEED_SB		BIT(2)
+#define DEVLINK_NL_FLAG_NEED_VDEV       BIT(3)
 
 /* The per devlink instance lock is taken by default in the pre-doit
  * operation, yet several commands do not require this. The global
@@ -418,6 +468,15 @@ static int devlink_nl_pre_doit(const struct genl_ops *ops,
 			goto unlock;
 		}
 		info->user_ptr[0] = devlink_port;
+	} else if (ops->internal_flags & DEVLINK_NL_FLAG_NEED_VDEV) {
+		struct devlink_vdev *devlink_vdev;
+
+		devlink_vdev = devlink_vdev_get_from_info(devlink, info);
+		if (IS_ERR(devlink_vdev)) {
+			err = PTR_ERR(devlink_vdev);
+			goto unlock;
+		}
+		info->user_ptr[0] = devlink_vdev;
 	}
 	if (ops->internal_flags & DEVLINK_NL_FLAG_NEED_SB) {
 		struct devlink_sb *devlink_sb;
@@ -644,6 +703,53 @@ static void devlink_port_notify(struct devlink_port *devlink_port,
 				msg, 0, DEVLINK_MCGRP_CONFIG, GFP_KERNEL);
 }
 
+static int devlink_nl_vdev_fill(struct sk_buff *msg, struct devlink *devlink,
+				struct devlink_vdev *devlink_vdev,
+				enum devlink_command cmd, u32 vdevid,
+				u32 seq, int flags)
+{
+	void *hdr;
+
+	hdr = genlmsg_put(msg, vdevid, seq, &devlink_nl_family, flags, cmd);
+	if (!hdr)
+		return -EMSGSIZE;
+
+	if (devlink_nl_put_handle(msg, devlink))
+		goto nla_put_failure;
+	if (nla_put_u32(msg, DEVLINK_ATTR_VDEV_INDEX, devlink_vdev->index))
+		goto nla_put_failure;
+
+	genlmsg_end(msg, hdr);
+	return 0;
+
+nla_put_failure:
+	genlmsg_cancel(msg, hdr);
+	return -EMSGSIZE;
+}
+
+static void devlink_vdev_notify(struct devlink_vdev *devlink_vdev,
+				enum devlink_command cmd)
+{
+	struct devlink *devlink = devlink_vdev->devlink;
+	struct sk_buff *msg;
+	int err;
+
+	WARN_ON(cmd != DEVLINK_CMD_VDEV_NEW && cmd != DEVLINK_CMD_VDEV_DEL);
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return;
+
+	err = devlink_nl_vdev_fill(msg, devlink, devlink_vdev, cmd, 0, 0, 0);
+	if (err) {
+		nlmsg_free(msg);
+		return;
+	}
+
+	genlmsg_multicast_netns(&devlink_nl_family, devlink_net(devlink),
+				msg, 0, DEVLINK_MCGRP_CONFIG, GFP_KERNEL);
+}
+
 static int devlink_nl_cmd_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct devlink *devlink = info->user_ptr[0];
@@ -844,6 +950,74 @@ static int devlink_nl_cmd_port_unsplit_doit(struct sk_buff *skb,
 	return devlink_port_unsplit(devlink, port_index, info->extack);
 }
 
+static int devlink_nl_cmd_vdev_get_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	struct devlink_vdev *devlink_vdev = info->user_ptr[0];
+	struct devlink *devlink = devlink_vdev->devlink;
+	struct sk_buff *msg;
+	int err;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	err = devlink_nl_vdev_fill(msg, devlink, devlink_vdev,
+				   DEVLINK_CMD_VDEV_NEW,
+				   info->snd_portid, info->snd_seq, 0);
+	if (err) {
+		nlmsg_free(msg);
+		return err;
+	}
+
+	return genlmsg_reply(msg, info);
+}
+
+static int devlink_nl_cmd_vdev_get_dumpit(struct sk_buff *msg,
+					  struct netlink_callback *cb)
+{
+	struct devlink_vdev *devlink_vdev;
+	struct devlink *devlink;
+	int start = cb->args[0];
+	int idx = 0;
+	int err;
+
+	mutex_lock(&devlink_mutex);
+	list_for_each_entry(devlink, &devlink_list, list) {
+		if (!net_eq(devlink_net(devlink), sock_net(msg->sk)))
+			continue;
+		mutex_lock(&devlink->lock);
+		list_for_each_entry(devlink_vdev, &devlink->vdev_list, list) {
+			if (idx < start) {
+				idx++;
+				continue;
+			}
+			err = devlink_nl_vdev_fill(msg, devlink, devlink_vdev,
+						   DEVLINK_CMD_NEW,
+						   NETLINK_CB(cb->skb).portid,
+						   cb->nlh->nlmsg_seq,
+						   NLM_F_MULTI);
+			if (err) {
+				mutex_unlock(&devlink->lock);
+				goto out;
+			}
+			idx++;
+		}
+		mutex_unlock(&devlink->lock);
+	}
+out:
+	mutex_unlock(&devlink_mutex);
+
+	cb->args[0] = idx;
+	return msg->len;
+}
+
+static int devlink_nl_cmd_vdev_set_doit(struct sk_buff *skb,
+					struct genl_info *info)
+{
+	return 0;
+}
+
 static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink,
 			      struct devlink_sb *devlink_sb,
 			      enum devlink_command cmd, u32 portid,
@@ -5902,6 +6076,7 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_NETNS_PID] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_NETNS_FD] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_NETNS_ID] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_VDEV_INDEX] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops devlink_nl_ops[] = {
@@ -5928,6 +6103,19 @@ static const struct genl_ops devlink_nl_ops[] = {
 		.flags = GENL_ADMIN_PERM,
 		.internal_flags = DEVLINK_NL_FLAG_NEED_PORT,
 	},
+	{
+		.cmd = DEVLINK_CMD_VDEV_GET,
+		.doit = devlink_nl_cmd_vdev_get_doit,
+		.dumpit = devlink_nl_cmd_vdev_get_dumpit,
+		.internal_flags = DEVLINK_NL_FLAG_NEED_VDEV,
+		/* can be retrieved by unprivileged users */
+	},
+	{
+		.cmd = DEVLINK_CMD_VDEV_SET,
+		.doit = devlink_nl_cmd_vdev_set_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NEED_VDEV,
+	},
 	{
 		.cmd = DEVLINK_CMD_PORT_SPLIT,
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
@@ -6268,6 +6456,7 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
 	devlink->ops = ops;
 	__devlink_net_set(devlink, &init_net);
 	INIT_LIST_HEAD(&devlink->port_list);
+	INIT_LIST_HEAD(&devlink->vdev_list);
 	INIT_LIST_HEAD(&devlink->sb_list);
 	INIT_LIST_HEAD_RCU(&devlink->dpipe_table_list);
 	INIT_LIST_HEAD(&devlink->resource_list);
@@ -6332,6 +6521,7 @@ void devlink_free(struct devlink *devlink)
 	WARN_ON(!list_empty(&devlink->dpipe_table_list));
 	WARN_ON(!list_empty(&devlink->sb_list));
 	WARN_ON(!list_empty(&devlink->port_list));
+	WARN_ON(!list_empty(&devlink->vdev_list));
 
 	kfree(devlink);
 }
@@ -6657,6 +6847,71 @@ static int __devlink_port_phys_port_name_get(struct devlink_port *devlink_port,
 	return 0;
 }
 
+void *devlink_vdev_priv(struct devlink_vdev *devlink_vdev)
+{
+	return devlink_vdev->priv;
+}
+EXPORT_SYMBOL_GPL(devlink_vdev_priv);
+
+/**
+ *	devlink_vdev_create - create devlink vdev
+ *
+ *	@devlink: devlink
+ *	@vdev_index: driver-specific numerical identifier of the vdev
+ *	@ops: vdev specific ops
+ *	@attrs: vdev specific attributes
+ *	@priv: driver private data
+ *
+ *	Create devlink vdev with provided vdev index. User can use
+ *	any indexing, even hw-related one.
+ */
+struct devlink_vdev *devlink_vdev_create(struct devlink *devlink,
+					 unsigned int vdev_index,
+					 const struct devlink_vdev_ops *ops,
+					 const struct devlink_vdev_attrs *attrs,
+					 void *priv)
+{
+	struct devlink_vdev *devlink_vdev;
+
+	devlink_vdev = kzalloc(sizeof(*devlink_vdev), GFP_KERNEL);
+	if (!devlink_vdev)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&devlink->lock);
+	if (devlink_vdev_index_exists(devlink, vdev_index)) {
+		mutex_unlock(&devlink->lock);
+		kfree(devlink_vdev);
+		return ERR_PTR(-EEXIST);
+	}
+	devlink_vdev->devlink = devlink;
+	devlink_vdev->priv = priv;
+	devlink_vdev->index = vdev_index;
+	devlink_vdev->ops = ops;
+	devlink_vdev->attrs = *attrs;
+	list_add_tail(&devlink_vdev->list, &devlink->vdev_list);
+	mutex_unlock(&devlink->lock);
+	devlink_vdev_notify(devlink_vdev, DEVLINK_CMD_VDEV_NEW);
+	return devlink_vdev;
+}
+EXPORT_SYMBOL_GPL(devlink_vdev_create);
+
+/**
+ *	devlink_vdev_destroy - destroy devlink vdev
+ *
+ *	@devlink_vdev: devlink vdev
+ */
+void devlink_vdev_destroy(struct devlink_vdev *devlink_vdev)
+{
+	struct devlink *devlink = devlink_vdev->devlink;
+
+	devlink_vdev_notify(devlink_vdev, DEVLINK_CMD_VDEV_DEL);
+	mutex_lock(&devlink->lock);
+	list_del(&devlink_vdev->list);
+	mutex_unlock(&devlink->lock);
+	kfree(devlink_vdev);
+}
+EXPORT_SYMBOL_GPL(devlink_vdev_destroy);
+
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 2/9] devlink: Add PCI attributes support for vdev
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 1/9] devlink: Introduce vdev Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 3/9] devlink: Add port with vdev register support Yuval Avnery
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

When vdev represents a PCI device it should reflect the PCI device
type (PF/VF) and it's index.

Example:

$ devlink vdev show pci/0000:03:00.0/1 -jp
{
    "vdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0
        }
    }
}

$ devlink vdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h        | 17 ++++++++++++
 include/uapi/linux/devlink.h |  8 ++++++
 net/core/devlink.c           | 51 ++++++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 58fe8c339368..ab7e316ea758 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -92,7 +92,21 @@ struct devlink_port {
 	struct delayed_work type_warn_dw;
 };
 
+struct devlink_vdev_pci_pf_attrs {
+	u16 pf;	/* Associated PCI PF for this vdev. */
+};
+
+struct devlink_vdev_pci_vf_attrs {
+	u16 pf;	/* Associated PCI PF for this vdev. */
+	u16 vf;	/* Associated PCI VF for of the PCI PF for this vdev. */
+};
+
 struct devlink_vdev_attrs {
+	enum devlink_vdev_flavour flavour;
+	union {
+		struct devlink_vdev_pci_pf_attrs pci_pf;
+		struct devlink_vdev_pci_vf_attrs pci_vf;
+	};
 };
 
 struct devlink_sb_pool_info {
@@ -819,6 +833,9 @@ struct devlink_vdev *devlink_vdev_create(struct devlink *devlink,
 void devlink_vdev_destroy(struct devlink_vdev *devlink_vdev);
 struct devlink *devlink_vdev_devlink(struct devlink_vdev *devlink_vdev);
 void *devlink_vdev_priv(struct devlink_vdev *devlink_vdev);
+void devlink_vdev_attrs_pci_pf_init(struct devlink_vdev_attrs *attrs, u16 pf);
+void devlink_vdev_attrs_pci_vf_init(struct devlink_vdev_attrs *attrs, u16 pf,
+				    u16 vf);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 70e2816331c5..161bad54d528 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -194,6 +194,11 @@ enum devlink_port_flavour {
 				      */
 };
 
+enum devlink_vdev_flavour {
+	DEVLINK_VDEV_FLAVOUR_PCI_PF,
+	DEVLINK_VDEV_FLAVOUR_PCI_VF,
+};
+
 enum devlink_param_cmode {
 	DEVLINK_PARAM_CMODE_RUNTIME,
 	DEVLINK_PARAM_CMODE_DRIVERINIT,
@@ -431,6 +436,9 @@ enum devlink_attr {
 	DEVLINK_ATTR_NETNS_ID,			/* u32 */
 
 	DEVLINK_ATTR_VDEV_INDEX,		/* u32 */
+	DEVLINK_ATTR_VDEV_FLAVOUR,		/* u16 */
+	DEVLINK_ATTR_VDEV_PF_INDEX,		/* u32 */
+	DEVLINK_ATTR_VDEV_VF_INDEX,		/* u32 */
 
 	/* add new attributes above here, update the policy in devlink.c */
 
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 3d6099ec139e..0a201a373da9 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -708,6 +708,7 @@ static int devlink_nl_vdev_fill(struct sk_buff *msg, struct devlink *devlink,
 				enum devlink_command cmd, u32 vdevid,
 				u32 seq, int flags)
 {
+	struct devlink_vdev_attrs *attrs = &devlink_vdev->attrs;
 	void *hdr;
 
 	hdr = genlmsg_put(msg, vdevid, seq, &devlink_nl_family, flags, cmd);
@@ -719,6 +720,24 @@ static int devlink_nl_vdev_fill(struct sk_buff *msg, struct devlink *devlink,
 	if (nla_put_u32(msg, DEVLINK_ATTR_VDEV_INDEX, devlink_vdev->index))
 		goto nla_put_failure;
 
+	if (nla_put_u16(msg, DEVLINK_ATTR_VDEV_FLAVOUR, attrs->flavour))
+		goto nla_put_failure;
+	switch (attrs->flavour) {
+	case DEVLINK_VDEV_FLAVOUR_PCI_PF:
+		if (nla_put_u32(msg, DEVLINK_ATTR_VDEV_PF_INDEX,
+				attrs->pci_pf.pf))
+			goto nla_put_failure;
+		break;
+	case DEVLINK_VDEV_FLAVOUR_PCI_VF:
+		if (nla_put_u32(msg, DEVLINK_ATTR_VDEV_PF_INDEX,
+				attrs->pci_vf.pf))
+			goto nla_put_failure;
+		if (nla_put_u32(msg, DEVLINK_ATTR_VDEV_VF_INDEX,
+				attrs->pci_vf.vf))
+			goto nla_put_failure;
+		break;
+	}
+
 	genlmsg_end(msg, hdr);
 	return 0;
 
@@ -6077,6 +6096,9 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_NETNS_FD] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_NETNS_ID] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_VDEV_INDEX] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_VDEV_FLAVOUR] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_VDEV_PF_INDEX] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_VDEV_VF_INDEX] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops devlink_nl_ops[] = {
@@ -6912,6 +6934,35 @@ void devlink_vdev_destroy(struct devlink_vdev *devlink_vdev)
 }
 EXPORT_SYMBOL_GPL(devlink_vdev_destroy);
 
+/**
+ *	devlink_vdev_attrs_pci_pf_int - Init PCI PF vdev attributes
+ *
+ *	@devlink_vdev_attr: devlink vdev attributes
+ *	@pf: associated PF index for the devlink vdev instance
+ */
+void devlink_vdev_attrs_pci_pf_init(struct devlink_vdev_attrs *attrs, u16 pf)
+{
+	attrs->flavour = DEVLINK_VDEV_FLAVOUR_PCI_PF;
+	attrs->pci_pf.pf = pf;
+}
+EXPORT_SYMBOL_GPL(devlink_vdev_attrs_pci_pf_init);
+
+/**
+ *	devlink_vdev_attrs_pci_vf_init - Init PCI VF vdev attributes
+ *
+ *	@devlink_vdev: devlink vdev
+ *	@pf: associated PF index for the devlink vdev instance
+ *	@vf: associated VF index for the devlink vdev instance
+ */
+void devlink_vdev_attrs_pci_vf_init(struct devlink_vdev_attrs *attrs, u16 pf,
+				    u16 vf)
+{
+	attrs->flavour = DEVLINK_VDEV_FLAVOUR_PCI_VF;
+	attrs->pci_vf.pf = pf;
+	attrs->pci_vf.vf = vf;
+}
+EXPORT_SYMBOL_GPL(devlink_vdev_attrs_pci_vf_init);
+
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 3/9] devlink: Add port with vdev register support
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 1/9] devlink: Introduce vdev Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 2/9] devlink: Add PCI attributes support for vdev Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 4/9] devlink: Support vdev HW address get Yuval Avnery
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

A vdev may represent a network device, so it may be linked to
a devlink port.
Added devlink_port_register_with_vdev to allow vdev-port linkage.

Example:

$ devlink vdev show pci/0000:03:00.0/1 -jp
{
    "vdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0,
            "port_index": 1
        }
    }
}
$ devlink vdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1

$ devlink port show pci/0000:03:00.0/1
pci/0000:03:00.0/1: type eth netdev ens2f0_0 flavour pcivf pfnum 0 vfnum 0 vdev_index 1

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h |  5 ++++
 net/core/devlink.c    | 53 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index ab7e316ea758..138d33275963 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -90,6 +90,7 @@ struct devlink_port {
 	void *type_dev;
 	struct devlink_port_attrs attrs;
 	struct delayed_work type_warn_dw;
+	struct devlink_vdev *devlink_vdev; /* linked vdev */
 };
 
 struct devlink_vdev_pci_pf_attrs {
@@ -806,6 +807,10 @@ void devlink_free(struct devlink *devlink);
 int devlink_port_register(struct devlink *devlink,
 			  struct devlink_port *devlink_port,
 			  unsigned int port_index);
+int devlink_port_register_with_vdev(struct devlink *devlink,
+				    struct devlink_port *devlink_port,
+				    unsigned int port_index,
+				    struct devlink_vdev *devlink_vdev);
 void devlink_port_unregister(struct devlink_port *devlink_port);
 void devlink_port_type_eth_set(struct devlink_port *devlink_port,
 			       struct net_device *netdev);
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 0a201a373da9..2fffbd37e710 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -37,6 +37,7 @@ struct devlink_vdev {
 	unsigned int index;
 	const struct devlink_vdev_ops *ops;
 	struct devlink_vdev_attrs attrs;
+	struct devlink_port *devlink_port; /* linked port */
 	void *priv;
 };
 
@@ -664,6 +665,10 @@ static int devlink_nl_port_fill(struct sk_buff *msg, struct devlink *devlink,
 			goto nla_put_failure_type_locked;
 	}
 	spin_unlock_bh(&devlink_port->type_lock);
+	if (devlink_port->devlink_vdev)
+		if (nla_put_u32(msg, DEVLINK_ATTR_VDEV_INDEX,
+				devlink_port->devlink_vdev->index))
+			goto nla_put_failure;
 	if (devlink_nl_port_attrs_put(msg, devlink_port))
 		goto nla_put_failure;
 
@@ -738,6 +743,11 @@ static int devlink_nl_vdev_fill(struct sk_buff *msg, struct devlink *devlink,
 		break;
 	}
 
+	if (devlink_vdev->devlink_port)
+		if (nla_put_u32(msg, DEVLINK_ATTR_PORT_INDEX,
+				devlink_vdev->devlink_port->index))
+			goto nla_put_failure;
+
 	genlmsg_end(msg, hdr);
 	return 0;
 
@@ -6582,11 +6592,12 @@ static void devlink_port_type_warn_cancel(struct devlink_port *devlink_port)
 }
 
 /**
- *	devlink_port_register - Register devlink port
+ *	devlink_port_register_with_vdev - Register devlink port
  *
  *	@devlink: devlink
  *	@devlink_port: devlink port
  *	@port_index: driver-specific numerical identifier of the port
+ *	@devlink_vdev: vdev to link with the port
  *
  *	Register devlink port with provided port index. User can use
  *	any indexing, even hw-related one. devlink_port structure
@@ -6594,9 +6605,10 @@ static void devlink_port_type_warn_cancel(struct devlink_port *devlink_port)
  *	Note that the caller should take care of zeroing the devlink_port
  *	structure.
  */
-int devlink_port_register(struct devlink *devlink,
-			  struct devlink_port *devlink_port,
-			  unsigned int port_index)
+int devlink_port_register_with_vdev(struct devlink *devlink,
+				    struct devlink_port *devlink_port,
+				    unsigned int port_index,
+				    struct devlink_vdev *devlink_vdev)
 {
 	mutex_lock(&devlink->lock);
 	if (devlink_port_index_exists(devlink, port_index)) {
@@ -6606,15 +6618,42 @@ int devlink_port_register(struct devlink *devlink,
 	devlink_port->devlink = devlink;
 	devlink_port->index = port_index;
 	devlink_port->registered = true;
+	devlink_port->devlink_vdev = devlink_vdev;
 	spin_lock_init(&devlink_port->type_lock);
 	list_add_tail(&devlink_port->list, &devlink->port_list);
 	INIT_LIST_HEAD(&devlink_port->param_list);
 	mutex_unlock(&devlink->lock);
 	INIT_DELAYED_WORK(&devlink_port->type_warn_dw, &devlink_port_type_warn);
 	devlink_port_type_warn_schedule(devlink_port);
+	if (devlink_vdev) {
+		devlink_vdev->devlink_port = devlink_port;
+		devlink_vdev_notify(devlink_vdev, DEVLINK_CMD_VDEV_NEW);
+	}
 	devlink_port_notify(devlink_port, DEVLINK_CMD_PORT_NEW);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(devlink_port_register_with_vdev);
+
+/**
+ *	devlink_port_register - Register devlink port
+ *
+ *	@devlink: devlink
+ *	@devlink_port: devlink port
+ *	@port_index: driver-specific numerical identifier of the port
+ *
+ *	Register devlink port with provided port index. User can use
+ *	any indexing, even hw-related one. devlink_port structure
+ *	is convenient to be embedded inside user driver private structure.
+ *	Note that the caller should take care of zeroing the devlink_port
+ *	structure.
+ */
+int devlink_port_register(struct devlink *devlink,
+			  struct devlink_port *devlink_port,
+			  unsigned int port_index)
+{
+	return devlink_port_register_with_vdev(devlink, devlink_port,
+					       port_index, NULL);
+}
 EXPORT_SYMBOL_GPL(devlink_port_register);
 
 /**
@@ -6624,9 +6663,14 @@ EXPORT_SYMBOL_GPL(devlink_port_register);
  */
 void devlink_port_unregister(struct devlink_port *devlink_port)
 {
+	struct devlink_vdev *devlink_vdev = devlink_port->devlink_vdev;
 	struct devlink *devlink = devlink_port->devlink;
 
 	devlink_port_type_warn_cancel(devlink_port);
+	if (devlink_vdev) {
+		devlink_vdev->devlink_port = NULL;
+		devlink_vdev_notify(devlink_vdev, DEVLINK_CMD_VDEV_NEW);
+	}
 	devlink_port_notify(devlink_port, DEVLINK_CMD_PORT_DEL);
 	mutex_lock(&devlink->lock);
 	list_del(&devlink_port->list);
@@ -6926,6 +6970,7 @@ void devlink_vdev_destroy(struct devlink_vdev *devlink_vdev)
 {
 	struct devlink *devlink = devlink_vdev->devlink;
 
+	WARN_ON(devlink_vdev->devlink_port);
 	devlink_vdev_notify(devlink_vdev, DEVLINK_CMD_VDEV_DEL);
 	mutex_lock(&devlink->lock);
 	list_del(&devlink_vdev->list);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 4/9] devlink: Support vdev HW address get
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
                   ` (2 preceding siblings ...)
  2019-10-22 17:43 ` [PATCH net-next 3/9] devlink: Add port with vdev register support Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 5/9] devlink: Support vdev HW address set Yuval Avnery
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

Allow privileged user to get the HW address of a vdev.

Example:

$ devlink vdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 00:23:35:af:35:34

$ devlink vdev show pci/0000:03:00.0/1 -pj
{
    "vdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0,
            "port_index": 1,
            "hw_addr": "00:23:35:af:35:34"
        }
    }
}

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h        |  3 +++
 include/uapi/linux/devlink.h |  1 +
 net/core/devlink.c           | 33 +++++++++++++++++++++++++++------
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 138d33275963..12550cc92e9d 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -765,6 +765,9 @@ struct devlink_ops {
 };
 
 struct devlink_vdev_ops {
+	int (*hw_addr_get)(struct devlink_vdev *vdev,
+			   u8 *hw_addr, struct netlink_ext_ack *extack);
+	unsigned int hw_addr_len;
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 161bad54d528..2f2c2d60796f 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -439,6 +439,7 @@ enum devlink_attr {
 	DEVLINK_ATTR_VDEV_FLAVOUR,		/* u16 */
 	DEVLINK_ATTR_VDEV_PF_INDEX,		/* u32 */
 	DEVLINK_ATTR_VDEV_VF_INDEX,		/* u32 */
+	DEVLINK_ATTR_VDEV_HW_ADDR,		/* binary */
 
 	/* add new attributes above here, update the policy in devlink.c */
 
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 2fffbd37e710..94599409f12c 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -711,10 +711,13 @@ static void devlink_port_notify(struct devlink_port *devlink_port,
 static int devlink_nl_vdev_fill(struct sk_buff *msg, struct devlink *devlink,
 				struct devlink_vdev *devlink_vdev,
 				enum devlink_command cmd, u32 vdevid,
-				u32 seq, int flags)
+				u32 seq, int flags,
+				struct netlink_ext_ack *extack)
 {
 	struct devlink_vdev_attrs *attrs = &devlink_vdev->attrs;
+	const struct devlink_vdev_ops *ops = devlink_vdev->ops;
 	void *hdr;
+	int err;
 
 	hdr = genlmsg_put(msg, vdevid, seq, &devlink_nl_family, flags, cmd);
 	if (!hdr)
@@ -748,6 +751,19 @@ static int devlink_nl_vdev_fill(struct sk_buff *msg, struct devlink *devlink,
 				devlink_vdev->devlink_port->index))
 			goto nla_put_failure;
 
+	if (ops && ops->hw_addr_get) {
+		u8 hw_addr[MAX_ADDR_LEN];
+
+		err = ops->hw_addr_get(devlink_vdev, hw_addr, extack);
+		if (err) {
+			genlmsg_cancel(msg, hdr);
+			return err;
+		}
+		if (nla_put(msg, DEVLINK_ATTR_VDEV_HW_ADDR,
+			    ops->hw_addr_len, hw_addr))
+			goto nla_put_failure;
+	}
+
 	genlmsg_end(msg, hdr);
 	return 0;
 
@@ -769,7 +785,8 @@ static void devlink_vdev_notify(struct devlink_vdev *devlink_vdev,
 	if (!msg)
 		return;
 
-	err = devlink_nl_vdev_fill(msg, devlink, devlink_vdev, cmd, 0, 0, 0);
+	err = devlink_nl_vdev_fill(msg, devlink, devlink_vdev, cmd,
+				   0, 0, 0, NULL);
 	if (err) {
 		nlmsg_free(msg);
 		return;
@@ -992,8 +1009,8 @@ static int devlink_nl_cmd_vdev_get_doit(struct sk_buff *skb,
 		return -ENOMEM;
 
 	err = devlink_nl_vdev_fill(msg, devlink, devlink_vdev,
-				   DEVLINK_CMD_VDEV_NEW,
-				   info->snd_portid, info->snd_seq, 0);
+				   DEVLINK_CMD_VDEV_NEW, info->snd_portid,
+				   info->snd_seq, 0, info->extack);
 	if (err) {
 		nlmsg_free(msg);
 		return err;
@@ -1009,7 +1026,7 @@ static int devlink_nl_cmd_vdev_get_dumpit(struct sk_buff *msg,
 	struct devlink *devlink;
 	int start = cb->args[0];
 	int idx = 0;
-	int err;
+	int err = 0;
 
 	mutex_lock(&devlink_mutex);
 	list_for_each_entry(devlink, &devlink_list, list) {
@@ -1025,7 +1042,7 @@ static int devlink_nl_cmd_vdev_get_dumpit(struct sk_buff *msg,
 						   DEVLINK_CMD_NEW,
 						   NETLINK_CB(cb->skb).portid,
 						   cb->nlh->nlmsg_seq,
-						   NLM_F_MULTI);
+						   NLM_F_MULTI, NULL);
 			if (err) {
 				mutex_unlock(&devlink->lock);
 				goto out;
@@ -1036,6 +1053,8 @@ static int devlink_nl_cmd_vdev_get_dumpit(struct sk_buff *msg,
 	}
 out:
 	mutex_unlock(&devlink_mutex);
+	if (err != -EMSGSIZE)
+		return err;
 
 	cb->args[0] = idx;
 	return msg->len;
@@ -6109,6 +6128,8 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_VDEV_FLAVOUR] = { .type = NLA_U16 },
 	[DEVLINK_ATTR_VDEV_PF_INDEX] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_VDEV_VF_INDEX] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_VDEV_HW_ADDR] = { .type = NLA_BINARY,
+					.len = MAX_ADDR_LEN },
 };
 
 static const struct genl_ops devlink_nl_ops[] = {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 5/9] devlink: Support vdev HW address set
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
                   ` (3 preceding siblings ...)
  2019-10-22 17:43 ` [PATCH net-next 4/9] devlink: Support vdev HW address get Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 6/9] netdevsim: Add max_vfs to bus_dev Yuval Avnery
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

Allow privileged user to set the HW address of a vdev.

Example:

$ devlink vdev set pci/0000:03:00.0/1 hw_addr 00:23:35:af:35:34

$ devlink vdev show pci/0000:03:00.0/1
pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 00:23:35:af:35:34

$ devlink vdev show pci/0000:03:00.0/1 -pj
{
    "vdev": {
        "pci/0000:03:00.0/1": {
            "flavour": "pcivf",
            "pf": 0,
            "vf": 0,
            "port_index": 1,
            "hw_addr": "00:23:35:af:35:34"
        }
    }
}

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h |  2 ++
 net/core/devlink.c    | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 12550cc92e9d..5ae329813ec8 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -765,6 +765,8 @@ struct devlink_ops {
 };
 
 struct devlink_vdev_ops {
+	int (*hw_addr_set)(struct devlink_vdev *vdev,
+			   u8 *hw_addr, struct netlink_ext_ack *extack);
 	int (*hw_addr_get)(struct devlink_vdev *vdev,
 			   u8 *hw_addr, struct netlink_ext_ack *extack);
 	unsigned int hw_addr_len;
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 94599409f12c..15cc674b05ce 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -1063,6 +1063,23 @@ static int devlink_nl_cmd_vdev_get_dumpit(struct sk_buff *msg,
 static int devlink_nl_cmd_vdev_set_doit(struct sk_buff *skb,
 					struct genl_info *info)
 {
+	struct nlattr *nla_hw_addr = info->attrs[DEVLINK_ATTR_VDEV_HW_ADDR];
+	struct devlink_vdev *devlink_vdev = info->user_ptr[0];
+	const struct devlink_vdev_ops *ops;
+	int err;
+
+	ops = devlink_vdev->ops;
+	if (nla_hw_addr) {
+		u8 *hw_addr;
+
+		if (!ops || !ops->hw_addr_set)
+			return -EOPNOTSUPP;
+
+		hw_addr = nla_data(nla_hw_addr);
+		err = ops->hw_addr_set(devlink_vdev, hw_addr, info->extack);
+		if (err)
+			return err;
+	}
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 6/9] netdevsim: Add max_vfs to bus_dev
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
                   ` (4 preceding siblings ...)
  2019-10-22 17:43 ` [PATCH net-next 5/9] devlink: Support vdev HW address set Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 7/9] netdevsim: Add devlink vdev creation Yuval Avnery
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

Currently there is no limit to the number of VFs netdevsim can enable.
Vdevs that represent VFs can exist and be configured before user enabled
them.
max_vfs defines the number of vdevs the driver should create.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/netdevsim/bus.c       | 39 +++++++++++++++++++++++--------
 drivers/net/netdevsim/netdevsim.h |  1 +
 2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/net/netdevsim/bus.c b/drivers/net/netdevsim/bus.c
index 6aeed0c600f8..f1a0171080cb 100644
--- a/drivers/net/netdevsim/bus.c
+++ b/drivers/net/netdevsim/bus.c
@@ -26,9 +26,9 @@ static struct nsim_bus_dev *to_nsim_bus_dev(struct device *dev)
 static int nsim_bus_dev_vfs_enable(struct nsim_bus_dev *nsim_bus_dev,
 				   unsigned int num_vfs)
 {
-	nsim_bus_dev->vfconfigs = kcalloc(num_vfs,
-					  sizeof(struct nsim_vf_config),
-					  GFP_KERNEL);
+	if (nsim_bus_dev->max_vfs < num_vfs)
+		return -ENOMEM;
+
 	if (!nsim_bus_dev->vfconfigs)
 		return -ENOMEM;
 	nsim_bus_dev->num_vfs = num_vfs;
@@ -38,8 +38,6 @@ static int nsim_bus_dev_vfs_enable(struct nsim_bus_dev *nsim_bus_dev,
 
 static void nsim_bus_dev_vfs_disable(struct nsim_bus_dev *nsim_bus_dev)
 {
-	kfree(nsim_bus_dev->vfconfigs);
-	nsim_bus_dev->vfconfigs = NULL;
 	nsim_bus_dev->num_vfs = 0;
 }
 
@@ -154,22 +152,29 @@ static struct device_type nsim_bus_dev_type = {
 };
 
 static struct nsim_bus_dev *
-nsim_bus_dev_new(unsigned int id, unsigned int port_count);
+nsim_bus_dev_new(unsigned int id, unsigned int port_count,
+		 unsigned int max_vfs);
+
+#define NSIM_BUS_DEV_MAX_VFS 4
 
 static ssize_t
 new_device_store(struct bus_type *bus, const char *buf, size_t count)
 {
 	struct nsim_bus_dev *nsim_bus_dev;
 	unsigned int port_count;
+	unsigned int max_vfs;
 	unsigned int id;
 	int err;
 
-	err = sscanf(buf, "%u %u", &id, &port_count);
+	err = sscanf(buf, "%u %u %u", &id, &port_count, &max_vfs);
 	switch (err) {
 	case 1:
 		port_count = 1;
 		/* fall through */
 	case 2:
+		max_vfs = NSIM_BUS_DEV_MAX_VFS;
+		/* fall through */
+	case 3:
 		if (id > INT_MAX) {
 			pr_err("Value of \"id\" is too big.\n");
 			return -EINVAL;
@@ -179,7 +184,7 @@ new_device_store(struct bus_type *bus, const char *buf, size_t count)
 		pr_err("Format for adding new device is \"id port_count\" (uint uint).\n");
 		return -EINVAL;
 	}
-	nsim_bus_dev = nsim_bus_dev_new(id, port_count);
+	nsim_bus_dev = nsim_bus_dev_new(id, port_count, max_vfs);
 	if (IS_ERR(nsim_bus_dev))
 		return PTR_ERR(nsim_bus_dev);
 
@@ -267,7 +272,8 @@ static struct bus_type nsim_bus = {
 };
 
 static struct nsim_bus_dev *
-nsim_bus_dev_new(unsigned int id, unsigned int port_count)
+nsim_bus_dev_new(unsigned int id, unsigned int port_count,
+		 unsigned int max_vfs)
 {
 	struct nsim_bus_dev *nsim_bus_dev;
 	int err;
@@ -284,12 +290,24 @@ nsim_bus_dev_new(unsigned int id, unsigned int port_count)
 	nsim_bus_dev->dev.type = &nsim_bus_dev_type;
 	nsim_bus_dev->port_count = port_count;
 	nsim_bus_dev->initial_net = current->nsproxy->net_ns;
+	nsim_bus_dev->max_vfs = max_vfs;
+
+	nsim_bus_dev->vfconfigs = kcalloc(nsim_bus_dev->max_vfs,
+					  sizeof(struct nsim_vf_config),
+					  GFP_KERNEL);
+	if (!nsim_bus_dev->vfconfigs) {
+		err = -ENOMEM;
+		goto err_nsim_bus_dev_id_free;
+	}
 
 	err = device_register(&nsim_bus_dev->dev);
 	if (err)
-		goto err_nsim_bus_dev_id_free;
+		goto err_nsim_vfconfigs_free;
+
 	return nsim_bus_dev;
 
+err_nsim_vfconfigs_free:
+	kfree(nsim_bus_dev->vfconfigs);
 err_nsim_bus_dev_id_free:
 	ida_free(&nsim_bus_dev_ids, nsim_bus_dev->dev.id);
 err_nsim_bus_dev_free:
@@ -301,6 +319,7 @@ static void nsim_bus_dev_del(struct nsim_bus_dev *nsim_bus_dev)
 {
 	device_unregister(&nsim_bus_dev->dev);
 	ida_free(&nsim_bus_dev_ids, nsim_bus_dev->dev.id);
+	kfree(nsim_bus_dev->vfconfigs);
 	kfree(nsim_bus_dev);
 }
 
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index 94df795ef4d3..e2049856add8 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -238,6 +238,7 @@ struct nsim_bus_dev {
 	struct net *initial_net; /* Purpose of this is to carry net pointer
 				  * during the probe time only.
 				  */
+	unsigned int max_vfs;
 	unsigned int num_vfs;
 	struct nsim_vf_config *vfconfigs;
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 7/9] netdevsim: Add devlink vdev creation
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
                   ` (5 preceding siblings ...)
  2019-10-22 17:43 ` [PATCH net-next 6/9] netdevsim: Add max_vfs to bus_dev Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 8/9] netdevsim: Add devlink vdev sefltest for netdevsim Yuval Avnery
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

Add devlink vdev creation to represent VFs, and implement hw_addr
get/set.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/netdevsim/Makefile    |  2 +-
 drivers/net/netdevsim/dev.c       |  9 ++-
 drivers/net/netdevsim/netdevsim.h | 10 ++++
 drivers/net/netdevsim/vdev.c      | 96 +++++++++++++++++++++++++++++++
 4 files changed, 115 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/netdevsim/vdev.c

diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index f4d8f62f28c2..62256906128f 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -3,7 +3,7 @@
 obj-$(CONFIG_NETDEVSIM) += netdevsim.o
 
 netdevsim-objs := \
-	netdev.o dev.o fib.o bus.o health.o
+	netdev.o dev.o fib.o bus.o health.o vdev.o
 
 ifeq ($(CONFIG_BPF_SYSCALL),y)
 netdevsim-objs += \
diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index 468e157a7cb1..3fbf4e1ca0d1 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -785,10 +785,14 @@ static struct nsim_dev *nsim_dev_create(struct nsim_bus_dev *nsim_bus_dev)
 	if (err)
 		goto err_fib_destroy;
 
+	err = nsim_dev_vdevs_create(nsim_dev, devlink);
+	if (err)
+		goto err_dl_unregister;
+
 	err = devlink_params_register(devlink, nsim_devlink_params,
 				      ARRAY_SIZE(nsim_devlink_params));
 	if (err)
-		goto err_dl_unregister;
+		goto err_dl_vdevs_destroy;
 	nsim_devlink_set_params_init_values(nsim_dev, devlink);
 
 	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
@@ -831,6 +835,8 @@ static struct nsim_dev *nsim_dev_create(struct nsim_bus_dev *nsim_bus_dev)
 err_params_unregister:
 	devlink_params_unregister(devlink, nsim_devlink_params,
 				  ARRAY_SIZE(nsim_devlink_params));
+err_dl_vdevs_destroy:
+	nsim_dev_vdevs_destroy(nsim_dev);
 err_dl_unregister:
 	devlink_unregister(devlink);
 err_fib_destroy:
@@ -866,6 +872,7 @@ static void nsim_dev_destroy(struct nsim_dev *nsim_dev)
 	nsim_dev_debugfs_exit(nsim_dev);
 	devlink_params_unregister(devlink, nsim_devlink_params,
 				  ARRAY_SIZE(nsim_devlink_params));
+	nsim_dev_vdevs_destroy(nsim_dev);
 	devlink_unregister(devlink);
 	devlink_resources_unregister(devlink, NULL);
 	devlink_free(devlink);
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index e2049856add8..071bedb999bf 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -154,6 +154,12 @@ struct nsim_dev_port {
 	struct netdevsim *ns;
 };
 
+struct nsim_vdev {
+	struct devlink_vdev *devlink_vdev;
+	unsigned int vdev_index;
+	struct nsim_bus_dev *nsim_bus_dev;
+};
+
 struct nsim_dev {
 	struct nsim_bus_dev *nsim_bus_dev;
 	struct nsim_fib_data *fib_data;
@@ -177,6 +183,7 @@ struct nsim_dev {
 	bool fail_reload;
 	struct devlink_region *dummy_region;
 	struct nsim_dev_health health;
+	struct nsim_vdev *vdevs;
 };
 
 static inline struct net *nsim_dev_net(struct nsim_dev *nsim_dev)
@@ -245,3 +252,6 @@ struct nsim_bus_dev {
 
 int nsim_bus_init(void);
 void nsim_bus_exit(void);
+
+int nsim_dev_vdevs_create(struct nsim_dev *nsim_dev, struct devlink *devlink);
+void nsim_dev_vdevs_destroy(struct nsim_dev *nsim_dev);
diff --git a/drivers/net/netdevsim/vdev.c b/drivers/net/netdevsim/vdev.c
new file mode 100644
index 000000000000..c22e01982ba6
--- /dev/null
+++ b/drivers/net/netdevsim/vdev.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Mellanox Technologies */
+
+#include <linux/device.h>
+#include <linux/etherdevice.h>
+#include <linux/inet.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include <linux/rtnetlink.h>
+#include <net/devlink.h>
+#include <uapi/linux/devlink.h>
+
+#include "netdevsim.h"
+
+static int
+nsim_hw_addr_set(struct devlink_vdev *devlink_vdev, u8 *hw_addr,
+		 struct netlink_ext_ack *extack)
+{
+	struct nsim_bus_dev *nsim_bus_dev;
+	struct nsim_vdev *nsim_vdev;
+
+	nsim_vdev = devlink_vdev_priv(devlink_vdev);
+	nsim_bus_dev = nsim_vdev->nsim_bus_dev;
+
+	ether_addr_copy(nsim_bus_dev->vfconfigs[nsim_vdev->vdev_index].vf_mac,
+			hw_addr);
+	return 0;
+}
+
+static int
+nsim_hw_addr_get(struct devlink_vdev *devlink_vdev, u8 *hw_addr,
+		 struct netlink_ext_ack *extack)
+{
+	struct nsim_bus_dev *nsim_bus_dev;
+	struct nsim_vdev *nsim_vdev;
+
+	nsim_vdev = devlink_vdev_priv(devlink_vdev);
+	nsim_bus_dev = nsim_vdev->nsim_bus_dev;
+
+	ether_addr_copy(hw_addr,
+			nsim_bus_dev->vfconfigs[nsim_vdev->vdev_index].vf_mac);
+	return 0;
+}
+
+static struct devlink_vdev_ops vdev_ops = {
+	.hw_addr_set = nsim_hw_addr_set,
+	.hw_addr_get = nsim_hw_addr_get,
+	.hw_addr_len = ETH_ALEN,
+};
+
+int nsim_dev_vdevs_create(struct nsim_dev *nsim_dev, struct devlink *devlink)
+{
+	int max_vfs = nsim_dev->nsim_bus_dev->max_vfs;
+	struct devlink_vdev_attrs attrs;
+	int err;
+	int vf;
+
+	nsim_dev->vdevs = kcalloc(max_vfs, sizeof(*nsim_dev->vdevs),
+				  GFP_KERNEL);
+	if (!nsim_dev->vdevs)
+		return -ENOMEM;
+
+	for (vf = 0; vf < max_vfs; vf++) {
+		struct nsim_vdev *nsim_vdev = &nsim_dev->vdevs[vf];
+		struct devlink_vdev *devlink_vdev;
+
+		nsim_vdev->nsim_bus_dev = nsim_dev->nsim_bus_dev;
+		devlink_vdev_attrs_pci_vf_init(&attrs, 0, vf);
+		devlink_vdev = devlink_vdev_create(devlink, vf,
+						   &vdev_ops,
+						   &attrs,
+						   nsim_vdev);
+		if (IS_ERR(devlink_vdev)) {
+			err = PTR_ERR(devlink_vdev);
+			goto err_vdevs_destroy;
+		}
+		nsim_vdev->devlink_vdev = devlink_vdev;
+		nsim_vdev->vdev_index = vf;
+	}
+
+	return 0;
+
+err_vdevs_destroy:
+	for (vf--; vf >= 0; vf--)
+		devlink_vdev_destroy(nsim_dev->vdevs[vf].devlink_vdev);
+	return err;
+}
+
+void nsim_dev_vdevs_destroy(struct nsim_dev *nsim_dev)
+{
+	int vf;
+
+	for (vf = 0; vf < nsim_dev->nsim_bus_dev->max_vfs; vf++)
+		devlink_vdev_destroy(nsim_dev->vdevs[vf].devlink_vdev);
+}
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 8/9] netdevsim: Add devlink vdev sefltest for netdevsim
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
                   ` (6 preceding siblings ...)
  2019-10-22 17:43 ` [PATCH net-next 7/9] netdevsim: Add devlink vdev creation Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-22 17:43 ` [PATCH net-next 9/9] net/mlx5e: Add support for devlink vdev and vdev hw_addr set/show Yuval Avnery
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

Test will assign hw_addr to all registered vdevs and query it.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 .../drivers/net/netdevsim/devlink.sh          | 55 ++++++++++++++++++-
 1 file changed, 53 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index ee89cd2f5bee..88ddf65b7897 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -5,12 +5,13 @@ lib_dir=$(dirname $0)/../../../net/forwarding
 
 ALL_TESTS="fw_flash_test params_test regions_test reload_test \
 	   netns_reload_test resource_test dev_info_test \
-	   empty_reporter_test dummy_reporter_test"
+	   empty_reporter_test dummy_reporter_test vdev_test"
 NUM_NETIFS=0
 source $lib_dir/lib.sh
 
 BUS_ADDR=10
 PORT_COUNT=4
+VF_COUNT=4
 DEV_NAME=netdevsim$BUS_ADDR
 SYSFS_NET_DIR=/sys/bus/netdevsim/devices/$DEV_NAME/net/
 DEBUGFS_DIR=/sys/kernel/debug/netdevsim/$DEV_NAME/
@@ -428,10 +429,60 @@ dummy_reporter_test()
 	log_test "dummy reporter test"
 }
 
+vdev_attr_get()
+{
+	local handle=$1
+	local name=$2
+
+	cmd_jq "devlink vdev show $handle -j" '.[][].'$name
+}
+
+vdev_objects_get()
+{
+	local handle=$1
+
+	cmd_jq "devlink vdev show -j" \
+	       '.[] | keys[] | select(contains("'$handle'"))'
+}
+
+vdev_attr_set()
+{
+	local handle=$1
+	local name=$2
+	local value=$3
+
+	devlink vdev set $handle $name $value
+}
+
+vdev_test()
+{
+	RET=0
+
+	local vdevs=`vdev_objects_get $DL_HANDLE`
+	local num_vdevs=`echo $vdevs | wc -w`
+	[ $num_vdevs == $VF_COUNT ]
+	check_err $? "Expected $VF_COUNT vdevs but got $num_vdevs"
+
+	i=1
+	for vdev in $vdevs
+	do
+		local hw_addr=`printf "10:22:33:44:55:%02x" $i`
+
+		vdev_attr_set "$vdev" hw_addr $hw_addr
+		check_err $? "Failed to set hw_addr value"
+		value=$(vdev_attr_get $vdev hw_addr)
+		check_err $? "Failed to get hw_addr attr value"
+		[ "$value" == "$hw_addr" ]
+		check_err $? "Unexpected hw_addr attr value $value != $hw_addr"
+		i=$(($i+1))
+	done
+	log_test "vdev test"
+}
+
 setup_prepare()
 {
 	modprobe netdevsim
-	echo "$BUS_ADDR $PORT_COUNT" > /sys/bus/netdevsim/new_device
+	echo "$BUS_ADDR $PORT_COUNT $VF_COUNT" > /sys/bus/netdevsim/new_device
 	while [ ! -d $SYSFS_NET_DIR ] ; do :; done
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next 9/9] net/mlx5e: Add support for devlink vdev and vdev hw_addr set/show
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
                   ` (7 preceding siblings ...)
  2019-10-22 17:43 ` [PATCH net-next 8/9] netdevsim: Add devlink vdev sefltest for netdevsim Yuval Avnery
@ 2019-10-22 17:43 ` Yuval Avnery
  2019-10-23 15:48 ` [PATCH net-next 0/9] devlink vdev Or Gerlitz
  2019-10-23 19:00 ` Jakub Kicinski
  10 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-22 17:43 UTC (permalink / raw)
  To: netdev; +Cc: jiri, saeedm, leon, davem, jakub.kicinski, shuah, Yuval Avnery

Add vdev creation and implement get/set ops for vdev hw_addr.
Add vdev linkage to devlink port registration.

Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 91 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |  5 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  9 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 19 +++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  7 +-
 .../mellanox/mlx5/core/eswitch_offloads.c     |  8 ++
 7 files changed, 135 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 381925c90d94..45166f7016c0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -226,3 +226,94 @@ void mlx5_devlink_unregister(struct devlink *devlink)
 				  ARRAY_SIZE(mlx5_devlink_params));
 	devlink_unregister(devlink);
 }
+
+static int
+mlx5_devlink_mac_set(struct devlink_vdev *devlink_vdev, u8 *mac,
+		     struct netlink_ext_ack *extack)
+{
+	struct mlx5_vport *vport = devlink_vdev_priv(devlink_vdev);
+
+	return mlx5_eswitch_set_vport_mac(vport->dev->priv.eswitch,
+					 vport->vport, mac, extack);
+}
+
+static int
+mlx5_devlink_mac_get(struct devlink_vdev *devlink_vdev, u8 *mac,
+		     struct netlink_ext_ack *extack)
+{
+	struct mlx5_vport *vport = devlink_vdev_priv(devlink_vdev);
+	struct ifla_vf_info ivi;
+	int err;
+
+	vport = devlink_vdev_priv(devlink_vdev);
+
+	err = mlx5_eswitch_get_vport_config(vport->dev->priv.eswitch,
+					    vport->vport, &ivi, extack);
+	if (!err)
+		ether_addr_copy(mac, ivi.mac);
+
+	return err;
+}
+
+static struct devlink_vdev_ops vdev_ops = {
+	.hw_addr_set = mlx5_devlink_mac_set,
+	.hw_addr_get = mlx5_devlink_mac_get,
+	.hw_addr_len = ETH_ALEN,
+};
+
+int mlx5_devlink_vdevs_create(struct mlx5_eswitch *esw)
+{
+	struct devlink *devlink = priv_to_devlink(esw->dev);
+	struct mlx5_vport *vport;
+	int err;
+	int i;
+
+	mlx5_esw_for_all_vports(esw, i, vport) {
+		struct devlink_vdev *devlink_vdev;
+		struct devlink_vdev_attrs attrs;
+		int vdev_idx;
+
+		if (IS_ERR(vport)) {
+			err = PTR_ERR(vport);
+			goto err_dl_destroy;
+		}
+
+		if (vport->vport == MLX5_VPORT_UPLINK)
+			continue;
+		if (vport->vport == mlx5_eswitch_manager_vport(esw->dev))
+			continue;
+
+		vdev_idx = vport->vport;
+		if (mlx5_eswitch_is_vf_vport(esw, vdev_idx))
+			devlink_vdev_attrs_pci_vf_init(&attrs, 0, vdev_idx - 1);
+		else
+			devlink_vdev_attrs_pci_pf_init(&attrs, vdev_idx);
+
+		devlink_vdev = devlink_vdev_create(devlink, vdev_idx,
+						   &vdev_ops, &attrs,
+						   vport);
+		if (IS_ERR(devlink_vdev)) {
+			err = PTR_ERR(devlink_vdev);
+			goto err_dl_destroy;
+		}
+		vport->devlink_vdev = devlink_vdev;
+	}
+
+	return 0;
+
+err_dl_destroy:
+	mlx5_devlink_vdevs_destroy(esw);
+	return err;
+}
+
+void mlx5_devlink_vdevs_destroy(struct mlx5_eswitch *esw)
+{
+	struct mlx5_vport *vport;
+	int i;
+
+	mlx5_esw_for_all_vports(esw, i, vport) {
+		if (vport->devlink_vdev)
+			devlink_vdev_destroy(vport->devlink_vdev);
+	}
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
index d0ba03774ddf..652190480dbf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
@@ -6,9 +6,14 @@
 
 #include <net/devlink.h>
 
+struct mlx5_eswitch;
+
 struct devlink *mlx5_devlink_alloc(void);
 void mlx5_devlink_free(struct devlink *devlink);
 int mlx5_devlink_register(struct devlink *devlink, struct device *dev);
 void mlx5_devlink_unregister(struct devlink *devlink);
 
+int mlx5_devlink_vdevs_create(struct mlx5_eswitch *esw);
+void mlx5_devlink_vdevs_destroy(struct mlx5_eswitch *esw);
+
 #endif /* __MLX5_DEVLINK_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 772bfdbdeb9c..429672f69b6f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4054,7 +4054,8 @@ int mlx5e_set_vf_mac(struct net_device *dev, int vf, u8 *mac)
 	struct mlx5e_priv *priv = netdev_priv(dev);
 	struct mlx5_core_dev *mdev = priv->mdev;
 
-	return mlx5_eswitch_set_vport_mac(mdev->priv.eswitch, vf + 1, mac);
+	return mlx5_eswitch_set_vport_mac(mdev->priv.eswitch, vf + 1,
+					  mac, NULL);
 }
 
 static int mlx5e_set_vf_vlan(struct net_device *dev, int vf, u16 vlan, u8 qos,
@@ -4135,7 +4136,8 @@ int mlx5e_get_vf_config(struct net_device *dev,
 	struct mlx5_core_dev *mdev = priv->mdev;
 	int err;
 
-	err = mlx5_eswitch_get_vport_config(mdev->priv.eswitch, vf + 1, ivi);
+	err = mlx5_eswitch_get_vport_config(mdev->priv.eswitch, vf + 1,
+					    ivi, NULL);
 	if (err)
 		return err;
 	ivi->linkstate = mlx5_vport_link2ifla(ivi->linkstate);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index cd9bb7c7b341..165b28349cc3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1812,8 +1812,10 @@ static int register_devlink_port(struct mlx5_core_dev *dev,
 {
 	struct devlink *devlink = priv_to_devlink(dev);
 	struct mlx5_eswitch_rep *rep = rpriv->rep;
+	struct devlink_vdev *devlink_vdev = NULL;
 	struct netdev_phys_item_id ppid = {};
 	unsigned int dl_port_index = 0;
+	struct mlx5_vport *vport;
 
 	if (!is_devlink_port_supported(dev, rpriv))
 		return 0;
@@ -1831,6 +1833,8 @@ static int register_devlink_port(struct mlx5_core_dev *dev,
 					      &ppid.id[0], ppid.id_len,
 					      dev->pdev->devfn);
 		dl_port_index = rep->vport;
+		vport = mlx5_eswitch_get_vport(dev->priv.eswitch, rep->vport);
+		devlink_vdev = vport->devlink_vdev;
 	} else if (mlx5_eswitch_is_vf_vport(dev->priv.eswitch,
 					    rpriv->rep->vport)) {
 		devlink_port_attrs_pci_vf_set(&rpriv->dl_port,
@@ -1838,9 +1842,12 @@ static int register_devlink_port(struct mlx5_core_dev *dev,
 					      dev->pdev->devfn,
 					      rep->vport - 1);
 		dl_port_index = vport_to_devlink_port_index(dev, rep->vport);
+		vport = mlx5_eswitch_get_vport(dev->priv.eswitch, rep->vport);
+		devlink_vdev = vport->devlink_vdev;
 	}
 
-	return devlink_port_register(devlink, &rpriv->dl_port, dl_port_index);
+	return devlink_port_register_with_vdev(devlink, &rpriv->dl_port,
+					       dl_port_index, devlink_vdev);
 }
 
 static void unregister_devlink_port(struct mlx5_core_dev *dev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 7baade9e62b7..19c4020dc5df 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -2175,7 +2175,8 @@ void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw)
 
 /* Vport Administration */
 int mlx5_eswitch_set_vport_mac(struct mlx5_eswitch *esw,
-			       u16 vport, u8 mac[ETH_ALEN])
+			       u16 vport, u8 mac[ETH_ALEN],
+			       struct netlink_ext_ack *extack)
 {
 	struct mlx5_vport *evport = mlx5_eswitch_get_vport(esw, vport);
 	u64 node_guid;
@@ -2188,25 +2189,30 @@ int mlx5_eswitch_set_vport_mac(struct mlx5_eswitch *esw,
 
 	mutex_lock(&esw->state_lock);
 
-	if (evport->info.spoofchk && !is_valid_ether_addr(mac))
+	if (evport->info.spoofchk && !is_valid_ether_addr(mac)) {
 		mlx5_core_warn(esw->dev,
 			       "Set invalid MAC while spoofchk is on, vport(%d)\n",
 			       vport);
+		NL_SET_ERR_MSG_MOD(extack, "Set invalid MAC while spoofchk is on");
+	}
 
 	err = mlx5_modify_nic_vport_mac_address(esw->dev, vport, mac);
 	if (err) {
 		mlx5_core_warn(esw->dev,
 			       "Failed to mlx5_modify_nic_vport_mac vport(%d) err=(%d)\n",
 			       vport, err);
+		NL_SET_ERR_MSG_MOD(extack, "Failed to modify NIC vport");
 		goto unlock;
 	}
 
 	node_guid_gen_from_mac(&node_guid, mac);
 	err = mlx5_modify_nic_vport_node_guid(esw->dev, vport, node_guid);
-	if (err)
+	if (err) {
 		mlx5_core_warn(esw->dev,
 			       "Failed to set vport %d node guid, err = %d. RDMA_CM will not function properly for this VF.\n",
 			       vport, err);
+		NL_SET_ERR_MSG_MOD(extack, "Failed to set node guid, RDMA_CM will not function properly for this vport");
+	}
 
 	ether_addr_copy(evport->info.mac, mac);
 	evport->info.node_guid = node_guid;
@@ -2249,12 +2255,15 @@ int mlx5_eswitch_set_vport_state(struct mlx5_eswitch *esw,
 }
 
 int mlx5_eswitch_get_vport_config(struct mlx5_eswitch *esw,
-				  u16 vport, struct ifla_vf_info *ivi)
+				  u16 vport, struct ifla_vf_info *ivi,
+				  struct netlink_ext_ack *extack)
 {
 	struct mlx5_vport *evport = mlx5_eswitch_get_vport(esw, vport);
 
-	if (IS_ERR(evport))
+	if (IS_ERR(evport)) {
+		NL_SET_ERR_MSG_MOD(extack, "Vport does not exist");
 		return PTR_ERR(evport);
+	}
 
 	memset(ivi, 0, sizeof(*ivi));
 	ivi->vf = vport - 1;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 920d8f529fb9..b43fc9194f39 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -138,6 +138,7 @@ struct mlx5_vport {
 
 	bool                    enabled;
 	enum mlx5_eswitch_vport_event enabled_events;
+	struct devlink_vdev     *devlink_vdev;
 };
 
 enum offloads_fdb_flags {
@@ -277,7 +278,8 @@ void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw);
 int mlx5_eswitch_enable(struct mlx5_eswitch *esw, int mode);
 void mlx5_eswitch_disable(struct mlx5_eswitch *esw, bool clear_vf);
 int mlx5_eswitch_set_vport_mac(struct mlx5_eswitch *esw,
-			       u16 vport, u8 mac[ETH_ALEN]);
+			       u16 vport, u8 mac[ETH_ALEN],
+			       struct netlink_ext_ack *extack);
 int mlx5_eswitch_set_vport_state(struct mlx5_eswitch *esw,
 				 u16 vport, int link_state);
 int mlx5_eswitch_set_vport_vlan(struct mlx5_eswitch *esw,
@@ -291,7 +293,8 @@ int mlx5_eswitch_set_vport_rate(struct mlx5_eswitch *esw, u16 vport,
 int mlx5_eswitch_set_vepa(struct mlx5_eswitch *esw, u8 setting);
 int mlx5_eswitch_get_vepa(struct mlx5_eswitch *esw, u8 *setting);
 int mlx5_eswitch_get_vport_config(struct mlx5_eswitch *esw,
-				  u16 vport, struct ifla_vf_info *ivi);
+				  u16 vport, struct ifla_vf_info *ivi,
+				  struct netlink_ext_ack *extack);
 int mlx5_eswitch_get_vport_stats(struct mlx5_eswitch *esw,
 				 u16 vport,
 				 struct ifla_vf_stats *vf_stats);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index bd9fd59d8233..9f1024048f8b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -42,6 +42,7 @@
 #include "fs_core.h"
 #include "lib/devcom.h"
 #include "lib/eq.h"
+#include "devlink.h"
 
 /* There are two match-all miss flows, one for unicast dst mac and
  * one for multicast.
@@ -2193,6 +2194,10 @@ int esw_offloads_enable(struct mlx5_eswitch *esw)
 	if (err)
 		goto err_vports;
 
+	err = mlx5_devlink_vdevs_create(esw);
+	if (err)
+		goto err_vdevs;
+
 	err = esw_offloads_load_all_reps(esw);
 	if (err)
 		goto err_reps;
@@ -2203,6 +2208,8 @@ int esw_offloads_enable(struct mlx5_eswitch *esw)
 	return 0;
 
 err_reps:
+	mlx5_devlink_vdevs_destroy(esw);
+err_vdevs:
 	mlx5_eswitch_disable_pf_vf_vports(esw);
 err_vports:
 	esw_set_passing_vport_metadata(esw, false);
@@ -2236,6 +2243,7 @@ void esw_offloads_disable(struct mlx5_eswitch *esw)
 {
 	esw_offloads_devcom_cleanup(esw);
 	esw_offloads_unload_all_reps(esw);
+	mlx5_devlink_vdevs_destroy(esw);
 	mlx5_eswitch_disable_pf_vf_vports(esw);
 	esw_set_passing_vport_metadata(esw, false);
 	esw_offloads_steering_cleanup(esw);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
                   ` (8 preceding siblings ...)
  2019-10-22 17:43 ` [PATCH net-next 9/9] net/mlx5e: Add support for devlink vdev and vdev hw_addr set/show Yuval Avnery
@ 2019-10-23 15:48 ` Or Gerlitz
  2019-10-23 19:00 ` Jakub Kicinski
  10 siblings, 0 replies; 24+ messages in thread
From: Or Gerlitz @ 2019-10-23 15:48 UTC (permalink / raw)
  To: Yuval Avnery
  Cc: Linux Netdev List, Jiri Pirko, Saeed Mahameed, Leon Romanovsky,
	David Miller, Jakub Kicinski, shuah

On Tue, Oct 22, 2019 at 8:46 PM Yuval Avnery <yuvalav@mellanox.com> wrote:
> This patchset introduces devlink vdev.

> Currently, legacy tools do not provide a comprehensive solution that can
> be used in both SmartNic and non-SmartNic mode.
> Vdev represents a device that exists on the ASIC but is not necessarily
> visible to the kernel.

I assume you refer to a function (e.g Eth VF, NVME) or mediated device (e.g the
sub-functions work by mellanox we presented in netdev 0x13) residing on the
host but not visible to the smart nic, smart... would be good to
clarify that here.

FWIW whatever SW you have on the smart NIC, it will still not be able
to actually provision / PT / map this device to a guest/container before/after
this patch set, and a host SW agent is still needed, agree?

worth explaining what delta/progress this series gets us in this context
(of inability to provision from the smartnic ).

Or.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
                   ` (9 preceding siblings ...)
  2019-10-23 15:48 ` [PATCH net-next 0/9] devlink vdev Or Gerlitz
@ 2019-10-23 19:00 ` Jakub Kicinski
  2019-10-23 19:25   ` Jiri Pirko
  10 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-10-23 19:00 UTC (permalink / raw)
  To: Yuval Avnery; +Cc: netdev, jiri, saeedm, leon, davem, shuah

On Tue, 22 Oct 2019 20:43:01 +0300, Yuval Avnery wrote:
> This patchset introduces devlink vdev.
> 
> Currently, legacy tools do not provide a comprehensive solution that can
> be used in both SmartNic and non-SmartNic mode.
> Vdev represents a device that exists on the ASIC but is not necessarily
> visible to the kernel.
> 
> Using devlink ports is not suitable because:
> 
> 1. Those devices aren't necessarily network devices (such as NVMe devices)
>    and doesn’t have E-switch representation. Therefore, there is need for
>    more generic representation of PCI VF.
> 2. Some attributes are not necessarily pure port attributes
>    (number of MSIX vectors)
> 3. It creates a confusing devlink topology, with multiple port flavours
>    and indices.
> 
> Vdev will be created along with flavour and attributes.
> Some network vdevs may be linked with a devlink port.
> 
> This is also aimed to replace "ip link vf" commands as they are strongly
> linked to the PCI topology and allow access only to enabled VFs.
> Even though current patchset and example is limited to MAC address
> of the VF, this interface will allow to manage PF, VF, mdev in
> SmartNic and non SmartNic modes, in unified way for networking and
> non-networking devices via devlink instance.
> 
> Example:
> 
> A privileged user wants to configure a VF's hw_addr, before the VF is
> enabled.
> 
> $ devlink vdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66
> 
> $ devlink vdev show pci/0000:03:00.0/1
> pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 10:22:33:44:55:66
> 
> $ devlink vdev show pci/0000:03:00.0/1 -jp
> {
>     "vdev": {
>         "pci/0000:03:00.0/1": {
>             "flavour": "pcivf",
>             "pf": 0,
>             "vf": 0,
>             "port_index": 1,
>             "hw_addr": "10:22:33:44:55:66"
>         }
>     }
> }

I don't trust this is a good design. 

We need some proper ontology and decisions what goes where. We have
half of port attributes duplicated here, and hw_addr which honestly
makes more sense in a port (since port is more of a networking
construct, why would ep storage have a hw_addr?). Then you say you're
going to dump more PCI stuff in here :(

"vdev" sounds entirely meaningless, and has a high chance of becoming 
a dumping ground for attributes.

I'm kind of sour about the debug interfaces that were added to devlink.
Seems like the health API superseded the region stuff to a certain
extent and the two don't interact with each other.

I'm slightly worried there is too much "learning by doing" going on
with these new devlink uABIs.

I'm sure you guys thought this all through in detail and have more
documentation and design docs internally. Could you provide more of
this information here? How things fit together? Bigger picture?

The 20 lines of cover letters and no Documentation/ is definitely not
going to cut it this time.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-23 19:00 ` Jakub Kicinski
@ 2019-10-23 19:25   ` Jiri Pirko
  2019-10-23 22:14     ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2019-10-23 19:25 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Yuval Avnery, netdev, jiri, saeedm, leon, davem, shuah

Wed, Oct 23, 2019 at 09:00:46PM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 22 Oct 2019 20:43:01 +0300, Yuval Avnery wrote:
>> This patchset introduces devlink vdev.
>> 
>> Currently, legacy tools do not provide a comprehensive solution that can
>> be used in both SmartNic and non-SmartNic mode.
>> Vdev represents a device that exists on the ASIC but is not necessarily
>> visible to the kernel.
>> 
>> Using devlink ports is not suitable because:
>> 
>> 1. Those devices aren't necessarily network devices (such as NVMe devices)
>>    and doesn’t have E-switch representation. Therefore, there is need for
>>    more generic representation of PCI VF.
>> 2. Some attributes are not necessarily pure port attributes
>>    (number of MSIX vectors)
>> 3. It creates a confusing devlink topology, with multiple port flavours
>>    and indices.
>> 
>> Vdev will be created along with flavour and attributes.
>> Some network vdevs may be linked with a devlink port.
>> 
>> This is also aimed to replace "ip link vf" commands as they are strongly
>> linked to the PCI topology and allow access only to enabled VFs.
>> Even though current patchset and example is limited to MAC address
>> of the VF, this interface will allow to manage PF, VF, mdev in
>> SmartNic and non SmartNic modes, in unified way for networking and
>> non-networking devices via devlink instance.
>> 
>> Example:
>> 
>> A privileged user wants to configure a VF's hw_addr, before the VF is
>> enabled.
>> 
>> $ devlink vdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66
>> 
>> $ devlink vdev show pci/0000:03:00.0/1
>> pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 10:22:33:44:55:66
>> 
>> $ devlink vdev show pci/0000:03:00.0/1 -jp
>> {
>>     "vdev": {
>>         "pci/0000:03:00.0/1": {
>>             "flavour": "pcivf",
>>             "pf": 0,
>>             "vf": 0,
>>             "port_index": 1,
>>             "hw_addr": "10:22:33:44:55:66"
>>         }
>>     }
>> }
>
>I don't trust this is a good design. 
>
>We need some proper ontology and decisions what goes where. We have
>half of port attributes duplicated here, and hw_addr which honestly
>makes more sense in a port (since port is more of a networking
>construct, why would ep storage have a hw_addr?). Then you say you're
>going to dump more PCI stuff in here :(

Well basically what this "vdev" is is the "port peer" we discussed
couple of months ago. It provides possibility for the user on bare metal
to cofigure things for the VF - for example.

Regarding hw_addr vs. port - it is not correct to make that a devlink
port attribute. It is not port's hw_addr, but the port's peer hw_addr.


>
>"vdev" sounds entirely meaningless, and has a high chance of becoming 
>a dumping ground for attributes.

Sure, it is a madeup name. If you have a better name, please share.
Basically it is something that represents VF/mdev - the other side of
devlink port. But in some cases, like NVMe, there is no associated
devlink port - that is why "devlink port peer" would not work here.


>
>I'm kind of sour about the debug interfaces that were added to devlink.
>Seems like the health API superseded the region stuff to a certain
>extent and the two don't interact with each other.

Yeah, I admit that the regions were probably step into wrong direction.
But only one driver (mlx4) implements and will be hopefully soon
coverted into health.


>
>I'm slightly worried there is too much "learning by doing" going on
>with these new devlink uABIs.

I'm worried about this too.


>
>I'm sure you guys thought this all through in detail and have more
>documentation and design docs internally. Could you provide more of
>this information here? How things fit together? Bigger picture?
>
>The 20 lines of cover letters and no Documentation/ is definitely not
>going to cut it this time.

You are right, Documentation/ file is definitelly a good idea here.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-23 19:25   ` Jiri Pirko
@ 2019-10-23 22:14     ` Jakub Kicinski
  2019-10-24  0:11       ` Yuval Avnery
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-10-23 22:14 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Yuval Avnery, netdev, jiri, saeedm, leon, davem, shuah

On Wed, 23 Oct 2019 21:25:12 +0200, Jiri Pirko wrote:
> Wed, Oct 23, 2019 at 09:00:46PM CEST, jakub.kicinski@netronome.com wrote:
> >On Tue, 22 Oct 2019 20:43:01 +0300, Yuval Avnery wrote:  
> >> This patchset introduces devlink vdev.
> >> 
> >> Currently, legacy tools do not provide a comprehensive solution that can
> >> be used in both SmartNic and non-SmartNic mode.
> >> Vdev represents a device that exists on the ASIC but is not necessarily
> >> visible to the kernel.
> >> 
> >> Using devlink ports is not suitable because:
> >> 
> >> 1. Those devices aren't necessarily network devices (such as NVMe devices)
> >>    and doesn’t have E-switch representation. Therefore, there is need for
> >>    more generic representation of PCI VF.
> >> 2. Some attributes are not necessarily pure port attributes
> >>    (number of MSIX vectors)
> >> 3. It creates a confusing devlink topology, with multiple port flavours
> >>    and indices.
> >> 
> >> Vdev will be created along with flavour and attributes.
> >> Some network vdevs may be linked with a devlink port.
> >> 
> >> This is also aimed to replace "ip link vf" commands as they are strongly
> >> linked to the PCI topology and allow access only to enabled VFs.
> >> Even though current patchset and example is limited to MAC address
> >> of the VF, this interface will allow to manage PF, VF, mdev in
> >> SmartNic and non SmartNic modes, in unified way for networking and
> >> non-networking devices via devlink instance.
> >> 
> >> Example:
> >> 
> >> A privileged user wants to configure a VF's hw_addr, before the VF is
> >> enabled.
> >> 
> >> $ devlink vdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66
> >> 
> >> $ devlink vdev show pci/0000:03:00.0/1
> >> pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 10:22:33:44:55:66
> >> 
> >> $ devlink vdev show pci/0000:03:00.0/1 -jp
> >> {
> >>     "vdev": {
> >>         "pci/0000:03:00.0/1": {
> >>             "flavour": "pcivf",
> >>             "pf": 0,
> >>             "vf": 0,
> >>             "port_index": 1,
> >>             "hw_addr": "10:22:33:44:55:66"
> >>         }
> >>     }
> >> }  
> >
> >I don't trust this is a good design. 
> >
> >We need some proper ontology and decisions what goes where. We have
> >half of port attributes duplicated here, and hw_addr which honestly
> >makes more sense in a port (since port is more of a networking
> >construct, why would ep storage have a hw_addr?). Then you say you're
> >going to dump more PCI stuff in here :(  
> 
> Well basically what this "vdev" is is the "port peer" we discussed
> couple of months ago. It provides possibility for the user on bare metal
> to cofigure things for the VF - for example.
> 
> Regarding hw_addr vs. port - it is not correct to make that a devlink
> port attribute. It is not port's hw_addr, but the port's peer hw_addr.

Yeah, I remember us arguing with others that "the other side of the
wire" should not be a port.

> >"vdev" sounds entirely meaningless, and has a high chance of becoming 
> >a dumping ground for attributes.  
> 
> Sure, it is a madeup name. If you have a better name, please share.

IDK. I think I started the "peer" stuff, so it made sense to me.
Now it sounds like you'd like to kill a lot of problems with this 
one stone. For PCIe "vdev" is def wrong because some of the config 
will be for PF (which is not virtual). Also for PCIe the config has 
to be done with permanence in mind from day 1, PCI often requires
HW reset to reconfig.

> Basically it is something that represents VF/mdev - the other side of
> devlink port. But in some cases, like NVMe, there is no associated
> devlink port - that is why "devlink port peer" would not work here.

What are the NVMe parameters we'd configure here? Queues etc. or some
IDs? Presumably there will be a NVMe-specific way to configure things?
Something has to point the NVMe VF to a backend, right? 

(I haven't looked much into NVMe myself in case that's not obvious ;))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-23 22:14     ` Jakub Kicinski
@ 2019-10-24  0:11       ` Yuval Avnery
  2019-10-24  2:51         ` Jakub Kicinski
  0 siblings, 1 reply; 24+ messages in thread
From: Yuval Avnery @ 2019-10-24  0:11 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: netdev, Jiri Pirko, Saeed Mahameed, leon, davem, shuah, Daniel Jurgens

On 2019-10-23 3:14 p.m., Jakub Kicinski wrote:
> On Wed, 23 Oct 2019 21:25:12 +0200, Jiri Pirko wrote:
>> Wed, Oct 23, 2019 at 09:00:46PM CEST, jakub.kicinski@netronome.com wrote:
>>> On Tue, 22 Oct 2019 20:43:01 +0300, Yuval Avnery wrote:
>>>> This patchset introduces devlink vdev.
>>>>
>>>> Currently, legacy tools do not provide a comprehensive solution that can
>>>> be used in both SmartNic and non-SmartNic mode.
>>>> Vdev represents a device that exists on the ASIC but is not necessarily
>>>> visible to the kernel.
>>>>
>>>> Using devlink ports is not suitable because:
>>>>
>>>> 1. Those devices aren't necessarily network devices (such as NVMe devices)
>>>>     and doesn’t have E-switch representation. Therefore, there is need for
>>>>     more generic representation of PCI VF.
>>>> 2. Some attributes are not necessarily pure port attributes
>>>>     (number of MSIX vectors)
>>>> 3. It creates a confusing devlink topology, with multiple port flavours
>>>>     and indices.
>>>>
>>>> Vdev will be created along with flavour and attributes.
>>>> Some network vdevs may be linked with a devlink port.
>>>>
>>>> This is also aimed to replace "ip link vf" commands as they are strongly
>>>> linked to the PCI topology and allow access only to enabled VFs.
>>>> Even though current patchset and example is limited to MAC address
>>>> of the VF, this interface will allow to manage PF, VF, mdev in
>>>> SmartNic and non SmartNic modes, in unified way for networking and
>>>> non-networking devices via devlink instance.
>>>>
>>>> Example:
>>>>
>>>> A privileged user wants to configure a VF's hw_addr, before the VF is
>>>> enabled.
>>>>
>>>> $ devlink vdev set pci/0000:03:00.0/1 hw_addr 10:22:33:44:55:66
>>>>
>>>> $ devlink vdev show pci/0000:03:00.0/1
>>>> pci/0000:03:00.0/1: flavour pcivf pf 0 vf 0 port_index 1 hw_addr 10:22:33:44:55:66
>>>>
>>>> $ devlink vdev show pci/0000:03:00.0/1 -jp
>>>> {
>>>>      "vdev": {
>>>>          "pci/0000:03:00.0/1": {
>>>>              "flavour": "pcivf",
>>>>              "pf": 0,
>>>>              "vf": 0,
>>>>              "port_index": 1,
>>>>              "hw_addr": "10:22:33:44:55:66"
>>>>          }
>>>>      }
>>>> }
>>> I don't trust this is a good design.
>>>
>>> We need some proper ontology and decisions what goes where. We have
>>> half of port attributes duplicated here, and hw_addr which honestly
>>> makes more sense in a port (since port is more of a networking
>>> construct, why would ep storage have a hw_addr?). Then you say you're
>>> going to dump more PCI stuff in here :(
>> Well basically what this "vdev" is is the "port peer" we discussed
>> couple of months ago. It provides possibility for the user on bare metal
>> to cofigure things for the VF - for example.
>>
>> Regarding hw_addr vs. port - it is not correct to make that a devlink
>> port attribute. It is not port's hw_addr, but the port's peer hw_addr.
> Yeah, I remember us arguing with others that "the other side of the
> wire" should not be a port.
>
>>> "vdev" sounds entirely meaningless, and has a high chance of becoming
>>> a dumping ground for attributes.
>> Sure, it is a madeup name. If you have a better name, please share.
> IDK. I think I started the "peer" stuff, so it made sense to me.
> Now it sounds like you'd like to kill a lot of problems with this
> one stone. For PCIe "vdev" is def wrong because some of the config
> will be for PF (which is not virtual). Also for PCIe the config has
> to be done with permanence in mind from day 1, PCI often requires
> HW reset to reconfig.
>
The PF is "virtual" from the SmartNic embedded CPU point of view.

Maybe gdev is better? (generic)

>> Basically it is something that represents VF/mdev - the other side of
>> devlink port. But in some cases, like NVMe, there is no associated
>> devlink port - that is why "devlink port peer" would not work here.
> What are the NVMe parameters we'd configure here? Queues etc. or some
> IDs? Presumably there will be a NVMe-specific way to configure things?
> Something has to point the NVMe VF to a backend, right?
>
> (I haven't looked much into NVMe myself in case that's not obvious ;))



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-24  0:11       ` Yuval Avnery
@ 2019-10-24  2:51         ` Jakub Kicinski
  2019-10-25 14:58           ` Andy Gospodarek
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-10-24  2:51 UTC (permalink / raw)
  To: Yuval Avnery
  Cc: Jiri Pirko, netdev, Jiri Pirko, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, andrew.gospodarek, Michael Chan

On Thu, 24 Oct 2019 00:11:48 +0000, Yuval Avnery wrote:
> >>> We need some proper ontology and decisions what goes where. We have
> >>> half of port attributes duplicated here, and hw_addr which honestly
> >>> makes more sense in a port (since port is more of a networking
> >>> construct, why would ep storage have a hw_addr?). Then you say you're
> >>> going to dump more PCI stuff in here :(  
> >> Well basically what this "vdev" is is the "port peer" we discussed
> >> couple of months ago. It provides possibility for the user on bare metal
> >> to cofigure things for the VF - for example.
> >>
> >> Regarding hw_addr vs. port - it is not correct to make that a devlink
> >> port attribute. It is not port's hw_addr, but the port's peer hw_addr.  
> > Yeah, I remember us arguing with others that "the other side of the
> > wire" should not be a port.
> >  
> >>> "vdev" sounds entirely meaningless, and has a high chance of becoming
> >>> a dumping ground for attributes.  
> >> Sure, it is a madeup name. If you have a better name, please share.  
> > IDK. I think I started the "peer" stuff, so it made sense to me.
> > Now it sounds like you'd like to kill a lot of problems with this
> > one stone. For PCIe "vdev" is def wrong because some of the config
> > will be for PF (which is not virtual). Also for PCIe the config has
> > to be done with permanence in mind from day 1, PCI often requires
> > HW reset to reconfig.
>
> The PF is "virtual" from the SmartNic embedded CPU point of view.

We also want to configure PCIe on local host thru this in non-SmartNIC
case, having the virtual in the name would be confusing there.

> Maybe gdev is better? (generic)

Let's focus on the scope and semantics of the object we are modelling
first. Can we talk goals, requirements, user scenarios etc.?

IMHO the hw_addr use case is kind of weak, clouds usually do tunnelling
so nobody cares which MAC customer has assigned in the overlay.

CCing Andy and Michael from Broadcom for their perspective and
requirements.

> >> Basically it is something that represents VF/mdev - the other side of
> >> devlink port. But in some cases, like NVMe, there is no associated
> >> devlink port - that is why "devlink port peer" would not work here.  
> > What are the NVMe parameters we'd configure here? Queues etc. or some
> > IDs? Presumably there will be a NVMe-specific way to configure things?
> > Something has to point the NVMe VF to a backend, right?
> >
> > (I haven't looked much into NVMe myself in case that's not obvious ;))  

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-24  2:51         ` Jakub Kicinski
@ 2019-10-25 14:58           ` Andy Gospodarek
  2019-10-28 19:04             ` Yuval Avnery
                               ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Andy Gospodarek @ 2019-10-25 14:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Yuval Avnery, Jiri Pirko, netdev, Jiri Pirko, Saeed Mahameed,
	leon, davem, shuah, Daniel Jurgens, andrew.gospodarek,
	Michael Chan

On Wed, Oct 23, 2019 at 07:51:41PM -0700, Jakub Kicinski wrote:
> On Thu, 24 Oct 2019 00:11:48 +0000, Yuval Avnery wrote:
> > >>> We need some proper ontology and decisions what goes where. We have
> > >>> half of port attributes duplicated here, and hw_addr which honestly
> > >>> makes more sense in a port (since port is more of a networking
> > >>> construct, why would ep storage have a hw_addr?). Then you say you're
> > >>> going to dump more PCI stuff in here :(  
> > >> Well basically what this "vdev" is is the "port peer" we discussed
> > >> couple of months ago. It provides possibility for the user on bare metal
> > >> to cofigure things for the VF - for example.
> > >>
> > >> Regarding hw_addr vs. port - it is not correct to make that a devlink
> > >> port attribute. It is not port's hw_addr, but the port's peer hw_addr.  
> > > Yeah, I remember us arguing with others that "the other side of the
> > > wire" should not be a port.
> > >  
> > >>> "vdev" sounds entirely meaningless, and has a high chance of becoming
> > >>> a dumping ground for attributes.  
> > >> Sure, it is a madeup name. If you have a better name, please share.  
> > > IDK. I think I started the "peer" stuff, so it made sense to me.
> > > Now it sounds like you'd like to kill a lot of problems with this
> > > one stone. For PCIe "vdev" is def wrong because some of the config
> > > will be for PF (which is not virtual). Also for PCIe the config has
> > > to be done with permanence in mind from day 1, PCI often requires
> > > HW reset to reconfig.
> >
> > The PF is "virtual" from the SmartNic embedded CPU point of view.
> 
> We also want to configure PCIe on local host thru this in non-SmartNIC
> case, having the virtual in the name would be confusing there.
> 
> > Maybe gdev is better? (generic)
> 
> Let's focus on the scope and semantics of the object we are modelling
> first. Can we talk goals, requirements, user scenarios etc.?
> 
> IMHO the hw_addr use case is kind of weak, clouds usually do tunnelling
> so nobody cares which MAC customer has assigned in the overlay.
> 
> CCing Andy and Michael from Broadcom for their perspective and
> requirements.

Thanks, Jakub, I'm happy to chime in based on our deployment experience.
We definitely understand the desire to be able to configure properties
of devices on the SmartNIC (the kind with general purpose cores not the
kind with only flow offload) from the server side.

In addition to addressing NVMe devices, I'd also like to be be able to
create virtual or real serial ports as well as there is an interest in
*sometimes* being able to gain direct access to the SmartNIC console not
just a shell via ssh.  So my point is that there are multiple use-cases.

Arm are also _extremely_ interested in developing a method to enable
some form of SmartNIC discovery method and while lots of ideas have been
thrown around, discovery via devlink is a reasonable option.  So while
doing all this will be much more work than simply handling this case
where we set the peer or local MAC for a vdev, I think it will be worth
it to make this more usable for all^W more types of devices.  I also
agree that not everything on the other side of the wire should be a
port. 

So if we agree that addressing this device as a PCIe device then it
feels like we would be better served to query device capabilities and
depending on what capabilities exist we would be able to configure
properties for those.  In an ideal world, I could query a device using
devlink ('devlink info'?) and it would show me different devices that
are available for configuration on the SmartNIC and would also give me a
way to address them.  So while I like the idea of being able to address
and set parameters as shown in patch 05 of this series, I would like to
see a bit more flexibility to define what type of device is available
and how it might be configured.

So if we took the devlink info command as an example (whether its the
proper place for this or not), it could look _like_ this:

$ devlink dev info pci/0000:03:00.0
pci/0000:03:00.0:
  driver foo
  serial_number 8675309
  versions:
[...]
  capabilities:
      storage 0
      console 1
      mdev 1024
      [something else] [limit]

(Additionally rather than putting this as part of 'info' the device
capabilities and limits could be part of the 'resource' section and
frankly may make more sense if this is part of that.)

and then those capabilities would be something that could be set using the
'vdev' or whatever-it-is-named interface:

# devlink vdev show pci/0000:03:00.0
pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0
pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00
[...]
pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff

# devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00

Since these Arm/RISC-V based SmartNICs are going to be used in a variety
of different ways and will have a variety of different personalities
(not just different SKUs that vendors will offer but different ways in
which these will be deployed), I think it's critical that we consider
more than just the mdev/representer case from the start.

> > >> Basically it is something that represents VF/mdev - the other side of
> > >> devlink port. But in some cases, like NVMe, there is no associated
> > >> devlink port - that is why "devlink port peer" would not work here.  
> > > What are the NVMe parameters we'd configure here? Queues etc. or some
> > > IDs? Presumably there will be a NVMe-specific way to configure things?
> > > Something has to point the NVMe VF to a backend, right?
> > >
> > > (I haven't looked much into NVMe myself in case that's not obvious ;))  

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-25 14:58           ` Andy Gospodarek
@ 2019-10-28 19:04             ` Yuval Avnery
  2019-10-28 20:02             ` Parav Pandit
  2019-10-29 17:08             ` Jakub Kicinski
  2 siblings, 0 replies; 24+ messages in thread
From: Yuval Avnery @ 2019-10-28 19:04 UTC (permalink / raw)
  To: Andy Gospodarek, Jakub Kicinski
  Cc: Jiri Pirko, netdev, Jiri Pirko, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, Michael Chan

On 2019-10-25 7:58 a.m., Andy Gospodarek wrote:
> On Wed, Oct 23, 2019 at 07:51:41PM -0700, Jakub Kicinski wrote:
>> On Thu, 24 Oct 2019 00:11:48 +0000, Yuval Avnery wrote:
>>>>>> We need some proper ontology and decisions what goes where. We have
>>>>>> half of port attributes duplicated here, and hw_addr which honestly
>>>>>> makes more sense in a port (since port is more of a networking
>>>>>> construct, why would ep storage have a hw_addr?). Then you say you're
>>>>>> going to dump more PCI stuff in here :(
>>>>> Well basically what this "vdev" is is the "port peer" we discussed
>>>>> couple of months ago. It provides possibility for the user on bare metal
>>>>> to cofigure things for the VF - for example.
>>>>>
>>>>> Regarding hw_addr vs. port - it is not correct to make that a devlink
>>>>> port attribute. It is not port's hw_addr, but the port's peer hw_addr.
>>>> Yeah, I remember us arguing with others that "the other side of the
>>>> wire" should not be a port.
>>>>   
>>>>>> "vdev" sounds entirely meaningless, and has a high chance of becoming
>>>>>> a dumping ground for attributes.
>>>>> Sure, it is a madeup name. If you have a better name, please share.
>>>> IDK. I think I started the "peer" stuff, so it made sense to me.
>>>> Now it sounds like you'd like to kill a lot of problems with this
>>>> one stone. For PCIe "vdev" is def wrong because some of the config
>>>> will be for PF (which is not virtual). Also for PCIe the config has
>>>> to be done with permanence in mind from day 1, PCI often requires
>>>> HW reset to reconfig.
>>> The PF is "virtual" from the SmartNic embedded CPU point of view.
>> We also want to configure PCIe on local host thru this in non-SmartNIC
>> case, having the virtual in the name would be confusing there.
>>
>>> Maybe gdev is better? (generic)
>> Let's focus on the scope and semantics of the object we are modelling
>> first. Can we talk goals, requirements, user scenarios etc.?
>>
>> IMHO the hw_addr use case is kind of weak, clouds usually do tunnelling
>> so nobody cares which MAC customer has assigned in the overlay.
>>
>> CCing Andy and Michael from Broadcom for their perspective and
>> requirements.
> Thanks, Jakub, I'm happy to chime in based on our deployment experience.
> We definitely understand the desire to be able to configure properties
> of devices on the SmartNIC (the kind with general purpose cores not the
> kind with only flow offload) from the server side.
>
> In addition to addressing NVMe devices, I'd also like to be be able to
> create virtual or real serial ports as well as there is an interest in
> *sometimes* being able to gain direct access to the SmartNIC console not
> just a shell via ssh.  So my point is that there are multiple use-cases.
>
> Arm are also _extremely_ interested in developing a method to enable
> some form of SmartNIC discovery method and while lots of ideas have been
> thrown around, discovery via devlink is a reasonable option.  So while
> doing all this will be much more work than simply handling this case
> where we set the peer or local MAC for a vdev, I think it will be worth
> it to make this more usable for all^W more types of devices.  I also
> agree that not everything on the other side of the wire should be a
> port.
>
> So if we agree that addressing this device as a PCIe device then it
> feels like we would be better served to query device capabilities and
> depending on what capabilities exist we would be able to configure
> properties for those.  In an ideal world, I could query a device using
> devlink ('devlink info'?) and it would show me different devices that
> are available for configuration on the SmartNIC and would also give me a
> way to address them.  So while I like the idea of being able to address
> and set parameters as shown in patch 05 of this series, I would like to
> see a bit more flexibility to define what type of device is available
> and how it might be configured.
>
> So if we took the devlink info command as an example (whether its the
> proper place for this or not), it could look _like_ this:
>
> $ devlink dev info pci/0000:03:00.0
> pci/0000:03:00.0:
>    driver foo
>    serial_number 8675309
>    versions:
> [...]
>    capabilities:
>        storage 0
>        console 1
>        mdev 1024
>        [something else] [limit]
>
> (Additionally rather than putting this as part of 'info' the device
> capabilities and limits could be part of the 'resource' section and
> frankly may make more sense if this is part of that.)
>
> and then those capabilities would be something that could be set using the
> 'vdev' or whatever-it-is-named interface:
>
> # devlink vdev show pci/0000:03:00.0
> pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0
> pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00
> [...]
> pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff
>
> # devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00

I believe the flexibility you are looking for is already there.

The driver is free to put any attribute to classify the vdev.

I think the simple vdev index and attributes leaves more flexibility:

# devlink vdev show pci/0000:03:00.0
pci/0000:03:00.0/0 console 0 speed 115200 device /dev/ttySNIC0
pci/0000:03:00.0/1 mdev 0 hw_addr 02:00:00:00:00:00
[...]
pci/0000:03:00.0/1025 hw_addr 02:00:00:00:03:ff

So besides that point I think we are on the same page here?

> Since these Arm/RISC-V based SmartNICs are going to be used in a variety
> of different ways and will have a variety of different personalities
> (not just different SKUs that vendors will offer but different ways in
> which these will be deployed), I think it's critical that we consider
> more than just the mdev/representer case from the start.
>
>>>>> Basically it is something that represents VF/mdev - the other side of
>>>>> devlink port. But in some cases, like NVMe, there is no associated
>>>>> devlink port - that is why "devlink port peer" would not work here.
>>>> What are the NVMe parameters we'd configure here? Queues etc. or some
>>>> IDs? Presumably there will be a NVMe-specific way to configure things?
>>>> Something has to point the NVMe VF to a backend, right?
>>>>
>>>> (I haven't looked much into NVMe myself in case that's not obvious ;))



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH net-next 0/9] devlink vdev
  2019-10-25 14:58           ` Andy Gospodarek
  2019-10-28 19:04             ` Yuval Avnery
@ 2019-10-28 20:02             ` Parav Pandit
  2019-10-29 17:08             ` Jakub Kicinski
  2 siblings, 0 replies; 24+ messages in thread
From: Parav Pandit @ 2019-10-28 20:02 UTC (permalink / raw)
  To: Andy Gospodarek, Jakub Kicinski
  Cc: Yuval Avnery, Jiri Pirko, netdev, Jiri Pirko, Saeed Mahameed,
	leon, davem, shuah, Daniel Jurgens, Michael Chan

Hi Andy, Jakub,

> -----Original Message-----
> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Andy Gospodarek
> Sent: Friday, October 25, 2019 9:58 AM
> To: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: Yuval Avnery <yuvalav@mellanox.com>; Jiri Pirko <jiri@resnulli.us>;
> netdev@vger.kernel.org; Jiri Pirko <jiri@mellanox.com>; Saeed Mahameed
> <saeedm@mellanox.com>; leon@kernel.org; davem@davemloft.net;
> shuah@kernel.org; Daniel Jurgens <danielj@mellanox.com>;
> andrew.gospodarek@broadcom.com; Michael Chan
> <michael.chan@broadcom.com>
> Subject: Re: [PATCH net-next 0/9] devlink vdev
> 
> On Wed, Oct 23, 2019 at 07:51:41PM -0700, Jakub Kicinski wrote:
> > On Thu, 24 Oct 2019 00:11:48 +0000, Yuval Avnery wrote:
> > > >>> We need some proper ontology and decisions what goes where. We
> > > >>> have half of port attributes duplicated here, and hw_addr which
> > > >>> honestly makes more sense in a port (since port is more of a
> > > >>> networking construct, why would ep storage have a hw_addr?).
> > > >>> Then you say you're going to dump more PCI stuff in here :(
> > > >> Well basically what this "vdev" is is the "port peer" we
> > > >> discussed couple of months ago. It provides possibility for the
> > > >> user on bare metal to cofigure things for the VF - for example.
> > > >>
> > > >> Regarding hw_addr vs. port - it is not correct to make that a
> > > >> devlink port attribute. It is not port's hw_addr, but the port's peer
> hw_addr.
> > > > Yeah, I remember us arguing with others that "the other side of
> > > > the wire" should not be a port.
> > > >
> > > >>> "vdev" sounds entirely meaningless, and has a high chance of
> > > >>> becoming a dumping ground for attributes.
> > > >> Sure, it is a madeup name. If you have a better name, please share.
> > > > IDK. I think I started the "peer" stuff, so it made sense to me.
> > > > Now it sounds like you'd like to kill a lot of problems with this
> > > > one stone. For PCIe "vdev" is def wrong because some of the config
> > > > will be for PF (which is not virtual). Also for PCIe the config
> > > > has to be done with permanence in mind from day 1, PCI often
> > > > requires HW reset to reconfig.
> > >
> > > The PF is "virtual" from the SmartNic embedded CPU point of view.
> >
> > We also want to configure PCIe on local host thru this in non-SmartNIC
> > case, having the virtual in the name would be confusing there.
> >
> > > Maybe gdev is better? (generic)
> >
How about naming it 'subdev' -> as sub device?
Since these are the sub devices of one or different class which are getting managed.

> > Let's focus on the scope and semantics of the object we are modelling
> > first. Can we talk goals, requirements, user scenarios etc.?
> >
> > IMHO the hw_addr use case is kind of weak, clouds usually do
> > tunnelling so nobody cares which MAC customer has assigned in the overlay.
> >
> > CCing Andy and Michael from Broadcom for their perspective and
> > requirements.
> 
> Thanks, Jakub, I'm happy to chime in based on our deployment experience.
> We definitely understand the desire to be able to configure properties of
> devices on the SmartNIC (the kind with general purpose cores not the kind with
> only flow offload) from the server side.
> 
> In addition to addressing NVMe devices, I'd also like to be be able to create
> virtual or real serial ports as well as there is an interest in
> *sometimes* being able to gain direct access to the SmartNIC console not just
> a shell via ssh.  So my point is that there are multiple use-cases.
> 
Yes. we also see that use case/desire of accessing it sometimes.
We believe that current direction of vdev is good starting point with your example below.
Actually want to call it 'subdev' for rest of the discussion below and in updated v1 series.

s/vdev/subdev

> Arm are also _extremely_ interested in developing a method to enable some
> form of SmartNIC discovery method and while lots of ideas have been thrown
> around, discovery via devlink is a reasonable option.  
Great. 
> So while doing all this will
> be much more work than simply handling this case where we set the peer or
> local MAC for a vdev, I think it will be worth it to make this more usable for
> all^W more types of devices.  I also agree that not everything on the other side
> of the wire should be a port.
> 
Right. Having a generic object named 'subdev' of different flavours (similar to port flavours), is very useful and extendible
For subdev flavours as,
(a) PCI PF,
(b) PCI VF,
(b) mdev
(c) serial/console device

> So if we agree that addressing this device as a PCIe device then it feels like we
> would be better served to query device capabilities and depending on what
> capabilities exist we would be able to configure properties for those.  In an
> ideal world, I could query a device using devlink ('devlink info'?) and it would
> show me different devices that are available for configuration on the SmartNIC
> and would also give me a way to address them.  So while I like the idea of being
> able to address and set parameters as shown in patch 05 of this series, I would
> like to see a bit more flexibility to define what type of device is available and
> how it might be configured.
> 
> So if we took the devlink info command as an example (whether its the proper
> place for this or not), it could look _like_ this:
> 
> $ devlink dev info pci/0000:03:00.0
$ devlink subdev info pci/0000:03:00.0
This will show all subdevices of different class and based on their class will show attributes?

I also liked your idea of resources (more than capabilities).
Using 'resource' will give visibility/information about what resources exist.
Immediate one that becomes useful is total_msix_vectors of the device and then how much of this vectors(resource) to provision to a VF.
So each subdev will show much resource is being assigned to it.

Devlink already has notion of 'resource' as described in [1]. However it is bit complex to parse and use, though mlxsw I think uses it.

Jiri,
What is your opinion on 'devlink resource'?

[1] http://man7.org/linux/man-pages/man8/devlink-resource.8.html

> pci/0000:03:00.0:
>   driver foo
>   serial_number 8675309
>   versions:
> [...]
>   capabilities:
>       storage 0
>       console 1
>       mdev 1024
>       [something else] [limit]
> 
> (Additionally rather than putting this as part of 'info' the device capabilities and
> limits could be part of the 'resource' section and frankly may make more sense
> if this is part of that.)
> 
> and then those capabilities would be something that could be set using the
> 'vdev' or whatever-it-is-named interface:
> 
> # devlink vdev show pci/0000:03:00.0
> pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0
> pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00 [...]
> pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff
> 
> # devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00
> 

> Since these Arm/RISC-V based SmartNICs are going to be used in a variety of
> different ways and will have a variety of different personalities (not just
> different SKUs that vendors will offer but different ways in which these will be
> deployed), I think it's critical that we consider more than just the
> mdev/representer case from the start.
> 
I completely agree with you.
Yuval patches are showing VF/mdev as starting example, but are not limited to it.

> > > >> Basically it is something that represents VF/mdev - the other
> > > >> side of devlink port. But in some cases, like NVMe, there is no
> > > >> associated devlink port - that is why "devlink port peer" would not work
> here.
> > > > What are the NVMe parameters we'd configure here? Queues etc. or
> > > > some IDs? Presumably there will be a NVMe-specific way to configure
> things?
> > > > Something has to point the NVMe VF to a backend, right?
> > > >
> > > > (I haven't looked much into NVMe myself in case that's not obvious
> > > > ;))
It doesn't matter a given PCI VF class is nvme/network/gpu or other.
Since devlink framework handles the generic PCI device, having well defined PCI VF object and handling generic PCI VF properties in common way is desired.

So at minimum, I see following changes to be made to this series.

1. Update the cover letter for below items.
(a) Provide crisply, limitation of $ ip link set vf mac, and how is it overcome here
(b) describe future updates to use resources for device resource configuration (one example is irq vectors)
(c) rename vdev to 'subdev' or any suggestion for better name?
(d) explicitly mention the current purpose and use case for future extension

2. update patches from vdev to subdev.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-25 14:58           ` Andy Gospodarek
  2019-10-28 19:04             ` Yuval Avnery
  2019-10-28 20:02             ` Parav Pandit
@ 2019-10-29 17:08             ` Jakub Kicinski
  2019-10-29 17:56               ` Andy Gospodarek
                                 ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Jakub Kicinski @ 2019-10-29 17:08 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Yuval Avnery, Jiri Pirko, netdev, Jiri Pirko, Saeed Mahameed,
	leon, davem, shuah, Daniel Jurgens, Michael Chan

On Fri, 25 Oct 2019 10:58:08 -0400, Andy Gospodarek wrote:
> Thanks, Jakub, I'm happy to chime in based on our deployment experience.
> We definitely understand the desire to be able to configure properties
> of devices on the SmartNIC (the kind with general purpose cores not the
> kind with only flow offload) from the server side.

Thanks!

> In addition to addressing NVMe devices, I'd also like to be be able to
> create virtual or real serial ports as well as there is an interest in
> *sometimes* being able to gain direct access to the SmartNIC console not
> just a shell via ssh.  So my point is that there are multiple use-cases.

Shelling into a NIC is the ultimate API backdoor. IMO we should try to
avoid that as much as possible.

> Arm are also _extremely_ interested in developing a method to enable
> some form of SmartNIC discovery method and while lots of ideas have been
> thrown around, discovery via devlink is a reasonable option.  So while
> doing all this will be much more work than simply handling this case
> where we set the peer or local MAC for a vdev, I think it will be worth
> it to make this more usable for all^W more types of devices.  I also
> agree that not everything on the other side of the wire should be a
> port. 
> 
> So if we agree that addressing this device as a PCIe device then it
> feels like we would be better served to query device capabilities and
> depending on what capabilities exist we would be able to configure
> properties for those.  In an ideal world, I could query a device using
> devlink ('devlink info'?) and it would show me different devices that
> are available for configuration on the SmartNIC and would also give me a
> way to address them.  So while I like the idea of being able to address
> and set parameters as shown in patch 05 of this series, I would like to
> see a bit more flexibility to define what type of device is available
> and how it might be configured.

We shall see how this develops. For now sounds pretty high level.
If the NIC needs to expose many "devices" that are independently
controlled we should probably look at re-using the standard device
model and not reinvent the wheel. 
If we need to configure particular aspects and resource allocation, 
we can add dedicated APIs as needed.

What I definitely want to avoid is adding a catch-all API with unclear
semantics which will become the SmartNIC dumping ground.

> So if we took the devlink info command as an example (whether its the
> proper place for this or not), it could look _like_ this:
> 
> $ devlink dev info pci/0000:03:00.0
> pci/0000:03:00.0:
>   driver foo
>   serial_number 8675309
>   versions:
> [...]
>   capabilities:
>       storage 0
>       console 1
>       mdev 1024
>       [something else] [limit]
> 
> (Additionally rather than putting this as part of 'info' the device
> capabilities and limits could be part of the 'resource' section and
> frankly may make more sense if this is part of that.)
> 
> and then those capabilities would be something that could be set using the
> 'vdev' or whatever-it-is-named interface:
> 
> # devlink vdev show pci/0000:03:00.0
> pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0

The speed in this console example makes no sense to me.

The patches as they stand are about the peer side/other side of the
port. So which side of the serial device is the speed set on? One can
just read the speed from /dev/ttySNIC0. And link that serial device to
the appropriate parent via sysfs. This is pure wheel reinvention.

> pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00
> [...]
> pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff
> 
> # devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00
> 
> Since these Arm/RISC-V based SmartNICs are going to be used in a variety
> of different ways and will have a variety of different personalities
> (not just different SKUs that vendors will offer but different ways in
> which these will be deployed), I think it's critical that we consider
> more than just the mdev/representer case from the start.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-29 17:08             ` Jakub Kicinski
@ 2019-10-29 17:56               ` Andy Gospodarek
  2019-10-29 18:06               ` Yuval Avnery
  2019-10-30  2:30               ` Parav Pandit
  2 siblings, 0 replies; 24+ messages in thread
From: Andy Gospodarek @ 2019-10-29 17:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andy Gospodarek, Yuval Avnery, Jiri Pirko, netdev, Jiri Pirko,
	Saeed Mahameed, leon, davem, shuah, Daniel Jurgens, Michael Chan

On Tue, Oct 29, 2019 at 10:08:10AM -0700, Jakub Kicinski wrote:
> On Fri, 25 Oct 2019 10:58:08 -0400, Andy Gospodarek wrote:
> > Thanks, Jakub, I'm happy to chime in based on our deployment experience.
> > We definitely understand the desire to be able to configure properties
> > of devices on the SmartNIC (the kind with general purpose cores not the
> > kind with only flow offload) from the server side.
> 
> Thanks!
> 
> > In addition to addressing NVMe devices, I'd also like to be be able to
> > create virtual or real serial ports as well as there is an interest in
> > *sometimes* being able to gain direct access to the SmartNIC console not
> > just a shell via ssh.  So my point is that there are multiple use-cases.
> 
> Shelling into a NIC is the ultimate API backdoor. IMO we should try to
> avoid that as much as possible.

In the case of a network controlled device (where the host has no real
knowledge that the NIC contains general purpose cores), we definitely do
not want to shell into the NIC, but there are requests to have this
available from many users.

Even if there is a way to get console with a cable, there is an interest
in not having to wire-up a another cable for every device in the
network.

> > Arm are also _extremely_ interested in developing a method to enable
> > some form of SmartNIC discovery method and while lots of ideas have been
> > thrown around, discovery via devlink is a reasonable option.  So while
> > doing all this will be much more work than simply handling this case
> > where we set the peer or local MAC for a vdev, I think it will be worth
> > it to make this more usable for all^W more types of devices.  I also
> > agree that not everything on the other side of the wire should be a
> > port. 
> > 
> > So if we agree that addressing this device as a PCIe device then it
> > feels like we would be better served to query device capabilities and
> > depending on what capabilities exist we would be able to configure
> > properties for those.  In an ideal world, I could query a device using
> > devlink ('devlink info'?) and it would show me different devices that
> > are available for configuration on the SmartNIC and would also give me a
> > way to address them.  So while I like the idea of being able to address
> > and set parameters as shown in patch 05 of this series, I would like to
> > see a bit more flexibility to define what type of device is available
> > and how it might be configured.
> 
> We shall see how this develops. For now sounds pretty high level.
> If the NIC needs to expose many "devices" that are independently
> controlled we should probably look at re-using the standard device
> model and not reinvent the wheel. 
> If we need to configure particular aspects and resource allocation, 
> we can add dedicated APIs as needed.
> 
> What I definitely want to avoid is adding a catch-all API with unclear
> semantics which will become the SmartNIC dumping ground.

Despite the fact that I proposed this, that is also my concern.  That
said I think there is a definite need to be able to understand the
hardware capabilities present on a PCI device.

> > So if we took the devlink info command as an example (whether its the
> > proper place for this or not), it could look _like_ this:
> > 
> > $ devlink dev info pci/0000:03:00.0
> > pci/0000:03:00.0:
> >   driver foo
> >   serial_number 8675309
> >   versions:
> > [...]
> >   capabilities:
> >       storage 0
> >       console 1
> >       mdev 1024
> >       [something else] [limit]
> > 
> > (Additionally rather than putting this as part of 'info' the device
> > capabilities and limits could be part of the 'resource' section and
> > frankly may make more sense if this is part of that.)
> > 
> > and then those capabilities would be something that could be set using the
> > 'vdev' or whatever-it-is-named interface:
> > 
> > # devlink vdev show pci/0000:03:00.0
> > pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0
> 
> The speed in this console example makes no sense to me.
> 
> The patches as they stand are about the peer side/other side of the
> port. So which side of the serial device is the speed set on? One can
> just read the speed from /dev/ttySNIC0. And link that serial device to
> the appropriate parent via sysfs.

That's true and setting the speed is probably not that valuable, so we
do not need that option.  Proposal to set/read speed rescinded.  :)

>  This is pure wheel reinvention.
> 
> > pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00
> > [...]
> > pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff
> > 
> > # devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00
> > 
> > Since these Arm/RISC-V based SmartNICs are going to be used in a variety
> > of different ways and will have a variety of different personalities
> > (not just different SKUs that vendors will offer but different ways in
> > which these will be deployed), I think it's critical that we consider
> > more than just the mdev/representer case from the start.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-29 17:08             ` Jakub Kicinski
  2019-10-29 17:56               ` Andy Gospodarek
@ 2019-10-29 18:06               ` Yuval Avnery
  2019-10-29 18:32                 ` Jakub Kicinski
  2019-10-30  2:30               ` Parav Pandit
  2 siblings, 1 reply; 24+ messages in thread
From: Yuval Avnery @ 2019-10-29 18:06 UTC (permalink / raw)
  To: Jakub Kicinski, Andy Gospodarek
  Cc: Jiri Pirko, netdev, Jiri Pirko, Saeed Mahameed, leon, davem,
	shuah, Daniel Jurgens, Michael Chan

On 2019-10-29 10:08 a.m., Jakub Kicinski wrote:
> On Fri, 25 Oct 2019 10:58:08 -0400, Andy Gospodarek wrote:
>> Thanks, Jakub, I'm happy to chime in based on our deployment experience.
>> We definitely understand the desire to be able to configure properties
>> of devices on the SmartNIC (the kind with general purpose cores not the
>> kind with only flow offload) from the server side.
> Thanks!
>
>> In addition to addressing NVMe devices, I'd also like to be be able to
>> create virtual or real serial ports as well as there is an interest in
>> *sometimes* being able to gain direct access to the SmartNIC console not
>> just a shell via ssh.  So my point is that there are multiple use-cases.
> Shelling into a NIC is the ultimate API backdoor. IMO we should try to
> avoid that as much as possible.
>
>> Arm are also _extremely_ interested in developing a method to enable
>> some form of SmartNIC discovery method and while lots of ideas have been
>> thrown around, discovery via devlink is a reasonable option.  So while
>> doing all this will be much more work than simply handling this case
>> where we set the peer or local MAC for a vdev, I think it will be worth
>> it to make this more usable for all^W more types of devices.  I also
>> agree that not everything on the other side of the wire should be a
>> port.
>>
>> So if we agree that addressing this device as a PCIe device then it
>> feels like we would be better served to query device capabilities and
>> depending on what capabilities exist we would be able to configure
>> properties for those.  In an ideal world, I could query a device using
>> devlink ('devlink info'?) and it would show me different devices that
>> are available for configuration on the SmartNIC and would also give me a
>> way to address them.  So while I like the idea of being able to address
>> and set parameters as shown in patch 05 of this series, I would like to
>> see a bit more flexibility to define what type of device is available
>> and how it might be configured.
> We shall see how this develops. For now sounds pretty high level.
> If the NIC needs to expose many "devices" that are independently
> controlled we should probably look at re-using the standard device
> model and not reinvent the wheel.
> If we need to configure particular aspects and resource allocation,
> we can add dedicated APIs as needed.
>
> What I definitely want to avoid is adding a catch-all API with unclear
> semantics which will become the SmartNIC dumping ground.
>
>> So if we took the devlink info command as an example (whether its the
>> proper place for this or not), it could look _like_ this:
>>
>> $ devlink dev info pci/0000:03:00.0
>> pci/0000:03:00.0:
>>    driver foo
>>    serial_number 8675309
>>    versions:
>> [...]
>>    capabilities:
>>        storage 0
>>        console 1
>>        mdev 1024
>>        [something else] [limit]
>>
>> (Additionally rather than putting this as part of 'info' the device
>> capabilities and limits could be part of the 'resource' section and
>> frankly may make more sense if this is part of that.)
>>
>> and then those capabilities would be something that could be set using the
>> 'vdev' or whatever-it-is-named interface:
>>
>> # devlink vdev show pci/0000:03:00.0
>> pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0
> The speed in this console example makes no sense to me.
>
> The patches as they stand are about the peer side/other side of the
> port. So which side of the serial device is the speed set on? One can
> just read the speed from /dev/ttySNIC0. And link that serial device to
> the appropriate parent via sysfs. This is pure wheel reinvention.
>
The patches are not only about the other side,

They are about all the devices which are under the control of a 
privileged user.

In the case of SmartNic, those devices (vdevs) are on the host side, and 
the privileged user runs on the embedded CPU.

The vdev devices don't necessarily have relationship with a port. All 
attributes are optional except flavour.

>> pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00
>> [...]
>> pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff
>>
>> # devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00
>>
>> Since these Arm/RISC-V based SmartNICs are going to be used in a variety
>> of different ways and will have a variety of different personalities
>> (not just different SKUs that vendors will offer but different ways in
>> which these will be deployed), I think it's critical that we consider
>> more than just the mdev/representer case from the start.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next 0/9] devlink vdev
  2019-10-29 18:06               ` Yuval Avnery
@ 2019-10-29 18:32                 ` Jakub Kicinski
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2019-10-29 18:32 UTC (permalink / raw)
  To: Yuval Avnery
  Cc: Andy Gospodarek, Jiri Pirko, netdev, Jiri Pirko, Saeed Mahameed,
	leon, davem, shuah, Daniel Jurgens, Michael Chan

On Tue, 29 Oct 2019 18:06:28 +0000, Yuval Avnery wrote:
> On 2019-10-29 10:08 a.m., Jakub Kicinski wrote:
> > On Fri, 25 Oct 2019 10:58:08 -0400, Andy Gospodarek wrote:  
> >> # devlink vdev show pci/0000:03:00.0
> >> pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0  
> > The speed in this console example makes no sense to me.
> >
> > The patches as they stand are about the peer side/other side of the
> > port. So which side of the serial device is the speed set on? One can
> > just read the speed from /dev/ttySNIC0. And link that serial device to
> > the appropriate parent via sysfs. This is pure wheel reinvention.
> >  
> The patches are not only about the other side,
> 
> They are about all the devices which are under the control of a 
> privileged user.
> 
> In the case of SmartNic, those devices (vdevs) are on the host side, and 
> the privileged user runs on the embedded CPU.
> 
> The vdev devices don't necessarily have relationship with a port. All 
> attributes are optional except flavour.

Okay, true, you can list the non-port PCIe functions. I should have
chosen my words more carefully.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH net-next 0/9] devlink vdev
  2019-10-29 17:08             ` Jakub Kicinski
  2019-10-29 17:56               ` Andy Gospodarek
  2019-10-29 18:06               ` Yuval Avnery
@ 2019-10-30  2:30               ` Parav Pandit
  2 siblings, 0 replies; 24+ messages in thread
From: Parav Pandit @ 2019-10-30  2:30 UTC (permalink / raw)
  To: Jakub Kicinski, Andy Gospodarek
  Cc: Yuval Avnery, Jiri Pirko, netdev, Jiri Pirko, Saeed Mahameed,
	leon, davem, shuah, Daniel Jurgens, Michael Chan



> -----Original Message-----
> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Jakub Kicinski
> Sent: Tuesday, October 29, 2019 12:08 PM
> To: Andy Gospodarek <andrew.gospodarek@broadcom.com>
> Cc: Yuval Avnery <yuvalav@mellanox.com>; Jiri Pirko <jiri@resnulli.us>;
> netdev@vger.kernel.org; Jiri Pirko <jiri@mellanox.com>; Saeed Mahameed
> <saeedm@mellanox.com>; leon@kernel.org; davem@davemloft.net;
> shuah@kernel.org; Daniel Jurgens <danielj@mellanox.com>; Michael Chan
> <michael.chan@broadcom.com>
> Subject: Re: [PATCH net-next 0/9] devlink vdev
> 
> On Fri, 25 Oct 2019 10:58:08 -0400, Andy Gospodarek wrote:
> > Thanks, Jakub, I'm happy to chime in based on our deployment experience.
> > We definitely understand the desire to be able to configure properties
> > of devices on the SmartNIC (the kind with general purpose cores not
> > the kind with only flow offload) from the server side.
> 
> Thanks!
> 
> > In addition to addressing NVMe devices, I'd also like to be be able to
> > create virtual or real serial ports as well as there is an interest in
> > *sometimes* being able to gain direct access to the SmartNIC console
> > not just a shell via ssh.  So my point is that there are multiple use-cases.
> 
> Shelling into a NIC is the ultimate API backdoor. IMO we should try to avoid
> that as much as possible.
> 
> > Arm are also _extremely_ interested in developing a method to enable
> > some form of SmartNIC discovery method and while lots of ideas have
> > been thrown around, discovery via devlink is a reasonable option.  So
> > while doing all this will be much more work than simply handling this
> > case where we set the peer or local MAC for a vdev, I think it will be
> > worth it to make this more usable for all^W more types of devices.  I
> > also agree that not everything on the other side of the wire should be
> > a port.
> >
> > So if we agree that addressing this device as a PCIe device then it
> > feels like we would be better served to query device capabilities and
> > depending on what capabilities exist we would be able to configure
> > properties for those.  In an ideal world, I could query a device using
> > devlink ('devlink info'?) and it would show me different devices that
> > are available for configuration on the SmartNIC and would also give me
> > a way to address them.  So while I like the idea of being able to
> > address and set parameters as shown in patch 05 of this series, I
> > would like to see a bit more flexibility to define what type of device
> > is available and how it might be configured.
> 
> We shall see how this develops. For now sounds pretty high level.
> If the NIC needs to expose many "devices" that are independently controlled
> we should probably look at re-using the standard device model and not
> reinvent the wheel.
> If we need to configure particular aspects and resource allocation, we can
> add dedicated APIs as needed.
> 
> What I definitely want to avoid is adding a catch-all API with unclear
> semantics which will become the SmartNIC dumping ground.
> 
What part is unclear in API? Can you be specific?
Subdev is not a dumping ground and so the devlink port is not a dumping ground either.
Having a more well defined object such as subdev with covers more than just port attribute (instead of devlink port) doesn't make it a dumping ground.

As I explained in previous email, subdev is intended to have attributes of the device (PF/VF/mdev).
Additionally as Andy described, resources will be linked to such subdev.

Keep in mind that this is anyway useful without smartnic usecase too immediately.

> > So if we took the devlink info command as an example (whether its the
> > proper place for this or not), it could look _like_ this:
> >
> > $ devlink dev info pci/0000:03:00.0
> > pci/0000:03:00.0:
> >   driver foo
> >   serial_number 8675309
> >   versions:
> > [...]
> >   capabilities:
> >       storage 0
> >       console 1
> >       mdev 1024
> >       [something else] [limit]
> >
> > (Additionally rather than putting this as part of 'info' the device
> > capabilities and limits could be part of the 'resource' section and
> > frankly may make more sense if this is part of that.)
> >
> > and then those capabilities would be something that could be set using
> > the 'vdev' or whatever-it-is-named interface:
> >
> > # devlink vdev show pci/0000:03:00.0
> > pci/0000:03:00.0/console/0: speed 115200 device /dev/ttySNIC0
> 
> The speed in this console example makes no sense to me.
> 
> The patches as they stand are about the peer side/other side of the port. So
> which side of the serial device is the speed set on? One can just read the
> speed from /dev/ttySNIC0. And link that serial device to the appropriate
> parent via sysfs. This is pure wheel reinvention.
> 
> > pci/0000:03:00.0/mdev/0: hw_addr 02:00:00:00:00:00 [...]
> > pci/0000:03:00.0/mdev/1023: hw_addr 02:00:00:00:03:ff
> >
> > # devlink vdev set pci/0000:03:00.0/mdev/0 hw_addr 00:22:33:44:55:00
> >
> > Since these Arm/RISC-V based SmartNICs are going to be used in a
> > variety of different ways and will have a variety of different
> > personalities (not just different SKUs that vendors will offer but
> > different ways in which these will be deployed), I think it's critical
> > that we consider more than just the mdev/representer case from the start.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-10-30  2:30 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-22 17:43 [PATCH net-next 0/9] devlink vdev Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 1/9] devlink: Introduce vdev Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 2/9] devlink: Add PCI attributes support for vdev Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 3/9] devlink: Add port with vdev register support Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 4/9] devlink: Support vdev HW address get Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 5/9] devlink: Support vdev HW address set Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 6/9] netdevsim: Add max_vfs to bus_dev Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 7/9] netdevsim: Add devlink vdev creation Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 8/9] netdevsim: Add devlink vdev sefltest for netdevsim Yuval Avnery
2019-10-22 17:43 ` [PATCH net-next 9/9] net/mlx5e: Add support for devlink vdev and vdev hw_addr set/show Yuval Avnery
2019-10-23 15:48 ` [PATCH net-next 0/9] devlink vdev Or Gerlitz
2019-10-23 19:00 ` Jakub Kicinski
2019-10-23 19:25   ` Jiri Pirko
2019-10-23 22:14     ` Jakub Kicinski
2019-10-24  0:11       ` Yuval Avnery
2019-10-24  2:51         ` Jakub Kicinski
2019-10-25 14:58           ` Andy Gospodarek
2019-10-28 19:04             ` Yuval Avnery
2019-10-28 20:02             ` Parav Pandit
2019-10-29 17:08             ` Jakub Kicinski
2019-10-29 17:56               ` Andy Gospodarek
2019-10-29 18:06               ` Yuval Avnery
2019-10-29 18:32                 ` Jakub Kicinski
2019-10-30  2:30               ` Parav Pandit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).