netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] devlink: Introduce cpu_affinity command
@ 2022-02-22 10:58 Shay Drory
  2022-02-22 10:58 ` [PATCH net-next 1/4] net netlink: Introduce NLA_BITFIELD type Shay Drory
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Shay Drory @ 2022-02-22 10:58 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: jiri, saeedm, parav, netdev, linux-kernel, Shay Drory

Currently a user can only configure the IRQ CPU affinity of a device
via the global /proc/irq/../smp_afinity interface, however this
interface changes the affinity globally across all subsystems connected
to the device.

Historically, this API is useful for single function devices since,
generally speaking, the queue structure created on top of the IRQ
vectors is predictable enough that this control is usable.

However, with complex multi-subsystem devices, like mlx5, the
assignment of queues at every layer throughout the software stack is
complex and there are multiple queues, each for different usage, over
the same IRQ. Hence, a simple fiddling of the base IRQ is no longer
effective.

As an example mlx5 SF's can share MSI-X IRQ between themselves, which
means that currently user doesn't have control over which SF to use
which CPU set. Hence, an application and IRQ can run on different
CPUs, which leads to lower performance, as shown in the bellow table.

application=netperf,    SF-IRQ     channel affinity   latecy(usec)
                                                      (lower is better)
cpu=0 (numa=0)           cpu={0}   cpu={0}            14.417
cpu=8 (numa=0)           cpu={0}   cpu={0}            15.114 (+5%)
cpu=1 (numa=1)           cpu={0}   cpu={0}            17.784 (+30%)

This series is a start at resolving this problem by inverting the
control of the affinities. Instead of having the user go around behind
the driver and adjusting the IRQs the driver already created we want
to have the user tell the software layer what CPUs to use and the
software layer will manage this. The suggested command will then trickle
down to the PCI driver which will create/share MSI-X IRQs and resources
to achieve it. In the mlx5 SF example the involved software components
would be devlink, rdma, vdpa and netdev.

This series introduces a devlink control that assigns a CPU set to the
cross-subsystem mlx5_core PCI function device. This can be used either
on PF, VF or SF and restricts all the software layers above it to the
given CPU set.

For specified CPU, SF either uses an existing IRQ affiliated to the CPU
or a new IRQ available from the device. For example if user gives
affinity 3 (11 in binary), SF will create driver internal required
completion EQ, attached to these specific CPU's IRQ.
If SF is already fully probed, devlink reload is required for
cpu_affinity to take effect.

The following command sets the affinity of mlx5 PF/VF/SF.
devlink command structure:
$ devlink dev param set auxiliary/mlx5_core.sf.4 name cpu_affinity value \
          [cpu_bitmask] cmode driverinit

Applications that want to restrict a SF or VF HW to a CPU set, for
instance container workloads, can make use of this API to easily
achieve it.

Shay Drory (4):
  net netlink: Introduce NLA_BITFIELD type
  devlink: Add support for NLA_BITFIELD for devlink param
  devlink: Add new cpu_affinity generic device param
  net/mlx5: Support cpu_affinity devlink dev param

 .../networking/devlink/devlink-params.rst     |   5 +
 Documentation/networking/devlink/mlx5.rst     |   3 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 123 +++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  39 +++++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../ethernet/mellanox/mlx5/core/mlx5_irq.h    |   5 +-
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  85 +++++++++-
 include/net/devlink.h                         |  22 +++
 include/net/netlink.h                         |  30 ++++
 include/uapi/linux/netlink.h                  |  10 ++
 lib/nlattr.c                                  | 145 +++++++++++++++++-
 net/core/devlink.c                            | 143 +++++++++++++++--
 net/netlink/policy.c                          |   4 +
 14 files changed, 594 insertions(+), 24 deletions(-)

-- 
2.21.3


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net-next 1/4] net netlink: Introduce NLA_BITFIELD type
  2022-02-22 10:58 [PATCH net-next 0/4] devlink: Introduce cpu_affinity command Shay Drory
@ 2022-02-22 10:58 ` Shay Drory
  2022-02-22 10:58 ` [PATCH net-next 2/4] devlink: Add support for NLA_BITFIELD for devlink param Shay Drory
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Shay Drory @ 2022-02-22 10:58 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: jiri, saeedm, parav, netdev, linux-kernel, Shay Drory, Moshe Shemesh

Generic bitfield attribute content sent to the kernel by user.
With this netlink attr type the user can either set or unset a
bitmap in the kernel.

This attribute is an extension (dynamic array) of NLA_BITFIELD32,
and have similar checks and policies.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
---
 include/net/netlink.h        |  30 ++++++++
 include/uapi/linux/netlink.h |  10 +++
 lib/nlattr.c                 | 145 ++++++++++++++++++++++++++++++++++-
 net/netlink/policy.c         |   4 +
 4 files changed, 185 insertions(+), 4 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 7a2a9d3144ba..52a0bcccae36 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -180,6 +180,7 @@ enum {
 	NLA_S32,
 	NLA_S64,
 	NLA_BITFIELD32,
+	NLA_BITFIELD,
 	NLA_REJECT,
 	__NLA_TYPE_MAX,
 };
@@ -235,12 +236,16 @@ enum nla_policy_validation {
  *                         given type fits, using it verifies minimum length
  *                         just like "All other"
  *    NLA_BITFIELD32       Unused
+ *    NLA_BITFIELD         Maximum length of attribute payload
  *    NLA_REJECT           Unused
  *    All other            Minimum length of attribute payload
  *
  * Meaning of validation union:
  *    NLA_BITFIELD32       This is a 32-bit bitmap/bitselector attribute and
  *                         `bitfield32_valid' is the u32 value of valid flags
+ *    NLA_BITFIELD         This is a dynamic array of 32-bit bitmap/bitselector
+ *                         attribute and `arr_bitfield32_valid' is the u32
+ *                         values array of valid flags.
  *    NLA_REJECT           This attribute is always rejected and `reject_message'
  *                         may point to a string to report as the error instead
  *                         of the generic one in extended ACK.
@@ -318,6 +323,7 @@ struct nla_policy {
 	u16		len;
 	union {
 		const u32 bitfield32_valid;
+		const u32 *arr_bitfield32_valid;
 		const u32 mask;
 		const char *reject_message;
 		const struct nla_policy *nested_policy;
@@ -363,6 +369,8 @@ struct nla_policy {
 	_NLA_POLICY_NESTED_ARRAY(ARRAY_SIZE(policy) - 1, policy)
 #define NLA_POLICY_BITFIELD32(valid) \
 	{ .type = NLA_BITFIELD32, .bitfield32_valid = valid }
+#define NLA_POLICY_BITFIELD(valid, size) \
+	{ .type = NLA_BITFIELD, .arr_bitfield32_valid = valid, .len = size }
 
 #define __NLA_IS_UINT_TYPE(tp)						\
 	(tp == NLA_U8 || tp == NLA_U16 || tp == NLA_U32 || tp == NLA_U64)
@@ -1545,6 +1553,19 @@ static inline int nla_put_bitfield32(struct sk_buff *skb, int attrtype,
 	return nla_put(skb, attrtype, sizeof(tmp), &tmp);
 }
 
+/**
+ * nla_put_bitfield - Add a bitfield netlink attribute to a socket buffer
+ * @skb: socket buffer to add attribute to
+ * @attrtype: attribute type
+ * @bitfield: bitfield
+ */
+static inline int nla_put_bitfield(struct sk_buff *skb, int attrtype,
+				   const struct nla_bitfield *bitfield)
+{
+	return nla_put(skb, attrtype, bitfield->size * sizeof(struct nla_bitfield32)
+		       + sizeof(*bitfield), bitfield);
+}
+
 /**
  * nla_get_u32 - return payload of u32 attribute
  * @nla: u32 netlink attribute
@@ -1738,6 +1759,15 @@ static inline struct nla_bitfield32 nla_get_bitfield32(const struct nlattr *nla)
 	return tmp;
 }
 
+struct nla_bitfield *nla_bitfield_alloc(__u64 nbits);
+void nla_bitfield_free(struct nla_bitfield *bitfield);
+void nla_bitfield_to_bitmap(unsigned long *bitmap,
+			    struct nla_bitfield *bitfield);
+void nla_bitfield_from_bitmap(struct nla_bitfield *bitfield,
+			      unsigned long *bitmap, __u64 bitmap_nbits);
+bool nla_bitfield_len_is_valid(struct nla_bitfield *bitfield, size_t user_len);
+bool nla_bitfield_nbits_valid(struct nla_bitfield *bitfield, size_t nbits);
+
 /**
  * nla_memdup - duplicate attribute memory (kmemdup)
  * @src: netlink attribute to duplicate from
diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index 4c0cde075c27..a11bb91e3386 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -252,6 +252,14 @@ struct nla_bitfield32 {
 	__u32 selector;
 };
 
+/* Generic bitmap attribute content sent to the kernel.
+ * The size is the number of elements in the array.
+ */
+struct nla_bitfield {
+	__u64 size;
+	struct nla_bitfield32 data[0];
+};
+
 /*
  * policy descriptions - it's specific to each family how this is used
  * Normally, it should be retrieved via a dump inside another attribute
@@ -283,6 +291,7 @@ struct nla_bitfield32 {
  *	entry has attributes again, the policy for those inner ones
  *	and the corresponding maxtype may be specified.
  * @NL_ATTR_TYPE_BITFIELD32: &struct nla_bitfield32 attribute
+ * @NL_ATTR_TYPE_BITFIELD: &struct nla_bitfield attribute
  */
 enum netlink_attribute_type {
 	NL_ATTR_TYPE_INVALID,
@@ -307,6 +316,7 @@ enum netlink_attribute_type {
 	NL_ATTR_TYPE_NESTED_ARRAY,
 
 	NL_ATTR_TYPE_BITFIELD32,
+	NL_ATTR_TYPE_BITFIELD,
 };
 
 /**
diff --git a/lib/nlattr.c b/lib/nlattr.c
index 86029ad5ead4..6d20bf38850b 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -58,11 +58,9 @@ static int __nla_validate_parse(const struct nlattr *head, int len, int maxtype,
 				struct netlink_ext_ack *extack,
 				struct nlattr **tb, unsigned int depth);
 
-static int validate_nla_bitfield32(const struct nlattr *nla,
-				   const u32 valid_flags_mask)
+static int validate_bitfield32(const struct nla_bitfield32 *bf,
+			       const u32 valid_flags_mask)
 {
-	const struct nla_bitfield32 *bf = nla_data(nla);
-
 	if (!valid_flags_mask)
 		return -EINVAL;
 
@@ -81,6 +79,33 @@ static int validate_nla_bitfield32(const struct nlattr *nla,
 	return 0;
 }
 
+static int validate_nla_bitfield32(const struct nlattr *nla,
+				   const u32 valid_flags_mask)
+{
+	const struct nla_bitfield32 *bf = nla_data(nla);
+
+	return validate_bitfield32(bf, valid_flags_mask);
+}
+
+static int validate_nla_bitfield(const struct nlattr *nla,
+				 const u32 *valid_flags_masks,
+				 const u16 nbits)
+{
+	struct nla_bitfield *bf = nla_data(nla);
+	int err;
+	int i;
+
+	if (!nla_bitfield_len_is_valid(bf, nla_len(nla)) ||
+	    !nla_bitfield_nbits_valid(bf, nbits))
+		return -EINVAL;
+	for (i = 0; i < bf->size; i++) {
+		err = validate_bitfield32(&bf->data[i], valid_flags_masks[i]);
+		if (err)
+			return err;
+	}
+	return 0;
+}
+
 static int nla_validate_array(const struct nlattr *head, int len, int maxtype,
 			      const struct nla_policy *policy,
 			      struct netlink_ext_ack *extack,
@@ -422,6 +447,12 @@ static int validate_nla(const struct nlattr *nla, int maxtype,
 			goto out_err;
 		break;
 
+	case NLA_BITFIELD:
+		err = validate_nla_bitfield(nla, pt->arr_bitfield32_valid, pt->len);
+		if (err)
+			goto out_err;
+		break;
+
 	case NLA_NUL_STRING:
 		if (pt->len)
 			minlen = min_t(int, attrlen, pt->len + 1);
@@ -839,6 +870,112 @@ int nla_strcmp(const struct nlattr *nla, const char *str)
 }
 EXPORT_SYMBOL(nla_strcmp);
 
+/**
+ * nla_bitfield_alloc - Alloc struct nla_bitfield
+ * @nbits: number of bits to accommodate
+ */
+struct nla_bitfield *nla_bitfield_alloc(__u64 nbits)
+{
+	struct nla_bitfield *bitfield;
+	size_t bitfield_size;
+	size_t bitfield_len;
+
+	bitfield_len = DIV_ROUND_UP(nbits, BITS_PER_TYPE(u32));
+	bitfield_size = bitfield_len * sizeof(struct nla_bitfield32) +
+		sizeof(*bitfield);
+	bitfield = kzalloc(bitfield_size, GFP_KERNEL);
+	if (bitfield)
+		bitfield->size = bitfield_len;
+	return bitfield;
+}
+EXPORT_SYMBOL(nla_bitfield_alloc);
+
+/**
+ * nla_bitfield_free - Free struct nla_bitfield
+ * @bitfield: the bitfield to free
+ */
+void nla_bitfield_free(struct nla_bitfield *bitfield)
+{
+	kfree(bitfield);
+}
+EXPORT_SYMBOL(nla_bitfield_free);
+
+/**
+ * nla_bitfield_to_bitmap - Convert bitfield to bitmap
+ * @bitmap: bitmap to copy to (dst)
+ * @bitfield: bitfield to be copied (src)
+ */
+void nla_bitfield_to_bitmap(unsigned long *bitmap,
+			    struct nla_bitfield *bitfield)
+{
+	int i, j;
+	u32 tmp;
+
+	for (i = 0; i < bitfield->size; i++) {
+		tmp = bitfield->data[i].value & bitfield->data[i].selector;
+		for (j = 0; j < BITS_PER_TYPE(u32); j++)
+			if (tmp & (1 << j))
+				set_bit(j + i * BITS_PER_TYPE(u32), bitmap);
+	}
+}
+EXPORT_SYMBOL(nla_bitfield_to_bitmap);
+
+/**
+ * nla_bitfield_from_bitmap - Convert bitmap to bitfield
+ * @bitfield: bitfield to copy to (dst)
+ * @bitmap: bitmap to be copied (src)
+ * @bitmap_nbits: len of bitmap
+ */
+void nla_bitfield_from_bitmap(struct nla_bitfield *bitfield,
+			      unsigned long *bitmap, __u64 bitmap_nbits)
+{
+	long size;
+	int i, j;
+
+	size = DIV_ROUND_UP(bitmap_nbits, BITS_PER_TYPE(u32));
+	for (i = 0; i < size; i++) {
+		for (j = 0; j < min_t(__u64, bitmap_nbits, BITS_PER_TYPE(u32)); j++)
+			if (test_bit(j + i * BITS_PER_TYPE(u32), bitmap))
+				bitfield->data[i].value |= 1 << j;
+		bitfield->data[i].selector = bitmap_nbits >= BITS_PER_TYPE(u32) ?
+			UINT_MAX : (1 << bitmap_nbits) - 1;
+		bitmap_nbits -= BITS_PER_TYPE(u32);
+	}
+}
+EXPORT_SYMBOL(nla_bitfield_from_bitmap);
+
+/**
+ * nla_bitfield_len_is_valid - validate the len of the bitfield
+ * @bitfield: bitfield to validate
+ * @user_len: len of the nla.
+ */
+bool nla_bitfield_len_is_valid(struct nla_bitfield *bitfield, size_t user_len)
+{
+	return !(user_len % sizeof(bitfield->data[0]) ||
+		 sizeof(bitfield->data[0]) * bitfield->size +
+		 sizeof(*bitfield) != user_len);
+}
+EXPORT_SYMBOL(nla_bitfield_len_is_valid);
+
+/**
+ * nla_bitfield_nbits_valid - validate the len of the bitfield vs a given nbits
+ * @bitfield: bitfield to validate
+ * @nbits: number of bits the user wants to use.
+ */
+bool nla_bitfield_nbits_valid(struct nla_bitfield *bitfield, size_t nbits)
+{
+	u32 *last_value = &bitfield->data[bitfield->size - 1].value;
+	u32 last_bit;
+
+	if (BITS_PER_TYPE(u32) * (bitfield->size - 1) > nbits)
+		return false;
+
+	nbits -= BITS_PER_TYPE(u32) * (bitfield->size - 1);
+	last_bit = find_last_bit((unsigned long *)last_value, BITS_PER_TYPE(u32));
+	return last_bit == BITS_PER_TYPE(u32) ? true : last_bit <= nbits - 1;
+}
+EXPORT_SYMBOL(nla_bitfield_nbits_valid);
+
 #ifdef CONFIG_NET
 /**
  * __nla_reserve - reserve room for attribute on the skb
diff --git a/net/netlink/policy.c b/net/netlink/policy.c
index 8d7c900e27f4..c9fffb3b8045 100644
--- a/net/netlink/policy.c
+++ b/net/netlink/policy.c
@@ -227,6 +227,7 @@ int netlink_policy_dump_attr_size_estimate(const struct nla_policy *pt)
 	case NLA_STRING:
 	case NLA_NUL_STRING:
 	case NLA_BINARY:
+	case NLA_BITFIELD:
 		/* maximum is common, u32 min-length/max-length */
 		return common + 2 * nla_attr_size(sizeof(u32));
 	case NLA_FLAG:
@@ -338,11 +339,14 @@ __netlink_policy_dump_write_attr(struct netlink_policy_dump_state *state,
 		break;
 	case NLA_STRING:
 	case NLA_NUL_STRING:
+	case NLA_BITFIELD:
 	case NLA_BINARY:
 		if (pt->type == NLA_STRING)
 			type = NL_ATTR_TYPE_STRING;
 		else if (pt->type == NLA_NUL_STRING)
 			type = NL_ATTR_TYPE_NUL_STRING;
+		else if (pt->type == NLA_BITFIELD)
+			type = NL_ATTR_TYPE_BITFIELD;
 		else
 			type = NL_ATTR_TYPE_BINARY;
 
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next 2/4] devlink: Add support for NLA_BITFIELD for devlink param
  2022-02-22 10:58 [PATCH net-next 0/4] devlink: Introduce cpu_affinity command Shay Drory
  2022-02-22 10:58 ` [PATCH net-next 1/4] net netlink: Introduce NLA_BITFIELD type Shay Drory
@ 2022-02-22 10:58 ` Shay Drory
  2022-02-22 10:58 ` [PATCH net-next 3/4] devlink: Add new cpu_affinity generic device param Shay Drory
  2022-02-22 10:58 ` [PATCH net-next 4/4] net/mlx5: Support cpu_affinity devlink dev param Shay Drory
  3 siblings, 0 replies; 5+ messages in thread
From: Shay Drory @ 2022-02-22 10:58 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: jiri, saeedm, parav, netdev, linux-kernel, Shay Drory, Moshe Shemesh

Add devlink support for param of type NLA_BITFIELD.
kernel user need to provide a bitmap to devlink.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
---
 include/net/devlink.h |  18 ++++++
 net/core/devlink.c    | 138 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 139 insertions(+), 17 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 8d5349d2fb68..f411482f716d 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -372,6 +372,7 @@ enum devlink_param_type {
 	DEVLINK_PARAM_TYPE_U32,
 	DEVLINK_PARAM_TYPE_STRING,
 	DEVLINK_PARAM_TYPE_BOOL,
+	DEVLINK_PARAM_TYPE_BITFIELD,
 };
 
 union devlink_param_value {
@@ -380,6 +381,7 @@ union devlink_param_value {
 	u32 vu32;
 	char vstr[__DEVLINK_PARAM_MAX_STRING_VALUE];
 	bool vbool;
+	unsigned long *vbitmap;
 };
 
 struct devlink_param_gset_ctx {
@@ -412,6 +414,8 @@ struct devlink_flash_notify {
  * @generic: indicates if the parameter is generic or driver specific
  * @type: parameter type
  * @supported_cmodes: bitmap of supported configuration modes
+ * @nbits: number of bits this param need to use, if this param is
+ *         of dynamic len.
  * @get: get parameter value, used for runtime and permanent
  *       configuration modes
  * @set: set parameter value, used for runtime and permanent
@@ -427,6 +431,7 @@ struct devlink_param {
 	bool generic;
 	enum devlink_param_type type;
 	unsigned long supported_cmodes;
+	u64 nbits;
 	int (*get)(struct devlink *devlink, u32 id,
 		   struct devlink_param_gset_ctx *ctx);
 	int (*set)(struct devlink *devlink, u32 id,
@@ -542,6 +547,19 @@ enum devlink_param_generic_id {
 	.validate = _validate,						\
 }
 
+#define DEVLINK_PARAM_DYNAMIC_GENERIC(_id, _cmodes, _get, _set, _validate, _nbits)\
+{									\
+	.id = DEVLINK_PARAM_GENERIC_ID_##_id,				\
+	.name = DEVLINK_PARAM_GENERIC_##_id##_NAME,			\
+	.type = DEVLINK_PARAM_GENERIC_##_id##_TYPE,			\
+	.generic = true,						\
+	.supported_cmodes = _cmodes,					\
+	.nbits = _nbits,						\
+	.get = _get,							\
+	.set = _set,							\
+	.validate = _validate,						\
+}
+
 /* Part number, identifier of board design */
 #define DEVLINK_INFO_VERSION_GENERIC_BOARD_ID	"board.id"
 /* Revision of board design */
diff --git a/net/core/devlink.c b/net/core/devlink.c
index fcd9f6d85cf1..3d7e27abc487 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -4568,6 +4568,8 @@ devlink_param_type_to_nla_type(enum devlink_param_type param_type)
 		return NLA_STRING;
 	case DEVLINK_PARAM_TYPE_BOOL:
 		return NLA_FLAG;
+	case DEVLINK_PARAM_TYPE_BITFIELD:
+		return NLA_BITFIELD;
 	default:
 		return -EINVAL;
 	}
@@ -4575,11 +4577,13 @@ devlink_param_type_to_nla_type(enum devlink_param_type param_type)
 
 static int
 devlink_nl_param_value_fill_one(struct sk_buff *msg,
-				enum devlink_param_type type,
+				const struct devlink_param *param,
 				enum devlink_param_cmode cmode,
 				union devlink_param_value val)
 {
 	struct nlattr *param_value_attr;
+	struct nla_bitfield *bitfield;
+	int err;
 
 	param_value_attr = nla_nest_start_noflag(msg,
 						 DEVLINK_ATTR_PARAM_VALUE);
@@ -4589,7 +4593,7 @@ devlink_nl_param_value_fill_one(struct sk_buff *msg,
 	if (nla_put_u8(msg, DEVLINK_ATTR_PARAM_VALUE_CMODE, cmode))
 		goto value_nest_cancel;
 
-	switch (type) {
+	switch (param->type) {
 	case DEVLINK_PARAM_TYPE_U8:
 		if (nla_put_u8(msg, DEVLINK_ATTR_PARAM_VALUE_DATA, val.vu8))
 			goto value_nest_cancel;
@@ -4612,6 +4616,17 @@ devlink_nl_param_value_fill_one(struct sk_buff *msg,
 		    nla_put_flag(msg, DEVLINK_ATTR_PARAM_VALUE_DATA))
 			goto value_nest_cancel;
 		break;
+	case DEVLINK_PARAM_TYPE_BITFIELD:
+		bitfield = nla_bitfield_alloc(param->nbits);
+		if (!bitfield)
+			return -ENOMEM;
+		nla_bitfield_from_bitmap(bitfield, val.vbitmap, param->nbits);
+		err = nla_put_bitfield(msg, DEVLINK_ATTR_PARAM_VALUE_DATA,
+				       bitfield);
+		nla_bitfield_free(bitfield);
+		if (err)
+			goto value_nest_cancel;
+		break;
 	}
 
 	nla_nest_end(msg, param_value_attr);
@@ -4623,6 +4638,24 @@ devlink_nl_param_value_fill_one(struct sk_buff *msg,
 	return -EMSGSIZE;
 }
 
+static int devlink_param_value_get(const struct devlink_param *param,
+				   union devlink_param_value *value)
+{
+	if (param->type == DEVLINK_PARAM_TYPE_BITFIELD) {
+		value->vbitmap = bitmap_zalloc(param->nbits, GFP_KERNEL);
+		if (!value->vbitmap)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+static void devlink_param_value_put(const struct devlink_param *param,
+				    union devlink_param_value *value)
+{
+	if (param->type == DEVLINK_PARAM_TYPE_BITFIELD)
+		bitmap_free(value->vbitmap);
+}
+
 static int devlink_nl_param_fill(struct sk_buff *msg, struct devlink *devlink,
 				 unsigned int port_index,
 				 struct devlink_param_item *param_item,
@@ -4645,14 +4678,22 @@ static int devlink_nl_param_fill(struct sk_buff *msg, struct devlink *devlink,
 		if (!devlink_param_cmode_is_supported(param, i))
 			continue;
 		if (i == DEVLINK_PARAM_CMODE_DRIVERINIT) {
-			if (!param_item->driverinit_value_valid)
-				return -EOPNOTSUPP;
+			if (!param_item->driverinit_value_valid) {
+				err = -EOPNOTSUPP;
+				goto param_value_put;
+			}
 			param_value[i] = param_item->driverinit_value;
 		} else {
 			ctx.cmode = i;
-			err = devlink_param_get(devlink, param, &ctx);
+			err = devlink_param_value_get(param, &param_value[i]);
 			if (err)
-				return err;
+				goto param_value_put;
+			ctx.val = param_value[i];
+			err = devlink_param_get(devlink, param, &ctx);
+			if (err) {
+				devlink_param_value_put(param, &param_value[i]);
+				goto param_value_put;
+			}
 			param_value[i] = ctx.val;
 		}
 		param_value_set[i] = true;
@@ -4660,7 +4701,7 @@ static int devlink_nl_param_fill(struct sk_buff *msg, struct devlink *devlink,
 
 	hdr = genlmsg_put(msg, portid, seq, &devlink_nl_family, flags, cmd);
 	if (!hdr)
-		return -EMSGSIZE;
+		goto genlmsg_put_err;
 
 	if (devlink_nl_put_handle(msg, devlink))
 		goto genlmsg_cancel;
@@ -4693,10 +4734,13 @@ static int devlink_nl_param_fill(struct sk_buff *msg, struct devlink *devlink,
 	for (i = 0; i <= DEVLINK_PARAM_CMODE_MAX; i++) {
 		if (!param_value_set[i])
 			continue;
-		err = devlink_nl_param_value_fill_one(msg, param->type,
+		err = devlink_nl_param_value_fill_one(msg, param,
 						      i, param_value[i]);
 		if (err)
 			goto values_list_nest_cancel;
+		if (i != DEVLINK_PARAM_CMODE_DRIVERINIT)
+			devlink_param_value_put(param, &param_value[i]);
+		param_value_set[i] = false;
 	}
 
 	nla_nest_end(msg, param_values_list);
@@ -4710,7 +4754,13 @@ static int devlink_nl_param_fill(struct sk_buff *msg, struct devlink *devlink,
 	nla_nest_cancel(msg, param_attr);
 genlmsg_cancel:
 	genlmsg_cancel(msg, hdr);
-	return -EMSGSIZE;
+genlmsg_put_err:
+	err = -EMSGSIZE;
+param_value_put:
+	for (i = 0; i <= DEVLINK_PARAM_CMODE_MAX; i++)
+		if (i != DEVLINK_PARAM_CMODE_DRIVERINIT && param_value_set[i])
+			devlink_param_value_put(param, &param_value[i]);
+	return err;
 }
 
 static void devlink_param_notify(struct devlink *devlink,
@@ -4815,6 +4865,9 @@ devlink_param_type_get_from_info(struct genl_info *info,
 	case NLA_FLAG:
 		*param_type = DEVLINK_PARAM_TYPE_BOOL;
 		break;
+	case NLA_BITFIELD:
+		*param_type = DEVLINK_PARAM_TYPE_BITFIELD;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -4827,6 +4880,7 @@ devlink_param_value_get_from_info(const struct devlink_param *param,
 				  struct genl_info *info,
 				  union devlink_param_value *value)
 {
+	struct nla_bitfield *bitfield;
 	struct nlattr *param_data;
 	int len;
 
@@ -4863,6 +4917,18 @@ devlink_param_value_get_from_info(const struct devlink_param *param,
 			return -EINVAL;
 		value->vbool = nla_get_flag(param_data);
 		break;
+	case DEVLINK_PARAM_TYPE_BITFIELD:
+		bitfield = nla_data(param_data);
+
+		if (!nla_bitfield_len_is_valid(bitfield, nla_len(param_data)) ||
+		    !nla_bitfield_nbits_valid(bitfield, param->nbits))
+			return -EINVAL;
+		value->vbitmap = bitmap_zalloc(param->nbits, GFP_KERNEL);
+		if (!value->vbitmap)
+			return -ENOMEM;
+
+		nla_bitfield_to_bitmap(value->vbitmap, bitfield);
+		break;
 	}
 	return 0;
 }
@@ -4936,33 +5002,48 @@ static int __devlink_nl_cmd_param_set_doit(struct devlink *devlink,
 	if (param->validate) {
 		err = param->validate(devlink, param->id, value, info->extack);
 		if (err)
-			return err;
+			goto out;
 	}
 
-	if (!info->attrs[DEVLINK_ATTR_PARAM_VALUE_CMODE])
-		return -EINVAL;
+	if (!info->attrs[DEVLINK_ATTR_PARAM_VALUE_CMODE]) {
+		err = -EINVAL;
+		goto out;
+	}
 	cmode = nla_get_u8(info->attrs[DEVLINK_ATTR_PARAM_VALUE_CMODE]);
-	if (!devlink_param_cmode_is_supported(param, cmode))
-		return -EOPNOTSUPP;
+	if (!devlink_param_cmode_is_supported(param, cmode)) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
 
 	if (cmode == DEVLINK_PARAM_CMODE_DRIVERINIT) {
 		if (param->type == DEVLINK_PARAM_TYPE_STRING)
 			strcpy(param_item->driverinit_value.vstr, value.vstr);
+		else if (param->type == DEVLINK_PARAM_TYPE_BITFIELD)
+			bitmap_copy(param_item->driverinit_value.vbitmap,
+				    value.vbitmap,
+				    param_item->param->nbits);
 		else
 			param_item->driverinit_value = value;
 		param_item->driverinit_value_valid = true;
 	} else {
-		if (!param->set)
-			return -EOPNOTSUPP;
+		if (!param->set) {
+			err = -EOPNOTSUPP;
+			goto out;
+		}
 		ctx.val = value;
 		ctx.cmode = cmode;
 		err = devlink_param_set(devlink, param, &ctx);
 		if (err)
-			return err;
+			goto out;
 	}
 
+	devlink_param_value_put(param, &value);
 	devlink_param_notify(devlink, port_index, param_item, cmd);
 	return 0;
+
+out:
+	devlink_param_value_put(param, &value);
+	return err;
 }
 
 static int devlink_nl_cmd_param_set_doit(struct sk_buff *skb,
@@ -10098,6 +10179,8 @@ static int devlink_param_verify(const struct devlink_param *param)
 {
 	if (!param || !param->name || !param->supported_cmodes)
 		return -EINVAL;
+	if (param->type == DEVLINK_PARAM_TYPE_BITFIELD && !param->nbits)
+		return -EINVAL;
 	if (param->generic)
 		return devlink_param_generic_verify(param);
 	else
@@ -10188,6 +10271,16 @@ int devlink_param_register(struct devlink *devlink,
 		return -ENOMEM;
 
 	param_item->param = param;
+	if (param_item->param->type == DEVLINK_PARAM_TYPE_BITFIELD &&
+	    devlink_param_cmode_is_supported(param_item->param,
+					     DEVLINK_PARAM_CMODE_DRIVERINIT)) {
+		param_item->driverinit_value.vbitmap =
+			bitmap_zalloc(param_item->param->nbits, GFP_KERNEL);
+		if (!param_item->driverinit_value.vbitmap) {
+			kfree(param_item);
+			return -ENOMEM;
+		}
+	}
 
 	list_add_tail(&param_item->list, &devlink->param_list);
 	return 0;
@@ -10210,6 +10303,10 @@ void devlink_param_unregister(struct devlink *devlink,
 		devlink_param_find_by_name(&devlink->param_list, param->name);
 	WARN_ON(!param_item);
 	list_del(&param_item->list);
+	if (param_item->param->type == DEVLINK_PARAM_TYPE_BITFIELD &&
+	    devlink_param_cmode_is_supported(param_item->param,
+					     DEVLINK_PARAM_CMODE_DRIVERINIT))
+		bitmap_free(param_item->driverinit_value.vbitmap);
 	kfree(param_item);
 }
 EXPORT_SYMBOL_GPL(devlink_param_unregister);
@@ -10244,6 +10341,10 @@ int devlink_param_driverinit_value_get(struct devlink *devlink, u32 param_id,
 
 	if (param_item->param->type == DEVLINK_PARAM_TYPE_STRING)
 		strcpy(init_val->vstr, param_item->driverinit_value.vstr);
+	else if (param_item->param->type == DEVLINK_PARAM_TYPE_BITFIELD)
+		bitmap_copy(init_val->vbitmap,
+			    param_item->driverinit_value.vbitmap,
+			    param_item->param->nbits);
 	else
 		*init_val = param_item->driverinit_value;
 
@@ -10280,6 +10381,9 @@ int devlink_param_driverinit_value_set(struct devlink *devlink, u32 param_id,
 
 	if (param_item->param->type == DEVLINK_PARAM_TYPE_STRING)
 		strcpy(param_item->driverinit_value.vstr, init_val.vstr);
+	else if (param_item->param->type == DEVLINK_PARAM_TYPE_BITFIELD)
+		bitmap_copy(param_item->driverinit_value.vbitmap,
+			    init_val.vbitmap, param_item->param->nbits);
 	else
 		param_item->driverinit_value = init_val;
 	param_item->driverinit_value_valid = true;
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next 3/4] devlink: Add new cpu_affinity generic device param
  2022-02-22 10:58 [PATCH net-next 0/4] devlink: Introduce cpu_affinity command Shay Drory
  2022-02-22 10:58 ` [PATCH net-next 1/4] net netlink: Introduce NLA_BITFIELD type Shay Drory
  2022-02-22 10:58 ` [PATCH net-next 2/4] devlink: Add support for NLA_BITFIELD for devlink param Shay Drory
@ 2022-02-22 10:58 ` Shay Drory
  2022-02-22 10:58 ` [PATCH net-next 4/4] net/mlx5: Support cpu_affinity devlink dev param Shay Drory
  3 siblings, 0 replies; 5+ messages in thread
From: Shay Drory @ 2022-02-22 10:58 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: jiri, saeedm, parav, netdev, linux-kernel, Shay Drory

Add new device generic parameter to configure device affinity.

A user who wishes to customize the affinity of the device can do it
using below example.

$ devlink dev param set auxiliary/mlx5_core.sf.4 name cpu_affinity \
              value [cpu_bitmask] cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance will use the customize affinity.

Signed-off-by: Shay Drory <shayd@nvidia.com>
---
 Documentation/networking/devlink/devlink-params.rst | 5 +++++
 include/net/devlink.h                               | 4 ++++
 net/core/devlink.c                                  | 5 +++++
 3 files changed, 14 insertions(+)

diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index 4e01dc32bc08..2f9f5baf4373 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -137,3 +137,8 @@ own name.
    * - ``event_eq_size``
      - u32
      - Control the size of asynchronous control events EQ.
+   * - ``cpu_affinity``
+     - Bitfield
+     - control the cpu affinity of the device. user is able to change cpu
+       affinity also via procfs interface (/proc/irq/\*/smp_affinity). This will
+       overwrite the devlink setting.
diff --git a/include/net/devlink.h b/include/net/devlink.h
index f411482f716d..595a4d54a2bd 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -466,6 +466,7 @@ enum devlink_param_generic_id {
 	DEVLINK_PARAM_GENERIC_ID_ENABLE_IWARP,
 	DEVLINK_PARAM_GENERIC_ID_IO_EQ_SIZE,
 	DEVLINK_PARAM_GENERIC_ID_EVENT_EQ_SIZE,
+	DEVLINK_PARAM_GENERIC_ID_CPU_AFFINITY,
 
 	/* add new param generic ids above here*/
 	__DEVLINK_PARAM_GENERIC_ID_MAX,
@@ -524,6 +525,9 @@ enum devlink_param_generic_id {
 #define DEVLINK_PARAM_GENERIC_EVENT_EQ_SIZE_NAME "event_eq_size"
 #define DEVLINK_PARAM_GENERIC_EVENT_EQ_SIZE_TYPE DEVLINK_PARAM_TYPE_U32
 
+#define DEVLINK_PARAM_GENERIC_CPU_AFFINITY_NAME "cpu_affinity"
+#define DEVLINK_PARAM_GENERIC_CPU_AFFINITY_TYPE DEVLINK_PARAM_TYPE_BITFIELD
+
 #define DEVLINK_PARAM_GENERIC(_id, _cmodes, _get, _set, _validate)	\
 {									\
 	.id = DEVLINK_PARAM_GENERIC_ID_##_id,				\
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 3d7e27abc487..d2dfd9a88eb1 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -4477,6 +4477,11 @@ static const struct devlink_param devlink_param_generic[] = {
 		.name = DEVLINK_PARAM_GENERIC_EVENT_EQ_SIZE_NAME,
 		.type = DEVLINK_PARAM_GENERIC_EVENT_EQ_SIZE_TYPE,
 	},
+	{
+		.id = DEVLINK_PARAM_GENERIC_ID_CPU_AFFINITY,
+		.name = DEVLINK_PARAM_GENERIC_CPU_AFFINITY_NAME,
+		.type = DEVLINK_PARAM_GENERIC_CPU_AFFINITY_TYPE,
+	},
 };
 
 static int devlink_param_generic_verify(const struct devlink_param *param)
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next 4/4] net/mlx5: Support cpu_affinity devlink dev param
  2022-02-22 10:58 [PATCH net-next 0/4] devlink: Introduce cpu_affinity command Shay Drory
                   ` (2 preceding siblings ...)
  2022-02-22 10:58 ` [PATCH net-next 3/4] devlink: Add new cpu_affinity generic device param Shay Drory
@ 2022-02-22 10:58 ` Shay Drory
  3 siblings, 0 replies; 5+ messages in thread
From: Shay Drory @ 2022-02-22 10:58 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski
  Cc: jiri, saeedm, parav, netdev, linux-kernel, Shay Drory, Moshe Shemesh

Enable users to control the affinity of PCI Function.
The default value is the affinity assigned by kernel for the
PCI Function.
The runtime value shows the current affinity, the driverinit value
is used in order to set a new affinity on the next driver reload.
Setting empty affinity means kernel default policy.

Example:
- Show the current affinity.
    $ devlink dev param show auxiliary/mlx5_core.sf.4 name cpu_affinity
       name cpu_affinity type driver-specific
        values:
          cmode runtime value ff
          cmode driverinit value 0

- Set affinity to 3 (cpu 0 and cpu 1).
    $ devlink dev param set auxiliary/mlx5_core.sf.4 name cpu_affinity \
      value 3 cmode driverinit

Then run devlink reload command to apply the new value.
$ devlink dev reload auxiliary/mlx5_core.sf.4

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
---
 Documentation/networking/devlink/mlx5.rst     |   3 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 123 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  39 ++++++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../ethernet/mellanox/mlx5/core/mlx5_irq.h    |   5 +-
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  85 +++++++++++-
 7 files changed, 256 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 29ad304e6fba..a213e93e495b 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -27,6 +27,9 @@ Parameters
    * - ``max_macs``
      - driverinit
      - The range is between 1 and 2^31. Only power of 2 values are supported.
+   * - ``cpu_affinity``
+     - driverinit | runtime
+     - empty affinity (0) means kernel assign the affinity
 
 The ``mlx5`` driver also implements the following driver-specific
 parameters.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index d1093bb2d436..9e33e8f7fed0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -10,6 +10,7 @@
 #include "esw/qos.h"
 #include "sf/dev/dev.h"
 #include "sf/sf.h"
+#include "mlx5_irq.h"
 
 static int mlx5_devlink_flash_update(struct devlink *devlink,
 				     struct devlink_flash_update_params *params,
@@ -833,6 +834,121 @@ mlx5_devlink_max_uc_list_param_unregister(struct devlink *devlink)
 	devlink_param_unregister(devlink, &max_uc_list_param);
 }
 
+static int mlx5_devlink_cpu_affinity_validate(struct devlink *devlink, u32 id,
+					      union devlink_param_value val,
+					      struct netlink_ext_ack *extack)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	cpumask_var_t tmp;
+	int max_eqs;
+	int ret = 0;
+	int last;
+
+	/* Check whether the mask is valid cpu mask */
+	last = find_last_bit(val.vbitmap, MLX5_CPU_AFFINITY_MAX_LEN);
+	if (last == MLX5_CPU_AFFINITY_MAX_LEN)
+		/* Affinity is empty, will use default policy*/
+		return 0;
+	if (last >= num_present_cpus()) {
+		NL_SET_ERR_MSG_MOD(extack, "Some CPUs aren't present");
+		return -ERANGE;
+	}
+
+	if (!zalloc_cpumask_var(&tmp, GFP_KERNEL))
+		return -ENOMEM;
+
+	bitmap_copy(cpumask_bits(tmp), val.vbitmap, nr_cpu_ids);
+	if (!cpumask_subset(tmp, cpu_online_mask)) {
+		NL_SET_ERR_MSG_MOD(extack, "Some CPUs aren't online");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Check whether the PF/VF/SFs have enough IRQs. SF will
+	 * perform IRQ->CPU check during load time.
+	 */
+	if (mlx5_core_is_sf(dev))
+		max_eqs = min_t(int, MLX5_COMP_EQS_PER_SF,
+				mlx5_irq_table_get_sfs_vec(mlx5_irq_table_get(dev)));
+	else
+		max_eqs = mlx5_irq_table_get_num_comp(mlx5_irq_table_get(dev));
+	if (cpumask_weight(tmp) > max_eqs) {
+		NL_SET_ERR_MSG_MOD(extack, "PCI Function doesn’t have enough IRQs");
+		ret = -EINVAL;
+	}
+
+out:
+	free_cpumask_var(tmp);
+	return ret;
+}
+
+static int mlx5_devlink_cpu_affinity_set(struct devlink *devlink, u32 id,
+					 struct devlink_param_gset_ctx *ctx)
+{
+	/* Runtime set of cpu_affinity is not supported */
+	return -EOPNOTSUPP;
+}
+
+static int mlx5_devlink_cpu_affinity_get(struct devlink *devlink, u32 id,
+					 struct devlink_param_gset_ctx *ctx)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	cpumask_var_t dev_mask;
+
+	if (!zalloc_cpumask_var(&dev_mask, GFP_KERNEL))
+		return -ENOMEM;
+	mlx5_core_affinity_get(dev, dev_mask);
+	bitmap_copy(ctx->val.vbitmap, cpumask_bits(dev_mask), nr_cpu_ids);
+	free_cpumask_var(dev_mask);
+	return 0;
+}
+
+static const struct devlink_param cpu_affinity_param =
+	DEVLINK_PARAM_DYNAMIC_GENERIC(CPU_AFFINITY, BIT(DEVLINK_PARAM_CMODE_RUNTIME) |
+				      BIT(DEVLINK_PARAM_CMODE_DRIVERINIT),
+				      mlx5_devlink_cpu_affinity_get,
+				      mlx5_devlink_cpu_affinity_set,
+				      mlx5_devlink_cpu_affinity_validate,
+				      MLX5_CPU_AFFINITY_MAX_LEN);
+
+static int mlx5_devlink_cpu_affinity_param_register(struct devlink *devlink)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+	union devlink_param_value value;
+	cpumask_var_t dev_mask;
+	int ret = 0;
+
+	if (mlx5_core_is_sf(dev) &&
+	    !mlx5_irq_table_have_dedicated_sfs_irqs(mlx5_irq_table_get(dev)))
+		return 0;
+
+	if (!zalloc_cpumask_var(&dev_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = devlink_param_register(devlink, &cpu_affinity_param);
+	if (ret)
+		goto out;
+
+	value.vbitmap = cpumask_bits(dev_mask);
+	devlink_param_driverinit_value_set(devlink,
+					   DEVLINK_PARAM_GENERIC_ID_CPU_AFFINITY,
+					   value);
+out:
+	free_cpumask_var(dev_mask);
+	return ret;
+}
+
+static void mlx5_devlink_cpu_affinity_param_unregister(struct devlink *devlink)
+{
+	struct mlx5_core_dev *dev = devlink_priv(devlink);
+
+	if (mlx5_core_is_sf(dev) &&
+	    !mlx5_irq_table_have_dedicated_sfs_irqs(mlx5_irq_table_get(dev)))
+		return;
+
+	devlink_param_unregister(devlink, &cpu_affinity_param);
+}
+
 #define MLX5_TRAP_DROP(_id, _group_id)					\
 	DEVLINK_TRAP_GENERIC(DROP, DROP, _id,				\
 			     DEVLINK_TRAP_GROUP_GENERIC_ID_##_group_id, \
@@ -896,6 +1012,10 @@ int mlx5_devlink_register(struct devlink *devlink)
 	if (err)
 		goto max_uc_list_err;
 
+	err = mlx5_devlink_cpu_affinity_param_register(devlink);
+	if (err)
+		goto cpu_affinity_err;
+
 	err = mlx5_devlink_traps_register(devlink);
 	if (err)
 		goto traps_reg_err;
@@ -906,6 +1026,8 @@ int mlx5_devlink_register(struct devlink *devlink)
 	return 0;
 
 traps_reg_err:
+	mlx5_devlink_cpu_affinity_param_unregister(devlink);
+cpu_affinity_err:
 	mlx5_devlink_max_uc_list_param_unregister(devlink);
 max_uc_list_err:
 	mlx5_devlink_auxdev_params_unregister(devlink);
@@ -918,6 +1040,7 @@ int mlx5_devlink_register(struct devlink *devlink)
 void mlx5_devlink_unregister(struct devlink *devlink)
 {
 	mlx5_devlink_traps_unregister(devlink);
+	mlx5_devlink_cpu_affinity_param_unregister(devlink);
 	mlx5_devlink_max_uc_list_param_unregister(devlink);
 	mlx5_devlink_auxdev_params_unregister(devlink);
 	devlink_params_unregister(devlink, mlx5_devlink_params,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
index 30bf4882779b..891d4df419fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.h
@@ -6,6 +6,8 @@
 
 #include <net/devlink.h>
 
+#define MLX5_CPU_AFFINITY_MAX_LEN (NR_CPUS)
+
 enum mlx5_devlink_param_id {
 	MLX5_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
 	MLX5_DEVLINK_PARAM_ID_FLOW_STEERING_MODE,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 48a45aa54a3c..9572c9f85f70 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -794,6 +794,30 @@ void mlx5_eq_update_ci(struct mlx5_eq *eq, u32 cc, bool arm)
 }
 EXPORT_SYMBOL(mlx5_eq_update_ci);
 
+static int comp_irqs_request_by_cpu_affinity(struct mlx5_core_dev *dev)
+{
+	struct mlx5_eq_table *table = dev->priv.eq_table;
+	struct devlink *devlink = priv_to_devlink(dev);
+	union devlink_param_value val;
+	cpumask_var_t user_mask;
+	int ret;
+
+	if (!zalloc_cpumask_var(&user_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	val.vbitmap = cpumask_bits(user_mask);
+	ret = devlink_param_driverinit_value_get(devlink,
+						 DEVLINK_PARAM_GENERIC_ID_CPU_AFFINITY,
+						 &val);
+	if (ret)
+		goto out;
+
+	ret = mlx5_irqs_request_mask(dev, table->comp_irqs, user_mask);
+out:
+	free_cpumask_var(user_mask);
+	return ret;
+}
+
 static void comp_irqs_release(struct mlx5_core_dev *dev)
 {
 	struct mlx5_eq_table *table = dev->priv.eq_table;
@@ -817,6 +841,11 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 	table->comp_irqs = kcalloc(ncomp_eqs, sizeof(*table->comp_irqs), GFP_KERNEL);
 	if (!table->comp_irqs)
 		return -ENOMEM;
+
+	ret = comp_irqs_request_by_cpu_affinity(dev);
+	if (ret > 0)
+		return ret;
+	mlx5_core_dbg(dev, "failed to get param cpu_affinity. use default policy\n");
 	if (mlx5_core_is_sf(dev)) {
 		ret = mlx5_irq_affinity_irqs_request_auto(dev, ncomp_eqs, table->comp_irqs);
 		if (ret < 0)
@@ -987,6 +1016,16 @@ mlx5_comp_irq_get_affinity_mask(struct mlx5_core_dev *dev, int vector)
 }
 EXPORT_SYMBOL(mlx5_comp_irq_get_affinity_mask);
 
+void mlx5_core_affinity_get(struct mlx5_core_dev *dev, struct cpumask *dev_mask)
+{
+	struct mlx5_eq_table *table = dev->priv.eq_table;
+	struct mlx5_eq_comp *eq, *n;
+
+	list_for_each_entry_safe(eq, n, &table->comp_eqs_list, list)
+		cpumask_or(dev_mask, dev_mask,
+			   mlx5_irq_get_affinity_mask(eq->core.irq));
+}
+
 #ifdef CONFIG_RFS_ACCEL
 struct cpu_rmap *mlx5_eq_table_get_rmap(struct mlx5_core_dev *dev)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 6f8baa0f2a73..95d133aa3fcd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -307,4 +307,6 @@ bool mlx5_rdma_supported(struct mlx5_core_dev *dev);
 bool mlx5_vnet_supported(struct mlx5_core_dev *dev);
 bool mlx5_same_hw_devs(struct mlx5_core_dev *dev, struct mlx5_core_dev *peer_dev);
 
+void mlx5_core_affinity_get(struct mlx5_core_dev *dev, struct cpumask *dev_mask);
+
 #endif /* __MLX5_CORE_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h
index 23cb63fa4588..a31dc3d900af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h
@@ -16,6 +16,7 @@ int mlx5_irq_table_create(struct mlx5_core_dev *dev);
 void mlx5_irq_table_destroy(struct mlx5_core_dev *dev);
 int mlx5_irq_table_get_num_comp(struct mlx5_irq_table *table);
 int mlx5_irq_table_get_sfs_vec(struct mlx5_irq_table *table);
+bool mlx5_irq_table_have_dedicated_sfs_irqs(struct mlx5_irq_table *table);
 struct mlx5_irq_table *mlx5_irq_table_get(struct mlx5_core_dev *dev);
 
 int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
@@ -25,10 +26,12 @@ int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
 struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev);
 void mlx5_ctrl_irq_release(struct mlx5_irq *ctrl_irq);
 struct mlx5_irq *mlx5_irq_request(struct mlx5_core_dev *dev, u16 vecidx,
-				  struct cpumask *affinity);
+				  const struct cpumask *affinity);
 int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
 			      struct mlx5_irq **irqs);
 void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs);
+int mlx5_irqs_request_mask(struct mlx5_core_dev *dev, struct mlx5_irq **irqs,
+			   struct cpumask *irqs_req_mask);
 int mlx5_irq_attach_nb(struct mlx5_irq *irq, struct notifier_block *nb);
 int mlx5_irq_detach_nb(struct mlx5_irq *irq, struct notifier_block *nb);
 struct cpumask *mlx5_irq_get_affinity_mask(struct mlx5_irq *irq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 41807ef55201..ed4e491ec9c0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -300,7 +300,7 @@ int mlx5_irq_get_index(struct mlx5_irq *irq)
 /* requesting an irq from a given pool according to given index */
 static struct mlx5_irq *
 irq_pool_request_vector(struct mlx5_irq_pool *pool, int vecidx,
-			struct cpumask *affinity)
+			const struct cpumask *affinity)
 {
 	struct mlx5_irq *irq;
 
@@ -420,7 +420,7 @@ struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
  * This function returns a pointer to IRQ, or ERR_PTR in case of error.
  */
 struct mlx5_irq *mlx5_irq_request(struct mlx5_core_dev *dev, u16 vecidx,
-				  struct cpumask *affinity)
+				  const struct cpumask *affinity)
 {
 	struct mlx5_irq_table *irq_table = mlx5_irq_table_get(dev);
 	struct mlx5_irq_pool *pool;
@@ -481,6 +481,82 @@ int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
 	return i ? i : PTR_ERR(irq);
 }
 
+static int req_mask_local_spread(unsigned int i, int node,
+				 const struct cpumask *irqs_req_mask)
+{
+	int cpu;
+
+	if (node == NUMA_NO_NODE) {
+		for_each_cpu_and(cpu, cpu_online_mask, irqs_req_mask)
+			if (i-- == 0)
+				return cpu;
+	} else {
+		/* NUMA first. */
+		for_each_cpu_and(cpu, cpumask_of_node(node), irqs_req_mask)
+			if (cpu_online(cpu))
+				if (i-- == 0)
+					return cpu;
+
+		for_each_online_cpu(cpu) {
+			/* Skip NUMA nodes, done above. */
+			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
+				continue;
+
+			if (i-- == 0)
+				return cpu;
+		}
+	}
+	WARN_ON(true);
+	return cpumask_first(cpu_online_mask);
+}
+
+/**
+ * mlx5_irqs_request_mask - request one or more IRQs for mlx5 device.
+ * @dev: mlx5 device that is requesting the IRQs.
+ * @irqs: an output array of IRQs pointers.
+ * @irqs_req_mask: cpumask requested for these IRQs.
+ *
+ * Each IRQ is bounded to at most 1 CPU.
+ * This function returns the number of IRQs requested, (which might be smaller than
+ * cpumask_weight(@irqs_req_mask)), if successful, or a negative error code in
+ * case of an error.
+ */
+int mlx5_irqs_request_mask(struct mlx5_core_dev *dev, struct mlx5_irq **irqs,
+			   struct cpumask *irqs_req_mask)
+{
+	struct mlx5_irq_pool *pool = mlx5_irq_pool_get(dev);
+	struct mlx5_irq *irq;
+	int nirqs;
+	int cpu;
+	int i;
+
+	/* Request an IRQ for each online CPU in the given mask */
+	cpumask_and(irqs_req_mask, irqs_req_mask, cpu_online_mask);
+	nirqs = cpumask_weight(irqs_req_mask);
+	for (i = 0; i < nirqs; i++) {
+		/* Iterate over the mask the caller provided in numa aware fashion.
+		 * Local CPUs are requested first, followed by non-local ones.
+		 */
+		cpu = req_mask_local_spread(i, dev->priv.numa_node, irqs_req_mask);
+
+		if (mlx5_irq_pool_is_sf_pool(pool))
+			irq = mlx5_irq_affinity_request(pool, cpumask_of(cpu));
+		else
+			irq = mlx5_irq_request(dev, i, cpumask_of(cpu));
+		if (IS_ERR(irq)) {
+			if (!i)
+				return PTR_ERR(irq);
+			return i;
+		}
+		irqs[i] = irq;
+		mlx5_core_dbg(pool->dev, "IRQ %u mapped to cpu %*pbl, %u EQs on this irq\n",
+			      pci_irq_vector(dev->pdev, mlx5_irq_get_index(irq)),
+			      cpumask_pr_args(mlx5_irq_get_affinity_mask(irq)),
+			      mlx5_irq_read_locked(irq) / MLX5_EQ_REFS_PER_IRQ);
+	}
+	return i;
+}
+
 static struct mlx5_irq_pool *
 irq_pool_alloc(struct mlx5_core_dev *dev, int start, int size, char *name,
 	       u32 min_threshold, u32 max_threshold)
@@ -670,6 +746,11 @@ void mlx5_irq_table_destroy(struct mlx5_core_dev *dev)
 	pci_free_irq_vectors(dev->pdev);
 }
 
+bool mlx5_irq_table_have_dedicated_sfs_irqs(struct mlx5_irq_table *table)
+{
+	return table->sf_comp_pool;
+}
+
 int mlx5_irq_table_get_sfs_vec(struct mlx5_irq_table *table)
 {
 	if (table->sf_comp_pool)
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-02-22 11:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-22 10:58 [PATCH net-next 0/4] devlink: Introduce cpu_affinity command Shay Drory
2022-02-22 10:58 ` [PATCH net-next 1/4] net netlink: Introduce NLA_BITFIELD type Shay Drory
2022-02-22 10:58 ` [PATCH net-next 2/4] devlink: Add support for NLA_BITFIELD for devlink param Shay Drory
2022-02-22 10:58 ` [PATCH net-next 3/4] devlink: Add new cpu_affinity generic device param Shay Drory
2022-02-22 10:58 ` [PATCH net-next 4/4] net/mlx5: Support cpu_affinity devlink dev param Shay Drory

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).