All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr
@ 2024-03-01  1:11 William Tu
  2024-03-01  1:11 ` [PATCH RFC v2 net-next 2/2] net/mlx5e: Add eswitch shared descriptor devlink William Tu
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: William Tu @ 2024-03-01  1:11 UTC (permalink / raw)
  To: netdev; +Cc: jiri, bodong, tariqt, yossiku, kuba, witu

Add two eswitch attrs: shrdesc_mode and shrdesc_count.

1. shrdesc_mode: to enable a sharing memory buffer for
representor's rx buffer, and 2. shrdesc_count: to control the
number of buffers in this shared memory pool.

When using switchdev mode, the representor ports handles the slow path
traffic, the traffic that can't be offloaded will be redirected to the
representor port for processing. Memory consumption of the representor
port's rx buffer can grow to several GB when scaling to 1k VFs reps.
For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
consumes 3MB of DMA memory for packet buffer in WQEs, and with four
channels, it consumes 4 * 3MB * 1024 = 12GB of memory. And since rep
ports are for slow path traffic, most of these rx DMA memory are idle.

Add shrdesc_mode configuration, allowing multiple representors
to share a rx memory buffer pool. When enabled, individual representor
doesn't need to allocate its dedicated rx buffer, but just pointing
its rq to the memory pool. This could make the memory being better
utilized. The shrdesc_count represents the number of rx ring
entries, e.g., same meaning as ethtool -g, that's shared across other
representors. Users adjust it based on how many reps, total system
memory, or performance expectation.

The two params are also useful for other vendors such as Intel ICE
drivers and Broadcom's driver, which also have representor ports for
slow path traffic.

An example use case:
$ devlink dev eswitch show pci/0000:08:00.0
  pci/0000:08:00.0: mode legacy inline-mode none encap-mode basic \
  shrdesc-mode none shrdesc-count 0
$ devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
  shrdesc-mode basic shrdesc-count 1024
$ devlink dev eswitch show pci/0000:08:00.0
  pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic \
  shrdesc-mode basic shrdesc-count 1024

Note that new configurations are set at legacy mode, and enabled at
switchdev mode.

Signed-off-by: William Tu <witu@nvidia.com>
---
v2: feedback from Simon and Jiri
- adding devlink.yaml, generate using ynl-regen.sh

---
 Documentation/netlink/specs/devlink.yaml | 17 ++++++++++
 include/net/devlink.h                    |  8 +++++
 include/uapi/linux/devlink.h             |  7 ++++
 net/devlink/dev.c                        | 43 ++++++++++++++++++++++++
 net/devlink/netlink_gen.c                |  6 ++--
 5 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index cf6eaa0da821..58f31d99b8b3 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -119,6 +119,14 @@ definitions:
         name: none
       -
         name: basic
+  -
+    type: enum
+    name: eswitch-shrdesc-mode
+    entries:
+      -
+        name: none
+      -
+        name: basic
   -
     type: enum
     name: dpipe-header-id
@@ -429,6 +437,13 @@ attribute-sets:
         name: eswitch-encap-mode
         type: u8
         enum: eswitch-encap-mode
+      -
+        name: eswitch-shrdesc-mode
+        type: u8
+        enum: eswitch-shrdesc-mode
+      -
+        name: eswitch-shrdesc-count
+        type: u32
       -
         name: resource-list
         type: nest
@@ -1555,6 +1570,8 @@ operations:
             - eswitch-mode
             - eswitch-inline-mode
             - eswitch-encap-mode
+            - eswitch-shrdesc-mode
+            - eswitch-shrdesc-count
 
     -
       name: eswitch-set
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 9ac394bdfbe4..aca25183e72a 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1327,6 +1327,14 @@ struct devlink_ops {
 	int (*eswitch_encap_mode_set)(struct devlink *devlink,
 				      enum devlink_eswitch_encap_mode encap_mode,
 				      struct netlink_ext_ack *extack);
+	int (*eswitch_shrdesc_mode_get)(struct devlink *devlink,
+				        enum devlink_eswitch_shrdesc_mode *p_shrdesc_mode);
+	int (*eswitch_shrdesc_mode_set)(struct devlink *devlink,
+				        enum devlink_eswitch_shrdesc_mode shrdesc_mode,
+				        struct netlink_ext_ack *extack);
+	int (*eswitch_shrdesc_count_get)(struct devlink *devlink, int *count);
+	int (*eswitch_shrdesc_count_set)(struct devlink *devlink, int count,
+				        struct netlink_ext_ack *extack);
 	int (*info_get)(struct devlink *devlink, struct devlink_info_req *req,
 			struct netlink_ext_ack *extack);
 	/**
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 130cae0d3e20..31323c481614 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -195,6 +195,11 @@ enum devlink_eswitch_encap_mode {
 	DEVLINK_ESWITCH_ENCAP_MODE_BASIC,
 };
 
+enum devlink_eswitch_shrdesc_mode {
+	DEVLINK_ESWITCH_SHRDESC_MODE_NONE,
+	DEVLINK_ESWITCH_SHRDESC_MODE_BASIC,
+};
+
 enum devlink_port_flavour {
 	DEVLINK_PORT_FLAVOUR_PHYSICAL, /* Any kind of a port physically
 					* facing the user.
@@ -614,6 +619,8 @@ enum devlink_attr {
 
 	DEVLINK_ATTR_REGION_DIRECT,		/* flag */
 
+	DEVLINK_ATTR_ESWITCH_SHRDESC_MODE,	/* u8 */
+	DEVLINK_ATTR_ESWITCH_SHRDESC_COUNT,	/* u32 */
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index 19dbf540748a..90b279c3c1e6 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -631,8 +631,10 @@ static int devlink_nl_eswitch_fill(struct sk_buff *msg, struct devlink *devlink,
 				   enum devlink_command cmd, u32 portid,
 				   u32 seq, int flags)
 {
+	enum devlink_eswitch_shrdesc_mode shrdesc_mode;
 	const struct devlink_ops *ops = devlink->ops;
 	enum devlink_eswitch_encap_mode encap_mode;
+	u32 shrdesc_count;
 	u8 inline_mode;
 	void *hdr;
 	int err = 0;
@@ -674,6 +676,25 @@ static int devlink_nl_eswitch_fill(struct sk_buff *msg, struct devlink *devlink,
 			goto nla_put_failure;
 	}
 
+	if (ops->eswitch_shrdesc_mode_get) {
+		err = ops->eswitch_shrdesc_mode_get(devlink, &shrdesc_mode);
+		if (err)
+			goto nla_put_failure;
+		err = nla_put_u8(msg, DEVLINK_ATTR_ESWITCH_SHRDESC_MODE, shrdesc_mode);
+		if (err)
+			goto nla_put_failure;
+
+	}
+
+	if (ops->eswitch_shrdesc_count_get) {
+		err = ops->eswitch_shrdesc_count_get(devlink, &shrdesc_count);
+		if (err)
+			goto nla_put_failure;
+		err = nla_put_u32(msg, DEVLINK_ATTR_ESWITCH_SHRDESC_COUNT, shrdesc_count);
+		if (err)
+			goto nla_put_failure;
+	}
+
 	genlmsg_end(msg, hdr);
 	return 0;
 
@@ -705,9 +726,11 @@ int devlink_nl_eswitch_get_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_eswitch_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
+	enum devlink_eswitch_shrdesc_mode shrdesc_mode;
 	struct devlink *devlink = info->user_ptr[0];
 	const struct devlink_ops *ops = devlink->ops;
 	enum devlink_eswitch_encap_mode encap_mode;
+	u32 shrdesc_count;
 	u8 inline_mode;
 	int err = 0;
 	u16 mode;
@@ -744,6 +767,26 @@ int devlink_nl_eswitch_set_doit(struct sk_buff *skb, struct genl_info *info)
 			return err;
 	}
 
+	if (info->attrs[DEVLINK_ATTR_ESWITCH_SHRDESC_MODE]) {
+		if (!ops->eswitch_shrdesc_mode_set)
+			return -EOPNOTSUPP;
+		shrdesc_mode = nla_get_u8(info->attrs[DEVLINK_ATTR_ESWITCH_SHRDESC_MODE]);
+		err = ops->eswitch_shrdesc_mode_set(devlink, shrdesc_mode,
+						    info->extack);
+		if (err)
+			return err;
+	}
+
+	if (info->attrs[DEVLINK_ATTR_ESWITCH_SHRDESC_COUNT]) {
+		if (!ops->eswitch_shrdesc_count_set)
+			return -EOPNOTSUPP;
+		shrdesc_count = nla_get_u32(info->attrs[DEVLINK_ATTR_ESWITCH_SHRDESC_COUNT]);
+		err = ops->eswitch_shrdesc_count_set(devlink, shrdesc_count,
+						     info->extack);
+		if (err)
+			return err;
+	}
+
 	return 0;
 }
 
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index c81cf2dd154f..9346505a03d5 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -194,12 +194,14 @@ static const struct nla_policy devlink_eswitch_get_nl_policy[DEVLINK_ATTR_DEV_NA
 };
 
 /* DEVLINK_CMD_ESWITCH_SET - do */
-static const struct nla_policy devlink_eswitch_set_nl_policy[DEVLINK_ATTR_ESWITCH_ENCAP_MODE + 1] = {
+static const struct nla_policy devlink_eswitch_set_nl_policy[DEVLINK_ATTR_ESWITCH_SHRDESC_COUNT + 1] = {
 	[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
 	[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
 	[DEVLINK_ATTR_ESWITCH_MODE] = NLA_POLICY_MAX(NLA_U16, 1),
 	[DEVLINK_ATTR_ESWITCH_INLINE_MODE] = NLA_POLICY_MAX(NLA_U16, 3),
 	[DEVLINK_ATTR_ESWITCH_ENCAP_MODE] = NLA_POLICY_MAX(NLA_U8, 1),
+	[DEVLINK_ATTR_ESWITCH_SHRDESC_MODE] = NLA_POLICY_MAX(NLA_U8, 1),
+	[DEVLINK_ATTR_ESWITCH_SHRDESC_COUNT] = { .type = NLA_U32, },
 };
 
 /* DEVLINK_CMD_DPIPE_TABLE_GET - do */
@@ -787,7 +789,7 @@ const struct genl_split_ops devlink_nl_ops[74] = {
 		.doit		= devlink_nl_eswitch_set_doit,
 		.post_doit	= devlink_nl_post_doit,
 		.policy		= devlink_eswitch_set_nl_policy,
-		.maxattr	= DEVLINK_ATTR_ESWITCH_ENCAP_MODE,
+		.maxattr	= DEVLINK_ATTR_ESWITCH_SHRDESC_COUNT,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
 	{
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v2 net-next 2/2] net/mlx5e: Add eswitch shared descriptor devlink
  2024-03-01  1:11 [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr William Tu
@ 2024-03-01  1:11 ` William Tu
  2024-03-01  1:46 ` [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr Samudrala, Sridhar
  2024-03-05  4:37 ` Jakub Kicinski
  2 siblings, 0 replies; 8+ messages in thread
From: William Tu @ 2024-03-01  1:11 UTC (permalink / raw)
  To: netdev; +Cc: jiri, bodong, tariqt, yossiku, kuba, witu

Add devlink support for ewsitch shared descriptor
implementation for mlx5 driver.

Signed-off-by: William Tu <witu@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  4 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.h | 10 +++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 80 +++++++++++++++++++
 3 files changed, 94 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 3e064234f6fe..24eb03763b60 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -312,6 +312,10 @@ static const struct devlink_ops mlx5_devlink_ops = {
 	.eswitch_inline_mode_get = mlx5_devlink_eswitch_inline_mode_get,
 	.eswitch_encap_mode_set = mlx5_devlink_eswitch_encap_mode_set,
 	.eswitch_encap_mode_get = mlx5_devlink_eswitch_encap_mode_get,
+	.eswitch_shrdesc_mode_set = mlx5_devlink_eswitch_shrdesc_mode_set,
+	.eswitch_shrdesc_mode_get = mlx5_devlink_eswitch_shrdesc_mode_get,
+	.eswitch_shrdesc_count_set = mlx5_devlink_eswitch_shrdesc_count_set,
+	.eswitch_shrdesc_count_get = mlx5_devlink_eswitch_shrdesc_count_get,
 	.rate_leaf_tx_share_set = mlx5_esw_devlink_rate_leaf_tx_share_set,
 	.rate_leaf_tx_max_set = mlx5_esw_devlink_rate_leaf_tx_max_set,
 	.rate_node_tx_share_set = mlx5_esw_devlink_rate_node_tx_share_set,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 349e28a6dd8d..f678bcb98e1f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -378,6 +378,8 @@ struct mlx5_eswitch {
 	struct mlx5_esw_functions esw_funcs;
 	struct {
 		u32             large_group_num;
+		u32             shared_rx_ring_counts;
+		bool            enable_shared_rx_ring;
 	}  params;
 	struct blocking_notifier_head n_head;
 	struct xarray paired;
@@ -549,6 +551,14 @@ int mlx5_devlink_eswitch_encap_mode_set(struct devlink *devlink,
 					struct netlink_ext_ack *extack);
 int mlx5_devlink_eswitch_encap_mode_get(struct devlink *devlink,
 					enum devlink_eswitch_encap_mode *encap);
+int mlx5_devlink_eswitch_shrdesc_mode_set(struct devlink *devlink,
+					  enum devlink_eswitch_shrdesc_mode mode,
+					  struct netlink_ext_ack *extack);
+int mlx5_devlink_eswitch_shrdesc_mode_get(struct devlink *devlink,
+					  enum devlink_eswitch_shrdesc_mode *mode);
+int mlx5_devlink_eswitch_shrdesc_count_set(struct devlink *devlink, int count,
+					   struct netlink_ext_ack *extack);
+int mlx5_devlink_eswitch_shrdesc_count_get(struct devlink *devlink, int *count);
 int mlx5_devlink_port_fn_hw_addr_get(struct devlink_port *port,
 				     u8 *hw_addr, int *hw_addr_len,
 				     struct netlink_ext_ack *extack);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index b0455134c98e..5586f52e4239 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -4019,6 +4019,86 @@ int mlx5_devlink_eswitch_encap_mode_get(struct devlink *devlink,
 	return 0;
 }
 
+int mlx5_devlink_eswitch_shrdesc_mode_set(struct devlink *devlink,
+					  enum devlink_eswitch_shrdesc_mode shrdesc,
+					  struct netlink_ext_ack *extack)
+{
+	struct mlx5_eswitch *esw;
+	int err = 0;
+
+	esw = mlx5_devlink_eswitch_get(devlink);
+	if (IS_ERR(esw))
+		return PTR_ERR(esw);
+
+	down_write(&esw->mode_lock);
+	if (esw->mode != MLX5_ESWITCH_OFFLOADS) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Can't enable shared descriptors in legacy mode");
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+	esw->params.enable_shared_rx_ring = shrdesc ==
+					     DEVLINK_ESWITCH_SHRDESC_MODE_BASIC;
+
+out:
+	up_write(&esw->mode_lock);
+	return err;
+}
+
+int mlx5_devlink_eswitch_shrdesc_mode_get(struct devlink *devlink,
+					  enum devlink_eswitch_shrdesc_mode *shrdesc)
+{
+	struct mlx5_eswitch *esw;
+	bool enable;
+
+	esw = mlx5_devlink_eswitch_get(devlink);
+	if (IS_ERR(esw))
+		return PTR_ERR(esw);
+
+	enable = esw->params.enable_shared_rx_ring;
+	if (enable)
+		*shrdesc = DEVLINK_ESWITCH_SHRDESC_MODE_BASIC;
+	else
+		*shrdesc = DEVLINK_ESWITCH_SHRDESC_MODE_NONE;
+
+	return 0;
+}
+
+int mlx5_devlink_eswitch_shrdesc_count_set(struct devlink *devlink, int count,
+					   struct netlink_ext_ack *extack)
+{
+	struct mlx5_eswitch *esw;
+	int err = 0;
+
+	esw = mlx5_devlink_eswitch_get(devlink);
+	if (IS_ERR(esw))
+		return PTR_ERR(esw);
+
+	down_write(&esw->mode_lock);
+	if (esw->mode != MLX5_ESWITCH_OFFLOADS) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Can't enable shared descriptors in legacy mode");
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+	esw->params.shared_rx_ring_counts = count;
+out:
+	up_write(&esw->mode_lock);
+	return err;
+}
+
+int mlx5_devlink_eswitch_shrdesc_count_get(struct devlink *devlink, int *count)
+{
+	struct mlx5_eswitch *esw;
+
+	esw = mlx5_devlink_eswitch_get(devlink);
+	if (IS_ERR(esw))
+		return PTR_ERR(esw);
+
+	*count = esw->params.shared_rx_ring_counts;
+	return 0;
+}
+
 static bool
 mlx5_eswitch_vport_has_rep(const struct mlx5_eswitch *esw, u16 vport_num)
 {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr
  2024-03-01  1:11 [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr William Tu
  2024-03-01  1:11 ` [PATCH RFC v2 net-next 2/2] net/mlx5e: Add eswitch shared descriptor devlink William Tu
@ 2024-03-01  1:46 ` Samudrala, Sridhar
  2024-03-01 17:25   ` William Tu
  2024-03-05  4:37 ` Jakub Kicinski
  2 siblings, 1 reply; 8+ messages in thread
From: Samudrala, Sridhar @ 2024-03-01  1:46 UTC (permalink / raw)
  To: William Tu, netdev; +Cc: jiri, bodong, tariqt, yossiku, kuba



On 2/29/2024 7:11 PM, William Tu wrote:
> Add two eswitch attrs: shrdesc_mode and shrdesc_count.
> 
> 1. shrdesc_mode: to enable a sharing memory buffer for
> representor's rx buffer, and 2. shrdesc_count: to control the
> number of buffers in this shared memory pool.
> 
> When using switchdev mode, the representor ports handles the slow path
> traffic, the traffic that can't be offloaded will be redirected to the
> representor port for processing. Memory consumption of the representor
> port's rx buffer can grow to several GB when scaling to 1k VFs reps.
> For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
> consumes 3MB of DMA memory for packet buffer in WQEs, and with four
> channels, it consumes 4 * 3MB * 1024 = 12GB of memory. And since rep
> ports are for slow path traffic, most of these rx DMA memory are idle.
> 
> Add shrdesc_mode configuration, allowing multiple representors
> to share a rx memory buffer pool. When enabled, individual representor
> doesn't need to allocate its dedicated rx buffer, but just pointing
> its rq to the memory pool. This could make the memory being better

I guess the rx buffers are allocated from a page_pool. Does it mean that 
a page pool is now shared across multiple rx queues belonging to 
multiple netdevs?  Do they all share the same napi?

> utilized. The shrdesc_count represents the number of rx ring
> entries, e.g., same meaning as ethtool -g, that's shared across other
> representors. Users adjust it based on how many reps, total system
> memory, or performance expectation.
> 
> The two params are also useful for other vendors such as Intel ICE
> drivers and Broadcom's driver, which also have representor ports for
> slow path traffic.
> 
> An example use case:
> $ devlink dev eswitch show pci/0000:08:00.0
>    pci/0000:08:00.0: mode legacy inline-mode none encap-mode basic \
>    shrdesc-mode none shrdesc-count 0
> $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
>    shrdesc-mode basic shrdesc-count 1024
> $ devlink dev eswitch show pci/0000:08:00.0
>    pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic \
>    shrdesc-mode basic shrdesc-count 1024
> 
> Note that new configurations are set at legacy mode, and enabled at
> switchdev mode.
> 
> Signed-off-by: William Tu <witu@nvidia.com>
> ---

<snip>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr
  2024-03-01  1:46 ` [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr Samudrala, Sridhar
@ 2024-03-01 17:25   ` William Tu
  0 siblings, 0 replies; 8+ messages in thread
From: William Tu @ 2024-03-01 17:25 UTC (permalink / raw)
  To: Samudrala, Sridhar, netdev; +Cc: jiri, bodong, tariqt, yossiku, kuba



On 2/29/24 5:46 PM, Samudrala, Sridhar wrote:
> External email: Use caution opening links or attachments
>
>
> On 2/29/2024 7:11 PM, William Tu wrote:
>> Add two eswitch attrs: shrdesc_mode and shrdesc_count.
>>
>> 1. shrdesc_mode: to enable a sharing memory buffer for
>> representor's rx buffer, and 2. shrdesc_count: to control the
>> number of buffers in this shared memory pool.
>>
>> When using switchdev mode, the representor ports handles the slow path
>> traffic, the traffic that can't be offloaded will be redirected to the
>> representor port for processing. Memory consumption of the representor
>> port's rx buffer can grow to several GB when scaling to 1k VFs reps.
>> For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
>> consumes 3MB of DMA memory for packet buffer in WQEs, and with four
>> channels, it consumes 4 * 3MB * 1024 = 12GB of memory. And since rep
>> ports are for slow path traffic, most of these rx DMA memory are idle.
>>
>> Add shrdesc_mode configuration, allowing multiple representors
>> to share a rx memory buffer pool. When enabled, individual representor
>> doesn't need to allocate its dedicated rx buffer, but just pointing
>> its rq to the memory pool. This could make the memory being better
>
> I guess the rx buffers are allocated from a page_pool. Does it mean that
> a page pool is now shared across multiple rx queues belonging to
> multiple netdevs?  Do they all share the same napi?

yes. The basic sharing scheme is to have all representor netdevs' rx 
queue N sharing 1 pool.
And packets are proceeded by uplink netdev, so they share the same napi.
More detail here:
https://lore.kernel.org/netdev/20240125223617.7298-1-witu@nvidia.com/



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr
  2024-03-01  1:11 [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr William Tu
  2024-03-01  1:11 ` [PATCH RFC v2 net-next 2/2] net/mlx5e: Add eswitch shared descriptor devlink William Tu
  2024-03-01  1:46 ` [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr Samudrala, Sridhar
@ 2024-03-05  4:37 ` Jakub Kicinski
  2024-03-06  0:27   ` William Tu
  2 siblings, 1 reply; 8+ messages in thread
From: Jakub Kicinski @ 2024-03-05  4:37 UTC (permalink / raw)
  To: William Tu; +Cc: netdev, jiri, bodong, tariqt, yossiku

On Fri, 1 Mar 2024 03:11:18 +0200 William Tu wrote:
> Add two eswitch attrs: shrdesc_mode and shrdesc_count.
> 
> 1. shrdesc_mode: to enable a sharing memory buffer for
> representor's rx buffer, 

Let's narrow down the terminology. "Shared memory buffer"
and "shared memory pool" and "shrdesc" all refer to the same
thing. Let's stick to shared pool?

> and 2. shrdesc_count: to control the
> number of buffers in this shared memory pool.

_default_ number of buffers in shared pool used by representors?

If/when the API to configure shared pools becomes real it will
presumably take precedence over this default?

> When using switchdev mode, the representor ports handles the slow path
> traffic, the traffic that can't be offloaded will be redirected to the
> representor port for processing. Memory consumption of the representor
> port's rx buffer can grow to several GB when scaling to 1k VFs reps.
> For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
> consumes 3MB of DMA memory for packet buffer in WQEs, and with four
> channels, it consumes 4 * 3MB * 1024 = 12GB of memory. And since rep
> ports are for slow path traffic, most of these rx DMA memory are idle.
> 
> Add shrdesc_mode configuration, allowing multiple representors
> to share a rx memory buffer pool. When enabled, individual representor
> doesn't need to allocate its dedicated rx buffer, but just pointing
> its rq to the memory pool. This could make the memory being better
> utilized. The shrdesc_count represents the number of rx ring
> entries, e.g., same meaning as ethtool -g, that's shared across other
> representors. Users adjust it based on how many reps, total system
> memory, or performance expectation.

Can we use bytes as the unit? Like the page pool. Descriptors don't
mean much to the user.

> The two params are also useful for other vendors such as Intel ICE
> drivers and Broadcom's driver, which also have representor ports for
> slow path traffic.
> 
> An example use case:
> $ devlink dev eswitch show pci/0000:08:00.0
>   pci/0000:08:00.0: mode legacy inline-mode none encap-mode basic \
>   shrdesc-mode none shrdesc-count 0
> $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
>   shrdesc-mode basic shrdesc-count 1024
> $ devlink dev eswitch show pci/0000:08:00.0
>   pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic \
>   shrdesc-mode basic shrdesc-count 1024
> 
> Note that new configurations are set at legacy mode, and enabled at
> switchdev mode.

>  Documentation/netlink/specs/devlink.yaml | 17 ++++++++++
>  include/net/devlink.h                    |  8 +++++
>  include/uapi/linux/devlink.h             |  7 ++++
>  net/devlink/dev.c                        | 43 ++++++++++++++++++++++++
>  net/devlink/netlink_gen.c                |  6 ++--
>  5 files changed, 79 insertions(+), 2 deletions(-)

ENODOCS

> diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
> index cf6eaa0da821..58f31d99b8b3 100644
> --- a/Documentation/netlink/specs/devlink.yaml
> +++ b/Documentation/netlink/specs/devlink.yaml
> @@ -119,6 +119,14 @@ definitions:
>          name: none
>        -
>          name: basic
> +  -
> +    type: enum
> +    name: eswitch-shrdesc-mode
> +    entries:
> +      -
> +        name: none
> +      -
> +        name: basic

Do we need this knob?
Can we not assume that shared-pool-count == 0 means disabled?
We can always add the knob later if needed, right now it's
just on / off with some less direct names.

>    -
>      type: enum
>      name: dpipe-header-id
> @@ -429,6 +437,13 @@ attribute-sets:
>          name: eswitch-encap-mode
>          type: u8
>          enum: eswitch-encap-mode
> +      -
> +        name: eswitch-shrdesc-mode
> +        type: u8

u32, netlink rounds sizes up to 4B, anyway

> +        enum: eswitch-shrdesc-mode
> +      -
> +        name: eswitch-shrdesc-count
> +        type: u32
>        -
>          name: resource-list
>          type: nest

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr
  2024-03-05  4:37 ` Jakub Kicinski
@ 2024-03-06  0:27   ` William Tu
  2024-03-06  2:30     ` Jakub Kicinski
  0 siblings, 1 reply; 8+ messages in thread
From: William Tu @ 2024-03-06  0:27 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, jiri, bodong, tariqt, yossiku



On 3/4/24 8:37 PM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Fri, 1 Mar 2024 03:11:18 +0200 William Tu wrote:
>> Add two eswitch attrs: shrdesc_mode and shrdesc_count.
>>
>> 1. shrdesc_mode: to enable a sharing memory buffer for
>> representor's rx buffer,
> Let's narrow down the terminology. "Shared memory buffer"
> and "shared memory pool" and "shrdesc" all refer to the same
> thing. Let's stick to shared pool?
ok, will use share pool.
>> and 2. shrdesc_count: to control the
>> number of buffers in this shared memory pool.
> _default_ number of buffers in shared pool used by representors?
>
> If/when the API to configure shared pools becomes real it will
> presumably take precedence over this default?
yes, if that's the case.
>> When using switchdev mode, the representor ports handles the slow path
>> traffic, the traffic that can't be offloaded will be redirected to the
>> representor port for processing. Memory consumption of the representor
>> port's rx buffer can grow to several GB when scaling to 1k VFs reps.
>> For example, in mlx5 driver, each RQ, with a typical 1K descriptors,
>> consumes 3MB of DMA memory for packet buffer in WQEs, and with four
>> channels, it consumes 4 * 3MB * 1024 = 12GB of memory. And since rep
>> ports are for slow path traffic, most of these rx DMA memory are idle.
>>
>> Add shrdesc_mode configuration, allowing multiple representors
>> to share a rx memory buffer pool. When enabled, individual representor
>> doesn't need to allocate its dedicated rx buffer, but just pointing
>> its rq to the memory pool. This could make the memory being better
>> utilized. The shrdesc_count represents the number of rx ring
>> entries, e.g., same meaning as ethtool -g, that's shared across other
>> representors. Users adjust it based on how many reps, total system
>> memory, or performance expectation.
> Can we use bytes as the unit? Like the page pool. Descriptors don't
> mean much to the user.
But how about the unit size? do we assume unit size = 1 page?
so page pool has
order: 2^order pages on allocation
pool_size: size of ptr_ring

How about we assume that order is 0, and let user set pool_size (number 
of page-size entries).
>
>> The two params are also useful for other vendors such as Intel ICE
>> drivers and Broadcom's driver, which also have representor ports for
>> slow path traffic.
>>
>> An example use case:
>> $ devlink dev eswitch show pci/0000:08:00.0
>>    pci/0000:08:00.0: mode legacy inline-mode none encap-mode basic \
>>    shrdesc-mode none shrdesc-count 0
>> $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev \
>>    shrdesc-mode basic shrdesc-count 1024
>> $ devlink dev eswitch show pci/0000:08:00.0
>>    pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic \
>>    shrdesc-mode basic shrdesc-count 1024
>>
>> Note that new configurations are set at legacy mode, and enabled at
>> switchdev mode.
>>   Documentation/netlink/specs/devlink.yaml | 17 ++++++++++
>>   include/net/devlink.h                    |  8 +++++
>>   include/uapi/linux/devlink.h             |  7 ++++
>>   net/devlink/dev.c                        | 43 ++++++++++++++++++++++++
>>   net/devlink/netlink_gen.c                |  6 ++--
>>   5 files changed, 79 insertions(+), 2 deletions(-)
> ENODOCS
will add docs in next version, thanks.
>> diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
>> index cf6eaa0da821..58f31d99b8b3 100644
>> --- a/Documentation/netlink/specs/devlink.yaml
>> +++ b/Documentation/netlink/specs/devlink.yaml
>> @@ -119,6 +119,14 @@ definitions:
>>           name: none
>>         -
>>           name: basic
>> +  -
>> +    type: enum
>> +    name: eswitch-shrdesc-mode
>> +    entries:
>> +      -
>> +        name: none
>> +      -
>> +        name: basic
> Do we need this knob?
> Can we not assume that shared-pool-count == 0 means disabled?
do you mean assume or not assume?

I guess you mean assume, so use "shared-pool-count == 0" to indicate 
disable?
That will also work so we only need to introduce 1 attribute.
> We can always add the knob later if needed, right now it's
> just on / off with some less direct names.
>
>>     -
>>       type: enum
>>       name: dpipe-header-id
>> @@ -429,6 +437,13 @@ attribute-sets:
>>           name: eswitch-encap-mode
>>           type: u8
>>           enum: eswitch-encap-mode
>> +      -
>> +        name: eswitch-shrdesc-mode
>> +        type: u8
> u32, netlink rounds sizes up to 4B, anyway
ok, thanks!

>
>> +        enum: eswitch-shrdesc-mode
>> +      -
>> +        name: eswitch-shrdesc-count
>> +        type: u32
>>         -
>>           name: resource-list
>>           type: nest


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr
  2024-03-06  0:27   ` William Tu
@ 2024-03-06  2:30     ` Jakub Kicinski
  2024-03-06  5:18       ` William Tu
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Kicinski @ 2024-03-06  2:30 UTC (permalink / raw)
  To: William Tu; +Cc: netdev, jiri, bodong, tariqt, yossiku

On Tue, 5 Mar 2024 16:27:50 -0800 William Tu wrote:
> > Can we use bytes as the unit? Like the page pool. Descriptors don't
> > mean much to the user.  
> But how about the unit size? do we assume unit size = 1 page?
> so page pool has
> order: 2^order pages on allocation
> pool_size: size of ptr_ring
> 
> How about we assume that order is 0, and let user set pool_size (number 
> of page-size entries).

Do you mean because the user doesn't know the granularity,
e.g. we can't allocate 12345 bytes most likely?

For shared buffer (the switch buffer configuration API)
we report cell size IIRC. In ethtool a lot of drivers just
round up to whatever they support. We could also treat 
the user-configured value as "upper bound" and effectively
round down but keep what the user configured exactly, when
they read back. I like the last one the most, if it makes sense.

> > Do we need this knob?
> > Can we not assume that shared-pool-count == 0 means disabled?  
> do you mean assume or not assume?

Sorry for the double negation (:

> I guess you mean assume, so use "shared-pool-count == 0" to indicate 
> disable?
> That will also work so we only need to introduce 1 attribute.

SG!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr
  2024-03-06  2:30     ` Jakub Kicinski
@ 2024-03-06  5:18       ` William Tu
  0 siblings, 0 replies; 8+ messages in thread
From: William Tu @ 2024-03-06  5:18 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, jiri, bodong, tariqt, yossiku



On 3/5/24 6:30 PM, Jakub Kicinski wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, 5 Mar 2024 16:27:50 -0800 William Tu wrote:
>>> Can we use bytes as the unit? Like the page pool. Descriptors don't
>>> mean much to the user.
>> But how about the unit size? do we assume unit size = 1 page?
>> so page pool has
>> order: 2^order pages on allocation
>> pool_size: size of ptr_ring
>>
>> How about we assume that order is 0, and let user set pool_size (number
>> of page-size entries).
> Do you mean because the user doesn't know the granularity,
> e.g. we can't allocate 12345 bytes most likely?
>
> For shared buffer (the switch buffer configuration API)
> we report cell size IIRC. In ethtool a lot of drivers just
> round up to whatever they support. We could also treat
> the user-configured value as "upper bound" and effectively
> round down but keep what the user configured exactly, when
> they read back. I like the last one the most, if it makes sense.
got it, I think that also works (round down and keep user's config).
I will work on next version, thanks!
William
>>> Do we need this knob?
>>> Can we not assume that shared-pool-count == 0 means disabled?
>> do you mean assume or not assume?
> Sorry for the double negation (:
>
>> I guess you mean assume, so use "shared-pool-count == 0" to indicate
>> disable?
>> That will also work so we only need to introduce 1 attribute.
> SG!


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-03-06  5:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-01  1:11 [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr William Tu
2024-03-01  1:11 ` [PATCH RFC v2 net-next 2/2] net/mlx5e: Add eswitch shared descriptor devlink William Tu
2024-03-01  1:46 ` [PATCH RFC v2 net-next 1/2] devlink: Add shared descriptor eswitch attr Samudrala, Sridhar
2024-03-01 17:25   ` William Tu
2024-03-05  4:37 ` Jakub Kicinski
2024-03-06  0:27   ` William Tu
2024-03-06  2:30     ` Jakub Kicinski
2024-03-06  5:18       ` William Tu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.