All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] net: openvswitch: Reduce stack usage
@ 2023-09-27  0:13 Nicholas Piggin
  2023-09-27  0:13 ` [RFC PATCH 1/7] net: openvswitch: Move NSH buffer out of do_execute_actions Nicholas Piggin
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27  0:13 UTC (permalink / raw)
  To: netdev; +Cc: Nicholas Piggin, dev, Pravin B Shelar

Hi,

We've got a report of a stack overflow on ppc64le with a 16kB kernel
stack. Openvswitch is just one of many things in the stack, but it
does cause recursion and contributes to some usage.

Here are a few patches for reducing stack overhead. I don't know the
code well so consider them just ideas. GFP_ATOMIC allocations
introduced in a couple of places might be controversial, but there
is still some savings to be had if you skip those.

Here is one place detected where the stack reaches >14kB before
overflowing a little later. I massaged the output so it just shows
the stack frame address on the left.

[c00000037d480b40] __kmalloc+0x8c/0x5e0
[c00000037d480bc0] virtqueue_add_outbuf+0x354/0xac0
[c00000037d480cc0] xmit_skb+0x1dc/0x350 [virtio_net]
[c00000037d480d50] start_xmit+0xd4/0x3b0 [virtio_net]
[c00000037d480e00] dev_hard_start_xmit+0x11c/0x280
[c00000037d480e80] sch_direct_xmit+0xec/0x330
[c00000037d480f20] __dev_xmit_skb+0x41c/0xa80
[c00000037d480f90] __dev_queue_xmit+0x414/0x950
[c00000037d481070] ovs_vport_send+0xb4/0x210 [openvswitch]
[c00000037d4810f0] do_output+0x7c/0x200 [openvswitch]
[c00000037d481140] do_execute_actions+0xe48/0xeb0 [openvswitch]
[c00000037d481300] ovs_execute_actions+0x78/0x1f0 [openvswitch]
[c00000037d481380] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
[c00000037d481450] ovs_vport_receive+0x8c/0x130 [openvswitch]
[c00000037d481660] internal_dev_xmit+0x40/0xd0 [openvswitch]
[c00000037d481690] dev_hard_start_xmit+0x11c/0x280
[c00000037d481710] __dev_queue_xmit+0x634/0x950
[c00000037d4817f0] neigh_hh_output+0xd0/0x180
[c00000037d481840] ip_finish_output2+0x31c/0x5c0
[c00000037d4818e0] ip_local_out+0x64/0x90
[c00000037d481920] iptunnel_xmit+0x194/0x290
[c00000037d4819c0] udp_tunnel_xmit_skb+0x100/0x140 [udp_tunnel]
[c00000037d481a80] geneve_xmit_skb+0x34c/0x610 [geneve]
[c00000037d481bb0] geneve_xmit+0x94/0x1e8 [geneve]
[c00000037d481c30] dev_hard_start_xmit+0x11c/0x280
[c00000037d481cb0] __dev_queue_xmit+0x634/0x950
[c00000037d481d90] ovs_vport_send+0xb4/0x210 [openvswitch]
[c00000037d481e10] do_output+0x7c/0x200 [openvswitch]
[c00000037d481e60] do_execute_actions+0xe48/0xeb0 [openvswitch]
[c00000037d482020] ovs_execute_actions+0x78/0x1f0 [openvswitch]
[c00000037d4820a0] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
[c00000037d482170] clone_execute+0x2c8/0x370 [openvswitch]
[c00000037d482210] do_execute_actions+0x4b8/0xeb0 [openvswitch]
[c00000037d4823d0] ovs_execute_actions+0x78/0x1f0 [openvswitch]
[c00000037d482450] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
[c00000037d482520] ovs_vport_receive+0x8c/0x130 [openvswitch]
[c00000037d482730] internal_dev_xmit+0x40/0xd0 [openvswitch]
[c00000037d482760] dev_hard_start_xmit+0x11c/0x280
[c00000037d4827e0] __dev_queue_xmit+0x634/0x950
[c00000037d4828c0] neigh_hh_output+0xd0/0x180
[c00000037d482910] ip_finish_output2+0x31c/0x5c0
[c00000037d4829b0] __ip_queue_xmit+0x1b0/0x4f0
[c00000037d482a40] __tcp_transmit_skb+0x450/0x9a0
[c00000037d482b10] tcp_write_xmit+0x4e0/0xb40
[c00000037d482be0] __tcp_push_pending_frames+0x44/0x130
[c00000037d482c50] __tcp_sock_set_cork.part.0+0x8c/0xb0
[c00000037d482c80] tcp_sock_set_cork+0x78/0xa0
[c00000037d482cb0] xs_tcp_send_request+0x2d4/0x430 [sunrpc]
[c00000037d482e50] xprt_request_transmit.constprop.0+0xa8/0x3c0 [sunrpc]
[c00000037d482eb0] xprt_transmit+0x12c/0x260 [sunrpc]
[c00000037d482f20] call_transmit+0xd0/0x100 [sunrpc]
[c00000037d482f50] __rpc_execute+0xec/0x570 [sunrpc]
[c00000037d482fd0] rpc_execute+0x168/0x1d0 [sunrpc]
[c00000037d483010] rpc_run_task+0x1cc/0x2a0 [sunrpc]
[c00000037d483070] nfs4_call_sync_sequence+0x98/0x100 [nfsv4]
[c00000037d483120] _nfs4_server_capabilities+0xd4/0x3c0 [nfsv4]
[c00000037d483210] nfs4_server_capabilities+0x74/0xd0 [nfsv4]
[c00000037d483270] nfs4_proc_get_root+0x3c/0x150 [nfsv4]
[c00000037d4832f0] nfs_get_root+0xac/0x660 [nfs]
[c00000037d483420] nfs_get_tree_common+0x104/0x5f0 [nfs]
[c00000037d4834b0] nfs_get_tree+0x90/0xc0 [nfs]
[c00000037d4834e0] vfs_get_tree+0x48/0x160
[c00000037d483560] nfs_do_submount+0x170/0x210 [nfs]
[c00000037d483600] nfs4_submount+0x250/0x360 [nfsv4]
[c00000037d4836b0] nfs_d_automount+0x194/0x2d0 [nfs]
[c00000037d483710] __traverse_mounts+0x114/0x330
[c00000037d483770] step_into+0x364/0x4d0
[c00000037d4837f0] walk_component+0x8c/0x300
[c00000037d483870] path_lookupat+0xa8/0x260
[c00000037d4838c0] filename_lookup+0xc8/0x230
[c00000037d483a00] vfs_path_lookup+0x68/0xc0
[c00000037d483a60] mount_subtree+0xd0/0x1e0
[c00000037d483ad0] do_nfs4_mount+0x280/0x520 [nfsv4]
[c00000037d483ba0] nfs4_try_get_tree+0x60/0x140 [nfsv4]
[c00000037d483c20] nfs_get_tree+0x60/0xc0 [nfs]
[c00000037d483c50] vfs_get_tree+0x48/0x160
[c00000037d483cd0] do_new_mount+0x204/0x3c0
[c00000037d483d40] sys_mount+0x168/0x1c0
[c00000037d483db0] system_call_exception+0x164/0x310
[c00000037d483e10] system_call_vectored_common+0xe8/0x278

That's hard to decipher so here all the stack frames sorted by
size, and number of appearances if > 1:

528 ovs_vport_receive (x2)
448 do_execute_actions (x3)
416 xs_tcp_send_request
320 filename_lookup
304 nfs_get_root
304 geneve_xmit_skb
256 virtqueue_add_outbuf
240 _nfs4_server_capabilities
224 __dev_queue_xmit (x4)
208 tcp_write_xmit
208 __tcp_transmit_skb
208 ovs_dp_process_packet (x3)
208 do_nfs4_mount
192 udp_tunnel_xmit_skb
176 start_xmit
176 nfs4_submount
176 nfs4_call_sync_sequence
160 sch_direct_xmit
160 nfs_do_submount
160 iptunnel_xmit
160 ip_finish_output2 (x2)
160 clone_execute
144 xmit_skb
144 nfs_get_tree_common
144 __ip_queue_xmit
128 walk_component
128 vfs_get_tree (x2)
128 step_into
128 __rpc_execute
128 ovs_vport_send (x2)
128 ovs_execute_actions (x3)
128 nfs4_try_get_tree
128 nfs4_proc_get_root
128 __kmalloc
128 geneve_xmit
128 dev_hard_start_xmit (x4)
112 xprt_transmit
112 __tcp_push_pending_frames
112 sys_mount
112 mount_subtree
112 do_new_mount
112 __dev_xmit_skb
96 xprt_request_transmit.constprop.0
96 vfs_path_lookup
96 __traverse_mounts
96 system_call_exception
96 rpc_run_task
96 nfs_d_automount
96 nfs4_server_capabilities
80 path_lookupat
80 neigh_hh_output (x2)
80 do_output (x2)
64 rpc_execute
64 ip_local_out
48 __tcp_sock_set_cork.part.0
48 tcp_sock_set_cork
48 nfs_get_tree (x2)
48 internal_dev_xmit (x2)
48 call_transmit

Thanks,
Nick

Nicholas Piggin (7):
  net: openvswitch: Move NSH buffer out of do_execute_actions
  net: openvswitch: Reduce execute_push_nsh stack overhead
  net: openvswitch: uninline action execution
  net: openvswitch: ovs_vport_receive reduce stack usage
  net: openvswitch: uninline ovs_fragment to control stack usage
  net: openvswitch: Reduce ovs_fragment stack usage
  net: openvswitch: Reduce stack usage in ovs_dp_process_packet

 net/openvswitch/actions.c  | 139 +++++++++++++++++++++++--------------
 net/openvswitch/datapath.c |  55 ++++++++-------
 net/openvswitch/drop.h     |   1 +
 net/openvswitch/vport.c    |  14 +++-
 4 files changed, 129 insertions(+), 80 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 1/7] net: openvswitch: Move NSH buffer out of do_execute_actions
  2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
@ 2023-09-27  0:13 ` Nicholas Piggin
  2023-09-27  8:26   ` [ovs-dev] " Ilya Maximets
  2023-09-27  0:13 ` [RFC PATCH 2/7] net: openvswitch: Reduce execute_push_nsh stack overhead Nicholas Piggin
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27  0:13 UTC (permalink / raw)
  To: netdev; +Cc: Nicholas Piggin, dev, Pravin B Shelar

This takes do_execute_actions stack use from 544 bytes to 288
bytes. execute_push_nsh uses 336 bytes, but it is a leaf call not
involved in recursion.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 net/openvswitch/actions.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index fd66014d8a76..8933caa92794 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1286,6 +1286,21 @@ static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key)
 	return 0;
 }
 
+static noinline_for_stack int execute_push_nsh(struct sk_buff *skb,
+					       struct sw_flow_key *key,
+					       const struct nlattr *attr)
+{
+	u8 buffer[NSH_HDR_MAX_LEN];
+	struct nshhdr *nh = (struct nshhdr *)buffer;
+	int err;
+
+	err = nsh_hdr_from_nlattr(attr, nh, NSH_HDR_MAX_LEN);
+	if (likely(!err))
+		err = push_nsh(skb, key, nh);
+
+	return err;
+}
+
 /* Execute a list of actions against 'skb'. */
 static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 			      struct sw_flow_key *key,
@@ -1439,17 +1454,9 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 			err = pop_eth(skb, key);
 			break;
 
-		case OVS_ACTION_ATTR_PUSH_NSH: {
-			u8 buffer[NSH_HDR_MAX_LEN];
-			struct nshhdr *nh = (struct nshhdr *)buffer;
-
-			err = nsh_hdr_from_nlattr(nla_data(a), nh,
-						  NSH_HDR_MAX_LEN);
-			if (unlikely(err))
-				break;
-			err = push_nsh(skb, key, nh);
+		case OVS_ACTION_ATTR_PUSH_NSH:
+			err = execute_push_nsh(skb, key, nla_data(a));
 			break;
-		}
 
 		case OVS_ACTION_ATTR_POP_NSH:
 			err = pop_nsh(skb, key);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 2/7] net: openvswitch: Reduce execute_push_nsh stack overhead
  2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
  2023-09-27  0:13 ` [RFC PATCH 1/7] net: openvswitch: Move NSH buffer out of do_execute_actions Nicholas Piggin
@ 2023-09-27  0:13 ` Nicholas Piggin
  2023-09-27  0:13 ` [RFC PATCH 3/7] net: openvswitch: uninline action execution Nicholas Piggin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27  0:13 UTC (permalink / raw)
  To: netdev; +Cc: Nicholas Piggin, dev, Pravin B Shelar

Use dynamic allocation to reduce execute_push_nsh stack consumption
from 336 bytes to 64 bytes, at the cost of introducing a GFP_ATOMIC
allocation.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 net/openvswitch/actions.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 8933caa92794..af177701a606 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1290,13 +1290,17 @@ static noinline_for_stack int execute_push_nsh(struct sk_buff *skb,
 					       struct sw_flow_key *key,
 					       const struct nlattr *attr)
 {
-	u8 buffer[NSH_HDR_MAX_LEN];
-	struct nshhdr *nh = (struct nshhdr *)buffer;
+	struct nshhdr *nh;
 	int err;
 
+	nh = kmalloc(NSH_HDR_MAX_LEN, GFP_ATOMIC);
+	if (unlikely(!nh))
+		return -ENOMEM; /* XXX: should this skip action like clone? */
+
 	err = nsh_hdr_from_nlattr(attr, nh, NSH_HDR_MAX_LEN);
 	if (likely(!err))
 		err = push_nsh(skb, key, nh);
+	kfree(nh);
 
 	return err;
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 3/7] net: openvswitch: uninline action execution
  2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
  2023-09-27  0:13 ` [RFC PATCH 1/7] net: openvswitch: Move NSH buffer out of do_execute_actions Nicholas Piggin
  2023-09-27  0:13 ` [RFC PATCH 2/7] net: openvswitch: Reduce execute_push_nsh stack overhead Nicholas Piggin
@ 2023-09-27  0:13 ` Nicholas Piggin
  2023-09-27  0:13 ` [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage Nicholas Piggin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27  0:13 UTC (permalink / raw)
  To: netdev; +Cc: Nicholas Piggin, dev, Pravin B Shelar

A function tends to use as much stack as the maximum of any control
flow path. Compilers can "shrink wrap" to special-case stack allocation,
but that only works well for exclusive paths. The switch statement in
the loop in do_execute_actions uses as much stack as the maximum of its
cases, and so inlining large actions increases overall stack uage. This
is particularly bad because the actions that cause recursion are not the
largest stack users.

Uninline action execution functions, which reduces the stack usage of
do_execute_actions from 288 bytes to 112 bytes.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 net/openvswitch/actions.c | 69 +++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 31 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index af177701a606..b4d4150c5e69 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -913,8 +913,9 @@ static void ovs_fragment(struct net *net, struct vport *vport,
 	ovs_kfree_skb_reason(skb, reason);
 }
 
-static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
-		      struct sw_flow_key *key)
+static noinline_for_stack void do_output(struct datapath *dp,
+					 struct sk_buff *skb, int out_port,
+					 struct sw_flow_key *key)
 {
 	struct vport *vport = ovs_vport_rcu(dp, out_port);
 
@@ -944,10 +945,11 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
 	}
 }
 
-static int output_userspace(struct datapath *dp, struct sk_buff *skb,
-			    struct sw_flow_key *key, const struct nlattr *attr,
-			    const struct nlattr *actions, int actions_len,
-			    uint32_t cutlen)
+static noinline_for_stack
+int output_userspace(struct datapath *dp, struct sk_buff *skb,
+		     struct sw_flow_key *key, const struct nlattr *attr,
+		     const struct nlattr *actions, int actions_len,
+		     uint32_t cutlen)
 {
 	struct dp_upcall_info upcall;
 	const struct nlattr *a;
@@ -1022,9 +1024,9 @@ static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb,
  * Otherwise, sample() should keep 'skb' intact regardless what
  * actions are executed within sample().
  */
-static int sample(struct datapath *dp, struct sk_buff *skb,
-		  struct sw_flow_key *key, const struct nlattr *attr,
-		  bool last)
+static noinline_for_stack int sample(struct datapath *dp, struct sk_buff *skb,
+				     struct sw_flow_key *key,
+				     const struct nlattr *attr, bool last)
 {
 	struct nlattr *actions;
 	struct nlattr *sample_arg;
@@ -1053,9 +1055,10 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
  * Otherwise, clone() should keep 'skb' intact regardless what
  * actions are executed within clone().
  */
-static int clone(struct datapath *dp, struct sk_buff *skb,
-		 struct sw_flow_key *key, const struct nlattr *attr,
-		 bool last)
+static noinline_for_stack int clone(struct datapath *dp,
+				    struct sk_buff *skb,
+				    struct sw_flow_key *key,
+				    const struct nlattr *attr, bool last)
 {
 	struct nlattr *actions;
 	struct nlattr *clone_arg;
@@ -1071,8 +1074,9 @@ static int clone(struct datapath *dp, struct sk_buff *skb,
 			     !dont_clone_flow_key);
 }
 
-static void execute_hash(struct sk_buff *skb, struct sw_flow_key *key,
-			 const struct nlattr *attr)
+static noinline_for_stack void execute_hash(struct sk_buff *skb,
+					    struct sw_flow_key *key,
+					    const struct nlattr *attr)
 {
 	struct ovs_action_hash *hash_act = nla_data(attr);
 	u32 hash = 0;
@@ -1094,9 +1098,9 @@ static void execute_hash(struct sk_buff *skb, struct sw_flow_key *key,
 	key->ovs_flow_hash = hash;
 }
 
-static int execute_set_action(struct sk_buff *skb,
-			      struct sw_flow_key *flow_key,
-			      const struct nlattr *a)
+static noinline_for_stack int execute_set_action(struct sk_buff *skb,
+						 struct sw_flow_key *flow_key,
+						 const struct nlattr *a)
 {
 	/* Only tunnel set execution is supported without a mask. */
 	if (nla_type(a) == OVS_KEY_ATTR_TUNNEL_INFO) {
@@ -1114,9 +1118,9 @@ static int execute_set_action(struct sk_buff *skb,
 /* Mask is at the midpoint of the data. */
 #define get_mask(a, type) ((const type)nla_data(a) + 1)
 
-static int execute_masked_set_action(struct sk_buff *skb,
-				     struct sw_flow_key *flow_key,
-				     const struct nlattr *a)
+static noinline_for_stack
+int execute_masked_set_action(struct sk_buff *skb, struct sw_flow_key *flow_key,
+			      const struct nlattr *a)
 {
 	int err = 0;
 
@@ -1189,9 +1193,9 @@ static int execute_masked_set_action(struct sk_buff *skb,
 	return err;
 }
 
-static int execute_recirc(struct datapath *dp, struct sk_buff *skb,
-			  struct sw_flow_key *key,
-			  const struct nlattr *a, bool last)
+static noinline_for_stack
+int execute_recirc(struct datapath *dp, struct sk_buff *skb,
+		   struct sw_flow_key *key, const struct nlattr *a, bool last)
 {
 	u32 recirc_id;
 
@@ -1208,9 +1212,10 @@ static int execute_recirc(struct datapath *dp, struct sk_buff *skb,
 	return clone_execute(dp, skb, key, recirc_id, NULL, 0, last, true);
 }
 
-static int execute_check_pkt_len(struct datapath *dp, struct sk_buff *skb,
-				 struct sw_flow_key *key,
-				 const struct nlattr *attr, bool last)
+static noinline_for_stack
+int execute_check_pkt_len(struct datapath *dp, struct sk_buff *skb,
+			  struct sw_flow_key *key, const struct nlattr *attr,
+			  bool last)
 {
 	struct ovs_skb_cb *ovs_cb = OVS_CB(skb);
 	const struct nlattr *actions, *cpl_arg;
@@ -1247,7 +1252,8 @@ static int execute_check_pkt_len(struct datapath *dp, struct sk_buff *skb,
 			     nla_len(actions), last, clone_flow_key);
 }
 
-static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key)
+static noinline_for_stack int execute_dec_ttl(struct sk_buff *skb,
+					      struct sw_flow_key *key)
 {
 	int err;
 
@@ -1526,10 +1532,11 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
  * The execution may be deferred in case the actions can not be executed
  * immediately.
  */
-static int clone_execute(struct datapath *dp, struct sk_buff *skb,
-			 struct sw_flow_key *key, u32 recirc_id,
-			 const struct nlattr *actions, int len,
-			 bool last, bool clone_flow_key)
+static noinline_for_stack
+int clone_execute(struct datapath *dp, struct sk_buff *skb,
+		  struct sw_flow_key *key, u32 recirc_id,
+		  const struct nlattr *actions, int len, bool last,
+		  bool clone_flow_key)
 {
 	struct deferred_action *da;
 	struct sw_flow_key *clone;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
                   ` (2 preceding siblings ...)
  2023-09-27  0:13 ` [RFC PATCH 3/7] net: openvswitch: uninline action execution Nicholas Piggin
@ 2023-09-27  0:13 ` Nicholas Piggin
  2023-09-28 15:26   ` [ovs-dev] " Aaron Conole
  2023-09-27  0:13 ` [RFC PATCH 5/7] net: openvswitch: uninline ovs_fragment to control " Nicholas Piggin
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27  0:13 UTC (permalink / raw)
  To: netdev; +Cc: Nicholas Piggin, dev, Pravin B Shelar

Dynamically allocating the sw_flow_key reduces stack usage of
ovs_vport_receive from 544 bytes to 64 bytes at the cost of
another GFP_ATOMIC allocation in the receive path.

XXX: is this a problem with memory reserves if ovs is in a
memory reclaim path, or since we have a skb allocated, is it
okay to use some GFP_ATOMIC reserves?

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 net/openvswitch/vport.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 972ae01a70f7..ddba3e00832b 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -494,9 +494,13 @@ u32 ovs_vport_find_upcall_portid(const struct vport *vport,
 int ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
 		      const struct ip_tunnel_info *tun_info)
 {
-	struct sw_flow_key key;
+	struct sw_flow_key *key;
 	int error;
 
+	key = kmalloc(sizeof(*key), GFP_ATOMIC);
+	if (!key)
+		return -ENOMEM;
+
 	OVS_CB(skb)->input_vport = vport;
 	OVS_CB(skb)->mru = 0;
 	OVS_CB(skb)->cutlen = 0;
@@ -510,12 +514,16 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
 	}
 
 	/* Extract flow from 'skb' into 'key'. */
-	error = ovs_flow_key_extract(tun_info, skb, &key);
+	error = ovs_flow_key_extract(tun_info, skb, key);
 	if (unlikely(error)) {
 		kfree_skb(skb);
+		kfree(key);
 		return error;
 	}
-	ovs_dp_process_packet(skb, &key);
+	ovs_dp_process_packet(skb, key);
+
+	kfree(key);
+
 	return 0;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 5/7] net: openvswitch: uninline ovs_fragment to control stack usage
  2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
                   ` (3 preceding siblings ...)
  2023-09-27  0:13 ` [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage Nicholas Piggin
@ 2023-09-27  0:13 ` Nicholas Piggin
  2023-09-27  0:13 ` [RFC PATCH 6/7] net: openvswitch: Reduce ovs_fragment " Nicholas Piggin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27  0:13 UTC (permalink / raw)
  To: netdev; +Cc: Nicholas Piggin, dev, Pravin B Shelar

ovs_fragment uses a lot of stack, 400 bytes. It is a leaf function
but its caller do_output is involved in openvswitch recursion.
GCC 13.2 for powerpc64le is not inlining it, but it only has a single
call site, so it is liable to being inlined.

Mark it noinline_for_stack, to ensure it doesn't bloat stack use in
the recursive path.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 net/openvswitch/actions.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index b4d4150c5e69..12ad998b70e2 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -849,9 +849,9 @@ static void prepare_frag(struct vport *vport, struct sk_buff *skb,
 	skb_pull(skb, hlen);
 }
 
-static void ovs_fragment(struct net *net, struct vport *vport,
-			 struct sk_buff *skb, u16 mru,
-			 struct sw_flow_key *key)
+static noinline_for_stack
+void ovs_fragment(struct net *net, struct vport *vport, struct sk_buff *skb,
+		  u16 mru, struct sw_flow_key *key)
 {
 	enum ovs_drop_reason reason;
 	u16 orig_network_offset = 0;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 6/7] net: openvswitch: Reduce ovs_fragment stack usage
  2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
                   ` (4 preceding siblings ...)
  2023-09-27  0:13 ` [RFC PATCH 5/7] net: openvswitch: uninline ovs_fragment to control " Nicholas Piggin
@ 2023-09-27  0:13 ` Nicholas Piggin
  2023-09-27  0:13 ` [RFC PATCH 7/7] net: openvswitch: Reduce stack usage in ovs_dp_process_packet Nicholas Piggin
  2023-09-27  8:36 ` [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Ilya Maximets
  7 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27  0:13 UTC (permalink / raw)
  To: netdev; +Cc: Nicholas Piggin, dev, Pravin B Shelar

Allocate the dst route dynamically rather than on stack, reducing
ovs_fragment stack usage from 400 to 160 bytes, at a cost of a
GFP_ATOMIC allocation.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 net/openvswitch/actions.c | 33 ++++++++++++++++++++++++---------
 net/openvswitch/drop.h    |  1 +
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 12ad998b70e2..a6e10f59838f 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -868,38 +868,53 @@ void ovs_fragment(struct net *net, struct vport *vport, struct sk_buff *skb,
 	}
 
 	if (key->eth.type == htons(ETH_P_IP)) {
-		struct rtable ovs_rt = { 0 };
+		struct rtable *ovs_rt;
 		unsigned long orig_dst;
 
+		ovs_rt = kzalloc(sizeof(*ovs_rt), GFP_ATOMIC);
+		if (!ovs_rt) {
+			OVS_NLERR(1, "No memory to fragment");
+			reason = OVS_DROP_NOMEM;
+			goto err;
+		}
+
 		prepare_frag(vport, skb, orig_network_offset,
 			     ovs_key_mac_proto(key));
-		dst_init(&ovs_rt.dst, &ovs_dst_ops, NULL, 1,
+		dst_init(&ovs_rt->dst, &ovs_dst_ops, NULL, 1,
 			 DST_OBSOLETE_NONE, DST_NOCOUNT);
-		ovs_rt.dst.dev = vport->dev;
+		ovs_rt->dst.dev = vport->dev;
 
 		orig_dst = skb->_skb_refdst;
-		skb_dst_set_noref(skb, &ovs_rt.dst);
+		skb_dst_set_noref(skb, &ovs_rt->dst);
 		IPCB(skb)->frag_max_size = mru;
 
 		ip_do_fragment(net, skb->sk, skb, ovs_vport_output);
 		refdst_drop(orig_dst);
+		kfree(ovs_rt);
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		unsigned long orig_dst;
-		struct rt6_info ovs_rt;
+		struct rt6_info *ovs_rt;
+
+		ovs_rt = kzalloc(sizeof(*ovs_rt), GFP_ATOMIC);
+		if (!ovs_rt) {
+			OVS_NLERR(1, "No memory to fragment");
+			reason = OVS_DROP_NOMEM;
+			goto err;
+		}
 
 		prepare_frag(vport, skb, orig_network_offset,
 			     ovs_key_mac_proto(key));
-		memset(&ovs_rt, 0, sizeof(ovs_rt));
-		dst_init(&ovs_rt.dst, &ovs_dst_ops, NULL, 1,
+		dst_init(&ovs_rt->dst, &ovs_dst_ops, NULL, 1,
 			 DST_OBSOLETE_NONE, DST_NOCOUNT);
-		ovs_rt.dst.dev = vport->dev;
+		ovs_rt->dst.dev = vport->dev;
 
 		orig_dst = skb->_skb_refdst;
-		skb_dst_set_noref(skb, &ovs_rt.dst);
+		skb_dst_set_noref(skb, &ovs_rt->dst);
 		IP6CB(skb)->frag_max_size = mru;
 
 		ipv6_stub->ipv6_fragment(net, skb->sk, skb, ovs_vport_output);
 		refdst_drop(orig_dst);
+		kfree(ovs_rt);
 	} else {
 		WARN_ONCE(1, "Failed fragment ->%s: eth=%04x, MRU=%d, MTU=%d.",
 			  ovs_vport_name(vport), ntohs(key->eth.type), mru,
diff --git a/net/openvswitch/drop.h b/net/openvswitch/drop.h
index cedf9b7b5796..0bf156867a69 100644
--- a/net/openvswitch/drop.h
+++ b/net/openvswitch/drop.h
@@ -20,6 +20,7 @@
 	R(OVS_DROP_FRAG_INVALID_PROTO)		\
 	R(OVS_DROP_CONNTRACK)			\
 	R(OVS_DROP_IP_TTL)			\
+	R(OVS_DROP_NOMEM)			\
 	/* deliberate comment for trailing \ */
 
 enum ovs_drop_reason {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 7/7] net: openvswitch: Reduce stack usage in ovs_dp_process_packet
  2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
                   ` (5 preceding siblings ...)
  2023-09-27  0:13 ` [RFC PATCH 6/7] net: openvswitch: Reduce ovs_fragment " Nicholas Piggin
@ 2023-09-27  0:13 ` Nicholas Piggin
  2023-09-27 15:22   ` kernel test robot
  2023-09-27  8:36 ` [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Ilya Maximets
  7 siblings, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27  0:13 UTC (permalink / raw)
  To: netdev; +Cc: Nicholas Piggin, dev, Pravin B Shelar

The upcall in ovs_dp_process_packet uses a lot of stack and is not
involved in the recursive call. Move it into an out of line function,
reducing stack overhead from 144 to 96 bytes.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 net/openvswitch/datapath.c | 55 +++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 11c69415c605..bdbbdd556c4a 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -242,6 +242,37 @@ void ovs_dp_detach_port(struct vport *p)
 	ovs_vport_del(p);
 }
 
+static noinline_for_stack
+void do_packet_upcall(struct sk_buff *skb, struct sw_flow_key *key,
+		      const struct vport *p, struct datapath *dp)
+{
+	struct dp_upcall_info upcall;
+	int error;
+
+	memset(&upcall, 0, sizeof(upcall));
+	upcall.cmd = OVS_PACKET_CMD_MISS;
+
+	if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
+		upcall.portid =
+		    ovs_dp_get_upcall_portid(dp, smp_processor_id());
+	else
+		upcall.portid = ovs_vport_find_upcall_portid(p, skb);
+
+	upcall.mru = OVS_CB(skb)->mru;
+	error = ovs_dp_upcall(dp, skb, key, &upcall, 0);
+	switch (error) {
+	case 0:
+	case -EAGAIN:
+	case -ERESTARTSYS:
+	case -EINTR:
+		consume_skb(skb);
+		break;
+	default:
+		kfree_skb(skb);
+		break;
+	}
+}
+
 /* Must be called with rcu_read_lock. */
 void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 {
@@ -261,30 +292,6 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 	flow = ovs_flow_tbl_lookup_stats(&dp->table, key, skb_get_hash(skb),
 					 &n_mask_hit, &n_cache_hit);
 	if (unlikely(!flow)) {
-		struct dp_upcall_info upcall;
-
-		memset(&upcall, 0, sizeof(upcall));
-		upcall.cmd = OVS_PACKET_CMD_MISS;
-
-		if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
-			upcall.portid =
-			    ovs_dp_get_upcall_portid(dp, smp_processor_id());
-		else
-			upcall.portid = ovs_vport_find_upcall_portid(p, skb);
-
-		upcall.mru = OVS_CB(skb)->mru;
-		error = ovs_dp_upcall(dp, skb, key, &upcall, 0);
-		switch (error) {
-		case 0:
-		case -EAGAIN:
-		case -ERESTARTSYS:
-		case -EINTR:
-			consume_skb(skb);
-			break;
-		default:
-			kfree_skb(skb);
-			break;
-		}
 		stats_counter = &stats->n_missed;
 		goto out;
 	}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 1/7] net: openvswitch: Move NSH buffer out of do_execute_actions
  2023-09-27  0:13 ` [RFC PATCH 1/7] net: openvswitch: Move NSH buffer out of do_execute_actions Nicholas Piggin
@ 2023-09-27  8:26   ` Ilya Maximets
  2023-09-27 10:03     ` Nicholas Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Ilya Maximets @ 2023-09-27  8:26 UTC (permalink / raw)
  To: Nicholas Piggin, netdev; +Cc: dev, i.maximets

On 9/27/23 02:13, Nicholas Piggin wrote:
> This takes do_execute_actions stack use from 544 bytes to 288
> bytes. execute_push_nsh uses 336 bytes, but it is a leaf call not
> involved in recursion.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  net/openvswitch/actions.c | 27 +++++++++++++++++----------
>  1 file changed, 17 insertions(+), 10 deletions(-)

Hi, Nicholas.  I made the same change about a week ago:
  https://lore.kernel.org/netdev/20230921194314.1976605-1-i.maximets@ovn.org/
So, you can drop this patch from your set.

Best regards, Ilya Maximets.

> 
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index fd66014d8a76..8933caa92794 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -1286,6 +1286,21 @@ static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key)
>  	return 0;
>  }
>  
> +static noinline_for_stack int execute_push_nsh(struct sk_buff *skb,
> +					       struct sw_flow_key *key,
> +					       const struct nlattr *attr)
> +{
> +	u8 buffer[NSH_HDR_MAX_LEN];
> +	struct nshhdr *nh = (struct nshhdr *)buffer;
> +	int err;
> +
> +	err = nsh_hdr_from_nlattr(attr, nh, NSH_HDR_MAX_LEN);
> +	if (likely(!err))
> +		err = push_nsh(skb, key, nh);
> +
> +	return err;
> +}
> +
>  /* Execute a list of actions against 'skb'. */
>  static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>  			      struct sw_flow_key *key,
> @@ -1439,17 +1454,9 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>  			err = pop_eth(skb, key);
>  			break;
>  
> -		case OVS_ACTION_ATTR_PUSH_NSH: {
> -			u8 buffer[NSH_HDR_MAX_LEN];
> -			struct nshhdr *nh = (struct nshhdr *)buffer;
> -
> -			err = nsh_hdr_from_nlattr(nla_data(a), nh,
> -						  NSH_HDR_MAX_LEN);
> -			if (unlikely(err))
> -				break;
> -			err = push_nsh(skb, key, nh);
> +		case OVS_ACTION_ATTR_PUSH_NSH:
> +			err = execute_push_nsh(skb, key, nla_data(a));
>  			break;
> -		}
>  
>  		case OVS_ACTION_ATTR_POP_NSH:
>  			err = pop_nsh(skb, key);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage
  2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
                   ` (6 preceding siblings ...)
  2023-09-27  0:13 ` [RFC PATCH 7/7] net: openvswitch: Reduce stack usage in ovs_dp_process_packet Nicholas Piggin
@ 2023-09-27  8:36 ` Ilya Maximets
  2023-09-28  1:52   ` Nicholas Piggin
  2023-09-29  7:06   ` Nicholas Piggin
  7 siblings, 2 replies; 28+ messages in thread
From: Ilya Maximets @ 2023-09-27  8:36 UTC (permalink / raw)
  To: Nicholas Piggin, netdev; +Cc: dev, i.maximets

On 9/27/23 02:13, Nicholas Piggin wrote:
> Hi,
> 
> We've got a report of a stack overflow on ppc64le with a 16kB kernel
> stack. Openvswitch is just one of many things in the stack, but it
> does cause recursion and contributes to some usage.
> 
> Here are a few patches for reducing stack overhead. I don't know the
> code well so consider them just ideas. GFP_ATOMIC allocations
> introduced in a couple of places might be controversial, but there
> is still some savings to be had if you skip those.
> 
> Here is one place detected where the stack reaches >14kB before
> overflowing a little later. I massaged the output so it just shows
> the stack frame address on the left.

Hi, Nicholas.  Thanks for the patches!

Though it looks like OVS is not really playing a huge role in the
stack trace below.  How much of the stack does the patch set save
in total?  How much patches 2-7 contribute (I posted a patch similar
to the first one last week, so we may not count it)?

Also, most of the changes introduced here has a real chance to
noticeably impact performance.  Did you run any performance tests
with this to assess the impact?

One last thing is that at least some of the patches seem to change
non-inlined non-recursive functions.  Seems unnecessary.

Best regards, Ilya Maximets.

> 
> [c00000037d480b40] __kmalloc+0x8c/0x5e0
> [c00000037d480bc0] virtqueue_add_outbuf+0x354/0xac0
> [c00000037d480cc0] xmit_skb+0x1dc/0x350 [virtio_net]
> [c00000037d480d50] start_xmit+0xd4/0x3b0 [virtio_net]
> [c00000037d480e00] dev_hard_start_xmit+0x11c/0x280
> [c00000037d480e80] sch_direct_xmit+0xec/0x330
> [c00000037d480f20] __dev_xmit_skb+0x41c/0xa80
> [c00000037d480f90] __dev_queue_xmit+0x414/0x950
> [c00000037d481070] ovs_vport_send+0xb4/0x210 [openvswitch]
> [c00000037d4810f0] do_output+0x7c/0x200 [openvswitch]
> [c00000037d481140] do_execute_actions+0xe48/0xeb0 [openvswitch]
> [c00000037d481300] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> [c00000037d481380] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> [c00000037d481450] ovs_vport_receive+0x8c/0x130 [openvswitch]
> [c00000037d481660] internal_dev_xmit+0x40/0xd0 [openvswitch]
> [c00000037d481690] dev_hard_start_xmit+0x11c/0x280
> [c00000037d481710] __dev_queue_xmit+0x634/0x950
> [c00000037d4817f0] neigh_hh_output+0xd0/0x180
> [c00000037d481840] ip_finish_output2+0x31c/0x5c0
> [c00000037d4818e0] ip_local_out+0x64/0x90
> [c00000037d481920] iptunnel_xmit+0x194/0x290
> [c00000037d4819c0] udp_tunnel_xmit_skb+0x100/0x140 [udp_tunnel]
> [c00000037d481a80] geneve_xmit_skb+0x34c/0x610 [geneve]
> [c00000037d481bb0] geneve_xmit+0x94/0x1e8 [geneve]
> [c00000037d481c30] dev_hard_start_xmit+0x11c/0x280
> [c00000037d481cb0] __dev_queue_xmit+0x634/0x950
> [c00000037d481d90] ovs_vport_send+0xb4/0x210 [openvswitch]
> [c00000037d481e10] do_output+0x7c/0x200 [openvswitch]
> [c00000037d481e60] do_execute_actions+0xe48/0xeb0 [openvswitch]
> [c00000037d482020] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> [c00000037d4820a0] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> [c00000037d482170] clone_execute+0x2c8/0x370 [openvswitch]
> [c00000037d482210] do_execute_actions+0x4b8/0xeb0 [openvswitch]
> [c00000037d4823d0] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> [c00000037d482450] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> [c00000037d482520] ovs_vport_receive+0x8c/0x130 [openvswitch]
> [c00000037d482730] internal_dev_xmit+0x40/0xd0 [openvswitch]
> [c00000037d482760] dev_hard_start_xmit+0x11c/0x280
> [c00000037d4827e0] __dev_queue_xmit+0x634/0x950
> [c00000037d4828c0] neigh_hh_output+0xd0/0x180
> [c00000037d482910] ip_finish_output2+0x31c/0x5c0
> [c00000037d4829b0] __ip_queue_xmit+0x1b0/0x4f0
> [c00000037d482a40] __tcp_transmit_skb+0x450/0x9a0
> [c00000037d482b10] tcp_write_xmit+0x4e0/0xb40
> [c00000037d482be0] __tcp_push_pending_frames+0x44/0x130
> [c00000037d482c50] __tcp_sock_set_cork.part.0+0x8c/0xb0
> [c00000037d482c80] tcp_sock_set_cork+0x78/0xa0
> [c00000037d482cb0] xs_tcp_send_request+0x2d4/0x430 [sunrpc]
> [c00000037d482e50] xprt_request_transmit.constprop.0+0xa8/0x3c0 [sunrpc]
> [c00000037d482eb0] xprt_transmit+0x12c/0x260 [sunrpc]
> [c00000037d482f20] call_transmit+0xd0/0x100 [sunrpc]
> [c00000037d482f50] __rpc_execute+0xec/0x570 [sunrpc]
> [c00000037d482fd0] rpc_execute+0x168/0x1d0 [sunrpc]
> [c00000037d483010] rpc_run_task+0x1cc/0x2a0 [sunrpc]
> [c00000037d483070] nfs4_call_sync_sequence+0x98/0x100 [nfsv4]
> [c00000037d483120] _nfs4_server_capabilities+0xd4/0x3c0 [nfsv4]
> [c00000037d483210] nfs4_server_capabilities+0x74/0xd0 [nfsv4]
> [c00000037d483270] nfs4_proc_get_root+0x3c/0x150 [nfsv4]
> [c00000037d4832f0] nfs_get_root+0xac/0x660 [nfs]
> [c00000037d483420] nfs_get_tree_common+0x104/0x5f0 [nfs]
> [c00000037d4834b0] nfs_get_tree+0x90/0xc0 [nfs]
> [c00000037d4834e0] vfs_get_tree+0x48/0x160
> [c00000037d483560] nfs_do_submount+0x170/0x210 [nfs]
> [c00000037d483600] nfs4_submount+0x250/0x360 [nfsv4]
> [c00000037d4836b0] nfs_d_automount+0x194/0x2d0 [nfs]
> [c00000037d483710] __traverse_mounts+0x114/0x330
> [c00000037d483770] step_into+0x364/0x4d0
> [c00000037d4837f0] walk_component+0x8c/0x300
> [c00000037d483870] path_lookupat+0xa8/0x260
> [c00000037d4838c0] filename_lookup+0xc8/0x230
> [c00000037d483a00] vfs_path_lookup+0x68/0xc0
> [c00000037d483a60] mount_subtree+0xd0/0x1e0
> [c00000037d483ad0] do_nfs4_mount+0x280/0x520 [nfsv4]
> [c00000037d483ba0] nfs4_try_get_tree+0x60/0x140 [nfsv4]
> [c00000037d483c20] nfs_get_tree+0x60/0xc0 [nfs]
> [c00000037d483c50] vfs_get_tree+0x48/0x160
> [c00000037d483cd0] do_new_mount+0x204/0x3c0
> [c00000037d483d40] sys_mount+0x168/0x1c0
> [c00000037d483db0] system_call_exception+0x164/0x310
> [c00000037d483e10] system_call_vectored_common+0xe8/0x278
> 
> That's hard to decipher so here all the stack frames sorted by
> size, and number of appearances if > 1:
> 
> 528 ovs_vport_receive (x2)
> 448 do_execute_actions (x3)
> 416 xs_tcp_send_request
> 320 filename_lookup
> 304 nfs_get_root
> 304 geneve_xmit_skb
> 256 virtqueue_add_outbuf
> 240 _nfs4_server_capabilities
> 224 __dev_queue_xmit (x4)
> 208 tcp_write_xmit
> 208 __tcp_transmit_skb
> 208 ovs_dp_process_packet (x3)
> 208 do_nfs4_mount
> 192 udp_tunnel_xmit_skb
> 176 start_xmit
> 176 nfs4_submount
> 176 nfs4_call_sync_sequence
> 160 sch_direct_xmit
> 160 nfs_do_submount
> 160 iptunnel_xmit
> 160 ip_finish_output2 (x2)
> 160 clone_execute
> 144 xmit_skb
> 144 nfs_get_tree_common
> 144 __ip_queue_xmit
> 128 walk_component
> 128 vfs_get_tree (x2)
> 128 step_into
> 128 __rpc_execute
> 128 ovs_vport_send (x2)
> 128 ovs_execute_actions (x3)
> 128 nfs4_try_get_tree
> 128 nfs4_proc_get_root
> 128 __kmalloc
> 128 geneve_xmit
> 128 dev_hard_start_xmit (x4)
> 112 xprt_transmit
> 112 __tcp_push_pending_frames
> 112 sys_mount
> 112 mount_subtree
> 112 do_new_mount
> 112 __dev_xmit_skb
> 96 xprt_request_transmit.constprop.0
> 96 vfs_path_lookup
> 96 __traverse_mounts
> 96 system_call_exception
> 96 rpc_run_task
> 96 nfs_d_automount
> 96 nfs4_server_capabilities
> 80 path_lookupat
> 80 neigh_hh_output (x2)
> 80 do_output (x2)
> 64 rpc_execute
> 64 ip_local_out
> 48 __tcp_sock_set_cork.part.0
> 48 tcp_sock_set_cork
> 48 nfs_get_tree (x2)
> 48 internal_dev_xmit (x2)
> 48 call_transmit
> 
> Thanks,
> Nick
> 
> Nicholas Piggin (7):
>   net: openvswitch: Move NSH buffer out of do_execute_actions
>   net: openvswitch: Reduce execute_push_nsh stack overhead
>   net: openvswitch: uninline action execution
>   net: openvswitch: ovs_vport_receive reduce stack usage
>   net: openvswitch: uninline ovs_fragment to control stack usage
>   net: openvswitch: Reduce ovs_fragment stack usage
>   net: openvswitch: Reduce stack usage in ovs_dp_process_packet
> 
>  net/openvswitch/actions.c  | 139 +++++++++++++++++++++++--------------
>  net/openvswitch/datapath.c |  55 ++++++++-------
>  net/openvswitch/drop.h     |   1 +
>  net/openvswitch/vport.c    |  14 +++-
>  4 files changed, 129 insertions(+), 80 deletions(-)
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 1/7] net: openvswitch: Move NSH buffer out of do_execute_actions
  2023-09-27  8:26   ` [ovs-dev] " Ilya Maximets
@ 2023-09-27 10:03     ` Nicholas Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-27 10:03 UTC (permalink / raw)
  To: Ilya Maximets, netdev; +Cc: dev

On Wed Sep 27, 2023 at 6:26 PM AEST, Ilya Maximets wrote:
> On 9/27/23 02:13, Nicholas Piggin wrote:
> > This takes do_execute_actions stack use from 544 bytes to 288
> > bytes. execute_push_nsh uses 336 bytes, but it is a leaf call not
> > involved in recursion.
> > 
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >  net/openvswitch/actions.c | 27 +++++++++++++++++----------
> >  1 file changed, 17 insertions(+), 10 deletions(-)
>
> Hi, Nicholas.  I made the same change about a week ago:
>   https://lore.kernel.org/netdev/20230921194314.1976605-1-i.maximets@ovn.org/
> So, you can drop this patch from your set.

Ah nice, I didn't see it. Looks good to me.

Thanks,
Nick

>
> Best regards, Ilya Maximets.
>
> > 
> > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> > index fd66014d8a76..8933caa92794 100644
> > --- a/net/openvswitch/actions.c
> > +++ b/net/openvswitch/actions.c
> > @@ -1286,6 +1286,21 @@ static int execute_dec_ttl(struct sk_buff *skb, struct sw_flow_key *key)
> >  	return 0;
> >  }
> >  
> > +static noinline_for_stack int execute_push_nsh(struct sk_buff *skb,
> > +					       struct sw_flow_key *key,
> > +					       const struct nlattr *attr)
> > +{
> > +	u8 buffer[NSH_HDR_MAX_LEN];
> > +	struct nshhdr *nh = (struct nshhdr *)buffer;
> > +	int err;
> > +
> > +	err = nsh_hdr_from_nlattr(attr, nh, NSH_HDR_MAX_LEN);
> > +	if (likely(!err))
> > +		err = push_nsh(skb, key, nh);
> > +
> > +	return err;
> > +}
> > +
> >  /* Execute a list of actions against 'skb'. */
> >  static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> >  			      struct sw_flow_key *key,
> > @@ -1439,17 +1454,9 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> >  			err = pop_eth(skb, key);
> >  			break;
> >  
> > -		case OVS_ACTION_ATTR_PUSH_NSH: {
> > -			u8 buffer[NSH_HDR_MAX_LEN];
> > -			struct nshhdr *nh = (struct nshhdr *)buffer;
> > -
> > -			err = nsh_hdr_from_nlattr(nla_data(a), nh,
> > -						  NSH_HDR_MAX_LEN);
> > -			if (unlikely(err))
> > -				break;
> > -			err = push_nsh(skb, key, nh);
> > +		case OVS_ACTION_ATTR_PUSH_NSH:
> > +			err = execute_push_nsh(skb, key, nla_data(a));
> >  			break;
> > -		}
> >  
> >  		case OVS_ACTION_ATTR_POP_NSH:
> >  			err = pop_nsh(skb, key);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/7] net: openvswitch: Reduce stack usage in ovs_dp_process_packet
  2023-09-27  0:13 ` [RFC PATCH 7/7] net: openvswitch: Reduce stack usage in ovs_dp_process_packet Nicholas Piggin
@ 2023-09-27 15:22   ` kernel test robot
  0 siblings, 0 replies; 28+ messages in thread
From: kernel test robot @ 2023-09-27 15:22 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: oe-kbuild-all

Hi Nicholas,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on net/main]
[also build test WARNING on linus/master v6.6-rc3]
[cannot apply to net-next/main horms-ipvs/master next-20230927]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicholas-Piggin/net-openvswitch-Move-NSH-buffer-out-of-do_execute_actions/20230927-082125
base:   net/main
patch link:    https://lore.kernel.org/r/20230927001308.749910-8-npiggin%40gmail.com
patch subject: [RFC PATCH 7/7] net: openvswitch: Reduce stack usage in ovs_dp_process_packet
config: parisc-allyesconfig (https://download.01.org/0day-ci/archive/20230927/202309272345.yNWcLYj9-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230927/202309272345.yNWcLYj9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309272345.yNWcLYj9-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> net/openvswitch/datapath.c:246:6: warning: 'do_packet_upcall' defined but not used [-Wunused-function]
     246 | void do_packet_upcall(struct sk_buff *skb, struct sw_flow_key *key,
         |      ^~~~~~~~~~~~~~~~


vim +/do_packet_upcall +246 net/openvswitch/datapath.c

   244	
   245	static noinline_for_stack
 > 246	void do_packet_upcall(struct sk_buff *skb, struct sw_flow_key *key,
   247			      const struct vport *p, struct datapath *dp)
   248	{
   249		struct dp_upcall_info upcall;
   250		int error;
   251	
   252		memset(&upcall, 0, sizeof(upcall));
   253		upcall.cmd = OVS_PACKET_CMD_MISS;
   254	
   255		if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
   256			upcall.portid =
   257			    ovs_dp_get_upcall_portid(dp, smp_processor_id());
   258		else
   259			upcall.portid = ovs_vport_find_upcall_portid(p, skb);
   260	
   261		upcall.mru = OVS_CB(skb)->mru;
   262		error = ovs_dp_upcall(dp, skb, key, &upcall, 0);
   263		switch (error) {
   264		case 0:
   265		case -EAGAIN:
   266		case -ERESTARTSYS:
   267		case -EINTR:
   268			consume_skb(skb);
   269			break;
   270		default:
   271			kfree_skb(skb);
   272			break;
   273		}
   274	}
   275	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage
  2023-09-27  8:36 ` [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Ilya Maximets
@ 2023-09-28  1:52   ` Nicholas Piggin
  2023-10-02 11:54     ` Ilya Maximets
  2023-09-29  7:06   ` Nicholas Piggin
  1 sibling, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-28  1:52 UTC (permalink / raw)
  To: Ilya Maximets, netdev; +Cc: dev

On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
> On 9/27/23 02:13, Nicholas Piggin wrote:
> > Hi,
> > 
> > We've got a report of a stack overflow on ppc64le with a 16kB kernel
> > stack. Openvswitch is just one of many things in the stack, but it
> > does cause recursion and contributes to some usage.
> > 
> > Here are a few patches for reducing stack overhead. I don't know the
> > code well so consider them just ideas. GFP_ATOMIC allocations
> > introduced in a couple of places might be controversial, but there
> > is still some savings to be had if you skip those.
> > 
> > Here is one place detected where the stack reaches >14kB before
> > overflowing a little later. I massaged the output so it just shows
> > the stack frame address on the left.
>
> Hi, Nicholas.  Thanks for the patches!

Hey, sorry your mail didn't come through for me (though it's on the
list)... Anyway thanks for the feedback.

And the important thing I forgot to mention: this was reproduced on a
RHEL9 kernel and that's where the traces are from. Upstream is quite
similar though so the code and call chains and stack use should be
pretty close.

It's a complicated configuration we're having difficulty with testing
upstream kernel. People are working to test things on the RHEL kernel
but I wanted to bring this upstream before we get too far down that
road.

Unfortunately that means I don't have performance or exact stack
use savings yet. But I will let you know if/when I get results.

> Though it looks like OVS is not really playing a huge role in the
> stack trace below.  How much of the stack does the patch set save
> in total?  How much patches 2-7 contribute (I posted a patch similar
> to the first one last week, so we may not count it)?

ovs functions themselves are maybe 30% of stack use, so significant.  I
did find they are the ones with some of the biggest structures in local
variables though, so low hanging fruit. This series should save about
2kB of stack, by eyeball. Should be enough to get us out of trouble for
this scenario, at least.

I don't suggest ovs is the only problem, I'm just trying to trim things
where possible. I have been trying to find other savings too, e.g.,
https://lore.kernel.org/linux-nfs/20230927001624.750031-1-npiggin@gmail.com/

Recursion is a difficulty. I think we recursed 3 times in ovs, and it
looks like there's either 1 or 2 more recursions possible before the
limit (depending on how the accounting works, not sure if it stops at
4 or 5), so we're a long way off. ppc64le doesn't use an unusually large
amount of stack, probably more than x86-64, but shouldn't be by a big
factor. So it could be risky for any arch with 16kB stack.

I wonder if we should have an arch function that can be called by
significant recursion points such as this, which signals free stack is
low and you should bail out ASAP. I don't think it's reasonable to
expect ovs to know about all arch size and usage of stack. You could
keep your hard limit for consistency, but if that goes wrong the
low free stack indication could save you.

>
> Also, most of the changes introduced here has a real chance to
> noticeably impact performance.  Did you run any performance tests
> with this to assess the impact?

Will see if we can do that, but I doubt this setup would be too
sensitive to small changes so it might be something upstream would have
to help evaluate. Function calls and even small kmalloc/free on the same
CPU shouldn't be too costly I hope, but I don't know how hot these paths
can get.

>
> One last thing is that at least some of the patches seem to change
> non-inlined non-recursive functions.  Seems unnecessary.

I was concentrating on functions in the recursive path, but there
were one or two big ones just off the side that still can be called
when you're deep into stack. In general it's just a good idea to
be frugal as reasonably possible with kernel stack always so I
wouldn't say unnecessary, but yes arguably less important. I defer
to your judgement about cost and benefit of all these changes
though.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-09-27  0:13 ` [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage Nicholas Piggin
@ 2023-09-28 15:26   ` Aaron Conole
  2023-09-29  7:00     ` Nicholas Piggin
  2023-10-04  7:29     ` Nicholas Piggin
  0 siblings, 2 replies; 28+ messages in thread
From: Aaron Conole @ 2023-09-28 15:26 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: netdev, dev, Ilya Maximets, Eelco Chaudron, Flavio Leitner

Nicholas Piggin <npiggin@gmail.com> writes:

> Dynamically allocating the sw_flow_key reduces stack usage of
> ovs_vport_receive from 544 bytes to 64 bytes at the cost of
> another GFP_ATOMIC allocation in the receive path.
>
> XXX: is this a problem with memory reserves if ovs is in a
> memory reclaim path, or since we have a skb allocated, is it
> okay to use some GFP_ATOMIC reserves?
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---

This represents a fairly large performance hit.  Just my own quick
testing on a system using two netns, iperf3, and simple forwarding rules
shows between 2.5% and 4% performance reduction on x86-64.  Note that it
is a simple case, and doesn't involve a more involved scenario like
multiple bridges, tunnels, and internal ports.  I suspect such cases
will see even bigger hit.

I don't know the impact of the other changes, but just an FYI that the
performance impact of this change is extremely noticeable on x86
platform.

----
ip netns add left
ip netns add right

ip link add eth0 type veth peer name l0
ip link set eth0 netns left
ip netns exec left ip addr add 172.31.110.1/24 dev eth0
ip netns exec left ip link set eth0 up
ip link set l0 up

ip link add eth0 type veth peer name r0
ip link set eth0 netns right
ip netns exec right ip addr add 172.31.110.2/24 dev eth0
ip netns exec right ip link set eth0 up
ip link set r0 up

python3 ovs-dpctl.py add-dp br0
python3 ovs-dpctl.py add-if br0 l0
python3 ovs-dpctl.py add-if br0 r0

python3 ovs-dpctl.py add-flow \
  br0 'in_port(1),eth(),eth_type(0x806),arp()' 2
  
python3 ovs-dpctl.py add-flow \
  br0 'in_port(2),eth(),eth_type(0x806),arp()' 1

python3 ovs-dpctl.py add-flow \
  br0 'in_port(1),eth(),eth_type(0x800),ipv4()' 2

python3 ovs-dpctl.py add-flow \
  br0 'in_port(2),eth(),eth_type(0x800),ipv4()' 1

----

ex results without this patch:
[root@wsfd-netdev60 ~]# ip netns exec right ./git/iperf/src/iperf3 -c 172.31.110.1
...
[  5]   0.00-10.00  sec  46.7 GBytes  40.2 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  46.7 GBytes  40.2 Gbits/sec                  receiver


ex results with this patch:
[root@wsfd-netdev60 ~]# ip netns exec right ./git/iperf/src/iperf3 -c 172.31.110.1
...
[  5]   0.00-10.00  sec  44.9 GBytes  38.6 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  44.9 GBytes  38.6 Gbits/sec                  receiver

I did testing with udp at various bandwidths and this tcp testing.

>  net/openvswitch/vport.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index 972ae01a70f7..ddba3e00832b 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c
> @@ -494,9 +494,13 @@ u32 ovs_vport_find_upcall_portid(const struct vport *vport,
>  int ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
>  		      const struct ip_tunnel_info *tun_info)
>  {
> -	struct sw_flow_key key;
> +	struct sw_flow_key *key;
>  	int error;
>  
> +	key = kmalloc(sizeof(*key), GFP_ATOMIC);
> +	if (!key)
> +		return -ENOMEM;
> +
>  	OVS_CB(skb)->input_vport = vport;
>  	OVS_CB(skb)->mru = 0;
>  	OVS_CB(skb)->cutlen = 0;
> @@ -510,12 +514,16 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
>  	}
>  
>  	/* Extract flow from 'skb' into 'key'. */
> -	error = ovs_flow_key_extract(tun_info, skb, &key);
> +	error = ovs_flow_key_extract(tun_info, skb, key);
>  	if (unlikely(error)) {
>  		kfree_skb(skb);
> +		kfree(key);
>  		return error;
>  	}
> -	ovs_dp_process_packet(skb, &key);
> +	ovs_dp_process_packet(skb, key);
> +
> +	kfree(key);
> +
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-09-28 15:26   ` [ovs-dev] " Aaron Conole
@ 2023-09-29  7:00     ` Nicholas Piggin
  2023-09-29  8:38       ` Eelco Chaudron
  2023-10-04  7:29     ` Nicholas Piggin
  1 sibling, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-29  7:00 UTC (permalink / raw)
  To: Aaron Conole; +Cc: netdev, dev, Ilya Maximets, Eelco Chaudron, Flavio Leitner

On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
>
> > Dynamically allocating the sw_flow_key reduces stack usage of
> > ovs_vport_receive from 544 bytes to 64 bytes at the cost of
> > another GFP_ATOMIC allocation in the receive path.
> >
> > XXX: is this a problem with memory reserves if ovs is in a
> > memory reclaim path, or since we have a skb allocated, is it
> > okay to use some GFP_ATOMIC reserves?
> >
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
>
> This represents a fairly large performance hit.  Just my own quick
> testing on a system using two netns, iperf3, and simple forwarding rules
> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
> is a simple case, and doesn't involve a more involved scenario like
> multiple bridges, tunnels, and internal ports.  I suspect such cases
> will see even bigger hit.
>
> I don't know the impact of the other changes, but just an FYI that the
> performance impact of this change is extremely noticeable on x86
> platform.

Thanks for the numbers. This patch is probably the biggest perf cost,
but unfortunately it's also about the biggest saving. I might have an
idea to improve it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage
  2023-09-27  8:36 ` [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Ilya Maximets
  2023-09-28  1:52   ` Nicholas Piggin
@ 2023-09-29  7:06   ` Nicholas Piggin
  2023-10-02 11:56     ` Ilya Maximets
  1 sibling, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-09-29  7:06 UTC (permalink / raw)
  To: Ilya Maximets, netdev; +Cc: dev

On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
> On 9/27/23 02:13, Nicholas Piggin wrote:
> > Hi,
> > 
> > We've got a report of a stack overflow on ppc64le with a 16kB kernel
> > stack. Openvswitch is just one of many things in the stack, but it
> > does cause recursion and contributes to some usage.
> > 
> > Here are a few patches for reducing stack overhead. I don't know the
> > code well so consider them just ideas. GFP_ATOMIC allocations
> > introduced in a couple of places might be controversial, but there
> > is still some savings to be had if you skip those.
> > 
> > Here is one place detected where the stack reaches >14kB before
> > overflowing a little later. I massaged the output so it just shows
> > the stack frame address on the left.
>
> Hi, Nicholas.  Thanks for the patches!
>
> Though it looks like OVS is not really playing a huge role in the
> stack trace below.  How much of the stack does the patch set save
> in total?  How much patches 2-7 contribute (I posted a patch similar
> to the first one last week, so we may not count it)?

Stack usage was tested for the same path (this is backported to
RHEL9 kernel), and saving was 2080 bytes for that. It's enough
to get us out of trouble. But if it was a config that caused more
recursions then it might still be a problem.

>
> Also, most of the changes introduced here has a real chance to
> noticeably impact performance.  Did you run any performance tests
> with this to assess the impact?

Some numbers were posted by Aaron as you would see. 2-4% for that
patch, but I suspect the rest should have much smaller impact.

Maybe patch 2 if you were doing a lot of push_nsh operations, but
that might be less important since it's out of the recursive path.

>
> One last thing is that at least some of the patches seem to change
> non-inlined non-recursive functions.  Seems unnecessary.
>
> Best regards, Ilya Maximets.
>

One thing I do notice in the trace:

> > 
> > [c00000037d480b40] __kmalloc+0x8c/0x5e0
> > [c00000037d480bc0] virtqueue_add_outbuf+0x354/0xac0
> > [c00000037d480cc0] xmit_skb+0x1dc/0x350 [virtio_net]
> > [c00000037d480d50] start_xmit+0xd4/0x3b0 [virtio_net]
> > [c00000037d480e00] dev_hard_start_xmit+0x11c/0x280
> > [c00000037d480e80] sch_direct_xmit+0xec/0x330
> > [c00000037d480f20] __dev_xmit_skb+0x41c/0xa80
> > [c00000037d480f90] __dev_queue_xmit+0x414/0x950
> > [c00000037d481070] ovs_vport_send+0xb4/0x210 [openvswitch]
> > [c00000037d4810f0] do_output+0x7c/0x200 [openvswitch]
> > [c00000037d481140] do_execute_actions+0xe48/0xeb0 [openvswitch]
> > [c00000037d481300] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> > [c00000037d481380] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> > [c00000037d481450] ovs_vport_receive+0x8c/0x130 [openvswitch]
> > [c00000037d481660] internal_dev_xmit+0x40/0xd0 [openvswitch]
> > [c00000037d481690] dev_hard_start_xmit+0x11c/0x280
> > [c00000037d481710] __dev_queue_xmit+0x634/0x950
> > [c00000037d4817f0] neigh_hh_output+0xd0/0x180
> > [c00000037d481840] ip_finish_output2+0x31c/0x5c0
> > [c00000037d4818e0] ip_local_out+0x64/0x90
> > [c00000037d481920] iptunnel_xmit+0x194/0x290
> > [c00000037d4819c0] udp_tunnel_xmit_skb+0x100/0x140 [udp_tunnel]
> > [c00000037d481a80] geneve_xmit_skb+0x34c/0x610 [geneve]
> > [c00000037d481bb0] geneve_xmit+0x94/0x1e8 [geneve]
> > [c00000037d481c30] dev_hard_start_xmit+0x11c/0x280
> > [c00000037d481cb0] __dev_queue_xmit+0x634/0x950
> > [c00000037d481d90] ovs_vport_send+0xb4/0x210 [openvswitch]
> > [c00000037d481e10] do_output+0x7c/0x200 [openvswitch]
> > [c00000037d481e60] do_execute_actions+0xe48/0xeb0 [openvswitch]
> > [c00000037d482020] ovs_execute_actions+0x78/0x1f0 [openvswitch]
> > [c00000037d4820a0] ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
> > [c00000037d482170] clone_execute+0x2c8/0x370 [openvswitch]

                       ^^^^^

clone_execute is an action which can be deferred AFAIKS, but it is
not deferred until several recursions deep.

If we deferred always when possible, then might avoid such a big
stack (at least for this config). Is it very costly to defer? Would
it help here, or is it just going to process it right away and
cause basically the same call chain?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-09-29  7:00     ` Nicholas Piggin
@ 2023-09-29  8:38       ` Eelco Chaudron
  2023-10-04  7:11         ` Nicholas Piggin
  2023-10-05  2:01         ` Nicholas Piggin
  0 siblings, 2 replies; 28+ messages in thread
From: Eelco Chaudron @ 2023-09-29  8:38 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Aaron Conole, netdev, dev, Ilya Maximets, Flavio Leitner



On 29 Sep 2023, at 9:00, Nicholas Piggin wrote:

> On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>
>>> Dynamically allocating the sw_flow_key reduces stack usage of
>>> ovs_vport_receive from 544 bytes to 64 bytes at the cost of
>>> another GFP_ATOMIC allocation in the receive path.
>>>
>>> XXX: is this a problem with memory reserves if ovs is in a
>>> memory reclaim path, or since we have a skb allocated, is it
>>> okay to use some GFP_ATOMIC reserves?
>>>
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>
>> This represents a fairly large performance hit.  Just my own quick
>> testing on a system using two netns, iperf3, and simple forwarding rules
>> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
>> is a simple case, and doesn't involve a more involved scenario like
>> multiple bridges, tunnels, and internal ports.  I suspect such cases
>> will see even bigger hit.
>>
>> I don't know the impact of the other changes, but just an FYI that the
>> performance impact of this change is extremely noticeable on x86
>> platform.
>
> Thanks for the numbers. This patch is probably the biggest perf cost,
> but unfortunately it's also about the biggest saving. I might have an
> idea to improve it.

Also, were you able to figure out why we do not see this problem on x86 and arm64? Is the stack usage so much larger, or is there some other root cause? Is there a simple replicator, as this might help you profile the differences between the architectures?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage
  2023-09-28  1:52   ` Nicholas Piggin
@ 2023-10-02 11:54     ` Ilya Maximets
  2023-10-04  9:56       ` Nicholas Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Ilya Maximets @ 2023-10-02 11:54 UTC (permalink / raw)
  To: Nicholas Piggin, netdev
  Cc: i.maximets, dev, Aaron Conole, Eelco Chaudron, Simon Horman

On 9/28/23 03:52, Nicholas Piggin wrote:
> On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
>> On 9/27/23 02:13, Nicholas Piggin wrote:
>>> Hi,
>>>
>>> We've got a report of a stack overflow on ppc64le with a 16kB kernel
>>> stack. Openvswitch is just one of many things in the stack, but it
>>> does cause recursion and contributes to some usage.
>>>
>>> Here are a few patches for reducing stack overhead. I don't know the
>>> code well so consider them just ideas. GFP_ATOMIC allocations
>>> introduced in a couple of places might be controversial, but there
>>> is still some savings to be had if you skip those.
>>>
>>> Here is one place detected where the stack reaches >14kB before
>>> overflowing a little later. I massaged the output so it just shows
>>> the stack frame address on the left.
>>
>> Hi, Nicholas.  Thanks for the patches!
> 
> Hey, sorry your mail didn't come through for me (though it's on the
> list)... Anyway thanks for the feedback.
> 
> And the important thing I forgot to mention: this was reproduced on a
> RHEL9 kernel and that's where the traces are from. Upstream is quite
> similar though so the code and call chains and stack use should be
> pretty close.
> 
> It's a complicated configuration we're having difficulty with testing
> upstream kernel. People are working to test things on the RHEL kernel
> but I wanted to bring this upstream before we get too far down that
> road.
> 
> Unfortunately that means I don't have performance or exact stack
> use savings yet. But I will let you know if/when I get results.
> 
>> Though it looks like OVS is not really playing a huge role in the
>> stack trace below.  How much of the stack does the patch set save
>> in total?  How much patches 2-7 contribute (I posted a patch similar
>> to the first one last week, so we may not count it)?
> 
> ovs functions themselves are maybe 30% of stack use, so significant.  I
> did find they are the ones with some of the biggest structures in local
> variables though, so low hanging fruit. This series should save about
> 2kB of stack, by eyeball. Should be enough to get us out of trouble for
> this scenario, at least.

Unfortunately, the only low handing fruit in this set is patch #1,
the rest needs a serious performance evaluation.

> 
> I don't suggest ovs is the only problem, I'm just trying to trim things
> where possible. I have been trying to find other savings too, e.g.,
> https://lore.kernel.org/linux-nfs/20230927001624.750031-1-npiggin@gmail.com/
> 
> Recursion is a difficulty. I think we recursed 3 times in ovs, and it
> looks like there's either 1 or 2 more recursions possible before the
> limit (depending on how the accounting works, not sure if it stops at
> 4 or 5), so we're a long way off. ppc64le doesn't use an unusually large
> amount of stack, probably more than x86-64, but shouldn't be by a big
> factor. So it could be risky for any arch with 16kB stack.

The stack trace looks like a very standard trace for something like
an ovn-kubernetes setup.  And I haven't seen such issues on x86 or
aarch64 systems.  What architectures beside ppc64le use 16kB stack?

> 
> I wonder if we should have an arch function that can be called by
> significant recursion points such as this, which signals free stack is
> low and you should bail out ASAP. I don't think it's reasonable to
> expect ovs to know about all arch size and usage of stack. You could
> keep your hard limit for consistency, but if that goes wrong the
> low free stack indication could save you.

Every part of the code will need to react somehow to such a signal,
so I'm not sure how the implementations would look like.

> 
>>
>> Also, most of the changes introduced here has a real chance to
>> noticeably impact performance.  Did you run any performance tests
>> with this to assess the impact?
> 
> Will see if we can do that, but I doubt this setup would be too
> sensitive to small changes so it might be something upstream would have
> to help evaluate. Function calls and even small kmalloc/free on the same
> CPU shouldn't be too costly I hope, but I don't know how hot these paths
> can get.

This code path unfortunately is as hot as it gets as it needs to be able
to process millions of packets per second.  Even small changes may cause
significant performance impact. 

> 
>>
>> One last thing is that at least some of the patches seem to change
>> non-inlined non-recursive functions.  Seems unnecessary.
> 
> I was concentrating on functions in the recursive path, but there
> were one or two big ones just off the side that still can be called
> when you're deep into stack. In general it's just a good idea to
> be frugal as reasonably possible with kernel stack always so I
> wouldn't say unnecessary, but yes arguably less important. I defer
> to your judgement about cost and benefit of all these changes
> though.
> 
> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage
  2023-09-29  7:06   ` Nicholas Piggin
@ 2023-10-02 11:56     ` Ilya Maximets
  2023-10-03 13:31       ` Aaron Conole
  0 siblings, 1 reply; 28+ messages in thread
From: Ilya Maximets @ 2023-10-02 11:56 UTC (permalink / raw)
  To: Nicholas Piggin, netdev
  Cc: i.maximets, dev, Aaron Conole, Eelco Chaudron, Simon Horman

On 9/29/23 09:06, Nicholas Piggin wrote:
> On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
>> On 9/27/23 02:13, Nicholas Piggin wrote:
>>> Hi,
>>>
>>> We've got a report of a stack overflow on ppc64le with a 16kB kernel
>>> stack. Openvswitch is just one of many things in the stack, but it
>>> does cause recursion and contributes to some usage.
>>>
>>> Here are a few patches for reducing stack overhead. I don't know the
>>> code well so consider them just ideas. GFP_ATOMIC allocations
>>> introduced in a couple of places might be controversial, but there
>>> is still some savings to be had if you skip those.
>>>
>>> Here is one place detected where the stack reaches >14kB before
>>> overflowing a little later. I massaged the output so it just shows
>>> the stack frame address on the left.
>>
>> Hi, Nicholas.  Thanks for the patches!
>>
>> Though it looks like OVS is not really playing a huge role in the
>> stack trace below.  How much of the stack does the patch set save
>> in total?  How much patches 2-7 contribute (I posted a patch similar
>> to the first one last week, so we may not count it)?
> 
> Stack usage was tested for the same path (this is backported to
> RHEL9 kernel), and saving was 2080 bytes for that. It's enough
> to get us out of trouble. But if it was a config that caused more
> recursions then it might still be a problem.

The 2K total value likely means that only patches 1 and 4 actually
contribute much into the savings.  And I agree that running at
85%+ stack utilization seems risky.  It can likely be overflowed
by just a few more recirculations in OVS pipeline or traversing
one more network namespace on a way out.  And it's possible that
some of the traffic will take such a route in your system even if
you didn't see it yet.

>> Also, most of the changes introduced here has a real chance to
>> noticeably impact performance.  Did you run any performance tests
>> with this to assess the impact?
> 
> Some numbers were posted by Aaron as you would see. 2-4% for that
> patch, but I suspect the rest should have much smaller impact.

They also seem to have a very small impact on the stack usage,
so may be not worth touching at all, since performance evaluation
for them will be necessary before they can be accepted.

> 
> Maybe patch 2 if you were doing a lot of push_nsh operations, but
> that might be less important since it's out of the recursive path.

It's also unlikely that you have NHS pipeline configured in OVS.

> 
>>
>> One last thing is that at least some of the patches seem to change
>> non-inlined non-recursive functions.  Seems unnecessary.
>>
>> Best regards, Ilya Maximets.
>>
> 
> One thing I do notice in the trace:
> 
> 
> clone_execute is an action which can be deferred AFAIKS, but it is
> not deferred until several recursions deep.
> 
> If we deferred always when possible, then might avoid such a big
> stack (at least for this config). Is it very costly to defer? Would
> it help here, or is it just going to process it right away and
> cause basically the same call chain?

It may save at most two stack frames maybe, because deferred actions
will be called just one function above in ovs_execute_actions(), and
it will not save us from packets exiting openvswitch module and
re-entering from a different port, which is a case in the provided
trace.

Also, I'd vote against deferring, because then we'll start hitting
the limit of deferred actions much faster causing packet drops, which
is already a problem for some OVN deployments.  And deferring involves
copying a lot of memory, which will hit performance once again.

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage
  2023-10-02 11:56     ` Ilya Maximets
@ 2023-10-03 13:31       ` Aaron Conole
  0 siblings, 0 replies; 28+ messages in thread
From: Aaron Conole @ 2023-10-03 13:31 UTC (permalink / raw)
  To: Ilya Maximets; +Cc: Nicholas Piggin, netdev, dev, Eelco Chaudron, Simon Horman

Ilya Maximets <i.maximets@ovn.org> writes:

> On 9/29/23 09:06, Nicholas Piggin wrote:
>> On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
>>> On 9/27/23 02:13, Nicholas Piggin wrote:
>>>> Hi,
>>>>
>>>> We've got a report of a stack overflow on ppc64le with a 16kB kernel
>>>> stack. Openvswitch is just one of many things in the stack, but it
>>>> does cause recursion and contributes to some usage.
>>>>
>>>> Here are a few patches for reducing stack overhead. I don't know the
>>>> code well so consider them just ideas. GFP_ATOMIC allocations
>>>> introduced in a couple of places might be controversial, but there
>>>> is still some savings to be had if you skip those.
>>>>
>>>> Here is one place detected where the stack reaches >14kB before
>>>> overflowing a little later. I massaged the output so it just shows
>>>> the stack frame address on the left.
>>>
>>> Hi, Nicholas.  Thanks for the patches!
>>>
>>> Though it looks like OVS is not really playing a huge role in the
>>> stack trace below.  How much of the stack does the patch set save
>>> in total?  How much patches 2-7 contribute (I posted a patch similar
>>> to the first one last week, so we may not count it)?
>> 
>> Stack usage was tested for the same path (this is backported to
>> RHEL9 kernel), and saving was 2080 bytes for that. It's enough
>> to get us out of trouble. But if it was a config that caused more
>> recursions then it might still be a problem.
>
> The 2K total value likely means that only patches 1 and 4 actually
> contribute much into the savings.  And I agree that running at
> 85%+ stack utilization seems risky.  It can likely be overflowed
> by just a few more recirculations in OVS pipeline or traversing
> one more network namespace on a way out.  And it's possible that
> some of the traffic will take such a route in your system even if
> you didn't see it yet.
>
>>> Also, most of the changes introduced here has a real chance to
>>> noticeably impact performance.  Did you run any performance tests
>>> with this to assess the impact?
>> 
>> Some numbers were posted by Aaron as you would see. 2-4% for that
>> patch, but I suspect the rest should have much smaller impact.
>
> They also seem to have a very small impact on the stack usage,
> so may be not worth touching at all, since performance evaluation
> for them will be necessary before they can be accepted.

Actually, it's also important to keep in mind that the vport_receive is
only happening once in my performance test.  I expect it gets worse when
running in the scenario (br-ex, br-int, br-tun setup).

>> 
>> Maybe patch 2 if you were doing a lot of push_nsh operations, but
>> that might be less important since it's out of the recursive path.
>
> It's also unlikely that you have NHS pipeline configured in OVS.
>
>> 
>>>
>>> One last thing is that at least some of the patches seem to change
>>> non-inlined non-recursive functions.  Seems unnecessary.
>>>
>>> Best regards, Ilya Maximets.
>>>
>> 
>> One thing I do notice in the trace:
>> 
>> 
>> clone_execute is an action which can be deferred AFAIKS, but it is
>> not deferred until several recursions deep.
>> 
>> If we deferred always when possible, then might avoid such a big
>> stack (at least for this config). Is it very costly to defer? Would
>> it help here, or is it just going to process it right away and
>> cause basically the same call chain?
>
> It may save at most two stack frames maybe, because deferred actions
> will be called just one function above in ovs_execute_actions(), and
> it will not save us from packets exiting openvswitch module and
> re-entering from a different port, which is a case in the provided
> trace.

It used to always be deferred but, IIUC we were hitting deferred actions
limit quite a bit so when the sample and clone actions were unified there
was a choice to recurse to avoid dropping packets.

> Also, I'd vote against deferring, because then we'll start hitting
> the limit of deferred actions much faster causing packet drops, which
> is already a problem for some OVN deployments.  And deferring involves
> copying a lot of memory, which will hit performance once again.
>
> Best regards, Ilya Maximets.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-09-29  8:38       ` Eelco Chaudron
@ 2023-10-04  7:11         ` Nicholas Piggin
  2023-10-04 15:15           ` Aaron Conole
  2023-10-05  2:01         ` Nicholas Piggin
  1 sibling, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-10-04  7:11 UTC (permalink / raw)
  To: Eelco Chaudron; +Cc: Aaron Conole, netdev, dev, Ilya Maximets, Flavio Leitner

On Fri Sep 29, 2023 at 6:38 PM AEST, Eelco Chaudron wrote:
>
>
> On 29 Sep 2023, at 9:00, Nicholas Piggin wrote:
>
> > On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>
> >>> Dynamically allocating the sw_flow_key reduces stack usage of
> >>> ovs_vport_receive from 544 bytes to 64 bytes at the cost of
> >>> another GFP_ATOMIC allocation in the receive path.
> >>>
> >>> XXX: is this a problem with memory reserves if ovs is in a
> >>> memory reclaim path, or since we have a skb allocated, is it
> >>> okay to use some GFP_ATOMIC reserves?
> >>>
> >>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> >>> ---
> >>
> >> This represents a fairly large performance hit.  Just my own quick
> >> testing on a system using two netns, iperf3, and simple forwarding rules
> >> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
> >> is a simple case, and doesn't involve a more involved scenario like
> >> multiple bridges, tunnels, and internal ports.  I suspect such cases
> >> will see even bigger hit.
> >>
> >> I don't know the impact of the other changes, but just an FYI that the
> >> performance impact of this change is extremely noticeable on x86
> >> platform.
> >
> > Thanks for the numbers. This patch is probably the biggest perf cost,
> > but unfortunately it's also about the biggest saving. I might have an
> > idea to improve it.
>
> Also, were you able to figure out why we do not see this problem on
> x86 and arm64? Is the stack usage so much larger, or is there some
> other root cause?

Haven't pinpointed it exactly. ppc64le interrupt entry frame is nearly
3x larger than x86-64, about 200 bytes. So there's 400 if a hard
interrupt (not seen in the backtrace) is what overflowed it. Stack
alignment I think is 32 bytes vs 16 for x86-64. And different amount of
spilling and non-volatile register use and inlining choices by the
compiler could nudge things one way or another. There is little to no
ppc64le specific data structures on the stack in any of this call chain
which should cause much more bloat though, AFAIKS.

So other archs should not be far away from overflowing 16kB I think.

> Is there a simple replicator, as this might help you
> profile the differences between the architectures?

Unfortunately not, it's some kubernetes contraption I don't know how
to reproduce myself.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-09-28 15:26   ` [ovs-dev] " Aaron Conole
  2023-09-29  7:00     ` Nicholas Piggin
@ 2023-10-04  7:29     ` Nicholas Piggin
  2023-10-04 15:16       ` Aaron Conole
  1 sibling, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-10-04  7:29 UTC (permalink / raw)
  To: Aaron Conole; +Cc: netdev, dev, Ilya Maximets, Eelco Chaudron, Flavio Leitner

On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
>
> > Dynamically allocating the sw_flow_key reduces stack usage of
> > ovs_vport_receive from 544 bytes to 64 bytes at the cost of
> > another GFP_ATOMIC allocation in the receive path.
> >
> > XXX: is this a problem with memory reserves if ovs is in a
> > memory reclaim path, or since we have a skb allocated, is it
> > okay to use some GFP_ATOMIC reserves?
> >
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
>
> This represents a fairly large performance hit.  Just my own quick
> testing on a system using two netns, iperf3, and simple forwarding rules
> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
> is a simple case, and doesn't involve a more involved scenario like
> multiple bridges, tunnels, and internal ports.  I suspect such cases
> will see even bigger hit.
>
> I don't know the impact of the other changes, but just an FYI that the
> performance impact of this change is extremely noticeable on x86
> platform.
>
> ----
> ip netns add left
> ip netns add right
>
> ip link add eth0 type veth peer name l0
> ip link set eth0 netns left
> ip netns exec left ip addr add 172.31.110.1/24 dev eth0
> ip netns exec left ip link set eth0 up
> ip link set l0 up
>
> ip link add eth0 type veth peer name r0
> ip link set eth0 netns right
> ip netns exec right ip addr add 172.31.110.2/24 dev eth0
> ip netns exec right ip link set eth0 up
> ip link set r0 up
>
> python3 ovs-dpctl.py add-dp br0
> python3 ovs-dpctl.py add-if br0 l0
> python3 ovs-dpctl.py add-if br0 r0
>
> python3 ovs-dpctl.py add-flow \
>   br0 'in_port(1),eth(),eth_type(0x806),arp()' 2
>   
> python3 ovs-dpctl.py add-flow \
>   br0 'in_port(2),eth(),eth_type(0x806),arp()' 1
>
> python3 ovs-dpctl.py add-flow \
>   br0 'in_port(1),eth(),eth_type(0x800),ipv4()' 2
>
> python3 ovs-dpctl.py add-flow \
>   br0 'in_port(2),eth(),eth_type(0x800),ipv4()' 1
>
> ----
>
> ex results without this patch:
> [root@wsfd-netdev60 ~]# ip netns exec right ./git/iperf/src/iperf3 -c 172.31.110.1
> ...
> [  5]   0.00-10.00  sec  46.7 GBytes  40.2 Gbits/sec    0             sender
> [  5]   0.00-10.00  sec  46.7 GBytes  40.2 Gbits/sec                  receiver
>
>
> ex results with this patch:
> [root@wsfd-netdev60 ~]# ip netns exec right ./git/iperf/src/iperf3 -c 172.31.110.1
> ...
> [  5]   0.00-10.00  sec  44.9 GBytes  38.6 Gbits/sec    0             sender
> [  5]   0.00-10.00  sec  44.9 GBytes  38.6 Gbits/sec                  receiver
>
> I did testing with udp at various bandwidths and this tcp testing.

Thanks for the test case. It works perfectly in the end, but it took me
days to get there because of a random conspiracy of problems I hit :(
Sorry for the slow reply, but I was now able to test another idea for
this. Performance seems to be within the noise with the full series, but
my system is only getting ~half the rate of yours so you might see more
movement.

Instead of slab it reuses the per-cpu actions key allocator here.

https://github.com/torvalds/linux/commit/878f01f04ca858e445ff4b4c64351a25bb8399e3

Pushed the series to kvm branch of https://github.com/npiggin/linux

I can repost the series as a second RFC but will wait for thoughts on
this approach.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage
  2023-10-02 11:54     ` Ilya Maximets
@ 2023-10-04  9:56       ` Nicholas Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2023-10-04  9:56 UTC (permalink / raw)
  To: Ilya Maximets, netdev; +Cc: dev, Aaron Conole, Eelco Chaudron, Simon Horman

On Mon Oct 2, 2023 at 9:54 PM AEST, Ilya Maximets wrote:
> On 9/28/23 03:52, Nicholas Piggin wrote:
> > On Wed Sep 27, 2023 at 6:36 PM AEST, Ilya Maximets wrote:
> >> On 9/27/23 02:13, Nicholas Piggin wrote:
> >>> Hi,
> >>>
> >>> We've got a report of a stack overflow on ppc64le with a 16kB kernel
> >>> stack. Openvswitch is just one of many things in the stack, but it
> >>> does cause recursion and contributes to some usage.
> >>>
> >>> Here are a few patches for reducing stack overhead. I don't know the
> >>> code well so consider them just ideas. GFP_ATOMIC allocations
> >>> introduced in a couple of places might be controversial, but there
> >>> is still some savings to be had if you skip those.
> >>>
> >>> Here is one place detected where the stack reaches >14kB before
> >>> overflowing a little later. I massaged the output so it just shows
> >>> the stack frame address on the left.
> >>
> >> Hi, Nicholas.  Thanks for the patches!
> > 
> > Hey, sorry your mail didn't come through for me (though it's on the
> > list)... Anyway thanks for the feedback.
> > 
> > And the important thing I forgot to mention: this was reproduced on a
> > RHEL9 kernel and that's where the traces are from. Upstream is quite
> > similar though so the code and call chains and stack use should be
> > pretty close.
> > 
> > It's a complicated configuration we're having difficulty with testing
> > upstream kernel. People are working to test things on the RHEL kernel
> > but I wanted to bring this upstream before we get too far down that
> > road.
> > 
> > Unfortunately that means I don't have performance or exact stack
> > use savings yet. But I will let you know if/when I get results.
> > 
> >> Though it looks like OVS is not really playing a huge role in the
> >> stack trace below.  How much of the stack does the patch set save
> >> in total?  How much patches 2-7 contribute (I posted a patch similar
> >> to the first one last week, so we may not count it)?
> > 
> > ovs functions themselves are maybe 30% of stack use, so significant.  I
> > did find they are the ones with some of the biggest structures in local
> > variables though, so low hanging fruit. This series should save about
> > 2kB of stack, by eyeball. Should be enough to get us out of trouble for
> > this scenario, at least.
>
> Unfortunately, the only low handing fruit in this set is patch #1,
> the rest needs a serious performance evaluation.
>
> > 
> > I don't suggest ovs is the only problem, I'm just trying to trim things
> > where possible. I have been trying to find other savings too, e.g.,
> > https://lore.kernel.org/linux-nfs/20230927001624.750031-1-npiggin@gmail.com/
> > 
> > Recursion is a difficulty. I think we recursed 3 times in ovs, and it
> > looks like there's either 1 or 2 more recursions possible before the
> > limit (depending on how the accounting works, not sure if it stops at
> > 4 or 5), so we're a long way off. ppc64le doesn't use an unusually large
> > amount of stack, probably more than x86-64, but shouldn't be by a big
> > factor. So it could be risky for any arch with 16kB stack.
>
> The stack trace looks like a very standard trace for something like
> an ovn-kubernetes setup.  And I haven't seen such issues on x86 or
> aarch64 systems.  What architectures beside ppc64le use 16kB stack?

AFAIKS from browsing defines of defaults for 64-bit builds, all I
looked at do (riscv, s390, loongarch, mips, sparc).

They will all be different about how much stack the compiler uses,
some type sizes that could be in local variables, and details of
kernel entry and how irq stacks are implemented. Would be interesting
to compare typical stack usage of different archs, I haven't made a
good study of it.

> > 
> > I wonder if we should have an arch function that can be called by
> > significant recursion points such as this, which signals free stack is
> > low and you should bail out ASAP. I don't think it's reasonable to
> > expect ovs to know about all arch size and usage of stack. You could
> > keep your hard limit for consistency, but if that goes wrong the
> > low free stack indication could save you.
>
> Every part of the code will need to react somehow to such a signal,
> so I'm not sure how the implementations would look like.

Not every, it can be few strategic checks. The recursion test that
is already in ovs, for example.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-10-04  7:11         ` Nicholas Piggin
@ 2023-10-04 15:15           ` Aaron Conole
  0 siblings, 0 replies; 28+ messages in thread
From: Aaron Conole @ 2023-10-04 15:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Eelco Chaudron, netdev, dev, Ilya Maximets, Flavio Leitner

"Nicholas Piggin" <npiggin@gmail.com> writes:

> On Fri Sep 29, 2023 at 6:38 PM AEST, Eelco Chaudron wrote:
>>
>>
>> On 29 Sep 2023, at 9:00, Nicholas Piggin wrote:
>>
>> > On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
>> >> Nicholas Piggin <npiggin@gmail.com> writes:
>> >>
>> >>> Dynamically allocating the sw_flow_key reduces stack usage of
>> >>> ovs_vport_receive from 544 bytes to 64 bytes at the cost of
>> >>> another GFP_ATOMIC allocation in the receive path.
>> >>>
>> >>> XXX: is this a problem with memory reserves if ovs is in a
>> >>> memory reclaim path, or since we have a skb allocated, is it
>> >>> okay to use some GFP_ATOMIC reserves?
>> >>>
>> >>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> >>> ---
>> >>
>> >> This represents a fairly large performance hit.  Just my own quick
>> >> testing on a system using two netns, iperf3, and simple forwarding rules
>> >> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
>> >> is a simple case, and doesn't involve a more involved scenario like
>> >> multiple bridges, tunnels, and internal ports.  I suspect such cases
>> >> will see even bigger hit.
>> >>
>> >> I don't know the impact of the other changes, but just an FYI that the
>> >> performance impact of this change is extremely noticeable on x86
>> >> platform.
>> >
>> > Thanks for the numbers. This patch is probably the biggest perf cost,
>> > but unfortunately it's also about the biggest saving. I might have an
>> > idea to improve it.
>>
>> Also, were you able to figure out why we do not see this problem on
>> x86 and arm64? Is the stack usage so much larger, or is there some
>> other root cause?
>
> Haven't pinpointed it exactly. ppc64le interrupt entry frame is nearly
> 3x larger than x86-64, about 200 bytes. So there's 400 if a hard
> interrupt (not seen in the backtrace) is what overflowed it. Stack
> alignment I think is 32 bytes vs 16 for x86-64. And different amount of
> spilling and non-volatile register use and inlining choices by the
> compiler could nudge things one way or another. There is little to no
> ppc64le specific data structures on the stack in any of this call chain
> which should cause much more bloat though, AFAIKS.
>
> So other archs should not be far away from overflowing 16kB I think.
>
>> Is there a simple replicator, as this might help you
>> profile the differences between the architectures?
>
> Unfortunately not, it's some kubernetes contraption I don't know how
> to reproduce myself.

If we can get the flow dump and configuration, we can probably make sure
to reproduce it with ovs-dpctl.py (add any missing features, etc).  I
guess it should be simple to get (ovs-vsctl show, ovs-appctl
dpctl/dump-flows) and we can try to replicate it.

> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-10-04  7:29     ` Nicholas Piggin
@ 2023-10-04 15:16       ` Aaron Conole
  0 siblings, 0 replies; 28+ messages in thread
From: Aaron Conole @ 2023-10-04 15:16 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: netdev, dev, Ilya Maximets, Eelco Chaudron, Flavio Leitner

"Nicholas Piggin" <npiggin@gmail.com> writes:

> On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>
>> > Dynamically allocating the sw_flow_key reduces stack usage of
>> > ovs_vport_receive from 544 bytes to 64 bytes at the cost of
>> > another GFP_ATOMIC allocation in the receive path.
>> >
>> > XXX: is this a problem with memory reserves if ovs is in a
>> > memory reclaim path, or since we have a skb allocated, is it
>> > okay to use some GFP_ATOMIC reserves?
>> >
>> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> > ---
>>
>> This represents a fairly large performance hit.  Just my own quick
>> testing on a system using two netns, iperf3, and simple forwarding rules
>> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
>> is a simple case, and doesn't involve a more involved scenario like
>> multiple bridges, tunnels, and internal ports.  I suspect such cases
>> will see even bigger hit.
>>
>> I don't know the impact of the other changes, but just an FYI that the
>> performance impact of this change is extremely noticeable on x86
>> platform.
>>
>> ----
>> ip netns add left
>> ip netns add right
>>
>> ip link add eth0 type veth peer name l0
>> ip link set eth0 netns left
>> ip netns exec left ip addr add 172.31.110.1/24 dev eth0
>> ip netns exec left ip link set eth0 up
>> ip link set l0 up
>>
>> ip link add eth0 type veth peer name r0
>> ip link set eth0 netns right
>> ip netns exec right ip addr add 172.31.110.2/24 dev eth0
>> ip netns exec right ip link set eth0 up
>> ip link set r0 up
>>
>> python3 ovs-dpctl.py add-dp br0
>> python3 ovs-dpctl.py add-if br0 l0
>> python3 ovs-dpctl.py add-if br0 r0
>>
>> python3 ovs-dpctl.py add-flow \
>>   br0 'in_port(1),eth(),eth_type(0x806),arp()' 2
>>   
>> python3 ovs-dpctl.py add-flow \
>>   br0 'in_port(2),eth(),eth_type(0x806),arp()' 1
>>
>> python3 ovs-dpctl.py add-flow \
>>   br0 'in_port(1),eth(),eth_type(0x800),ipv4()' 2
>>
>> python3 ovs-dpctl.py add-flow \
>>   br0 'in_port(2),eth(),eth_type(0x800),ipv4()' 1
>>
>> ----
>>
>> ex results without this patch:
>> [root@wsfd-netdev60 ~]# ip netns exec right ./git/iperf/src/iperf3 -c 172.31.110.1
>> ...
>> [  5]   0.00-10.00  sec  46.7 GBytes  40.2 Gbits/sec    0             sender
>> [  5]   0.00-10.00  sec  46.7 GBytes  40.2 Gbits/sec                  receiver
>>
>>
>> ex results with this patch:
>> [root@wsfd-netdev60 ~]# ip netns exec right ./git/iperf/src/iperf3 -c 172.31.110.1
>> ...
>> [  5]   0.00-10.00  sec  44.9 GBytes  38.6 Gbits/sec    0             sender
>> [  5]   0.00-10.00  sec  44.9 GBytes  38.6 Gbits/sec                  receiver
>>
>> I did testing with udp at various bandwidths and this tcp testing.
>
> Thanks for the test case. It works perfectly in the end, but it took me
> days to get there because of a random conspiracy of problems I hit :(
> Sorry for the slow reply, but I was now able to test another idea for
> this. Performance seems to be within the noise with the full series, but
> my system is only getting ~half the rate of yours so you might see more
> movement.
>
> Instead of slab it reuses the per-cpu actions key allocator here.
>
> https://github.com/torvalds/linux/commit/878f01f04ca858e445ff4b4c64351a25bb8399e3
>
> Pushed the series to kvm branch of https://github.com/npiggin/linux
>
> I can repost the series as a second RFC but will wait for thoughts on
> this approach.

Thanks - I'll take a look at it.

> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-09-29  8:38       ` Eelco Chaudron
  2023-10-04  7:11         ` Nicholas Piggin
@ 2023-10-05  2:01         ` Nicholas Piggin
  2023-10-11 13:34           ` Aaron Conole
  1 sibling, 1 reply; 28+ messages in thread
From: Nicholas Piggin @ 2023-10-05  2:01 UTC (permalink / raw)
  To: Eelco Chaudron; +Cc: Aaron Conole, netdev, dev, Ilya Maximets, Flavio Leitner

On Fri Sep 29, 2023 at 6:38 PM AEST, Eelco Chaudron wrote:
>
>
> On 29 Sep 2023, at 9:00, Nicholas Piggin wrote:
>
> > On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>
> >>> Dynamically allocating the sw_flow_key reduces stack usage of
> >>> ovs_vport_receive from 544 bytes to 64 bytes at the cost of
> >>> another GFP_ATOMIC allocation in the receive path.
> >>>
> >>> XXX: is this a problem with memory reserves if ovs is in a
> >>> memory reclaim path, or since we have a skb allocated, is it
> >>> okay to use some GFP_ATOMIC reserves?
> >>>
> >>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> >>> ---
> >>
> >> This represents a fairly large performance hit.  Just my own quick
> >> testing on a system using two netns, iperf3, and simple forwarding rules
> >> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
> >> is a simple case, and doesn't involve a more involved scenario like
> >> multiple bridges, tunnels, and internal ports.  I suspect such cases
> >> will see even bigger hit.
> >>
> >> I don't know the impact of the other changes, but just an FYI that the
> >> performance impact of this change is extremely noticeable on x86
> >> platform.
> >
> > Thanks for the numbers. This patch is probably the biggest perf cost,
> > but unfortunately it's also about the biggest saving. I might have an
> > idea to improve it.
>
> Also, were you able to figure out why we do not see this problem on x86 and arm64? Is the stack usage so much larger, or is there some other root cause? Is there a simple replicator, as this might help you profile the differences between the architectures?

I found some snippets of equivalent call chain (this is for 4.18 RHEL8
kernels, but it's just to give a general idea of stack overhead
differences in C code). Frame size annotated on the right hand side:

[c0000007ffdba980] do_execute_actions     496
[c0000007ffdbab70] ovs_execute_actions    128
[c0000007ffdbabf0] ovs_dp_process_packet  208
[c0000007ffdbacc0] clone_execute          176
[c0000007ffdbad70] do_execute_actions     496
[c0000007ffdbaf60] ovs_execute_actions    128
[c0000007ffdbafe0] ovs_dp_process_packet  208
[c0000007ffdbb0b0] ovs_vport_receive      528
[c0000007ffdbb2c0] internal_dev_xmit
                                 total = 2368
[ff49b6d4065a3628] do_execute_actions     416
[ff49b6d4065a37c8] ovs_execute_actions     48
[ff49b6d4065a37f8] ovs_dp_process_packet  112
[ff49b6d4065a3868] clone_execute           64
[ff49b6d4065a38a8] do_execute_actions     416
[ff49b6d4065a3a48] ovs_execute_actions     48
[ff49b6d4065a3a78] ovs_dp_process_packet  112
[ff49b6d4065a3ae8] ovs_vport_receive      496
[ff49b6d4065a3cd8] netdev_frame_hook
                                 total = 1712

That's more significant than I thought, nearly 40% more stack usage for
ppc even with 3 frames having large local variables that can't be
avoided for either arch.

So, x86_64 could be quite safe with its 16kB stack for the same
workload, explaining why same overflow has not been seen there.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-10-05  2:01         ` Nicholas Piggin
@ 2023-10-11 13:34           ` Aaron Conole
  2023-10-11 23:58             ` Nicholas Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Aaron Conole @ 2023-10-11 13:34 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Eelco Chaudron, netdev, dev, Ilya Maximets, Flavio Leitner

"Nicholas Piggin" <npiggin@gmail.com> writes:

> On Fri Sep 29, 2023 at 6:38 PM AEST, Eelco Chaudron wrote:
>>
>>
>> On 29 Sep 2023, at 9:00, Nicholas Piggin wrote:
>>
>> > On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
>> >> Nicholas Piggin <npiggin@gmail.com> writes:
>> >>
>> >>> Dynamically allocating the sw_flow_key reduces stack usage of
>> >>> ovs_vport_receive from 544 bytes to 64 bytes at the cost of
>> >>> another GFP_ATOMIC allocation in the receive path.
>> >>>
>> >>> XXX: is this a problem with memory reserves if ovs is in a
>> >>> memory reclaim path, or since we have a skb allocated, is it
>> >>> okay to use some GFP_ATOMIC reserves?
>> >>>
>> >>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> >>> ---
>> >>
>> >> This represents a fairly large performance hit.  Just my own quick
>> >> testing on a system using two netns, iperf3, and simple forwarding rules
>> >> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
>> >> is a simple case, and doesn't involve a more involved scenario like
>> >> multiple bridges, tunnels, and internal ports.  I suspect such cases
>> >> will see even bigger hit.
>> >>
>> >> I don't know the impact of the other changes, but just an FYI that the
>> >> performance impact of this change is extremely noticeable on x86
>> >> platform.
>> >
>> > Thanks for the numbers. This patch is probably the biggest perf cost,
>> > but unfortunately it's also about the biggest saving. I might have an
>> > idea to improve it.
>>
>> Also, were you able to figure out why we do not see this problem on
>> x86 and arm64? Is the stack usage so much larger, or is there some
>> other root cause? Is there a simple replicator, as this might help
>> you profile the differences between the architectures?
>
> I found some snippets of equivalent call chain (this is for 4.18 RHEL8
> kernels, but it's just to give a general idea of stack overhead
> differences in C code). Frame size annotated on the right hand side:
>
> [c0000007ffdba980] do_execute_actions     496
> [c0000007ffdbab70] ovs_execute_actions    128
> [c0000007ffdbabf0] ovs_dp_process_packet  208
> [c0000007ffdbacc0] clone_execute          176
> [c0000007ffdbad70] do_execute_actions     496
> [c0000007ffdbaf60] ovs_execute_actions    128
> [c0000007ffdbafe0] ovs_dp_process_packet  208
> [c0000007ffdbb0b0] ovs_vport_receive      528
> [c0000007ffdbb2c0] internal_dev_xmit
>                                  total = 2368
> [ff49b6d4065a3628] do_execute_actions     416
> [ff49b6d4065a37c8] ovs_execute_actions     48
> [ff49b6d4065a37f8] ovs_dp_process_packet  112
> [ff49b6d4065a3868] clone_execute           64
> [ff49b6d4065a38a8] do_execute_actions     416
> [ff49b6d4065a3a48] ovs_execute_actions     48
> [ff49b6d4065a3a78] ovs_dp_process_packet  112
> [ff49b6d4065a3ae8] ovs_vport_receive      496
> [ff49b6d4065a3cd8] netdev_frame_hook
>                                  total = 1712
>
> That's more significant than I thought, nearly 40% more stack usage for
> ppc even with 3 frames having large local variables that can't be
> avoided for either arch.
>
> So, x86_64 could be quite safe with its 16kB stack for the same
> workload, explaining why same overflow has not been seen there.

This is interesting - is it possible that we could resolve this without
needing to change the kernel - or at least without changing how OVS
works?  Why are these so different?  Maybe there's some bloat in some of
the ppc data structures that can be addressed?  For example,
ovs_execute_actions shouldn't really be that different, but I wonder if
the way the per-cpu infra works, or the deferred action processing gets
inlined would be causing stack bloat?

> Thanks,
> Nick


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage
  2023-10-11 13:34           ` Aaron Conole
@ 2023-10-11 23:58             ` Nicholas Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nicholas Piggin @ 2023-10-11 23:58 UTC (permalink / raw)
  To: Aaron Conole; +Cc: Eelco Chaudron, netdev, dev, Ilya Maximets, Flavio Leitner

On Wed Oct 11, 2023 at 11:34 PM AEST, Aaron Conole wrote:
> "Nicholas Piggin" <npiggin@gmail.com> writes:
>
> > On Fri Sep 29, 2023 at 6:38 PM AEST, Eelco Chaudron wrote:
> >>
> >>
> >> On 29 Sep 2023, at 9:00, Nicholas Piggin wrote:
> >>
> >> > On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
> >> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >> >>
> >> >>> Dynamically allocating the sw_flow_key reduces stack usage of
> >> >>> ovs_vport_receive from 544 bytes to 64 bytes at the cost of
> >> >>> another GFP_ATOMIC allocation in the receive path.
> >> >>>
> >> >>> XXX: is this a problem with memory reserves if ovs is in a
> >> >>> memory reclaim path, or since we have a skb allocated, is it
> >> >>> okay to use some GFP_ATOMIC reserves?
> >> >>>
> >> >>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> >> >>> ---
> >> >>
> >> >> This represents a fairly large performance hit.  Just my own quick
> >> >> testing on a system using two netns, iperf3, and simple forwarding rules
> >> >> shows between 2.5% and 4% performance reduction on x86-64.  Note that it
> >> >> is a simple case, and doesn't involve a more involved scenario like
> >> >> multiple bridges, tunnels, and internal ports.  I suspect such cases
> >> >> will see even bigger hit.
> >> >>
> >> >> I don't know the impact of the other changes, but just an FYI that the
> >> >> performance impact of this change is extremely noticeable on x86
> >> >> platform.
> >> >
> >> > Thanks for the numbers. This patch is probably the biggest perf cost,
> >> > but unfortunately it's also about the biggest saving. I might have an
> >> > idea to improve it.
> >>
> >> Also, were you able to figure out why we do not see this problem on
> >> x86 and arm64? Is the stack usage so much larger, or is there some
> >> other root cause? Is there a simple replicator, as this might help
> >> you profile the differences between the architectures?
> >
> > I found some snippets of equivalent call chain (this is for 4.18 RHEL8
> > kernels, but it's just to give a general idea of stack overhead
> > differences in C code). Frame size annotated on the right hand side:
> >
> > [c0000007ffdba980] do_execute_actions     496
> > [c0000007ffdbab70] ovs_execute_actions    128
> > [c0000007ffdbabf0] ovs_dp_process_packet  208
> > [c0000007ffdbacc0] clone_execute          176
> > [c0000007ffdbad70] do_execute_actions     496
> > [c0000007ffdbaf60] ovs_execute_actions    128
> > [c0000007ffdbafe0] ovs_dp_process_packet  208
> > [c0000007ffdbb0b0] ovs_vport_receive      528
> > [c0000007ffdbb2c0] internal_dev_xmit
> >                                  total = 2368
> > [ff49b6d4065a3628] do_execute_actions     416
> > [ff49b6d4065a37c8] ovs_execute_actions     48
> > [ff49b6d4065a37f8] ovs_dp_process_packet  112
> > [ff49b6d4065a3868] clone_execute           64
> > [ff49b6d4065a38a8] do_execute_actions     416
> > [ff49b6d4065a3a48] ovs_execute_actions     48
> > [ff49b6d4065a3a78] ovs_dp_process_packet  112
> > [ff49b6d4065a3ae8] ovs_vport_receive      496
> > [ff49b6d4065a3cd8] netdev_frame_hook
> >                                  total = 1712
> >
> > That's more significant than I thought, nearly 40% more stack usage for
> > ppc even with 3 frames having large local variables that can't be
> > avoided for either arch.
> >
> > So, x86_64 could be quite safe with its 16kB stack for the same
> > workload, explaining why same overflow has not been seen there.
>
> This is interesting - is it possible that we could resolve this without
> needing to change the kernel - or at least without changing how OVS
> works?

Not really.

To be clear I don't say ovs is the one and only problem, so it could be
resolved if stack was larger or if other things did not use so much,
etc.

Maybe other things could be changed too, but ovs uses several K of stack
that it doesn't need to, and since it is also causing recursion it needs
to be as tight as possible with its stack use.

> Why are these so different?  Maybe there's some bloat in some of
> the ppc data structures that can be addressed?  For example,
> ovs_execute_actions shouldn't really be that different, but I wonder if
> the way the per-cpu infra works, or the deferred action processing gets
> inlined would be causing stack bloat?

Most other stack usage is not due to Linux powerpc arch defining certain
types and structures to be larger (most are the same size as other
64-bit archs). Rather due to C and GCC. I have asked powerpc GCC people
about stack size and no easy option to reduce it, if it were possible to
improve in new version of GCC then we still need to deal with old.

Powerpc has a larger minimum stack frame size (32 bytes) and larger
alignment (32 bytes vs 16 IIRC). It also has more non-volatile registers
and probably uses them more which requires saving to stack. So some of
it is fundamental.

In some cases I can't really see why GCC on ppc uses so much. AFAIKS
ovs_execute_actions could be using 96 bytes, but it's possible I miss
an alignment requirement.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-10-11 23:58 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-27  0:13 [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Nicholas Piggin
2023-09-27  0:13 ` [RFC PATCH 1/7] net: openvswitch: Move NSH buffer out of do_execute_actions Nicholas Piggin
2023-09-27  8:26   ` [ovs-dev] " Ilya Maximets
2023-09-27 10:03     ` Nicholas Piggin
2023-09-27  0:13 ` [RFC PATCH 2/7] net: openvswitch: Reduce execute_push_nsh stack overhead Nicholas Piggin
2023-09-27  0:13 ` [RFC PATCH 3/7] net: openvswitch: uninline action execution Nicholas Piggin
2023-09-27  0:13 ` [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive reduce stack usage Nicholas Piggin
2023-09-28 15:26   ` [ovs-dev] " Aaron Conole
2023-09-29  7:00     ` Nicholas Piggin
2023-09-29  8:38       ` Eelco Chaudron
2023-10-04  7:11         ` Nicholas Piggin
2023-10-04 15:15           ` Aaron Conole
2023-10-05  2:01         ` Nicholas Piggin
2023-10-11 13:34           ` Aaron Conole
2023-10-11 23:58             ` Nicholas Piggin
2023-10-04  7:29     ` Nicholas Piggin
2023-10-04 15:16       ` Aaron Conole
2023-09-27  0:13 ` [RFC PATCH 5/7] net: openvswitch: uninline ovs_fragment to control " Nicholas Piggin
2023-09-27  0:13 ` [RFC PATCH 6/7] net: openvswitch: Reduce ovs_fragment " Nicholas Piggin
2023-09-27  0:13 ` [RFC PATCH 7/7] net: openvswitch: Reduce stack usage in ovs_dp_process_packet Nicholas Piggin
2023-09-27 15:22   ` kernel test robot
2023-09-27  8:36 ` [ovs-dev] [RFC PATCH 0/7] net: openvswitch: Reduce stack usage Ilya Maximets
2023-09-28  1:52   ` Nicholas Piggin
2023-10-02 11:54     ` Ilya Maximets
2023-10-04  9:56       ` Nicholas Piggin
2023-09-29  7:06   ` Nicholas Piggin
2023-10-02 11:56     ` Ilya Maximets
2023-10-03 13:31       ` Aaron Conole

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.