All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] gro: some minor optimizations
@ 2022-01-18 15:24 Paolo Abeni
  2022-01-18 15:24 ` [RFC PATCH 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Paolo Abeni @ 2022-01-18 15:24 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet

This series collects a few small optimization for the GRO engine.
I measure a 10% performance improvements in micro-benchmarks
around dev_gro_receive(), but deltas are within noise range in tput
tests - ence the RFC tag.

Any feedback more then welcome.

Paolo Abeni (3):
  net: gro: avoid re-computing truesize twice on recycle
  net: gro: minor optimization for dev_gro_receive()
  net: gro: register gso and gro offload on separate lists

 include/linux/netdevice.h |   3 +-
 include/net/gro.h         |  13 +++--
 net/core/gro.c            | 107 +++++++++++++++++++++-----------------
 3 files changed, 70 insertions(+), 53 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/3] net: gro: avoid re-computing truesize twice on recycle
  2022-01-18 15:24 [RFC PATCH 0/3] gro: some minor optimizations Paolo Abeni
@ 2022-01-18 15:24 ` Paolo Abeni
  2022-01-18 15:24 ` [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni
  2022-01-18 15:24 ` [RFC PATCH 3/3] net: gro: register gso and gro offload on separate lists Paolo Abeni
  2 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2022-01-18 15:24 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet

After commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb
carring sock reference") and commit af352460b465 ("net: fix GRO
skb truesize update") the truesize of freed skb is properly updated
by the GRO engine, we don't need anymore resetting it at recycle time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/core/gro.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/core/gro.c b/net/core/gro.c
index a11b286d1495..d43d42215bdb 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -634,7 +634,6 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
 
 	skb->encapsulation = 0;
 	skb_shinfo(skb)->gso_type = 0;
-	skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
 	if (unlikely(skb->slow_gro)) {
 		skb_orphan(skb);
 		skb_ext_reset(skb);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive()
  2022-01-18 15:24 [RFC PATCH 0/3] gro: some minor optimizations Paolo Abeni
  2022-01-18 15:24 ` [RFC PATCH 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni
@ 2022-01-18 15:24 ` Paolo Abeni
  2022-01-18 15:56   ` Alexander Lobakin
  2022-01-18 15:24 ` [RFC PATCH 3/3] net: gro: register gso and gro offload on separate lists Paolo Abeni
  2 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2022-01-18 15:24 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet

While inspecting some perf report, I noticed that the compiler
emits suboptimal code for the napi CB initialization, fetching
and storing multiple times the memory for flags bitfield.
This is with gcc 10.3.1, but I observed the same with older compiler
versions.

We can help the compiler to do a nicer work e.g. initially setting
all the bitfield to 0 using an u16 alias. The generated code is quite
smaller, with the same number of conditional

Before:
objdump -t net/core/gro.o | grep " F .text"
0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive

After:
0000000000000bb0 l     F .text	000000000000033c dev_gro_receive

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/gro.h | 13 +++++++++----
 net/core/gro.c    | 16 +++++-----------
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index 8f75802d50fd..a068b27d341f 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -29,14 +29,17 @@ struct napi_gro_cb {
 	/* Number of segments aggregated. */
 	u16	count;
 
-	/* Start offset for remote checksum offload */
-	u16	gro_remcsum_start;
+	/* Used in ipv6_gro_receive() and foo-over-udp */
+	u16	proto;
 
 	/* jiffies when first packet was created/queued */
 	unsigned long age;
 
-	/* Used in ipv6_gro_receive() and foo-over-udp */
-	u16	proto;
+	/* portion of the cb set to zero at every gro iteration */
+	u32	zeroed_start[0];
+
+	/* Start offset for remote checksum offload */
+	u16	gro_remcsum_start;
 
 	/* This is non-zero if the packet may be of the same flow. */
 	u8	same_flow:1;
@@ -70,6 +73,8 @@ struct napi_gro_cb {
 	/* GRO is done by frag_list pointer chaining. */
 	u8	is_flist:1;
 
+	u32	zeroed_end[0];
+
 	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
 	__wsum	csum;
 
diff --git a/net/core/gro.c b/net/core/gro.c
index d43d42215bdb..b9ebe9298731 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -435,6 +435,9 @@ static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
 	napi_gro_complete(napi, oldest);
 }
 
+#define zeroed_len	(offsetof(struct napi_gro_cb, zeroed_end) - \
+			 offsetof(struct napi_gro_cb, zeroed_start))
+
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 {
 	u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
@@ -459,29 +462,20 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 
 		skb_set_network_header(skb, skb_gro_offset(skb));
 		skb_reset_mac_len(skb);
-		NAPI_GRO_CB(skb)->same_flow = 0;
+		BUILD_BUG_ON(zeroed_len != sizeof(NAPI_GRO_CB(skb)->zeroed_start[0]));
+		NAPI_GRO_CB(skb)->zeroed_start[0] = 0;
 		NAPI_GRO_CB(skb)->flush = skb_is_gso(skb) || skb_has_frag_list(skb);
-		NAPI_GRO_CB(skb)->free = 0;
-		NAPI_GRO_CB(skb)->encap_mark = 0;
-		NAPI_GRO_CB(skb)->recursion_counter = 0;
-		NAPI_GRO_CB(skb)->is_fou = 0;
 		NAPI_GRO_CB(skb)->is_atomic = 1;
-		NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
 
 		/* Setup for GRO checksum validation */
 		switch (skb->ip_summed) {
 		case CHECKSUM_COMPLETE:
 			NAPI_GRO_CB(skb)->csum = skb->csum;
 			NAPI_GRO_CB(skb)->csum_valid = 1;
-			NAPI_GRO_CB(skb)->csum_cnt = 0;
 			break;
 		case CHECKSUM_UNNECESSARY:
 			NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1;
-			NAPI_GRO_CB(skb)->csum_valid = 0;
 			break;
-		default:
-			NAPI_GRO_CB(skb)->csum_cnt = 0;
-			NAPI_GRO_CB(skb)->csum_valid = 0;
 		}
 
 		pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/3] net: gro: register gso and gro offload on separate lists
  2022-01-18 15:24 [RFC PATCH 0/3] gro: some minor optimizations Paolo Abeni
  2022-01-18 15:24 ` [RFC PATCH 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni
  2022-01-18 15:24 ` [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni
@ 2022-01-18 15:24 ` Paolo Abeni
  2 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2022-01-18 15:24 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet

So that we know each element in gro_list has valid gro callbacks
(and the same for gso). This allows dropping a bunch of conditional
in fastpath.

Before:
objdump -t net/core/gro.o | grep " F .text"
0000000000000bb0 l     F .text  000000000000033c dev_gro_receive

After:
0000000000000bb0 l     F .text  0000000000000325 dev_gro_receive

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/netdevice.h |  3 +-
 net/core/gro.c            | 90 +++++++++++++++++++++++----------------
 2 files changed, 56 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3213c7227b59..406cb457d788 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2564,7 +2564,8 @@ struct packet_offload {
 	__be16			 type;	/* This is really htons(ether_type). */
 	u16			 priority;
 	struct offload_callbacks callbacks;
-	struct list_head	 list;
+	struct list_head	 gro_list;
+	struct list_head	 gso_list;
 };
 
 /* often modified stats are per-CPU, other are shared (netdev->stats) */
diff --git a/net/core/gro.c b/net/core/gro.c
index b9ebe9298731..5d7bc6813a7d 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -10,10 +10,21 @@
 #define GRO_MAX_HEAD (MAX_HEADER + 128)
 
 static DEFINE_SPINLOCK(offload_lock);
-static struct list_head offload_base __read_mostly = LIST_HEAD_INIT(offload_base);
+static struct list_head gro_offload_base __read_mostly = LIST_HEAD_INIT(gro_offload_base);
+static struct list_head gso_offload_base __read_mostly = LIST_HEAD_INIT(gso_offload_base);
 /* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
 int gro_normal_batch __read_mostly = 8;
 
+#define offload_list_insert(head, poff, list)		\
+({							\
+	struct packet_offload *elem;			\
+	list_for_each_entry(elem, head, list) {		\
+		if ((poff)->priority < elem->priority)	\
+			break;				\
+	}						\
+	list_add_rcu(&(poff)->list, elem->list.prev);	\
+})
+
 /**
  *	dev_add_offload - register offload handlers
  *	@po: protocol offload declaration
@@ -28,18 +39,33 @@ int gro_normal_batch __read_mostly = 8;
  */
 void dev_add_offload(struct packet_offload *po)
 {
-	struct packet_offload *elem;
-
 	spin_lock(&offload_lock);
-	list_for_each_entry(elem, &offload_base, list) {
-		if (po->priority < elem->priority)
-			break;
-	}
-	list_add_rcu(&po->list, elem->list.prev);
+	if (po->callbacks.gro_receive && po->callbacks.gro_complete)
+		offload_list_insert(&gro_offload_base, po, gro_list);
+	else if (po->callbacks.gro_complete)
+		pr_warn("missing gro_receive callback");
+	else if (po->callbacks.gro_receive)
+		pr_warn("missing gro_complete callback");
+
+	if (po->callbacks.gso_segment)
+		offload_list_insert(&gso_offload_base, po, gso_list);
 	spin_unlock(&offload_lock);
 }
 EXPORT_SYMBOL(dev_add_offload);
 
+#define offload_list_remove(type, head, poff, list)	\
+({							\
+	struct packet_offload *elem;			\
+	list_for_each_entry(elem, head, list) {		\
+		if ((poff) == elem) {			\
+			list_del_rcu(&(poff)->list);	\
+			break;				\
+		}					\
+	}						\
+	if (elem != (poff))				\
+		pr_warn("dev_remove_offload: %p not found in %s list\n", (poff), type); \
+})
+
 /**
  *	__dev_remove_offload	 - remove offload handler
  *	@po: packet offload declaration
@@ -55,20 +81,12 @@ EXPORT_SYMBOL(dev_add_offload);
  */
 static void __dev_remove_offload(struct packet_offload *po)
 {
-	struct list_head *head = &offload_base;
-	struct packet_offload *po1;
-
 	spin_lock(&offload_lock);
+	if (po->callbacks.gro_receive)
+		offload_list_remove("gro", &gro_offload_base, po, gso_list);
 
-	list_for_each_entry(po1, head, list) {
-		if (po == po1) {
-			list_del_rcu(&po->list);
-			goto out;
-		}
-	}
-
-	pr_warn("dev_remove_offload: %p not found\n", po);
-out:
+	if (po->callbacks.gso_segment)
+		offload_list_remove("gso", &gso_offload_base, po, gro_list);
 	spin_unlock(&offload_lock);
 }
 
@@ -111,8 +129,8 @@ struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
 	__skb_pull(skb, vlan_depth);
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(ptype, &offload_base, list) {
-		if (ptype->type == type && ptype->callbacks.gso_segment) {
+	list_for_each_entry_rcu(ptype, &gso_offload_base, gso_list) {
+		if (ptype->type == type) {
 			segs = ptype->callbacks.gso_segment(skb, features);
 			break;
 		}
@@ -250,7 +268,7 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
 {
 	struct packet_offload *ptype;
 	__be16 type = skb->protocol;
-	struct list_head *head = &offload_base;
+	struct list_head *head = &gro_offload_base;
 	int err = -ENOENT;
 
 	BUILD_BUG_ON(sizeof(struct napi_gro_cb) > sizeof(skb->cb));
@@ -261,8 +279,8 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
 	}
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(ptype, head, list) {
-		if (ptype->type != type || !ptype->callbacks.gro_complete)
+	list_for_each_entry_rcu(ptype, head, gro_list) {
+		if (ptype->type != type)
 			continue;
 
 		err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
@@ -273,7 +291,7 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
 	rcu_read_unlock();
 
 	if (err) {
-		WARN_ON(&ptype->list == head);
+		WARN_ON(&ptype->gro_list == head);
 		kfree_skb(skb);
 		return;
 	}
@@ -442,7 +460,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 {
 	u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
 	struct gro_list *gro_list = &napi->gro_hash[bucket];
-	struct list_head *head = &offload_base;
+	struct list_head *head = &gro_offload_base;
 	struct packet_offload *ptype;
 	__be16 type = skb->protocol;
 	struct sk_buff *pp = NULL;
@@ -456,8 +474,8 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	gro_list_prepare(&gro_list->list, skb);
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(ptype, head, list) {
-		if (ptype->type != type || !ptype->callbacks.gro_receive)
+	list_for_each_entry_rcu(ptype, head, gro_list) {
+		if (ptype->type != type)
 			continue;
 
 		skb_set_network_header(skb, skb_gro_offset(skb));
@@ -485,7 +503,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	}
 	rcu_read_unlock();
 
-	if (&ptype->list == head)
+	if (&ptype->gro_list == head)
 		goto normal;
 
 	if (PTR_ERR(pp) == -EINPROGRESS) {
@@ -541,11 +559,11 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 
 struct packet_offload *gro_find_receive_by_type(__be16 type)
 {
-	struct list_head *offload_head = &offload_base;
+	struct list_head *offload_head = &gro_offload_base;
 	struct packet_offload *ptype;
 
-	list_for_each_entry_rcu(ptype, offload_head, list) {
-		if (ptype->type != type || !ptype->callbacks.gro_receive)
+	list_for_each_entry_rcu(ptype, offload_head, gro_list) {
+		if (ptype->type != type)
 			continue;
 		return ptype;
 	}
@@ -555,11 +573,11 @@ EXPORT_SYMBOL(gro_find_receive_by_type);
 
 struct packet_offload *gro_find_complete_by_type(__be16 type)
 {
-	struct list_head *offload_head = &offload_base;
+	struct list_head *offload_head = &gro_offload_base;
 	struct packet_offload *ptype;
 
-	list_for_each_entry_rcu(ptype, offload_head, list) {
-		if (ptype->type != type || !ptype->callbacks.gro_complete)
+	list_for_each_entry_rcu(ptype, offload_head, gro_list) {
+		if (ptype->type != type)
 			continue;
 		return ptype;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive()
  2022-01-18 15:24 ` [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni
@ 2022-01-18 15:56   ` Alexander Lobakin
  2022-01-18 16:31     ` Paolo Abeni
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Lobakin @ 2022-01-18 15:56 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Alexander Lobakin, netdev, Eric Dumazet, Kees Cook

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 18 Jan 2022 16:24:19 +0100

> While inspecting some perf report, I noticed that the compiler
> emits suboptimal code for the napi CB initialization, fetching
> and storing multiple times the memory for flags bitfield.
> This is with gcc 10.3.1, but I observed the same with older compiler
> versions.
> 
> We can help the compiler to do a nicer work e.g. initially setting
> all the bitfield to 0 using an u16 alias. The generated code is quite
> smaller, with the same number of conditional
> 
> Before:
> objdump -t net/core/gro.o | grep " F .text"
> 0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive
> 
> After:
> 0000000000000bb0 l     F .text	000000000000033c dev_gro_receive
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  include/net/gro.h | 13 +++++++++----
>  net/core/gro.c    | 16 +++++-----------
>  2 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/include/net/gro.h b/include/net/gro.h
> index 8f75802d50fd..a068b27d341f 100644
> --- a/include/net/gro.h
> +++ b/include/net/gro.h
> @@ -29,14 +29,17 @@ struct napi_gro_cb {
>  	/* Number of segments aggregated. */
>  	u16	count;
>  
> -	/* Start offset for remote checksum offload */
> -	u16	gro_remcsum_start;
> +	/* Used in ipv6_gro_receive() and foo-over-udp */
> +	u16	proto;
>  
>  	/* jiffies when first packet was created/queued */
>  	unsigned long age;
>  
> -	/* Used in ipv6_gro_receive() and foo-over-udp */
> -	u16	proto;
> +	/* portion of the cb set to zero at every gro iteration */
> +	u32	zeroed_start[0];
> +
> +	/* Start offset for remote checksum offload */
> +	u16	gro_remcsum_start;
>  
>  	/* This is non-zero if the packet may be of the same flow. */
>  	u8	same_flow:1;
> @@ -70,6 +73,8 @@ struct napi_gro_cb {
>  	/* GRO is done by frag_list pointer chaining. */
>  	u8	is_flist:1;
>  
> +	u32	zeroed_end[0];

This should be wrapped in struct_group() I believe, or compilers
will start complaining soon. See [0] for the details.
Adding Kees to the CCs.

> +
>  	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
>  	__wsum	csum;
>  
> diff --git a/net/core/gro.c b/net/core/gro.c
> index d43d42215bdb..b9ebe9298731 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -435,6 +435,9 @@ static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
>  	napi_gro_complete(napi, oldest);
>  }
>  
> +#define zeroed_len	(offsetof(struct napi_gro_cb, zeroed_end) - \
> +			 offsetof(struct napi_gro_cb, zeroed_start))
> +
>  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
>  {
>  	u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
> @@ -459,29 +462,20 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
>  
>  		skb_set_network_header(skb, skb_gro_offset(skb));
>  		skb_reset_mac_len(skb);
> -		NAPI_GRO_CB(skb)->same_flow = 0;
> +		BUILD_BUG_ON(zeroed_len != sizeof(NAPI_GRO_CB(skb)->zeroed_start[0]));
> +		NAPI_GRO_CB(skb)->zeroed_start[0] = 0;
>  		NAPI_GRO_CB(skb)->flush = skb_is_gso(skb) || skb_has_frag_list(skb);
> -		NAPI_GRO_CB(skb)->free = 0;
> -		NAPI_GRO_CB(skb)->encap_mark = 0;
> -		NAPI_GRO_CB(skb)->recursion_counter = 0;
> -		NAPI_GRO_CB(skb)->is_fou = 0;
>  		NAPI_GRO_CB(skb)->is_atomic = 1;
> -		NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
>  
>  		/* Setup for GRO checksum validation */
>  		switch (skb->ip_summed) {
>  		case CHECKSUM_COMPLETE:
>  			NAPI_GRO_CB(skb)->csum = skb->csum;
>  			NAPI_GRO_CB(skb)->csum_valid = 1;
> -			NAPI_GRO_CB(skb)->csum_cnt = 0;
>  			break;
>  		case CHECKSUM_UNNECESSARY:
>  			NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1;
> -			NAPI_GRO_CB(skb)->csum_valid = 0;
>  			break;
> -		default:
> -			NAPI_GRO_CB(skb)->csum_cnt = 0;
> -			NAPI_GRO_CB(skb)->csum_valid = 0;
>  		}
>  
>  		pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
> -- 
> 2.34.1

[0] https://lore.kernel.org/linux-hardening/20220112220652.3952944-1-keescook@chromium.org

Thanks,
Al

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive()
  2022-01-18 15:56   ` Alexander Lobakin
@ 2022-01-18 16:31     ` Paolo Abeni
  2022-01-18 17:39       ` Alexander Lobakin
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2022-01-18 16:31 UTC (permalink / raw)
  To: Alexander Lobakin; +Cc: netdev, Eric Dumazet, Kees Cook

On Tue, 2022-01-18 at 16:56 +0100, Alexander Lobakin wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 18 Jan 2022 16:24:19 +0100
> 
> > While inspecting some perf report, I noticed that the compiler
> > emits suboptimal code for the napi CB initialization, fetching
> > and storing multiple times the memory for flags bitfield.
> > This is with gcc 10.3.1, but I observed the same with older compiler
> > versions.
> > 
> > We can help the compiler to do a nicer work e.g. initially setting
> > all the bitfield to 0 using an u16 alias. The generated code is quite
> > smaller, with the same number of conditional
> > 
> > Before:
> > objdump -t net/core/gro.o | grep " F .text"
> > 0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive
> > 
> > After:
> > 0000000000000bb0 l     F .text	000000000000033c dev_gro_receive
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> >  include/net/gro.h | 13 +++++++++----
> >  net/core/gro.c    | 16 +++++-----------
> >  2 files changed, 14 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/net/gro.h b/include/net/gro.h
> > index 8f75802d50fd..a068b27d341f 100644
> > --- a/include/net/gro.h
> > +++ b/include/net/gro.h
> > @@ -29,14 +29,17 @@ struct napi_gro_cb {
> >  	/* Number of segments aggregated. */
> >  	u16	count;
> >  
> > -	/* Start offset for remote checksum offload */
> > -	u16	gro_remcsum_start;
> > +	/* Used in ipv6_gro_receive() and foo-over-udp */
> > +	u16	proto;
> >  
> >  	/* jiffies when first packet was created/queued */
> >  	unsigned long age;
> >  
> > -	/* Used in ipv6_gro_receive() and foo-over-udp */
> > -	u16	proto;
> > +	/* portion of the cb set to zero at every gro iteration */
> > +	u32	zeroed_start[0];
> > +
> > +	/* Start offset for remote checksum offload */
> > +	u16	gro_remcsum_start;
> >  
> >  	/* This is non-zero if the packet may be of the same flow. */
> >  	u8	same_flow:1;
> > @@ -70,6 +73,8 @@ struct napi_gro_cb {
> >  	/* GRO is done by frag_list pointer chaining. */
> >  	u8	is_flist:1;
> >  
> > +	u32	zeroed_end[0];
> 
> This should be wrapped in struct_group() I believe, or compilers
> will start complaining soon. See [0] for the details.
> Adding Kees to the CCs.

Thank you for the reference. That really slipped-off my mind. 

This patch does not use memcpy() or similar, just a single direct
assignement. Would that still require struct_group()?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive()
  2022-01-18 16:31     ` Paolo Abeni
@ 2022-01-18 17:39       ` Alexander Lobakin
  2022-02-02 10:09         ` Paolo Abeni
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Lobakin @ 2022-01-18 17:39 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Alexander Lobakin, netdev, Eric Dumazet, Kees Cook

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 18 Jan 2022 17:31:00 +0100

> On Tue, 2022-01-18 at 16:56 +0100, Alexander Lobakin wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> > Date: Tue, 18 Jan 2022 16:24:19 +0100
> > 
> > > While inspecting some perf report, I noticed that the compiler
> > > emits suboptimal code for the napi CB initialization, fetching
> > > and storing multiple times the memory for flags bitfield.
> > > This is with gcc 10.3.1, but I observed the same with older compiler
> > > versions.
> > > 
> > > We can help the compiler to do a nicer work e.g. initially setting
> > > all the bitfield to 0 using an u16 alias. The generated code is quite
> > > smaller, with the same number of conditional
> > > 
> > > Before:
> > > objdump -t net/core/gro.o | grep " F .text"
> > > 0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive
> > > 
> > > After:
> > > 0000000000000bb0 l     F .text	000000000000033c dev_gro_receive
> > > 
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > ---
> > >  include/net/gro.h | 13 +++++++++----
> > >  net/core/gro.c    | 16 +++++-----------
> > >  2 files changed, 14 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/net/gro.h b/include/net/gro.h
> > > index 8f75802d50fd..a068b27d341f 100644
> > > --- a/include/net/gro.h
> > > +++ b/include/net/gro.h
> > > @@ -29,14 +29,17 @@ struct napi_gro_cb {
> > >  	/* Number of segments aggregated. */
> > >  	u16	count;
> > >  
> > > -	/* Start offset for remote checksum offload */
> > > -	u16	gro_remcsum_start;
> > > +	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > +	u16	proto;
> > >  
> > >  	/* jiffies when first packet was created/queued */
> > >  	unsigned long age;
> > >  
> > > -	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > -	u16	proto;
> > > +	/* portion of the cb set to zero at every gro iteration */
> > > +	u32	zeroed_start[0];
> > > +
> > > +	/* Start offset for remote checksum offload */
> > > +	u16	gro_remcsum_start;
> > >  
> > >  	/* This is non-zero if the packet may be of the same flow. */
> > >  	u8	same_flow:1;
> > > @@ -70,6 +73,8 @@ struct napi_gro_cb {
> > >  	/* GRO is done by frag_list pointer chaining. */
> > >  	u8	is_flist:1;
> > >  
> > > +	u32	zeroed_end[0];
> > 
> > This should be wrapped in struct_group() I believe, or compilers
> > will start complaining soon. See [0] for the details.
> > Adding Kees to the CCs.
> 
> Thank you for the reference. That really slipped-off my mind.
> 
> This patch does not use memcpy() or similar, just a single direct
> assignement. Would that still require struct_group()?

Oof, sorry, I saw start/end and overlooked that it's only for
a single assignment.
Then it shouldn't cause warnings, but maybe use an anonymous
union instead?

	union {
		u32 zeroed;
		struct {
			u16 gro_remcsum_start;
			...
		};
	};
	__wsum csum;

Use can still use a BUILD_BUG_ON() in this case, like
sizeof(zeroed) != offsetof(csum) - offsetof(zeroed).

> 
> Thanks!
> 
> Paolo

Thanks,
Al

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive()
  2022-01-18 17:39       ` Alexander Lobakin
@ 2022-02-02 10:09         ` Paolo Abeni
  2022-02-02 12:08           ` Alexander Lobakin
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2022-02-02 10:09 UTC (permalink / raw)
  To: Alexander Lobakin; +Cc: netdev, Eric Dumazet, Kees Cook

Hello,

On Tue, 2022-01-18 at 18:39 +0100, Alexander Lobakin wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 18 Jan 2022 17:31:00 +0100
> 
> > On Tue, 2022-01-18 at 16:56 +0100, Alexander Lobakin wrote:
> > > From: Paolo Abeni <pabeni@redhat.com>
> > > Date: Tue, 18 Jan 2022 16:24:19 +0100
> > > 
> > > > While inspecting some perf report, I noticed that the compiler
> > > > emits suboptimal code for the napi CB initialization, fetching
> > > > and storing multiple times the memory for flags bitfield.
> > > > This is with gcc 10.3.1, but I observed the same with older compiler
> > > > versions.
> > > > 
> > > > We can help the compiler to do a nicer work e.g. initially setting
> > > > all the bitfield to 0 using an u16 alias. The generated code is quite
> > > > smaller, with the same number of conditional
> > > > 
> > > > Before:
> > > > objdump -t net/core/gro.o | grep " F .text"
> > > > 0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive
> > > > 
> > > > After:
> > > > 0000000000000bb0 l     F .text	000000000000033c dev_gro_receive
> > > > 
> > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > ---
> > > >  include/net/gro.h | 13 +++++++++----
> > > >  net/core/gro.c    | 16 +++++-----------
> > > >  2 files changed, 14 insertions(+), 15 deletions(-)
> > > > 
> > > > diff --git a/include/net/gro.h b/include/net/gro.h
> > > > index 8f75802d50fd..a068b27d341f 100644
> > > > --- a/include/net/gro.h
> > > > +++ b/include/net/gro.h
> > > > @@ -29,14 +29,17 @@ struct napi_gro_cb {
> > > >  	/* Number of segments aggregated. */
> > > >  	u16	count;
> > > >  
> > > > -	/* Start offset for remote checksum offload */
> > > > -	u16	gro_remcsum_start;
> > > > +	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > > +	u16	proto;
> > > >  
> > > >  	/* jiffies when first packet was created/queued */
> > > >  	unsigned long age;
> > > >  
> > > > -	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > > -	u16	proto;
> > > > +	/* portion of the cb set to zero at every gro iteration */
> > > > +	u32	zeroed_start[0];
> > > > +
> > > > +	/* Start offset for remote checksum offload */
> > > > +	u16	gro_remcsum_start;
> > > >  
> > > >  	/* This is non-zero if the packet may be of the same flow. */
> > > >  	u8	same_flow:1;
> > > > @@ -70,6 +73,8 @@ struct napi_gro_cb {
> > > >  	/* GRO is done by frag_list pointer chaining. */
> > > >  	u8	is_flist:1;
> > > >  
> > > > +	u32	zeroed_end[0];
> > > 
> > > This should be wrapped in struct_group() I believe, or compilers
> > > will start complaining soon. See [0] for the details.
> > > Adding Kees to the CCs.
> > 
> > Thank you for the reference. That really slipped-off my mind.
> > 
> > This patch does not use memcpy() or similar, just a single direct
> > assignement. Would that still require struct_group()?
> 
> Oof, sorry, I saw start/end and overlooked that it's only for
> a single assignment.
> Then it shouldn't cause warnings, but maybe use an anonymous
> union instead?
> 
> 	union {
> 		u32 zeroed;
> 		struct {
> 			u16 gro_remcsum_start;
> 			...
> 		};
> 	};
> 	__wsum csum;
> 
> Use can still use a BUILD_BUG_ON() in this case, like
> sizeof(zeroed) != offsetof(csum) - offsetof(zeroed).

Please forgive me for the very long delay. I'm looking again at this
stuff for formal non-rfc submission.

I like the anonymous union less, because it will move around much more
code - making the patch less readable - and will be more fragile e.g.
some comment alike "please don't move around 'csum'" would be needed.

No strong opinion anyway, so if you really prefer that way I can adapt.
Please let me know.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive()
  2022-02-02 10:09         ` Paolo Abeni
@ 2022-02-02 12:08           ` Alexander Lobakin
  2022-02-07  2:53             ` Kees Cook
  0 siblings, 1 reply; 10+ messages in thread
From: Alexander Lobakin @ 2022-02-02 12:08 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Alexander Lobakin, netdev, Eric Dumazet, Kees Cook, Gustavo A. R. Silva

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 02 Feb 2022 11:09:41 +0100

> Hello,

Hi!

> 
> On Tue, 2022-01-18 at 18:39 +0100, Alexander Lobakin wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> > Date: Tue, 18 Jan 2022 17:31:00 +0100
> > 
> > > On Tue, 2022-01-18 at 16:56 +0100, Alexander Lobakin wrote:
> > > > From: Paolo Abeni <pabeni@redhat.com>
> > > > Date: Tue, 18 Jan 2022 16:24:19 +0100
> > > > 
> > > > > While inspecting some perf report, I noticed that the compiler
> > > > > emits suboptimal code for the napi CB initialization, fetching
> > > > > and storing multiple times the memory for flags bitfield.
> > > > > This is with gcc 10.3.1, but I observed the same with older compiler
> > > > > versions.
> > > > > 
> > > > > We can help the compiler to do a nicer work e.g. initially setting
> > > > > all the bitfield to 0 using an u16 alias. The generated code is quite
> > > > > smaller, with the same number of conditional
> > > > > 
> > > > > Before:
> > > > > objdump -t net/core/gro.o | grep " F .text"
> > > > > 0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive
> > > > > 
> > > > > After:
> > > > > 0000000000000bb0 l     F .text	000000000000033c dev_gro_receive
> > > > > 
> > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > > ---
> > > > >  include/net/gro.h | 13 +++++++++----
> > > > >  net/core/gro.c    | 16 +++++-----------
> > > > >  2 files changed, 14 insertions(+), 15 deletions(-)
> > > > > 
> > > > > diff --git a/include/net/gro.h b/include/net/gro.h
> > > > > index 8f75802d50fd..a068b27d341f 100644
> > > > > --- a/include/net/gro.h
> > > > > +++ b/include/net/gro.h
> > > > > @@ -29,14 +29,17 @@ struct napi_gro_cb {
> > > > >  	/* Number of segments aggregated. */
> > > > >  	u16	count;
> > > > >  
> > > > > -	/* Start offset for remote checksum offload */
> > > > > -	u16	gro_remcsum_start;
> > > > > +	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > > > +	u16	proto;
> > > > >  
> > > > >  	/* jiffies when first packet was created/queued */
> > > > >  	unsigned long age;
> > > > >  
> > > > > -	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > > > -	u16	proto;
> > > > > +	/* portion of the cb set to zero at every gro iteration */
> > > > > +	u32	zeroed_start[0];
> > > > > +
> > > > > +	/* Start offset for remote checksum offload */
> > > > > +	u16	gro_remcsum_start;
> > > > >  
> > > > >  	/* This is non-zero if the packet may be of the same flow. */
> > > > >  	u8	same_flow:1;
> > > > > @@ -70,6 +73,8 @@ struct napi_gro_cb {
> > > > >  	/* GRO is done by frag_list pointer chaining. */
> > > > >  	u8	is_flist:1;
> > > > >  
> > > > > +	u32	zeroed_end[0];
> > > > 
> > > > This should be wrapped in struct_group() I believe, or compilers
> > > > will start complaining soon. See [0] for the details.
> > > > Adding Kees to the CCs.
> > > 
> > > Thank you for the reference. That really slipped-off my mind.
> > > 
> > > This patch does not use memcpy() or similar, just a single direct
> > > assignement. Would that still require struct_group()?
> > 
> > Oof, sorry, I saw start/end and overlooked that it's only for
> > a single assignment.
> > Then it shouldn't cause warnings, but maybe use an anonymous
> > union instead?
> > 
> > 	union {
> > 		u32 zeroed;
> > 		struct {
> > 			u16 gro_remcsum_start;
> > 			...
> > 		};
> > 	};
> > 	__wsum csum;
> > 
> > Use can still use a BUILD_BUG_ON() in this case, like
> > sizeof(zeroed) != offsetof(csum) - offsetof(zeroed).
> 
> Please forgive me for the very long delay. I'm looking again at this
> stuff for formal non-rfc submission.

Sure, not a problem at all (:

> 
> I like the anonymous union less, because it will move around much more
> code - making the patch less readable - and will be more fragile e.g.
> some comment alike "please don't move around 'csum'" would be needed.

We still need comments around zeroed_{start,end}[0] for now.
I used offsetof(csum) as offsetofend(is_flist) which I'd prefer here
unfortunately expands to offsetof(is_flist) + sizeof(is_flist), and
the latter causes an error of using sizeof() against a bitfield.

> 
> No strong opinion anyway, so if you really prefer that way I can adapt.
> Please let me know.

I don't really have a strong preference here, I just suspect that
zero-length array will produce or already produce -Warray-bounds
warnings, and empty-struct constructs like

	struct { } zeroed_start;
	u16 gro_remcsum_start;
	...
	struct { } zeroed_end;

	memset(NAPI_GRO_CB(skb)->zeroed_start, 0,
	       offsetofend(zeroed_end) - offsetsof(zeroed_start));

will trigger Fortify compile-time errors from Kees' KSPP tree.

I think we could use

	__struct_group(/* no tag */, zeroed, /* no attrs */,
		u16 gro_remcsum_start;
		...
		u8 is_flist:1;
	);
	__wsum csum;

	BUILD_BUG_ON(sizeof_field(struct napi_gro_cb, zeroed) != sizeof(u32));
	BUILD_BUG_ON(!IS_ALIGNED(offsetof(struct napi_gro_cb, zeroed),
				 sizeof(u32))); /* Avoid slow unaligned acc */

	*(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0;

This doesn't depend on `csum`, doesn't need `struct { }` or
`struct zero[0]` markers and still uses a direct assignment.

Also adding Gustavo, maybe he'd like to leave a comment here.

> 
> Thanks!
> 
> Paolo

Thanks,
Al

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive()
  2022-02-02 12:08           ` Alexander Lobakin
@ 2022-02-07  2:53             ` Kees Cook
  0 siblings, 0 replies; 10+ messages in thread
From: Kees Cook @ 2022-02-07  2:53 UTC (permalink / raw)
  To: Alexander Lobakin; +Cc: Paolo Abeni, netdev, Eric Dumazet, Gustavo A. R. Silva

On Wed, Feb 02, 2022 at 01:08:27PM +0100, Alexander Lobakin wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Wed, 02 Feb 2022 11:09:41 +0100
> 
> > Hello,
> 
> Hi!
> 
> > 
> > On Tue, 2022-01-18 at 18:39 +0100, Alexander Lobakin wrote:
> > > From: Paolo Abeni <pabeni@redhat.com>
> > > Date: Tue, 18 Jan 2022 17:31:00 +0100
> > > 
> > > > On Tue, 2022-01-18 at 16:56 +0100, Alexander Lobakin wrote:
> > > > > From: Paolo Abeni <pabeni@redhat.com>
> > > > > Date: Tue, 18 Jan 2022 16:24:19 +0100
> > > > > 
> > > > > > While inspecting some perf report, I noticed that the compiler
> > > > > > emits suboptimal code for the napi CB initialization, fetching
> > > > > > and storing multiple times the memory for flags bitfield.
> > > > > > This is with gcc 10.3.1, but I observed the same with older compiler
> > > > > > versions.
> > > > > > 
> > > > > > We can help the compiler to do a nicer work e.g. initially setting
> > > > > > all the bitfield to 0 using an u16 alias. The generated code is quite
> > > > > > smaller, with the same number of conditional
> > > > > > 
> > > > > > Before:
> > > > > > objdump -t net/core/gro.o | grep " F .text"
> > > > > > 0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive
> > > > > > 
> > > > > > After:
> > > > > > 0000000000000bb0 l     F .text	000000000000033c dev_gro_receive
> > > > > > 
> > > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > > > ---
> > > > > >  include/net/gro.h | 13 +++++++++----
> > > > > >  net/core/gro.c    | 16 +++++-----------
> > > > > >  2 files changed, 14 insertions(+), 15 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/net/gro.h b/include/net/gro.h
> > > > > > index 8f75802d50fd..a068b27d341f 100644
> > > > > > --- a/include/net/gro.h
> > > > > > +++ b/include/net/gro.h
> > > > > > @@ -29,14 +29,17 @@ struct napi_gro_cb {
> > > > > >  	/* Number of segments aggregated. */
> > > > > >  	u16	count;
> > > > > >  
> > > > > > -	/* Start offset for remote checksum offload */
> > > > > > -	u16	gro_remcsum_start;
> > > > > > +	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > > > > +	u16	proto;
> > > > > >  
> > > > > >  	/* jiffies when first packet was created/queued */
> > > > > >  	unsigned long age;
> > > > > >  
> > > > > > -	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > > > > -	u16	proto;
> > > > > > +	/* portion of the cb set to zero at every gro iteration */
> > > > > > +	u32	zeroed_start[0];
> > > > > > +
> > > > > > +	/* Start offset for remote checksum offload */
> > > > > > +	u16	gro_remcsum_start;
> > > > > >  
> > > > > >  	/* This is non-zero if the packet may be of the same flow. */
> > > > > >  	u8	same_flow:1;
> > > > > > @@ -70,6 +73,8 @@ struct napi_gro_cb {
> > > > > >  	/* GRO is done by frag_list pointer chaining. */
> > > > > >  	u8	is_flist:1;
> > > > > >  
> > > > > > +	u32	zeroed_end[0];
> > > > > 
> > > > > This should be wrapped in struct_group() I believe, or compilers
> > > > > will start complaining soon. See [0] for the details.
> > > > > Adding Kees to the CCs.

Hi! Sorry I missed the original email sent my way. :)

> > > > 
> > > > Thank you for the reference. That really slipped-off my mind.
> > > > 
> > > > This patch does not use memcpy() or similar, just a single direct
> > > > assignement. Would that still require struct_group()?
> > > 
> > > Oof, sorry, I saw start/end and overlooked that it's only for
> > > a single assignment.
> > > Then it shouldn't cause warnings, but maybe use an anonymous
> > > union instead?
> > > 
> > > 	union {
> > > 		u32 zeroed;
> > > 		struct {
> > > 			u16 gro_remcsum_start;
> > > 			...
> > > 		};
> > > 	};
> > > 	__wsum csum;
> > > 
> > > Use can still use a BUILD_BUG_ON() in this case, like
> > > sizeof(zeroed) != offsetof(csum) - offsetof(zeroed).
> > 
> > Please forgive me for the very long delay. I'm looking again at this
> > stuff for formal non-rfc submission.
> 
> Sure, not a problem at all (:
> 
> > 
> > I like the anonymous union less, because it will move around much more
> > code - making the patch less readable - and will be more fragile e.g.
> > some comment alike "please don't move around 'csum'" would be needed.
> 
> We still need comments around zeroed_{start,end}[0] for now.
> I used offsetof(csum) as offsetofend(is_flist) which I'd prefer here
> unfortunately expands to offsetof(is_flist) + sizeof(is_flist), and
> the latter causes an error of using sizeof() against a bitfield.
> 
> > 
> > No strong opinion anyway, so if you really prefer that way I can adapt.
> > Please let me know.
> 
> I don't really have a strong preference here, I just suspect that
> zero-length array will produce or already produce -Warray-bounds
> warnings, and empty-struct constructs like

Yes, -Warray-bounds would yell very loudly about an 0-element array
having its first element assigned. :)

> 
> 	struct { } zeroed_start;
> 	u16 gro_remcsum_start;
> 	...
> 	struct { } zeroed_end;
> 
> 	memset(NAPI_GRO_CB(skb)->zeroed_start, 0,
> 	       offsetofend(zeroed_end) - offsetsof(zeroed_start));
> 
> will trigger Fortify compile-time errors from Kees' KSPP tree.
> 
> I think we could use
> 
> 	__struct_group(/* no tag */, zeroed, /* no attrs */,

Since this isn't UAPI, you can just do:

	struct_group(zeroed,

> 		u16 gro_remcsum_start;
> 		...
> 		u8 is_flist:1;
> 	);
> 	__wsum csum;
> 
> 	BUILD_BUG_ON(sizeof_field(struct napi_gro_cb, zeroed) != sizeof(u32));
> 	BUILD_BUG_ON(!IS_ALIGNED(offsetof(struct napi_gro_cb, zeroed),
> 				 sizeof(u32))); /* Avoid slow unaligned acc */
> 
> 	*(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0;
> 
> This doesn't depend on `csum`, doesn't need `struct { }` or
> `struct zero[0]` markers and still uses a direct assignment.

Given the kernel's -O2 use, a constant-sized u32 memcpy() and a direct
assignment will resolve to the same machine code:

https://godbolt.org/z/b9qKexx6q

void zero_var(struct object *p)
{
    p->zero_var = 0;
}

void zero_group(struct object *p)
{
    memset(&p->zero_group, 0, sizeof(p->zero_group));
}

zero_var:
  movl $0, (%rdi)
  ret
zero_group:
  movl $0, (%rdi)
  ret

And assigning them individually results in careful "and"ing to avoid the
padding:

zero_each:
  andl $-134217728, (%rdi)
  ret

I would have recommended struct_group() + memset() here, just because
you don't need the casts. But since you're doing BUILD_BUG_ON() size
verification, it's fine either way.

If you want to avoid the cast and avoid the memset(), you could use the
earlier suggestion of anonymous union and anonymous struct. It's
basically an open-coded struct_group, but you get to pick the type of
the overlapping variable. :)

-Kees

> Also adding Gustavo, maybe he'd like to leave a comment here.
> 
> > 
> > Thanks!
> > 
> > Paolo
> 
> Thanks,
> Al

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-02-07  5:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 15:24 [RFC PATCH 0/3] gro: some minor optimizations Paolo Abeni
2022-01-18 15:24 ` [RFC PATCH 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni
2022-01-18 15:24 ` [RFC PATCH 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni
2022-01-18 15:56   ` Alexander Lobakin
2022-01-18 16:31     ` Paolo Abeni
2022-01-18 17:39       ` Alexander Lobakin
2022-02-02 10:09         ` Paolo Abeni
2022-02-02 12:08           ` Alexander Lobakin
2022-02-07  2:53             ` Kees Cook
2022-01-18 15:24 ` [RFC PATCH 3/3] net: gro: register gso and gro offload on separate lists Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.