All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS
@ 2023-03-23 16:28 Eric Dumazet
  2023-03-24  1:32 ` Yunsheng Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Eric Dumazet @ 2023-03-23 16:28 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Eric Dumazet, Eric Dumazet, Nikolay Aleksandrov

From: Eric Dumazet <edumazet@google.com>

Currently, MAX_SKB_FRAGS value is 17.

For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
attempts order-3 allocations, stuffing 32768 bytes per frag.

But with zero copy, we use order-0 pages.

For BIG TCP to show its full potential, we add a config option
to be able to fit up to 45 segments per skb.

This is also needed for BIG TCP rx zerocopy, as zerocopy currently
does not support skbs with frag list.

We have used MAX_SKB_FRAGS=45 value for years at Google before
we deployed 4K MTU, with no adverse effect, other than
a recent issue in mlx4, fixed in commit 26782aad00cc
("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")

Back then, goal was to be able to receive full size (64KB) GRO
packets without the frag_list overhead.

Note that /proc/sys/net/core/max_skb_frags can also be used to limit
the number of fragments TCP can use in tx packets.

By default we keep the old/legacy value of 17 until we get
more coverage for the updated values.

Sizes of struct skb_shared_info on 64bit arches

MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
==============================================
         17     320
         21     320+64  = 384
         25     320+128 = 448
         29     320+192 = 512
         33     320+256 = 576
         37     320+320 = 640
         41     320+384 = 704
         45     320+448 = 768

This inflation might cause problems for drivers assuming they could pack
both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
using build_skb().

v3: fix build error when CONFIG_NET=n
v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
---
 drivers/scsi/cxgbi/libcxgbi.c |  4 ++--
 include/linux/skbuff.h        | 16 +++++-----------
 net/Kconfig                   | 12 ++++++++++++
 net/packet/af_packet.c        |  4 ++--
 4 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
index af281e271f886041b397ea881e2ce7be00eff625..3e1de4c842cc6102e25a5972d6b11e05c3e4c060 100644
--- a/drivers/scsi/cxgbi/libcxgbi.c
+++ b/drivers/scsi/cxgbi/libcxgbi.c
@@ -2314,9 +2314,9 @@ static int cxgbi_sock_tx_queue_up(struct cxgbi_sock *csk, struct sk_buff *skb)
 		frags++;
 
 	if (frags >= SKB_WR_LIST_SIZE) {
-		pr_err("csk 0x%p, frags %u, %u,%u >%lu.\n",
+		pr_err("csk 0x%p, frags %u, %u,%u >%u.\n",
 		       csk, skb_shinfo(skb)->nr_frags, skb->len,
-		       skb->data_len, SKB_WR_LIST_SIZE);
+		       skb->data_len, (unsigned int)SKB_WR_LIST_SIZE);
 		return -EINVAL;
 	}
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe661011644b8f468ff5e92075a6624f0557584c..82511b2f61ea2bc5d587b58f0901e50e64729e4f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -345,18 +345,12 @@ struct sk_buff_head {
 
 struct sk_buff;
 
-/* To allow 64K frame to be packed as single skb without frag_list we
- * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
- * buffers which do not start on a page boundary.
- *
- * Since GRO uses frags we allocate at least 16 regardless of page
- * size.
- */
-#if (65536/PAGE_SIZE + 1) < 16
-#define MAX_SKB_FRAGS 16UL
-#else
-#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
+#ifndef CONFIG_MAX_SKB_FRAGS
+# define CONFIG_MAX_SKB_FRAGS 17
 #endif
+
+#define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
+
 extern int sysctl_max_skb_frags;
 
 /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
diff --git a/net/Kconfig b/net/Kconfig
index 48c33c2221999e575c83a409ab773b9cc3656eab..f806722bccf450c62e07bfdb245e5195ac4a156d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -251,6 +251,18 @@ config PCPU_DEV_REFCNT
 	  network device refcount are using per cpu variables if this option is set.
 	  This can be forced to N to detect underflows (with a performance drop).
 
+config MAX_SKB_FRAGS
+	int "Maximum number of fragments per skb_shared_info"
+	range 17 45
+	default 17
+	help
+	  Having more fragments per skb_shared_info can help GRO efficiency.
+	  This helps BIG TCP workloads, but might expose bugs in some
+	  legacy drivers.
+	  This also increases memory overhead of small packets,
+	  and in drivers using build_skb().
+	  If unsure, say 17.
+
 config RPS
 	bool
 	depends on SMP && SYSFS
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 497193f73030c385a2d33b71dfbc299fbf9b763d..568f8d76e3c124f3b322a8d88dc3dcfbc45e7c0e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2622,8 +2622,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 		nr_frags = skb_shinfo(skb)->nr_frags;
 
 		if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
-			pr_err("Packet exceed the number of skb frags(%lu)\n",
-			       MAX_SKB_FRAGS);
+			pr_err("Packet exceed the number of skb frags(%u)\n",
+			       (unsigned int)MAX_SKB_FRAGS);
 			return -EFAULT;
 		}
 
-- 
2.40.0.rc1.284.g88254d51c5-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS
  2023-03-23 16:28 [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS Eric Dumazet
@ 2023-03-24  1:32 ` Yunsheng Lin
  2023-03-24  1:40   ` Eric Dumazet
  2023-03-27  7:18 ` Jason Xing
  2023-03-28  2:50 ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 6+ messages in thread
From: Yunsheng Lin @ 2023-03-24  1:32 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Eric Dumazet, Nikolay Aleksandrov

On 2023/3/24 0:28, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Currently, MAX_SKB_FRAGS value is 17.
> 
> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> attempts order-3 allocations, stuffing 32768 bytes per frag.
> 
> But with zero copy, we use order-0 pages.
> 
> For BIG TCP to show its full potential, we add a config option
> to be able to fit up to 45 segments per skb.
> 
> This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> does not support skbs with frag list.

Just out of curiosity, it is possible to add support for skbs with
frag list for zerocopy if the driver also support transmiting skbs
with frag list with NETIF_F_FRAGLIST feature on?

> 
> We have used MAX_SKB_FRAGS=45 value for years at Google before
> we deployed 4K MTU, with no adverse effect, other than
> a recent issue in mlx4, fixed in commit 26782aad00cc
> ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
> 
> Back then, goal was to be able to receive full size (64KB) GRO
> packets without the frag_list overhead.
> 
> Note that /proc/sys/net/core/max_skb_frags can also be used to limit
> the number of fragments TCP can use in tx packets.
> 
> By default we keep the old/legacy value of 17 until we get
> more coverage for the updated values.
> 
> Sizes of struct skb_shared_info on 64bit arches
> 
> MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
> ==============================================
>          17     320
>          21     320+64  = 384
>          25     320+128 = 448
>          29     320+192 = 512
>          33     320+256 = 576
>          37     320+320 = 640
>          41     320+384 = 704
>          45     320+448 = 768
> 
> This inflation might cause problems for drivers assuming they could pack
> both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
> using build_skb().
> 
> v3: fix build error when CONFIG_NET=n
> v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
> ---
>  drivers/scsi/cxgbi/libcxgbi.c |  4 ++--
>  include/linux/skbuff.h        | 16 +++++-----------
>  net/Kconfig                   | 12 ++++++++++++
>  net/packet/af_packet.c        |  4 ++--
>  4 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
> index af281e271f886041b397ea881e2ce7be00eff625..3e1de4c842cc6102e25a5972d6b11e05c3e4c060 100644
> --- a/drivers/scsi/cxgbi/libcxgbi.c
> +++ b/drivers/scsi/cxgbi/libcxgbi.c
> @@ -2314,9 +2314,9 @@ static int cxgbi_sock_tx_queue_up(struct cxgbi_sock *csk, struct sk_buff *skb)
>  		frags++;
>  
>  	if (frags >= SKB_WR_LIST_SIZE) {
> -		pr_err("csk 0x%p, frags %u, %u,%u >%lu.\n",
> +		pr_err("csk 0x%p, frags %u, %u,%u >%u.\n",
>  		       csk, skb_shinfo(skb)->nr_frags, skb->len,
> -		       skb->data_len, SKB_WR_LIST_SIZE);
> +		       skb->data_len, (unsigned int)SKB_WR_LIST_SIZE);
>  		return -EINVAL;
>  	}
>  
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index fe661011644b8f468ff5e92075a6624f0557584c..82511b2f61ea2bc5d587b58f0901e50e64729e4f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -345,18 +345,12 @@ struct sk_buff_head {
>  
>  struct sk_buff;
>  
> -/* To allow 64K frame to be packed as single skb without frag_list we
> - * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
> - * buffers which do not start on a page boundary.
> - *
> - * Since GRO uses frags we allocate at least 16 regardless of page
> - * size.
> - */
> -#if (65536/PAGE_SIZE + 1) < 16
> -#define MAX_SKB_FRAGS 16UL
> -#else
> -#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
> +#ifndef CONFIG_MAX_SKB_FRAGS
> +# define CONFIG_MAX_SKB_FRAGS 17

There seems to be an extra space before 'define'.

Also, is there a reason why not to keep below old logic
if CONFIG_MAX_SKB_FRAGS is not defined?

#if (65536/PAGE_SIZE + 1) < 16
#define MAX_SKB_FRAGS 16UL
#else
#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
#endif

It seems with old logic:
1. for kernel with 4K page size, MAX_SKB_FRAGS is 17.
2. for kernel with 64K page size, MAX_SKB_FRAGS is 16.

>  #endif
> +
> +#define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
> +
>  extern int sysctl_max_skb_frags;
>  
>  /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
> diff --git a/net/Kconfig b/net/Kconfig
> index 48c33c2221999e575c83a409ab773b9cc3656eab..f806722bccf450c62e07bfdb245e5195ac4a156d 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -251,6 +251,18 @@ config PCPU_DEV_REFCNT
>  	  network device refcount are using per cpu variables if this option is set.
>  	  This can be forced to N to detect underflows (with a performance drop).
>  
> +config MAX_SKB_FRAGS
> +	int "Maximum number of fragments per skb_shared_info"
> +	range 17 45
> +	default 17
> +	help
> +	  Having more fragments per skb_shared_info can help GRO efficiency.
> +	  This helps BIG TCP workloads, but might expose bugs in some
> +	  legacy drivers.
> +	  This also increases memory overhead of small packets,
> +	  and in drivers using build_skb().
> +	  If unsure, say 17.
> +
>  config RPS
>  	bool
>  	depends on SMP && SYSFS
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 497193f73030c385a2d33b71dfbc299fbf9b763d..568f8d76e3c124f3b322a8d88dc3dcfbc45e7c0e 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2622,8 +2622,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
>  		nr_frags = skb_shinfo(skb)->nr_frags;
>  
>  		if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
> -			pr_err("Packet exceed the number of skb frags(%lu)\n",
> -			       MAX_SKB_FRAGS);
> +			pr_err("Packet exceed the number of skb frags(%u)\n",
> +			       (unsigned int)MAX_SKB_FRAGS);
>  			return -EFAULT;
>  		}
>  
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS
  2023-03-24  1:32 ` Yunsheng Lin
@ 2023-03-24  1:40   ` Eric Dumazet
  2023-03-24  1:53     ` Yunsheng Lin
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2023-03-24  1:40 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Nikolay Aleksandrov

On Thu, Mar 23, 2023 at 6:32 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/3/24 0:28, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Currently, MAX_SKB_FRAGS value is 17.
> >
> > For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> > attempts order-3 allocations, stuffing 32768 bytes per frag.
> >
> > But with zero copy, we use order-0 pages.
> >
> > For BIG TCP to show its full potential, we add a config option
> > to be able to fit up to 45 segments per skb.
> >
> > This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> > does not support skbs with frag list.
>
> Just out of curiosity, it is possible to add support for skbs with
> frag list for zerocopy if the driver also support transmiting skbs
> with frag list with NETIF_F_FRAGLIST feature on?

We are talking of rx zerocopy, look at net/ipv4/tcp.c (this is not
tied to NETIF_F_FRAGLIST support
because packets land into a TCP receive queue)

>
> >
> > We have used MAX_SKB_FRAGS=45 value for years at Google before
> > we deployed 4K MTU, with no adverse effect, other than
> > a recent issue in mlx4, fixed in commit 26782aad00cc
> > ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
> >
> > Back then, goal was to be able to receive full size (64KB) GRO
> > packets without the frag_list overhead.
> >
> > Note that /proc/sys/net/core/max_skb_frags can also be used to limit
> > the number of fragments TCP can use in tx packets.
> >
> > By default we keep the old/legacy value of 17 until we get
> > more coverage for the updated values.
> >
> > Sizes of struct skb_shared_info on 64bit arches
> >
> > MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
> > ==============================================
> >          17     320
> >          21     320+64  = 384
> >          25     320+128 = 448
> >          29     320+192 = 512
> >          33     320+256 = 576
> >          37     320+320 = 640
> >          41     320+384 = 704
> >          45     320+448 = 768
> >
> > This inflation might cause problems for drivers assuming they could pack
> > both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
> > using build_skb().
> >
> > v3: fix build error when CONFIG_NET=n
> > v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
> > ---
> >  drivers/scsi/cxgbi/libcxgbi.c |  4 ++--
> >  include/linux/skbuff.h        | 16 +++++-----------
> >  net/Kconfig                   | 12 ++++++++++++
> >  net/packet/af_packet.c        |  4 ++--
> >  4 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
> > index af281e271f886041b397ea881e2ce7be00eff625..3e1de4c842cc6102e25a5972d6b11e05c3e4c060 100644
> > --- a/drivers/scsi/cxgbi/libcxgbi.c
> > +++ b/drivers/scsi/cxgbi/libcxgbi.c
> > @@ -2314,9 +2314,9 @@ static int cxgbi_sock_tx_queue_up(struct cxgbi_sock *csk, struct sk_buff *skb)
> >               frags++;
> >
> >       if (frags >= SKB_WR_LIST_SIZE) {
> > -             pr_err("csk 0x%p, frags %u, %u,%u >%lu.\n",
> > +             pr_err("csk 0x%p, frags %u, %u,%u >%u.\n",
> >                      csk, skb_shinfo(skb)->nr_frags, skb->len,
> > -                    skb->data_len, SKB_WR_LIST_SIZE);
> > +                    skb->data_len, (unsigned int)SKB_WR_LIST_SIZE);
> >               return -EINVAL;
> >       }
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index fe661011644b8f468ff5e92075a6624f0557584c..82511b2f61ea2bc5d587b58f0901e50e64729e4f 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -345,18 +345,12 @@ struct sk_buff_head {
> >
> >  struct sk_buff;
> >
> > -/* To allow 64K frame to be packed as single skb without frag_list we
> > - * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
> > - * buffers which do not start on a page boundary.
> > - *
> > - * Since GRO uses frags we allocate at least 16 regardless of page
> > - * size.
> > - */
> > -#if (65536/PAGE_SIZE + 1) < 16
> > -#define MAX_SKB_FRAGS 16UL
> > -#else
> > -#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
> > +#ifndef CONFIG_MAX_SKB_FRAGS
> > +# define CONFIG_MAX_SKB_FRAGS 17
>
> There seems to be an extra space before 'define'.

This is indentation. Pretty standard I would say.

#if xxxxx
# define ....
#else
# define ....
#endif


>
> Also, is there a reason why not to keep below old logic
> if CONFIG_MAX_SKB_FRAGS is not defined?
>
> #if (65536/PAGE_SIZE + 1) < 16
> #define MAX_SKB_FRAGS 16UL
> #else
> #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
> #endif
>
> It seems with old logic:
> 1. for kernel with 4K page size, MAX_SKB_FRAGS is 17.
> 2. for kernel with 64K page size, MAX_SKB_FRAGS is 16.

This is for CONFIG_NET=n configs.

I am pretty sure nobody would care about having 17 or 16 frags per skb
for such a silly config.

Let's not confuse readers.

>
> >  #endif
> > +
> > +#define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
> > +
> >  extern int sysctl_max_skb_frags;
> >
> >  /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
> > diff --git a/net/Kconfig b/net/Kconfig
> > index 48c33c2221999e575c83a409ab773b9cc3656eab..f806722bccf450c62e07bfdb245e5195ac4a156d 100644
> > --- a/net/Kconfig
> > +++ b/net/Kconfig
> > @@ -251,6 +251,18 @@ config PCPU_DEV_REFCNT
> >         network device refcount are using per cpu variables if this option is set.
> >         This can be forced to N to detect underflows (with a performance drop).
> >
> > +config MAX_SKB_FRAGS
> > +     int "Maximum number of fragments per skb_shared_info"
> > +     range 17 45
> > +     default 17
> > +     help
> > +       Having more fragments per skb_shared_info can help GRO efficiency.
> > +       This helps BIG TCP workloads, but might expose bugs in some
> > +       legacy drivers.
> > +       This also increases memory overhead of small packets,
> > +       and in drivers using build_skb().
> > +       If unsure, say 17.
> > +
> >  config RPS
> >       bool
> >       depends on SMP && SYSFS
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index 497193f73030c385a2d33b71dfbc299fbf9b763d..568f8d76e3c124f3b322a8d88dc3dcfbc45e7c0e 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2622,8 +2622,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
> >               nr_frags = skb_shinfo(skb)->nr_frags;
> >
> >               if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
> > -                     pr_err("Packet exceed the number of skb frags(%lu)\n",
> > -                            MAX_SKB_FRAGS);
> > +                     pr_err("Packet exceed the number of skb frags(%u)\n",
> > +                            (unsigned int)MAX_SKB_FRAGS);
> >                       return -EFAULT;
> >               }
> >
> >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS
  2023-03-24  1:40   ` Eric Dumazet
@ 2023-03-24  1:53     ` Yunsheng Lin
  0 siblings, 0 replies; 6+ messages in thread
From: Yunsheng Lin @ 2023-03-24  1:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Nikolay Aleksandrov

On 2023/3/24 9:40, Eric Dumazet wrote:
> On Thu, Mar 23, 2023 at 6:32 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2023/3/24 0:28, Eric Dumazet wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> Currently, MAX_SKB_FRAGS value is 17.
>>>
>>> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
>>> attempts order-3 allocations, stuffing 32768 bytes per frag.
>>>
>>> But with zero copy, we use order-0 pages.
>>>
>>> For BIG TCP to show its full potential, we add a config option
>>> to be able to fit up to 45 segments per skb.
>>>
>>> This is also needed for BIG TCP rx zerocopy, as zerocopy currently
>>> does not support skbs with frag list.
>>
>> Just out of curiosity, it is possible to add support for skbs with
>> frag list for zerocopy if the driver also support transmiting skbs
>> with frag list with NETIF_F_FRAGLIST feature on?
> 
> We are talking of rx zerocopy, look at net/ipv4/tcp.c (this is not
> tied to NETIF_F_FRAGLIST support
> because packets land into a TCP receive queue)

Sorry for confusion.
So the NETIF_F_FRAGLIST is only for tx?
what about the driver building a skb with frag list if a packet with
descs more than MAX_SKB_FRAGS for rx too?

> 
>>
>>>
>>> We have used MAX_SKB_FRAGS=45 value for years at Google before
>>> we deployed 4K MTU, with no adverse effect, other than
>>> a recent issue in mlx4, fixed in commit 26782aad00cc
>>> ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
>>>
>>> Back then, goal was to be able to receive full size (64KB) GRO
>>> packets without the frag_list overhead.
>>>
>>> Note that /proc/sys/net/core/max_skb_frags can also be used to limit
>>> the number of fragments TCP can use in tx packets.
>>>
>>> By default we keep the old/legacy value of 17 until we get
>>> more coverage for the updated values.
>>>
>>> Sizes of struct skb_shared_info on 64bit arches
>>>
>>> MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
>>> ==============================================
>>>          17     320
>>>          21     320+64  = 384
>>>          25     320+128 = 448
>>>          29     320+192 = 512
>>>          33     320+256 = 576
>>>          37     320+320 = 640
>>>          41     320+384 = 704
>>>          45     320+448 = 768
>>>
>>> This inflation might cause problems for drivers assuming they could pack
>>> both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
>>> using build_skb().
>>>
>>> v3: fix build error when CONFIG_NET=n
>>> v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
>>> ---
>>>  drivers/scsi/cxgbi/libcxgbi.c |  4 ++--
>>>  include/linux/skbuff.h        | 16 +++++-----------
>>>  net/Kconfig                   | 12 ++++++++++++
>>>  net/packet/af_packet.c        |  4 ++--
>>>  4 files changed, 21 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
>>> index af281e271f886041b397ea881e2ce7be00eff625..3e1de4c842cc6102e25a5972d6b11e05c3e4c060 100644
>>> --- a/drivers/scsi/cxgbi/libcxgbi.c
>>> +++ b/drivers/scsi/cxgbi/libcxgbi.c
>>> @@ -2314,9 +2314,9 @@ static int cxgbi_sock_tx_queue_up(struct cxgbi_sock *csk, struct sk_buff *skb)
>>>               frags++;
>>>
>>>       if (frags >= SKB_WR_LIST_SIZE) {
>>> -             pr_err("csk 0x%p, frags %u, %u,%u >%lu.\n",
>>> +             pr_err("csk 0x%p, frags %u, %u,%u >%u.\n",
>>>                      csk, skb_shinfo(skb)->nr_frags, skb->len,
>>> -                    skb->data_len, SKB_WR_LIST_SIZE);
>>> +                    skb->data_len, (unsigned int)SKB_WR_LIST_SIZE);
>>>               return -EINVAL;
>>>       }
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index fe661011644b8f468ff5e92075a6624f0557584c..82511b2f61ea2bc5d587b58f0901e50e64729e4f 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -345,18 +345,12 @@ struct sk_buff_head {
>>>
>>>  struct sk_buff;
>>>
>>> -/* To allow 64K frame to be packed as single skb without frag_list we
>>> - * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
>>> - * buffers which do not start on a page boundary.
>>> - *
>>> - * Since GRO uses frags we allocate at least 16 regardless of page
>>> - * size.
>>> - */
>>> -#if (65536/PAGE_SIZE + 1) < 16
>>> -#define MAX_SKB_FRAGS 16UL
>>> -#else
>>> -#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
>>> +#ifndef CONFIG_MAX_SKB_FRAGS
>>> +# define CONFIG_MAX_SKB_FRAGS 17
>>
>> There seems to be an extra space before 'define'.
> 
> This is indentation. Pretty standard I would say.
> 
> #if xxxxx
> # define ....
> #else
> # define ....
> #endif
> 
> 
>>
>> Also, is there a reason why not to keep below old logic
>> if CONFIG_MAX_SKB_FRAGS is not defined?
>>
>> #if (65536/PAGE_SIZE + 1) < 16
>> #define MAX_SKB_FRAGS 16UL
>> #else
>> #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
>> #endif
>>
>> It seems with old logic:
>> 1. for kernel with 4K page size, MAX_SKB_FRAGS is 17.
>> 2. for kernel with 64K page size, MAX_SKB_FRAGS is 16.
> 
> This is for CONFIG_NET=n configs.
> 
> I am pretty sure nobody would care about having 17 or 16 frags per skb
> for such a silly config.
> 
> Let's not confuse readers.
> 
>>
>>>  #endif
>>> +
>>> +#define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
>>> +
>>>  extern int sysctl_max_skb_frags;
>>>
>>>  /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
>>> diff --git a/net/Kconfig b/net/Kconfig
>>> index 48c33c2221999e575c83a409ab773b9cc3656eab..f806722bccf450c62e07bfdb245e5195ac4a156d 100644
>>> --- a/net/Kconfig
>>> +++ b/net/Kconfig
>>> @@ -251,6 +251,18 @@ config PCPU_DEV_REFCNT
>>>         network device refcount are using per cpu variables if this option is set.
>>>         This can be forced to N to detect underflows (with a performance drop).
>>>
>>> +config MAX_SKB_FRAGS
>>> +     int "Maximum number of fragments per skb_shared_info"
>>> +     range 17 45
>>> +     default 17
>>> +     help
>>> +       Having more fragments per skb_shared_info can help GRO efficiency.
>>> +       This helps BIG TCP workloads, but might expose bugs in some
>>> +       legacy drivers.
>>> +       This also increases memory overhead of small packets,
>>> +       and in drivers using build_skb().
>>> +       If unsure, say 17.
>>> +
>>>  config RPS
>>>       bool
>>>       depends on SMP && SYSFS
>>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>>> index 497193f73030c385a2d33b71dfbc299fbf9b763d..568f8d76e3c124f3b322a8d88dc3dcfbc45e7c0e 100644
>>> --- a/net/packet/af_packet.c
>>> +++ b/net/packet/af_packet.c
>>> @@ -2622,8 +2622,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
>>>               nr_frags = skb_shinfo(skb)->nr_frags;
>>>
>>>               if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
>>> -                     pr_err("Packet exceed the number of skb frags(%lu)\n",
>>> -                            MAX_SKB_FRAGS);
>>> +                     pr_err("Packet exceed the number of skb frags(%u)\n",
>>> +                            (unsigned int)MAX_SKB_FRAGS);
>>>                       return -EFAULT;
>>>               }
>>>
>>>
> .
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS
  2023-03-23 16:28 [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS Eric Dumazet
  2023-03-24  1:32 ` Yunsheng Lin
@ 2023-03-27  7:18 ` Jason Xing
  2023-03-28  2:50 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 6+ messages in thread
From: Jason Xing @ 2023-03-27  7:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Eric Dumazet, Nikolay Aleksandrov

On Fri, Mar 24, 2023 at 12:35 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> Currently, MAX_SKB_FRAGS value is 17.
>
> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> attempts order-3 allocations, stuffing 32768 bytes per frag.
>
> But with zero copy, we use order-0 pages.
>
> For BIG TCP to show its full potential, we add a config option
> to be able to fit up to 45 segments per skb.
>
> This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> does not support skbs with frag list.
>
> We have used MAX_SKB_FRAGS=45 value for years at Google before
> we deployed 4K MTU, with no adverse effect, other than
> a recent issue in mlx4, fixed in commit 26782aad00cc
> ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
>
> Back then, goal was to be able to receive full size (64KB) GRO
> packets without the frag_list overhead.
>
> Note that /proc/sys/net/core/max_skb_frags can also be used to limit
> the number of fragments TCP can use in tx packets.
>
> By default we keep the old/legacy value of 17 until we get
> more coverage for the updated values.
>
> Sizes of struct skb_shared_info on 64bit arches
>
> MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
> ==============================================
>          17     320
>          21     320+64  = 384
>          25     320+128 = 448
>          29     320+192 = 512
>          33     320+256 = 576
>          37     320+320 = 640
>          41     320+384 = 704
>          45     320+448 = 768
>
> This inflation might cause problems for drivers assuming they could pack
> both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
> using build_skb().
>
> v3: fix build error when CONFIG_NET=n
> v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

The patch itself looks good. Please feel free to add:

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks:)

> ---
>  drivers/scsi/cxgbi/libcxgbi.c |  4 ++--
>  include/linux/skbuff.h        | 16 +++++-----------
>  net/Kconfig                   | 12 ++++++++++++
>  net/packet/af_packet.c        |  4 ++--
>  4 files changed, 21 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
> index af281e271f886041b397ea881e2ce7be00eff625..3e1de4c842cc6102e25a5972d6b11e05c3e4c060 100644
> --- a/drivers/scsi/cxgbi/libcxgbi.c
> +++ b/drivers/scsi/cxgbi/libcxgbi.c
> @@ -2314,9 +2314,9 @@ static int cxgbi_sock_tx_queue_up(struct cxgbi_sock *csk, struct sk_buff *skb)
>                 frags++;
>
>         if (frags >= SKB_WR_LIST_SIZE) {
> -               pr_err("csk 0x%p, frags %u, %u,%u >%lu.\n",
> +               pr_err("csk 0x%p, frags %u, %u,%u >%u.\n",
>                        csk, skb_shinfo(skb)->nr_frags, skb->len,
> -                      skb->data_len, SKB_WR_LIST_SIZE);
> +                      skb->data_len, (unsigned int)SKB_WR_LIST_SIZE);
>                 return -EINVAL;
>         }
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index fe661011644b8f468ff5e92075a6624f0557584c..82511b2f61ea2bc5d587b58f0901e50e64729e4f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -345,18 +345,12 @@ struct sk_buff_head {
>
>  struct sk_buff;
>
> -/* To allow 64K frame to be packed as single skb without frag_list we
> - * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
> - * buffers which do not start on a page boundary.
> - *
> - * Since GRO uses frags we allocate at least 16 regardless of page
> - * size.
> - */
> -#if (65536/PAGE_SIZE + 1) < 16
> -#define MAX_SKB_FRAGS 16UL
> -#else
> -#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
> +#ifndef CONFIG_MAX_SKB_FRAGS
> +# define CONFIG_MAX_SKB_FRAGS 17
>  #endif
> +
> +#define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
> +
>  extern int sysctl_max_skb_frags;
>
>  /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to
> diff --git a/net/Kconfig b/net/Kconfig
> index 48c33c2221999e575c83a409ab773b9cc3656eab..f806722bccf450c62e07bfdb245e5195ac4a156d 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -251,6 +251,18 @@ config PCPU_DEV_REFCNT
>           network device refcount are using per cpu variables if this option is set.
>           This can be forced to N to detect underflows (with a performance drop).
>
> +config MAX_SKB_FRAGS
> +       int "Maximum number of fragments per skb_shared_info"
> +       range 17 45
> +       default 17
> +       help
> +         Having more fragments per skb_shared_info can help GRO efficiency.
> +         This helps BIG TCP workloads, but might expose bugs in some
> +         legacy drivers.
> +         This also increases memory overhead of small packets,
> +         and in drivers using build_skb().
> +         If unsure, say 17.
> +
>  config RPS
>         bool
>         depends on SMP && SYSFS
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 497193f73030c385a2d33b71dfbc299fbf9b763d..568f8d76e3c124f3b322a8d88dc3dcfbc45e7c0e 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2622,8 +2622,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
>                 nr_frags = skb_shinfo(skb)->nr_frags;
>
>                 if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
> -                       pr_err("Packet exceed the number of skb frags(%lu)\n",
> -                              MAX_SKB_FRAGS);
> +                       pr_err("Packet exceed the number of skb frags(%u)\n",
> +                              (unsigned int)MAX_SKB_FRAGS);
>                         return -EFAULT;
>                 }
>
> --
> 2.40.0.rc1.284.g88254d51c5-goog
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS
  2023-03-23 16:28 [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS Eric Dumazet
  2023-03-24  1:32 ` Yunsheng Lin
  2023-03-27  7:18 ` Jason Xing
@ 2023-03-28  2:50 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 6+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-03-28  2:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuba, pabeni, netdev, edumazet, razor

Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 23 Mar 2023 09:28:42 -0700 you wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Currently, MAX_SKB_FRAGS value is 17.
> 
> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> attempts order-3 allocations, stuffing 32768 bytes per frag.
> 
> [...]

Here is the summary with links:
  - [v3,net-next] net: introduce a config option to tweak MAX_SKB_FRAGS
    https://git.kernel.org/netdev/net-next/c/3948b05950fd

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-28  2:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-23 16:28 [PATCH v3 net-next] net: introduce a config option to tweak MAX_SKB_FRAGS Eric Dumazet
2023-03-24  1:32 ` Yunsheng Lin
2023-03-24  1:40   ` Eric Dumazet
2023-03-24  1:53     ` Yunsheng Lin
2023-03-27  7:18 ` Jason Xing
2023-03-28  2:50 ` patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.