All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug in skb_segment: fskb->len != len
@ 2013-10-28 11:55 Christoph Paasch
  2013-10-28 13:21 ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Christoph Paasch @ 2013-10-28 11:55 UTC (permalink / raw)
  To: Eric Dumazet, Herbert Xu; +Cc: netdev

Hello,

I have been seeing the below BUG in skb_segment with the latest net-next
head on my router.

I am forwarding Multipath TCP-traffic on this router. The MPTCP-sender is simply
doing an iperf-session. Strangely, I cannot reproduce the bug when sending
regular TCP-traffic across the router.
Note: The crash happens on a vanilla net-next kernel. It does not has any
MPTCP-code in it.

I bisected it down to 8a29111c7c (net: gro: allow to build full sized skb),
but I guess 8a29111c7c is just revealing a more fundamental bug in skb_segment.

Some info I found:
In skb_segment, when the bug happens, fskb->len is 4284 but the mss and len is 1428.
Shortly before the bug happens, skb_gro_receive is building a packet where
lp->len is equal to 4284 inside the frag_list.


Seems like skb_segment cannot handle those bigger skb's in the frag_list.


Cheers,
Christoph


Here the crash-dump:

[  399.832854] ------------[ cut here ]------------
[  399.888048] kernel BUG at /home/cpaasch/builder/net-next/net/core/skbuff.c:2796!
[  399.976504] invalid opcode: 0000 [#1] SMP 
[  400.025675] Modules linked in:
[  400.062270] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.12.0-rc6-mptcp #231
[  400.145531] Hardware name: HP ProLiant DL120 G6/ProLiant DL120 G6, BIOS O26    09/06/2010
[  400.243342] task: ffff88042d8a4680 ti: ffff88042d8ce000 task.ti: ffff88042d8ce000
[  400.332841] RIP: 0010:[<ffffffff81447d21>]  [<ffffffff81447d21>] skb_segment+0x1aa/0x5fa
[  400.429722] RSP: 0018:ffff88043fd03770  EFLAGS: 00010212
[  400.493231] RAX: 0000000000000594 RBX: ffff8800ba89ac00 RCX: 00000000000064be
[  400.578574] RDX: 0000000000000000 RSI: 0000000000000011 RDI: ffff8804273a7080
[  400.663918] RBP: ffff88043fd03820 R08: 0000000000000000 R09: ffff88042c4d4600
[  400.749259] R10: 0000000000010000 R11: ffff88042d801900 R12: ffff88042c7ca000
[  400.834596] R13: ffff88042c5d5400 R14: 0000000000001650 R15: 0000000000000056
[  400.919934] FS:  0000000000000000(0000) GS:ffff88043fd00000(0000) knlGS:0000000000000000
[  401.016711] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  401.085422] CR2: ffffffffff600400 CR3: 000000042c86b000 CR4: 00000000000007e0
[  401.170765] Stack:
[  401.194780]  ffff88042d94e900 ffff88042c4d46f0 0000000000000000 0000000000000042
[  401.283663]  0100000000000000 0000000000000001 0000001100000594 0000000000000056
[  401.372555]  0000000000000000 0000004200000098 ffffffffffffffaa 0000001100000001
[  401.461445] Call Trace:
[  401.490658]  <IRQ> 
[  401.513631]  [<ffffffff8149b077>] tcp_gso_segment+0x168/0x395
[  401.584644]  [<ffffffff814a5ba1>] inet_gso_segment+0x175/0x2a9
[  401.654396]  [<ffffffff8144fb40>] skb_mac_gso_segment+0x10a/0x16a
[  401.727264]  [<ffffffff81451062>] __skb_gso_segment+0xaf/0xb4
[  401.795977]  [<ffffffff814515ae>] dev_hard_start_xmit+0x215/0x40a
[  401.868846]  [<ffffffff814689ed>] sch_direct_xmit+0x6b/0x195
[  401.936519]  [<ffffffff81451988>] dev_queue_xmit+0x1e5/0x3ac
[  402.004193]  [<ffffffff814b6461>] ? iptable_filter_hook+0x41/0x4c
[  402.077061]  [<ffffffff8148039d>] ip_finish_output+0x2f6/0x351
[  402.146812]  [<ffffffff8147c6dc>] ? ip_frag_mem+0x34/0x34
[  402.211366]  [<ffffffff81480470>] ip_output+0x78/0x7f
[  402.271765]  [<ffffffff8147c71c>] ip_forward_finish+0x40/0x44
[  402.340475]  [<ffffffff8147c9c5>] ip_forward+0x2a5/0x300
[  402.403993]  [<ffffffff8147b104>] ip_rcv_finish+0x214/0x22c
[  402.470625]  [<ffffffff8147b3cd>] ip_rcv+0x2b1/0x2e9
[  402.529983]  [<ffffffff81446a19>] ? skb_gro_receive+0x562/0x582
[  402.600773]  [<ffffffff8144dcd8>] __netif_receive_skb_core+0x49a/0x4cd
[  402.678840]  [<ffffffff8144dd60>] __netif_receive_skb+0x55/0x5a
[  402.749631]  [<ffffffff81450190>] netif_receive_skb+0x71/0x78
[  402.818344]  [<ffffffff8149af07>] ? tcp4_gro_receive+0xf4/0xfc
[  402.888095]  [<ffffffff81450249>] napi_gro_complete+0xb2/0xba
[  402.956808]  [<ffffffff8145045f>] dev_gro_receive+0x20e/0x34d
[  403.025519]  [<ffffffff81450ae5>] napi_gro_receive+0x92/0xf1
[  403.093195]  [<ffffffff813acfe2>] netxen_process_rcv_ring+0x1b0/0x767
[  403.170222]  [<ffffffff810b3ae8>] ? kmem_cache_free+0xef/0xf3
[  403.238931]  [<ffffffff81450fb1>] ? dev_kfree_skb_any+0x2e/0x30
[  403.309723]  [<ffffffff813acc42>] ? netxen_process_cmd_ring+0x33/0x223
[  403.387790]  [<ffffffff813a8f70>] netxen_nic_poll+0x35/0x9a
[  403.454423]  [<ffffffff814506dc>] net_rx_action+0xa7/0x1d2
[  403.520017]  [<ffffffff8103605d>] __do_softirq+0xbd/0x17e
[  403.584572]  [<ffffffff815289bc>] call_softirq+0x1c/0x26
[  403.648085]  [<ffffffff81003bbb>] do_softirq+0x33/0x68
[  403.709523]  [<ffffffff81035efb>] irq_exit+0x40/0x4e
[  403.768880]  [<ffffffff81003423>] do_IRQ+0x98/0xaf
[  403.826158]  [<ffffffff8152716a>] common_interrupt+0x6a/0x6a
[  403.893829]  <EOI> 
[  403.916800]  [<ffffffff8100933d>] ? default_idle+0x6/0x8
[  403.982604]  [<ffffffff81009542>] arch_cpu_idle+0x13/0x18
[  404.047159]  [<ffffffff8105ea2b>] cpu_startup_entry+0xa4/0xf1
[  404.115873]  [<ffffffff8102320b>] start_secondary+0x1b2/0x1b7
[  404.184582] Code: bd 7f ff ff ff 00 74 04 44 8b 75 c0 45 85 f6 0f 85 e5 00 00 00 8b 75 84 39 75 ac 0f 8c d9 00 00 00 45 8b 75 68 44 3b 75 c0 74 04 <0f> 0b eb fe 4c 89 ef be 20 00 00 00 e8 08 f1 ff ff 48 85 c0 48 
[  404.417106] RIP  [<ffffffff81447d21>] skb_segment+0x1aa/0x5fa
[  404.485928]  RSP <ffff88043fd03770>
[  404.527614] ---[ end trace 32152a68c7bdc3ac ]---

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-28 11:55 Bug in skb_segment: fskb->len != len Christoph Paasch
@ 2013-10-28 13:21 ` Eric Dumazet
  2013-10-28 13:28   ` Christoph Paasch
  2013-10-29  1:15   ` Eric Dumazet
  0 siblings, 2 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-10-28 13:21 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: Herbert Xu, netdev

On Mon, 2013-10-28 at 12:55 +0100, Christoph Paasch wrote:
> Hello,
> 
> I have been seeing the below BUG in skb_segment with the latest net-next
> head on my router.
> 
> I am forwarding Multipath TCP-traffic on this router. The MPTCP-sender is simply
> doing an iperf-session. Strangely, I cannot reproduce the bug when sending
> regular TCP-traffic across the router.
> Note: The crash happens on a vanilla net-next kernel. It does not has any
> MPTCP-code in it.
> 
> I bisected it down to 8a29111c7c (net: gro: allow to build full sized skb),
> but I guess 8a29111c7c is just revealing a more fundamental bug in skb_segment.
> 
> Some info I found:
> In skb_segment, when the bug happens, fskb->len is 4284 but the mss and len is 1428.

fskb seems to contain 3 segments -> 3*1428 = 4284, so it looks fine

But what do you mean by 'len is 1428' ?

> Shortly before the bug happens, skb_gro_receive is building a packet where
> lp->len is equal to 4284 inside the frag_list.
> 
> 
> Seems like skb_segment cannot handle those bigger skb's in the frag_list.
> 

Thanks for the report, I'll take a look.

As mentioned earlier, building very large skbs (with a frag_list) for a
router makes little sense, because we need to segment them before NIC
ndo_start_xmit()

But we also need to fix the skb_segment() bug anyway.

Thanks !

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-28 13:21 ` Eric Dumazet
@ 2013-10-28 13:28   ` Christoph Paasch
  2013-10-29  1:15   ` Eric Dumazet
  1 sibling, 0 replies; 163+ messages in thread
From: Christoph Paasch @ 2013-10-28 13:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, netdev

On 28/10/13 - 06:21:11, Eric Dumazet wrote:
> On Mon, 2013-10-28 at 12:55 +0100, Christoph Paasch wrote:
> > I have been seeing the below BUG in skb_segment with the latest net-next
> > head on my router.
> > 
> > I am forwarding Multipath TCP-traffic on this router. The MPTCP-sender is simply
> > doing an iperf-session. Strangely, I cannot reproduce the bug when sending
> > regular TCP-traffic across the router.
> > Note: The crash happens on a vanilla net-next kernel. It does not has any
> > MPTCP-code in it.
> > 
> > I bisected it down to 8a29111c7c (net: gro: allow to build full sized skb),
> > but I guess 8a29111c7c is just revealing a more fundamental bug in skb_segment.
> > 
> > Some info I found:
> > In skb_segment, when the bug happens, fskb->len is 4284 but the mss and len is 1428.
> 
> fskb seems to contain 3 segments -> 3*1428 = 4284, so it looks fine
> 
> But what do you mean by 'len is 1428' ?

I meant that the variable "len" equals 1428. And thus BUG_ON(fskb->len != len) triggers.

> > Shortly before the bug happens, skb_gro_receive is building a packet where
> > lp->len is equal to 4284 inside the frag_list.
> > 
> > 
> > Seems like skb_segment cannot handle those bigger skb's in the frag_list.
> > 
> 
> Thanks for the report, I'll take a look.
> 
> As mentioned earlier, building very large skbs (with a frag_list) for a
> router makes little sense, because we need to segment them before NIC
> ndo_start_xmit()
> 
> But we also need to fix the skb_segment() bug anyway.
> 
> Thanks !

Let me know if I should provide more info or test a patch.


Cheers,
Christoph

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-28 13:21 ` Eric Dumazet
  2013-10-28 13:28   ` Christoph Paasch
@ 2013-10-29  1:15   ` Eric Dumazet
  2013-10-29  9:08     ` Christoph Paasch
  2013-10-29 14:41     ` Bug in skb_segment: fskb->len != len Herbert Xu
  1 sibling, 2 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-10-29  1:15 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: Herbert Xu, netdev

On Mon, 2013-10-28 at 06:21 -0700, Eric Dumazet wrote:

> But we also need to fix the skb_segment() bug anyway.

Hi Christoph

I cooked a minimal patch, could you please try it ?

I'll refactor skb_segment() to be smarter for the next release
(linux-3.14).

Thanks !

 net/core/skbuff.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0ab32faa520f..771946487a8d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2761,7 +2761,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	unsigned int len;
 	__be16 proto;
 	bool csum;
-	int sg = !!(features & NETIF_F_SG);
+	bool sg = !!(features & NETIF_F_SG);
 	int nfrags = skb_shinfo(skb)->nr_frags;
 	int err = -ENOMEM;
 	int i = 0;
@@ -2793,7 +2793,11 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+			if (fskb->len != len) {
+				hsize = len;
+				sg = false;
+				goto do_linear;
+			}
 
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
@@ -2812,6 +2816,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			skb_release_head_state(nskb);
 			__skb_push(nskb, doffset);
 		} else {
+do_linear:
 			nskb = __alloc_skb(hsize + doffset + headroom,
 					   GFP_ATOMIC, skb_alloc_rx_flag(skb),
 					   NUMA_NO_NODE);
@@ -2838,9 +2843,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
-
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
@@ -2849,6 +2851,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			continue;
 		}
 
+		if (fskb != skb_shinfo(skb)->frag_list)
+			goto perform_csum_check;
+
 		frag = skb_shinfo(nskb)->frags;
 
 		skb_copy_from_linear_data_offset(skb, offset,

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-29  1:15   ` Eric Dumazet
@ 2013-10-29  9:08     ` Christoph Paasch
  2013-10-29 12:57       ` Eric Dumazet
  2013-10-29 13:06       ` [PATCH net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
  2013-10-29 14:41     ` Bug in skb_segment: fskb->len != len Herbert Xu
  1 sibling, 2 replies; 163+ messages in thread
From: Christoph Paasch @ 2013-10-29  9:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, netdev

On 28/10/13 - 18:15:08, Eric Dumazet wrote:
> On Mon, 2013-10-28 at 06:21 -0700, Eric Dumazet wrote:
> 
> > But we also need to fix the skb_segment() bug anyway.
> 
> Hi Christoph
> 
> I cooked a minimal patch, could you please try it ?
> 
> I'll refactor skb_segment() to be smarter for the next release
> (linux-3.14).
> 
> Thanks !
> 
>  net/core/skbuff.c |   15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)

Ok, my router does not crash anymore with my workload.

Thanks for fixing it!

Tested-by: Christoph Paasch <christoph.paasch@uclouvain.be>


Cheers,
Christoph

> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 0ab32faa520f..771946487a8d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2761,7 +2761,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  	unsigned int len;
>  	__be16 proto;
>  	bool csum;
> -	int sg = !!(features & NETIF_F_SG);
> +	bool sg = !!(features & NETIF_F_SG);
>  	int nfrags = skb_shinfo(skb)->nr_frags;
>  	int err = -ENOMEM;
>  	int i = 0;
> @@ -2793,7 +2793,11 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  			hsize = len;
>  
>  		if (!hsize && i >= nfrags) {
> -			BUG_ON(fskb->len != len);
> +			if (fskb->len != len) {
> +				hsize = len;
> +				sg = false;
> +				goto do_linear;
> +			}
>  
>  			pos += len;
>  			nskb = skb_clone(fskb, GFP_ATOMIC);
> @@ -2812,6 +2816,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  			skb_release_head_state(nskb);
>  			__skb_push(nskb, doffset);
>  		} else {
> +do_linear:
>  			nskb = __alloc_skb(hsize + doffset + headroom,
>  					   GFP_ATOMIC, skb_alloc_rx_flag(skb),
>  					   NUMA_NO_NODE);
> @@ -2838,9 +2843,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  						 nskb->data - tnl_hlen,
>  						 doffset + tnl_hlen);
>  
> -		if (fskb != skb_shinfo(skb)->frag_list)
> -			goto perform_csum_check;
> -
>  		if (!sg) {
>  			nskb->ip_summed = CHECKSUM_NONE;
>  			nskb->csum = skb_copy_and_csum_bits(skb, offset,
> @@ -2849,6 +2851,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  			continue;
>  		}
>  
> +		if (fskb != skb_shinfo(skb)->frag_list)
> +			goto perform_csum_check;
> +
>  		frag = skb_shinfo(nskb)->frags;
>  
>  		skb_copy_from_linear_data_offset(skb, offset,
> 
> 

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-29  9:08     ` Christoph Paasch
@ 2013-10-29 12:57       ` Eric Dumazet
  2013-10-29 13:06       ` [PATCH net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
  1 sibling, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-10-29 12:57 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: Herbert Xu, netdev

On Tue, 2013-10-29 at 10:08 +0100, Christoph Paasch wrote:
> Ok, my router does not crash anymore with my workload.
> 
> Thanks for fixing it!
> 
> Tested-by: Christoph Paasch <christoph.paasch@uclouvain.be>

Ok fine ;)

Do you mind to test the official patch, as I added a new sysctl so that
GRO layer do not use frag_list on a router ?

You could check that /proc/sys/net/core/gro_frag_list_enable is
automatically cleared.

This way, we have optimal behavior for host or router.

Thanks !

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [PATCH net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-29  9:08     ` Christoph Paasch
  2013-10-29 12:57       ` Eric Dumazet
@ 2013-10-29 13:06       ` Eric Dumazet
  2013-10-29 13:48         ` Christoph Paasch
  2013-10-29 15:12         ` [PATCH v2 " Eric Dumazet
  1 sibling, 2 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-10-29 13:06 UTC (permalink / raw)
  To: Christoph Paasch, David Miller
  Cc: Herbert Xu, netdev, Jerry Chu, Michael Dalton

From: Eric Dumazet <edumazet@google.com>

Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")

(Jerry is working on adding native GRO support for tunnels)

skb_segment() only deals with a frag_list chain containing MSS sized
fragments.

This patch adds support any kind of frag, and adds a new sysctl,
as clearly the GRO layer should avoid building frag_list skbs
on a router, as the segmentation is adding cpu overhead.

Note that we could try to reuse page fragments instead of doing
copy to linear skbs, but this requires a fair amount of work,
and possible truesize nightmares, as we do not track individual
(per page fragment) truesizes.

/proc/sys/net/core/gro_frag_list_enable possible values are :

0 : GRO layer is not allowed to use frag_list to extend skb capacity
1 : GRO layer is allowed to use frag_list, but skb_segment()
    automatically sets the sysctl to 0.
2 : GRO is allowed to use frag_list, and skb_segment() wont
    clear the sysctl.

Default value is 1 : automatic discovery

Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Reported-by: Jerry Chu <hkchu@google.com>
Cc: Michael Dalton <mwdalton@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/sysctl/net.txt |   19 +++++++++++++++++++
 include/linux/netdevice.h    |    1 +
 net/core/skbuff.c            |   29 ++++++++++++++++++++---------
 net/core/sysctl_net_core.c   |   10 ++++++++++
 4 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 9a0319a82470..8778568ae64e 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -87,6 +87,25 @@ sysctl.net.busy_read globally.
 Will increase power usage.
 Default: 0 (off)
 
+gro_frag_list_enable
+--------------------
+
+GRO layer can build full size GRO packets (~64K of payload) if it is allowed
+to extend skb using the frag_list pointer. However, this strategy is a win
+on hosts, where TCP flows are terminated. For a router, using frag_list
+skbs is not a win because we have to segment skbs before transmit,
+as most NIC drivers do not support frag_list.
+As soon as one frag_list skb has to be segmented, this sysctl is automatically
+changed from 1 to 0.
+If the value is set to 2, kernel wont change it.
+
+Choices : 0 (off),
+          1 (on, with automatic change to 0)
+          2 (on, permanent)
+
+Default: 1 (on, with automatic downgrade on a router)
+
+
 rmem_default
 ------------
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 27f62f746621..b82ff52f301e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2807,6 +2807,7 @@ extern int		netdev_max_backlog;
 extern int		netdev_tstamp_prequeue;
 extern int		weight_p;
 extern int		bpf_jit_enable;
+extern int		sysctl_gro_frag_list_enable;
 
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 bool netdev_has_any_upper_dev(struct net_device *dev);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0ab32faa520f..d5b74ed1e9cb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2761,7 +2761,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	unsigned int len;
 	__be16 proto;
 	bool csum;
-	int sg = !!(features & NETIF_F_SG);
+	bool sg = !!(features & NETIF_F_SG);
 	int nfrags = skb_shinfo(skb)->nr_frags;
 	int err = -ENOMEM;
 	int i = 0;
@@ -2793,7 +2793,13 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+			if (fskb->len != len) {
+				if (sysctl_gro_frag_list_enable == 1)
+					sysctl_gro_frag_list_enable = 0;
+				hsize = len;
+				sg = false;
+				goto do_linear;
+			}
 
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
@@ -2812,6 +2818,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			skb_release_head_state(nskb);
 			__skb_push(nskb, doffset);
 		} else {
+do_linear:
 			nskb = __alloc_skb(hsize + doffset + headroom,
 					   GFP_ATOMIC, skb_alloc_rx_flag(skb),
 					   NUMA_NO_NODE);
@@ -2838,9 +2845,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
-
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
@@ -2849,6 +2853,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			continue;
 		}
 
+		if (fskb != skb_shinfo(skb)->frag_list)
+			goto perform_csum_check;
+
 		frag = skb_shinfo(nskb)->frags;
 
 		skb_copy_from_linear_data_offset(skb, offset,
@@ -2944,9 +2951,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		int i = skbinfo->nr_frags;
 		int nr_frags = pinfo->nr_frags + i;
 
-		if (nr_frags > MAX_SKB_FRAGS)
+		if (unlikely(nr_frags > MAX_SKB_FRAGS)) {
+			if (!sysctl_gro_frag_list_enable)
+				return -E2BIG;
 			goto merge;
-
+		}
 		offset -= headlen;
 		pinfo->nr_frags = nr_frags;
 		skbinfo->nr_frags = 0;
@@ -2977,9 +2986,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		unsigned int first_size = headlen - offset;
 		unsigned int first_offset;
 
-		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
+		if (unlikely(nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)) {
+			if (!sysctl_gro_frag_list_enable)
+				return -E2BIG;
 			goto merge;
-
+		}
 		first_offset = skb->data -
 			       (unsigned char *)page_address(page) +
 			       offset;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cca444190907..2d6aaf6d5838 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -24,6 +24,7 @@
 
 static int zero = 0;
 static int one = 1;
+static int two = 2;
 static int ushort_max = USHRT_MAX;
 
 #ifdef CONFIG_RPS
@@ -360,6 +361,15 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "gro_frag_list_enable",
+		.data		= &sysctl_gro_frag_list_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &two,
+	},
 	{ }
 };
 

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: [PATCH net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-29 13:06       ` [PATCH net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
@ 2013-10-29 13:48         ` Christoph Paasch
  2013-10-29 15:12         ` [PATCH v2 " Eric Dumazet
  1 sibling, 0 replies; 163+ messages in thread
From: Christoph Paasch @ 2013-10-29 13:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Herbert Xu, netdev, Jerry Chu, Michael Dalton

On 29/10/13 - 06:06:02, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> 
> (Jerry is working on adding native GRO support for tunnels)
> 
> skb_segment() only deals with a frag_list chain containing MSS sized
> fragments.
> 
> This patch adds support any kind of frag, and adds a new sysctl,
> as clearly the GRO layer should avoid building frag_list skbs
> on a router, as the segmentation is adding cpu overhead.
> 
> Note that we could try to reuse page fragments instead of doing
> copy to linear skbs, but this requires a fair amount of work,
> and possible truesize nightmares, as we do not track individual
> (per page fragment) truesizes.
> 
> /proc/sys/net/core/gro_frag_list_enable possible values are :
> 
> 0 : GRO layer is not allowed to use frag_list to extend skb capacity
> 1 : GRO layer is allowed to use frag_list, but skb_segment()
>     automatically sets the sysctl to 0.
> 2 : GRO is allowed to use frag_list, and skb_segment() wont
>     clear the sysctl.
> 
> Default value is 1 : automatic discovery
> 
> Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
> Reported-by: Jerry Chu <hkchu@google.com>
> Cc: Michael Dalton <mwdalton@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  Documentation/sysctl/net.txt |   19 +++++++++++++++++++
>  include/linux/netdevice.h    |    1 +
>  net/core/skbuff.c            |   29 ++++++++++++++++++++---------
>  net/core/sysctl_net_core.c   |   10 ++++++++++
>  4 files changed, 50 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> index 9a0319a82470..8778568ae64e 100644
> --- a/Documentation/sysctl/net.txt
> +++ b/Documentation/sysctl/net.txt
> @@ -87,6 +87,25 @@ sysctl.net.busy_read globally.
>  Will increase power usage.
>  Default: 0 (off)
>  
> +gro_frag_list_enable
> +--------------------
> +
> +GRO layer can build full size GRO packets (~64K of payload) if it is allowed
> +to extend skb using the frag_list pointer. However, this strategy is a win
> +on hosts, where TCP flows are terminated. For a router, using frag_list
> +skbs is not a win because we have to segment skbs before transmit,
> +as most NIC drivers do not support frag_list.
> +As soon as one frag_list skb has to be segmented, this sysctl is automatically
> +changed from 1 to 0.
> +If the value is set to 2, kernel wont change it.
> +
> +Choices : 0 (off),
> +          1 (on, with automatic change to 0)
> +          2 (on, permanent)
> +
> +Default: 1 (on, with automatic downgrade on a router)
> +
> +
>  rmem_default
>  ------------
>  
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 27f62f746621..b82ff52f301e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2807,6 +2807,7 @@ extern int		netdev_max_backlog;
>  extern int		netdev_tstamp_prequeue;
>  extern int		weight_p;
>  extern int		bpf_jit_enable;
> +extern int		sysctl_gro_frag_list_enable;

We are missing the definition of sysctl_gro_frag_list_enable :)

net/built-in.o: In function `skb_gro_receive':
(.text+0x8f04): undefined reference to `sysctl_gro_frag_list_enable'
net/built-in.o: In function `skb_segment':
(.text+0xa54e): undefined reference to `sysctl_gro_frag_list_enable'
net/built-in.o: In function `skb_segment':
(.text+0xa557): undefined reference to `sysctl_gro_frag_list_enable'
net/built-in.o:(.data+0x1198): undefined reference to `sysctl_gro_frag_list_enable'


Cheers,
Christoph

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-29  1:15   ` Eric Dumazet
  2013-10-29  9:08     ` Christoph Paasch
@ 2013-10-29 14:41     ` Herbert Xu
  2013-10-29 15:08       ` Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-10-29 14:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Paasch, netdev

On Mon, Oct 28, 2013 at 06:15:08PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-28 at 06:21 -0700, Eric Dumazet wrote:
> 
> > But we also need to fix the skb_segment() bug anyway.
> 
> Hi Christoph
> 
> I cooked a minimal patch, could you please try it ?
> 
> I'll refactor skb_segment() to be smarter for the next release
> (linux-3.14).

I think this patch is just papering over a deeper issue.

We should either be building skbs in pages, or using frag_list.
In the latter case each frag_list must be exactly mss bytes,
except for the last one.

So if we're crashing here it means that we got mixed up on the
receive side, either because the driver was sending us bogus skbs
or we're simply buggy.

So we need to figure out why the receive-side (i.e., GRO) is building
these bogus packets, and not papering over them on the transmit-side.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-29 14:41     ` Bug in skb_segment: fskb->len != len Herbert Xu
@ 2013-10-29 15:08       ` Eric Dumazet
  2013-10-30  1:50         ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-10-29 15:08 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Christoph Paasch, netdev

On Tue, 2013-10-29 at 22:41 +0800, Herbert Xu wrote:
> On Mon, Oct 28, 2013 at 06:15:08PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-28 at 06:21 -0700, Eric Dumazet wrote:
> > 
> > > But we also need to fix the skb_segment() bug anyway.
> > 
> > Hi Christoph
> > 
> > I cooked a minimal patch, could you please try it ?
> > 
> > I'll refactor skb_segment() to be smarter for the next release
> > (linux-3.14).
> 
> I think this patch is just papering over a deeper issue.
> 
> We should either be building skbs in pages, or using frag_list.
> In the latter case each frag_list must be exactly mss bytes,
> except for the last one.
> 
> So if we're crashing here it means that we got mixed up on the
> receive side, either because the driver was sending us bogus skbs
> or we're simply buggy.
> 
> So we need to figure out why the receive-side (i.e., GRO) is building
> these bogus packets, and not papering over them on the transmit-side.

It looks like you missed a lot of recent changes.

GRO layer was updated to be able to stack two or three sk_buff,
fully populated with page frags.

Thats quite mandatory to support line rate for 40Gb links.

We now have to make skb_segment() aware of this, I missed this part.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-29 13:06       ` [PATCH net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
  2013-10-29 13:48         ` Christoph Paasch
@ 2013-10-29 15:12         ` Eric Dumazet
  2013-10-29 23:44           ` David Miller
  2013-10-30  0:03           ` Jerry Chu
  1 sibling, 2 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-10-29 15:12 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: David Miller, Herbert Xu, netdev, Jerry Chu, Michael Dalton

From: Eric Dumazet <edumazet@google.com>

Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")

(Jerry is working on adding native GRO support for tunnels)

skb_segment() only deals with a frag_list chain containing MSS sized
fragments.

This patch adds support any kind of frag, and adds a new sysctl,
as clearly the GRO layer should avoid building frag_list skbs
on a router, as the segmentation is adding cpu overhead.

Note that we could try to reuse page fragments instead of doing
copy to linear skbs, but this requires a fair amount of work,
and possible truesize nightmares, as we do not track individual
(per page fragment) truesizes.

/proc/sys/net/core/gro_frag_list_enable possible values are :

0 : GRO layer is not allowed to use frag_list to extend skb capacity
1 : GRO layer is allowed to use frag_list, but skb_segment()
    automatically sets the sysctl to 0.
2 : GRO is allowed to use frag_list, and skb_segment() wont
    clear the sysctl.

Default value is 1 : automatic discovery

Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Reported-by: Jerry Chu <hkchu@google.com>
Cc: Michael Dalton <mwdalton@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
v2: added missing sysctl definition in skbuff.c

 Documentation/sysctl/net.txt |   19 +++++++++++++++++++
 include/linux/netdevice.h    |    1 +
 net/core/skbuff.c            |   31 ++++++++++++++++++++++---------
 net/core/sysctl_net_core.c   |   10 ++++++++++
 4 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 9a0319a82470..8778568ae64e 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -87,6 +87,25 @@ sysctl.net.busy_read globally.
 Will increase power usage.
 Default: 0 (off)
 
+gro_frag_list_enable
+--------------------
+
+GRO layer can build full size GRO packets (~64K of payload) if it is allowed
+to extend skb using the frag_list pointer. However, this strategy is a win
+on hosts, where TCP flows are terminated. For a router, using frag_list
+skbs is not a win because we have to segment skbs before transmit,
+as most NIC drivers do not support frag_list.
+As soon as one frag_list skb has to be segmented, this sysctl is automatically
+changed from 1 to 0.
+If the value is set to 2, kernel wont change it.
+
+Choices : 0 (off),
+          1 (on, with automatic change to 0)
+          2 (on, permanent)
+
+Default: 1 (on, with automatic downgrade on a router)
+
+
 rmem_default
 ------------
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 27f62f746621..b82ff52f301e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2807,6 +2807,7 @@ extern int		netdev_max_backlog;
 extern int		netdev_tstamp_prequeue;
 extern int		weight_p;
 extern int		bpf_jit_enable;
+extern int		sysctl_gro_frag_list_enable;
 
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 bool netdev_has_any_upper_dev(struct net_device *dev);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0ab32faa520f..e089cd2782e5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -74,6 +74,8 @@
 struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
+int sysctl_gro_frag_list_enable __read_mostly = 1;
+
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
@@ -2761,7 +2763,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	unsigned int len;
 	__be16 proto;
 	bool csum;
-	int sg = !!(features & NETIF_F_SG);
+	bool sg = !!(features & NETIF_F_SG);
 	int nfrags = skb_shinfo(skb)->nr_frags;
 	int err = -ENOMEM;
 	int i = 0;
@@ -2793,7 +2795,13 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+			if (fskb->len != len) {
+				if (sysctl_gro_frag_list_enable == 1)
+					sysctl_gro_frag_list_enable = 0;
+				hsize = len;
+				sg = false;
+				goto do_linear;
+			}
 
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
@@ -2812,6 +2820,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			skb_release_head_state(nskb);
 			__skb_push(nskb, doffset);
 		} else {
+do_linear:
 			nskb = __alloc_skb(hsize + doffset + headroom,
 					   GFP_ATOMIC, skb_alloc_rx_flag(skb),
 					   NUMA_NO_NODE);
@@ -2838,9 +2847,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
-
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
@@ -2849,6 +2855,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			continue;
 		}
 
+		if (fskb != skb_shinfo(skb)->frag_list)
+			goto perform_csum_check;
+
 		frag = skb_shinfo(nskb)->frags;
 
 		skb_copy_from_linear_data_offset(skb, offset,
@@ -2944,9 +2953,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		int i = skbinfo->nr_frags;
 		int nr_frags = pinfo->nr_frags + i;
 
-		if (nr_frags > MAX_SKB_FRAGS)
+		if (unlikely(nr_frags > MAX_SKB_FRAGS)) {
+			if (!sysctl_gro_frag_list_enable)
+				return -E2BIG;
 			goto merge;
-
+		}
 		offset -= headlen;
 		pinfo->nr_frags = nr_frags;
 		skbinfo->nr_frags = 0;
@@ -2977,9 +2988,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		unsigned int first_size = headlen - offset;
 		unsigned int first_offset;
 
-		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
+		if (unlikely(nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)) {
+			if (!sysctl_gro_frag_list_enable)
+				return -E2BIG;
 			goto merge;
-
+		}
 		first_offset = skb->data -
 			       (unsigned char *)page_address(page) +
 			       offset;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cca444190907..2d6aaf6d5838 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -24,6 +24,7 @@
 
 static int zero = 0;
 static int one = 1;
+static int two = 2;
 static int ushort_max = USHRT_MAX;
 
 #ifdef CONFIG_RPS
@@ -360,6 +361,15 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "gro_frag_list_enable",
+		.data		= &sysctl_gro_frag_list_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &two,
+	},
 	{ }
 };
 

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-29 15:12         ` [PATCH v2 " Eric Dumazet
@ 2013-10-29 23:44           ` David Miller
  2013-10-30  0:06             ` Ben Hutchings
  2013-10-30  0:53             ` [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
  2013-10-30  0:03           ` Jerry Chu
  1 sibling, 2 replies; 163+ messages in thread
From: David Miller @ 2013-10-29 23:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: christoph.paasch, herbert, netdev, hkchu, mwdalton

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 29 Oct 2013 08:12:35 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> 
> (Jerry is working on adding native GRO support for tunnels)
> 
> skb_segment() only deals with a frag_list chain containing MSS sized
> fragments.
> 
> This patch adds support any kind of frag, and adds a new sysctl,
> as clearly the GRO layer should avoid building frag_list skbs
> on a router, as the segmentation is adding cpu overhead.
> 
> Note that we could try to reuse page fragments instead of doing
> copy to linear skbs, but this requires a fair amount of work,
> and possible truesize nightmares, as we do not track individual
> (per page fragment) truesizes.
> 
> /proc/sys/net/core/gro_frag_list_enable possible values are :
> 
> 0 : GRO layer is not allowed to use frag_list to extend skb capacity
> 1 : GRO layer is allowed to use frag_list, but skb_segment()
>     automatically sets the sysctl to 0.
> 2 : GRO is allowed to use frag_list, and skb_segment() wont
>     clear the sysctl.
> 
> Default value is 1 : automatic discovery
> 
> Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
> Reported-by: Jerry Chu <hkchu@google.com>
> Cc: Michael Dalton <mwdalton@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> v2: added missing sysctl definition in skbuff.c

I do not like the idea of packet actions indirectly changing sysctl
values, even if you document it sufficiently as you have here.

Plus this puts the sysctl change logic in a fast path.

I would suggest instead making it change in response to changes to
ip_forward, as we do with per-device LRO settings.  This means that,
like ip_forward, you should also make this sysctl a global + devinet
per-device sysctl.

You might even emit a pr_info() when this logic triggers, and if you
are ambitious enough keep track of the previous GRO sysctl state so
you can restore it if ip_forward is set back to zero.

Thanks.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-29 15:12         ` [PATCH v2 " Eric Dumazet
  2013-10-29 23:44           ` David Miller
@ 2013-10-30  0:03           ` Jerry Chu
  1 sibling, 0 replies; 163+ messages in thread
From: Jerry Chu @ 2013-10-30  0:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Paasch, David Miller, Herbert Xu, netdev, Michael Dalton

On Tue, Oct 29, 2013 at 8:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
>
> (Jerry is working on adding native GRO support for tunnels)
>
> skb_segment() only deals with a frag_list chain containing MSS sized
> fragments.
>
> This patch adds support any kind of frag, and adds a new sysctl,
> as clearly the GRO layer should avoid building frag_list skbs
> on a router, as the segmentation is adding cpu overhead.
>
> Note that we could try to reuse page fragments instead of doing
> copy to linear skbs, but this requires a fair amount of work,
> and possible truesize nightmares, as we do not track individual
> (per page fragment) truesizes.
>
> /proc/sys/net/core/gro_frag_list_enable possible values are :
>
> 0 : GRO layer is not allowed to use frag_list to extend skb capacity
> 1 : GRO layer is allowed to use frag_list, but skb_segment()
>     automatically sets the sysctl to 0.
> 2 : GRO is allowed to use frag_list, and skb_segment() wont
>     clear the sysctl.
>
> Default value is 1 : automatic discovery
>
> Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
> Reported-by: Jerry Chu <hkchu@google.com>
> Cc: Michael Dalton <mwdalton@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> v2: added missing sysctl definition in skbuff.c
>
>  Documentation/sysctl/net.txt |   19 +++++++++++++++++++
>  include/linux/netdevice.h    |    1 +
>  net/core/skbuff.c            |   31 ++++++++++++++++++++++---------
>  net/core/sysctl_net_core.c   |   10 ++++++++++
>  4 files changed, 52 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> index 9a0319a82470..8778568ae64e 100644
> --- a/Documentation/sysctl/net.txt
> +++ b/Documentation/sysctl/net.txt
> @@ -87,6 +87,25 @@ sysctl.net.busy_read globally.
>  Will increase power usage.
>  Default: 0 (off)
>
> +gro_frag_list_enable
> +--------------------
> +
> +GRO layer can build full size GRO packets (~64K of payload) if it is allowed
> +to extend skb using the frag_list pointer. However, this strategy is a win
> +on hosts, where TCP flows are terminated. For a router, using frag_list
> +skbs is not a win because we have to segment skbs before transmit,
> +as most NIC drivers do not support frag_list.
> +As soon as one frag_list skb has to be segmented, this sysctl is automatically
> +changed from 1 to 0.
> +If the value is set to 2, kernel wont change it.
> +
> +Choices : 0 (off),
> +          1 (on, with automatic change to 0)
> +          2 (on, permanent)
> +
> +Default: 1 (on, with automatic downgrade on a router)
> +
> +
>  rmem_default
>  ------------
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 27f62f746621..b82ff52f301e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2807,6 +2807,7 @@ extern int                netdev_max_backlog;
>  extern int             netdev_tstamp_prequeue;
>  extern int             weight_p;
>  extern int             bpf_jit_enable;
> +extern int             sysctl_gro_frag_list_enable;
>
>  bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
>  bool netdev_has_any_upper_dev(struct net_device *dev);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 0ab32faa520f..e089cd2782e5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -74,6 +74,8 @@
>  struct kmem_cache *skbuff_head_cache __read_mostly;
>  static struct kmem_cache *skbuff_fclone_cache __read_mostly;
>
> +int sysctl_gro_frag_list_enable __read_mostly = 1;
> +
>  static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
>                                   struct pipe_buffer *buf)
>  {
> @@ -2761,7 +2763,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>         unsigned int len;
>         __be16 proto;
>         bool csum;
> -       int sg = !!(features & NETIF_F_SG);
> +       bool sg = !!(features & NETIF_F_SG);
>         int nfrags = skb_shinfo(skb)->nr_frags;
>         int err = -ENOMEM;
>         int i = 0;
> @@ -2793,7 +2795,13 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>                         hsize = len;
>
>                 if (!hsize && i >= nfrags) {
> -                       BUG_ON(fskb->len != len);
> +                       if (fskb->len != len) {
> +                               if (sysctl_gro_frag_list_enable == 1)
> +                                       sysctl_gro_frag_list_enable = 0;
> +                               hsize = len;
> +                               sg = false;
> +                               goto do_linear;
> +                       }
>
>                         pos += len;
>                         nskb = skb_clone(fskb, GFP_ATOMIC);
> @@ -2812,6 +2820,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>                         skb_release_head_state(nskb);
>                         __skb_push(nskb, doffset);
>                 } else {
> +do_linear:
>                         nskb = __alloc_skb(hsize + doffset + headroom,
>                                            GFP_ATOMIC, skb_alloc_rx_flag(skb),
>                                            NUMA_NO_NODE);
> @@ -2838,9 +2847,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>                                                  nskb->data - tnl_hlen,
>                                                  doffset + tnl_hlen);
>
> -               if (fskb != skb_shinfo(skb)->frag_list)
> -                       goto perform_csum_check;
> -
>                 if (!sg) {
>                         nskb->ip_summed = CHECKSUM_NONE;
>                         nskb->csum = skb_copy_and_csum_bits(skb, offset,
> @@ -2849,6 +2855,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>                         continue;
>                 }
>
> +               if (fskb != skb_shinfo(skb)->frag_list)
> +                       goto perform_csum_check;
> +
>                 frag = skb_shinfo(nskb)->frags;
>
>                 skb_copy_from_linear_data_offset(skb, offset,
> @@ -2944,9 +2953,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
>                 int i = skbinfo->nr_frags;
>                 int nr_frags = pinfo->nr_frags + i;
>
> -               if (nr_frags > MAX_SKB_FRAGS)
> +               if (unlikely(nr_frags > MAX_SKB_FRAGS)) {
> +                       if (!sysctl_gro_frag_list_enable)
> +                               return -E2BIG;
>                         goto merge;
> -
> +               }
>                 offset -= headlen;
>                 pinfo->nr_frags = nr_frags;
>                 skbinfo->nr_frags = 0;
> @@ -2977,9 +2988,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
>                 unsigned int first_size = headlen - offset;
>                 unsigned int first_offset;
>
> -               if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
> +               if (unlikely(nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)) {
> +                       if (!sysctl_gro_frag_list_enable)
> +                               return -E2BIG;
>                         goto merge;
> -
> +               }
>                 first_offset = skb->data -
>                                (unsigned char *)page_address(page) +
>                                offset;
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index cca444190907..2d6aaf6d5838 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -24,6 +24,7 @@
>
>  static int zero = 0;
>  static int one = 1;
> +static int two = 2;
>  static int ushort_max = USHRT_MAX;
>
>  #ifdef CONFIG_RPS
> @@ -360,6 +361,15 @@ static struct ctl_table net_core_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec
>         },
> +       {
> +               .procname       = "gro_frag_list_enable",
> +               .data           = &sysctl_gro_frag_list_enable,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +               .extra2         = &two,
> +       },
>         { }
>  };
>
>
>

I've tested all three values of sysctl_gro_frag_list_enable with my
yet-to-submitted GRE-GRO patch running on a router. The BUG panic is
gone and the patch seems to work as documented. Also setting the value
to 0 or 1 does seem to save a bit of CPU time on the forwarding path
and also help throughput a little (a combined of 5-8%).

Tested by: Jerry Chu <hkchu@google.com>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-29 23:44           ` David Miller
@ 2013-10-30  0:06             ` Ben Hutchings
  2013-11-02 14:01               ` [PATCH v3 net-next] net: introduce dev_set_forwarding() Eric Dumazet
  2013-10-30  0:53             ` [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Ben Hutchings @ 2013-10-30  0:06 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, christoph.paasch, herbert, netdev, hkchu, mwdalton

On Tue, 2013-10-29 at 19:44 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 29 Oct 2013 08:12:35 -0700
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> > by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> > 
> > (Jerry is working on adding native GRO support for tunnels)
> > 
> > skb_segment() only deals with a frag_list chain containing MSS sized
> > fragments.
> > 
> > This patch adds support any kind of frag, and adds a new sysctl,
> > as clearly the GRO layer should avoid building frag_list skbs
> > on a router, as the segmentation is adding cpu overhead.
> > 
> > Note that we could try to reuse page fragments instead of doing
> > copy to linear skbs, but this requires a fair amount of work,
> > and possible truesize nightmares, as we do not track individual
> > (per page fragment) truesizes.
> > 
> > /proc/sys/net/core/gro_frag_list_enable possible values are :
> > 
> > 0 : GRO layer is not allowed to use frag_list to extend skb capacity
> > 1 : GRO layer is allowed to use frag_list, but skb_segment()
> >     automatically sets the sysctl to 0.
> > 2 : GRO is allowed to use frag_list, and skb_segment() wont
> >     clear the sysctl.
> > 
> > Default value is 1 : automatic discovery
> > 
> > Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
> > Reported-by: Jerry Chu <hkchu@google.com>
> > Cc: Michael Dalton <mwdalton@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > v2: added missing sysctl definition in skbuff.c
> 
> I do not like the idea of packet actions indirectly changing sysctl
> values, even if you document it sufficiently as you have here.
> 
> Plus this puts the sysctl change logic in a fast path.
> 
> I would suggest instead making it change in response to changes to
> ip_forward, as we do with per-device LRO settings.  This means that,
> like ip_forward, you should also make this sysctl a global + devinet
> per-device sysctl.
> 
> You might even emit a pr_info() when this logic triggers, and if you
> are ambitious enough keep track of the previous GRO sysctl state so
> you can restore it if ip_forward is set back to zero.

Speaking of which: insteading of disabling LRO once, we really ought to
keep count of the forwarders (IPv4 routing, IPv6 routing, bridge) and
use that to mask out LRO in netdev_fix_features().

I think that the forwarder count is also needed for this.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-29 23:44           ` David Miller
  2013-10-30  0:06             ` Ben Hutchings
@ 2013-10-30  0:53             ` Eric Dumazet
  2013-10-30  2:02               ` David Miller
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-10-30  0:53 UTC (permalink / raw)
  To: David Miller; +Cc: christoph.paasch, herbert, netdev, hkchu, mwdalton

On Tue, 2013-10-29 at 19:44 -0400, David Miller wrote:

> I do not like the idea of packet actions indirectly changing sysctl
> values, even if you document it sufficiently as you have here.
> 

Fair enough.

> I would suggest instead making it change in response to changes to
> ip_forward, as we do with per-device LRO settings.  This means that,
> like ip_forward, you should also make this sysctl a global + devinet
> per-device sysctl.
> 


> You might even emit a pr_info() when this logic triggers, and if you
> are ambitious enough keep track of the previous GRO sysctl state so
> you can restore it if ip_forward is set back to zero.

Ok, but this might take some time.

So should we apply the first fix to avoid the BUG_ON() ?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-29 15:08       ` Eric Dumazet
@ 2013-10-30  1:50         ` Herbert Xu
  2013-10-30  4:03           ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  1:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Paasch, netdev

On Tue, Oct 29, 2013 at 08:08:13AM -0700, Eric Dumazet wrote:
>
> GRO layer was updated to be able to stack two or three sk_buff,
> fully populated with page frags.
> 
> Thats quite mandatory to support line rate for 40Gb links.

Indeed I missed this.  Which commit introduced this change?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  0:53             ` [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
@ 2013-10-30  2:02               ` David Miller
  2013-10-30  2:05                 ` Herbert Xu
  2013-10-30  4:06                 ` Eric Dumazet
  0 siblings, 2 replies; 163+ messages in thread
From: David Miller @ 2013-10-30  2:02 UTC (permalink / raw)
  To: eric.dumazet; +Cc: christoph.paasch, herbert, netdev, hkchu, mwdalton

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 29 Oct 2013 17:53:48 -0700

> So should we apply the first fix to avoid the BUG_ON() ?

Please be more specific, are you talking about splitting up
this patch in some way?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  2:02               ` David Miller
@ 2013-10-30  2:05                 ` Herbert Xu
  2013-10-30  2:13                   ` Jerry Chu
  2013-10-30 19:39                   ` Ben Hutchings
  2013-10-30  4:06                 ` Eric Dumazet
  1 sibling, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  2:05 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Oct 29, 2013 at 10:02:53PM -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 29 Oct 2013 17:53:48 -0700
> 
> > So should we apply the first fix to avoid the BUG_ON() ?
> 
> Please be more specific, are you talking about splitting up
> this patch in some way?

I think Eric is referring to the patch that removes the BUG_ON
in skb_segment and deals with the new mega-GRO packets.

I think that's fine for stable, but for the long term we should
fix it properly as these new meag-GRO packets still retain the
existing packet boundaries and are trivially segmentable.

If we are indeed able to do that, I doubt we would even need
the sysctl patch since the GRO performance should be vastly
superior to the non-GRO case, even for a router/bridge.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  2:05                 ` Herbert Xu
@ 2013-10-30  2:13                   ` Jerry Chu
  2013-10-30  2:19                     ` Herbert Xu
  2013-10-30  2:33                     ` David Miller
  2013-10-30 19:39                   ` Ben Hutchings
  1 sibling, 2 replies; 163+ messages in thread
From: Jerry Chu @ 2013-10-30  2:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, Eric Dumazet, Christoph Paasch, netdev, Michael Dalton

On Tue, Oct 29, 2013 at 7:05 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Tue, Oct 29, 2013 at 10:02:53PM -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Tue, 29 Oct 2013 17:53:48 -0700
>>
>> > So should we apply the first fix to avoid the BUG_ON() ?
>>
>> Please be more specific, are you talking about splitting up
>> this patch in some way?
>
> I think Eric is referring to the patch that removes the BUG_ON
> in skb_segment and deals with the new mega-GRO packets.
>
> I think that's fine for stable, but for the long term we should
> fix it properly as these new meag-GRO packets still retain the
> existing packet boundaries and are trivially segmentable.
>
> If we are indeed able to do that, I doubt we would even need
> the sysctl patch since the GRO performance should be vastly
> superior to the non-GRO case, even for a router/bridge.

Probably not the case for the simple forwarding case. See my
test result of some small (5-8%) CPU+throughput penalty from
GRO (over GRE tunnel) posted previously. But I can believe
the number may be very different if the forwarding path involves
more work (NAT, iptables filtering,...,etc) resulting in a higher per
pkt cost.

Best,

Jerry

>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  2:13                   ` Jerry Chu
@ 2013-10-30  2:19                     ` Herbert Xu
  2013-10-30  2:34                       ` David Miller
  2013-10-30  2:33                     ` David Miller
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  2:19 UTC (permalink / raw)
  To: Jerry Chu
  Cc: David Miller, Eric Dumazet, Christoph Paasch, netdev, Michael Dalton

On Tue, Oct 29, 2013 at 07:13:50PM -0700, Jerry Chu wrote:
> On Tue, Oct 29, 2013 at 7:05 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> > If we are indeed able to do that, I doubt we would even need
> > the sysctl patch since the GRO performance should be vastly
> > superior to the non-GRO case, even for a router/bridge.
> 
> Probably not the case for the simple forwarding case. See my
> test result of some small (5-8%) CPU+throughput penalty from
> GRO (over GRE tunnel) posted previously. But I can believe
> the number may be very different if the forwarding path involves
> more work (NAT, iptables filtering,...,etc) resulting in a higher per
> pkt cost.

Your numbers are with Eric's current patch that just linearises
the packet, what I'm saying is that you don't need to linearise
these packets since the packet boundaries are still there, just
hidden inside each frag_list.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  2:13                   ` Jerry Chu
  2013-10-30  2:19                     ` Herbert Xu
@ 2013-10-30  2:33                     ` David Miller
       [not found]                       ` <44571383414236@web13j.yandex.ru>
  1 sibling, 1 reply; 163+ messages in thread
From: David Miller @ 2013-10-30  2:33 UTC (permalink / raw)
  To: hkchu; +Cc: herbert, eric.dumazet, christoph.paasch, netdev, mwdalton

From: Jerry Chu <hkchu@google.com>
Date: Tue, 29 Oct 2013 19:13:50 -0700

> On Tue, Oct 29, 2013 at 7:05 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Tue, Oct 29, 2013 at 10:02:53PM -0400, David Miller wrote:
>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>> Date: Tue, 29 Oct 2013 17:53:48 -0700
>>>
>>> > So should we apply the first fix to avoid the BUG_ON() ?
>>>
>>> Please be more specific, are you talking about splitting up
>>> this patch in some way?
>>
>> I think Eric is referring to the patch that removes the BUG_ON
>> in skb_segment and deals with the new mega-GRO packets.
>>
>> I think that's fine for stable, but for the long term we should
>> fix it properly as these new meag-GRO packets still retain the
>> existing packet boundaries and are trivially segmentable.
>>
>> If we are indeed able to do that, I doubt we would even need
>> the sysctl patch since the GRO performance should be vastly
>> superior to the non-GRO case, even for a router/bridge.
> 
> Probably not the case for the simple forwarding case. See my
> test result of some small (5-8%) CPU+throughput penalty from
> GRO (over GRE tunnel) posted previously. But I can believe
> the number may be very different if the forwarding path involves
> more work (NAT, iptables filtering,...,etc) resulting in a higher per
> pkt cost.

It's that way because it's not implemented properly.

GRO should always win, even on a router, because it decreases the
number of fundamental operations (routing lookups) that the stack
needs to perform.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  2:19                     ` Herbert Xu
@ 2013-10-30  2:34                       ` David Miller
  0 siblings, 0 replies; 163+ messages in thread
From: David Miller @ 2013-10-30  2:34 UTC (permalink / raw)
  To: herbert; +Cc: hkchu, eric.dumazet, christoph.paasch, netdev, mwdalton

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 30 Oct 2013 10:19:06 +0800

> On Tue, Oct 29, 2013 at 07:13:50PM -0700, Jerry Chu wrote:
>> On Tue, Oct 29, 2013 at 7:05 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> > If we are indeed able to do that, I doubt we would even need
>> > the sysctl patch since the GRO performance should be vastly
>> > superior to the non-GRO case, even for a router/bridge.
>> 
>> Probably not the case for the simple forwarding case. See my
>> test result of some small (5-8%) CPU+throughput penalty from
>> GRO (over GRE tunnel) posted previously. But I can believe
>> the number may be very different if the forwarding path involves
>> more work (NAT, iptables filtering,...,etc) resulting in a higher per
>> pkt cost.
> 
> Your numbers are with Eric's current patch that just linearises
> the packet, what I'm saying is that you don't need to linearise
> these packets since the packet boundaries are still there, just
> hidden inside each frag_list.

Agreed.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-30  1:50         ` Herbert Xu
@ 2013-10-30  4:03           ` Eric Dumazet
  2013-10-30  4:06             ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-10-30  4:03 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Christoph Paasch, netdev

On Wed, 2013-10-30 at 09:50 +0800, Herbert Xu wrote:
> On Tue, Oct 29, 2013 at 08:08:13AM -0700, Eric Dumazet wrote:
> >
> > GRO layer was updated to be able to stack two or three sk_buff,
> > fully populated with page frags.
> > 
> > Thats quite mandatory to support line rate for 40Gb links.
> 
> Indeed I missed this.  Which commit introduced this change?

This was mentioned in the changelog :

<quote>

Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")

</quote>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-30  4:03           ` Eric Dumazet
@ 2013-10-30  4:06             ` Herbert Xu
  2013-10-30  4:37               ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  4:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Paasch, netdev

On Tue, Oct 29, 2013 at 09:03:05PM -0700, Eric Dumazet wrote:
> On Wed, 2013-10-30 at 09:50 +0800, Herbert Xu wrote:
> > 
> > Indeed I missed this.  Which commit introduced this change?
> 
> This was mentioned in the changelog :
> 
> <quote>
> 
> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> 
> </quote>

Thanks.

In that case we should be able to segment these full-sized skbs
without linearising.  What we should do is iterate through
each frag_list entry and apply the usual GSO code to them.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  2:02               ` David Miller
  2013-10-30  2:05                 ` Herbert Xu
@ 2013-10-30  4:06                 ` Eric Dumazet
  2013-10-30  4:08                   ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-10-30  4:06 UTC (permalink / raw)
  To: David Miller; +Cc: christoph.paasch, herbert, netdev, hkchu, mwdalton

On Tue, 2013-10-29 at 22:02 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 29 Oct 2013 17:53:48 -0700
> 
> > So should we apply the first fix to avoid the BUG_ON() ?
> 
> Please be more specific, are you talking about splitting up
> this patch in some way?

I am referring to the first version I sent to Christoph :

http://www.spinics.net/lists/netdev/msg255452.html

Then I added the sysctl to avoid future packets to get a frag_list in
the first place.

Doing a smart skb_segment() is possible, but this function is already
complex.

I am not sure 64K GRO packets that must be segmented are going to be
faster than 22K packets without segmentation at all (TSO path on xmit)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  4:06                 ` Eric Dumazet
@ 2013-10-30  4:08                   ` Herbert Xu
  2013-10-30  4:09                     ` Herbert Xu
  2013-10-30  4:16                     ` Eric Dumazet
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  4:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Oct 29, 2013 at 09:06:52PM -0700, Eric Dumazet wrote:
>
> I am not sure 64K GRO packets that must be segmented are going to be
> faster than 22K packets without segmentation at all (TSO path on xmit)

Indeed that is a tough call, but I think conceptually the 64K
case is much nicer than a sysctl that gets magically turned off.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  4:08                   ` Herbert Xu
@ 2013-10-30  4:09                     ` Herbert Xu
  2013-10-30  4:15                       ` Jerry Chu
  2013-10-30  4:16                     ` Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  4:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Oct 30, 2013 at 12:08:18PM +0800, Herbert Xu wrote:
> On Tue, Oct 29, 2013 at 09:06:52PM -0700, Eric Dumazet wrote:
> >
> > I am not sure 64K GRO packets that must be segmented are going to be
> > faster than 22K packets without segmentation at all (TSO path on xmit)
> 
> Indeed that is a tough call, but I think conceptually the 64K
> case is much nicer than a sysctl that gets magically turned off.

Also at some point we'll want to do >64K GRO/GSO too so we'll
have to face this complexity one day.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  4:09                     ` Herbert Xu
@ 2013-10-30  4:15                       ` Jerry Chu
  0 siblings, 0 replies; 163+ messages in thread
From: Jerry Chu @ 2013-10-30  4:15 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, David Miller, Christoph Paasch, netdev, Michael Dalton

On Tue, Oct 29, 2013 at 9:09 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Wed, Oct 30, 2013 at 12:08:18PM +0800, Herbert Xu wrote:
>> On Tue, Oct 29, 2013 at 09:06:52PM -0700, Eric Dumazet wrote:
>> >
>> > I am not sure 64K GRO packets that must be segmented are going to be
>> > faster than 22K packets without segmentation at all (TSO path on xmit)
>>
>> Indeed that is a tough call, but I think conceptually the 64K
>> case is much nicer than a sysctl that gets magically turned off.
>
> Also at some point we'll want to do >64K GRO/GSO too so we'll
> have to face this complexity one day.

Not sure how this is possible with the IP datagram length limit. (Am I
missing something, like pkt chaining?)

Jerry

>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  4:08                   ` Herbert Xu
  2013-10-30  4:09                     ` Herbert Xu
@ 2013-10-30  4:16                     ` Eric Dumazet
  2013-10-30  4:19                       ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-10-30  4:16 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-10-30 at 12:08 +0800, Herbert Xu wrote:
> On Tue, Oct 29, 2013 at 09:06:52PM -0700, Eric Dumazet wrote:
> >
> > I am not sure 64K GRO packets that must be segmented are going to be
> > faster than 22K packets without segmentation at all (TSO path on xmit)
> 
> Indeed that is a tough call, but I think conceptually the 64K
> case is much nicer than a sysctl that gets magically turned off.

The thing is this only matters for hosts receiving at line rate on few
TCP flows.

A router should not build too big GRO packets, as it adds latencies.

Really, we had to make TSO packets being auto sized, lets not add the
syndrome again.

So I do not really understand David concern about emitting a warning.

If a machine is used as a router, building GRO packets of 17 MSS is
absolutely fine.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  4:16                     ` Eric Dumazet
@ 2013-10-30  4:19                       ` Herbert Xu
  2013-10-30  4:34                         ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  4:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Oct 29, 2013 at 09:16:17PM -0700, Eric Dumazet wrote:
>
> The thing is this only matters for hosts receiving at line rate on few
> TCP flows.
> 
> A router should not build too big GRO packets, as it adds latencies.
> 
> Really, we had to make TSO packets being auto sized, lets not add the
> syndrome again.
> 
> So I do not really understand David concern about emitting a warning.
> 
> If a machine is used as a router, building GRO packets of 17 MSS is
> absolutely fine.

It's not just routers you know, we use the same code on bridges
as part of virtualisation.  So it absolutely does matter.

In fact this is why I wrote GRO in the first place, to make it
work for virtualisation.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  4:19                       ` Herbert Xu
@ 2013-10-30  4:34                         ` Eric Dumazet
  2013-10-30  4:42                           ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-10-30  4:34 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-10-30 at 12:19 +0800, Herbert Xu wrote:
> On Tue, Oct 29, 2013 at 09:16:17PM -0700, Eric Dumazet wrote:
> >
> > The thing is this only matters for hosts receiving at line rate on few
> > TCP flows.
> > 
> > A router should not build too big GRO packets, as it adds latencies.
> > 
> > Really, we had to make TSO packets being auto sized, lets not add the
> > syndrome again.
> > 
> > So I do not really understand David concern about emitting a warning.
> > 
> > If a machine is used as a router, building GRO packets of 17 MSS is
> > absolutely fine.
> 
> It's not just routers you know, we use the same code on bridges
> as part of virtualisation.  So it absolutely does matter.
> 

What matters ?

GRO ?

Or making size of GRO packets not too big, or making them bigger ?

Before my patch, GRO packets were 17 MSS, and nobody complained packets
were too small, so what are you saying exactly ?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-30  4:06             ` Herbert Xu
@ 2013-10-30  4:37               ` Eric Dumazet
  2013-10-30  4:47                 ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-10-30  4:37 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Christoph Paasch, netdev

On Wed, 2013-10-30 at 12:06 +0800, Herbert Xu wrote:

> In that case we should be able to segment these full-sized skbs
> without linearising.  What we should do is iterate through
> each frag_list entry and apply the usual GSO code to them.

Well, if you really want to be smart, we could build GSO packets of ~16
MSS, as most NIC support TSO.

So a 64K packet containing 44 MSS could be split into 3 TSO packets.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  4:34                         ` Eric Dumazet
@ 2013-10-30  4:42                           ` Herbert Xu
  2013-10-30 17:39                             ` Jerry Chu
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  4:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Oct 29, 2013 at 09:34:41PM -0700, Eric Dumazet wrote:
>
> What matters ?
> 
> GRO ?

What matters is that you should not treat the forwarding case
separately from the host case.

For virtualisation the host case looks exactly like the forwarding
case.

IOW, if having a 64KB packet matters for the host, then it matters
for forwarding as well.

> Before my patch, GRO packets were 17 MSS, and nobody complained packets
> were too small, so what are you saying exactly ?

I'm not criticsing your mega-GRO patch at all.  That one is great
and means that we'll get aggregated packets up to 64K.  What we need
to do is just to patch up the GSO code so that it can handle these
mega-packets properly.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Bug in skb_segment: fskb->len != len
  2013-10-30  4:37               ` Eric Dumazet
@ 2013-10-30  4:47                 ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-10-30  4:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Paasch, netdev

On Tue, Oct 29, 2013 at 09:37:38PM -0700, Eric Dumazet wrote:
>
> Well, if you really want to be smart, we could build GSO packets of ~16
> MSS, as most NIC support TSO.
> 
> So a 64K packet containing 44 MSS could be split into 3 TSO packets.

Yes that would be nice for sure.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  4:42                           ` Herbert Xu
@ 2013-10-30 17:39                             ` Jerry Chu
  2013-10-30 18:09                               ` Vlad Yasevich
  2013-10-30 19:12                               ` David Miller
  0 siblings, 2 replies; 163+ messages in thread
From: Jerry Chu @ 2013-10-30 17:39 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, David Miller, Christoph Paasch, netdev, Michael Dalton

On Tue, Oct 29, 2013 at 9:42 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Oct 29, 2013 at 09:34:41PM -0700, Eric Dumazet wrote:
> >
> > What matters ?
> >
> > GRO ?
>
> What matters is that you should not treat the forwarding case
> separately from the host case.
>
> For virtualisation the host case looks exactly like the forwarding
> case.


Not sure I agree - there are two different "forwarding" cases - forwarding
to another physical NIC (to go out to the wire hence need to do GSO),
and (for virtualization) forwarding to a virtual NIC and consumed internally
(e.g., VM). For the latter we should strive to push GSO pkts all the way
to the VM stack w/o breaking them up. So for virtualization GRO is all
goodness but not sure about the regular forwarding path. (From the
perf perspective it boils down to if the cost of GSO/GRO will offset
the benefit of GRO. Sure if one manages to get the cost close to zero
than there is not reason to leave GRO always on. But it's still a big if for
now.)

Best,

Jerry

>
>
> IOW, if having a 64KB packet matters for the host, then it matters
> for forwarding as well.
>
> > Before my patch, GRO packets were 17 MSS, and nobody complained packets
> > were too small, so what are you saying exactly ?
>
> I'm not criticsing your mega-GRO patch at all.  That one is great
> and means that we'll get aggregated packets up to 64K.  What we need
> to do is just to patch up the GSO code so that it can handle these
> mega-packets properly.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30 17:39                             ` Jerry Chu
@ 2013-10-30 18:09                               ` Vlad Yasevich
  2013-10-30 19:12                               ` David Miller
  1 sibling, 0 replies; 163+ messages in thread
From: Vlad Yasevich @ 2013-10-30 18:09 UTC (permalink / raw)
  To: Jerry Chu, Herbert Xu
  Cc: Eric Dumazet, David Miller, Christoph Paasch, netdev, Michael Dalton

On 10/30/2013 01:39 PM, Jerry Chu wrote:
> On Tue, Oct 29, 2013 at 9:42 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> On Tue, Oct 29, 2013 at 09:34:41PM -0700, Eric Dumazet wrote:
>>>
>>> What matters ?
>>>
>>> GRO ?
>>
>> What matters is that you should not treat the forwarding case
>> separately from the host case.
>>
>> For virtualisation the host case looks exactly like the forwarding
>> case.
>
>
> Not sure I agree - there are two different "forwarding" cases - forwarding
> to another physical NIC (to go out to the wire hence need to do GSO),
> and (for virtualization) forwarding to a virtual NIC and consumed internally
> (e.g., VM).

I don't think you can really differentiate these 2 case.  VM are
very commonly used as routers/forwarders.  In some cases, to get
better throughput,  VFs are assigned to the VMs as the externally
facing ports.  So, you still end up forwarding to another physical
NIC.

-vlad

> For the latter we should strive to push GSO pkts all the way
> to the VM stack w/o breaking them up. So for virtualization GRO is all
> goodness but not sure about the regular forwarding path. (From the
> perf perspective it boils down to if the cost of GSO/GRO will offset
> the benefit of GRO. Sure if one manages to get the cost close to zero
> than there is not reason to leave GRO always on. But it's still a big if for
> now.)
>
> Best,
>
> Jerry
>
>>
>>
>> IOW, if having a 64KB packet matters for the host, then it matters
>> for forwarding as well.
>>
>>> Before my patch, GRO packets were 17 MSS, and nobody complained packets
>>> were too small, so what are you saying exactly ?
>>
>> I'm not criticsing your mega-GRO patch at all.  That one is great
>> and means that we'll get aggregated packets up to 64K.  What we need
>> to do is just to patch up the GSO code so that it can handle these
>> mega-packets properly.
>>
>> Cheers,
>> --
>> Email: Herbert Xu <herbert@gondor.apana.org.au>
>> Home Page: http://gondor.apana.org.au/~herbert/
>> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30 17:39                             ` Jerry Chu
  2013-10-30 18:09                               ` Vlad Yasevich
@ 2013-10-30 19:12                               ` David Miller
  1 sibling, 0 replies; 163+ messages in thread
From: David Miller @ 2013-10-30 19:12 UTC (permalink / raw)
  To: hkchu; +Cc: herbert, eric.dumazet, christoph.paasch, netdev, mwdalton

From: Jerry Chu <hkchu@google.com>
Date: Wed, 30 Oct 2013 10:39:00 -0700

> Not sure I agree - there are two different "forwarding" cases - forwarding
> to another physical NIC (to go out to the wire hence need to do GSO),
> and (for virtualization) forwarding to a virtual NIC and consumed internally
> (e.g., VM). For the latter we should strive to push GSO pkts all the way
> to the VM stack w/o breaking them up. So for virtualization GRO is all
> goodness but not sure about the regular forwarding path. (From the
> perf perspective it boils down to if the cost of GSO/GRO will offset
> the benefit of GRO. Sure if one manages to get the cost close to zero
> than there is not reason to leave GRO always on. But it's still a big if for
> now.)

More precisely, for the regular forwarding path it only needs to be
cheaper than N * cost routing lookup, where N is the number of real
packets contained in the GRO SKB.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30  2:05                 ` Herbert Xu
  2013-10-30  2:13                   ` Jerry Chu
@ 2013-10-30 19:39                   ` Ben Hutchings
  2013-10-30 19:53                     ` Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Ben Hutchings @ 2013-10-30 19:39 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, eric.dumazet, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-10-30 at 10:05 +0800, Herbert Xu wrote:
> On Tue, Oct 29, 2013 at 10:02:53PM -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 29 Oct 2013 17:53:48 -0700
> > 
> > > So should we apply the first fix to avoid the BUG_ON() ?
> > 
> > Please be more specific, are you talking about splitting up
> > this patch in some way?
> 
> I think Eric is referring to the patch that removes the BUG_ON
> in skb_segment and deals with the new mega-GRO packets.
> 
> I think that's fine for stable, but for the long term we should
> fix it properly as these new meag-GRO packets still retain the
> existing packet boundaries and are trivially segmentable.
> 
> If we are indeed able to do that, I doubt we would even need
> the sysctl patch since the GRO performance should be vastly
> superior to the non-GRO case, even for a router/bridge.

I think the change to enable mega-GRO packets should be reverted and
that change should go to stable so that the performance regression for
forwarding is also fixed.

Then it can be re-enabled, with the additional check that there are no
forwarders, in net-next.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30 19:39                   ` Ben Hutchings
@ 2013-10-30 19:53                     ` Eric Dumazet
  2013-10-30 20:05                       ` Ben Hutchings
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-10-30 19:53 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Herbert Xu, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-10-30 at 19:39 +0000, Ben Hutchings wrote:

> I think the change to enable mega-GRO packets should be reverted and
> that change should go to stable so that the performance regression for
> forwarding is also fixed.

What are you talking about ?

The change is in net-next only.

It seems a lot of people talk, but few people try to help.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30 19:53                     ` Eric Dumazet
@ 2013-10-30 20:05                       ` Ben Hutchings
  2013-10-30 20:12                         ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Ben Hutchings @ 2013-10-30 20:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-10-30 at 12:53 -0700, Eric Dumazet wrote:
> On Wed, 2013-10-30 at 19:39 +0000, Ben Hutchings wrote:
> 
> > I think the change to enable mega-GRO packets should be reverted and
> > that change should go to stable so that the performance regression for
> > forwarding is also fixed.
> 
> What are you talking about ?
> 
> The change is in net-next only.
> 
> It seems a lot of people talk, but few people try to help.

Sorry :-/

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
  2013-10-30 20:05                       ` Ben Hutchings
@ 2013-10-30 20:12                         ` Eric Dumazet
  0 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-10-30 20:12 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Herbert Xu, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-10-30 at 20:05 +0000, Ben Hutchings wrote:
> On Wed, 2013-10-30 at 12:53 -0700, Eric Dumazet wrote:
> > On Wed, 2013-10-30 at 19:39 +0000, Ben Hutchings wrote:
> > 
> > > I think the change to enable mega-GRO packets should be reverted and
> > > that change should go to stable so that the performance regression for
> > > forwarding is also fixed.
> > 
> > What are you talking about ?
> > 
> > The change is in net-next only.
> > 
> > It seems a lot of people talk, but few people try to help.
> 
> Sorry :-/
> 

Sorry, my last sentence was harsh.

I meant that patch was sent October 8th, and bug discovered only 2 days
ago. I sent two patches to fix the bug, and will send other patches as
well, so asking for a revert is I think premature, give me some time
to make forward progress ;)

Thanks !

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-10-30  0:06             ` Ben Hutchings
@ 2013-11-02 14:01               ` Eric Dumazet
  2013-11-02 15:46                 ` Ben Hutchings
                                   ` (2 more replies)
  0 siblings, 3 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-02 14:01 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, christoph.paasch, herbert, netdev, hkchu, mwdalton

From: Eric Dumazet <edumazet@google.com>

Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")

skb_segment() only deals with a frag_list chain containing MSS sized
fragments. Even if we fix this problem, its better if GRO layer
doesn't build skb with a frag_list in the first place, to let TSO
packets reaching output devices.
 
David Miller and Ben Hutchings suggested we keep track of number of
forwarding users to be able to :

- Disable LRO
- Make sure GRO layer do not use skb frag_list to extend skb capacity

Note that after this patch, LRO is automatically re-enabled if
forwarding is disabled on the device, or if a device is removed
from a bridge.

Tested:

lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: on
lpq84:~# echo 1 >/proc/sys/net/ipv4/conf/eth0/forwarding
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: off
lpq84:~# echo 0 >/proc/sys/net/ipv4/conf/eth0/forwarding
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: on

lpq84:~# cat /proc/sys/net/ipv4/ip_forward
0
lpq84:~# echo 1 >/proc/sys/net/ipv4/ip_forward
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: off
lpq84:~# echo 0 >/proc/sys/net/ipv4/ip_forward
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: on


Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Reported-by: Jerry Chu <hkchu@google.com>
Cc: Michael Dalton <mwdalton@google.com>
Fixes: 8a29111c7ca6 ("net: gro: allow to build full sized skb")
---
 include/linux/netdevice.h |    3 ++-
 net/bridge/br_if.c        |    4 +++-
 net/core/dev.c            |   30 +++++++++++++++++++-----------
 net/core/skbuff.c         |   11 ++++++++---
 net/ipv4/devinet.c        |   14 ++++++++------
 net/ipv6/addrconf.c       |    5 ++---
 net/ipv6/addrconf_core.c  |    2 ++
 7 files changed, 44 insertions(+), 25 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb1d918ecdf1..6ddd0fa85ae2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1295,6 +1295,7 @@ struct net_device {
 
 	struct netdev_queue __rcu *ingress_queue;
 	unsigned char		broadcast[MAX_ADDR_LEN];	/* hw bcast add	*/
+	unsigned int		forwarding_count;
 
 
 /*
@@ -1787,7 +1788,7 @@ struct net_device *__dev_get_by_name(struct net *net, const char *name);
 int dev_alloc_name(struct net_device *dev, const char *name);
 int dev_open(struct net_device *dev);
 int dev_close(struct net_device *dev);
-void dev_disable_lro(struct net_device *dev);
+void dev_set_forwarding(struct net_device *dev, int inc);
 int dev_loopback_xmit(struct sk_buff *newskb);
 int dev_queue_xmit(struct sk_buff *skb);
 int register_netdevice(struct net_device *dev);
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index c41d5fbb91d0..959164374ce8 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -150,6 +150,8 @@ static void del_nbp(struct net_bridge_port *p)
 
 	netdev_rx_handler_unregister(dev);
 
+	dev_set_forwarding(dev, -1);
+
 	netdev_upper_dev_unlink(dev, br->dev);
 
 	br_multicast_del_port(p);
@@ -377,7 +379,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 
 	dev->priv_flags |= IFF_BRIDGE_PORT;
 
-	dev_disable_lro(dev);
+	dev_set_forwarding(dev, 1);
 
 	list_add_rcu(&p->list, &br->port_list);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 0054c8c75f50..d1276ea6baf7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1408,29 +1408,37 @@ EXPORT_SYMBOL(dev_close);
 
 
 /**
- *	dev_disable_lro - disable Large Receive Offload on a device
- *	@dev: device
- *
- *	Disable Large Receive Offload (LRO) on a net device.  Must be
- *	called under RTNL.  This is needed if received packets may be
- *	forwarded to another interface.
+ * dev_set_forwarding - Keep count of forwarding users for a device
+ * @dev: device
+ * @inc: +1 or -1
+ *
+ * Add or remove forwarding from a device.
+ * When the count of forwarding users is above 0 :
+ * 1) disable LRO (Large Receive Offload) on this device.
+ * 2) instruct GRO layer to not use frag_list to extend skb capacity.
+ * Must be called under RTNL.
+ * This is needed if received packets may be forwarded to another interface.
  */
-void dev_disable_lro(struct net_device *dev)
+void dev_set_forwarding(struct net_device *dev, int inc)
 {
 	/*
-	 * If we're trying to disable lro on a vlan device
+	 * If we're trying to enable forwarding from a vlan device
 	 * use the underlying physical device instead
 	 */
 	if (is_vlan_dev(dev))
 		dev = vlan_dev_real_dev(dev);
 
-	dev->wanted_features &= ~NETIF_F_LRO;
+	dev->forwarding_count += inc;
+	if (dev->forwarding_count)
+		dev->wanted_features &= ~NETIF_F_LRO;
+	else
+		dev->wanted_features |= NETIF_F_LRO;
 	netdev_update_features(dev);
 
-	if (unlikely(dev->features & NETIF_F_LRO))
+	if (unlikely(dev->forwarding_count && (dev->features & NETIF_F_LRO)))
 		netdev_WARN(dev, "failed to disable LRO!\n");
 }
-EXPORT_SYMBOL(dev_disable_lro);
+EXPORT_SYMBOL(dev_set_forwarding);
 
 static int call_netdevice_notifier(struct notifier_block *nb, unsigned long val,
 				   struct net_device *dev)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0ab32faa520f..7b1cff884d50 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2944,9 +2944,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		int i = skbinfo->nr_frags;
 		int nr_frags = pinfo->nr_frags + i;
 
-		if (nr_frags > MAX_SKB_FRAGS)
+		if (nr_frags > MAX_SKB_FRAGS) {
+			if (skb->dev->forwarding_count)
+				return -E2BIG;
 			goto merge;
-
+		}
 		offset -= headlen;
 		pinfo->nr_frags = nr_frags;
 		skbinfo->nr_frags = 0;
@@ -2977,8 +2979,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		unsigned int first_size = headlen - offset;
 		unsigned int first_offset;
 
-		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
+		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS) {
+			if (skb->dev->forwarding_count)
+				return -E2BIG;
 			goto merge;
+		}
 
 		first_offset = skb->data -
 			       (unsigned char *)page_address(page) +
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index a1b5bcbd04ae..cf01e6fb77c5 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -223,6 +223,8 @@ void in_dev_finish_destroy(struct in_device *idev)
 #ifdef NET_REFCNT_DEBUG
 	pr_debug("%s: %p=%s\n", __func__, idev, dev ? dev->name : "NIL");
 #endif
+	if (IPV4_DEVCONF(idev->cnf, FORWARDING))
+		dev_set_forwarding(dev, -1);
 	dev_put(dev);
 	if (!idev->dead)
 		pr_err("Freeing alive in_device %p\n", idev);
@@ -248,7 +250,7 @@ static struct in_device *inetdev_init(struct net_device *dev)
 	if (!in_dev->arp_parms)
 		goto out_kfree;
 	if (IPV4_DEVCONF(in_dev->cnf, FORWARDING))
-		dev_disable_lro(dev);
+		dev_set_forwarding(dev, 1);
 	/* Reference in_dev->dev */
 	dev_hold(dev);
 	/* Account for reference dev->ip_ptr (below) */
@@ -1932,8 +1934,8 @@ static void inet_forward_change(struct net *net)
 
 	for_each_netdev(net, dev) {
 		struct in_device *in_dev;
-		if (on)
-			dev_disable_lro(dev);
+
+		dev_set_forwarding(dev, on ? 1 : -1);
 		rcu_read_lock();
 		in_dev = __in_dev_get_rcu(dev);
 		if (in_dev) {
@@ -1997,7 +1999,7 @@ static int devinet_sysctl_forward(struct ctl_table *ctl, int write,
 	loff_t pos = *ppos;
 	int ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
 
-	if (write && *valp != val) {
+	if (write && (!(*valp) != !val)) {
 		struct net *net = ctl->extra2;
 
 		if (valp != &IPV4_DEVCONF_DFLT(net, FORWARDING)) {
@@ -2013,8 +2015,8 @@ static int devinet_sysctl_forward(struct ctl_table *ctl, int write,
 				struct ipv4_devconf *cnf = ctl->extra1;
 				struct in_device *idev =
 					container_of(cnf, struct in_device, cnf);
-				if (*valp)
-					dev_disable_lro(idev->dev);
+
+				dev_set_forwarding(idev->dev, *valp ? 1 : -1);
 				inet_netconf_notify_devconf(net,
 							    NETCONFA_FORWARDING,
 							    idev->dev->ifindex,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 542d09561ed6..5b8406c6c868 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -322,7 +322,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 		return NULL;
 	}
 	if (ndev->cnf.forwarding)
-		dev_disable_lro(dev);
+		dev_set_forwarding(dev, 1);
 	/* We refer to the device */
 	dev_hold(dev);
 
@@ -638,8 +638,7 @@ static void dev_forward_change(struct inet6_dev *idev)
 	if (!idev)
 		return;
 	dev = idev->dev;
-	if (idev->cnf.forwarding)
-		dev_disable_lro(dev);
+	dev_set_forwarding(dev, idev->cnf.forwarding ? 1 : -1);
 	if (dev->flags & IFF_MULTICAST) {
 		if (idev->cnf.forwarding) {
 			ipv6_dev_mc_inc(dev, &in6addr_linklocal_allrouters);
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 4c11cbcf8308..b7620ad8366f 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -139,6 +139,8 @@ void in6_dev_finish_destroy(struct inet6_dev *idev)
 #ifdef NET_REFCNT_DEBUG
 	pr_debug("%s: %s\n", __func__, dev ? dev->name : "NIL");
 #endif
+	if (idev->cnf.forwarding)
+		dev_set_forwarding(dev, -1);
 	dev_put(dev);
 	if (!idev->dead) {
 		pr_warn("Freeing alive inet6 device %p\n", idev);

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-02 14:01               ` [PATCH v3 net-next] net: introduce dev_set_forwarding() Eric Dumazet
@ 2013-11-02 15:46                 ` Ben Hutchings
  2013-11-02 18:20                   ` Eric Dumazet
  2013-11-02 19:58                 ` [PATCH v4 " Eric Dumazet
  2013-11-03 12:28                 ` [PATCH v3 " Herbert Xu
  2 siblings, 1 reply; 163+ messages in thread
From: Ben Hutchings @ 2013-11-02 15:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, christoph.paasch, herbert, netdev, hkchu, mwdalton

On Sat, 2013-11-02 at 07:01 -0700, Eric Dumazet wrote:
[...]
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[...]
> -void dev_disable_lro(struct net_device *dev)
> +void dev_set_forwarding(struct net_device *dev, int inc)
>  {
>  	/*
> -	 * If we're trying to disable lro on a vlan device
> +	 * If we're trying to enable forwarding from a vlan device
>  	 * use the underlying physical device instead
>  	 */
>  	if (is_vlan_dev(dev))
>  		dev = vlan_dev_real_dev(dev);
>  
> -	dev->wanted_features &= ~NETIF_F_LRO;
> +	dev->forwarding_count += inc;
> +	if (dev->forwarding_count)
> +		dev->wanted_features &= ~NETIF_F_LRO;
> +	else
> +		dev->wanted_features |= NETIF_F_LRO;
>  	netdev_update_features(dev);
[...]

We should not change dev->wanted_features any more.  It should only be
set as requested by userland.

Instead, netdev_update_features() should mask out NETIF_F_LRO based on
dev->forwarding_count.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-02 15:46                 ` Ben Hutchings
@ 2013-11-02 18:20                   ` Eric Dumazet
  0 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-02 18:20 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, christoph.paasch, herbert, netdev, hkchu, mwdalton

On Sat, 2013-11-02 at 15:46 +0000, Ben Hutchings wrote:

> We should not change dev->wanted_features any more.  It should only be
> set as requested by userland.
> 
> Instead, netdev_update_features() should mask out NETIF_F_LRO based on
> dev->forwarding_count.

Oh right, I'll send a V4 to make the right thing in
netdev_fix_features()

Thanks !

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
       [not found]                       ` <44571383414236@web13j.yandex.ru>
@ 2013-11-02 18:28                         ` Eric Dumazet
  2013-11-03 23:19                         ` David Miller
  1 sibling, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-02 18:28 UTC (permalink / raw)
  To: Oleg A.Arkhangelsky
  Cc: David Miller, hkchu, herbert, christoph.paasch, netdev, mwdalton

On Sat, 2013-11-02 at 21:43 +0400, Oleg A.Arkhangelsky wrote:
> 
> 30.10.2013, 06:33, "David Miller" <davem@davemloft.net>:
> 
> > GRO should always win, even on a router, because it decreases the
> > number of fundamental operations (routing lookups) that the stack
> > needs to perform.
> 
> Yes, unless the case when you're using Linux as IP router which is
> forwarding 500K mixed IP (TCP and UDP) flows traffic @ 10-20 Gbit/s.
> Then GRO is unnecessarily overhead, cause there's no possibility to
> accumulate adequate GRO list in such scenario.
> 

This was assuming the aggregation factor was not 0.

Even with 500K flows, you can have typically bursts of two packets per
flow.

Thats one of the reason why I chose to default
/proc/sys/net/ipv4/tcp_min_tso_segs to 2

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-02 14:01               ` [PATCH v3 net-next] net: introduce dev_set_forwarding() Eric Dumazet
  2013-11-02 15:46                 ` Ben Hutchings
@ 2013-11-02 19:58                 ` Eric Dumazet
  2013-11-03 17:18                   ` Christoph Paasch
                                     ` (2 more replies)
  2013-11-03 12:28                 ` [PATCH v3 " Herbert Xu
  2 siblings, 3 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-02 19:58 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, christoph.paasch, herbert, netdev, hkchu, mwdalton

From: Eric Dumazet <edumazet@google.com>

Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")

skb_segment() only deals with a frag_list chain containing MSS sized
fragments. Even if we fix this problem, its better if GRO layer
doesn't build skb with a frag_list in the first place, to let TSO
packets reaching output devices.
 
David Miller and Ben Hutchings suggested we keep track of number of
forwarding users to be able to :

- Disable LRO
- Make sure GRO layer do not use skb frag_list to extend skb capacity

Note that after this patch, LRO is automatically re-enabled if
forwarding is disabled on the device, or if a device is removed
from a bridge.

Tested:

lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: on
lpq84:~# echo 1 >/proc/sys/net/ipv4/conf/eth0/forwarding
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: off [requested on]
lpq84:~# echo 0 >/proc/sys/net/ipv4/conf/eth0/forwarding
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: on


lpq84:~# ethtool -K eth0 lro off
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: off
lpq84:~# echo 1 >/proc/sys/net/ipv4/conf/eth0/forwarding
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: off
lpq84:~# echo 0 >/proc/sys/net/ipv4/conf/eth0/forwarding
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: off
lpq84:~# ethtool -K eth0 lro on 


lpq84:~# cat /proc/sys/net/ipv4/ip_forward
0
lpq84:~# echo 1 >/proc/sys/net/ipv4/ip_forward
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: off [requested on]
lpq84:~# echo 0 >/proc/sys/net/ipv4/ip_forward
lpq84:~# ethtool -k eth0 | grep "large-receive"
large-receive-offload: on


Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Reported-by: Jerry Chu <hkchu@google.com>
Cc: Michael Dalton <mwdalton@google.com>
Fixes: 8a29111c7ca6 ("net: gro: allow to build full sized skb")
---
v4: drop LRO in netdev_fix_features(), ase Ben pointed out.

 include/linux/netdevice.h |    3 ++-
 net/bridge/br_if.c        |    4 +++-
 net/core/dev.c            |   31 ++++++++++++++++++++-----------
 net/core/skbuff.c         |   11 ++++++++---
 net/ipv4/devinet.c        |   14 ++++++++------
 net/ipv6/addrconf.c       |    5 ++---
 net/ipv6/addrconf_core.c  |    2 ++
 7 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb1d918..6ddd0fa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1295,6 +1295,7 @@ struct net_device {
 
 	struct netdev_queue __rcu *ingress_queue;
 	unsigned char		broadcast[MAX_ADDR_LEN];	/* hw bcast add	*/
+	unsigned int		forwarding_count;
 
 
 /*
@@ -1787,7 +1788,7 @@ struct net_device *__dev_get_by_name(struct net *net, const char *name);
 int dev_alloc_name(struct net_device *dev, const char *name);
 int dev_open(struct net_device *dev);
 int dev_close(struct net_device *dev);
-void dev_disable_lro(struct net_device *dev);
+void dev_set_forwarding(struct net_device *dev, int inc);
 int dev_loopback_xmit(struct sk_buff *newskb);
 int dev_queue_xmit(struct sk_buff *skb);
 int register_netdevice(struct net_device *dev);
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index c41d5fb..9591643 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -150,6 +150,8 @@ static void del_nbp(struct net_bridge_port *p)
 
 	netdev_rx_handler_unregister(dev);
 
+	dev_set_forwarding(dev, -1);
+
 	netdev_upper_dev_unlink(dev, br->dev);
 
 	br_multicast_del_port(p);
@@ -377,7 +379,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 
 	dev->priv_flags |= IFF_BRIDGE_PORT;
 
-	dev_disable_lro(dev);
+	dev_set_forwarding(dev, 1);
 
 	list_add_rcu(&p->list, &br->port_list);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 0054c8c..f95bde6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1408,29 +1408,33 @@ EXPORT_SYMBOL(dev_close);
 
 
 /**
- *	dev_disable_lro - disable Large Receive Offload on a device
- *	@dev: device
- *
- *	Disable Large Receive Offload (LRO) on a net device.  Must be
- *	called under RTNL.  This is needed if received packets may be
- *	forwarded to another interface.
+ * dev_set_forwarding - Keep count of forwarding users for a device
+ * @dev: device
+ * @inc: +1 or -1
+ *
+ * Add or remove forwarding from a device.
+ * When the count of forwarding users is above 0 :
+ * 1) disable LRO (Large Receive Offload) on this device.
+ * 2) instruct GRO layer to not use frag_list to extend skb capacity.
+ * Must be called under RTNL.
+ * This is needed if received packets may be forwarded to another interface.
  */
-void dev_disable_lro(struct net_device *dev)
+void dev_set_forwarding(struct net_device *dev, int inc)
 {
 	/*
-	 * If we're trying to disable lro on a vlan device
+	 * If we're trying to enable forwarding from a vlan device
 	 * use the underlying physical device instead
 	 */
 	if (is_vlan_dev(dev))
 		dev = vlan_dev_real_dev(dev);
 
-	dev->wanted_features &= ~NETIF_F_LRO;
+	dev->forwarding_count += inc;
 	netdev_update_features(dev);
 
-	if (unlikely(dev->features & NETIF_F_LRO))
+	if (unlikely(dev->forwarding_count && (dev->features & NETIF_F_LRO)))
 		netdev_WARN(dev, "failed to disable LRO!\n");
 }
-EXPORT_SYMBOL(dev_disable_lro);
+EXPORT_SYMBOL(dev_set_forwarding);
 
 static int call_netdevice_notifier(struct notifier_block *nb, unsigned long val,
 				   struct net_device *dev)
@@ -5584,6 +5588,11 @@ static netdev_features_t netdev_fix_features(struct net_device *dev,
 		}
 	}
 
+	if ((features & NETIF_F_LRO) && dev->forwarding_count) {
+		netdev_dbg(dev, "Dropping LRO because of forwarding.\n");
+		features &= ~NETIF_F_LRO;
+	}
+
 	return features;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0ab32fa..7b1cff8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2944,9 +2944,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		int i = skbinfo->nr_frags;
 		int nr_frags = pinfo->nr_frags + i;
 
-		if (nr_frags > MAX_SKB_FRAGS)
+		if (nr_frags > MAX_SKB_FRAGS) {
+			if (skb->dev->forwarding_count)
+				return -E2BIG;
 			goto merge;
-
+		}
 		offset -= headlen;
 		pinfo->nr_frags = nr_frags;
 		skbinfo->nr_frags = 0;
@@ -2977,8 +2979,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		unsigned int first_size = headlen - offset;
 		unsigned int first_offset;
 
-		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
+		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS) {
+			if (skb->dev->forwarding_count)
+				return -E2BIG;
 			goto merge;
+		}
 
 		first_offset = skb->data -
 			       (unsigned char *)page_address(page) +
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index a1b5bcb..cf01e6f 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -223,6 +223,8 @@ void in_dev_finish_destroy(struct in_device *idev)
 #ifdef NET_REFCNT_DEBUG
 	pr_debug("%s: %p=%s\n", __func__, idev, dev ? dev->name : "NIL");
 #endif
+	if (IPV4_DEVCONF(idev->cnf, FORWARDING))
+		dev_set_forwarding(dev, -1);
 	dev_put(dev);
 	if (!idev->dead)
 		pr_err("Freeing alive in_device %p\n", idev);
@@ -248,7 +250,7 @@ static struct in_device *inetdev_init(struct net_device *dev)
 	if (!in_dev->arp_parms)
 		goto out_kfree;
 	if (IPV4_DEVCONF(in_dev->cnf, FORWARDING))
-		dev_disable_lro(dev);
+		dev_set_forwarding(dev, 1);
 	/* Reference in_dev->dev */
 	dev_hold(dev);
 	/* Account for reference dev->ip_ptr (below) */
@@ -1932,8 +1934,8 @@ static void inet_forward_change(struct net *net)
 
 	for_each_netdev(net, dev) {
 		struct in_device *in_dev;
-		if (on)
-			dev_disable_lro(dev);
+
+		dev_set_forwarding(dev, on ? 1 : -1);
 		rcu_read_lock();
 		in_dev = __in_dev_get_rcu(dev);
 		if (in_dev) {
@@ -1997,7 +1999,7 @@ static int devinet_sysctl_forward(struct ctl_table *ctl, int write,
 	loff_t pos = *ppos;
 	int ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
 
-	if (write && *valp != val) {
+	if (write && (!(*valp) != !val)) {
 		struct net *net = ctl->extra2;
 
 		if (valp != &IPV4_DEVCONF_DFLT(net, FORWARDING)) {
@@ -2013,8 +2015,8 @@ static int devinet_sysctl_forward(struct ctl_table *ctl, int write,
 				struct ipv4_devconf *cnf = ctl->extra1;
 				struct in_device *idev =
 					container_of(cnf, struct in_device, cnf);
-				if (*valp)
-					dev_disable_lro(idev->dev);
+
+				dev_set_forwarding(idev->dev, *valp ? 1 : -1);
 				inet_netconf_notify_devconf(net,
 							    NETCONFA_FORWARDING,
 							    idev->dev->ifindex,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 542d095..232b2c4 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -322,7 +322,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 		return NULL;
 	}
 	if (ndev->cnf.forwarding)
-		dev_disable_lro(dev);
+		dev_set_forwarding(dev, 1);
 	/* We refer to the device */
 	dev_hold(dev);
 
@@ -638,8 +638,7 @@ static void dev_forward_change(struct inet6_dev *idev)
 	if (!idev)
 		return;
 	dev = idev->dev;
-	if (idev->cnf.forwarding)
-		dev_disable_lro(dev);
+	dev_set_forwarding(dev, idev->cnf.forwarding ? 1 : -1);
 	if (dev->flags & IFF_MULTICAST) {
 		if (idev->cnf.forwarding) {
 			ipv6_dev_mc_inc(dev, &in6addr_linklocal_allrouters);
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 4c11cbc..b7620ad 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -139,6 +139,8 @@ void in6_dev_finish_destroy(struct inet6_dev *idev)
 #ifdef NET_REFCNT_DEBUG
 	pr_debug("%s: %s\n", __func__, dev ? dev->name : "NIL");
 #endif
+	if (idev->cnf.forwarding)
+		dev_set_forwarding(dev, -1);
 	dev_put(dev);
 	if (!idev->dead) {
 		pr_warn("Freeing alive inet6 device %p\n", idev);

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-02 14:01               ` [PATCH v3 net-next] net: introduce dev_set_forwarding() Eric Dumazet
  2013-11-02 15:46                 ` Ben Hutchings
  2013-11-02 19:58                 ` [PATCH v4 " Eric Dumazet
@ 2013-11-03 12:28                 ` Herbert Xu
  2013-11-03 16:28                   ` Eric Dumazet
  2 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-03 12:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sat, Nov 02, 2013 at 07:01:37AM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> 
> skb_segment() only deals with a frag_list chain containing MSS sized
> fragments. Even if we fix this problem, its better if GRO layer
> doesn't build skb with a frag_list in the first place, to let TSO
> packets reaching output devices.
>  
> David Miller and Ben Hutchings suggested we keep track of number of
> forwarding users to be able to :
> 
> - Disable LRO
> - Make sure GRO layer do not use skb frag_list to extend skb capacity

Why are we still doing this instead of fixing skb_segment to
deal with skb frag_list properly?

LRO is legacy code and we should not be adding similar cruft
to GRO.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-03 12:28                 ` [PATCH v3 " Herbert Xu
@ 2013-11-03 16:28                   ` Eric Dumazet
  2013-11-03 16:31                     ` Herbert Xu
  2013-11-03 23:23                     ` [PATCH v3 net-next] net: introduce dev_set_forwarding() David Miller
  0 siblings, 2 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-03 16:28 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sun, 2013-11-03 at 20:28 +0800, Herbert Xu wrote:
> On Sat, Nov 02, 2013 at 07:01:37AM -0700, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> > by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> > 
> > skb_segment() only deals with a frag_list chain containing MSS sized
> > fragments. Even if we fix this problem, its better if GRO layer
> > doesn't build skb with a frag_list in the first place, to let TSO
> > packets reaching output devices.
> >  
> > David Miller and Ben Hutchings suggested we keep track of number of
> > forwarding users to be able to :
> > 
> > - Disable LRO
> > - Make sure GRO layer do not use skb frag_list to extend skb capacity
> 
> Why are we still doing this instead of fixing skb_segment to
> deal with skb frag_list properly?
> 
> LRO is legacy code and we should not be adding similar cruft
> to GRO.

1) Because we should not call skb_segment() at all on a router ?

2) If you aggregate too much on a router, you increase latencies,
   or if you prefer the RTT of TCP flows.

3) Because skb_segment() layer only builds MSS sized skb, so this
   remove TSO ability on output path

Splitting a 45 MSS packet into 3 TSO packets (16 + 16 + X MSS) is going
to be quite complex, given the gso_segment() stuff is meant to segment
in MSS packets. Adding complexity in this already complex stuff is
simply not worth it.

For local TCP, its different, because if you receive such high
throughput, ability to build full size GRO packet helps to reduce number
of ACK segments and number of SKB in receive queue (or OFO queue),
without impacting ACK clocking and TCP dynamics.

And even if a router does not do this aggregation, the final receiver
will do.

So in conclusion, GRO is like TSO : Its not because they are able to use
64KB skbs you always _have_ to fill skb to max capacity.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-03 16:28                   ` Eric Dumazet
@ 2013-11-03 16:31                     ` Herbert Xu
  2013-11-03 17:26                       ` Eric Dumazet
  2013-11-03 23:23                     ` [PATCH v3 net-next] net: introduce dev_set_forwarding() David Miller
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-03 16:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sun, Nov 03, 2013 at 08:28:24AM -0800, Eric Dumazet wrote:
>
> 1) Because we should not call skb_segment() at all on a router ?
> 
> 2) If you aggregate too much on a router, you increase latencies,
>    or if you prefer the RTT of TCP flows.
> 
> 3) Because skb_segment() layer only builds MSS sized skb, so this
>    remove TSO ability on output path
> 
> Splitting a 45 MSS packet into 3 TSO packets (16 + 16 + X MSS) is going
> to be quite complex, given the gso_segment() stuff is meant to segment
> in MSS packets. Adding complexity in this already complex stuff is
> simply not worth it.
> 
> For local TCP, its different, because if you receive such high
> throughput, ability to build full size GRO packet helps to reduce number
> of ACK segments and number of SKB in receive queue (or OFO queue),
> without impacting ACK clocking and TCP dynamics.
> 
> And even if a router does not do this aggregation, the final receiver
> will do.
> 
> So in conclusion, GRO is like TSO : Its not because they are able to use
> 64KB skbs you always _have_ to fill skb to max capacity.

I give up.  It's as if you've ignored everything I've said before.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-02 19:58                 ` [PATCH v4 " Eric Dumazet
@ 2013-11-03 17:18                   ` Christoph Paasch
  2013-11-04 16:55                   ` Ben Hutchings
  2013-11-21 18:29                   ` David Miller
  2 siblings, 0 replies; 163+ messages in thread
From: Christoph Paasch @ 2013-11-03 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, herbert, netdev, hkchu, mwdalton

On 02/11/13 - 12:58:50, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> 
> skb_segment() only deals with a frag_list chain containing MSS sized
> fragments. Even if we fix this problem, its better if GRO layer
> doesn't build skb with a frag_list in the first place, to let TSO
> packets reaching output devices.
>  
> David Miller and Ben Hutchings suggested we keep track of number of
> forwarding users to be able to :
> 
> - Disable LRO
> - Make sure GRO layer do not use skb frag_list to extend skb capacity
> 
> Note that after this patch, LRO is automatically re-enabled if
> forwarding is disabled on the device, or if a device is removed
> from a bridge.
> 
> Tested:
> 
> lpq84:~# ethtool -k eth0 | grep "large-receive"
> large-receive-offload: on
> lpq84:~# echo 1 >/proc/sys/net/ipv4/conf/eth0/forwarding
> lpq84:~# ethtool -k eth0 | grep "large-receive"
> large-receive-offload: off [requested on]
> lpq84:~# echo 0 >/proc/sys/net/ipv4/conf/eth0/forwarding
> lpq84:~# ethtool -k eth0 | grep "large-receive"
> large-receive-offload: on
> 
> 
> lpq84:~# ethtool -K eth0 lro off
> lpq84:~# ethtool -k eth0 | grep "large-receive"
> large-receive-offload: off
> lpq84:~# echo 1 >/proc/sys/net/ipv4/conf/eth0/forwarding
> lpq84:~# ethtool -k eth0 | grep "large-receive"
> large-receive-offload: off
> lpq84:~# echo 0 >/proc/sys/net/ipv4/conf/eth0/forwarding
> lpq84:~# ethtool -k eth0 | grep "large-receive"
> large-receive-offload: off
> lpq84:~# ethtool -K eth0 lro on 
> 
> 
> lpq84:~# cat /proc/sys/net/ipv4/ip_forward
> 0
> lpq84:~# echo 1 >/proc/sys/net/ipv4/ip_forward
> lpq84:~# ethtool -k eth0 | grep "large-receive"
> large-receive-offload: off [requested on]
> lpq84:~# echo 0 >/proc/sys/net/ipv4/ip_forward
> lpq84:~# ethtool -k eth0 | grep "large-receive"
> large-receive-offload: on
> 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
> Reported-by: Jerry Chu <hkchu@google.com>
> Cc: Michael Dalton <mwdalton@google.com>
> Fixes: 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> ---
> v4: drop LRO in netdev_fix_features(), ase Ben pointed out.
> 
>  include/linux/netdevice.h |    3 ++-
>  net/bridge/br_if.c        |    4 +++-
>  net/core/dev.c            |   31 ++++++++++++++++++++-----------
>  net/core/skbuff.c         |   11 ++++++++---
>  net/ipv4/devinet.c        |   14 ++++++++------
>  net/ipv6/addrconf.c       |    5 ++---
>  net/ipv6/addrconf_core.c  |    2 ++
>  7 files changed, 45 insertions(+), 25 deletions(-)

Good, this fixes the crash I experienced on my workload.

Tested-by: Christoph Paasch <christoph.paasch@uclouvain.be>

Thanks, Eric!

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-03 16:31                     ` Herbert Xu
@ 2013-11-03 17:26                       ` Eric Dumazet
  2013-11-04  4:11                         ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-03 17:26 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Mon, 2013-11-04 at 00:31 +0800, Herbert Xu wrote:

> I give up.  It's as if you've ignored everything I've said before.

Not really.

Have you took a look at the GSO path recently ?

The days it was handling only IP+TCP are gone.

If you think you can do better, please do so.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl
       [not found]                       ` <44571383414236@web13j.yandex.ru>
  2013-11-02 18:28                         ` Eric Dumazet
@ 2013-11-03 23:19                         ` David Miller
  1 sibling, 0 replies; 163+ messages in thread
From: David Miller @ 2013-11-03 23:19 UTC (permalink / raw)
  To: sysoleg; +Cc: hkchu, herbert, eric.dumazet, christoph.paasch, netdev, mwdalton

From: Oleg A. Arkhangelsky <sysoleg@yandex.ru>
Date: Sat, 02 Nov 2013 21:43:56 +0400

> 30.10.2013, 06:33, "David Miller" <davem@davemloft.net>:
> 
>> GRO should always win, even on a router, because it decreases the
>> number of fundamental operations (routing lookups) that the stack
>> needs to perform.
> 
> Yes, unless the case when you're using Linux as IP router which is
> forwarding 500K mixed IP (TCP and UDP) flows traffic @ 10-20 Gbit/s.
> Then GRO is unnecessarily overhead, cause there's no possibility to
> accumulate adequate GRO list in such scenario.

If it's not accumulating, then there really isn't much cost because
the bulk of the code paths are short circuited when the flow IDs do
not match.

We touch these packet headers to forward anyways, and that is the bulk
of the cost.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-03 16:28                   ` Eric Dumazet
  2013-11-03 16:31                     ` Herbert Xu
@ 2013-11-03 23:23                     ` David Miller
  1 sibling, 0 replies; 163+ messages in thread
From: David Miller @ 2013-11-03 23:23 UTC (permalink / raw)
  To: eric.dumazet
  Cc: herbert, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 03 Nov 2013 08:28:24 -0800

> 1) Because we should not call skb_segment() at all on a router ?

Eric, disagreement with this very thing is the basis for the entire
thread you have engaged with Herbert Xu about.  It's as if you've
ignored everything discussed there, please don't do this.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-03 17:26                       ` Eric Dumazet
@ 2013-11-04  4:11                         ` Herbert Xu
  2013-11-04  4:23                           ` Eric Dumazet
  2013-11-06  1:30                           ` gso: Attempt to handle mega-GRO packets Herbert Xu
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-04  4:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sun, Nov 03, 2013 at 09:26:43AM -0800, Eric Dumazet wrote:
> 
> Not really.
> 
> Have you took a look at the GSO path recently ?
> 
> The days it was handling only IP+TCP are gone.
> 
> If you think you can do better, please do so.

OK maybe I overreacted.

With regards to your point 2), GRO does not introduce any latencies
because it simply relies on NAPI to do the aggregation.  IOW it is
no better or worse latency-wise compared to NAPI.  If you need to
tune it, just use the usual NAPI toggles.

With repsect to point 3), sure we can allow the generation of TSO
segments in skb_segment.

You may be right that this is all too hard, since I haven't actually
sat down and tried to do it yet.

But please give me chance to have a look first before we give up and
install a permanent user-space toggle.

Since this is a crashing bug, I'm OK with adding the linearising
patch to the current tree so that we don't end up shipping with a
known crash in the network stack.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  4:11                         ` Herbert Xu
@ 2013-11-04  4:23                           ` Eric Dumazet
  2013-11-04  4:29                             ` Herbert Xu
  2013-11-06  1:30                           ` gso: Attempt to handle mega-GRO packets Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-04  4:23 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Mon, 2013-11-04 at 12:11 +0800, Herbert Xu wrote:
> On Sun, Nov 03, 2013 at 09:26:43AM -0800, Eric Dumazet wrote:
> > 
> > Not really.
> > 
> > Have you took a look at the GSO path recently ?
> > 
> > The days it was handling only IP+TCP are gone.
> > 
> > If you think you can do better, please do so.
> 
> OK maybe I overreacted.
> 
> With regards to your point 2), GRO does not introduce any latencies
> because it simply relies on NAPI to do the aggregation.  IOW it is
> no better or worse latency-wise compared to NAPI.  If you need to
> tune it, just use the usual NAPI toggles.
> 

Well, GRO adds latencies for sure.

You seem to assume the transmit only can happen when NAPI is done, but
its not true. As soon as GRO fills one packet (reaches max capacity),
packet is delivered and forwarded, even if NAPI handler is not yet
complete for the flow.

Say you have 1 us per MSS, then filling 45 MSS per skb means we add a 45
us delay transit, instead of 16 us, or 1 us if no GRO is used on the
router.



> With repsect to point 3), sure we can allow the generation of TSO
> segments in skb_segment.
> 
> You may be right that this is all too hard, since I haven't actually
> sat down and tried to do it yet.
> 
> But please give me chance to have a look first before we give up and
> install a permanent user-space toggle.

I don't think I ever said it was permanent, I am sorry you understood
this.

I will be happy to change skb_segment() in the future, but I already
said I would not expect doing so for linux-3.13, given we were too late
in the linux-3.12-rc.

Thanks

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  4:23                           ` Eric Dumazet
@ 2013-11-04  4:29                             ` Herbert Xu
  2013-11-04  5:00                               ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-04  4:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sun, Nov 03, 2013 at 08:23:02PM -0800, Eric Dumazet wrote:
> 
> Well, GRO adds latencies for sure.
> 
> You seem to assume the transmit only can happen when NAPI is done, but
> its not true. As soon as GRO fills one packet (reaches max capacity),
> packet is delivered and forwarded, even if NAPI handler is not yet
> complete for the flow.

Have you actually measured this? The latency added by GRO is pure
processing overhead.  This is tiny when compared to the time NAPI takes
to wait.

> Say you have 1 us per MSS, then filling 45 MSS per skb means we add a 45
> us delay transit, instead of 16 us, or 1 us if no GRO is used on the
> router.

Are you talking about the latency added by the TX qdisc? That is
not GRO's fault.  Perhaps we can add more metadata to the GRO packet
so that the TX qdisc can deal with it appropriately?

> > But please give me chance to have a look first before we give up and
> > install a permanent user-space toggle.
> 
> I don't think I ever said it was permanent, I am sorry you understood
> this.
> 
> I will be happy to change skb_segment() in the future, but I already
> said I would not expect doing so for linux-3.13, given we were too late
> in the linux-3.12-rc.

For the time being my preference is for your linearisation patch, followed
by a revert of the GRO patch, and lastly the magic toggle that turns this
off for forwarding systems.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  4:29                             ` Herbert Xu
@ 2013-11-04  5:00                               ` Eric Dumazet
  2013-11-04  5:23                                 ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-04  5:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Mon, 2013-11-04 at 12:29 +0800, Herbert Xu wrote:
> On Sun, Nov 03, 2013 at 08:23:02PM -0800, Eric Dumazet wrote:
> > 
> > Well, GRO adds latencies for sure.
> > 
> > You seem to assume the transmit only can happen when NAPI is done, but
> > its not true. As soon as GRO fills one packet (reaches max capacity),
> > packet is delivered and forwarded, even if NAPI handler is not yet
> > complete for the flow.
> 
> Have you actually measured this? The latency added by GRO is pure
> processing overhead.  This is tiny when compared to the time NAPI takes
> to wait.


Please take a look at 

2e71a6f8084e net: gro: selective flush of packets

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  5:00                               ` Eric Dumazet
@ 2013-11-04  5:23                                 ` Herbert Xu
  2013-11-04  6:05                                   ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-04  5:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sun, Nov 03, 2013 at 09:00:54PM -0800, Eric Dumazet wrote:
> On Mon, 2013-11-04 at 12:29 +0800, Herbert Xu wrote:
>
> > Have you actually measured this? The latency added by GRO is pure
> > processing overhead.  This is tiny when compared to the time NAPI takes
> > to wait.
> 
> Please take a look at 
> 
> 2e71a6f8084e net: gro: selective flush of packets

This is a different problem altogether.  I was worried about the
latency in cases where we're idle and waiting for new data, while
you're worried about the latency in the CPU-bound case.

I think we can definitely improve our behaviour the CPU-bound case.
Right now if we encounter something we can't hold for GRO we
start processing it right away.  Instead we can place it in a
list for later processing together with the GRO packets.

This way GRO packets are not penalised by non-GRO packets.

You can then use the usual NAPI budget to minimise latency and
ensure scheduling fairness.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  5:23                                 ` Herbert Xu
@ 2013-11-04  6:05                                   ` Eric Dumazet
  2013-11-04  6:22                                     ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-04  6:05 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Mon, 2013-11-04 at 13:23 +0800, Herbert Xu wrote:
> On Sun, Nov 03, 2013 at 09:00:54PM -0800, Eric Dumazet wrote:
> > On Mon, 2013-11-04 at 12:29 +0800, Herbert Xu wrote:
> >
> > > Have you actually measured this? The latency added by GRO is pure
> > > processing overhead.  This is tiny when compared to the time NAPI takes
> > > to wait.
> > 
> > Please take a look at 
> > 
> > 2e71a6f8084e net: gro: selective flush of packets
> 
> This is a different problem altogether.  I was worried about the
> latency in cases where we're idle and waiting for new data, while
> you're worried about the latency in the CPU-bound case.

Idle case, you very rarely cant keep up building skbs with 16 MSS.

> 
> I think we can definitely improve our behaviour the CPU-bound case.
> Right now if we encounter something we can't hold for GRO we
> start processing it right away.  Instead we can place it in a
> list for later processing together with the GRO packets.
> 
> This way GRO packets are not penalised by non-GRO packets.
> 
> You can then use the usual NAPI budget to minimise latency and
> ensure scheduling fairness.

We had these latencies only dealing with TCP packets, all GRO
candidates.

Really, I think we have used GRO at large scale here at Google ;)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  6:05                                   ` Eric Dumazet
@ 2013-11-04  6:22                                     ` Herbert Xu
  2013-11-04  6:26                                       ` Herbert Xu
  2013-11-04  6:46                                       ` Eric Dumazet
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-04  6:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sun, Nov 03, 2013 at 10:05:55PM -0800, Eric Dumazet wrote:
>
> > I think we can definitely improve our behaviour the CPU-bound case.
> > Right now if we encounter something we can't hold for GRO we
> > start processing it right away.  Instead we can place it in a
> > list for later processing together with the GRO packets.
> > 
> > This way GRO packets are not penalised by non-GRO packets.
> > 
> > You can then use the usual NAPI budget to minimise latency and
> > ensure scheduling fairness.
> 
> We had these latencies only dealing with TCP packets, all GRO
> candidates.
> 
> Really, I think we have used GRO at large scale here at Google ;)

So what NICs were you using that had this issue?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  6:22                                     ` Herbert Xu
@ 2013-11-04  6:26                                       ` Herbert Xu
  2013-11-04  7:10                                         ` Eric Dumazet
  2013-11-04  6:46                                       ` Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-04  6:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Mon, Nov 04, 2013 at 02:22:02PM +0800, Herbert Xu wrote:
>
> > We had these latencies only dealing with TCP packets, all GRO
> > candidates.
> > 
> > Really, I think we have used GRO at large scale here at Google ;)
> 
> So what NICs were you using that had this issue?

Also if your scenario had all GRO candidates then not using
frag_list would seem to be a bad workaround for an underlying
latency problem.  Rather than arbitrarily limiting the aggregation
to 22K surely it would make more sense to limit it based on your
actual latency requirements?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  6:22                                     ` Herbert Xu
  2013-11-04  6:26                                       ` Herbert Xu
@ 2013-11-04  6:46                                       ` Eric Dumazet
  2013-11-04  7:03                                         ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-04  6:46 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Mon, 2013-11-04 at 14:22 +0800, Herbert Xu wrote:

> So what NICs were you using that had this issue?

Its a generic issue, really.

It depends on how many flows are mixed per RX queues.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  6:46                                       ` Eric Dumazet
@ 2013-11-04  7:03                                         ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-04  7:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sun, Nov 03, 2013 at 10:46:21PM -0800, Eric Dumazet wrote:
> On Mon, 2013-11-04 at 14:22 +0800, Herbert Xu wrote:
> 
> > So what NICs were you using that had this issue?
> 
> Its a generic issue, really.

I'd still like answer because we also tested large numbers of
flows and have not seen this.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  6:26                                       ` Herbert Xu
@ 2013-11-04  7:10                                         ` Eric Dumazet
  2013-11-04  7:21                                           ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-04  7:10 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Mon, 2013-11-04 at 14:26 +0800, Herbert Xu wrote:

> Also if your scenario had all GRO candidates then not using
> frag_list would seem to be a bad workaround for an underlying
> latency problem.  Rather than arbitrarily limiting the aggregation
> to 22K surely it would make more sense to limit it based on your
> actual latency requirements?

I have never limited GRO aggregation to 22K.

You did this in commit 81705ad1b2f926d
("gro: Do not merge paged packets into frag_list")


I exactly implemented what you suggested in your patch :

"In future we can optimise this further by doing frag_list merging
but making sure that we continue to fill in the page array."

Using frag_list for skb meant to be delivered to local stack is mostly
fine. Its for forwarding that its not a win, since no driver actually
supports frag_list and we revert to skb_segment().

So my latest patch about _not_ building fat skbs with frag_list on a
router makes sense.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  7:10                                         ` Eric Dumazet
@ 2013-11-04  7:21                                           ` Herbert Xu
  2013-11-04 13:58                                             ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-04  7:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Sun, Nov 03, 2013 at 11:10:41PM -0800, Eric Dumazet wrote:
>
> Using frag_list for skb meant to be delivered to local stack is mostly
> fine. Its for forwarding that its not a win, since no driver actually
> supports frag_list and we revert to skb_segment().

Sigh.

Most of the gain these days isn't coming from the hardware anymore,
especially for virtualisation where the network stack is at least
twice as long.  The gain is in paying the cost of the network once
instead of n times for an aggregation of n packets.  So with your
mega-GRO patch, the choice comes down to paying for one trip or
three trips.  The difference may not be trivial, even for a router.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v3 net-next] net: introduce dev_set_forwarding()
  2013-11-04  7:21                                           ` Herbert Xu
@ 2013-11-04 13:58                                             ` Eric Dumazet
  0 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-04 13:58 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Mon, 2013-11-04 at 15:21 +0800, Herbert Xu wrote:
> On Sun, Nov 03, 2013 at 11:10:41PM -0800, Eric Dumazet wrote:
> >
> > Using frag_list for skb meant to be delivered to local stack is mostly
> > fine. Its for forwarding that its not a win, since no driver actually
> > supports frag_list and we revert to skb_segment().
> 
> Sigh.
> 
> Most of the gain these days isn't coming from the hardware anymore,
> especially for virtualisation where the network stack is at least
> twice as long.  The gain is in paying the cost of the network once
> instead of n times for an aggregation of n packets.  So with your
> mega-GRO patch, the choice comes down to paying for one trip or
> three trips.  The difference may not be trivial, even for a router.

I do think we are on the same thoughts.

Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
("virtio_net: migrate mergeable rx buffers to page frag allocators")

Notice it also uses frag_list

So if you don't provide a skb_segment() patch, I will.

I had urgent matters last week.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-02 19:58                 ` [PATCH v4 " Eric Dumazet
  2013-11-03 17:18                   ` Christoph Paasch
@ 2013-11-04 16:55                   ` Ben Hutchings
  2013-11-07 21:17                     ` David Miller
  2013-11-21 18:29                   ` David Miller
  2 siblings, 1 reply; 163+ messages in thread
From: Ben Hutchings @ 2013-11-04 16:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, christoph.paasch, herbert, netdev, hkchu, mwdalton

On Sat, 2013-11-02 at 12:58 -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> 
> skb_segment() only deals with a frag_list chain containing MSS sized
> fragments. Even if we fix this problem, its better if GRO layer
> doesn't build skb with a frag_list in the first place, to let TSO
> packets reaching output devices.
>  
> David Miller and Ben Hutchings suggested we keep track of number of
> forwarding users to be able to :
> 
> - Disable LRO
> - Make sure GRO layer do not use skb frag_list to extend skb capacity
> 
> Note that after this patch, LRO is automatically re-enabled if
> forwarding is disabled on the device, or if a device is removed
> from a bridge.
[...]

Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* gso: Attempt to handle mega-GRO packets
  2013-11-04  4:11                         ` Herbert Xu
  2013-11-04  4:23                           ` Eric Dumazet
@ 2013-11-06  1:30                           ` Herbert Xu
  2013-11-06  1:45                             ` Eric Dumazet
  2013-11-06 12:39                             ` Herbert Xu
  1 sibling, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-06  1:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

Here is a totally untested patch that tries to trivially process
these new frags + frag_list skbs.  It should actually be trivial
to make this generate TSO packets by just adding a gso_ok check
and short-circuit.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..ec8e8bc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2816,7 +2816,24 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+			if (fskb->len != len) {
+				SKB_FRAG_ASSERT(fskb);
+
+				nskb = skb_segment(fskb, features);
+
+				err = PTR_ERR(nskb);
+				if (IS_ERR(nskb))
+					goto err;
+				err = -ENOMEM;
+
+				if (segs)
+					tail->next = nskb;
+				else
+					segs = nskb;
+				tail = nskb;
+				while (tail->next)
+					tail = tail->next;
+			}
 
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  1:30                           ` gso: Attempt to handle mega-GRO packets Herbert Xu
@ 2013-11-06  1:45                             ` Eric Dumazet
  2013-11-06  4:07                               ` Herbert Xu
  2013-11-06 12:39                             ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-06  1:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 09:30 +0800, Herbert Xu wrote:
> Here is a totally untested patch that tries to trivially process
> these new frags + frag_list skbs.  It should actually be trivial
> to make this generate TSO packets by just adding a gso_ok check
> and short-circuit.
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3735fad..ec8e8bc 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2816,7 +2816,24 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  			hsize = len;
>  
>  		if (!hsize && i >= nfrags) {
> -			BUG_ON(fskb->len != len);
> +			if (fskb->len != len) {
> +				SKB_FRAG_ASSERT(fskb);
> +
> +				nskb = skb_segment(fskb, features);
> +
> +				err = PTR_ERR(nskb);
> +				if (IS_ERR(nskb))
> +					goto err;
> +				err = -ENOMEM;
> +
> +				if (segs)
> +					tail->next = nskb;
> +				else
> +					segs = nskb;
> +				tail = nskb;
> +				while (tail->next)
> +					tail = tail->next;
> +			}
>  
>  			pos += len;
>  			nskb = skb_clone(fskb, GFP_ATOMIC);
> 
> Thanks,

Hmm, I do not think fskb has the headers in the general case.

It might work in the GRO case only.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  1:45                             ` Eric Dumazet
@ 2013-11-06  4:07                               ` Herbert Xu
  2013-11-06  4:23                                 ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-06  4:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Nov 05, 2013 at 05:45:47PM -0800, Eric Dumazet wrote:
>
> Hmm, I do not think fskb has the headers in the general case.

Only GRO produces such packets, see the changeset where I added
frag_list support to skb_setgment

	89319d3801d1d3ac29c7df1f067038986f267d29

> It might work in the GRO case only.

Any chance you guys can test this patch out?

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  4:07                               ` Herbert Xu
@ 2013-11-06  4:23                                 ` Eric Dumazet
  2013-11-06  4:28                                   ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-06  4:23 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 12:07 +0800, Herbert Xu wrote:
> On Tue, Nov 05, 2013 at 05:45:47PM -0800, Eric Dumazet wrote:
> >
> > Hmm, I do not think fskb has the headers in the general case.
> 
> Only GRO produces such packets, see the changeset where I added
> frag_list support to skb_setgment

Nope, I already mentioned this :

Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
("virtio_net: migrate mergeable rx buffers to page frag allocators")

I do not see why skb_segment() would be tied to GRO -> GSO, with
the property of each page frag being exactly MSS sized.

We use it in TCP stack with page frags of any size.

skb_segment() needs to iterate properly on all pages frags,
the one found in the first skb, and the ones found on frag_list skbs.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  4:23                                 ` Eric Dumazet
@ 2013-11-06  4:28                                   ` Herbert Xu
  2013-11-06  5:20                                     ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-06  4:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Nov 05, 2013 at 08:23:05PM -0800, Eric Dumazet wrote:
>
> Nope, I already mentioned this :
> 
> Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
> ("virtio_net: migrate mergeable rx buffers to page frag allocators")

This patch should simply be reverted.  You guys steam-rolled over
the virtio_net people's concerns that this seriously impacts virt
performance.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  4:28                                   ` Herbert Xu
@ 2013-11-06  5:20                                     ` Eric Dumazet
  2013-11-06  8:04                                       ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-06  5:20 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 12:28 +0800, Herbert Xu wrote:
> On Tue, Nov 05, 2013 at 08:23:05PM -0800, Eric Dumazet wrote:
> >
> > Nope, I already mentioned this :
> > 
> > Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
> > ("virtio_net: migrate mergeable rx buffers to page frag allocators")
> 
> This patch should simply be reverted.  You guys steam-rolled over
> the virtio_net people's concerns that this seriously impacts virt
> performance.
> 

Have you followed netdev traffic lately ?

virtio_net performance is better than before, and we did no revert,
thank you.

I am really tired, I'll fix skb_segment() tomorrow.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  5:20                                     ` Eric Dumazet
@ 2013-11-06  8:04                                       ` Herbert Xu
  2013-11-06  8:16                                         ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-06  8:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Nov 05, 2013 at 09:20:22PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-06 at 12:28 +0800, Herbert Xu wrote:
> > On Tue, Nov 05, 2013 at 08:23:05PM -0800, Eric Dumazet wrote:
> > >
> > > Nope, I already mentioned this :
> > > 
> > > Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
> > > ("virtio_net: migrate mergeable rx buffers to page frag allocators")
> > 
> > This patch should simply be reverted.  You guys steam-rolled over
> > the virtio_net people's concerns that this seriously impacts virt
> > performance.
> > 
> 
> Have you followed netdev traffic lately ?
> 
> virtio_net performance is better than before, and we did no revert,
> thank you.

The last email I saw from Michael Tsirkin he still wasn't happy
that your patch does not degrade virt performance.  Where is the
reply to that?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  8:04                                       ` Herbert Xu
@ 2013-11-06  8:16                                         ` Herbert Xu
  2013-11-06 13:12                                           ` Herbert Xu
  2013-11-06 15:05                                           ` Eric Dumazet
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-06  8:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst

On Wed, Nov 06, 2013 at 04:04:25PM +0800, Herbert Xu wrote:
> On Tue, Nov 05, 2013 at 09:20:22PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-06 at 12:28 +0800, Herbert Xu wrote:
> > > On Tue, Nov 05, 2013 at 08:23:05PM -0800, Eric Dumazet wrote:
> > > >
> > > > Nope, I already mentioned this :
> > > > 
> > > > Please take a look at 2613af0ed18a11d5c566a81f9a6510b73180660a
> > > > ("virtio_net: migrate mergeable rx buffers to page frag allocators")
> > > 
> > > This patch should simply be reverted.  You guys steam-rolled over
> > > the virtio_net people's concerns that this seriously impacts virt
> > > performance.
> > > 
> > 
> > Have you followed netdev traffic lately ?
> > 
> > virtio_net performance is better than before, and we did no revert,
> > thank you.
> 
> The last email I saw from Michael Tsirkin he still wasn't happy
> that your patch does not degrade virt performance.  Where is the
> reply to that?

I just looked at the aformentioned patch again and it is seriously
broken! How on earth did it get merged?

Instead of using perfectly sane 4K pages per frag to store guest to
guest traffic, we now end up using 1.5K frags, which that's why you
end up having to use the frag_list, WTF?

Dave, please revert the above commit as it is seriously broken.

Whatever performance problem it is trying to address can surely
be fixed without being so stupid as to break up perfectly sized
4K pages into 1.5K chunks.

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  1:30                           ` gso: Attempt to handle mega-GRO packets Herbert Xu
  2013-11-06  1:45                             ` Eric Dumazet
@ 2013-11-06 12:39                             ` Herbert Xu
  2013-11-06 13:30                               ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-06 12:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 09:30:38AM +0800, Herbert Xu wrote:
> Here is a totally untested patch that tries to trivially process
> these new frags + frag_list skbs.  It should actually be trivial
> to make this generate TSO packets by just adding a gso_ok check
> and short-circuit.

That patch obviously didn't have a chance of working since I missed
a continue.

Here is a better version.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..409bd9b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2816,7 +2816,31 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+			if (fskb->len != len) {
+				SKB_FRAG_ASSERT(fskb);
+
+				nskb = skb_segment(fskb, features);
+
+				err = PTR_ERR(nskb);
+				if (IS_ERR(nskb))
+					goto err;
+				err = -ENOMEM;
+
+				if (segs)
+					tail->next = nskb;
+				else
+					segs = nskb;
+
+				tail = nskb;
+				while (tail->next)
+					tail = tail->next;
+
+				BUG_ON(fskb->next && tail->len != len);
+
+				len = fskb->len;
+				fskb = fskb->next;
+				continue;
+			}
 
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  8:16                                         ` Herbert Xu
@ 2013-11-06 13:12                                           ` Herbert Xu
  2013-11-06 15:01                                             ` Eric Dumazet
  2013-11-06 15:05                                           ` Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-06 13:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst

On Wed, Nov 06, 2013 at 04:16:38PM +0800, Herbert Xu wrote:
>
> I just looked at the aformentioned patch again and it is seriously
> broken! How on earth did it get merged?
> 
> Instead of using perfectly sane 4K pages per frag to store guest to
> guest traffic, we now end up using 1.5K frags, which that's why you
> end up having to use the frag_list, WTF?
> 
> Dave, please revert the above commit as it is seriously broken.

I take that back.  While the original patch was seriously broken,
it has since been fixed by the coalescing patch that Jason Wang
wrote.

It's still pretty weird to be dividing page frags into 1500-byte
chunks and then merging back up to 4K but at least it should do the
right thing now.

With regards to the impact on my skb_segment patch, there should be
none because those packets ultimately are still originating from
either GRO or the TCP stack.  In which case the same assumptions
on frag_list still holds.

However, we do have to guard against evil hosts/guests injecting
bogus packets into our stack, so either we'll have to lose the
BUG_ON in skb_segment or we'll need some sort of a filter in
virtio_net.

Perhaps a rate-limited printk might be the go.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 12:39                             ` Herbert Xu
@ 2013-11-06 13:30                               ` Herbert Xu
  2013-11-06 14:39                                 ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-06 13:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 08:39:00PM +0800, Herbert Xu wrote:
> 
> That patch obviously didn't have a chance of working since I missed
> a continue.
> 
> Here is a better version.

In order to handle malicious GSO packets that is now possible with
the use of frag_list in virtio_net, we need to remove the BUG_ONs.
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..f336e5c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2816,7 +2816,44 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+			if (fskb->len != len) {
+				if (skb_has_frag_list(fskb)) {
+					net_warn_ratelimited(
+						"skb_segment: "
+						"nested frag_list detected");
+					err = -EINVAL;
+					goto err;
+				}
+
+				nskb = skb_segment(fskb, features);
+
+				err = PTR_ERR(nskb);
+				if (IS_ERR(nskb))
+					goto err;
+				err = -ENOMEM;
+
+				if (segs)
+					tail->next = nskb;
+				else
+					segs = nskb;
+
+				tail = nskb;
+				while (tail->next)
+					tail = tail->next;
+
+				if (fskb->next && tail->len != len) {
+					net_warn_ratelimited(
+						"skb_segment: "
+						"illegal GSO fragment: %u %u",
+						tail->len, len);
+					err = -EINVAL;
+					goto err;
+				}
+
+				len = fskb->len;
+				fskb = fskb->next;
+				continue;
+			}
 
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
@@ -2905,7 +2942,14 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (pos < offset + len) {
 			struct sk_buff *fskb2 = fskb;
 
-			BUG_ON(pos + fskb->len != offset + len);
+			if (pos + fskb->len != offset + len) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"illegal GSO trailer: %u %u",
+					pos + fskb->len, offset + len);
+				err = -EINVAL;
+				goto err;
+			}
 
 			pos += fskb->len;
 			fskb = fskb->next;

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 13:30                               ` Herbert Xu
@ 2013-11-06 14:39                                 ` Herbert Xu
  2013-11-06 15:06                                   ` Eric Dumazet
                                                     ` (2 more replies)
  0 siblings, 3 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-06 14:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 09:30:45PM +0800, Herbert Xu wrote:
> 
> In order to handle malicious GSO packets that is now possible with
> the use of frag_list in virtio_net, we need to remove the BUG_ONs.

OK Eric was right and I am a dumb ass.  This has no chance in hell
of handling the new virtio_net frag_list since we won't have any
headers in the frag_list skbs.

In fact, we never relied on the frag_list having headers anyway so
it's not hard to fix this.

Still totally untested but at least this has a chance of handling
the new virtio_net.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..3e8819c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2816,8 +2816,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
-
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
 			fskb = fskb->next;
@@ -2846,12 +2844,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			__skb_put(nskb, doffset);
 		}
 
-		if (segs)
-			tail->next = nskb;
-		else
-			segs = nskb;
-		tail = nskb;
-
 		__copy_skb_header(nskb, skb);
 		nskb->mac_len = skb->mac_len;
 
@@ -2861,15 +2853,62 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
+		if (fskb != skb_shinfo(skb)->frag_list) {
+			struct sk_buff *nsegs;
+
+			if (nskb->len == len + doffset)
+				goto perform_csum_check;
+
+			if (skb_has_frag_list(nskb)) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"nested frag_list detected");
+				kfree(nskb);
+				err = -EINVAL;
+				goto err;
+			}
+
+			__skb_pull(nskb, doffset);
+			skb_shinfo(nskb)->gso_size = mss;
+			nsegs = skb_segment(nskb, features);
+
+			err = PTR_ERR(nsegs);
+			if (IS_ERR(nsegs)) {
+				kfree(nskb);
+				goto err;
+			}
+			err = -ENOMEM;
+
+			if (segs)
+				tail->next = nsegs;
+			else
+				segs = nsegs;
+
+			tail = nsegs;
+			while (tail->next)
+				tail = tail->next;
+
+			if (fskb && tail->len != len) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"illegal GSO fragment: %u %u",
+					tail->len, len);
+				kfree(nskb);
+				err = -EINVAL;
+				goto err;
+			}
+
+			len = nskb->len;
+			kfree(nskb);
+			continue;
+		}
 
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
 							    skb_put(nskb, len),
 							    len, 0);
-			continue;
+			goto add_to_segs;
 		}
 
 		frag = skb_shinfo(nskb)->frags;
@@ -2905,15 +2944,25 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (pos < offset + len) {
 			struct sk_buff *fskb2 = fskb;
 
-			BUG_ON(pos + fskb->len != offset + len);
+			if (pos + fskb->len != offset + len) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"illegal GSO trailer: %u %u",
+					pos + fskb->len, offset + len);
+				kfree(nskb);
+				err = -EINVAL;
+				goto err;
+			}
 
 			pos += fskb->len;
 			fskb = fskb->next;
 
 			if (fskb2->next) {
 				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
-				if (!fskb2)
+				if (!fskb2) {
+					kfree(nskb);
 					goto err;
+				}
 			} else
 				skb_get(fskb2);
 
@@ -2932,6 +2981,13 @@ perform_csum_check:
 						  nskb->len - doffset, 0);
 			nskb->ip_summed = CHECKSUM_NONE;
 		}
+
+add_to_segs:
+		if (segs)
+			tail->next = nskb;
+		else
+			segs = nskb;
+		tail = nskb;
 	} while ((offset += len) < skb->len);
 
 	return segs;

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 13:12                                           ` Herbert Xu
@ 2013-11-06 15:01                                             ` Eric Dumazet
  2013-11-07  0:36                                               ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-06 15:01 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst

On Wed, 2013-11-06 at 21:12 +0800, Herbert Xu wrote:

> I take that back.  While the original patch was seriously broken,
> it has since been fixed by the coalescing patch that Jason Wang
> wrote.
> 

Patch was not 'broken', but a step into the right direction. I am very
sorry if you think otherwise.

> It's still pretty weird to be dividing page frags into 1500-byte
> chunks and then merging back up to 4K but at least it should do the
> right thing now.


Have you thought about arches having PAGE_SIZE=65536, and how bad it is
to use a full page per network frame ? It is lazy and x86 centered.

So after our patches, we now have an optimal situation, even on these
arches.

On x86, a full 64KB GSO packet will now fit in 2 (or 3) frags allocated
in 32KB pages (order-3) in fast path (ie if there is not high memory
pressure)

According to our tests, performance is better on x86, and virtio_net now
is usable on arches with PAGE_SIZE=65536

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06  8:16                                         ` Herbert Xu
  2013-11-06 13:12                                           ` Herbert Xu
@ 2013-11-06 15:05                                           ` Eric Dumazet
  2013-11-07  0:39                                             ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-06 15:05 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst

On Wed, 2013-11-06 at 16:16 +0800, Herbert Xu wrote:

> Instead of using perfectly sane 4K pages per frag to store guest to
> guest traffic, we now end up using 1.5K frags, which that's why you
> end up having to use the frag_list, WTF?

Sure, if your host has infinite amount of memory, we can remove all the
silly and expensive and bore some checks we added in the stack against
skb->truesize.

And yes, network stack will be faster.

But in real life, we have memory constraints.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 14:39                                 ` Herbert Xu
@ 2013-11-06 15:06                                   ` Eric Dumazet
  2013-11-06 17:25                                   ` Joe Perches
  2013-11-06 19:47                                   ` Eric Dumazet
  2 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-06 15:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 22:39 +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 09:30:45PM +0800, Herbert Xu wrote:
> > 
> > In order to handle malicious GSO packets that is now possible with
> > the use of frag_list in virtio_net, we need to remove the BUG_ONs.
> 
> OK Eric was right and I am a dumb ass.  This has no chance in hell
> of handling the new virtio_net frag_list since we won't have any
> headers in the frag_list skbs.
> 
> In fact, we never relied on the frag_list having headers anyway so
> it's not hard to fix this.
> 
> Still totally untested but at least this has a chance of handling
> the new virtio_net.

OK I'll take a look at this and make tests today, thanks !

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 14:39                                 ` Herbert Xu
  2013-11-06 15:06                                   ` Eric Dumazet
@ 2013-11-06 17:25                                   ` Joe Perches
  2013-11-06 19:47                                   ` Eric Dumazet
  2 siblings, 0 replies; 163+ messages in thread
From: Joe Perches @ 2013-11-06 17:25 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 22:39 +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 09:30:45PM +0800, Herbert Xu wrote:
> > In order to handle malicious GSO packets that is now possible with
> > the use of frag_list in virtio_net, we need to remove the BUG_ONs.
[]
> Still totally untested but at least this has a chance of handling
> the new virtio_net.

trivial:  please add "\n" to each net_warn_ratelimited
format termination.

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
[]
> @@ -2861,15 +2853,62 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
[]
> +			if (skb_has_frag_list(nskb)) {
> +				net_warn_ratelimited(
> +					"skb_segment: "
> +					"nested frag_list detected");

	"nested frag_list detected\n");

etc...

It might be nicer to coalesce the format fragments too.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 14:39                                 ` Herbert Xu
  2013-11-06 15:06                                   ` Eric Dumazet
  2013-11-06 17:25                                   ` Joe Perches
@ 2013-11-06 19:47                                   ` Eric Dumazet
  2013-11-07  0:15                                     ` Eric Dumazet
  2013-11-07  0:43                                     ` Herbert Xu
  2 siblings, 2 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-06 19:47 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 22:39 +0800, Herbert Xu wrote:

> In fact, we never relied on the frag_list having headers anyway so
> it's not hard to fix this.
> 
> Still totally untested but at least this has a chance of handling
> the new virtio_net.

First try, machine doesn't crash, but things are not working.

[  433.232553] skbuff: skb_segment: illegal GSO fragment: 1514 1448
[  433.340523] skbuff: skb_segment: illegal GSO fragment: 1514 1448skbuff: skb_segment: illegal GSO fragment: 1514 1448
[  433.340578] skbuff: skb_segment: illegal GSO fragment: 1514 1448skbuff: skb_segment: illegal GSO fragment: 1514 1448
[  433.340598] skbuff: skb_segment: illegal GSO fragment: 1514 1448skbuff: skb_segment: illegal GSO fragment: 1514 1448
[  433.340620] skbuff: skb_segment: illegal GSO fragment: 1514 1448skbuff: skb_segment: illegal GSO fragment: 1514 1448
[  433.340661] skbuff: skb_segment: illegal GSO fragment: 1514 1448<4>[  438.313019] net_ratelimit: 141 callbacks suppressed

To test this, I used a regular forwarding path between three hosts

A --->  B ----> C

I'll try a different way.

The frag_list would contain a bunch of frags, that we logically add to the bunch
of frags found in the first skb shared_info structure.

Thanks

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 19:47                                   ` Eric Dumazet
@ 2013-11-07  0:15                                     ` Eric Dumazet
  2013-11-07  0:47                                       ` Herbert Xu
  2013-11-07  1:13                                       ` Hannes Frederic Sowa
  2013-11-07  0:43                                     ` Herbert Xu
  1 sibling, 2 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  0:15 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 11:47 -0800, Eric Dumazet wrote:
> I'll try a different way.
> 
> The frag_list would contain a bunch of frags, that we logically add to the bunch
> of frags found in the first skb shared_info structure.

Here is the patch I came into (I tested it and it works very fine)

The theory is that in GRO stack, all skbs use the head_frag trick,
so even if one NIC pulled some payload into skb->head, we do not have to
copy anything. Outside of GRO stack, we are not supposed to provide data
in skb->head (I am speaking of the skb found on the frag_list extension,
not the skb_head)

I put a fallback code, just in case, with a WARN_ON_ONCE() so that we
can catch the offenders (if any) to fix them.

I renamed @skb to @skb_head to more clearly document this code.
Same for @i renamed to @cur_frag

 net/core/skbuff.c |  187 ++++++++++++++++++++++----------------------
 1 file changed, 94 insertions(+), 93 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..0aeefd1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2771,80 +2771,60 @@ EXPORT_SYMBOL_GPL(skb_pull_rcsum);
  *	a pointer to the first in a list of new skbs for the segments.
  *	In case of error it returns ERR_PTR(err).
  */
-struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
+struct sk_buff *skb_segment(struct sk_buff *head_skb, netdev_features_t features)
 {
 	struct sk_buff *segs = NULL;
 	struct sk_buff *tail = NULL;
-	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
-	unsigned int mss = skb_shinfo(skb)->gso_size;
-	unsigned int doffset = skb->data - skb_mac_header(skb);
-	unsigned int offset = doffset;
-	unsigned int tnl_hlen = skb_tnl_header_len(skb);
+	struct sk_buff *cskb = head_skb;
+	unsigned int mss = skb_shinfo(head_skb)->gso_size;
+	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
+	unsigned int tot_len = doffset; /* should reach head_skb->len at the end */
+	unsigned int offset = doffset; /* offset in cskb->data */
+	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
 	unsigned int headroom;
 	unsigned int len;
 	__be16 proto;
 	bool csum;
 	int sg = !!(features & NETIF_F_SG);
-	int nfrags = skb_shinfo(skb)->nr_frags;
+	int cur_frag = 0, nfrags = skb_shinfo(cskb)->nr_frags;
+	unsigned int data_len, cur_frag_offset = 0;
 	int err = -ENOMEM;
-	int i = 0;
 	int pos;
 
-	proto = skb_network_protocol(skb);
+	proto = skb_network_protocol(head_skb);
 	if (unlikely(!proto))
 		return ERR_PTR(-EINVAL);
 
 	csum = !!can_checksum_protocol(features, proto);
-	__skb_push(skb, doffset);
-	headroom = skb_headroom(skb);
-	pos = skb_headlen(skb);
+	__skb_push(head_skb, doffset);
+	headroom = skb_headroom(head_skb);
+	pos = skb_headlen(head_skb);
 
-	do {
+	for (; tot_len < head_skb->len; tot_len += len) {
 		struct sk_buff *nskb;
 		skb_frag_t *frag;
 		int hsize;
 		int size;
 
-		len = skb->len - offset;
+		len = cskb->len - offset;
 		if (len > mss)
 			len = mss;
 
-		hsize = skb_headlen(skb) - offset;
+		hsize = skb_headlen(cskb) - offset;
 		if (hsize < 0)
 			hsize = 0;
 		if (hsize > len || !sg)
 			hsize = len;
 
-		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
-
-			pos += len;
-			nskb = skb_clone(fskb, GFP_ATOMIC);
-			fskb = fskb->next;
+		nskb = __alloc_skb(hsize + doffset + headroom,
+				   GFP_ATOMIC, skb_alloc_rx_flag(head_skb),
+				   NUMA_NO_NODE);
 
-			if (unlikely(!nskb))
-				goto err;
-
-			hsize = skb_end_offset(nskb);
-			if (skb_cow_head(nskb, doffset + headroom)) {
-				kfree_skb(nskb);
-				goto err;
-			}
+		if (unlikely(!nskb))
+			goto err;
 
-			nskb->truesize += skb_end_offset(nskb) - hsize;
-			skb_release_head_state(nskb);
-			__skb_push(nskb, doffset);
-		} else {
-			nskb = __alloc_skb(hsize + doffset + headroom,
-					   GFP_ATOMIC, skb_alloc_rx_flag(skb),
-					   NUMA_NO_NODE);
-
-			if (unlikely(!nskb))
-				goto err;
-
-			skb_reserve(nskb, headroom);
-			__skb_put(nskb, doffset);
-		}
+		skb_reserve(nskb, headroom);
+		__skb_put(nskb, doffset);
 
 		if (segs)
 			tail->next = nskb;
@@ -2852,94 +2832,115 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			segs = nskb;
 		tail = nskb;
 
-		__copy_skb_header(nskb, skb);
-		nskb->mac_len = skb->mac_len;
+		__copy_skb_header(nskb, head_skb);
+		nskb->mac_len = head_skb->mac_len;
 
 		skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
 
-		skb_copy_from_linear_data_offset(skb, -tnl_hlen,
+		skb_copy_from_linear_data_offset(head_skb, -tnl_hlen,
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
-
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
-			nskb->csum = skb_copy_and_csum_bits(skb, offset,
+			nskb->csum = skb_copy_and_csum_bits(head_skb, tot_len,
 							    skb_put(nskb, len),
 							    len, 0);
+			offset += len;
 			continue;
 		}
 
 		frag = skb_shinfo(nskb)->frags;
 
-		skb_copy_from_linear_data_offset(skb, offset,
+		skb_copy_from_linear_data_offset(cskb, offset,
 						 skb_put(nskb, hsize), hsize);
+		offset += hsize;
 
-		skb_shinfo(nskb)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
+		data_len = 0;
+		nskb->data_len = len - hsize;
+		nskb->len += nskb->data_len;
+		nskb->truesize += nskb->data_len;
 
-		while (pos < offset + len && i < nfrags) {
-			*frag = skb_shinfo(skb)->frags[i];
-			__skb_frag_ref(frag);
-			size = skb_frag_size(frag);
+		skb_shinfo(nskb)->tx_flags = skb_shinfo(head_skb)->tx_flags & SKBTX_SHARED_FRAG;
+
+		while (data_len < nskb->data_len) {
+			int remain = nskb->data_len - data_len;
 
-			if (pos < offset) {
-				frag->page_offset += offset - pos;
-				skb_frag_size_sub(frag, offset - pos);
+			if (unlikely(cur_frag >= nfrags)) {
+				if (cskb == head_skb)
+					cskb = skb_shinfo(head_skb)->frag_list;
+				else
+					cskb = cskb->next;
+				if (!cskb) {
+					WARN_ON_ONCE(1);
+					goto err;
+				}
+				cur_frag = 0;
+				cur_frag_offset = 0;
+				nfrags = skb_shinfo(cskb)->nr_frags;
+				offset = 0;
+				if (skb_headlen(cskb)) {
+					char *data;
+					struct page *page;
+
+					remain = min_t(int, remain, skb_headlen(cskb));
+					if (likely(cskb->head_frag)) {
+						data = cskb->data;
+						page = virt_to_head_page(data);
+						get_page(page);
+					} else {
+						data = __netdev_alloc_frag(SKB_DATA_ALIGN(remain),
+									   GFP_ATOMIC);
+						/* Really this should not happen, fix the caller ! */
+						WARN_ON_ONCE(1);
+						if (!data)
+							goto err;
+						memcpy(data, cskb->data, remain);
+						page = virt_to_head_page(data);
+					}
+					frag->page.p = page;
+					frag->page_offset = data - (char *)page_address(page);
+					skb_frag_size_set(frag, remain);
+					frag++;
+					offset = remain;
+					continue;
+				}
 			}
+			*frag = skb_shinfo(cskb)->frags[cur_frag];
+			__skb_frag_ref(frag);
+
+			frag->page_offset += cur_frag_offset;
+			skb_frag_size_sub(frag, cur_frag_offset);
+			size = skb_frag_size(frag);
 
 			skb_shinfo(nskb)->nr_frags++;
 
-			if (pos + size <= offset + len) {
-				i++;
-				pos += size;
+			if (size <= remain) {
+				cur_frag++;
+				cur_frag_offset = 0;
+				data_len += size;
 			} else {
-				skb_frag_size_sub(frag, pos + size - (offset + len));
-				goto skip_fraglist;
+				skb_frag_size_set(frag, remain);
+				data_len += remain;
+				cur_frag_offset += remain;
 			}
 
 			frag++;
 		}
 
-		if (pos < offset + len) {
-			struct sk_buff *fskb2 = fskb;
-
-			BUG_ON(pos + fskb->len != offset + len);
-
-			pos += fskb->len;
-			fskb = fskb->next;
-
-			if (fskb2->next) {
-				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
-				if (!fskb2)
-					goto err;
-			} else
-				skb_get(fskb2);
-
-			SKB_FRAG_ASSERT(nskb);
-			skb_shinfo(nskb)->frag_list = fskb2;
-		}
-
-skip_fraglist:
-		nskb->data_len = len - hsize;
-		nskb->len += nskb->data_len;
-		nskb->truesize += nskb->data_len;
-
-perform_csum_check:
 		if (!csum) {
 			nskb->csum = skb_checksum(nskb, doffset,
 						  nskb->len - doffset, 0);
 			nskb->ip_summed = CHECKSUM_NONE;
 		}
-	} while ((offset += len) < skb->len);
+	}
 
 	return segs;
 
 err:
-	while ((skb = segs)) {
-		segs = skb->next;
-		kfree_skb(skb);
+	while ((cskb = segs)) {
+		segs = cskb->next;
+		kfree_skb(cskb);
 	}
 	return ERR_PTR(err);
 }

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 15:01                                             ` Eric Dumazet
@ 2013-11-07  0:36                                               ` Herbert Xu
  2013-11-07  1:03                                                 ` Eric Dumazet
  2013-11-07  2:52                                                 ` Jason Wang
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  0:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst, Jason Wang

On Wed, Nov 06, 2013 at 07:01:10AM -0800, Eric Dumazet wrote:
> Have you thought about arches having PAGE_SIZE=65536, and how bad it is
> to use a full page per network frame ? It is lazy and x86 centered.

So instead if we were sending a full 64K packet on such an arch to
another guest, we'd now chop it up into 1.5K chunks and reassemble them.

> So after our patches, we now have an optimal situation, even on these
> arches.

Optimal only for physical incoming packets with no jumbo frames.

What's worse, I now realise that the coalesce thing isn't even
guaranteed to work.  It probably works in your benchmarks because
you're working with freshly allocated pages.

But once the system has been running for a while, I see nothing
in the virtio_net code that tries to prevent fragmentation.  Once
fragmentation sets in, you'll be back in the terrible situation
that we were in prior to the coalesce patch.

Jason/Michael (Tsirkin), am I missing something that would prevent
fragmentation of these buffers?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 15:05                                           ` Eric Dumazet
@ 2013-11-07  0:39                                             ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  0:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst

On Wed, Nov 06, 2013 at 07:05:41AM -0800, Eric Dumazet wrote:
> 
> Sure, if your host has infinite amount of memory, we can remove all the
> silly and expensive and bore some checks we added in the stack against
> skb->truesize.
> 
> And yes, network stack will be faster.
> 
> But in real life, we have memory constraints.

With intra-page fragmentation, it's not clear that you'd be better
off with this patch in the long run anyway.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-06 19:47                                   ` Eric Dumazet
  2013-11-07  0:15                                     ` Eric Dumazet
@ 2013-11-07  0:43                                     ` Herbert Xu
  2013-11-07  6:22                                       ` Herbert Xu
  2013-11-07  7:11                                       ` gso: Attempt to handle mega-GRO packets Eric Dumazet
  1 sibling, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  0:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 11:47:21AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-06 at 22:39 +0800, Herbert Xu wrote:
> 
> > In fact, we never relied on the frag_list having headers anyway so
> > it's not hard to fix this.
> > 
> > Still totally untested but at least this has a chance of handling
> > the new virtio_net.
> 
> First try, machine doesn't crash, but things are not working.
> 
> [  433.232553] skbuff: skb_segment: illegal GSO fragment: 1514 1448
> [  433.340523] skbuff: skb_segment: illegal GSO fragment: 1514 1448skbuff: skb_segment: illegal GSO fragment: 1514 1448
> [  433.340578] skbuff: skb_segment: illegal GSO fragment: 1514 1448skbuff: skb_segment: illegal GSO fragment: 1514 1448
> [  433.340598] skbuff: skb_segment: illegal GSO fragment: 1514 1448skbuff: skb_segment: illegal GSO fragment: 1514 1448
> [  433.340620] skbuff: skb_segment: illegal GSO fragment: 1514 1448skbuff: skb_segment: illegal GSO fragment: 1514 1448
> [  433.340661] skbuff: skb_segment: illegal GSO fragment: 1514 1448<4>[  438.313019] net_ratelimit: 141 callbacks suppressed
> 
> To test this, I used a regular forwarding path between three hosts
> 
> A --->  B ----> C
> 
> I'll try a different way.
> 
> The frag_list would contain a bunch of frags, that we logically add to the bunch
> of frags found in the first skb shared_info structure.

Yeah I screwed up with the test, it needs an additional doffset
since the tail already has the headers pushed:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..f0f85e0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2816,8 +2816,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
-
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
 			fskb = fskb->next;
@@ -2846,12 +2844,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			__skb_put(nskb, doffset);
 		}
 
-		if (segs)
-			tail->next = nskb;
-		else
-			segs = nskb;
-		tail = nskb;
-
 		__copy_skb_header(nskb, skb);
 		nskb->mac_len = skb->mac_len;
 
@@ -2861,15 +2853,62 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
+		if (fskb != skb_shinfo(skb)->frag_list) {
+			struct sk_buff *nsegs;
+
+			if (nskb->len == len + doffset)
+				goto perform_csum_check;
+
+			if (skb_has_frag_list(nskb)) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"nested frag_list detected\n");
+				kfree(nskb);
+				err = -EINVAL;
+				goto err;
+			}
+
+			__skb_pull(nskb, doffset);
+			skb_shinfo(nskb)->gso_size = mss;
+			nsegs = skb_segment(nskb, features);
+
+			err = PTR_ERR(nsegs);
+			if (IS_ERR(nsegs)) {
+				kfree(nskb);
+				goto err;
+			}
+			err = -ENOMEM;
+
+			if (segs)
+				tail->next = nsegs;
+			else
+				segs = nsegs;
+
+			tail = nsegs;
+			while (tail->next)
+				tail = tail->next;
+
+			if (fskb && tail->len != len + doffset) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"illegal GSO fragment: %u %u\n",
+					tail->len, len + doffset);
+				kfree(nskb);
+				err = -EINVAL;
+				goto err;
+			}
+
+			len = nskb->len;
+			kfree(nskb);
+			continue;
+		}
 
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
 							    skb_put(nskb, len),
 							    len, 0);
-			continue;
+			goto add_to_segs;
 		}
 
 		frag = skb_shinfo(nskb)->frags;
@@ -2905,15 +2944,25 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (pos < offset + len) {
 			struct sk_buff *fskb2 = fskb;
 
-			BUG_ON(pos + fskb->len != offset + len);
+			if (pos + fskb->len != offset + len) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"illegal GSO trailer: %u %u\n",
+					pos + fskb->len, offset + len);
+				kfree(nskb);
+				err = -EINVAL;
+				goto err;
+			}
 
 			pos += fskb->len;
 			fskb = fskb->next;
 
 			if (fskb2->next) {
 				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
-				if (!fskb2)
+				if (!fskb2) {
+					kfree(nskb);
 					goto err;
+				}
 			} else
 				skb_get(fskb2);
 
@@ -2932,6 +2981,13 @@ perform_csum_check:
 						  nskb->len - doffset, 0);
 			nskb->ip_summed = CHECKSUM_NONE;
 		}
+
+add_to_segs:
+		if (segs)
+			tail->next = nskb;
+		else
+			segs = nskb;
+		tail = nskb;
 	} while ((offset += len) < skb->len);
 
 	return segs;

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  0:15                                     ` Eric Dumazet
@ 2013-11-07  0:47                                       ` Herbert Xu
  2013-11-07  0:56                                         ` Eric Dumazet
  2013-11-07  1:13                                       ` Hannes Frederic Sowa
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  0:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 04:15:21PM -0800, Eric Dumazet wrote:
>
> Here is the patch I came into (I tested it and it works very fine)

Thanks.  However, unless I'm missing something aren't you now
copying every linear skb frag_list?  With the current code we
just do a clone and add the headers.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  0:47                                       ` Herbert Xu
@ 2013-11-07  0:56                                         ` Eric Dumazet
  2013-11-07  1:00                                           ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  0:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 08:47 +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 04:15:21PM -0800, Eric Dumazet wrote:
> >
> > Here is the patch I came into (I tested it and it works very fine)
> 
> Thanks.  However, unless I'm missing something aren't you now
> copying every linear skb frag_list?  With the current code we
> just do a clone and add the headers.

No copy at all.

As explained, all RX skb are now backed to a page frag.  (see
skb->head_frag)

For safety I probably need to check this is safe.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  0:56                                         ` Eric Dumazet
@ 2013-11-07  1:00                                           ` Herbert Xu
  2013-11-07  1:08                                             ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  1:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 04:56:03PM -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 08:47 +0800, Herbert Xu wrote:
> > On Wed, Nov 06, 2013 at 04:15:21PM -0800, Eric Dumazet wrote:
> > >
> > > Here is the patch I came into (I tested it and it works very fine)
> > 
> > Thanks.  However, unless I'm missing something aren't you now
> > copying every linear skb frag_list?  With the current code we
> > just do a clone and add the headers.
> 
> No copy at all.
> 
> As explained, all RX skb are now backed to a page frag.  (see
> skb->head_frag)

Even if we have called pskb_expand_head on it?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  0:36                                               ` Herbert Xu
@ 2013-11-07  1:03                                                 ` Eric Dumazet
  2013-11-07  1:47                                                   ` Herbert Xu
  2013-11-07  2:52                                                 ` Jason Wang
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  1:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst, Jason Wang

On Thu, 2013-11-07 at 08:36 +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 07:01:10AM -0800, Eric Dumazet wrote:
> > Have you thought about arches having PAGE_SIZE=65536, and how bad it is
> > to use a full page per network frame ? It is lazy and x86 centered.
> 
> So instead if we were sending a full 64K packet on such an arch to
> another guest, we'd now chop it up into 1.5K chunks and reassemble them.
> 

Yep, and speed is now better than before the patches.

I understand you do not believe it. But this is the truth.

And now your guest can receive a bunch of small UDP frames, without
having to drop them because sk->rcvbuf limit is hit.

> > So after our patches, we now have an optimal situation, even on these
> > arches.
> 
> Optimal only for physical incoming packets with no jumbo frames.

Have you actually tested this ?

> 
> What's worse, I now realise that the coalesce thing isn't even
> guaranteed to work.  It probably works in your benchmarks because
> you're working with freshly allocated pages.
> 

Oh well.

> But once the system has been running for a while, I see nothing
> in the virtio_net code that tries to prevent fragmentation.  Once
> fragmentation sets in, you'll be back in the terrible situation
> that we were in prior to the coalesce patch.
> 

There is no fragmentation, since we allocate 32Kb pages.

Michael Dalton worked on a patch to add EWMA for auto sizing and a
private page_frag per virtio queue, instead of using the per cpu one.

On x86 :

- All offloads enabled (average packet size should be >> MTU-size)

net-next trunk w/ virtio_net prior to 2613af0ed (PAGE_SIZE bufs): 14179.17Gb/s
net-next trunk (MTU-size bufs):  13390.69Gb/s
net-next trunk + auto-tune - 14358.41Gb/s

- guest_tso4/guest_csum disabled (forces MTU-sized packets on receiver)

net-next trunk w/ virtio_net prior to 2613af0ed: 4059.49Gb/s
net-next trunk (MTU 1500- packet takes two bufs due to sizing bug): 4174.30Gb/s
net-next trunk (MTU 1480- packet fits in one buf): 6672.16Gb/s
net-next trunk + auto-tune (MTU 1500- fixed, packet uses one buf) - 6791.28Gb/s

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  1:00                                           ` Herbert Xu
@ 2013-11-07  1:08                                             ` Eric Dumazet
  0 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  1:08 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 09:00 +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 04:56:03PM -0800, Eric Dumazet wrote:
> > On Thu, 2013-11-07 at 08:47 +0800, Herbert Xu wrote:
> > > On Wed, Nov 06, 2013 at 04:15:21PM -0800, Eric Dumazet wrote:
> > > >
> > > > Here is the patch I came into (I tested it and it works very fine)
> > > 
> > > Thanks.  However, unless I'm missing something aren't you now
> > > copying every linear skb frag_list?  With the current code we
> > > just do a clone and add the headers.
> > 
> > No copy at all.
> > 
> > As explained, all RX skb are now backed to a page frag.  (see
> > skb->head_frag)
> 
> Even if we have called pskb_expand_head on it?

Well, in this case we pulled only the headers in skb->head.

Only few buggy drivers do a pull of say 64 bytes in skb->head before
calling eth_type_trans() so could possibly have a few bytes of TCP
payload in skb->head

And this case is handled without copy.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  0:15                                     ` Eric Dumazet
  2013-11-07  0:47                                       ` Herbert Xu
@ 2013-11-07  1:13                                       ` Hannes Frederic Sowa
  2013-11-07  1:21                                         ` Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Hannes Frederic Sowa @ 2013-11-07  1:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 04:15:21PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-06 at 11:47 -0800, Eric Dumazet wrote:
> > I'll try a different way.
> > 
> > The frag_list would contain a bunch of frags, that we logically add to the bunch
> > of frags found in the first skb shared_info structure.
> 
> Here is the patch I came into (I tested it and it works very fine)
> 
> The theory is that in GRO stack, all skbs use the head_frag trick,
> so even if one NIC pulled some payload into skb->head, we do not have to
> copy anything. Outside of GRO stack, we are not supposed to provide data
> in skb->head (I am speaking of the skb found on the frag_list extension,
> not the skb_head)
> 
> I put a fallback code, just in case, with a WARN_ON_ONCE() so that we
> can catch the offenders (if any) to fix them.
> 
> I renamed @skb to @skb_head to more clearly document this code.
> Same for @i renamed to @cur_frag

I wanted to understand this code more closely and tried it with a test case I
used for the UDP_CORK bugs and also for the tbf panic.

The packet is allocated as an UFO one and gets segmented by tbf.

# tc qdisc replace dev eth0 root tbf rate 200kbit latency 20ms burst 5kb
# ./udptest
(Just doing two writes of 200 bytes, then a write of 4096 bytes on a udp
socket. I can send you the source (or a stripped down version, because it got
realy noisy.))

[  370.372237] ------------[ cut here ]------------
[  370.374110] WARNING: CPU: 1 PID: 9359 at net/core/skbuff.c:2875 skb_segment+0x725/0x7a0()
[  370.382857] Modules linked in: sch_tbf(F) joydev nfsd auth_rpcgss nfs_acl lockd i2c_piix4 i2c_core virtio_balloon sunrpc serio_raw virtio_blk virtio_net virtio_pci virtio_ring ata_generic virtio pata_acpi
[  370.393546] CPU: 1 PID: 9359 Comm: udptest Tainted: GF            3.12.0+ #1
[  370.395541] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  370.397182]  0000000000000009 ffff88011804f8a8 ffffffff816423eb 0000000000000000
[  370.401220]  ffff88011804f8e0 ffffffff81067dcd ffff8800d2d52900 ffffea00045fac00
[  370.403912]  00000000000005c8 0000000000000000 ffff8800c45b7d10 ffff88011804f8f0
[  370.406681] Call Trace:
[  370.407514]  [<ffffffff816423eb>] dump_stack+0x45/0x56
[  370.409094]  [<ffffffff81067dcd>] warn_slowpath_common+0x7d/0xa0
[  370.421411]  [<ffffffff81067eaa>] warn_slowpath_null+0x1a/0x20
[  370.423776]  [<ffffffff8153f6f5>] skb_segment+0x725/0x7a0
[  370.425961]  [<ffffffff815b0322>] udp4_ufo_fragment+0xc2/0x120
[  370.430097]  [<ffffffff815b8add>] inet_gso_segment+0x11d/0x330
[  370.432763]  [<ffffffff812a0469>] ? selinux_ip_postroute+0x99/0x2c0
[  370.436632]  [<ffffffff8154c29c>] skb_mac_gso_segment+0x9c/0x180
[  370.439060]  [<ffffffff8154c3e0>] __skb_gso_segment+0x60/0xc0
[  370.446392]  [<ffffffffa00c9a5f>] tbf_enqueue+0x5f/0x1f0 [sch_tbf]
[  370.451242]  [<ffffffff8154cd2b>] dev_queue_xmit+0x24b/0x480
[  370.454330]  [<ffffffff81585289>] ip_finish_output+0x2c9/0x3b0
[  370.456631]  [<ffffffff815866b8>] ip_output+0x58/0x90
[  370.458439]  [<ffffffff81585e25>] ip_local_out+0x25/0x30
[  370.461258]  [<ffffffff815871c5>] ip_send_skb+0x15/0x50
[  370.463113]  [<ffffffff815abaaf>] udp_send_skb+0x20f/0x2a0
[  370.465031]  [<ffffffff815842b0>] ? ip_copy_metadata+0xc0/0xc0
[  370.467024]  [<ffffffff815ac9fa>] udp_sendmsg+0x2fa/0xa10
[  370.468884]  [<ffffffff8129a560>] ? sock_has_perm+0x70/0x90
[  370.470922]  [<ffffffff8153799c>] ? release_sock+0x10c/0x160
[  370.473651]  [<ffffffff815b8724>] inet_sendmsg+0x64/0xb0
[  370.475621]  [<ffffffff8129a693>] ? selinux_socket_sendmsg+0x23/0x30
[  370.478025]  [<ffffffff815335db>] sock_sendmsg+0x8b/0xc0
[  370.483445]  [<ffffffff812b193e>] ? security_netlbl_sid_to_secattr+0x6e/0xb0
[  370.487287]  [<ffffffff8162fae2>] ? netlbl_domhsh_hash+0x12/0x50
[  370.489441]  [<ffffffff8162fb3f>] ? netlbl_domhsh_search+0x1f/0x90
[  370.492770]  [<ffffffff81533781>] SYSC_sendto+0x121/0x1c0
[  370.495053]  [<ffffffff811a81ce>] ? alloc_file+0x1e/0xc0
[  370.497309]  [<ffffffff815305fc>] ? sock_alloc_file+0x9c/0x130
[  370.500255]  [<ffffffff8153429e>] SyS_sendto+0xe/0x10
[  370.503751]  [<ffffffff81651329>] system_call_fastpath+0x16/0x1b
[  370.506104] ---[ end trace 117f3806fa493b38 ]---

Maybe it does help.

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  1:13                                       ` Hannes Frederic Sowa
@ 2013-11-07  1:21                                         ` Eric Dumazet
  2013-11-07  1:34                                           ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  1:21 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Herbert Xu, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 02:13 +0100, Hannes Frederic Sowa wrote:
> On Wed, Nov 06, 2013 at 04:15:21PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-06 at 11:47 -0800, Eric Dumazet wrote:
> > > I'll try a different way.
> > > 
> > > The frag_list would contain a bunch of frags, that we logically add to the bunch
> > > of frags found in the first skb shared_info structure.
> > 
> > Here is the patch I came into (I tested it and it works very fine)
> > 
> > The theory is that in GRO stack, all skbs use the head_frag trick,
> > so even if one NIC pulled some payload into skb->head, we do not have to
> > copy anything. Outside of GRO stack, we are not supposed to provide data
> > in skb->head (I am speaking of the skb found on the frag_list extension,
> > not the skb_head)
> > 
> > I put a fallback code, just in case, with a WARN_ON_ONCE() so that we
> > can catch the offenders (if any) to fix them.
> > 
> > I renamed @skb to @skb_head to more clearly document this code.
> > Same for @i renamed to @cur_frag
> 
> I wanted to understand this code more closely and tried it with a test case I
> used for the UDP_CORK bugs and also for the tbf panic.
> 
> The packet is allocated as an UFO one and gets segmented by tbf.
> 
> # tc qdisc replace dev eth0 root tbf rate 200kbit latency 20ms burst 5kb
> # ./udptest
> (Just doing two writes of 200 bytes, then a write of 4096 bytes on a udp
> socket. I can send you the source (or a stripped down version, because it got
> realy noisy.))

Interesting :

                       if (cskb == head_skb)
                               cskb = skb_shinfo(head_skb)->frag_list;
                       else
                               cskb = cskb->next;
                       if (!cskb) {
                               WARN_ON_ONCE(1);
                               goto err;
                       }

So here either head_skb->frag_list is NULL, or the frag_list chain finishes too early.

More probably I have a bug in the code ;)

I'll take a look, thanks !

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  1:21                                         ` Eric Dumazet
@ 2013-11-07  1:34                                           ` Eric Dumazet
  2013-11-07  2:03                                             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  1:34 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Herbert Xu, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 17:21 -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 02:13 +0100, Hannes Frederic Sowa wrote:
> > On Wed, Nov 06, 2013 at 04:15:21PM -0800, Eric Dumazet wrote:
> > > On Wed, 2013-11-06 at 11:47 -0800, Eric Dumazet wrote:
> > > > I'll try a different way.
> > > > 
> > > > The frag_list would contain a bunch of frags, that we logically add to the bunch
> > > > of frags found in the first skb shared_info structure.
> > > 
> > > Here is the patch I came into (I tested it and it works very fine)
> > > 
> > > The theory is that in GRO stack, all skbs use the head_frag trick,
> > > so even if one NIC pulled some payload into skb->head, we do not have to
> > > copy anything. Outside of GRO stack, we are not supposed to provide data
> > > in skb->head (I am speaking of the skb found on the frag_list extension,
> > > not the skb_head)
> > > 
> > > I put a fallback code, just in case, with a WARN_ON_ONCE() so that we
> > > can catch the offenders (if any) to fix them.
> > > 
> > > I renamed @skb to @skb_head to more clearly document this code.
> > > Same for @i renamed to @cur_frag
> > 
> > I wanted to understand this code more closely and tried it with a test case I
> > used for the UDP_CORK bugs and also for the tbf panic.
> > 
> > The packet is allocated as an UFO one and gets segmented by tbf.
> > 
> > # tc qdisc replace dev eth0 root tbf rate 200kbit latency 20ms burst 5kb
> > # ./udptest
> > (Just doing two writes of 200 bytes, then a write of 4096 bytes on a udp
> > socket. I can send you the source (or a stripped down version, because it got
> > realy noisy.))
> 
> Interesting :
> 
>                        if (cskb == head_skb)
>                                cskb = skb_shinfo(head_skb)->frag_list;
>                        else
>                                cskb = cskb->next;
>                        if (!cskb) {
>                                WARN_ON_ONCE(1);
>                                goto err;
>                        }
> 
> So here either head_skb->frag_list is NULL, or the frag_list chain finishes too early.
> 
> More probably I have a bug in the code ;)

Oh yes, I missed a : data_len += remain;

at line 2906 :

	offset = remain;
+	data_len += remain;
	continue;

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  1:03                                                 ` Eric Dumazet
@ 2013-11-07  1:47                                                   ` Herbert Xu
  2013-11-07  2:02                                                     ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  1:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst, Jason Wang

On Wed, Nov 06, 2013 at 05:03:28PM -0800, Eric Dumazet wrote:
>
> > But once the system has been running for a while, I see nothing
> > in the virtio_net code that tries to prevent fragmentation.  Once
> > fragmentation sets in, you'll be back in the terrible situation
> > that we were in prior to the coalesce patch.
> 
> There is no fragmentation, since we allocate 32Kb pages.

Say the system is fragmented sufficiently that you'll end up with
0-order pages.  In that case you'll only ever be able to coalesce
two packets.

Real systems that run for more than a day do end up with seriously
fragmented memory.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  1:47                                                   ` Herbert Xu
@ 2013-11-07  2:02                                                     ` Eric Dumazet
  2013-11-07  2:08                                                       ` Eric Dumazet
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  2:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst, Jason Wang

On Thu, 2013-11-07 at 09:47 +0800, Herbert Xu wrote:

> Say the system is fragmented sufficiently that you'll end up with
> 0-order pages.  In that case you'll only ever be able to coalesce
> two packets.

4K page will contain 2 frags and they will coalesce.

Performance will still be quite good.

We probably add a tweak, to not have any hole in this case.

> 
> Real systems that run for more than a day do end up with seriously
> fragmented memory.

Sure, but having shallow skbs in the first place help quite a bit.

There is no perfect solution, unless of course you change virtio_net to
provide different queues, with different frag sizes.

Sort of what NIU driver uses.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  1:34                                           ` Eric Dumazet
@ 2013-11-07  2:03                                             ` Hannes Frederic Sowa
  2013-11-07  3:05                                               ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Hannes Frederic Sowa @ 2013-11-07  2:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 05:34:28PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-06 at 17:21 -0800, Eric Dumazet wrote:
> > On Thu, 2013-11-07 at 02:13 +0100, Hannes Frederic Sowa wrote:
> > > On Wed, Nov 06, 2013 at 04:15:21PM -0800, Eric Dumazet wrote:
> > > > On Wed, 2013-11-06 at 11:47 -0800, Eric Dumazet wrote:
> > > > > I'll try a different way.
> > > > > 
> > > > > The frag_list would contain a bunch of frags, that we logically add to the bunch
> > > > > of frags found in the first skb shared_info structure.
> > > > 
> > > > Here is the patch I came into (I tested it and it works very fine)
> > > > 
> > > > The theory is that in GRO stack, all skbs use the head_frag trick,
> > > > so even if one NIC pulled some payload into skb->head, we do not have to
> > > > copy anything. Outside of GRO stack, we are not supposed to provide data
> > > > in skb->head (I am speaking of the skb found on the frag_list extension,
> > > > not the skb_head)
> > > > 
> > > > I put a fallback code, just in case, with a WARN_ON_ONCE() so that we
> > > > can catch the offenders (if any) to fix them.
> > > > 
> > > > I renamed @skb to @skb_head to more clearly document this code.
> > > > Same for @i renamed to @cur_frag
> > > 
> > > I wanted to understand this code more closely and tried it with a test case I
> > > used for the UDP_CORK bugs and also for the tbf panic.
> > > 
> > > The packet is allocated as an UFO one and gets segmented by tbf.
> > > 
> > > # tc qdisc replace dev eth0 root tbf rate 200kbit latency 20ms burst 5kb
> > > # ./udptest
> > > (Just doing two writes of 200 bytes, then a write of 4096 bytes on a udp
> > > socket. I can send you the source (or a stripped down version, because it got
> > > realy noisy.))
> > 
> > Interesting :
> > 
> >                        if (cskb == head_skb)
> >                                cskb = skb_shinfo(head_skb)->frag_list;
> >                        else
> >                                cskb = cskb->next;
> >                        if (!cskb) {
> >                                WARN_ON_ONCE(1);
> >                                goto err;
> >                        }
> > 
> > So here either head_skb->frag_list is NULL, or the frag_list chain finishes too early.
> > 
> > More probably I have a bug in the code ;)
> 
> Oh yes, I missed a : data_len += remain;
> 
> at line 2906 :
> 
> 	offset = remain;
> +	data_len += remain;
> 	continue;

Hm, I still hit the WARN_ON_ONCE (same test as above) with this fixed up:

[   27.962782] ------------[ cut here ]------------
[   27.964728] WARNING: CPU: 1 PID: 485 at net/core/skbuff.c:2875 skb_segment+0x72d/0x7b0()
[   27.967179] Modules linked in: sch_tbf joydev i2c_piix4 virtio_balloon i2c_core serio_raw nfsd auth_rpcgss nfs_acl lockd sunrpc virtio_blk virtio_net virtio_pci virtio_ring virtio ata_generic pata_acpi
[   27.979403] CPU: 1 PID: 485 Comm: udptest Not tainted 3.12.0+ #5
[   27.982402] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   27.984125]  0000000000000009 ffff8800d581b8a8 ffffffff816423fb 0000000000000000
[   27.986873]  ffff8800d581b8e0 ffffffff81067dcd ffff8800372c2c00 ffffea0000de5800
[   27.990107]  00000000000005c8 0000000000000000 ffff8800372b4710 ffff8800d581b8f0
[   27.993473] Call Trace:
[   27.994506]  [<ffffffff816423fb>] dump_stack+0x45/0x56
[   27.996172]  [<ffffffff81067dcd>] warn_slowpath_common+0x7d/0xa0
[   27.998295]  [<ffffffff81067eaa>] warn_slowpath_null+0x1a/0x20
[   28.000127]  [<ffffffff8153f6fd>] skb_segment+0x72d/0x7b0
[   28.002090]  [<ffffffff815b0332>] udp4_ufo_fragment+0xc2/0x120
[   28.006193]  [<ffffffff815b8aed>] inet_gso_segment+0x11d/0x330
[   28.008292]  [<ffffffff812a0469>] ? selinux_ip_postroute+0x99/0x2c0
[   28.010469]  [<ffffffff8154c2ac>] skb_mac_gso_segment+0x9c/0x180
[   28.013065]  [<ffffffff8154c3f0>] __skb_gso_segment+0x60/0xc0
[   28.015326]  [<ffffffffa00e1a5f>] tbf_enqueue+0x5f/0x1f0 [sch_tbf]
[   28.017752]  [<ffffffff8154cd3b>] dev_queue_xmit+0x24b/0x480
[   28.019906]  [<ffffffff81585299>] ip_finish_output+0x2c9/0x3b0
[   28.022189]  [<ffffffff815866c8>] ip_output+0x58/0x90
[   28.024167]  [<ffffffff81585e35>] ip_local_out+0x25/0x30
[   28.026409]  [<ffffffff815871d5>] ip_send_skb+0x15/0x50
[   28.028517]  [<ffffffff815ababf>] udp_send_skb+0x20f/0x2a0
[   28.030700]  [<ffffffff815842c0>] ? ip_copy_metadata+0xc0/0xc0
[   28.033053]  [<ffffffff815aca0a>] udp_sendmsg+0x2fa/0xa10
[   28.035024]  [<ffffffff8129a560>] ? sock_has_perm+0x70/0x90
[   28.042218]  [<ffffffff8153799c>] ? release_sock+0x10c/0x160
[   28.044222]  [<ffffffff815b8734>] inet_sendmsg+0x64/0xb0
[   28.047297]  [<ffffffff8129a693>] ? selinux_socket_sendmsg+0x23/0x30
[   28.049649]  [<ffffffff815335db>] sock_sendmsg+0x8b/0xc0
[   28.052800]  [<ffffffff812b193e>] ? security_netlbl_sid_to_secattr+0x6e/0xb0
[   28.055594]  [<ffffffff811bdc45>] ? __d_alloc+0x25/0x180
[   28.057651]  [<ffffffff8162fb4f>] ? netlbl_domhsh_search+0x1f/0x90
[   28.060103]  [<ffffffff81533781>] SYSC_sendto+0x121/0x1c0
[   28.062332]  [<ffffffff811a81ce>] ? alloc_file+0x1e/0xc0
[   28.064561]  [<ffffffff815305fc>] ? sock_alloc_file+0x9c/0x130
[   28.066758]  [<ffffffff8153429e>] SyS_sendto+0xe/0x10
[   28.072865]  [<ffffffff81651329>] system_call_fastpath+0x16/0x1b
[   28.074981] ---[ end trace 351089f5102f0c6a ]---

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  2:02                                                     ` Eric Dumazet
@ 2013-11-07  2:08                                                       ` Eric Dumazet
  2013-11-07  2:15                                                       ` Herbert Xu
  2013-11-07  5:56                                                       ` Michael S. Tsirkin
  2 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  2:08 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst, Jason Wang

On Wed, 2013-11-06 at 18:02 -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 09:47 +0800, Herbert Xu wrote:
> 
> > Say the system is fragmented sufficiently that you'll end up with
> > 0-order pages.  In that case you'll only ever be able to coalesce
> > two packets.
> 
> 4K page will contain 2 frags and they will coalesce.
> 
> Performance will still be quite good.
> 
> We probably add a tweak, to not have any hole in this case.

After skb_page_frag_refill() call, its trivial to check if the page is
order-0.

If yes, use the whole remaining space, instead of MAX_PACKET_LEN

-> If memory is fragmented, we switch back to old behavior.

Really, its that simple.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  2:02                                                     ` Eric Dumazet
  2013-11-07  2:08                                                       ` Eric Dumazet
@ 2013-11-07  2:15                                                       ` Herbert Xu
  2013-11-07  2:37                                                         ` Eric Dumazet
  2013-11-07  5:56                                                       ` Michael S. Tsirkin
  2 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  2:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst, Jason Wang

On Wed, Nov 06, 2013 at 06:02:38PM -0800, Eric Dumazet wrote:
> 
> 4K page will contain 2 frags and they will coalesce.
> 
> Performance will still be quite good.
>
> We probably add a tweak, to not have any hole in this case.

Also have you considered the security aspect of this? If you have
two skbs sharing a page, and one gets transmitted to a third party
using zero-copy, the other unrelated skb's content may become visible
where it shouldn't.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  2:15                                                       ` Herbert Xu
@ 2013-11-07  2:37                                                         ` Eric Dumazet
  2013-11-07  2:41                                                           ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  2:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst, Jason Wang

On Thu, 2013-11-07 at 10:15 +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 06:02:38PM -0800, Eric Dumazet wrote:
> > 
> > 4K page will contain 2 frags and they will coalesce.
> > 
> > Performance will still be quite good.
> >
> > We probably add a tweak, to not have any hole in this case.
> 
> Also have you considered the security aspect of this? If you have
> two skbs sharing a page, and one gets transmitted to a third party
> using zero-copy, the other unrelated skb's content may become visible
> where it shouldn't.

If the hypervisor is doomed, there is nothing we can do.

virtio_net owns the pages, and relies on hypervisor doing the right
thing.

That you use part of the page, is really irrelevant.

It seems you are speaking of virtio_net sending frames, but its about
receiving frames here.

We receive frames, delivered by the trusted hypervisor.

OK, I will shut up now, since apparently I really upset you.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  2:37                                                         ` Eric Dumazet
@ 2013-11-07  2:41                                                           ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  2:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst, Jason Wang

On Wed, Nov 06, 2013 at 06:37:35PM -0800, Eric Dumazet wrote:
>
> If the hypervisor is doomed, there is nothing we can do.

It's not just the hypervisor you know.  There is also DMA etc.
The hypervisor doesn't know that there is stuff in the page that
should not be exposed.

Also one day we may want to do zero-copy receive in which case
this would be even more important.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  0:36                                               ` Herbert Xu
  2013-11-07  1:03                                                 ` Eric Dumazet
@ 2013-11-07  2:52                                                 ` Jason Wang
  1 sibling, 0 replies; 163+ messages in thread
From: Jason Wang @ 2013-11-07  2:52 UTC (permalink / raw)
  To: Herbert Xu, Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu,
	mwdalton, mst

On 11/07/2013 08:36 AM, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 07:01:10AM -0800, Eric Dumazet wrote:
>> Have you thought about arches having PAGE_SIZE=65536, and how bad it is
>> to use a full page per network frame ? It is lazy and x86 centered.
> So instead if we were sending a full 64K packet on such an arch to
> another guest, we'd now chop it up into 1.5K chunks and reassemble them.
>
>> So after our patches, we now have an optimal situation, even on these
>> arches.
> Optimal only for physical incoming packets with no jumbo frames.
>
> What's worse, I now realise that the coalesce thing isn't even
> guaranteed to work.  It probably works in your benchmarks because
> you're working with freshly allocated pages.
>
> But once the system has been running for a while, I see nothing
> in the virtio_net code that tries to prevent fragmentation.  Once
> fragmentation sets in, you'll be back in the terrible situation
> that we were in prior to the coalesce patch.
>
> Jason/Michael (Tsirkin), am I missing something that would prevent
> fragmentation of these buffers?
>
> Cheers,

No. Maybe we can use per-queue buffers instead.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  2:03                                             ` Hannes Frederic Sowa
@ 2013-11-07  3:05                                               ` Eric Dumazet
  2013-11-07  6:59                                                 ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  3:05 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Herbert Xu, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 03:03 +0100, Hannes Frederic Sowa wrote:

> > Oh yes, I missed a : data_len += remain;
> > 
> > at line 2906 :
> > 
> > 	offset = remain;
> > +	data_len += remain;
> > 	continue;
> 
> Hm, I still hit the WARN_ON_ONCE (same test as above) with this fixed up:
> 

Thanks, I'll track this and send a v2

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  2:02                                                     ` Eric Dumazet
  2013-11-07  2:08                                                       ` Eric Dumazet
  2013-11-07  2:15                                                       ` Herbert Xu
@ 2013-11-07  5:56                                                       ` Michael S. Tsirkin
  2013-11-07  7:07                                                         ` Eric Dumazet
  2 siblings, 1 reply; 163+ messages in thread
From: Michael S. Tsirkin @ 2013-11-07  5:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton, Jason Wang

On Wed, Nov 06, 2013 at 06:02:38PM -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 09:47 +0800, Herbert Xu wrote:
> 
> > Say the system is fragmented sufficiently that you'll end up with
> > 0-order pages.  In that case you'll only ever be able to coalesce
> > two packets.
> 
> 4K page will contain 2 frags and they will coalesce.
> 
> Performance will still be quite good.
> 
> We probably add a tweak, to not have any hole in this case.
> 
> > 
> > Real systems that run for more than a day do end up with seriously
> > fragmented memory.
> 
> Sure, but having shallow skbs in the first place help quite a bit.
> 
> There is no perfect solution, unless of course you change virtio_net to
> provide different queues, with different frag sizes.
> 
> Sort of what NIU driver uses.
> 

I considered doing this but won't this mean packets can get reordered?
In practice we need to maintain ordering of RX frames within
a given flow, correct?

-- 
MST

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  0:43                                     ` Herbert Xu
@ 2013-11-07  6:22                                       ` Herbert Xu
  2013-11-07  7:03                                         ` [1/3] gso: Add to segs at end of loop in skb_segment Herbert Xu
  2013-11-07  7:11                                       ` gso: Attempt to handle mega-GRO packets Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  6:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 08:43:39AM +0800, Herbert Xu wrote:
>
> Yeah I screwed up with the test, it needs an additional doffset
> since the tail already has the headers pushed:

OK the patch has passed my tests and I will clean it up for
submission.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  3:05                                               ` Eric Dumazet
@ 2013-11-07  6:59                                                 ` Eric Dumazet
  0 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  6:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Herbert Xu, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton

On Wed, 2013-11-06 at 19:05 -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 03:03 +0100, Hannes Frederic Sowa wrote:
> 
> > > Oh yes, I missed a : data_len += remain;
> > > 
> > > at line 2906 :
> > > 
> > > 	offset = remain;
> > > +	data_len += remain;
> > > 	continue;
> > 
> > Hm, I still hit the WARN_ON_ONCE (same test as above) with this fixed up:
> > 
> 
> Thanks, I'll track this and send a v2
> 

Here is a new version before I turn off my laptop for the night.

It survived my tests and makes the code cleaner.

Thanks !

 net/core/skbuff.c |  190 +++++++++++++++++++++-----------------------
 1 file changed, 93 insertions(+), 97 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..eccd434 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2771,80 +2771,57 @@ EXPORT_SYMBOL_GPL(skb_pull_rcsum);
  *	a pointer to the first in a list of new skbs for the segments.
  *	In case of error it returns ERR_PTR(err).
  */
-struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
+struct sk_buff *skb_segment(struct sk_buff *head_skb, netdev_features_t features)
 {
 	struct sk_buff *segs = NULL;
 	struct sk_buff *tail = NULL;
-	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
-	unsigned int mss = skb_shinfo(skb)->gso_size;
-	unsigned int doffset = skb->data - skb_mac_header(skb);
-	unsigned int offset = doffset;
-	unsigned int tnl_hlen = skb_tnl_header_len(skb);
+	struct sk_buff *cskb = head_skb;
+	unsigned int mss = skb_shinfo(head_skb)->gso_size;
+	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
+	unsigned int tot_len; /* should reach head_skb->len at the end */
+	unsigned int offset = doffset; /* offset in cskb->data */
+	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
 	unsigned int headroom;
 	unsigned int len;
 	__be16 proto;
 	bool csum;
 	int sg = !!(features & NETIF_F_SG);
-	int nfrags = skb_shinfo(skb)->nr_frags;
+	int cur_frag = 0, nfrags = skb_shinfo(cskb)->nr_frags;
+	unsigned int data_len, cur_frag_offset = 0;
 	int err = -ENOMEM;
-	int i = 0;
-	int pos;
 
-	proto = skb_network_protocol(skb);
+	proto = skb_network_protocol(head_skb);
 	if (unlikely(!proto))
 		return ERR_PTR(-EINVAL);
 
 	csum = !!can_checksum_protocol(features, proto);
-	__skb_push(skb, doffset);
-	headroom = skb_headroom(skb);
-	pos = skb_headlen(skb);
+	__skb_push(head_skb, doffset);
+	headroom = skb_headroom(head_skb);
 
-	do {
+	for (tot_len = doffset; tot_len < head_skb->len; tot_len += len) {
 		struct sk_buff *nskb;
 		skb_frag_t *frag;
-		int hsize;
-		int size;
+		int hsize, size, remain;
 
-		len = skb->len - offset;
+		len = head_skb->len - tot_len;
 		if (len > mss)
 			len = mss;
 
-		hsize = skb_headlen(skb) - offset;
+		hsize = skb_headlen(cskb) - offset;
 		if (hsize < 0)
 			hsize = 0;
 		if (hsize > len || !sg)
 			hsize = len;
 
-		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
-
-			pos += len;
-			nskb = skb_clone(fskb, GFP_ATOMIC);
-			fskb = fskb->next;
+		nskb = __alloc_skb(hsize + doffset + headroom,
+				   GFP_ATOMIC, skb_alloc_rx_flag(head_skb),
+				   NUMA_NO_NODE);
 
-			if (unlikely(!nskb))
-				goto err;
-
-			hsize = skb_end_offset(nskb);
-			if (skb_cow_head(nskb, doffset + headroom)) {
-				kfree_skb(nskb);
-				goto err;
-			}
+		if (unlikely(!nskb))
+			goto err;
 
-			nskb->truesize += skb_end_offset(nskb) - hsize;
-			skb_release_head_state(nskb);
-			__skb_push(nskb, doffset);
-		} else {
-			nskb = __alloc_skb(hsize + doffset + headroom,
-					   GFP_ATOMIC, skb_alloc_rx_flag(skb),
-					   NUMA_NO_NODE);
-
-			if (unlikely(!nskb))
-				goto err;
-
-			skb_reserve(nskb, headroom);
-			__skb_put(nskb, doffset);
-		}
+		skb_reserve(nskb, headroom);
+		__skb_put(nskb, doffset);
 
 		if (segs)
 			tail->next = nskb;
@@ -2852,94 +2829,113 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			segs = nskb;
 		tail = nskb;
 
-		__copy_skb_header(nskb, skb);
-		nskb->mac_len = skb->mac_len;
+		__copy_skb_header(nskb, head_skb);
+		nskb->mac_len = head_skb->mac_len;
 
 		skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
 
-		skb_copy_from_linear_data_offset(skb, -tnl_hlen,
+		skb_copy_from_linear_data_offset(head_skb, -tnl_hlen,
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
-
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
-			nskb->csum = skb_copy_and_csum_bits(skb, offset,
+			nskb->csum = skb_copy_and_csum_bits(head_skb, tot_len,
 							    skb_put(nskb, len),
 							    len, 0);
+			offset += len;
 			continue;
 		}
 
 		frag = skb_shinfo(nskb)->frags;
 
-		skb_copy_from_linear_data_offset(skb, offset,
+		skb_copy_from_linear_data_offset(cskb, offset,
 						 skb_put(nskb, hsize), hsize);
+		offset += hsize;
 
-		skb_shinfo(nskb)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
-
-		while (pos < offset + len && i < nfrags) {
-			*frag = skb_shinfo(skb)->frags[i];
-			__skb_frag_ref(frag);
-			size = skb_frag_size(frag);
+		nskb->data_len = len - hsize;
+		nskb->len += nskb->data_len;
+		nskb->truesize += nskb->data_len;
 
-			if (pos < offset) {
-				frag->page_offset += offset - pos;
-				skb_frag_size_sub(frag, offset - pos);
+		skb_shinfo(nskb)->tx_flags = skb_shinfo(head_skb)->tx_flags & SKBTX_SHARED_FRAG;
+
+		for (data_len = 0; data_len < nskb->data_len; data_len += remain) {
+			remain = nskb->data_len - data_len;
+			if (unlikely(cur_frag >= nfrags)) {
+				if (cskb == head_skb)
+					cskb = skb_shinfo(head_skb)->frag_list;
+				else
+					cskb = cskb->next;
+				if (!cskb) {
+					WARN_ON_ONCE(1);
+					goto err;
+				}
+				cur_frag = 0;
+				cur_frag_offset = 0;
+				nfrags = skb_shinfo(cskb)->nr_frags;
+				offset = 0;
+				if (skb_headlen(cskb)) {
+					char *data;
+					struct page *page;
+
+					remain = min_t(int, remain, skb_headlen(cskb));
+
+					if (likely(cskb->head_frag)) {
+						data = cskb->data;
+						page = virt_to_head_page(data);
+						get_page(page);
+					} else {
+						data = __netdev_alloc_frag(SKB_DATA_ALIGN(remain),
+									   GFP_ATOMIC);
+						/* Really this should not happen, fix the caller ! */
+						WARN_ON_ONCE(1);
+						if (!data)
+							goto err;
+						memcpy(data, cskb->data, remain);
+						page = virt_to_head_page(data);
+					}
+					frag->page.p = page;
+					frag->page_offset = data - (char *)page_address(page);
+					skb_frag_size_set(frag, remain);
+					frag++;
+					offset = remain;
+					continue;
+				}
 			}
+			*frag = skb_shinfo(cskb)->frags[cur_frag];
+			__skb_frag_ref(frag);
 
-			skb_shinfo(nskb)->nr_frags++;
+			frag->page_offset += cur_frag_offset;
+			skb_frag_size_sub(frag, cur_frag_offset);
+			size = skb_frag_size(frag);
 
-			if (pos + size <= offset + len) {
-				i++;
-				pos += size;
+			if (size <= remain) {
+				cur_frag++;
+				cur_frag_offset = 0;
+				remain = size;
 			} else {
-				skb_frag_size_sub(frag, pos + size - (offset + len));
-				goto skip_fraglist;
+				skb_frag_size_set(frag, remain);
+				cur_frag_offset += remain;
 			}
 
 			frag++;
 		}
 
-		if (pos < offset + len) {
-			struct sk_buff *fskb2 = fskb;
-
-			BUG_ON(pos + fskb->len != offset + len);
-
-			pos += fskb->len;
-			fskb = fskb->next;
-
-			if (fskb2->next) {
-				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
-				if (!fskb2)
-					goto err;
-			} else
-				skb_get(fskb2);
-
-			SKB_FRAG_ASSERT(nskb);
-			skb_shinfo(nskb)->frag_list = fskb2;
-		}
-
-skip_fraglist:
-		nskb->data_len = len - hsize;
-		nskb->len += nskb->data_len;
-		nskb->truesize += nskb->data_len;
+		skb_shinfo(nskb)->nr_frags = frag - skb_shinfo(nskb)->frags;
 
-perform_csum_check:
 		if (!csum) {
 			nskb->csum = skb_checksum(nskb, doffset,
 						  nskb->len - doffset, 0);
 			nskb->ip_summed = CHECKSUM_NONE;
 		}
-	} while ((offset += len) < skb->len);
+	}
 
 	return segs;
 
 err:
-	while ((skb = segs)) {
-		segs = skb->next;
-		kfree_skb(skb);
+	while ((cskb = segs)) {
+		segs = cskb->next;
+		kfree_skb(cskb);
 	}
 	return ERR_PTR(err);
 }

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* [1/3] gso: Add to segs at end of loop in skb_segment
  2013-11-07  6:22                                       ` Herbert Xu
@ 2013-11-07  7:03                                         ` Herbert Xu
  2013-11-07  7:06                                           ` [2/3] gso: Handle new frag_list of frags GRO packets Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  7:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

This patch moves the addition to the segs list to the end of the
loop in skb_segment.  This is to allow the following patch to add
a list to segs without including nskb.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..88b7dc6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2846,12 +2846,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			__skb_put(nskb, doffset);
 		}
 
-		if (segs)
-			tail->next = nskb;
-		else
-			segs = nskb;
-		tail = nskb;
-
 		__copy_skb_header(nskb, skb);
 		nskb->mac_len = skb->mac_len;
 
@@ -2869,7 +2863,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			nskb->csum = skb_copy_and_csum_bits(skb, offset,
 							    skb_put(nskb, len),
 							    len, 0);
-			continue;
+			goto add_to_segs;
 		}
 
 		frag = skb_shinfo(nskb)->frags;
@@ -2912,8 +2906,10 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 			if (fskb2->next) {
 				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
-				if (!fskb2)
+				if (!fskb2) {
+					kfree(nskb);
 					goto err;
+				}
 			} else
 				skb_get(fskb2);
 
@@ -2932,6 +2928,13 @@ perform_csum_check:
 						  nskb->len - doffset, 0);
 			nskb->ip_summed = CHECKSUM_NONE;
 		}
+
+add_to_segs:
+		if (segs)
+			tail->next = nskb;
+		else
+			segs = nskb;
+		tail = nskb;
 	} while ((offset += len) < skb->len);
 
 	return segs;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* [2/3] gso: Handle new frag_list of frags GRO packets
  2013-11-07  7:03                                         ` [1/3] gso: Add to segs at end of loop in skb_segment Herbert Xu
@ 2013-11-07  7:06                                           ` Herbert Xu
  2013-11-07  7:08                                             ` [3/3] gso: Handle malicious GRO packets without crashing Herbert Xu
                                                               ` (2 more replies)
  0 siblings, 3 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  7:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

Recently GRO started generating packets with frag_lists of frags.
This was not handled by GSO, thus leading to a crash.

Thankfully these packets are of a regular form and are easy to
handle.  This patch handles them by calling skb_segment for each 
frag_list entry.  The depth of recursion is limited to just one.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 88b7dc6..bcc3f1c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2816,8 +2816,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			hsize = len;
 
 		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
-
 			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
 			fskb = fskb->next;
@@ -2855,8 +2853,40 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
-			goto perform_csum_check;
+		if (fskb != skb_shinfo(skb)->frag_list) {
+			struct sk_buff *nsegs;
+
+			if (nskb->len == len + doffset)
+				goto perform_csum_check;
+
+			SKB_FRAG_ASSERT(nskb);
+
+			__skb_pull(nskb, doffset);
+			skb_shinfo(nskb)->gso_size = mss;
+			nsegs = skb_segment(nskb, features);
+
+			err = PTR_ERR(nsegs);
+			if (IS_ERR(nsegs)) {
+				kfree(nskb);
+				goto err;
+			}
+			err = -ENOMEM;
+
+			if (segs)
+				tail->next = nsegs;
+			else
+				segs = nsegs;
+
+			tail = nsegs;
+			while (tail->next)
+				tail = tail->next;
+
+			BUG_ON(fskb && tail->len != len + doffset);
+
+			len = nskb->len;
+			kfree(nskb);
+			continue;
+		}
 
 		if (!sg) {
 			nskb->ip_summed = CHECKSUM_NONE;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  5:56                                                       ` Michael S. Tsirkin
@ 2013-11-07  7:07                                                         ` Eric Dumazet
  0 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  7:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Herbert Xu, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton, Jason Wang

On Thu, 2013-11-07 at 07:56 +0200, Michael S. Tsirkin wrote:

> I considered doing this but won't this mean packets can get reordered?
> In practice we need to maintain ordering of RX frames within
> a given flow, correct?

Sorry, I was referring to two pools of frags, instead of a single one.

One pool of frags of (1500 + hdr) bytes
One pool of frags of 4096 bytes

But its still one logical queue.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* [3/3] gso: Handle malicious GRO packets without crashing
  2013-11-07  7:06                                           ` [2/3] gso: Handle new frag_list of frags GRO packets Herbert Xu
@ 2013-11-07  7:08                                             ` Herbert Xu
  2013-11-07 18:18                                               ` Ben Hutchings
  2013-11-07 19:13                                               ` Sergei Shtylyov
  2013-11-07 18:16                                             ` [2/3] gso: Handle new frag_list of frags GRO packets Ben Hutchings
  2013-11-11 18:52                                             ` Herbert Xu
  2 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  7:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

As virtio_net can now generate GRO frag_list packets without
sufficient verification, we need to handle malicious GRO packets
thrown at us.

This patch converts to affected BUG_ONs in skb_segment to rate-
limited warnings.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bcc3f1c..fb1106d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2881,7 +2881,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			while (tail->next)
 				tail = tail->next;
 
-			BUG_ON(fskb && tail->len != len + doffset);
+			if (fskb && tail->len != len + doffset) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"illegal GSO fragment: %u %u\n",
+					tail->len, len + doffset);
+				kfree(nskb);
+				err = -EINVAL;
+				goto err;
+			}
 
 			len = nskb->len;
 			kfree(nskb);
@@ -2929,7 +2937,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (pos < offset + len) {
 			struct sk_buff *fskb2 = fskb;
 
-			BUG_ON(pos + fskb->len != offset + len);
+			if (pos + fskb->len != offset + len) {
+				net_warn_ratelimited(
+					"skb_segment: "
+					"illegal GSO trailer: %u %u\n",
+					pos + fskb->len, offset + len);
+				kfree(nskb);
+				err = -EINVAL;
+				goto err;
+			}
 
 			pos += fskb->len;
 			fskb = fskb->next;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  0:43                                     ` Herbert Xu
  2013-11-07  6:22                                       ` Herbert Xu
@ 2013-11-07  7:11                                       ` Eric Dumazet
  2013-11-07  7:15                                         ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  7:11 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 08:43 +0800, Herbert Xu wrote:

> -		if (fskb != skb_shinfo(skb)->frag_list)
> -			goto perform_csum_check;
> +		if (fskb != skb_shinfo(skb)->frag_list) {
> +			struct sk_buff *nsegs;
> +
> +			if (nskb->len == len + doffset)
> +				goto perform_csum_check;
> +
> +			if (skb_has_frag_list(nskb)) {
> +				net_warn_ratelimited(
> +					"skb_segment: "
> +					"nested frag_list detected\n");
> +				kfree(nskb);
> +				err = -EINVAL;
> +				goto err;
> +			}
> +
> +			__skb_pull(nskb, doffset);
> +			skb_shinfo(nskb)->gso_size = mss;
> +			nsegs = skb_segment(nskb, features);
> +

This still assumes each skb found in frag_list has a exact multiple of
@mss bytes, and that the initial skb also ends at a right boundary.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  7:11                                       ` gso: Attempt to handle mega-GRO packets Eric Dumazet
@ 2013-11-07  7:15                                         ` Herbert Xu
  2013-11-07  7:17                                           ` Herbert Xu
  2013-11-07  7:31                                           ` Eric Dumazet
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  7:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 11:11:02PM -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 08:43 +0800, Herbert Xu wrote:
> 
> > -		if (fskb != skb_shinfo(skb)->frag_list)
> > -			goto perform_csum_check;
> > +		if (fskb != skb_shinfo(skb)->frag_list) {
> > +			struct sk_buff *nsegs;
> > +
> > +			if (nskb->len == len + doffset)
> > +				goto perform_csum_check;
> > +
> > +			if (skb_has_frag_list(nskb)) {
> > +				net_warn_ratelimited(
> > +					"skb_segment: "
> > +					"nested frag_list detected\n");
> > +				kfree(nskb);
> > +				err = -EINVAL;
> > +				goto err;
> > +			}
> > +
> > +			__skb_pull(nskb, doffset);
> > +			skb_shinfo(nskb)->gso_size = mss;
> > +			nsegs = skb_segment(nskb, features);
> > +
> 
> This still assumes each skb found in frag_list has a exact multiple of
> @mss bytes, and that the initial skb also ends at a right boundary.

So what in our stack violates this assumption? We've never handled
arbitrary frag_lists in GSO and I see no reason why we need to start
doing that now.

Also GRO was designed to only merge packets that satisfy these
assumptions so that through GSO the original packets can be
recovered without losing end-to-end connectivity.  This is really
important for routers/switches.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  7:15                                         ` Herbert Xu
@ 2013-11-07  7:17                                           ` Herbert Xu
  2013-11-07  7:31                                           ` Eric Dumazet
  1 sibling, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  7:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 03:15:31PM +0800, Herbert Xu wrote:
> On Wed, Nov 06, 2013 at 11:11:02PM -0800, Eric Dumazet wrote:
> > On Thu, 2013-11-07 at 08:43 +0800, Herbert Xu wrote:
> > 
> > > -		if (fskb != skb_shinfo(skb)->frag_list)
> > > -			goto perform_csum_check;
> > > +		if (fskb != skb_shinfo(skb)->frag_list) {
> > > +			struct sk_buff *nsegs;
> > > +
> > > +			if (nskb->len == len + doffset)
> > > +				goto perform_csum_check;
> > > +
> > > +			if (skb_has_frag_list(nskb)) {
> > > +				net_warn_ratelimited(
> > > +					"skb_segment: "
> > > +					"nested frag_list detected\n");
> > > +				kfree(nskb);
> > > +				err = -EINVAL;
> > > +				goto err;
> > > +			}
> > > +
> > > +			__skb_pull(nskb, doffset);
> > > +			skb_shinfo(nskb)->gso_size = mss;
> > > +			nsegs = skb_segment(nskb, features);
> > > +
> > 
> > This still assumes each skb found in frag_list has a exact multiple of
> > @mss bytes, and that the initial skb also ends at a right boundary.
> 
> So what in our stack violates this assumption? We've never handled
> arbitrary frag_lists in GSO and I see no reason why we need to start
> doing that now.
> 
> Also GRO was designed to only merge packets that satisfy these
> assumptions so that through GSO the original packets can be
> recovered without losing end-to-end connectivity.  This is really
> important for routers/switches.

Or perhaps you are saying that this doesn't handle page frags?
It should because each GRO frag_list entry looks just like any
boring old GSO packet filled with frags.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  7:15                                         ` Herbert Xu
  2013-11-07  7:17                                           ` Herbert Xu
@ 2013-11-07  7:31                                           ` Eric Dumazet
  2013-11-07  7:33                                             ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07  7:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 15:15 +0800, Herbert Xu wrote:

> So what in our stack violates this assumption? We've never handled
> arbitrary frag_lists in GSO and I see no reason why we need to start
> doing that now.

I do see this, skb_segment() is generic.

> 
> Also GRO was designed to only merge packets that satisfy these
> assumptions so that through GSO the original packets can be
> recovered without losing end-to-end connectivity.  This is really
> important for routers/switches.

I think we all agree on this, and we should keep this property.

The point is : skb_segment() is not tied to GRO anymore.

My patch handles virtio_net just fine, I see nothing really malicious in
virtio_net.

In particular, each skb found in the frag_list can be of any size,
and not an exact MSS multiple.

I see frag_list as a way to extend skb capacity, not as something
tied to GRO/GSO.

I worked last year so that we no longer had the frag_list being used in
GRO stack. frag_list was no longer needed, thanks to skb->head_frag

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Attempt to handle mega-GRO packets
  2013-11-07  7:31                                           ` Eric Dumazet
@ 2013-11-07  7:33                                             ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07  7:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 06, 2013 at 11:31:30PM -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 15:15 +0800, Herbert Xu wrote:
> 
> > So what in our stack violates this assumption? We've never handled
> > arbitrary frag_lists in GSO and I see no reason why we need to start
> > doing that now.
> 
> I do see this, skb_segment() is generic.

IOW there is nothing in our stack that violates this, great!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [2/3] gso: Handle new frag_list of frags GRO packets
  2013-11-07  7:06                                           ` [2/3] gso: Handle new frag_list of frags GRO packets Herbert Xu
  2013-11-07  7:08                                             ` [3/3] gso: Handle malicious GRO packets without crashing Herbert Xu
@ 2013-11-07 18:16                                             ` Ben Hutchings
  2013-11-11 18:54                                               ` Herbert Xu
  2013-11-11 18:52                                             ` Herbert Xu
  2 siblings, 1 reply; 163+ messages in thread
From: Ben Hutchings @ 2013-11-07 18:16 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 15:06 +0800, Herbert Xu wrote:
> Recently GRO started generating packets with frag_lists of frags.
> This was not handled by GSO, thus leading to a crash.
> 
> Thankfully these packets are of a regular form and are easy to
> handle.  This patch handles them by calling skb_segment for each 
> frag_list entry.  The depth of recursion is limited to just one.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 88b7dc6..bcc3f1c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
[...]
> @@ -2855,8 +2853,40 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  						 nskb->data - tnl_hlen,
>  						 doffset + tnl_hlen);
>  
> -		if (fskb != skb_shinfo(skb)->frag_list)
> -			goto perform_csum_check;
> +		if (fskb != skb_shinfo(skb)->frag_list) {
> +			struct sk_buff *nsegs;
> +
> +			if (nskb->len == len + doffset)
> +				goto perform_csum_check;
> +
> +			SKB_FRAG_ASSERT(nskb);
> +
> +			__skb_pull(nskb, doffset);
> +			skb_shinfo(nskb)->gso_size = mss;
> +			nsegs = skb_segment(nskb, features);
> +
> +			err = PTR_ERR(nsegs);
> +			if (IS_ERR(nsegs)) {
> +				kfree(nskb);

Should be kfree_skb().

> +				goto err;
> +			}
> +			err = -ENOMEM;

It would be clearer to put this in front of the 'err' label:

err_nomem:
	err = -ENOMEM;

and change the allocation failure paths to goto err_nomem.

> +			if (segs)
> +				tail->next = nsegs;
> +			else
> +				segs = nsegs;
> +
> +			tail = nsegs;
> +			while (tail->next)
> +				tail = tail->next;
> +
> +			BUG_ON(fskb && tail->len != len + doffset);

Deserves a comment, maybe?

> +			len = nskb->len;
> +			kfree(nskb);
> +			continue;
> +		}
>  
>  		if (!sg) {
>  			nskb->ip_summed = CHECKSUM_NONE;

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [3/3] gso: Handle malicious GRO packets without crashing
  2013-11-07  7:08                                             ` [3/3] gso: Handle malicious GRO packets without crashing Herbert Xu
@ 2013-11-07 18:18                                               ` Ben Hutchings
  2013-11-07 19:13                                               ` Sergei Shtylyov
  1 sibling, 0 replies; 163+ messages in thread
From: Ben Hutchings @ 2013-11-07 18:18 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 15:08 +0800, Herbert Xu wrote:
> As virtio_net can now generate GRO frag_list packets without
> sufficient verification, we need to handle malicious GRO packets
> thrown at us.
> 
> This patch converts to affected BUG_ONs in skb_segment to rate-
> limited warnings.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index bcc3f1c..fb1106d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2881,7 +2881,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  			while (tail->next)
>  				tail = tail->next;
>  
> -			BUG_ON(fskb && tail->len != len + doffset);

Oh well, disregard my previous request for a comment.

> +			if (fskb && tail->len != len + doffset) {
> +				net_warn_ratelimited(
> +					"skb_segment: "
> +					"illegal GSO fragment: %u %u\n",
> +					tail->len, len + doffset);
> +				kfree(nskb);

kfree_skb()

> +				err = -EINVAL;
> +				goto err;
> +			}
>  
>  			len = nskb->len;
>  			kfree(nskb);
> @@ -2929,7 +2937,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  		if (pos < offset + len) {
>  			struct sk_buff *fskb2 = fskb;
>  
> -			BUG_ON(pos + fskb->len != offset + len);
> +			if (pos + fskb->len != offset + len) {
> +				net_warn_ratelimited(
> +					"skb_segment: "
> +					"illegal GSO trailer: %u %u\n",
> +					pos + fskb->len, offset + len);
> +				kfree(nskb);

kfree_skb()

> +				err = -EINVAL;
> +				goto err;
> +			}
>  
>  			pos += fskb->len;
>  			fskb = fskb->next;

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [3/3] gso: Handle malicious GRO packets without crashing
  2013-11-07  7:08                                             ` [3/3] gso: Handle malicious GRO packets without crashing Herbert Xu
  2013-11-07 18:18                                               ` Ben Hutchings
@ 2013-11-07 19:13                                               ` Sergei Shtylyov
  2013-11-11 18:55                                                 ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Sergei Shtylyov @ 2013-11-07 19:13 UTC (permalink / raw)
  To: Herbert Xu, Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

Hello.

On 11/07/2013 10:08 AM, Herbert Xu wrote:

> As virtio_net can now generate GRO frag_list packets without
> sufficient verification, we need to handle malicious GRO packets
> thrown at us.

> This patch converts to affected BUG_ONs in skb_segment to rate-
> limited warnings.

> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index bcc3f1c..fb1106d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2881,7 +2881,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>   			while (tail->next)
>   				tail = tail->next;
>
> -			BUG_ON(fskb && tail->len != len + doffset);
> +			if (fskb && tail->len != len + doffset) {
> +				net_warn_ratelimited(
> +					"skb_segment: "
> +					"illegal GSO fragment: %u %u\n",

    Don't break up the message -- chekpatch.pl should allow that...

> @@ -2929,7 +2937,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>   		if (pos < offset + len) {
>   			struct sk_buff *fskb2 = fskb;
>
> -			BUG_ON(pos + fskb->len != offset + len);
> +			if (pos + fskb->len != offset + len) {
> +				net_warn_ratelimited(
> +					"skb_segment: "
> +					"illegal GSO trailer: %u %u\n",

    Same here.

WBR, Sergei

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-04 16:55                   ` Ben Hutchings
@ 2013-11-07 21:17                     ` David Miller
  2013-11-07 21:31                       ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: David Miller @ 2013-11-07 21:17 UTC (permalink / raw)
  To: bhutchings
  Cc: eric.dumazet, christoph.paasch, herbert, netdev, hkchu, mwdalton

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 4 Nov 2013 16:55:30 +0000

> On Sat, 2013-11-02 at 12:58 -0700, Eric Dumazet wrote:
>> From: Eric Dumazet <edumazet@google.com>
>> 
>> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
>> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
>> 
>> skb_segment() only deals with a frag_list chain containing MSS sized
>> fragments. Even if we fix this problem, its better if GRO layer
>> doesn't build skb with a frag_list in the first place, to let TSO
>> packets reaching output devices.
>>  
>> David Miller and Ben Hutchings suggested we keep track of number of
>> forwarding users to be able to :
>> 
>> - Disable LRO
>> - Make sure GRO layer do not use skb frag_list to extend skb capacity
>> 
>> Note that after this patch, LRO is automatically re-enabled if
>> forwarding is disabled on the device, or if a device is removed
>> from a bridge.
> [...]
> 
> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>

Applied, thanks everyone.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-07 21:17                     ` David Miller
@ 2013-11-07 21:31                       ` Herbert Xu
  2013-11-07 21:54                         ` Eric Dumazet
  2013-11-07 22:06                         ` David Miller
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-07 21:31 UTC (permalink / raw)
  To: David Miller
  Cc: bhutchings, eric.dumazet, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 04:17:55PM -0500, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Mon, 4 Nov 2013 16:55:30 +0000
> 
> > On Sat, 2013-11-02 at 12:58 -0700, Eric Dumazet wrote:
> >> From: Eric Dumazet <edumazet@google.com>
> >> 
> >> Christoph Paasch and Jerry Chu reported crashes in skb_segment() caused
> >> by commit 8a29111c7ca6 ("net: gro: allow to build full sized skb")
> >> 
> >> skb_segment() only deals with a frag_list chain containing MSS sized
> >> fragments. Even if we fix this problem, its better if GRO layer
> >> doesn't build skb with a frag_list in the first place, to let TSO
> >> packets reaching output devices.
> >>  
> >> David Miller and Ben Hutchings suggested we keep track of number of
> >> forwarding users to be able to :
> >> 
> >> - Disable LRO
> >> - Make sure GRO layer do not use skb frag_list to extend skb capacity
> >> 
> >> Note that after this patch, LRO is automatically re-enabled if
> >> forwarding is disabled on the device, or if a device is removed
> >> from a bridge.
> > [...]
> > 
> > Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
> 
> Applied, thanks everyone.

Sorry David, I just realised that this patch doesn't address
this problem fully.  While we can stop the generation of these
packets in our own stack, if they're coming from the virt host
or another guest, there is nothing we can do to stop them.

So given virtio_net is now generating such packets, our choices
are either to linearise them or deal with them properly in skb_segment.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-07 21:31                       ` Herbert Xu
@ 2013-11-07 21:54                         ` Eric Dumazet
  2013-11-08  3:59                           ` Herbert Xu
  2013-11-07 22:06                         ` David Miller
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-07 21:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 05:31 +0800, Herbert Xu wrote:

> Sorry David, I just realised that this patch doesn't address
> this problem fully.  While we can stop the generation of these
> packets in our own stack, if they're coming from the virt host
> or another guest, there is nothing we can do to stop them.
> 
> So given virtio_net is now generating such packets, our choices
> are either to linearise them or deal with them properly in skb_segment.

Hi Herbert

I believe I did this on my patch.

Note that there is absolutely no requirement on how are the skb found in
frag_list (their length is not a multiple of MSS)

For the ease of discussion, once patched skb_segment() looks like :

/**
 *	skb_segment - Perform protocol segmentation on skb.
 *	@skb: buffer to segment
 *	@features: features for the output path (see dev->features)
 *
 *	This function performs segmentation on the given skb.  It returns
 *	a pointer to the first in a list of new skbs for the segments.
 *	In case of error it returns ERR_PTR(err).
 */
struct sk_buff *skb_segment(struct sk_buff *head_skb, netdev_features_t features)
{
	struct sk_buff *segs = NULL;
	struct sk_buff *tail = NULL;
	struct sk_buff *cskb = head_skb;
	unsigned int mss = skb_shinfo(head_skb)->gso_size;
	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
	unsigned int tot_len; /* should reach head_skb->len at the end */
	unsigned int offset = doffset; /* offset in cskb->data */
	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
	unsigned int headroom;
	unsigned int len;
	__be16 proto;
	bool csum;
	int sg = !!(features & NETIF_F_SG);
	int cur_frag = 0, nfrags = skb_shinfo(cskb)->nr_frags;
	unsigned int data_len, cur_frag_offset = 0;
	int err = -ENOMEM;

	proto = skb_network_protocol(head_skb);
	if (unlikely(!proto))
		return ERR_PTR(-EINVAL);

	csum = !!can_checksum_protocol(features, proto);
	__skb_push(head_skb, doffset);
	headroom = skb_headroom(head_skb);

	for (tot_len = doffset; tot_len < head_skb->len; tot_len += len) {
		struct sk_buff *nskb;
		skb_frag_t *frag;
		int hsize, size, remain;

		len = head_skb->len - tot_len;
		if (len > mss)
			len = mss;

		hsize = skb_headlen(cskb) - offset;
		if (hsize < 0)
			hsize = 0;
		if (hsize > len || !sg)
			hsize = len;

		nskb = __alloc_skb(hsize + doffset + headroom,
				   GFP_ATOMIC, skb_alloc_rx_flag(head_skb),
				   NUMA_NO_NODE);

		if (unlikely(!nskb))
			goto err;

		skb_reserve(nskb, headroom);
		__skb_put(nskb, doffset);

		if (segs)
			tail->next = nskb;
		else
			segs = nskb;
		tail = nskb;

		__copy_skb_header(nskb, head_skb);
		nskb->mac_len = head_skb->mac_len;

		skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);

		skb_copy_from_linear_data_offset(head_skb, -tnl_hlen,
						 nskb->data - tnl_hlen,
						 doffset + tnl_hlen);

		if (!sg) {
			nskb->ip_summed = CHECKSUM_NONE;
			nskb->csum = skb_copy_and_csum_bits(head_skb, tot_len,
							    skb_put(nskb, len),
							    len, 0);
			offset += len;
			continue;
		}

		frag = skb_shinfo(nskb)->frags;

		skb_copy_from_linear_data_offset(cskb, offset,
						 skb_put(nskb, hsize), hsize);
		offset += hsize;

		nskb->data_len = len - hsize;
		nskb->len += nskb->data_len;
		nskb->truesize += nskb->data_len;

		skb_shinfo(nskb)->tx_flags = skb_shinfo(head_skb)->tx_flags & SKBTX_SHARED_FRAG;

		for (data_len = 0; data_len < nskb->data_len; data_len += remain) {
			remain = nskb->data_len - data_len;
			if (unlikely(cur_frag >= nfrags)) {
				if (cskb == head_skb)
					cskb = skb_shinfo(head_skb)->frag_list;
				else
					cskb = cskb->next;
				if (!cskb) {
					WARN_ON_ONCE(1);
					goto err;
				}
				cur_frag = 0;
				cur_frag_offset = 0;
				nfrags = skb_shinfo(cskb)->nr_frags;
				offset = 0;
				if (skb_headlen(cskb)) {
					char *data;
					struct page *page;

					remain = min_t(int, remain, skb_headlen(cskb));
					pr_err_once("remain %d\n", remain);
					if (likely(cskb->head_frag)) {
						data = cskb->data;
						page = virt_to_head_page(data);
						get_page(page);
					} else {
						data = __netdev_alloc_frag(SKB_DATA_ALIGN(remain),
									   GFP_ATOMIC);
						/* Really this should not happen, fix the caller ! */
						WARN_ON_ONCE(1);
						if (!data)
							goto err;
						memcpy(data, cskb->data, remain);
						page = virt_to_head_page(data);
					}
					frag->page.p = page;
					frag->page_offset = data - (char *)page_address(page);
					skb_frag_size_set(frag, remain);
					frag++;
					offset = remain;
					continue;
				}
			}
			*frag = skb_shinfo(cskb)->frags[cur_frag];
			__skb_frag_ref(frag);

			frag->page_offset += cur_frag_offset;
			skb_frag_size_sub(frag, cur_frag_offset);
			size = skb_frag_size(frag);

			if (size <= remain) {
				cur_frag++;
				cur_frag_offset = 0;
				remain = size;
			} else {
				skb_frag_size_set(frag, remain);
				cur_frag_offset += remain;
			}

			frag++;
		}

		skb_shinfo(nskb)->nr_frags = frag - skb_shinfo(nskb)->frags;

		if (!csum) {
			nskb->csum = skb_checksum(nskb, doffset,
						  nskb->len - doffset, 0);
			nskb->ip_summed = CHECKSUM_NONE;
		}
	}

	return segs;

err:
	while ((cskb = segs)) {
		segs = cskb->next;
		kfree_skb(cskb);
	}
	return ERR_PTR(err);
}
EXPORT_SYMBOL_GPL(skb_segment);

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-07 21:31                       ` Herbert Xu
  2013-11-07 21:54                         ` Eric Dumazet
@ 2013-11-07 22:06                         ` David Miller
  2013-11-08  2:17                           ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: David Miller @ 2013-11-07 22:06 UTC (permalink / raw)
  To: herbert
  Cc: bhutchings, eric.dumazet, christoph.paasch, netdev, hkchu, mwdalton

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 8 Nov 2013 05:31:11 +0800

> Sorry David, I just realised that this patch doesn't address
> this problem fully.  While we can stop the generation of these
> packets in our own stack, if they're coming from the virt host
> or another guest, there is nothing we can do to stop them.
> 
> So given virtio_net is now generating such packets, our choices
> are either to linearise them or deal with them properly in skb_segment.

Ok.

Aside from the segmentation issues, Eric's patch was a nice cleanup
which also made it such that we'd be able to get features back when
the blocking condition gets removed.

Given all of this, would you like me to revert his change for now?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-07 22:06                         ` David Miller
@ 2013-11-08  2:17                           ` Herbert Xu
  2013-11-08  2:42                             ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  2:17 UTC (permalink / raw)
  To: David Miller
  Cc: bhutchings, eric.dumazet, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 05:06:51PM -0500, David Miller wrote:
> 
> Aside from the segmentation issues, Eric's patch was a nice cleanup
> which also made it such that we'd be able to get features back when
> the blocking condition gets removed.
> 
> Given all of this, would you like me to revert his change for now?

Oh no let's keep it as it is indeed a great cleanup and improvement
for LRO.  We can however remove the bits that relate to GRO.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  2:17                           ` Herbert Xu
@ 2013-11-08  2:42                             ` Eric Dumazet
  2013-11-08  2:51                               ` Eric Dumazet
  2013-11-08  3:22                               ` Herbert Xu
  0 siblings, 2 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  2:42 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 10:17 +0800, Herbert Xu wrote:

> Oh no let's keep it as it is indeed a great cleanup and improvement
> for LRO.  We can however remove the bits that relate to GRO.

What about doing a benchmark first ?

I benchmarked the 'locally delivered' traffic, not the forwarding one.

That's why I am not comfortable of having large GRO packets that need
to be segmented.

You tell us hardware advantage of TSO is negligible, this is not
what I observe today on many platforms.

A normal TSO packets with 16 MSS setups a ~17 DMA descriptors,
while GSO requires 2 DMA descriptors per MSS, plus a lot of overhead
in sk_buff allocation/deallocation.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  2:42                             ` Eric Dumazet
@ 2013-11-08  2:51                               ` Eric Dumazet
  2013-11-08  3:23                                 ` Herbert Xu
  2013-11-08  3:22                               ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  2:51 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, 2013-11-07 at 18:42 -0800, Eric Dumazet wrote:

> A normal TSO packets with 16 MSS setups a ~17 DMA descriptors,
> while GSO requires 2 DMA descriptors per MSS, plus a lot of overhead
> in sk_buff allocation/deallocation.

Not mentioning fact that a 64KB packet is adding latencies, since high
prio packets have to wait the whole preceding 64KB packet has left the
host.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  2:42                             ` Eric Dumazet
  2013-11-08  2:51                               ` Eric Dumazet
@ 2013-11-08  3:22                               ` Herbert Xu
  2013-11-08  4:06                                 ` Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  3:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 06:42:34PM -0800, Eric Dumazet wrote:
> On Fri, 2013-11-08 at 10:17 +0800, Herbert Xu wrote:
> 
> > Oh no let's keep it as it is indeed a great cleanup and improvement
> > for LRO.  We can however remove the bits that relate to GRO.
> 
> What about doing a benchmark first ?

If it really hurt then they can always turn it off using ethtool.
With your patch that automatically turns it off virt folks have
no way of even turning it on.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  2:51                               ` Eric Dumazet
@ 2013-11-08  3:23                                 ` Herbert Xu
  2013-11-08  4:21                                   ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  3:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 06:51:53PM -0800, Eric Dumazet wrote:
> On Thu, 2013-11-07 at 18:42 -0800, Eric Dumazet wrote:
> 
> > A normal TSO packets with 16 MSS setups a ~17 DMA descriptors,
> > while GSO requires 2 DMA descriptors per MSS, plus a lot of overhead
> > in sk_buff allocation/deallocation.
> 
> Not mentioning fact that a 64KB packet is adding latencies, since high
> prio packets have to wait the whole preceding 64KB packet has left the
> host.

That would be a bug in the GRO code since a high prio packet
shouldn't have been merged in the first place and therefore
the usual priority mechanism should allow it to preempt the
64KB packet.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-07 21:54                         ` Eric Dumazet
@ 2013-11-08  3:59                           ` Herbert Xu
  2013-11-08  4:25                             ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  3:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 01:54:48PM -0800, Eric Dumazet wrote:
>
> Hi Herbert
> 
> I believe I did this on my patch.

I actually quite like your patch for the pure iterative approach.

> Note that there is absolutely no requirement on how are the skb found in
> frag_list (their length is not a multiple of MSS)

OK I am thick and now I finally get what you're saying: virtio_net
as it stands does not produce the original GRO packet.  So my patch
is still broken with respect to virtio_net.  It won't crash anymore
but it'll just drop GRO/GSO packets which is probably worse.

> For the ease of discussion, once patched skb_segment() looks like :

However, I still have one reason for preferring my patch, it'll
be easier to prodce TSO packets with it.  Let me see if I can
fix up the arbitrary frag boundary issue without making it too
ugly.

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  3:22                               ` Herbert Xu
@ 2013-11-08  4:06                                 ` Eric Dumazet
  2013-11-08  4:10                                   ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  4:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 11:22 +0800, Herbert Xu wrote:

> If it really hurt then they can always turn it off using ethtool.
> With your patch that automatically turns it off virt folks have
> no way of even turning it on.

Let me remind you that before the GRO frag_list patch, virt folks had no
choice anyway. GRO packets were limited to 16 MSS.

Its not because we can build large packets that we must do so.

This was an error of our TSO implementation.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  4:06                                 ` Eric Dumazet
@ 2013-11-08  4:10                                   ` Herbert Xu
  2013-11-08  4:24                                     ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  4:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 08:06:34PM -0800, Eric Dumazet wrote:
> On Fri, 2013-11-08 at 11:22 +0800, Herbert Xu wrote:
> 
> > If it really hurt then they can always turn it off using ethtool.
> > With your patch that automatically turns it off virt folks have
> > no way of even turning it on.
> 
> Let me remind you that before the GRO frag_list patch, virt folks had no
> choice anyway. GRO packets were limited to 16 MSS.
> 
> Its not because we can build large packets that we must do so.
> 
> This was an error of our TSO implementation.

Well you still don't seem to be getting this: If you need it for
the host then you will need it even more for virt because the
network stack there is much longer.

So having it only available to the host without also giving it
to virt makes *zero* sense.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  3:23                                 ` Herbert Xu
@ 2013-11-08  4:21                                   ` Eric Dumazet
  2013-11-08  4:24                                     ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  4:21 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 11:23 +0800, Herbert Xu wrote:
> On Thu, Nov 07, 2013 at 06:51:53PM -0800, Eric Dumazet wrote:
> > On Thu, 2013-11-07 at 18:42 -0800, Eric Dumazet wrote:
> > 
> > > A normal TSO packets with 16 MSS setups a ~17 DMA descriptors,
> > > while GSO requires 2 DMA descriptors per MSS, plus a lot of overhead
> > > in sk_buff allocation/deallocation.
> > 
> > Not mentioning fact that a 64KB packet is adding latencies, since high
> > prio packets have to wait the whole preceding 64KB packet has left the
> > host.
> 
> That would be a bug in the GRO code since a high prio packet
> shouldn't have been merged in the first place and therefore
> the usual priority mechanism should allow it to preempt the
> 64KB packet.

Some users install Qdisc (AQM) on their router, to decide of what is
high priority and what is not. Their iptables or qdisc filters can be
quite complex.

Its all TCP for example.

GRO stack cannot make this decision.

So lets say we receive on ingress a mix of high prio packets and low
prio TCP packets. If GRO stack is able to build super big GRO packet,
then this super big GRO packet is a head of line blocking.

At 1Gbps, a 16 MSS packet is holding the line for about 190 us.

At 45 MSS, you basically multiply by 3 this latency.

What we probably want is a way to tune this latency, not ignore the
problem by making big GRO packets.

The only current choice for the user is to enable or disable GRO per
ingress port.

Thats a trivial patch, but net-next is closed at this moment.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  4:21                                   ` Eric Dumazet
@ 2013-11-08  4:24                                     ` Herbert Xu
  2013-11-08  4:40                                       ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  4:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 08:21:57PM -0800, Eric Dumazet wrote:
> On Fri, 2013-11-08 at 11:23 +0800, Herbert Xu wrote:
> > On Thu, Nov 07, 2013 at 06:51:53PM -0800, Eric Dumazet wrote:
> > > On Thu, 2013-11-07 at 18:42 -0800, Eric Dumazet wrote:
> > > 
> > > > A normal TSO packets with 16 MSS setups a ~17 DMA descriptors,
> > > > while GSO requires 2 DMA descriptors per MSS, plus a lot of overhead
> > > > in sk_buff allocation/deallocation.
> > > 
> > > Not mentioning fact that a 64KB packet is adding latencies, since high
> > > prio packets have to wait the whole preceding 64KB packet has left the
> > > host.
> > 
> > That would be a bug in the GRO code since a high prio packet
> > shouldn't have been merged in the first place and therefore
> > the usual priority mechanism should allow it to preempt the
> > 64KB packet.
> 
> Some users install Qdisc (AQM) on their router, to decide of what is
> high priority and what is not. Their iptables or qdisc filters can be
> quite complex.
> 
> Its all TCP for example.
> 
> GRO stack cannot make this decision.
> 
> So lets say we receive on ingress a mix of high prio packets and low
> prio TCP packets. If GRO stack is able to build super big GRO packet,
> then this super big GRO packet is a head of line blocking.
> 
> At 1Gbps, a 16 MSS packet is holding the line for about 190 us.
> 
> At 45 MSS, you basically multiply by 3 this latency.
> 
> What we probably want is a way to tune this latency, not ignore the
> problem by making big GRO packets.
> 
> The only current choice for the user is to enable or disable GRO per
> ingress port.
> 
> Thats a trivial patch, but net-next is closed at this moment.

I don't know why you're fixing this problem by disabling/degrading
GRO instead of fixing the qdisc to intelligently segment the GRO
packet.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  4:10                                   ` Herbert Xu
@ 2013-11-08  4:24                                     ` Eric Dumazet
  2013-11-08  4:28                                       ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  4:24 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 12:10 +0800, Herbert Xu wrote:

> Well you still don't seem to be getting this: If you need it for
> the host then you will need it even more for virt because the
> network stack there is much longer.
> 
> So having it only available to the host without also giving it
> to virt makes *zero* sense.

I am fine with this, but how can we know the packet is going to be
delivered to virt instead of forwarded to ethernet ?

Do we want the equivalent of 'IP early demux', done at GRO layer ? ;)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  3:59                           ` Herbert Xu
@ 2013-11-08  4:25                             ` Eric Dumazet
  2013-11-10 14:05                               ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  4:25 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 11:59 +0800, Herbert Xu wrote:

> However, I still have one reason for preferring my patch, it'll
> be easier to prodce TSO packets with it.  Let me see if I can
> fix up the arbitrary frag boundary issue without making it too
> ugly.

Sure !

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  4:24                                     ` Eric Dumazet
@ 2013-11-08  4:28                                       ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  4:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 08:24:43PM -0800, Eric Dumazet wrote:
> On Fri, 2013-11-08 at 12:10 +0800, Herbert Xu wrote:
> 
> > Well you still don't seem to be getting this: If you need it for
> > the host then you will need it even more for virt because the
> > network stack there is much longer.
> > 
> > So having it only available to the host without also giving it
> > to virt makes *zero* sense.
> 
> I am fine with this, but how can we know the packet is going to be
> delivered to virt instead of forwarded to ethernet ?

That's the crux of our disagreement :)

My preference is that if we can make it work at least as good
as before (if not better) in the forwarding case then we won't
need to care.

And so far I haven't seen any convincing argument why this cannot
be fixed in the forwarding case.

> Do we want the equivalent of 'IP early demux', done at GRO layer ? ;)

Won't help because the virt guest may end up forwarding the packet
anyway and you can't rely on it telling you that.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  4:24                                     ` Herbert Xu
@ 2013-11-08  4:40                                       ` Eric Dumazet
  2013-11-08  4:43                                         ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  4:40 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 12:24 +0800, Herbert Xu wrote:

> I don't know why you're fixing this problem by disabling/degrading
> GRO instead of fixing the qdisc to intelligently segment the GRO
> packet.

On what criteria you decide that the packet you are going to queue to
the NIC must be segmented ?

How do you predict the future ?

BTW, the Qdisc API doesnt really permit what you describe.

There are the queue() and dequeue(). Once packet is dequeued, you cannot
requeue a bunch of packets if you decide to segment the GRO packet.

This would require a lot of changes, while its much easier to give a
limit to GRO layer (replace the 65536 constant in skb_gro_receive() to
dev->max_gro_size).

Its rare the same host is used both as a latency sensitive router, and a
virt box...

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  4:40                                       ` Eric Dumazet
@ 2013-11-08  4:43                                         ` Herbert Xu
  2013-11-08  5:08                                           ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  4:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 08:40:00PM -0800, Eric Dumazet wrote:
> On Fri, 2013-11-08 at 12:24 +0800, Herbert Xu wrote:
> 
> > I don't know why you're fixing this problem by disabling/degrading
> > GRO instead of fixing the qdisc to intelligently segment the GRO
> > packet.
> 
> On what criteria you decide that the packet you are going to queue to
> the NIC must be segmented ?

Easy.  You don't physically segment the packet until it's
transmitted.  But logically you treat each GRO/GSO packet as if it
had already been segmented.

> BTW, the Qdisc API doesnt really permit what you describe.
> 
> There are the queue() and dequeue(). Once packet is dequeued, you cannot
> requeue a bunch of packets if you decide to segment the GRO packet.

This logic should be inside the qdisc.  I'm not saying that we
need to implement this for every single qdisc out there, just the
ones that you care about.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  4:43                                         ` Herbert Xu
@ 2013-11-08  5:08                                           ` Eric Dumazet
  2013-11-08  5:21                                             ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  5:08 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 12:43 +0800, Herbert Xu wrote:

> Easy.  You don't physically segment the packet until it's
> transmitted.  But logically you treat each GRO/GSO packet as if it
> had already been segmented.

OK, please submit some patches, because I must be blind.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  5:08                                           ` Eric Dumazet
@ 2013-11-08  5:21                                             ` Herbert Xu
  2013-11-08  5:40                                               ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-08  5:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 09:08:49PM -0800, Eric Dumazet wrote:
> On Fri, 2013-11-08 at 12:43 +0800, Herbert Xu wrote:
> 
> > Easy.  You don't physically segment the packet until it's
> > transmitted.  But logically you treat each GRO/GSO packet as if it
> > had already been segmented.
> 
> OK, please submit some patches, because I must be blind.

Honestly it doesn't even matter.  If it's really so bad for your
qdisc, you should just call skb_segment in your qdisc directly.

My point is that even if you did that GRO with your frag_list
patch should still be a win because the stack prior to the qdisc
gets run once instead of two or three times.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  5:21                                             ` Herbert Xu
@ 2013-11-08  5:40                                               ` Eric Dumazet
  2013-11-11 18:58                                                 ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-08  5:40 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Fri, 2013-11-08 at 13:21 +0800, Herbert Xu wrote:


> 
> My point is that even if you did that GRO with your frag_list
> patch should still be a win because the stack prior to the qdisc
> gets run once instead of two or three times.
> 

OK, lets me repeat again.

64KB packet receive/aggregation time is more than 540 us on 1Gbps link.

The fact that you split or not the packet at transmit is quite
irrelevant, its already too late.

The problem is not the egress, its GRO if it can aggregate too big
packets.

Most of GRO/GSO benefits are already there with 16 MSS skbs,
we do not gain that much using 44/45 MSS skbs, but increase the
Store-and-Forward Delay by 200 %

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  4:25                             ` Eric Dumazet
@ 2013-11-10 14:05                               ` Herbert Xu
  2013-11-11 14:36                                 ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-10 14:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 08:25:30PM -0800, Eric Dumazet wrote:
> On Fri, 2013-11-08 at 11:59 +0800, Herbert Xu wrote:
> 
> > However, I still have one reason for preferring my patch, it'll
> > be easier to prodce TSO packets with it.  Let me see if I can
> > fix up the arbitrary frag boundary issue without making it too
> > ugly.
> 
> Sure !

I ended up giving up on the recursion idea and borrowed your
iterative approach.  I haven't yet had the chance to test it
yet but here is the WIP.

The main assumptions are that virtio_net frag_list is always non-
linear and GRO frag_list may only contain a linear head part that
is exactly MSS bytes long.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..fab44ff 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2776,6 +2776,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	struct sk_buff *segs = NULL;
 	struct sk_buff *tail = NULL;
 	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
+	skb_frag_t *skb_frag = skb_shinfo(skb)->frags;
 	unsigned int mss = skb_shinfo(skb)->gso_size;
 	unsigned int doffset = skb->data - skb_mac_header(skb);
 	unsigned int offset = doffset;
@@ -2815,16 +2816,23 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (hsize > len || !sg)
 			hsize = len;
 
-		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+		if (!hsize && i >= nfrags && skb_headlen(fskb)) {
+			BUG_ON(skb_headlen(fskb) != len);
 
 			pos += len;
+			i = 0;
+			nfrags = skb_shinfo(fskb)->nr_frags;
+			skb_frag = skb_shinfo(fskb)->frags;
+
 			nskb = skb_clone(fskb, GFP_ATOMIC);
 			fskb = fskb->next;
 
 			if (unlikely(!nskb))
 				goto err;
 
+			if (unlikely(pskb_trim(nskb, len)))
+				goto err;
+
 			hsize = skb_end_offset(nskb);
 			if (skb_cow_head(nskb, doffset + headroom)) {
 				kfree_skb(nskb);
@@ -2861,7 +2869,8 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
+		if (fskb != skb_shinfo(skb)->frag_list &&
+		    nskb->len == len + doffset)
 			goto perform_csum_check;
 
 		if (!sg) {
@@ -2879,8 +2888,20 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 		skb_shinfo(nskb)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
 
-		while (pos < offset + len && i < nfrags) {
-			*frag = skb_shinfo(skb)->frags[i];
+		while (pos < offset + len) {
+			if (i >= nfrags) {
+				BUG_ON(skb_headlen(fskb));
+
+				i = 0;
+				nfrags = skb_shinfo(fskb)->nr_frags;
+				skb_frag = skb_shinfo(fskb)->frags;
+
+				BUG_ON(!nfrags);
+
+				fskb = fskb->next;
+			}
+
+			*frag = *skb_frag;
 			__skb_frag_ref(frag);
 			size = skb_frag_size(frag);
 
@@ -2891,37 +2912,26 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 			skb_shinfo(nskb)->nr_frags++;
 
-			if (pos + size <= offset + len) {
-				i++;
-				pos += size;
-			} else {
-				skb_frag_size_sub(frag, pos + size - (offset + len));
-				goto skip_fraglist;
+			if (pos + size >= offset + len) {
+				skb_frag_size_sub(frag,
+						  pos + size - (offset + len));
+				break;
 			}
 
+			skb_frag++;
+			i++;
+			pos += size;
 			frag++;
-		}
-
-		if (pos < offset + len) {
-			struct sk_buff *fskb2 = fskb;
 
-			BUG_ON(pos + fskb->len != offset + len);
-
-			pos += fskb->len;
-			fskb = fskb->next;
-
-			if (fskb2->next) {
-				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
-				if (!fskb2)
-					goto err;
-			} else
-				skb_get(fskb2);
-
-			SKB_FRAG_ASSERT(nskb);
-			skb_shinfo(nskb)->frag_list = fskb2;
+			if (unlikely(skb_shinfo(nskb)->nr_frags >=
+				     MAX_SKB_FRAGS)) {
+				net_warn_ratelimited(
+					"skb_segment: too many frags: %u %u\n",
+					pos, mss);
+				goto err;
+			}
 		}
 
-skip_fraglist:
 		nskb->data_len = len - hsize;
 		nskb->len += nskb->data_len;
 		nskb->truesize += nskb->data_len;

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-10 14:05                               ` Herbert Xu
@ 2013-11-11 14:36                                 ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-11 14:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Sun, Nov 10, 2013 at 10:05:41PM +0800, Herbert Xu wrote:
> 
> The main assumptions are that virtio_net frag_list is always non-
> linear and GRO frag_list may only contain a linear head part that
> is exactly MSS bytes long.

OK that assumption didn't work too well.  GRO frag_list can indeed
have a linear part followed by non-linear bits.  Instead we can
assume that the linear part is always less than our target size.

The following patch passes my tests and I will clean it up for
submission.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..3b3ad8b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2776,6 +2776,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	struct sk_buff *segs = NULL;
 	struct sk_buff *tail = NULL;
 	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
+	skb_frag_t *skb_frag = skb_shinfo(skb)->frags;
 	unsigned int mss = skb_shinfo(skb)->gso_size;
 	unsigned int doffset = skb->data - skb_mac_header(skb);
 	unsigned int offset = doffset;
@@ -2815,16 +2816,29 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (hsize > len || !sg)
 			hsize = len;
 
-		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+		if (!hsize && i >= nfrags && skb_headlen(fskb)) {
+			BUG_ON(skb_headlen(fskb) > len);
+
+			i = 0;
+			nfrags = skb_shinfo(fskb)->nr_frags;
+			skb_frag = skb_shinfo(fskb)->frags;
+
+			for (pos += skb_headlen(fskb); pos < offset + len;
+			     i++, pos += skb_frag_size(skb_frag++))
+				BUG_ON(pos > offset + len ||
+				       i >= nfrags);
 
-			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
 			fskb = fskb->next;
 
 			if (unlikely(!nskb))
 				goto err;
 
+			if (unlikely(pskb_trim(nskb, len))) {
+				kfree_skb(nskb);
+				goto err;
+			}
+
 			hsize = skb_end_offset(nskb);
 			if (skb_cow_head(nskb, doffset + headroom)) {
 				kfree_skb(nskb);
@@ -2861,7 +2875,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
+		if (nskb->len == len + doffset)
 			goto perform_csum_check;
 
 		if (!sg) {
@@ -2879,8 +2893,28 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 		skb_shinfo(nskb)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
 
-		while (pos < offset + len && i < nfrags) {
-			*frag = skb_shinfo(skb)->frags[i];
+		while (pos < offset + len) {
+			if (i >= nfrags) {
+				BUG_ON(skb_headlen(fskb));
+
+				i = 0;
+				nfrags = skb_shinfo(fskb)->nr_frags;
+				skb_frag = skb_shinfo(fskb)->frags;
+
+				BUG_ON(!nfrags);
+
+				fskb = fskb->next;
+			}
+
+			if (unlikely(skb_shinfo(nskb)->nr_frags >=
+				     MAX_SKB_FRAGS)) {
+				net_warn_ratelimited(
+					"skb_segment: too many frags: %u %u\n",
+					pos, mss);
+				goto err;
+			}
+
+			*frag = *skb_frag;
 			__skb_frag_ref(frag);
 			size = skb_frag_size(frag);
 
@@ -2893,6 +2927,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 			if (pos + size <= offset + len) {
 				i++;
+				skb_frag++;
 				pos += size;
 			} else {
 				skb_frag_size_sub(frag, pos + size - (offset + len));
@@ -2902,25 +2937,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			frag++;
 		}
 
-		if (pos < offset + len) {
-			struct sk_buff *fskb2 = fskb;
-
-			BUG_ON(pos + fskb->len != offset + len);
-
-			pos += fskb->len;
-			fskb = fskb->next;
-
-			if (fskb2->next) {
-				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
-				if (!fskb2)
-					goto err;
-			} else
-				skb_get(fskb2);
-
-			SKB_FRAG_ASSERT(nskb);
-			skb_shinfo(nskb)->frag_list = fskb2;
-		}
-
 skip_fraglist:
 		nskb->data_len = len - hsize;
 		nskb->len += nskb->data_len;

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* gso: Handle new frag_list of frags GRO packets
  2013-11-07  7:06                                           ` [2/3] gso: Handle new frag_list of frags GRO packets Herbert Xu
  2013-11-07  7:08                                             ` [3/3] gso: Handle malicious GRO packets without crashing Herbert Xu
  2013-11-07 18:16                                             ` [2/3] gso: Handle new frag_list of frags GRO packets Ben Hutchings
@ 2013-11-11 18:52                                             ` Herbert Xu
  2013-11-12 10:12                                               ` David Laight
  2013-11-13  1:13                                               ` gso: " Eric Dumazet
  2 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-11 18:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

Recently GRO started generating packets with frag_lists of frags.
This was not handled by GSO, thus leading to a crash.

Thankfully these packets are of a regular form and are easy to
handle.  This patch handles them in two ways.  For completely
non-linear frag_list entries, we simply continue to iterate over
the frag_list frags once we exhaust the normal frags.  For frag_list
entries with linear parts, we call pskb_trim on the first part
of the frag_list skb, and then process the rest of the frags in
the usual way.

This patch also kills a chunk of dead frag_list code that has
obviously never ever been run since it ends up generating a bogus
GSO-segmented packet with a frag_list entry.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3735fad..557e1a5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2776,6 +2776,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	struct sk_buff *segs = NULL;
 	struct sk_buff *tail = NULL;
 	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
+	skb_frag_t *skb_frag = skb_shinfo(skb)->frags;
 	unsigned int mss = skb_shinfo(skb)->gso_size;
 	unsigned int doffset = skb->data - skb_mac_header(skb);
 	unsigned int offset = doffset;
@@ -2815,16 +2816,38 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (hsize > len || !sg)
 			hsize = len;
 
-		if (!hsize && i >= nfrags) {
-			BUG_ON(fskb->len != len);
+		if (!hsize && i >= nfrags && skb_headlen(fskb) &&
+		    (skb_headlen(fskb) == len || sg)) {
+			BUG_ON(skb_headlen(fskb) > len);
+
+			i = 0;
+			nfrags = skb_shinfo(fskb)->nr_frags;
+			skb_frag = skb_shinfo(fskb)->frags;
+			pos += skb_headlen(fskb);
+
+			while (pos < offset + len) {
+				BUG_ON(i >= nfrags);
+
+				size = skb_frag_size(skb_frag);
+				if (pos + size > offset + len)
+					break;
+
+				i++;
+				pos += size;
+				skb_frag++;
+			}
 
-			pos += len;
 			nskb = skb_clone(fskb, GFP_ATOMIC);
 			fskb = fskb->next;
 
 			if (unlikely(!nskb))
 				goto err;
 
+			if (unlikely(pskb_trim(nskb, len))) {
+				kfree_skb(nskb);
+				goto err;
+			}
+
 			hsize = skb_end_offset(nskb);
 			if (skb_cow_head(nskb, doffset + headroom)) {
 				kfree_skb(nskb);
@@ -2861,7 +2884,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
 
-		if (fskb != skb_shinfo(skb)->frag_list)
+		if (nskb->len == len + doffset)
 			goto perform_csum_check;
 
 		if (!sg) {
@@ -2879,8 +2902,28 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 		skb_shinfo(nskb)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
 
-		while (pos < offset + len && i < nfrags) {
-			*frag = skb_shinfo(skb)->frags[i];
+		while (pos < offset + len) {
+			if (i >= nfrags) {
+				BUG_ON(skb_headlen(fskb));
+
+				i = 0;
+				nfrags = skb_shinfo(fskb)->nr_frags;
+				skb_frag = skb_shinfo(fskb)->frags;
+
+				BUG_ON(!nfrags);
+
+				fskb = fskb->next;
+			}
+
+			if (unlikely(skb_shinfo(nskb)->nr_frags >=
+				     MAX_SKB_FRAGS)) {
+				net_warn_ratelimited(
+					"skb_segment: too many frags: %u %u\n",
+					pos, mss);
+				goto err;
+			}
+
+			*frag = *skb_frag;
 			__skb_frag_ref(frag);
 			size = skb_frag_size(frag);
 
@@ -2893,6 +2936,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 			if (pos + size <= offset + len) {
 				i++;
+				skb_frag++;
 				pos += size;
 			} else {
 				skb_frag_size_sub(frag, pos + size - (offset + len));
@@ -2902,25 +2946,6 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 			frag++;
 		}
 
-		if (pos < offset + len) {
-			struct sk_buff *fskb2 = fskb;
-
-			BUG_ON(pos + fskb->len != offset + len);
-
-			pos += fskb->len;
-			fskb = fskb->next;
-
-			if (fskb2->next) {
-				fskb2 = skb_clone(fskb2, GFP_ATOMIC);
-				if (!fskb2)
-					goto err;
-			} else
-				skb_get(fskb2);
-
-			SKB_FRAG_ASSERT(nskb);
-			skb_shinfo(nskb)->frag_list = fskb2;
-		}
-
 skip_fraglist:
 		nskb->data_len = len - hsize;
 		nskb->len += nskb->data_len;

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: [2/3] gso: Handle new frag_list of frags GRO packets
  2013-11-07 18:16                                             ` [2/3] gso: Handle new frag_list of frags GRO packets Ben Hutchings
@ 2013-11-11 18:54                                               ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-11 18:54 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Eric Dumazet, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 06:16:44PM +0000, Ben Hutchings wrote:
> On Thu, 2013-11-07 at 15:06 +0800, Herbert Xu wrote:
> > Recently GRO started generating packets with frag_lists of frags.
> > This was not handled by GSO, thus leading to a crash.
> > 
> > Thankfully these packets are of a regular form and are easy to
> > handle.  This patch handles them by calling skb_segment for each 
> > frag_list entry.  The depth of recursion is limited to just one.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > 
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 88b7dc6..bcc3f1c 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> [...]
> > @@ -2855,8 +2853,40 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
> >  						 nskb->data - tnl_hlen,
> >  						 doffset + tnl_hlen);
> >  
> > -		if (fskb != skb_shinfo(skb)->frag_list)
> > -			goto perform_csum_check;
> > +		if (fskb != skb_shinfo(skb)->frag_list) {
> > +			struct sk_buff *nsegs;
> > +
> > +			if (nskb->len == len + doffset)
> > +				goto perform_csum_check;
> > +
> > +			SKB_FRAG_ASSERT(nskb);
> > +
> > +			__skb_pull(nskb, doffset);
> > +			skb_shinfo(nskb)->gso_size = mss;
> > +			nsegs = skb_segment(nskb, features);
> > +
> > +			err = PTR_ERR(nsegs);
> > +			if (IS_ERR(nsegs)) {
> > +				kfree(nskb);
> 
> Should be kfree_skb().

Thanks for catching this and I have incorporated this into the
newer (albeit completely different) version of the patch.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [3/3] gso: Handle malicious GRO packets without crashing
  2013-11-07 19:13                                               ` Sergei Shtylyov
@ 2013-11-11 18:55                                                 ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-11 18:55 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Eric Dumazet, Ben Hutchings, David Miller, christoph.paasch,
	netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 10:13:29PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 11/07/2013 10:08 AM, Herbert Xu wrote:
> 
> >As virtio_net can now generate GRO frag_list packets without
> >sufficient verification, we need to handle malicious GRO packets
> >thrown at us.
> 
> >This patch converts to affected BUG_ONs in skb_segment to rate-
> >limited warnings.
> 
> >Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> >diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >index bcc3f1c..fb1106d 100644
> >--- a/net/core/skbuff.c
> >+++ b/net/core/skbuff.c
> >@@ -2881,7 +2881,15 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
> >  			while (tail->next)
> >  				tail = tail->next;
> >
> >-			BUG_ON(fskb && tail->len != len + doffset);
> >+			if (fskb && tail->len != len + doffset) {
> >+				net_warn_ratelimited(
> >+					"skb_segment: "
> >+					"illegal GSO fragment: %u %u\n",
> 
>    Don't break up the message -- chekpatch.pl should allow that...

Thanks for the comment.  In the latest version of this patch
this should no longer be an issue.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-08  5:40                                               ` Eric Dumazet
@ 2013-11-11 18:58                                                 ` Herbert Xu
  0 siblings, 0 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-11 18:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, bhutchings, christoph.paasch, netdev, hkchu, mwdalton

On Thu, Nov 07, 2013 at 09:40:12PM -0800, Eric Dumazet wrote:
> On Fri, 2013-11-08 at 13:21 +0800, Herbert Xu wrote:
> 
> > My point is that even if you did that GRO with your frag_list
> > patch should still be a win because the stack prior to the qdisc
> > gets run once instead of two or three times.
> > 
> 
> OK, lets me repeat again.
> 
> 64KB packet receive/aggregation time is more than 540 us on 1Gbps link.

I presume you're still talking about the case where we're CPU-
bound on receive.  In that case I totally agree that you need to
impose a limit on the NAPI/GRO run so that we don't keep doing GRO
forever.

However, the limit should be based on time and not an arbitrary
number such as MAX_SKB_FRAGS.  IOW relying on not having a frag_list
to provide a bound to GRO is just wrong.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Handle new frag_list of frags GRO packets
  2013-11-11 18:52                                             ` Herbert Xu
@ 2013-11-12 10:12                                               ` David Laight
  2013-11-13  1:13                                               ` gso: " Eric Dumazet
  1 sibling, 0 replies; 163+ messages in thread
From: David Laight @ 2013-11-12 10:12 UTC (permalink / raw)
  To: Herbert Xu, Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

> Recently GRO started generating packets with frag_lists of frags.
> This was not handled by GSO, thus leading to a crash.

Is the build_dma_sg() code in net/usb/usbnet.c correct?
It creates a 'struct scatterlist' array for all the fragments
in and skb.

I'm not at clear of exactly how skb get put together.

	David
 

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-11 18:52                                             ` Herbert Xu
  2013-11-12 10:12                                               ` David Laight
@ 2013-11-13  1:13                                               ` Eric Dumazet
  2013-11-13  1:29                                                 ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-13  1:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, 2013-11-12 at 02:52 +0800, Herbert Xu wrote:
> Recently GRO started generating packets with frag_lists of frags.
> This was not handled by GSO, thus leading to a crash.
> 
> Thankfully these packets are of a regular form and are easy to
> handle.  This patch handles them in two ways.  For completely
> non-linear frag_list entries, we simply continue to iterate over
> the frag_list frags once we exhaust the normal frags.  For frag_list
> entries with linear parts, we call pskb_trim on the first part
> of the frag_list skb, and then process the rest of the frags in
> the usual way.
> 
> This patch also kills a chunk of dead frag_list code that has
> obviously never ever been run since it ends up generating a bogus
> GSO-segmented packet with a frag_list entry.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3735fad..557e1a5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2776,6 +2776,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  	struct sk_buff *segs = NULL;
>  	struct sk_buff *tail = NULL;
>  	struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
> +	skb_frag_t *skb_frag = skb_shinfo(skb)->frags;
>  	unsigned int mss = skb_shinfo(skb)->gso_size;
>  	unsigned int doffset = skb->data - skb_mac_header(skb);
>  	unsigned int offset = doffset;
> @@ -2815,16 +2816,38 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
>  		if (hsize > len || !sg)
>  			hsize = len;
>  
> -		if (!hsize && i >= nfrags) {
> -			BUG_ON(fskb->len != len);
> +		if (!hsize && i >= nfrags && skb_headlen(fskb) &&
> +		    (skb_headlen(fskb) == len || sg)) {
> +			BUG_ON(skb_headlen(fskb) > len);
> +

Hmm, yet another BUG_ON() case...

> +			i = 0;
> +			nfrags = skb_shinfo(fskb)->nr_frags;
> +			skb_frag = skb_shinfo(fskb)->frags;
> +			pos += skb_headlen(fskb);
> +
> +			while (pos < offset + len) {
> +				BUG_ON(i >= nfrags);
> +
> +				size = skb_frag_size(skb_frag);
> +				if (pos + size > offset + len)
> +					break;
> +
> +				i++;
> +				pos += size;
> +				skb_frag++;
> +			}
>  
> -			pos += len;
>  			nskb = skb_clone(fskb, GFP_ATOMIC);
>  			fskb = fskb->next;
>  
>  			if (unlikely(!nskb))
>  				goto err;
>  
> +			if (unlikely(pskb_trim(nskb, len))) {
> +				kfree_skb(nskb);
> +				goto err;
> +			}
> +

Note this pskb_trim() will reallocate/copy nskb head completely, since
nskb is a clone. (And increment page counts of frags, then eventually
decrement them)

I tested this patch one 'router', and it seems fine, although we consume
~90% more cpu doing the skb_segment() than not doing it.


GRO not building frag_list skbs :

lpaa6:~# mpstat -P 8 1 10
05:09:47 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
05:09:48 PM    8    0.00    0.00    1.01    0.00    3.03   23.23    0.00    0.00   72.73  43462.63
05:09:49 PM    8    0.00    0.00    0.00    0.00    5.10   28.57    0.00    0.00   66.33  88079.59
05:09:50 PM    8    0.00    0.00    0.00    0.00    2.13   17.02    0.00    0.00   80.85  41297.87
05:09:51 PM    8    0.00    0.00    0.95    0.00    3.81   29.52    0.00    0.00   65.71  45741.90
05:09:52 PM    8    0.00    0.00    0.00    0.00    2.11   17.89    0.00    0.00   80.00  25413.68
05:09:53 PM    8    1.03    0.00    1.03    0.00    2.06   20.62    0.00    0.00   75.26  36131.96
05:09:54 PM    8    0.00    0.00    0.94    0.00    3.77   30.19    0.00    0.00   65.09  47100.00
05:09:55 PM    8    0.00    0.00    0.00    0.00    3.26   21.74    0.00    0.00   75.00  71805.43
05:09:56 PM    8    0.00    0.00    0.00    0.00    3.19   22.34    0.00    0.00   74.47  70672.34
05:09:57 PM    8    0.00    0.00    0.00    0.00    4.50   32.43    0.00    0.00   63.06  45919.82
Average:       8    0.10    0.00    0.40    0.00    3.33   24.62    0.00    0.00   71.54  51339.66


Current GRO (large skbs -> need to split them with skb_segment(), no TSO is used)

lpaa6:~# mpstat -P 8 1 10
05:10:05 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
05:10:06 PM    8    0.00    0.00    0.00    0.00    8.16   44.90    0.00    0.00   46.94  88982.65
05:10:07 PM    8    0.00    0.00    1.05    0.00    8.42   50.53    0.00    0.00   40.00  54796.84
05:10:08 PM    8    0.00    0.00    1.94    0.00    8.74   52.43    0.00    0.00   36.89  50163.11
05:10:09 PM    8    0.00    0.00    0.00    0.00    7.14   39.80    0.00    0.00   53.06  85137.76
05:10:10 PM    8    0.00    0.00    0.00    0.00    8.08   44.44    0.00    0.00   47.47  42262.63
05:10:11 PM    8    0.00    0.00    0.00    0.00    8.00   53.00    0.00    0.00   39.00  53444.00
05:10:12 PM    8    0.00    0.00    0.00    0.00    5.00   27.50    0.00    0.00   67.50  91098.75
05:10:13 PM    8    0.00    0.00    0.00    0.00    8.55   47.86    0.00    0.00   43.59  34316.24
05:10:14 PM    8    0.00    0.00    0.00    0.00    8.70   48.91    0.00    0.00   42.39  56921.74
05:10:15 PM    8    0.00    0.00    0.93    0.00    7.41   46.30    0.00    0.00   45.37  77129.63
Average:       8    0.00    0.00    0.40    0.00    7.88   45.96    0.00    0.00   45.76  62458.99

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13  1:13                                               ` gso: " Eric Dumazet
@ 2013-11-13  1:29                                                 ` Herbert Xu
  2013-11-13  2:14                                                   ` Eric Dumazet
  2013-11-13  2:17                                                   ` Eric Dumazet
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-13  1:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Nov 12, 2013 at 05:13:43PM -0800, Eric Dumazet wrote:
>
> Note this pskb_trim() will reallocate/copy nskb head completely, since
> nskb is a clone. (And increment page counts of frags, then eventually
> decrement them)
> 
> I tested this patch one 'router', and it seems fine, although we consume
> ~90% more cpu doing the skb_segment() than not doing it.

I presume this is on a NIC that produces completely linear packets?
I wonder what happens if we don't convert head_frag to frags and just
use frag_lists like we used to?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13  1:29                                                 ` Herbert Xu
@ 2013-11-13  2:14                                                   ` Eric Dumazet
  2013-11-13  2:17                                                   ` Eric Dumazet
  1 sibling, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-13  2:14 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-13 at 09:29 +0800, Herbert Xu wrote:
> On Tue, Nov 12, 2013 at 05:13:43PM -0800, Eric Dumazet wrote:
> >
> > Note this pskb_trim() will reallocate/copy nskb head completely, since
> > nskb is a clone. (And increment page counts of frags, then eventually
> > decrement them)
> > 
> > I tested this patch one 'router', and it seems fine, although we consume
> > ~90% more cpu doing the skb_segment() than not doing it.
> 
> I presume this is on a NIC that produces completely linear packets?

Yes, I used one host with a Mellanox (mlx4 driver), as the bnx2x 'GRO'
is partially done by the hardware...


> I wonder what happens if we don't convert head_frag to frags and just
> use frag_lists like we used to?

Yep, but this needs quite invasive changes in the GSO stack in general ?

skb_segment() callers would need to know how many MSS are stuffed per
skb, instead of assuming its 1 MSS.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13  1:29                                                 ` Herbert Xu
  2013-11-13  2:14                                                   ` Eric Dumazet
@ 2013-11-13  2:17                                                   ` Eric Dumazet
  2013-11-13  2:22                                                     ` Herbert Xu
  1 sibling, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-13  2:17 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-13 at 09:29 +0800, Herbert Xu wrote:

> I presume this is on a NIC that produces completely linear packets?

Sorry : with mlx4 driver, GRO builds nice skbs with one page frag per
MSS, so each skb found on frag_list is fully loaded with 16 MSS

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13  2:17                                                   ` Eric Dumazet
@ 2013-11-13  2:22                                                     ` Herbert Xu
  2013-11-13  2:25                                                       ` Herbert Xu
  2013-11-13  2:31                                                       ` Eric Dumazet
  0 siblings, 2 replies; 163+ messages in thread
From: Herbert Xu @ 2013-11-13  2:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Nov 12, 2013 at 06:17:14PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-13 at 09:29 +0800, Herbert Xu wrote:
> 
> > I presume this is on a NIC that produces completely linear packets?
> 
> Sorry : with mlx4 driver, GRO builds nice skbs with one page frag per
> MSS, so each skb found on frag_list is fully loaded with 16 MSS

OK, so what are the numbers when GRO is off completely?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13  2:22                                                     ` Herbert Xu
@ 2013-11-13  2:25                                                       ` Herbert Xu
  2013-11-13  2:45                                                         ` Eric Dumazet
  2013-11-13  2:31                                                       ` Eric Dumazet
  1 sibling, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-13  2:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 13, 2013 at 10:22:43AM +0800, Herbert Xu wrote:
> On Tue, Nov 12, 2013 at 06:17:14PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-13 at 09:29 +0800, Herbert Xu wrote:
> > 
> > > I presume this is on a NIC that produces completely linear packets?
> > 
> > Sorry : with mlx4 driver, GRO builds nice skbs with one page frag per
> > MSS, so each skb found on frag_list is fully loaded with 16 MSS
> 
> OK, so what are the numbers when GRO is off completely?

Actually don't bother, it's not a fair comparison at all.  In the
first case we're doing TSO and with my patch we're only doing GSO.

So a better test for the time being would be to test with TSO
disabled in both cases.

In the mean time I'm cooking up a patch to generate TSO packets.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13  2:22                                                     ` Herbert Xu
  2013-11-13  2:25                                                       ` Herbert Xu
@ 2013-11-13  2:31                                                       ` Eric Dumazet
  1 sibling, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-13  2:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-13 at 10:22 +0800, Herbert Xu wrote:
> On Tue, Nov 12, 2013 at 06:17:14PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-13 at 09:29 +0800, Herbert Xu wrote:
> > 
> > > I presume this is on a NIC that produces completely linear packets?
> > 
> > Sorry : with mlx4 driver, GRO builds nice skbs with one page frag per
> > MSS, so each skb found on frag_list is fully loaded with 16 MSS
> 
> OK, so what are the numbers when GRO is off completely?

Pretty bad, as we drop many packets.

My (single) tcp flow is no longer full speed. (link speed is 10Gb)

lpaa6:~# mpstat -P 3 1 10
06:28:11 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:28:12 PM    3    2.04    0.00    7.14    0.00    5.10   73.47    0.00    0.00   12.24  28594.90
06:28:13 PM    3    3.00    0.00    3.00    0.00    5.00   74.00    0.00    0.00   15.00  28492.00
06:28:14 PM    3    0.00    0.00    0.99    0.00    5.94   93.07    0.00    0.00    0.00  35948.51
06:28:15 PM    3    5.00    0.00    4.00    0.00    5.00   74.00    0.00    0.00   12.00  28699.00
06:28:16 PM    3    3.06    0.00    5.10    0.00    5.10   75.51    0.00    0.00   11.22  30037.76
06:28:17 PM    3    0.00    0.00    0.00    0.00    6.00   92.00    0.00    0.00    2.00  36338.00
06:28:18 PM    3    1.00    0.00    2.00    0.00    5.00   73.00    0.00    0.00   19.00  28230.00
06:28:19 PM    3    0.00    0.00    2.02    0.00    6.06   86.87    0.00    0.00    5.05  33987.88
06:28:20 PM    3    0.00    0.00    0.00    0.00    6.00   81.00    0.00    0.00   13.00  32086.00
06:28:21 PM    3    1.98    0.00    1.98    0.00    4.95   73.27    0.00    0.00   17.82  28383.17
Average:       3    1.60    0.00    2.61    0.00    5.42   79.64    0.00    0.00   10.73  31086.06

perf profile (GRO off)

     6.88%  [kernel]          [k] fib_table_lookup                      
     5.97%  [kernel]          [k] _raw_spin_lock                        
     5.40%  [kernel]          [k] ipt_do_table                          
     4.79%  [mlx4_en]         [k] mlx4_en_process_rx_cq                 
     4.29%  [kernel]          [k] htb_dequeue                           
     3.65%  [mlx4_en]         [k] mlx4_en_xmit                          
     2.48%  [kernel]          [k] kmem_cache_free                       
     2.27%  [kernel]          [k] __netif_receive_skb_core              
     2.23%  [kernel]          [k] local_bh_enable                       
     2.22%  [mlx4_en]         [k] mlx4_en_tx_irq                        
     2.17%  [kernel]          [k] ip_route_input_noref                  
     2.16%  [kernel]          [k] kmem_cache_alloc                      
     2.03%  [kernel]          [k] check_leaf.isra.7                     
     1.89%  [kernel]          [k] inet_getpeer                          
     1.62%  [kernel]          [k] nf_iterate                            
     1.44%  [kernel]          [k] fib_validate_source                   
     1.42%  [kernel]          [k] dev_kfree_skb_irq                     
     1.39%  [kernel]          [k] build_skb                             
     1.38%  [kernel]          [k] ip_rcv                                
     1.26%  [kernel]          [k] htb_lookup_leaf                       
     1.23%  [kernel]          [k] dev_queue_xmit                        
     1.17%  [kernel]          [k] _raw_spin_lock_bh                     
     1.14%  [mlx4_en]         [k] mlx4_en_complete_rx_desc              
     1.08%  [unknown]         [.] 0x00000000019cdd6d                    
     1.03%  [kernel]          [k] htb_deactivate_prios                  
     0.94%  [kernel]          [k] htb_activate_prios                    
     0.91%  [kernel]          [k] htb_enqueue                           
     0.89%  [kernel]          [k] ip_forward                            
     0.87%  [mlx4_en]         [k] mlx4_en_free_tx_desc.isra.25          
     0.84%  [mlx4_en]         [k] mlx4_en_alloc_frags                   
     0.83%  [kernel]          [k] __netdev_alloc_frag                   
     0.83%  [kernel]          [k] read_tsc                              
     0.83%  [kernel]          [k] netif_receive_skb                     
     0.81%  [kernel]          [k] nf_hook_slow                          
     0.80%  [kernel]          [k] dst_alloc                             

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13  2:25                                                       ` Herbert Xu
@ 2013-11-13  2:45                                                         ` Eric Dumazet
  2013-11-13 14:26                                                           ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-13  2:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-13 at 10:25 +0800, Herbert Xu wrote:

> So a better test for the time being would be to test with TSO
> disabled in both cases.
> 

GRO on, TSO off, little difference between two cases :

(Note some small things run in background, so these numbers are not
ultra precise)

1) Full size GRO packets (frag_list enabled)

lpaa6:~# mpstat -P 3 1 10
06:39:32 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:39:33 PM    3    3.23    0.00   43.55    0.00   12.90    0.00    0.00    0.00   40.32 123209.68
06:39:34 PM    3    3.39    0.00   28.81    0.00   16.95    0.00    0.00    0.00   50.85  84559.32
06:39:35 PM    3    1.79    0.00    3.57    0.00   10.71    0.00    0.00    0.00   83.93 108898.21
06:39:36 PM    3    3.39    0.00   30.51    0.00   16.95    0.00    0.00    0.00   49.15  84661.02
06:39:37 PM    3    0.00    0.00    7.46    0.00   10.45    0.00    0.00    0.00   82.09  72135.82
06:39:38 PM    3    1.56    0.00   26.56    0.00    9.38    0.00    0.00    0.00   62.50  73562.50
06:39:39 PM    3    0.00    0.00    8.47    0.00   15.25    0.00    0.00    0.00   76.27  75715.25
06:39:40 PM    3    8.33    0.00    8.33    0.00    2.08    0.00    0.00    0.00   81.25 129195.83
06:39:41 PM    3    1.56    0.00   23.44    0.00   23.44    0.00    0.00    0.00   51.56  73389.06
06:39:42 PM    3    0.00    0.00    1.96    0.00   17.65    0.00    0.00    0.00   80.39 102474.51
Average:       3    2.21    0.00   18.85    0.00   13.75    0.00    0.00    0.00   65.20  91433.11

2) No frag_list skbs

lpaa6:~# mpstat -P 3 1 10
06:39:56 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
06:39:57 PM    3    0.00    0.00    2.63    0.00   26.32    0.00    0.00    0.00   71.05 173105.26
06:39:58 PM    3    3.85    0.00   13.46    0.00   15.38    0.00    0.00    0.00   67.31 121346.15
06:39:59 PM    3    9.09    0.00   18.18    0.00   14.55    0.00    0.00    0.00   58.18  84216.36
06:40:00 PM    3    0.00    0.00   10.53    0.00   10.53    2.63    0.00    0.00   76.32 170863.16
06:40:01 PM    3    3.45    0.00   20.69    0.00   18.97    0.00    0.00    0.00   56.90  83941.38
06:40:02 PM    3    0.00    0.00    5.88    0.00   19.61    0.00    0.00    0.00   74.51 125625.49
06:40:03 PM    3    1.85    0.00   24.07    0.00   16.67    0.00    0.00    0.00   57.41  88242.59
06:40:04 PM    3    3.33    0.00   13.33    0.00   13.33    0.00    0.00    0.00   70.00  74181.67
06:40:05 PM    3    0.00    0.00   10.00    0.00   16.00    2.00    0.00    0.00   72.00 150906.00
06:40:06 PM    3    3.28    0.00   44.26    0.00   14.75    0.00    0.00    0.00   37.70  79821.31
Average:       3    2.71    0.00   17.41    0.00   16.44    0.39    0.00    0.00   63.06 110094.00


> In the mean time I'm cooking up a patch to generate TSO packets.

That would be very nice !

Thanks

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13  2:45                                                         ` Eric Dumazet
@ 2013-11-13 14:26                                                           ` Herbert Xu
  2013-11-13 15:06                                                             ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-13 14:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Tue, Nov 12, 2013 at 06:45:04PM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-13 at 10:25 +0800, Herbert Xu wrote:
> 
> > So a better test for the time being would be to test with TSO
> > disabled in both cases.
> > 
> 
> GRO on, TSO off, little difference between two cases :
> 
> (Note some small things run in background, so these numbers are not
> ultra precise)

OK looks pretty sane.
 
> > In the mean time I'm cooking up a patch to generate TSO packets.
> 
> That would be very nice !

It's failing on my machine but I'm not certain whether it's a bug
in my patch or a bug in my NIC's TSO code.  So I'd appreciate it
if you can give this a spin on your mlx4.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 557e1a5..1302515 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2786,6 +2786,8 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	__be16 proto;
 	bool csum;
 	int sg = !!(features & NETIF_F_SG);
+	int gso_type = 0;
+	int gso_size = 0;
 	int nfrags = skb_shinfo(skb)->nr_frags;
 	int err = -ENOMEM;
 	int i = 0;
@@ -2795,6 +2797,11 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	if (unlikely(!proto))
 		return ERR_PTR(-EINVAL);
 
+	if (net_gso_ok(gso_type, features)) {
+		gso_type = skb_shinfo(skb)->gso_type & ~SKB_GSO_DODGY;
+		gso_size = mss;
+	}
+
 	csum = !!can_checksum_protocol(features, proto);
 	__skb_push(skb, doffset);
 	headroom = skb_headroom(skb);
@@ -2807,7 +2814,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		int size;
 
 		len = skb->len - offset;
-		if (len > mss)
+		if (!gso_size && len > mss)
 			len = mss;
 
 		hsize = skb_headlen(skb) - offset;
@@ -2819,6 +2826,26 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (!hsize && i >= nfrags && skb_headlen(fskb) &&
 		    (skb_headlen(fskb) == len || sg)) {
 			BUG_ON(skb_headlen(fskb) > len);
+			SKB_FRAG_ASSERT(fskb);
+
+			nskb = skb_clone(fskb, GFP_ATOMIC);
+			if (unlikely(!nskb))
+				goto err;
+
+			if (gso_size) {
+				len = nskb->len;
+				pos += len;
+
+				skb_shinfo(nskb)->gso_segs = len / mss;
+
+				/*
+				 * Original GRO packet boundaries must
+				 * have been preserved.
+				 */
+				BUG_ON(fskb->next && len % mss);
+
+				goto skip_trim;
+			}
 
 			i = 0;
 			nfrags = skb_shinfo(fskb)->nr_frags;
@@ -2837,17 +2864,14 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 				skb_frag++;
 			}
 
-			nskb = skb_clone(fskb, GFP_ATOMIC);
-			fskb = fskb->next;
-
-			if (unlikely(!nskb))
-				goto err;
-
 			if (unlikely(pskb_trim(nskb, len))) {
 				kfree_skb(nskb);
 				goto err;
 			}
 
+skip_trim:
+			fskb = fskb->next;
+
 			hsize = skb_end_offset(nskb);
 			if (skb_cow_head(nskb, doffset + headroom)) {
 				kfree_skb(nskb);
@@ -2880,6 +2904,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 		skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
 
+		skb_shinfo(nskb)->gso_size = gso_size;
+		skb_shinfo(nskb)->gso_type = gso_type;
+
 		skb_copy_from_linear_data_offset(skb, -tnl_hlen,
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
@@ -2902,6 +2929,41 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 		skb_shinfo(nskb)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
 
+		/*
+		 * virtio-net misery:
+		 *
+		 * Do a trial run for hardware GSO to get the proper length.
+		 */
+		if (pos < offset + len && gso_size) {
+			int j;
+
+			len = hsize - (offset - pos);
+
+			for (j = i; j < nfrags; j++)
+				len += skb_frag_size(skb_frag + j);
+
+			if (fskb && !skb_headlen(fskb)) {
+				j = min_t(int,
+					  skb_shinfo(fskb)->nr_frags,
+					  MAX_SKB_FRAGS - nfrags + i);
+
+				while (--j >= 0)
+					len += skb_frag_size(
+						skb_shinfo(fskb)->frags + j);
+			}
+
+			if (len < mss && offset + len < skb->len)
+				goto too_many_frags;
+
+			skb_shinfo(nskb)->gso_segs = len / mss;
+			if (len % mss) {
+				if (offset + len >= skb->len)
+					skb_shinfo(nskb)->gso_segs++;
+				else
+					len -= len % mss;
+			}
+		}
+
 		while (pos < offset + len) {
 			if (i >= nfrags) {
 				BUG_ON(skb_headlen(fskb));
@@ -2917,6 +2979,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 			if (unlikely(skb_shinfo(nskb)->nr_frags >=
 				     MAX_SKB_FRAGS)) {
+too_many_frags:
 				net_warn_ratelimited(
 					"skb_segment: too many frags: %u %u\n",
 					pos, mss);

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13 14:26                                                           ` Herbert Xu
@ 2013-11-13 15:06                                                             ` Eric Dumazet
  2013-11-14  8:11                                                               ` Herbert Xu
  0 siblings, 1 reply; 163+ messages in thread
From: Eric Dumazet @ 2013-11-13 15:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, 2013-11-13 at 22:26 +0800, Herbert Xu wrote:
> On Tue, Nov 12, 2013 at 06:45:04PM -0800, Eric Dumazet wrote:
> > On Wed, 2013-11-13 at 10:25 +0800, Herbert Xu wrote:
> > 
> > > So a better test for the time being would be to test with TSO
> > > disabled in both cases.
> > > 
> > 
> > GRO on, TSO off, little difference between two cases :
> > 
> > (Note some small things run in background, so these numbers are not
> > ultra precise)
> 
> OK looks pretty sane.
>  
> > > In the mean time I'm cooking up a patch to generate TSO packets.
> > 
> > That would be very nice !
> 
> It's failing on my machine but I'm not certain whether it's a bug
> in my patch or a bug in my NIC's TSO code.  So I'd appreciate it
> if you can give this a spin on your mlx4.
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 557e1a5..1302515 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c


Well, I wont try this patch, as it can not possibly work :(

The problem is not only skb_segment() but all its callers / users, as I
mentioned yesterday.

Specifically, you'll have to change inet_gso_segment(),
gre_gso_segment(), tcp_gso_segment(), ipv6_gso_segment(),
udp6_ufo_fragment() ...

Yes, we have to propagate correct IP id identifiers, and correct tcp
sequence numbers, tcp checksums.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-13 15:06                                                             ` Eric Dumazet
@ 2013-11-14  8:11                                                               ` Herbert Xu
  2013-11-15  4:37                                                                 ` Eric Dumazet
  0 siblings, 1 reply; 163+ messages in thread
From: Herbert Xu @ 2013-11-14  8:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Wed, Nov 13, 2013 at 07:06:25AM -0800, Eric Dumazet wrote:
>
> Well, I wont try this patch, as it can not possibly work :(

You're right.  It sort of worked for me because I had the GSO
features test reversed meaning it never enabled my new code.

This new patch is still incomplete in that it only does TCPv4 but
it does actually seem to work.

Please let me know what the performance numbers look like.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 557e1a5..e45a2ad 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2786,6 +2786,8 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	__be16 proto;
 	bool csum;
 	int sg = !!(features & NETIF_F_SG);
+	int gso_type = 0;
+	int gso_size = 0;
 	int nfrags = skb_shinfo(skb)->nr_frags;
 	int err = -ENOMEM;
 	int i = 0;
@@ -2795,6 +2797,11 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 	if (unlikely(!proto))
 		return ERR_PTR(-EINVAL);
 
+	if (net_gso_ok(features, gso_type)) {
+		gso_type = skb_shinfo(skb)->gso_type & ~SKB_GSO_DODGY;
+		gso_size = mss;
+	}
+
 	csum = !!can_checksum_protocol(features, proto);
 	__skb_push(skb, doffset);
 	headroom = skb_headroom(skb);
@@ -2805,9 +2812,10 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		skb_frag_t *frag;
 		int hsize;
 		int size;
+		int gso_segs = 1;
 
 		len = skb->len - offset;
-		if (len > mss)
+		if (!gso_size && len > mss)
 			len = mss;
 
 		hsize = skb_headlen(skb) - offset;
@@ -2819,6 +2827,22 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 		if (!hsize && i >= nfrags && skb_headlen(fskb) &&
 		    (skb_headlen(fskb) == len || sg)) {
 			BUG_ON(skb_headlen(fskb) > len);
+			SKB_FRAG_ASSERT(fskb);
+
+			if (gso_size) {
+				len = fskb->len;
+				pos += len;
+
+				gso_segs = len / mss;
+
+				/*
+				 * Original GRO packet boundaries must
+				 * have been preserved.
+				 */
+				BUG_ON(fskb->next && len % mss);
+
+				goto clone_fskb;
+			}
 
 			i = 0;
 			nfrags = skb_shinfo(fskb)->nr_frags;
@@ -2837,6 +2861,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 				skb_frag++;
 			}
 
+clone_fskb:
 			nskb = skb_clone(fskb, GFP_ATOMIC);
 			fskb = fskb->next;
 
@@ -2880,6 +2905,10 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 		skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
 
+		skb_shinfo(nskb)->gso_size = gso_size;
+		skb_shinfo(nskb)->gso_type = gso_type;
+		skb_shinfo(nskb)->gso_segs = gso_segs;
+
 		skb_copy_from_linear_data_offset(skb, -tnl_hlen,
 						 nskb->data - tnl_hlen,
 						 doffset + tnl_hlen);
@@ -2902,6 +2931,39 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 		skb_shinfo(nskb)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
 
+		/* Do a trial run for hardware GSO to get the proper length. */
+		if (pos < offset + len && gso_size) {
+			int j;
+
+			len = hsize;
+			if (pos < offset)
+				len -= offset - pos;
+
+			for (j = i; j < nfrags; j++)
+				len += skb_frag_size(skb_frag + j);
+
+			if (fskb && !skb_headlen(fskb)) {
+				j = min_t(int,
+					  skb_shinfo(fskb)->nr_frags,
+					  MAX_SKB_FRAGS - nfrags + i);
+
+				while (--j >= 0)
+					len += skb_frag_size(
+						skb_shinfo(fskb)->frags + j);
+			}
+
+			if (len < mss && offset + len < skb->len)
+				goto too_many_frags;
+
+			skb_shinfo(nskb)->gso_segs = len / mss;
+			if (len % mss) {
+				if (offset + len >= skb->len)
+					skb_shinfo(nskb)->gso_segs++;
+				else
+					len -= len % mss;
+			}
+		}
+
 		while (pos < offset + len) {
 			if (i >= nfrags) {
 				BUG_ON(skb_headlen(fskb));
@@ -2917,6 +2979,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
 
 			if (unlikely(skb_shinfo(nskb)->nr_frags >=
 				     MAX_SKB_FRAGS)) {
+too_many_frags:
 				net_warn_ratelimited(
 					"skb_segment: too many frags: %u %u\n",
 					pos, mss);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 09d78d4..fba07ba 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1317,7 +1317,8 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 				iph->frag_off |= htons(IP_MF);
 			offset += skb->len - nhoff - ihl;
 		} else {
-			iph->id = htons(id++);
+			id += skb_shinfo(skb)->gso_segs;
+			iph->id = htons(id);
 		}
 		iph->tot_len = htons(skb->len - nhoff);
 		ip_send_check(iph);
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index a2b68a1..62f9334 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -22,11 +22,9 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	struct tcphdr *th;
 	unsigned int thlen;
 	unsigned int seq;
-	__be32 delta;
 	unsigned int oldlen;
 	unsigned int mss;
 	struct sk_buff *gso_skb = skb;
-	__sum16 newcheck;
 	bool ooo_okay, copy_destructor;
 
 	if (!pskb_may_pull(skb, sizeof(*th)))
@@ -83,25 +81,24 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	/* Only first segment might have ooo_okay set */
 	segs->ooo_okay = ooo_okay;
 
-	delta = htonl(oldlen + (thlen + mss));
-
 	skb = segs;
 	th = tcp_hdr(skb);
 	seq = ntohl(th->seq);
 
-	newcheck = ~csum_fold((__force __wsum)((__force u32)th->check +
-					       (__force u32)delta));
-
 	do {
 		th->fin = th->psh = 0;
-		th->check = newcheck;
+
+		th->check = ~csum_fold((__force __wsum)(
+			(__force u32)th->check +
+			(__force u32)htonl(oldlen + skb->len -
+					   skb_transport_offset(skb))));
 
 		if (skb->ip_summed != CHECKSUM_PARTIAL)
 			th->check =
 			     csum_fold(csum_partial(skb_transport_header(skb),
 						    thlen, skb->csum));
 
-		seq += mss;
+		seq += skb->len - skb_transport_offset(skb) - thlen;
 		if (copy_destructor) {
 			skb->destructor = gso_skb->destructor;
 			skb->sk = gso_skb->sk;
@@ -127,11 +124,10 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 			   &skb->sk->sk_wmem_alloc);
 	}
 
-	delta = htonl(oldlen + (skb_tail_pointer(skb) -
-				skb_transport_header(skb)) +
-		      skb->data_len);
-	th->check = ~csum_fold((__force __wsum)((__force u32)th->check +
-				(__force u32)delta));
+	th->check = ~csum_fold((__force __wsum)(
+		(__force u32)th->check +
+		(__force u32)htonl(oldlen + skb->len -
+				   skb_transport_offset(skb))));
 	if (skb->ip_summed != CHECKSUM_PARTIAL)
 		th->check = csum_fold(csum_partial(skb_transport_header(skb),
 						   thlen, skb->csum));

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 163+ messages in thread

* Re: gso: Handle new frag_list of frags GRO packets
  2013-11-14  8:11                                                               ` Herbert Xu
@ 2013-11-15  4:37                                                                 ` Eric Dumazet
  0 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-15  4:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ben Hutchings, David Miller, christoph.paasch, netdev, hkchu, mwdalton

On Thu, 2013-11-14 at 16:11 +0800, Herbert Xu wrote:
> On Wed, Nov 13, 2013 at 07:06:25AM -0800, Eric Dumazet wrote:
> >
> > Well, I wont try this patch, as it can not possibly work :(
> 
> You're right.  It sort of worked for me because I had the GSO
> features test reversed meaning it never enabled my new code.
> 
> This new patch is still incomplete in that it only does TCPv4 but
> it does actually seem to work.
> 
> Please let me know what the performance numbers look like.

Just an update :  I 'lost' the host to do this experiment,
and will regain it shortly.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-02 19:58                 ` [PATCH v4 " Eric Dumazet
  2013-11-03 17:18                   ` Christoph Paasch
  2013-11-04 16:55                   ` Ben Hutchings
@ 2013-11-21 18:29                   ` David Miller
  2013-11-21 18:38                     ` Eric Dumazet
  2 siblings, 1 reply; 163+ messages in thread
From: David Miller @ 2013-11-21 18:29 UTC (permalink / raw)
  To: eric.dumazet
  Cc: bhutchings, christoph.paasch, herbert, netdev, hkchu, mwdalton


Eric please resubmit this in the future once all of the other fixes
have been sorted out with Herbert.

Thanks.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: [PATCH v4 net-next] net: introduce dev_set_forwarding()
  2013-11-21 18:29                   ` David Miller
@ 2013-11-21 18:38                     ` Eric Dumazet
  0 siblings, 0 replies; 163+ messages in thread
From: Eric Dumazet @ 2013-11-21 18:38 UTC (permalink / raw)
  To: David Miller
  Cc: bhutchings, christoph.paasch, herbert, netdev, hkchu, mwdalton

On Thu, 2013-11-21 at 13:29 -0500, David Miller wrote:
> Eric please resubmit this in the future once all of the other fixes
> have been sorted out with Herbert.

Sure, I postponed this for net-next, and will remove the GRO part from
it, as Herbert choked on it ;)

^ permalink raw reply	[flat|nested] 163+ messages in thread

end of thread, other threads:[~2013-11-21 18:44 UTC | newest]

Thread overview: 163+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-28 11:55 Bug in skb_segment: fskb->len != len Christoph Paasch
2013-10-28 13:21 ` Eric Dumazet
2013-10-28 13:28   ` Christoph Paasch
2013-10-29  1:15   ` Eric Dumazet
2013-10-29  9:08     ` Christoph Paasch
2013-10-29 12:57       ` Eric Dumazet
2013-10-29 13:06       ` [PATCH net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
2013-10-29 13:48         ` Christoph Paasch
2013-10-29 15:12         ` [PATCH v2 " Eric Dumazet
2013-10-29 23:44           ` David Miller
2013-10-30  0:06             ` Ben Hutchings
2013-11-02 14:01               ` [PATCH v3 net-next] net: introduce dev_set_forwarding() Eric Dumazet
2013-11-02 15:46                 ` Ben Hutchings
2013-11-02 18:20                   ` Eric Dumazet
2013-11-02 19:58                 ` [PATCH v4 " Eric Dumazet
2013-11-03 17:18                   ` Christoph Paasch
2013-11-04 16:55                   ` Ben Hutchings
2013-11-07 21:17                     ` David Miller
2013-11-07 21:31                       ` Herbert Xu
2013-11-07 21:54                         ` Eric Dumazet
2013-11-08  3:59                           ` Herbert Xu
2013-11-08  4:25                             ` Eric Dumazet
2013-11-10 14:05                               ` Herbert Xu
2013-11-11 14:36                                 ` Herbert Xu
2013-11-07 22:06                         ` David Miller
2013-11-08  2:17                           ` Herbert Xu
2013-11-08  2:42                             ` Eric Dumazet
2013-11-08  2:51                               ` Eric Dumazet
2013-11-08  3:23                                 ` Herbert Xu
2013-11-08  4:21                                   ` Eric Dumazet
2013-11-08  4:24                                     ` Herbert Xu
2013-11-08  4:40                                       ` Eric Dumazet
2013-11-08  4:43                                         ` Herbert Xu
2013-11-08  5:08                                           ` Eric Dumazet
2013-11-08  5:21                                             ` Herbert Xu
2013-11-08  5:40                                               ` Eric Dumazet
2013-11-11 18:58                                                 ` Herbert Xu
2013-11-08  3:22                               ` Herbert Xu
2013-11-08  4:06                                 ` Eric Dumazet
2013-11-08  4:10                                   ` Herbert Xu
2013-11-08  4:24                                     ` Eric Dumazet
2013-11-08  4:28                                       ` Herbert Xu
2013-11-21 18:29                   ` David Miller
2013-11-21 18:38                     ` Eric Dumazet
2013-11-03 12:28                 ` [PATCH v3 " Herbert Xu
2013-11-03 16:28                   ` Eric Dumazet
2013-11-03 16:31                     ` Herbert Xu
2013-11-03 17:26                       ` Eric Dumazet
2013-11-04  4:11                         ` Herbert Xu
2013-11-04  4:23                           ` Eric Dumazet
2013-11-04  4:29                             ` Herbert Xu
2013-11-04  5:00                               ` Eric Dumazet
2013-11-04  5:23                                 ` Herbert Xu
2013-11-04  6:05                                   ` Eric Dumazet
2013-11-04  6:22                                     ` Herbert Xu
2013-11-04  6:26                                       ` Herbert Xu
2013-11-04  7:10                                         ` Eric Dumazet
2013-11-04  7:21                                           ` Herbert Xu
2013-11-04 13:58                                             ` Eric Dumazet
2013-11-04  6:46                                       ` Eric Dumazet
2013-11-04  7:03                                         ` Herbert Xu
2013-11-06  1:30                           ` gso: Attempt to handle mega-GRO packets Herbert Xu
2013-11-06  1:45                             ` Eric Dumazet
2013-11-06  4:07                               ` Herbert Xu
2013-11-06  4:23                                 ` Eric Dumazet
2013-11-06  4:28                                   ` Herbert Xu
2013-11-06  5:20                                     ` Eric Dumazet
2013-11-06  8:04                                       ` Herbert Xu
2013-11-06  8:16                                         ` Herbert Xu
2013-11-06 13:12                                           ` Herbert Xu
2013-11-06 15:01                                             ` Eric Dumazet
2013-11-07  0:36                                               ` Herbert Xu
2013-11-07  1:03                                                 ` Eric Dumazet
2013-11-07  1:47                                                   ` Herbert Xu
2013-11-07  2:02                                                     ` Eric Dumazet
2013-11-07  2:08                                                       ` Eric Dumazet
2013-11-07  2:15                                                       ` Herbert Xu
2013-11-07  2:37                                                         ` Eric Dumazet
2013-11-07  2:41                                                           ` Herbert Xu
2013-11-07  5:56                                                       ` Michael S. Tsirkin
2013-11-07  7:07                                                         ` Eric Dumazet
2013-11-07  2:52                                                 ` Jason Wang
2013-11-06 15:05                                           ` Eric Dumazet
2013-11-07  0:39                                             ` Herbert Xu
2013-11-06 12:39                             ` Herbert Xu
2013-11-06 13:30                               ` Herbert Xu
2013-11-06 14:39                                 ` Herbert Xu
2013-11-06 15:06                                   ` Eric Dumazet
2013-11-06 17:25                                   ` Joe Perches
2013-11-06 19:47                                   ` Eric Dumazet
2013-11-07  0:15                                     ` Eric Dumazet
2013-11-07  0:47                                       ` Herbert Xu
2013-11-07  0:56                                         ` Eric Dumazet
2013-11-07  1:00                                           ` Herbert Xu
2013-11-07  1:08                                             ` Eric Dumazet
2013-11-07  1:13                                       ` Hannes Frederic Sowa
2013-11-07  1:21                                         ` Eric Dumazet
2013-11-07  1:34                                           ` Eric Dumazet
2013-11-07  2:03                                             ` Hannes Frederic Sowa
2013-11-07  3:05                                               ` Eric Dumazet
2013-11-07  6:59                                                 ` Eric Dumazet
2013-11-07  0:43                                     ` Herbert Xu
2013-11-07  6:22                                       ` Herbert Xu
2013-11-07  7:03                                         ` [1/3] gso: Add to segs at end of loop in skb_segment Herbert Xu
2013-11-07  7:06                                           ` [2/3] gso: Handle new frag_list of frags GRO packets Herbert Xu
2013-11-07  7:08                                             ` [3/3] gso: Handle malicious GRO packets without crashing Herbert Xu
2013-11-07 18:18                                               ` Ben Hutchings
2013-11-07 19:13                                               ` Sergei Shtylyov
2013-11-11 18:55                                                 ` Herbert Xu
2013-11-07 18:16                                             ` [2/3] gso: Handle new frag_list of frags GRO packets Ben Hutchings
2013-11-11 18:54                                               ` Herbert Xu
2013-11-11 18:52                                             ` Herbert Xu
2013-11-12 10:12                                               ` David Laight
2013-11-13  1:13                                               ` gso: " Eric Dumazet
2013-11-13  1:29                                                 ` Herbert Xu
2013-11-13  2:14                                                   ` Eric Dumazet
2013-11-13  2:17                                                   ` Eric Dumazet
2013-11-13  2:22                                                     ` Herbert Xu
2013-11-13  2:25                                                       ` Herbert Xu
2013-11-13  2:45                                                         ` Eric Dumazet
2013-11-13 14:26                                                           ` Herbert Xu
2013-11-13 15:06                                                             ` Eric Dumazet
2013-11-14  8:11                                                               ` Herbert Xu
2013-11-15  4:37                                                                 ` Eric Dumazet
2013-11-13  2:31                                                       ` Eric Dumazet
2013-11-07  7:11                                       ` gso: Attempt to handle mega-GRO packets Eric Dumazet
2013-11-07  7:15                                         ` Herbert Xu
2013-11-07  7:17                                           ` Herbert Xu
2013-11-07  7:31                                           ` Eric Dumazet
2013-11-07  7:33                                             ` Herbert Xu
2013-11-03 23:23                     ` [PATCH v3 net-next] net: introduce dev_set_forwarding() David Miller
2013-10-30  0:53             ` [PATCH v2 net-next] net: introduce gro_frag_list_enable sysctl Eric Dumazet
2013-10-30  2:02               ` David Miller
2013-10-30  2:05                 ` Herbert Xu
2013-10-30  2:13                   ` Jerry Chu
2013-10-30  2:19                     ` Herbert Xu
2013-10-30  2:34                       ` David Miller
2013-10-30  2:33                     ` David Miller
     [not found]                       ` <44571383414236@web13j.yandex.ru>
2013-11-02 18:28                         ` Eric Dumazet
2013-11-03 23:19                         ` David Miller
2013-10-30 19:39                   ` Ben Hutchings
2013-10-30 19:53                     ` Eric Dumazet
2013-10-30 20:05                       ` Ben Hutchings
2013-10-30 20:12                         ` Eric Dumazet
2013-10-30  4:06                 ` Eric Dumazet
2013-10-30  4:08                   ` Herbert Xu
2013-10-30  4:09                     ` Herbert Xu
2013-10-30  4:15                       ` Jerry Chu
2013-10-30  4:16                     ` Eric Dumazet
2013-10-30  4:19                       ` Herbert Xu
2013-10-30  4:34                         ` Eric Dumazet
2013-10-30  4:42                           ` Herbert Xu
2013-10-30 17:39                             ` Jerry Chu
2013-10-30 18:09                               ` Vlad Yasevich
2013-10-30 19:12                               ` David Miller
2013-10-30  0:03           ` Jerry Chu
2013-10-29 14:41     ` Bug in skb_segment: fskb->len != len Herbert Xu
2013-10-29 15:08       ` Eric Dumazet
2013-10-30  1:50         ` Herbert Xu
2013-10-30  4:03           ` Eric Dumazet
2013-10-30  4:06             ` Herbert Xu
2013-10-30  4:37               ` Eric Dumazet
2013-10-30  4:47                 ` Herbert Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.