netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BUG_ON in skb_segment, after bpf_skb_change_proto was applied
@ 2019-08-26 14:07 Shmulik Ladkani
  2019-08-26 17:47 ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Shmulik Ladkani @ 2019-08-26 14:07 UTC (permalink / raw)
  To: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck
  Cc: Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal

Hi,

In our production systems, running v4.19.y longterm kernels, we hit a
BUG_ON in 'skb_segment()'. It occurs rarely and although tried, couldn't
synthetically reproduce.

In v4.19.41 it crashes at net/core/skbuff.c:3711

		while (pos < offset + len) {
			if (i >= nfrags) {
				i = 0;
				nfrags = skb_shinfo(list_skb)->nr_frags;
				frag = skb_shinfo(list_skb)->frags;
				frag_skb = list_skb;
				if (!skb_headlen(list_skb)) {
					BUG_ON(!nfrags);
				} else {
3711:					BUG_ON(!list_skb->head_frag);

With the accompanying dump:

 kernel BUG at net/core/skbuff.c:3711!
 invalid opcode: 0000 [#1] SMP PTI
 CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Not tainted 4.19.41-041941-generic #201905080231
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
 RIP: 0010:skb_segment+0xb65/0xda9
 Code: 89 44 24 60 48 89 4c 24 70 e8 87 b3 ff ff 48 8b 4c 24 70 44 8b 44 24 60 85 c0 44 8b 54 24 4c 0f 84 fc fb ff ff e9 16 fd ff ff <0f> 0b 29 c1 89 ce 09 ca e9 61 ff ff ff 0f 0b 41 8b bf 84 00 00 00
 RSP: 0018:ffff9e4d79b037c0 EFLAGS: 00010246
 RAX: ffff9e4d75012ec0 RBX: ffff9e4d74067500 RCX: 0000000000000000
 RDX: 0000000000480020 RSI: 0000000000000000 RDI: ffff9e4d74e3a200
 RBP: ffff9e4d79b03898 R08: 0000000000000564 R09: f69d84ecbfe8b972
 R10: 0000000000000571 R11: a6b66a32f69d84ec R12: 0000000000000564
 R13: ffff9e4c18d03ef0 R14: 0000000000000000 R15: ffff9e4d74e3a200
 FS:  0000000000000000(0000) GS:ffff9e4d79b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000007f50d8 CR3: 000000009420a003 CR4: 00000000001606e0
 Call Trace:
  <IRQ>
  tcp_gso_segment+0xf9/0x4e0
  tcp6_gso_segment+0x5e/0x100
  ipv6_gso_segment+0x112/0x340
  skb_mac_gso_segment+0xb9/0x130
  __skb_gso_segment+0x84/0x190
  validate_xmit_skb+0x14a/0x2f0
  validate_xmit_skb_list+0x4b/0x70
  sch_direct_xmit+0x154/0x390
  __dev_queue_xmit+0x808/0x920
  dev_queue_xmit+0x10/0x20
  neigh_direct_output+0x11/0x20
  ip6_finish_output2+0x1b9/0x5b0
  ip6_finish_output+0x13a/0x1b0
  ip6_output+0x6c/0x110
  ? ip6_fragment+0xa40/0xa40
  ip6_forward+0x501/0x810
  ip6_rcv_finish+0x7a/0x90
  ipv6_rcv+0x69/0xe0
  ? nf_hook.part.24+0x10/0x10
  __netif_receive_skb_core+0x4fa/0xc80
  ? netif_receive_skb_core+0x20/0x20
  ? netif_receive_skb_internal+0x45/0xf0
  ? tcp4_gro_complete+0x86/0x90
  ? napi_gro_complete+0x53/0x90
  __netif_receive_skb_one_core+0x3b/0x80
  __netif_receive_skb+0x18/0x60
  process_backlog+0xb3/0x170
  net_rx_action+0x130/0x350
  __do_softirq+0xdc/0x2d4

To our best knowledge, the packet flow leading to this BUG_ON is:

  - ingress on eth0 (veth, gro:on), ipv4 udp encapsulated esp
  - re-ingresss on eth0, after xfrm, decapsulated ipv4 tcp
  - the skb was GROed (skb_is_gso:true)
  - ipv4 forwarding to dummy1, where eBPF nat4-to-6 program is attached
    at TC Egress (calls 'bpf_skb_change_proto()'), then redirect to ingress
    on same device.
    NOTE: 'bpf_skb_proto_4_to_6()' mangles 'shinfo->gso_size'
  - ingress on dummy1, ipv6 tcp
  - ipv6 forwarding
  - egress on tun2 (tun device) that calls:
    validate_xmit_skb -> ... -> skb_segment BUG_ON

A similar issue was reported and fixed by Yonghong Song in commit
13acc94eff12 ("net: permit skb_segment on head_frag frag_list skb").

However 13acc94eff12 added "BUG_ON(!list_skb->head_frag)" to line 3711,
and patchwork states:

    This patch addressed the issue by handling skb_headlen(list_skb) != 0
    case properly if list_skb->head_frag is true, which is expected in
    most cases. [1]

meaning, 13acc94eff12 does not support list_skb->head_frag=0 case.

Historically, it is claimed that skb_segment is rather intolerant to
gso_size changes, quote:

    Eric suggested to shrink gso_size instead to avoid segmentation+fragments.
    I think its nice idea, but skb_gso_segment makes certain assumptions about
    nr_frags and gso_size (it can't handle frag size > desired mss). [2]

Any suggestions how to debug and fix this?

Could it be that 'bpf_skb_change_proto()' isn't really allowed to
mangle 'gso_size', and we should somehow enforce a 'skb_segment()' call
PRIOR translation?

Appreciate any input and assistance,
Shmulik

[1] https://patchwork.ozlabs.org/patch/889166/
[2] https://patchwork.ozlabs.org/patch/314327/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-08-26 14:07 BUG_ON in skb_segment, after bpf_skb_change_proto was applied Shmulik Ladkani
@ 2019-08-26 17:47 ` Eric Dumazet
  2019-08-27 11:42   ` Shmulik Ladkani
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2019-08-26 17:47 UTC (permalink / raw)
  To: Shmulik Ladkani, Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck
  Cc: Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal



On 8/26/19 4:07 PM, Shmulik Ladkani wrote:
> Hi,
> 
> In our production systems, running v4.19.y longterm kernels, we hit a
> BUG_ON in 'skb_segment()'. It occurs rarely and although tried, couldn't
> synthetically reproduce.
> 
> In v4.19.41 it crashes at net/core/skbuff.c:3711
> 
> 		while (pos < offset + len) {
> 			if (i >= nfrags) {
> 				i = 0;
> 				nfrags = skb_shinfo(list_skb)->nr_frags;
> 				frag = skb_shinfo(list_skb)->frags;
> 				frag_skb = list_skb;
> 				if (!skb_headlen(list_skb)) {
> 					BUG_ON(!nfrags);
> 				} else {
> 3711:					BUG_ON(!list_skb->head_frag);
> 
> With the accompanying dump:
> 
>  kernel BUG at net/core/skbuff.c:3711!
>  invalid opcode: 0000 [#1] SMP PTI
>  CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Not tainted 4.19.41-041941-generic #201905080231
>  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
>  RIP: 0010:skb_segment+0xb65/0xda9
>  Code: 89 44 24 60 48 89 4c 24 70 e8 87 b3 ff ff 48 8b 4c 24 70 44 8b 44 24 60 85 c0 44 8b 54 24 4c 0f 84 fc fb ff ff e9 16 fd ff ff <0f> 0b 29 c1 89 ce 09 ca e9 61 ff ff ff 0f 0b 41 8b bf 84 00 00 00
>  RSP: 0018:ffff9e4d79b037c0 EFLAGS: 00010246
>  RAX: ffff9e4d75012ec0 RBX: ffff9e4d74067500 RCX: 0000000000000000
>  RDX: 0000000000480020 RSI: 0000000000000000 RDI: ffff9e4d74e3a200
>  RBP: ffff9e4d79b03898 R08: 0000000000000564 R09: f69d84ecbfe8b972
>  R10: 0000000000000571 R11: a6b66a32f69d84ec R12: 0000000000000564
>  R13: ffff9e4c18d03ef0 R14: 0000000000000000 R15: ffff9e4d74e3a200
>  FS:  0000000000000000(0000) GS:ffff9e4d79b00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00000000007f50d8 CR3: 000000009420a003 CR4: 00000000001606e0
>  Call Trace:
>   <IRQ>
>   tcp_gso_segment+0xf9/0x4e0
>   tcp6_gso_segment+0x5e/0x100
>   ipv6_gso_segment+0x112/0x340
>   skb_mac_gso_segment+0xb9/0x130
>   __skb_gso_segment+0x84/0x190
>   validate_xmit_skb+0x14a/0x2f0
>   validate_xmit_skb_list+0x4b/0x70
>   sch_direct_xmit+0x154/0x390
>   __dev_queue_xmit+0x808/0x920
>   dev_queue_xmit+0x10/0x20
>   neigh_direct_output+0x11/0x20
>   ip6_finish_output2+0x1b9/0x5b0
>   ip6_finish_output+0x13a/0x1b0
>   ip6_output+0x6c/0x110
>   ? ip6_fragment+0xa40/0xa40
>   ip6_forward+0x501/0x810
>   ip6_rcv_finish+0x7a/0x90
>   ipv6_rcv+0x69/0xe0
>   ? nf_hook.part.24+0x10/0x10
>   __netif_receive_skb_core+0x4fa/0xc80
>   ? netif_receive_skb_core+0x20/0x20
>   ? netif_receive_skb_internal+0x45/0xf0
>   ? tcp4_gro_complete+0x86/0x90
>   ? napi_gro_complete+0x53/0x90
>   __netif_receive_skb_one_core+0x3b/0x80
>   __netif_receive_skb+0x18/0x60
>   process_backlog+0xb3/0x170
>   net_rx_action+0x130/0x350
>   __do_softirq+0xdc/0x2d4
> 
> To our best knowledge, the packet flow leading to this BUG_ON is:
> 
>   - ingress on eth0 (veth, gro:on), ipv4 udp encapsulated esp
>   - re-ingresss on eth0, after xfrm, decapsulated ipv4 tcp
>   - the skb was GROed (skb_is_gso:true)
>   - ipv4 forwarding to dummy1, where eBPF nat4-to-6 program is attached
>     at TC Egress (calls 'bpf_skb_change_proto()'), then redirect to ingress
>     on same device.
>     NOTE: 'bpf_skb_proto_4_to_6()' mangles 'shinfo->gso_size'

Doing this on an skb with a frag_list is doomed, in current gso_segment() state.

A rewrite  would be needed (I believe I did so at some point, but Herbert Xu fought hard against it)


>   - ingress on dummy1, ipv6 tcp
>   - ipv6 forwarding
>   - egress on tun2 (tun device) that calls:
>     validate_xmit_skb -> ... -> skb_segment BUG_ON
> 
> A similar issue was reported and fixed by Yonghong Song in commit
> 13acc94eff12 ("net: permit skb_segment on head_frag frag_list skb").
> 
> However 13acc94eff12 added "BUG_ON(!list_skb->head_frag)" to line 3711,
> and patchwork states:
> 
>     This patch addressed the issue by handling skb_headlen(list_skb) != 0
>     case properly if list_skb->head_frag is true, which is expected in
>     most cases. [1]
> 
> meaning, 13acc94eff12 does not support list_skb->head_frag=0 case.
> 
> Historically, it is claimed that skb_segment is rather intolerant to
> gso_size changes, quote:
> 
>     Eric suggested to shrink gso_size instead to avoid segmentation+fragments.
>     I think its nice idea, but skb_gso_segment makes certain assumptions about
>     nr_frags and gso_size (it can't handle frag size > desired mss). [2]
> 
> Any suggestions how to debug and fix this?
> 
> Could it be that 'bpf_skb_change_proto()' isn't really allowed to
> mangle 'gso_size', and we should somehow enforce a 'skb_segment()' call
> PRIOR translation?
> 
> Appreciate any input and assistance,
> Shmulik
> 
> [1] https://patchwork.ozlabs.org/patch/889166/
> [2] https://patchwork.ozlabs.org/patch/314327/
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-08-26 17:47 ` Eric Dumazet
@ 2019-08-27 11:42   ` Shmulik Ladkani
  2019-08-27 12:10     ` Daniel Borkmann
  2019-08-27 15:09     ` Eric Dumazet
  0 siblings, 2 replies; 13+ messages in thread
From: Shmulik Ladkani @ 2019-08-27 11:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Borkmann, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, shmulik, eyal

On Mon, 26 Aug 2019 19:47:40 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 8/26/19 4:07 PM, Shmulik Ladkani wrote:
> >   - ipv4 forwarding to dummy1, where eBPF nat4-to-6 program is attached
> >     at TC Egress (calls 'bpf_skb_change_proto()'), then redirect to ingress
> >     on same device.
> >     NOTE: 'bpf_skb_proto_4_to_6()' mangles 'shinfo->gso_size'  
> 
> Doing this on an skb with a frag_list is doomed, in current gso_segment() state.
> 
> A rewrite  would be needed (I believe I did so at some point, but Herbert Xu fought hard against it)

Thanks Eric,

- If a rewrite is still considered out of the question, how can one use
  eBPF's bpf_skb_change_proto() safely without disabling GRO completely?
  - e.g. is there a way to force the GROed skbs to fit into a layout that is
    tolerated by skb_segment?
  - alternatively can eBPF layer somehow enforce segmentation of the
    original GROed skb before mangling the gso_size?

- Another thing that puzzles me is that we hit the BUG_ON rather rarely
  and cannot yet reproduce synthetically. If skb_segment's handling of
  skbs with a frag_list (that have gso_size mangled) is broken, I'd expect
  to hit this more often... Any ideas?

- Suppose going for a rewrite, care to elaborate what's exactly missing
  in skb_segment's logic?
  I must admit I do not fully understand all the different code flows in
  this function, it seems to support many different input skbs - any
  assistance is highly appreciated.

Shmulik

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-08-27 11:42   ` Shmulik Ladkani
@ 2019-08-27 12:10     ` Daniel Borkmann
  2019-08-28  5:56       ` Shmulik Ladkani
  2019-08-29 12:22       ` Shmulik Ladkani
  2019-08-27 15:09     ` Eric Dumazet
  1 sibling, 2 replies; 13+ messages in thread
From: Daniel Borkmann @ 2019-08-27 12:10 UTC (permalink / raw)
  To: Shmulik Ladkani, Eric Dumazet
  Cc: netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song,
	Steffen Klassert, shmulik, eyal

On 8/27/19 1:42 PM, Shmulik Ladkani wrote:
[...]
> - Another thing that puzzles me is that we hit the BUG_ON rather rarely
>    and cannot yet reproduce synthetically. If skb_segment's handling of
>    skbs with a frag_list (that have gso_size mangled) is broken, I'd expect
>    to hit this more often... Any ideas?
> 
> - Suppose going for a rewrite, care to elaborate what's exactly missing
>    in skb_segment's logic?
>    I must admit I do not fully understand all the different code flows in
>    this function, it seems to support many different input skbs - any
>    assistance is highly appreciated.

Given first point above wrt hitting rarely, it would be good to first get a
better understanding for writing a reproducer. Back then Yonghong added one
to the BPF kernel test suite [0], so it would be desirable to extend it for
the case you're hitting. Given NAT64 use-case is needed and used by multiple
parties, we should try to (fully) fix it generically.

Thanks,
Daniel

   [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=76db8087c4c991dcd17f5ea8ac0eafd0696ab450

> Shmulik
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-08-27 11:42   ` Shmulik Ladkani
  2019-08-27 12:10     ` Daniel Borkmann
@ 2019-08-27 15:09     ` Eric Dumazet
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2019-08-27 15:09 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Daniel Borkmann, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, shmulik, eyal



On 8/27/19 1:42 PM, Shmulik Ladkani wrote:
> On Mon, 26 Aug 2019 19:47:40 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On 8/26/19 4:07 PM, Shmulik Ladkani wrote:
>>>   - ipv4 forwarding to dummy1, where eBPF nat4-to-6 program is attached
>>>     at TC Egress (calls 'bpf_skb_change_proto()'), then redirect to ingress
>>>     on same device.
>>>     NOTE: 'bpf_skb_proto_4_to_6()' mangles 'shinfo->gso_size'  
>>
>> Doing this on an skb with a frag_list is doomed, in current gso_segment() state.
>>
>> A rewrite  would be needed (I believe I did so at some point, but Herbert Xu fought hard against it)
> 
> Thanks Eric,
> 
> - If a rewrite is still considered out of the question, how can one use
>   eBPF's bpf_skb_change_proto() safely without disabling GRO completely?
>   - e.g. is there a way to force the GROed skbs to fit into a layout that is
>     tolerated by skb_segment?
>   - alternatively can eBPF layer somehow enforce segmentation of the
>     original GROed skb before mangling the gso_size?
> 
> - Another thing that puzzles me is that we hit the BUG_ON rather rarely
>   and cannot yet reproduce synthetically. If skb_segment's handling of
>   skbs with a frag_list (that have gso_size mangled) is broken, I'd expect
>   to hit this more often... Any ideas?

skb_segment of a gro packet (especially with frag_list) is only supported 
if the geometry of the individual segments is not changed,
meaning that gso_size must remain the same.

> 
> - Suppose going for a rewrite, care to elaborate what's exactly missing
>   in skb_segment's logic?
>   I must admit I do not fully understand all the different code flows in
>   this function, it seems to support many different input skbs - any
>   assistance is highly appreciated.

Well, this is the point really.
The complexity of this function is so high that very few of us dare to touch it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-08-27 12:10     ` Daniel Borkmann
@ 2019-08-28  5:56       ` Shmulik Ladkani
  2019-08-29 12:22       ` Shmulik Ladkani
  1 sibling, 0 replies; 13+ messages in thread
From: Shmulik Ladkani @ 2019-08-28  5:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, shmulik, eyal

On Tue, 27 Aug 2019 14:10:35 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Given first point above wrt hitting rarely, it would be good to first get a
> better understanding for writing a reproducer. Back then Yonghong added one
> to the BPF kernel test suite [0], so it would be desirable to extend it for
> the case you're hitting. Given NAT64 use-case is needed and used by multiple
> parties, we should try to (fully) fix it generically.

Thanks Daniel for the advice.

I'm working on a reproducer that resembles the input skb which triggers
this BUG_ON.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-08-27 12:10     ` Daniel Borkmann
  2019-08-28  5:56       ` Shmulik Ladkani
@ 2019-08-29 12:22       ` Shmulik Ladkani
  2019-09-01 20:05         ` Willem de Bruijn
  1 sibling, 1 reply; 13+ messages in thread
From: Shmulik Ladkani @ 2019-08-29 12:22 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, shmulik, eyal

On Tue, 27 Aug 2019 14:10:35 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Given first point above wrt hitting rarely, it would be good to first get a
> better understanding for writing a reproducer. Back then Yonghong added one
> to the BPF kernel test suite [0], so it would be desirable to extend it for
> the case you're hitting. Given NAT64 use-case is needed and used by multiple
> parties, we should try to (fully) fix it generically.
> 

Thanks Daniel.

Managed to write a reproducer which mimics the skb we see on prodction,
that hits the exact same BUG_ON.

Submitted as a separate RFC PATCH to bpf-next.
Tested on v5.0.y (and fwd ported to net-next for submission).

Daniel, please use this reproducer.

Do note that the test assigns:

+	skb_shinfo(skb[0])->gso_size = 1288;

which is the *mangled* gso_size value, to mimic the works of
bpf_skb_proto_4_to_6().

When setting 'gso_size = 1288 + 20' (the *original* gso_size of the
GROed skb prior bpf_skb_proto_4_to_6), the test passes successfully and
we don't hit the mentioned BUG_ON.

Best,
Shmulik

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-08-29 12:22       ` Shmulik Ladkani
@ 2019-09-01 20:05         ` Willem de Bruijn
  2019-09-02 13:44           ` Shmulik Ladkani
  2019-09-03 15:51           ` Shmulik Ladkani
  0 siblings, 2 replies; 13+ messages in thread
From: Willem de Bruijn @ 2019-09-01 20:05 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck,
	Alexei Starovoitov, Yonghong Song, Steffen Klassert,
	Shmulik Ladkani, eyal

On Thu, Aug 29, 2019 at 8:22 AM Shmulik Ladkani
<shmulik.ladkani@gmail.com> wrote:
>
> On Tue, 27 Aug 2019 14:10:35 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> > Given first point above wrt hitting rarely, it would be good to first get a
> > better understanding for writing a reproducer. Back then Yonghong added one
> > to the BPF kernel test suite [0], so it would be desirable to extend it for
> > the case you're hitting. Given NAT64 use-case is needed and used by multiple
> > parties, we should try to (fully) fix it generically.
> >
>
> Thanks Daniel.
>
> Managed to write a reproducer which mimics the skb we see on prodction,
> that hits the exact same BUG_ON.
>
> Submitted as a separate RFC PATCH to bpf-next.

Thanks for the reproducer.

One quick fix is to disable sg and thus revert to copying in this
case. Not ideal, but better than a kernel splat:

@@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
        sg = !!(features & NETIF_F_SG);
        csum = !!can_checksum_protocol(features, proto);

+       if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag)
+               sg = false;
+

It could perhaps be refined to avoid in the special case where
skb_headlen(list_skb) == len and nskb aligned to start of list_skb.
And needs looking into effect on GSO_BY_FRAGS.

I also looked into trying to convert a kmalloc'ed skb->head into a
headfrag. But even if possible, that conversion is non-trivial and
easy to have bugs of its own.

@@ -3849,8 +3885,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
                                if (!skb_headlen(list_skb)) {
                                        BUG_ON(!nfrags);
                                } else {
-                                       BUG_ON(!list_skb->head_frag);
-
+                                       BUG_ON(!list_skb->head_frag &&
+
!skb_to_headfrag(list_skb, GFP_ATOMIC));

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-09-01 20:05         ` Willem de Bruijn
@ 2019-09-02 13:44           ` Shmulik Ladkani
  2019-09-03 15:51           ` Shmulik Ladkani
  1 sibling, 0 replies; 13+ messages in thread
From: Shmulik Ladkani @ 2019-09-02 13:44 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck,
	Alexei Starovoitov, Yonghong Song, Steffen Klassert,
	Shmulik Ladkani, eyal

On Sun, 1 Sep 2019 16:05:48 -0400
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> One quick fix is to disable sg and thus revert to copying in this
> case. Not ideal, but better than a kernel splat:
> 
> @@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>         sg = !!(features & NETIF_F_SG);
>         csum = !!can_checksum_protocol(features, proto);
> 
> +       if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag)
> +               sg = false;
> +

Many thanks Willem for looking into this.

Indeed the above suggestion avoids the splat, at least with the
reproducer.

I'll look deeper into this, to make sure the generated skbs are indeed
legit and have correct content.

Shmulik

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-09-01 20:05         ` Willem de Bruijn
  2019-09-02 13:44           ` Shmulik Ladkani
@ 2019-09-03 15:51           ` Shmulik Ladkani
  2019-09-03 16:23             ` Willem de Bruijn
  1 sibling, 1 reply; 13+ messages in thread
From: Shmulik Ladkani @ 2019-09-03 15:51 UTC (permalink / raw)
  To: Willem de Bruijn, Daniel Borkmann
  Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, Shmulik Ladkani, eyal

On Sun, 1 Sep 2019 16:05:48 -0400
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> One quick fix is to disable sg and thus revert to copying in this
> case. Not ideal, but better than a kernel splat:
> 
> @@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>         sg = !!(features & NETIF_F_SG);
>         csum = !!can_checksum_protocol(features, proto);
> 
> +       if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag)
> +               sg = false;
> +

Thanks Willem.

I followed this approach, and further refined it based on the conditions
that lead to this BUG_ON:

 - existance of frag_list
 - mangled gso_size (using SKB_GSO_DODGY as a hint)
 - some frag in the frag_list has a linear part that is NOT head_frag,
   or length not equal to the requested gso_size

BTW, doing so allowed me to refactor a loop that tests for similar
conditions in the !(features & NETIF_F_GSO_PARTIAL) case, where an
attempt to execute partial splitting at the frag_list pointer (see
07b26c9454a2 and 43170c4e0ba7).

I've tested this using the reproducer, with various linear skbs in
the frag_list and different gso_size mangling. All resulting 'segs'
looked correct. Did not test on a live system yet.

Comments are welcome.

specifically, I would like to know whether we can
 - better refine the condition where this "sg=false fallback" needs
   to be applied
 - consolidate my new 'list_skb && (type & SKB_GSO_DODGY)' case with
   the existing '!(features & NETIF_F_GSO_PARTIAL)' case

see below:


@@ -3470,6 +3470,27 @@ static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb)
 	return head_frag;
 }
 
+static inline bool skb_is_nonlinear_equal_frags(struct sk_buff *skb,
+						unsigned int total_len,
+		                                unsigned int frag_len,
+						unsigned int *remain)
+{
+	struct sk_buff *iter;
+
+	skb_walk_frags(skb, iter) {
+		if (iter->len != frag_len && iter->next)
+			return false;
+		if (skb_headlen(iter) && !iter->head_frag)
+			return false;
+
+		total_len -= iter->len;
+	}
+
+	if (remain)
+		*remain = total_len;
+	return total_len == frag_len;
+}
+
 /**
  *	skb_segment - Perform protocol segmentation on skb.
  *	@head_skb: buffer to segment
@@ -3486,6 +3507,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	struct sk_buff *tail = NULL;
 	struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
 	skb_frag_t *frag = skb_shinfo(head_skb)->frags;
+	unsigned int type = skb_shinfo(head_skb)->gso_type;
 	unsigned int mss = skb_shinfo(head_skb)->gso_size;
 	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
 	struct sk_buff *frag_skb = head_skb;
@@ -3510,13 +3532,29 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	sg = !!(features & NETIF_F_SG);
 	csum = !!can_checksum_protocol(features, proto);
 
-	if (sg && csum && (mss != GSO_BY_FRAGS))  {
+	if (sg && (mss != GSO_BY_FRAGS))  {
+		if (list_skb && (type & SKB_GSO_DODGY)) {
+			/* gso_size is untrusted.
+			 * if head_skb has a frag_list that contains a frag
+			 * with a linear (non head_frag) part, or a frag whose
+			 * length doesn't fit requested mss, fallback to skb
+			 * copying by disabling sg.
+			 */
+			if (!skb_is_nonlinear_equal_frags(head_skb, len, mss,
+						          NULL)) {
+				sg = false;
+				goto normal;
+			}
+		}
+
+		if (!csum)
+			goto normal;
+
 		if (!(features & NETIF_F_GSO_PARTIAL)) {
-			struct sk_buff *iter;
 			unsigned int frag_len;
 
 			if (!list_skb ||
-			    !net_gso_ok(features, skb_shinfo(head_skb)->gso_type))
+			    !net_gso_ok(features, type))
 				goto normal;
 
 			/* If we get here then all the required
@@ -3528,17 +3566,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			 * the last are of the same length.
 			 */
 			frag_len = list_skb->len;
-			skb_walk_frags(head_skb, iter) {
-				if (frag_len != iter->len && iter->next)
-					goto normal;
-				if (skb_headlen(iter) && !iter->head_frag)
-					goto normal;
-
-				len -= iter->len;
-			}
-
-			if (len != frag_len)
+			if (!skb_is_nonlinear_equal_frags(head_skb, len,
+						          frag_len, &len)) {
 				goto normal;
+			}
 		}
 
 		/* GSO partial only requires that we trim off any excess that

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-09-03 15:51           ` Shmulik Ladkani
@ 2019-09-03 16:23             ` Willem de Bruijn
  2019-09-03 17:03               ` Shmulik Ladkani
  0 siblings, 1 reply; 13+ messages in thread
From: Willem de Bruijn @ 2019-09-03 16:23 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Willem de Bruijn, Daniel Borkmann, Eric Dumazet, netdev,
	Alexander Duyck, Alexei Starovoitov, Yonghong Song,
	Steffen Klassert, eyal

On Tue, Sep 3, 2019 at 11:52 AM Shmulik Ladkani
<shmulik@metanetworks.com> wrote:
>
> On Sun, 1 Sep 2019 16:05:48 -0400
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>
> > One quick fix is to disable sg and thus revert to copying in this
> > case. Not ideal, but better than a kernel splat:
> >
> > @@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> >         sg = !!(features & NETIF_F_SG);
> >         csum = !!can_checksum_protocol(features, proto);
> >
> > +       if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag)
> > +               sg = false;
> > +
>
> Thanks Willem.
>
> I followed this approach, and further refined it based on the conditions
> that lead to this BUG_ON:
>
>  - existance of frag_list
>  - mangled gso_size (using SKB_GSO_DODGY as a hint)
>  - some frag in the frag_list has a linear part that is NOT head_frag,
>    or length not equal to the requested gso_size
>
> BTW, doing so allowed me to refactor a loop that tests for similar
> conditions in the !(features & NETIF_F_GSO_PARTIAL) case, where an
> attempt to execute partial splitting at the frag_list pointer (see
> 07b26c9454a2 and 43170c4e0ba7).
>
> I've tested this using the reproducer, with various linear skbs in
> the frag_list and different gso_size mangling. All resulting 'segs'
> looked correct. Did not test on a live system yet.
>
> Comments are welcome.
>
> specifically, I would like to know whether we can
>  - better refine the condition where this "sg=false fallback" needs
>    to be applied
>  - consolidate my new 'list_skb && (type & SKB_GSO_DODGY)' case with
>    the existing '!(features & NETIF_F_GSO_PARTIAL)' case

This is a lot more code change. Especially for stable fixes that need
to be backported, a smaller patch is preferable.

My suggestion only tested the first frag_skb length. If a list can be
created where the first frag_skb is head_frag but a later one is not,
it will fail short. I kind of doubt that.

By default skb_gro_receive builds GSO skbs that can be segmented
along the original gso_size boundaries. We have so far only observed
this issue when messing with gso_size.

We can easily refine the test to fall back on to copying only if
skb_headlen(list_skb) != mss. Alternatively, only on SKB_GSO_DODGY is fine, too.

I suggest we stick with the two-liner.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-09-03 16:23             ` Willem de Bruijn
@ 2019-09-03 17:03               ` Shmulik Ladkani
  2019-09-03 17:24                 ` Willem de Bruijn
  0 siblings, 1 reply; 13+ messages in thread
From: Shmulik Ladkani @ 2019-09-03 17:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck,
	Alexei Starovoitov, Yonghong Song, Steffen Klassert, eyal

On Tue, 3 Sep 2019 12:23:54 -0400
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> This is a lot more code change. Especially for stable fixes that need
> to be backported, a smaller patch is preferable.

Indeed. Thanks for the feedback.

> My suggestion only tested the first frag_skb length. If a list can be
> created where the first frag_skb is head_frag but a later one is not,
> it will fail short. I kind of doubt that.
> 
> By default skb_gro_receive builds GSO skbs that can be segmented
> along the original gso_size boundaries. We have so far only observed
> this issue when messing with gso_size.

The rationale was based on inputs specified in 43170c4e0ba7, where a GRO
skb has a fraglist with different amounts of payloads.

> We can easily refine the test to fall back on to copying only if
> skb_headlen(list_skb) != mss.

I'm concerned this is too generic; innocent skbs may fall victim to our
skb copy fallback. Probably those mentioned in 43170c4e0ba7.

> Alternatively, only on SKB_GSO_DODGY is fine, too.
> 
> I suggest we stick with the two-liner.

OK.
So lets refine your original codition, testing only the first
frag_skb, but also ensuring SKB_GSO_DODGY *and* 'skb_headlen(list_skb) != mss'
(we know existing code DOES work OK for unchanged gso_size, even if frags
have linear, non head_frag, data).

This hits the known, reproducable case of the mentioned BUG_ON, and is
tightly scoped to that case.

If that's agreed, I'll submit a proper patch.

Best,
Shmulik

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
  2019-09-03 17:03               ` Shmulik Ladkani
@ 2019-09-03 17:24                 ` Willem de Bruijn
  0 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2019-09-03 17:24 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Willem de Bruijn, Daniel Borkmann, Eric Dumazet, netdev,
	Alexander Duyck, Alexei Starovoitov, Yonghong Song,
	Steffen Klassert, eyal

On Tue, Sep 3, 2019 at 1:03 PM Shmulik Ladkani <shmulik@metanetworks.com> wrote:
>
> On Tue, 3 Sep 2019 12:23:54 -0400
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>
> > This is a lot more code change. Especially for stable fixes that need
> > to be backported, a smaller patch is preferable.
>
> Indeed. Thanks for the feedback.
>
> > My suggestion only tested the first frag_skb length. If a list can be
> > created where the first frag_skb is head_frag but a later one is not,
> > it will fail short. I kind of doubt that.
> >
> > By default skb_gro_receive builds GSO skbs that can be segmented
> > along the original gso_size boundaries. We have so far only observed
> > this issue when messing with gso_size.
>
> The rationale was based on inputs specified in 43170c4e0ba7, where a GRO
> skb has a fraglist with different amounts of payloads.
>
> > We can easily refine the test to fall back on to copying only if
> > skb_headlen(list_skb) != mss.
>
> I'm concerned this is too generic; innocent skbs may fall victim to our
> skb copy fallback. Probably those mentioned in 43170c4e0ba7.
>
> > Alternatively, only on SKB_GSO_DODGY is fine, too.
> >
> > I suggest we stick with the two-liner.
>
> OK.
> So lets refine your original codition, testing only the first
> frag_skb, but also ensuring SKB_GSO_DODGY *and* 'skb_headlen(list_skb) != mss'
> (we know existing code DOES work OK for unchanged gso_size, even if frags
> have linear, non head_frag, data).
>
> This hits the known, reproducable case of the mentioned BUG_ON, and is
> tightly scoped to that case.
>
> If that's agreed, I'll submit a proper patch.

Yep, that sounds good to me.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-09-03 17:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26 14:07 BUG_ON in skb_segment, after bpf_skb_change_proto was applied Shmulik Ladkani
2019-08-26 17:47 ` Eric Dumazet
2019-08-27 11:42   ` Shmulik Ladkani
2019-08-27 12:10     ` Daniel Borkmann
2019-08-28  5:56       ` Shmulik Ladkani
2019-08-29 12:22       ` Shmulik Ladkani
2019-09-01 20:05         ` Willem de Bruijn
2019-09-02 13:44           ` Shmulik Ladkani
2019-09-03 15:51           ` Shmulik Ladkani
2019-09-03 16:23             ` Willem de Bruijn
2019-09-03 17:03               ` Shmulik Ladkani
2019-09-03 17:24                 ` Willem de Bruijn
2019-08-27 15:09     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).