* BUG_ON in skb_segment, after bpf_skb_change_proto was applied @ 2019-08-26 14:07 Shmulik Ladkani 2019-08-26 17:47 ` Eric Dumazet 0 siblings, 1 reply; 13+ messages in thread From: Shmulik Ladkani @ 2019-08-26 14:07 UTC (permalink / raw) To: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck Cc: Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal Hi, In our production systems, running v4.19.y longterm kernels, we hit a BUG_ON in 'skb_segment()'. It occurs rarely and although tried, couldn't synthetically reproduce. In v4.19.41 it crashes at net/core/skbuff.c:3711 while (pos < offset + len) { if (i >= nfrags) { i = 0; nfrags = skb_shinfo(list_skb)->nr_frags; frag = skb_shinfo(list_skb)->frags; frag_skb = list_skb; if (!skb_headlen(list_skb)) { BUG_ON(!nfrags); } else { 3711: BUG_ON(!list_skb->head_frag); With the accompanying dump: kernel BUG at net/core/skbuff.c:3711! invalid opcode: 0000 [#1] SMP PTI CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Not tainted 4.19.41-041941-generic #201905080231 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 RIP: 0010:skb_segment+0xb65/0xda9 Code: 89 44 24 60 48 89 4c 24 70 e8 87 b3 ff ff 48 8b 4c 24 70 44 8b 44 24 60 85 c0 44 8b 54 24 4c 0f 84 fc fb ff ff e9 16 fd ff ff <0f> 0b 29 c1 89 ce 09 ca e9 61 ff ff ff 0f 0b 41 8b bf 84 00 00 00 RSP: 0018:ffff9e4d79b037c0 EFLAGS: 00010246 RAX: ffff9e4d75012ec0 RBX: ffff9e4d74067500 RCX: 0000000000000000 RDX: 0000000000480020 RSI: 0000000000000000 RDI: ffff9e4d74e3a200 RBP: ffff9e4d79b03898 R08: 0000000000000564 R09: f69d84ecbfe8b972 R10: 0000000000000571 R11: a6b66a32f69d84ec R12: 0000000000000564 R13: ffff9e4c18d03ef0 R14: 0000000000000000 R15: ffff9e4d74e3a200 FS: 0000000000000000(0000) GS:ffff9e4d79b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000007f50d8 CR3: 000000009420a003 CR4: 00000000001606e0 Call Trace: <IRQ> tcp_gso_segment+0xf9/0x4e0 tcp6_gso_segment+0x5e/0x100 ipv6_gso_segment+0x112/0x340 skb_mac_gso_segment+0xb9/0x130 __skb_gso_segment+0x84/0x190 validate_xmit_skb+0x14a/0x2f0 validate_xmit_skb_list+0x4b/0x70 sch_direct_xmit+0x154/0x390 __dev_queue_xmit+0x808/0x920 dev_queue_xmit+0x10/0x20 neigh_direct_output+0x11/0x20 ip6_finish_output2+0x1b9/0x5b0 ip6_finish_output+0x13a/0x1b0 ip6_output+0x6c/0x110 ? ip6_fragment+0xa40/0xa40 ip6_forward+0x501/0x810 ip6_rcv_finish+0x7a/0x90 ipv6_rcv+0x69/0xe0 ? nf_hook.part.24+0x10/0x10 __netif_receive_skb_core+0x4fa/0xc80 ? netif_receive_skb_core+0x20/0x20 ? netif_receive_skb_internal+0x45/0xf0 ? tcp4_gro_complete+0x86/0x90 ? napi_gro_complete+0x53/0x90 __netif_receive_skb_one_core+0x3b/0x80 __netif_receive_skb+0x18/0x60 process_backlog+0xb3/0x170 net_rx_action+0x130/0x350 __do_softirq+0xdc/0x2d4 To our best knowledge, the packet flow leading to this BUG_ON is: - ingress on eth0 (veth, gro:on), ipv4 udp encapsulated esp - re-ingresss on eth0, after xfrm, decapsulated ipv4 tcp - the skb was GROed (skb_is_gso:true) - ipv4 forwarding to dummy1, where eBPF nat4-to-6 program is attached at TC Egress (calls 'bpf_skb_change_proto()'), then redirect to ingress on same device. NOTE: 'bpf_skb_proto_4_to_6()' mangles 'shinfo->gso_size' - ingress on dummy1, ipv6 tcp - ipv6 forwarding - egress on tun2 (tun device) that calls: validate_xmit_skb -> ... -> skb_segment BUG_ON A similar issue was reported and fixed by Yonghong Song in commit 13acc94eff12 ("net: permit skb_segment on head_frag frag_list skb"). However 13acc94eff12 added "BUG_ON(!list_skb->head_frag)" to line 3711, and patchwork states: This patch addressed the issue by handling skb_headlen(list_skb) != 0 case properly if list_skb->head_frag is true, which is expected in most cases. [1] meaning, 13acc94eff12 does not support list_skb->head_frag=0 case. Historically, it is claimed that skb_segment is rather intolerant to gso_size changes, quote: Eric suggested to shrink gso_size instead to avoid segmentation+fragments. I think its nice idea, but skb_gso_segment makes certain assumptions about nr_frags and gso_size (it can't handle frag size > desired mss). [2] Any suggestions how to debug and fix this? Could it be that 'bpf_skb_change_proto()' isn't really allowed to mangle 'gso_size', and we should somehow enforce a 'skb_segment()' call PRIOR translation? Appreciate any input and assistance, Shmulik [1] https://patchwork.ozlabs.org/patch/889166/ [2] https://patchwork.ozlabs.org/patch/314327/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-08-26 14:07 BUG_ON in skb_segment, after bpf_skb_change_proto was applied Shmulik Ladkani @ 2019-08-26 17:47 ` Eric Dumazet 2019-08-27 11:42 ` Shmulik Ladkani 0 siblings, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2019-08-26 17:47 UTC (permalink / raw) To: Shmulik Ladkani, Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck Cc: Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal On 8/26/19 4:07 PM, Shmulik Ladkani wrote: > Hi, > > In our production systems, running v4.19.y longterm kernels, we hit a > BUG_ON in 'skb_segment()'. It occurs rarely and although tried, couldn't > synthetically reproduce. > > In v4.19.41 it crashes at net/core/skbuff.c:3711 > > while (pos < offset + len) { > if (i >= nfrags) { > i = 0; > nfrags = skb_shinfo(list_skb)->nr_frags; > frag = skb_shinfo(list_skb)->frags; > frag_skb = list_skb; > if (!skb_headlen(list_skb)) { > BUG_ON(!nfrags); > } else { > 3711: BUG_ON(!list_skb->head_frag); > > With the accompanying dump: > > kernel BUG at net/core/skbuff.c:3711! > invalid opcode: 0000 [#1] SMP PTI > CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Not tainted 4.19.41-041941-generic #201905080231 > Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 > RIP: 0010:skb_segment+0xb65/0xda9 > Code: 89 44 24 60 48 89 4c 24 70 e8 87 b3 ff ff 48 8b 4c 24 70 44 8b 44 24 60 85 c0 44 8b 54 24 4c 0f 84 fc fb ff ff e9 16 fd ff ff <0f> 0b 29 c1 89 ce 09 ca e9 61 ff ff ff 0f 0b 41 8b bf 84 00 00 00 > RSP: 0018:ffff9e4d79b037c0 EFLAGS: 00010246 > RAX: ffff9e4d75012ec0 RBX: ffff9e4d74067500 RCX: 0000000000000000 > RDX: 0000000000480020 RSI: 0000000000000000 RDI: ffff9e4d74e3a200 > RBP: ffff9e4d79b03898 R08: 0000000000000564 R09: f69d84ecbfe8b972 > R10: 0000000000000571 R11: a6b66a32f69d84ec R12: 0000000000000564 > R13: ffff9e4c18d03ef0 R14: 0000000000000000 R15: ffff9e4d74e3a200 > FS: 0000000000000000(0000) GS:ffff9e4d79b00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000007f50d8 CR3: 000000009420a003 CR4: 00000000001606e0 > Call Trace: > <IRQ> > tcp_gso_segment+0xf9/0x4e0 > tcp6_gso_segment+0x5e/0x100 > ipv6_gso_segment+0x112/0x340 > skb_mac_gso_segment+0xb9/0x130 > __skb_gso_segment+0x84/0x190 > validate_xmit_skb+0x14a/0x2f0 > validate_xmit_skb_list+0x4b/0x70 > sch_direct_xmit+0x154/0x390 > __dev_queue_xmit+0x808/0x920 > dev_queue_xmit+0x10/0x20 > neigh_direct_output+0x11/0x20 > ip6_finish_output2+0x1b9/0x5b0 > ip6_finish_output+0x13a/0x1b0 > ip6_output+0x6c/0x110 > ? ip6_fragment+0xa40/0xa40 > ip6_forward+0x501/0x810 > ip6_rcv_finish+0x7a/0x90 > ipv6_rcv+0x69/0xe0 > ? nf_hook.part.24+0x10/0x10 > __netif_receive_skb_core+0x4fa/0xc80 > ? netif_receive_skb_core+0x20/0x20 > ? netif_receive_skb_internal+0x45/0xf0 > ? tcp4_gro_complete+0x86/0x90 > ? napi_gro_complete+0x53/0x90 > __netif_receive_skb_one_core+0x3b/0x80 > __netif_receive_skb+0x18/0x60 > process_backlog+0xb3/0x170 > net_rx_action+0x130/0x350 > __do_softirq+0xdc/0x2d4 > > To our best knowledge, the packet flow leading to this BUG_ON is: > > - ingress on eth0 (veth, gro:on), ipv4 udp encapsulated esp > - re-ingresss on eth0, after xfrm, decapsulated ipv4 tcp > - the skb was GROed (skb_is_gso:true) > - ipv4 forwarding to dummy1, where eBPF nat4-to-6 program is attached > at TC Egress (calls 'bpf_skb_change_proto()'), then redirect to ingress > on same device. > NOTE: 'bpf_skb_proto_4_to_6()' mangles 'shinfo->gso_size' Doing this on an skb with a frag_list is doomed, in current gso_segment() state. A rewrite would be needed (I believe I did so at some point, but Herbert Xu fought hard against it) > - ingress on dummy1, ipv6 tcp > - ipv6 forwarding > - egress on tun2 (tun device) that calls: > validate_xmit_skb -> ... -> skb_segment BUG_ON > > A similar issue was reported and fixed by Yonghong Song in commit > 13acc94eff12 ("net: permit skb_segment on head_frag frag_list skb"). > > However 13acc94eff12 added "BUG_ON(!list_skb->head_frag)" to line 3711, > and patchwork states: > > This patch addressed the issue by handling skb_headlen(list_skb) != 0 > case properly if list_skb->head_frag is true, which is expected in > most cases. [1] > > meaning, 13acc94eff12 does not support list_skb->head_frag=0 case. > > Historically, it is claimed that skb_segment is rather intolerant to > gso_size changes, quote: > > Eric suggested to shrink gso_size instead to avoid segmentation+fragments. > I think its nice idea, but skb_gso_segment makes certain assumptions about > nr_frags and gso_size (it can't handle frag size > desired mss). [2] > > Any suggestions how to debug and fix this? > > Could it be that 'bpf_skb_change_proto()' isn't really allowed to > mangle 'gso_size', and we should somehow enforce a 'skb_segment()' call > PRIOR translation? > > Appreciate any input and assistance, > Shmulik > > [1] https://patchwork.ozlabs.org/patch/889166/ > [2] https://patchwork.ozlabs.org/patch/314327/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-08-26 17:47 ` Eric Dumazet @ 2019-08-27 11:42 ` Shmulik Ladkani 2019-08-27 12:10 ` Daniel Borkmann 2019-08-27 15:09 ` Eric Dumazet 0 siblings, 2 replies; 13+ messages in thread From: Shmulik Ladkani @ 2019-08-27 11:42 UTC (permalink / raw) To: Eric Dumazet Cc: Daniel Borkmann, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal On Mon, 26 Aug 2019 19:47:40 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On 8/26/19 4:07 PM, Shmulik Ladkani wrote: > > - ipv4 forwarding to dummy1, where eBPF nat4-to-6 program is attached > > at TC Egress (calls 'bpf_skb_change_proto()'), then redirect to ingress > > on same device. > > NOTE: 'bpf_skb_proto_4_to_6()' mangles 'shinfo->gso_size' > > Doing this on an skb with a frag_list is doomed, in current gso_segment() state. > > A rewrite would be needed (I believe I did so at some point, but Herbert Xu fought hard against it) Thanks Eric, - If a rewrite is still considered out of the question, how can one use eBPF's bpf_skb_change_proto() safely without disabling GRO completely? - e.g. is there a way to force the GROed skbs to fit into a layout that is tolerated by skb_segment? - alternatively can eBPF layer somehow enforce segmentation of the original GROed skb before mangling the gso_size? - Another thing that puzzles me is that we hit the BUG_ON rather rarely and cannot yet reproduce synthetically. If skb_segment's handling of skbs with a frag_list (that have gso_size mangled) is broken, I'd expect to hit this more often... Any ideas? - Suppose going for a rewrite, care to elaborate what's exactly missing in skb_segment's logic? I must admit I do not fully understand all the different code flows in this function, it seems to support many different input skbs - any assistance is highly appreciated. Shmulik ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-08-27 11:42 ` Shmulik Ladkani @ 2019-08-27 12:10 ` Daniel Borkmann 2019-08-28 5:56 ` Shmulik Ladkani 2019-08-29 12:22 ` Shmulik Ladkani 2019-08-27 15:09 ` Eric Dumazet 1 sibling, 2 replies; 13+ messages in thread From: Daniel Borkmann @ 2019-08-27 12:10 UTC (permalink / raw) To: Shmulik Ladkani, Eric Dumazet Cc: netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal On 8/27/19 1:42 PM, Shmulik Ladkani wrote: [...] > - Another thing that puzzles me is that we hit the BUG_ON rather rarely > and cannot yet reproduce synthetically. If skb_segment's handling of > skbs with a frag_list (that have gso_size mangled) is broken, I'd expect > to hit this more often... Any ideas? > > - Suppose going for a rewrite, care to elaborate what's exactly missing > in skb_segment's logic? > I must admit I do not fully understand all the different code flows in > this function, it seems to support many different input skbs - any > assistance is highly appreciated. Given first point above wrt hitting rarely, it would be good to first get a better understanding for writing a reproducer. Back then Yonghong added one to the BPF kernel test suite [0], so it would be desirable to extend it for the case you're hitting. Given NAT64 use-case is needed and used by multiple parties, we should try to (fully) fix it generically. Thanks, Daniel [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=76db8087c4c991dcd17f5ea8ac0eafd0696ab450 > Shmulik > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-08-27 12:10 ` Daniel Borkmann @ 2019-08-28 5:56 ` Shmulik Ladkani 2019-08-29 12:22 ` Shmulik Ladkani 1 sibling, 0 replies; 13+ messages in thread From: Shmulik Ladkani @ 2019-08-28 5:56 UTC (permalink / raw) To: Daniel Borkmann Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal On Tue, 27 Aug 2019 14:10:35 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote: > Given first point above wrt hitting rarely, it would be good to first get a > better understanding for writing a reproducer. Back then Yonghong added one > to the BPF kernel test suite [0], so it would be desirable to extend it for > the case you're hitting. Given NAT64 use-case is needed and used by multiple > parties, we should try to (fully) fix it generically. Thanks Daniel for the advice. I'm working on a reproducer that resembles the input skb which triggers this BUG_ON. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-08-27 12:10 ` Daniel Borkmann 2019-08-28 5:56 ` Shmulik Ladkani @ 2019-08-29 12:22 ` Shmulik Ladkani 2019-09-01 20:05 ` Willem de Bruijn 1 sibling, 1 reply; 13+ messages in thread From: Shmulik Ladkani @ 2019-08-29 12:22 UTC (permalink / raw) To: Daniel Borkmann Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal On Tue, 27 Aug 2019 14:10:35 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote: > Given first point above wrt hitting rarely, it would be good to first get a > better understanding for writing a reproducer. Back then Yonghong added one > to the BPF kernel test suite [0], so it would be desirable to extend it for > the case you're hitting. Given NAT64 use-case is needed and used by multiple > parties, we should try to (fully) fix it generically. > Thanks Daniel. Managed to write a reproducer which mimics the skb we see on prodction, that hits the exact same BUG_ON. Submitted as a separate RFC PATCH to bpf-next. Tested on v5.0.y (and fwd ported to net-next for submission). Daniel, please use this reproducer. Do note that the test assigns: + skb_shinfo(skb[0])->gso_size = 1288; which is the *mangled* gso_size value, to mimic the works of bpf_skb_proto_4_to_6(). When setting 'gso_size = 1288 + 20' (the *original* gso_size of the GROed skb prior bpf_skb_proto_4_to_6), the test passes successfully and we don't hit the mentioned BUG_ON. Best, Shmulik ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-08-29 12:22 ` Shmulik Ladkani @ 2019-09-01 20:05 ` Willem de Bruijn 2019-09-02 13:44 ` Shmulik Ladkani 2019-09-03 15:51 ` Shmulik Ladkani 0 siblings, 2 replies; 13+ messages in thread From: Willem de Bruijn @ 2019-09-01 20:05 UTC (permalink / raw) To: Shmulik Ladkani Cc: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, Shmulik Ladkani, eyal On Thu, Aug 29, 2019 at 8:22 AM Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote: > > On Tue, 27 Aug 2019 14:10:35 +0200 > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > Given first point above wrt hitting rarely, it would be good to first get a > > better understanding for writing a reproducer. Back then Yonghong added one > > to the BPF kernel test suite [0], so it would be desirable to extend it for > > the case you're hitting. Given NAT64 use-case is needed and used by multiple > > parties, we should try to (fully) fix it generically. > > > > Thanks Daniel. > > Managed to write a reproducer which mimics the skb we see on prodction, > that hits the exact same BUG_ON. > > Submitted as a separate RFC PATCH to bpf-next. Thanks for the reproducer. One quick fix is to disable sg and thus revert to copying in this case. Not ideal, but better than a kernel splat: @@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, sg = !!(features & NETIF_F_SG); csum = !!can_checksum_protocol(features, proto); + if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag) + sg = false; + It could perhaps be refined to avoid in the special case where skb_headlen(list_skb) == len and nskb aligned to start of list_skb. And needs looking into effect on GSO_BY_FRAGS. I also looked into trying to convert a kmalloc'ed skb->head into a headfrag. But even if possible, that conversion is non-trivial and easy to have bugs of its own. @@ -3849,8 +3885,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, if (!skb_headlen(list_skb)) { BUG_ON(!nfrags); } else { - BUG_ON(!list_skb->head_frag); - + BUG_ON(!list_skb->head_frag && + !skb_to_headfrag(list_skb, GFP_ATOMIC)); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-09-01 20:05 ` Willem de Bruijn @ 2019-09-02 13:44 ` Shmulik Ladkani 2019-09-03 15:51 ` Shmulik Ladkani 1 sibling, 0 replies; 13+ messages in thread From: Shmulik Ladkani @ 2019-09-02 13:44 UTC (permalink / raw) To: Willem de Bruijn Cc: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, Shmulik Ladkani, eyal On Sun, 1 Sep 2019 16:05:48 -0400 Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > One quick fix is to disable sg and thus revert to copying in this > case. Not ideal, but better than a kernel splat: > > @@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > sg = !!(features & NETIF_F_SG); > csum = !!can_checksum_protocol(features, proto); > > + if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag) > + sg = false; > + Many thanks Willem for looking into this. Indeed the above suggestion avoids the splat, at least with the reproducer. I'll look deeper into this, to make sure the generated skbs are indeed legit and have correct content. Shmulik ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-09-01 20:05 ` Willem de Bruijn 2019-09-02 13:44 ` Shmulik Ladkani @ 2019-09-03 15:51 ` Shmulik Ladkani 2019-09-03 16:23 ` Willem de Bruijn 1 sibling, 1 reply; 13+ messages in thread From: Shmulik Ladkani @ 2019-09-03 15:51 UTC (permalink / raw) To: Willem de Bruijn, Daniel Borkmann Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, Shmulik Ladkani, eyal On Sun, 1 Sep 2019 16:05:48 -0400 Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > One quick fix is to disable sg and thus revert to copying in this > case. Not ideal, but better than a kernel splat: > > @@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > sg = !!(features & NETIF_F_SG); > csum = !!can_checksum_protocol(features, proto); > > + if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag) > + sg = false; > + Thanks Willem. I followed this approach, and further refined it based on the conditions that lead to this BUG_ON: - existance of frag_list - mangled gso_size (using SKB_GSO_DODGY as a hint) - some frag in the frag_list has a linear part that is NOT head_frag, or length not equal to the requested gso_size BTW, doing so allowed me to refactor a loop that tests for similar conditions in the !(features & NETIF_F_GSO_PARTIAL) case, where an attempt to execute partial splitting at the frag_list pointer (see 07b26c9454a2 and 43170c4e0ba7). I've tested this using the reproducer, with various linear skbs in the frag_list and different gso_size mangling. All resulting 'segs' looked correct. Did not test on a live system yet. Comments are welcome. specifically, I would like to know whether we can - better refine the condition where this "sg=false fallback" needs to be applied - consolidate my new 'list_skb && (type & SKB_GSO_DODGY)' case with the existing '!(features & NETIF_F_GSO_PARTIAL)' case see below: @@ -3470,6 +3470,27 @@ static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb) return head_frag; } +static inline bool skb_is_nonlinear_equal_frags(struct sk_buff *skb, + unsigned int total_len, + unsigned int frag_len, + unsigned int *remain) +{ + struct sk_buff *iter; + + skb_walk_frags(skb, iter) { + if (iter->len != frag_len && iter->next) + return false; + if (skb_headlen(iter) && !iter->head_frag) + return false; + + total_len -= iter->len; + } + + if (remain) + *remain = total_len; + return total_len == frag_len; +} + /** * skb_segment - Perform protocol segmentation on skb. * @head_skb: buffer to segment @@ -3486,6 +3507,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, struct sk_buff *tail = NULL; struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list; skb_frag_t *frag = skb_shinfo(head_skb)->frags; + unsigned int type = skb_shinfo(head_skb)->gso_type; unsigned int mss = skb_shinfo(head_skb)->gso_size; unsigned int doffset = head_skb->data - skb_mac_header(head_skb); struct sk_buff *frag_skb = head_skb; @@ -3510,13 +3532,29 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, sg = !!(features & NETIF_F_SG); csum = !!can_checksum_protocol(features, proto); - if (sg && csum && (mss != GSO_BY_FRAGS)) { + if (sg && (mss != GSO_BY_FRAGS)) { + if (list_skb && (type & SKB_GSO_DODGY)) { + /* gso_size is untrusted. + * if head_skb has a frag_list that contains a frag + * with a linear (non head_frag) part, or a frag whose + * length doesn't fit requested mss, fallback to skb + * copying by disabling sg. + */ + if (!skb_is_nonlinear_equal_frags(head_skb, len, mss, + NULL)) { + sg = false; + goto normal; + } + } + + if (!csum) + goto normal; + if (!(features & NETIF_F_GSO_PARTIAL)) { - struct sk_buff *iter; unsigned int frag_len; if (!list_skb || - !net_gso_ok(features, skb_shinfo(head_skb)->gso_type)) + !net_gso_ok(features, type)) goto normal; /* If we get here then all the required @@ -3528,17 +3566,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, * the last are of the same length. */ frag_len = list_skb->len; - skb_walk_frags(head_skb, iter) { - if (frag_len != iter->len && iter->next) - goto normal; - if (skb_headlen(iter) && !iter->head_frag) - goto normal; - - len -= iter->len; - } - - if (len != frag_len) + if (!skb_is_nonlinear_equal_frags(head_skb, len, + frag_len, &len)) { goto normal; + } } /* GSO partial only requires that we trim off any excess that ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-09-03 15:51 ` Shmulik Ladkani @ 2019-09-03 16:23 ` Willem de Bruijn 2019-09-03 17:03 ` Shmulik Ladkani 0 siblings, 1 reply; 13+ messages in thread From: Willem de Bruijn @ 2019-09-03 16:23 UTC (permalink / raw) To: Shmulik Ladkani Cc: Willem de Bruijn, Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, eyal On Tue, Sep 3, 2019 at 11:52 AM Shmulik Ladkani <shmulik@metanetworks.com> wrote: > > On Sun, 1 Sep 2019 16:05:48 -0400 > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > One quick fix is to disable sg and thus revert to copying in this > > case. Not ideal, but better than a kernel splat: > > > > @@ -3714,6 +3714,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > > sg = !!(features & NETIF_F_SG); > > csum = !!can_checksum_protocol(features, proto); > > > > + if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag) > > + sg = false; > > + > > Thanks Willem. > > I followed this approach, and further refined it based on the conditions > that lead to this BUG_ON: > > - existance of frag_list > - mangled gso_size (using SKB_GSO_DODGY as a hint) > - some frag in the frag_list has a linear part that is NOT head_frag, > or length not equal to the requested gso_size > > BTW, doing so allowed me to refactor a loop that tests for similar > conditions in the !(features & NETIF_F_GSO_PARTIAL) case, where an > attempt to execute partial splitting at the frag_list pointer (see > 07b26c9454a2 and 43170c4e0ba7). > > I've tested this using the reproducer, with various linear skbs in > the frag_list and different gso_size mangling. All resulting 'segs' > looked correct. Did not test on a live system yet. > > Comments are welcome. > > specifically, I would like to know whether we can > - better refine the condition where this "sg=false fallback" needs > to be applied > - consolidate my new 'list_skb && (type & SKB_GSO_DODGY)' case with > the existing '!(features & NETIF_F_GSO_PARTIAL)' case This is a lot more code change. Especially for stable fixes that need to be backported, a smaller patch is preferable. My suggestion only tested the first frag_skb length. If a list can be created where the first frag_skb is head_frag but a later one is not, it will fail short. I kind of doubt that. By default skb_gro_receive builds GSO skbs that can be segmented along the original gso_size boundaries. We have so far only observed this issue when messing with gso_size. We can easily refine the test to fall back on to copying only if skb_headlen(list_skb) != mss. Alternatively, only on SKB_GSO_DODGY is fine, too. I suggest we stick with the two-liner. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-09-03 16:23 ` Willem de Bruijn @ 2019-09-03 17:03 ` Shmulik Ladkani 2019-09-03 17:24 ` Willem de Bruijn 0 siblings, 1 reply; 13+ messages in thread From: Shmulik Ladkani @ 2019-09-03 17:03 UTC (permalink / raw) To: Willem de Bruijn Cc: Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, eyal On Tue, 3 Sep 2019 12:23:54 -0400 Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > This is a lot more code change. Especially for stable fixes that need > to be backported, a smaller patch is preferable. Indeed. Thanks for the feedback. > My suggestion only tested the first frag_skb length. If a list can be > created where the first frag_skb is head_frag but a later one is not, > it will fail short. I kind of doubt that. > > By default skb_gro_receive builds GSO skbs that can be segmented > along the original gso_size boundaries. We have so far only observed > this issue when messing with gso_size. The rationale was based on inputs specified in 43170c4e0ba7, where a GRO skb has a fraglist with different amounts of payloads. > We can easily refine the test to fall back on to copying only if > skb_headlen(list_skb) != mss. I'm concerned this is too generic; innocent skbs may fall victim to our skb copy fallback. Probably those mentioned in 43170c4e0ba7. > Alternatively, only on SKB_GSO_DODGY is fine, too. > > I suggest we stick with the two-liner. OK. So lets refine your original codition, testing only the first frag_skb, but also ensuring SKB_GSO_DODGY *and* 'skb_headlen(list_skb) != mss' (we know existing code DOES work OK for unchanged gso_size, even if frags have linear, non head_frag, data). This hits the known, reproducable case of the mentioned BUG_ON, and is tightly scoped to that case. If that's agreed, I'll submit a proper patch. Best, Shmulik ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-09-03 17:03 ` Shmulik Ladkani @ 2019-09-03 17:24 ` Willem de Bruijn 0 siblings, 0 replies; 13+ messages in thread From: Willem de Bruijn @ 2019-09-03 17:24 UTC (permalink / raw) To: Shmulik Ladkani Cc: Willem de Bruijn, Daniel Borkmann, Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, eyal On Tue, Sep 3, 2019 at 1:03 PM Shmulik Ladkani <shmulik@metanetworks.com> wrote: > > On Tue, 3 Sep 2019 12:23:54 -0400 > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > This is a lot more code change. Especially for stable fixes that need > > to be backported, a smaller patch is preferable. > > Indeed. Thanks for the feedback. > > > My suggestion only tested the first frag_skb length. If a list can be > > created where the first frag_skb is head_frag but a later one is not, > > it will fail short. I kind of doubt that. > > > > By default skb_gro_receive builds GSO skbs that can be segmented > > along the original gso_size boundaries. We have so far only observed > > this issue when messing with gso_size. > > The rationale was based on inputs specified in 43170c4e0ba7, where a GRO > skb has a fraglist with different amounts of payloads. > > > We can easily refine the test to fall back on to copying only if > > skb_headlen(list_skb) != mss. > > I'm concerned this is too generic; innocent skbs may fall victim to our > skb copy fallback. Probably those mentioned in 43170c4e0ba7. > > > Alternatively, only on SKB_GSO_DODGY is fine, too. > > > > I suggest we stick with the two-liner. > > OK. > So lets refine your original codition, testing only the first > frag_skb, but also ensuring SKB_GSO_DODGY *and* 'skb_headlen(list_skb) != mss' > (we know existing code DOES work OK for unchanged gso_size, even if frags > have linear, non head_frag, data). > > This hits the known, reproducable case of the mentioned BUG_ON, and is > tightly scoped to that case. > > If that's agreed, I'll submit a proper patch. Yep, that sounds good to me. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied 2019-08-27 11:42 ` Shmulik Ladkani 2019-08-27 12:10 ` Daniel Borkmann @ 2019-08-27 15:09 ` Eric Dumazet 1 sibling, 0 replies; 13+ messages in thread From: Eric Dumazet @ 2019-08-27 15:09 UTC (permalink / raw) To: Shmulik Ladkani Cc: Daniel Borkmann, netdev, Alexander Duyck, Alexei Starovoitov, Yonghong Song, Steffen Klassert, shmulik, eyal On 8/27/19 1:42 PM, Shmulik Ladkani wrote: > On Mon, 26 Aug 2019 19:47:40 +0200 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> On 8/26/19 4:07 PM, Shmulik Ladkani wrote: >>> - ipv4 forwarding to dummy1, where eBPF nat4-to-6 program is attached >>> at TC Egress (calls 'bpf_skb_change_proto()'), then redirect to ingress >>> on same device. >>> NOTE: 'bpf_skb_proto_4_to_6()' mangles 'shinfo->gso_size' >> >> Doing this on an skb with a frag_list is doomed, in current gso_segment() state. >> >> A rewrite would be needed (I believe I did so at some point, but Herbert Xu fought hard against it) > > Thanks Eric, > > - If a rewrite is still considered out of the question, how can one use > eBPF's bpf_skb_change_proto() safely without disabling GRO completely? > - e.g. is there a way to force the GROed skbs to fit into a layout that is > tolerated by skb_segment? > - alternatively can eBPF layer somehow enforce segmentation of the > original GROed skb before mangling the gso_size? > > - Another thing that puzzles me is that we hit the BUG_ON rather rarely > and cannot yet reproduce synthetically. If skb_segment's handling of > skbs with a frag_list (that have gso_size mangled) is broken, I'd expect > to hit this more often... Any ideas? skb_segment of a gro packet (especially with frag_list) is only supported if the geometry of the individual segments is not changed, meaning that gso_size must remain the same. > > - Suppose going for a rewrite, care to elaborate what's exactly missing > in skb_segment's logic? > I must admit I do not fully understand all the different code flows in > this function, it seems to support many different input skbs - any > assistance is highly appreciated. Well, this is the point really. The complexity of this function is so high that very few of us dare to touch it. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-09-03 17:25 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-26 14:07 BUG_ON in skb_segment, after bpf_skb_change_proto was applied Shmulik Ladkani 2019-08-26 17:47 ` Eric Dumazet 2019-08-27 11:42 ` Shmulik Ladkani 2019-08-27 12:10 ` Daniel Borkmann 2019-08-28 5:56 ` Shmulik Ladkani 2019-08-29 12:22 ` Shmulik Ladkani 2019-09-01 20:05 ` Willem de Bruijn 2019-09-02 13:44 ` Shmulik Ladkani 2019-09-03 15:51 ` Shmulik Ladkani 2019-09-03 16:23 ` Willem de Bruijn 2019-09-03 17:03 ` Shmulik Ladkani 2019-09-03 17:24 ` Willem de Bruijn 2019-08-27 15:09 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).