netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BUG_ON triggered in skb_segment
@ 2018-03-13  5:45 Yonghong Song
  2018-03-13  6:04 ` Eric Dumazet
  2018-03-13  6:18 ` Yunsheng Lin
  0 siblings, 2 replies; 18+ messages in thread
From: Yonghong Song @ 2018-03-13  5:45 UTC (permalink / raw)
  To: Daniel Borkmann, Eric Dumazet, Alexei Starovoitov, netdev,
	Martin Lau, Yonghong Song

Hi,

One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
net-next function skb_segment, line 3667.

3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473                             netdev_features_t features)
3474 {
3475         struct sk_buff *segs = NULL;
3476         struct sk_buff *tail = NULL;
...
3665                 while (pos < offset + len) {
3666                         if (i >= nfrags) {
3667                                 BUG_ON(skb_headlen(list_skb));
3668
3669                                 i = 0;
3670                                 nfrags = 
skb_shinfo(list_skb)->nr_frags;
3671                                 frag = skb_shinfo(list_skb)->frags;
3672                                 frag_skb = list_skb;
...

call stack:
...
#0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41
  #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
  #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
  #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
  #4 [ffff883ffef03668] die at ffffffff8101deb2
  #5 [ffff883ffef03698] do_trap at ffffffff8101a700
  #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
  #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
  #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
     [exception RIP: skb_segment+3044]
     RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
     RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
     RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
     RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
     R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
     R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
#10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8
#11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b
#12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8
#13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9
#14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed
#15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2
#16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb
#17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa
#18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed
#19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90
#20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66
#21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209
#22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d
#23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8
#24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538
#25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0
#26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6
#27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168
#28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2
#29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2
#30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e
#31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e
#32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3
#33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515
#34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11
--- <IRQ stack> ---
bt: cannot transition from IRQ stack to current process stack:
         IRQ stack pointer: ffff883ffef034f8
     process stack pointer: ffffffff81a01ae9
        current stack base: ffffc9000c5c4000
...
Setup:
=====

The test will involve three machines:
   M_ipv6 <-> M_nat <-> M_ipv4

The M_nat will do ipv4<->ipv6 address translation and then forward packet
to proper destination. The control plane will configure M_nat properly
will understand virtual ipv4 address for machine M_ipv6, and
virtual ipv6 address for machine M_ipv4.

M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
The program uses bpf_skb_change_proto to do protocol conversion.
bpf_skb_change_proto will adjust skb header_len and len properly
based on protocol change.
After the conversion, the program will make proper change on
ethhdr and ip4/6 header, recalculate checksum, and send the packet out
through bpf_redirect.

Experiment:
===========

MTU: 1500B for all three machines.

The tso/lro/gro are enabled on the M_nat box.

ping works on both ways of M_ipv6 <-> M_ipv4.
It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
(both ways).
Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed with 
the above BUG_ON, really fast.
Did not really test from M_ipv4 to M_ipv6 with large file.

The error path likely to be (also from the above call stack):
   nic -> lro/gro -> bpf_program -> gso (BUG_ON)

In one of experiments, I explicitly printed the skb->len and 
skb->data_len. The values are below:
   skb_segment: len 2856, data_len 2686
They should be equal to avoid BUG.

In another experiment, I got:
   skb_segment: len 1428, data_len 1258

In both cases, the difference is 170 bytes. Not sure whether
this is just a coincidence or not.

Workaround:
===========

A workaround to avoid BUG_ON is to disable lro/gro. This way,
kernel will not receive big packets and hence gso is not really called.

I am not familiar with gso code. Does anybody hit this BUG_ON before?
Any suggestion on how to debug this?

Thanks!

Yonghong

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13  5:45 BUG_ON triggered in skb_segment Yonghong Song
@ 2018-03-13  6:04 ` Eric Dumazet
  2018-03-13  6:08   ` Yonghong Song
  2018-03-13  6:18 ` Yunsheng Lin
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-03-13  6:04 UTC (permalink / raw)
  To: Yonghong Song, Daniel Borkmann, Eric Dumazet, Alexei Starovoitov,
	netdev, Martin Lau



On 03/12/2018 10:45 PM, Yonghong Song wrote:
> Hi,
> 
> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
> net-next function skb_segment, line 3667.
> 
> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
> 3473                             netdev_features_t features)
> 3474 {
> 3475         struct sk_buff *segs = NULL;
> 3476         struct sk_buff *tail = NULL;
> ...
> 3665                 while (pos < offset + len) {
> 3666                         if (i >= nfrags) {
> 3667                                 BUG_ON(skb_headlen(list_skb));
> 3668
> 3669                                 i = 0;
> 3670                                 nfrags = 
> skb_shinfo(list_skb)->nr_frags;
> 3671                                 frag = skb_shinfo(list_skb)->frags;
> 3672                                 frag_skb = list_skb;
> ...
> 
> call stack:
> ...
> #0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41
>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>   #4 [ffff883ffef03668] die at ffffffff8101deb2
>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>      [exception RIP: skb_segment+3044]
>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
> #10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8
> #11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b
> #12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8
> #13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9
> #14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed
> #15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2
> #16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb
> #17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa
> #18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed
> #19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90
> #20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66
> #21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209
> #22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d
> #23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8
> #24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538
> #25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0
> #26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6
> #27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168
> #28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2
> #29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2
> #30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e
> #31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e
> #32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3
> #33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515
> #34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11
> --- <IRQ stack> ---
> bt: cannot transition from IRQ stack to current process stack:
>          IRQ stack pointer: ffff883ffef034f8
>      process stack pointer: ffffffff81a01ae9
>         current stack base: ffffc9000c5c4000
> ...
> Setup:
> =====
> 
> The test will involve three machines:
>    M_ipv6 <-> M_nat <-> M_ipv4
> 
> The M_nat will do ipv4<->ipv6 address translation and then forward packet
> to proper destination. The control plane will configure M_nat properly
> will understand virtual ipv4 address for machine M_ipv6, and
> virtual ipv6 address for machine M_ipv4.
> 
> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
> The program uses bpf_skb_change_proto to do protocol conversion.
> bpf_skb_change_proto will adjust skb header_len and len properly
> based on protocol change.
> After the conversion, the program will make proper change on
> ethhdr and ip4/6 header, recalculate checksum, and send the packet out
> through bpf_redirect.
> 
> Experiment:
> ===========
> 
> MTU: 1500B for all three machines.
> 
> The tso/lro/gro are enabled on the M_nat box.
> 
> ping works on both ways of M_ipv6 <-> M_ipv4.
> It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
> (both ways).
> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed with 
> the above BUG_ON, really fast.
> Did not really test from M_ipv4 to M_ipv6 with large file.
> 
> The error path likely to be (also from the above call stack):
>    nic -> lro/gro -> bpf_program -> gso (BUG_ON)
> 
> In one of experiments, I explicitly printed the skb->len and 
> skb->data_len. The values are below:
>    skb_segment: len 2856, data_len 2686
> They should be equal to avoid BUG.
> 
> In another experiment, I got:
>    skb_segment: len 1428, data_len 1258
> 
> In both cases, the difference is 170 bytes. Not sure whether
> this is just a coincidence or not.
> 
> Workaround:
> ===========
> 
> A workaround to avoid BUG_ON is to disable lro/gro. This way,
> kernel will not receive big packets and hence gso is not really called.
> 
> I am not familiar with gso code. Does anybody hit this BUG_ON before?
> Any suggestion on how to debug this?
> 

skb_segment() works if incoming GRO packet is not modified in its geometry.

In your case it seems you had to adjust gso_size (calling 
skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks 
skb_segment() badly, because geometry changes, unless you had specific 
MTU/MSS restrictions.

You will have to make skb_segment() more generic if you really want this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13  6:04 ` Eric Dumazet
@ 2018-03-13  6:08   ` Yonghong Song
  2018-03-13  6:25     ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2018-03-13  6:08 UTC (permalink / raw)
  To: Eric Dumazet, Daniel Borkmann, Alexei Starovoitov, netdev, Martin Lau



On 3/12/18 11:04 PM, Eric Dumazet wrote:
> 
> 
> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>> Hi,
>>
>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>> net-next function skb_segment, line 3667.
>>
>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> 3473                             netdev_features_t features)
>> 3474 {
>> 3475         struct sk_buff *segs = NULL;
>> 3476         struct sk_buff *tail = NULL;
>> ...
>> 3665                 while (pos < offset + len) {
>> 3666                         if (i >= nfrags) {
>> 3667                                 BUG_ON(skb_headlen(list_skb));
>> 3668
>> 3669                                 i = 0;
>> 3670                                 nfrags = 
>> skb_shinfo(list_skb)->nr_frags;
>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>> 3672                                 frag_skb = list_skb;
>> ...
>>
>> call stack:
>> ...
>> #0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41
>>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>   #4 [ffff883ffef03668] die at ffffffff8101deb2
>>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>      [exception RIP: skb_segment+3044]
>>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>> #10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8
>> #11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b
>> #12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8
>> #13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9
>> #14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed
>> #15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2
>> #16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb
>> #17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa
>> #18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed
>> #19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90
>> #20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66
>> #21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209
>> #22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d
>> #23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8
>> #24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538
>> #25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0
>> #26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6
>> #27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168
>> #28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2
>> #29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2
>> #30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e
>> #31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e
>> #32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3
>> #33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515
>> #34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11
>> --- <IRQ stack> ---
>> bt: cannot transition from IRQ stack to current process stack:
>>          IRQ stack pointer: ffff883ffef034f8
>>      process stack pointer: ffffffff81a01ae9
>>         current stack base: ffffc9000c5c4000
>> ...
>> Setup:
>> =====
>>
>> The test will involve three machines:
>>    M_ipv6 <-> M_nat <-> M_ipv4
>>
>> The M_nat will do ipv4<->ipv6 address translation and then forward packet
>> to proper destination. The control plane will configure M_nat properly
>> will understand virtual ipv4 address for machine M_ipv6, and
>> virtual ipv6 address for machine M_ipv4.
>>
>> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
>> The program uses bpf_skb_change_proto to do protocol conversion.
>> bpf_skb_change_proto will adjust skb header_len and len properly
>> based on protocol change.
>> After the conversion, the program will make proper change on
>> ethhdr and ip4/6 header, recalculate checksum, and send the packet out
>> through bpf_redirect.
>>
>> Experiment:
>> ===========
>>
>> MTU: 1500B for all three machines.
>>
>> The tso/lro/gro are enabled on the M_nat box.
>>
>> ping works on both ways of M_ipv6 <-> M_ipv4.
>> It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
>> (both ways).
>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed 
>> with the above BUG_ON, really fast.
>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>
>> The error path likely to be (also from the above call stack):
>>    nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>
>> In one of experiments, I explicitly printed the skb->len and 
>> skb->data_len. The values are below:
>>    skb_segment: len 2856, data_len 2686
>> They should be equal to avoid BUG.
>>
>> In another experiment, I got:
>>    skb_segment: len 1428, data_len 1258
>>
>> In both cases, the difference is 170 bytes. Not sure whether
>> this is just a coincidence or not.
>>
>> Workaround:
>> ===========
>>
>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>> kernel will not receive big packets and hence gso is not really called.
>>
>> I am not familiar with gso code. Does anybody hit this BUG_ON before?
>> Any suggestion on how to debug this?
>>
> 
> skb_segment() works if incoming GRO packet is not modified in its geometry.
> 
> In your case it seems you had to adjust gso_size (calling 
> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks 
> skb_segment() badly, because geometry changes, unless you had specific 
> MTU/MSS restrictions.
> 
> You will have to make skb_segment() more generic if you really want this.

In net/core/filter.c function bpf_skb_change_proto, which is called
in the bpf program, does some GSO adjustment. Could you help check
whether it satisfies my above use case or not? Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13  5:45 BUG_ON triggered in skb_segment Yonghong Song
  2018-03-13  6:04 ` Eric Dumazet
@ 2018-03-13  6:18 ` Yunsheng Lin
  1 sibling, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2018-03-13  6:18 UTC (permalink / raw)
  To: Yonghong Song, Daniel Borkmann, Eric Dumazet, Alexei Starovoitov,
	netdev, Martin Lau

Hi, Song

On 2018/3/13 13:45, Yonghong Song wrote:
> Hi,
> 
> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
> net-next function skb_segment, line 3667.
> 
> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
> 3473                             netdev_features_t features)
> 3474 {
> 3475         struct sk_buff *segs = NULL;
> 3476         struct sk_buff *tail = NULL;
> ...
> 3665                 while (pos < offset + len) {
> 3666                         if (i >= nfrags) {
> 3667                                 BUG_ON(skb_headlen(list_skb));
> 3668
> 3669                                 i = 0;
> 3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
> 3671                                 frag = skb_shinfo(list_skb)->frags;
> 3672                                 frag_skb = list_skb;
> ...
> 
> call stack:
> ...
> #0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41
>  #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>  #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>  #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>  #4 [ffff883ffef03668] die at ffffffff8101deb2
>  #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>  #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>  #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>  #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>     [exception RIP: skb_segment+3044]
>     RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>     RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>     RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>     RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>     R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>     R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
> #10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8
> #11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b
> #12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8
> #13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9
> #14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed
> #15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2
> #16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb
> #17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa
> #18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed
> #19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90
> #20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66
> #21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209
> #22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d
> #23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8
> #24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538
> #25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0
> #26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6
> #27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168
> #28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2
> #29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2
> #30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e
> #31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e
> #32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3
> #33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515
> #34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11
> --- <IRQ stack> ---
> bt: cannot transition from IRQ stack to current process stack:
>         IRQ stack pointer: ffff883ffef034f8
>     process stack pointer: ffffffff81a01ae9
>        current stack base: ffffc9000c5c4000
> ...
> Setup:
> =====
> 
> The test will involve three machines:
>   M_ipv6 <-> M_nat <-> M_ipv4
> 
> The M_nat will do ipv4<->ipv6 address translation and then forward packet
> to proper destination. The control plane will configure M_nat properly
> will understand virtual ipv4 address for machine M_ipv6, and
> virtual ipv6 address for machine M_ipv4.
> 
> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
> The program uses bpf_skb_change_proto to do protocol conversion.
> bpf_skb_change_proto will adjust skb header_len and len properly
> based on protocol change.
> After the conversion, the program will make proper change on
> ethhdr and ip4/6 header, recalculate checksum, and send the packet out
> through bpf_redirect.
> 
> Experiment:
> ===========
> 
> MTU: 1500B for all three machines.
> 
> The tso/lro/gro are enabled on the M_nat box.
> 
> ping works on both ways of M_ipv6 <-> M_ipv4.
> It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 (both ways).
> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed with the above BUG_ON, really fast.
> Did not really test from M_ipv4 to M_ipv6 with large file.
> 
> The error path likely to be (also from the above call stack):
>   nic -> lro/gro -> bpf_program -> gso (BUG_ON)
> 
> In one of experiments, I explicitly printed the skb->len and skb->data_len. The values are below:
>   skb_segment: len 2856, data_len 2686
> They should be equal to avoid BUG.
> 
> In another experiment, I got:
>   skb_segment: len 1428, data_len 1258
> 
> In both cases, the difference is 170 bytes. Not sure whether
> this is just a coincidence or not.
> 
> Workaround:
> ===========
> 
> A workaround to avoid BUG_ON is to disable lro/gro. This way,
> kernel will not receive big packets and hence gso is not really called.
> 
> I am not familiar with gso code. Does anybody hit this BUG_ON before?
> Any suggestion on how to debug this?

When the bpf_program do ipv4<->ipv6 address translation , shinfo->gso_type
maybe need to be changed, for example, SKB_GSO_TCPV4 -> SKB_GSO_TCPV6.
I am not sure if there are other field related to gro need to be changed,
you may want to debug it.

You may call call skb_mac_gso_segment with the packet after address
translation to debug this problem.

Hope this will help.

> 
> Thanks!
> 
> Yonghong
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13  6:08   ` Yonghong Song
@ 2018-03-13  6:25     ` Eric Dumazet
  2018-03-13  8:44       ` Steffen Klassert
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-03-13  6:25 UTC (permalink / raw)
  To: Yonghong Song, Eric Dumazet, Daniel Borkmann, Alexei Starovoitov,
	netdev, Martin Lau



On 03/12/2018 11:08 PM, Yonghong Song wrote:
> 
> 
> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>
>>
>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>> Hi,
>>>
>>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>>> net-next function skb_segment, line 3667.
>>>
>>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>> 3473                             netdev_features_t features)
>>> 3474 {
>>> 3475         struct sk_buff *segs = NULL;
>>> 3476         struct sk_buff *tail = NULL;
>>> ...
>>> 3665                 while (pos < offset + len) {
>>> 3666                         if (i >= nfrags) {
>>> 3667                                 BUG_ON(skb_headlen(list_skb));
>>> 3668
>>> 3669                                 i = 0;
>>> 3670                                 nfrags = 
>>> skb_shinfo(list_skb)->nr_frags;
>>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>>> 3672                                 frag_skb = list_skb;
>>> ...
>>>
>>> call stack:
>>> ...
>>> #0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41
>>>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>>   #4 [ffff883ffef03668] die at ffffffff8101deb2
>>>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>>      [exception RIP: skb_segment+3044]
>>>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>>> #10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8
>>> #11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b
>>> #12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8
>>> #13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9
>>> #14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed
>>> #15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2
>>> #16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb
>>> #17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa
>>> #18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed
>>> #19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90
>>> #20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66
>>> #21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209
>>> #22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d
>>> #23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8
>>> #24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538
>>> #25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0
>>> #26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6
>>> #27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168
>>> #28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2
>>> #29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2
>>> #30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e
>>> #31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e
>>> #32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3
>>> #33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515
>>> #34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11
>>> --- <IRQ stack> ---
>>> bt: cannot transition from IRQ stack to current process stack:
>>>          IRQ stack pointer: ffff883ffef034f8
>>>      process stack pointer: ffffffff81a01ae9
>>>         current stack base: ffffc9000c5c4000
>>> ...
>>> Setup:
>>> =====
>>>
>>> The test will involve three machines:
>>>    M_ipv6 <-> M_nat <-> M_ipv4
>>>
>>> The M_nat will do ipv4<->ipv6 address translation and then forward 
>>> packet
>>> to proper destination. The control plane will configure M_nat properly
>>> will understand virtual ipv4 address for machine M_ipv6, and
>>> virtual ipv6 address for machine M_ipv4.
>>>
>>> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>> based on protocol change.
>>> After the conversion, the program will make proper change on
>>> ethhdr and ip4/6 header, recalculate checksum, and send the packet out
>>> through bpf_redirect.
>>>
>>> Experiment:
>>> ===========
>>>
>>> MTU: 1500B for all three machines.
>>>
>>> The tso/lro/gro are enabled on the M_nat box.
>>>
>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>> It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
>>> (both ways).
>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed 
>>> with the above BUG_ON, really fast.
>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>
>>> The error path likely to be (also from the above call stack):
>>>    nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>>
>>> In one of experiments, I explicitly printed the skb->len and 
>>> skb->data_len. The values are below:
>>>    skb_segment: len 2856, data_len 2686
>>> They should be equal to avoid BUG.
>>>
>>> In another experiment, I got:
>>>    skb_segment: len 1428, data_len 1258
>>>
>>> In both cases, the difference is 170 bytes. Not sure whether
>>> this is just a coincidence or not.
>>>
>>> Workaround:
>>> ===========
>>>
>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>> kernel will not receive big packets and hence gso is not really called.
>>>
>>> I am not familiar with gso code. Does anybody hit this BUG_ON before?
>>> Any suggestion on how to debug this?
>>>
>>
>> skb_segment() works if incoming GRO packet is not modified in its 
>> geometry.
>>
>> In your case it seems you had to adjust gso_size (calling 
>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks 
>> skb_segment() badly, because geometry changes, unless you had specific 
>> MTU/MSS restrictions.
>>
>> You will have to make skb_segment() more generic if you really want this.
> 
> In net/core/filter.c function bpf_skb_change_proto, which is called
> in the bpf program, does some GSO adjustment. Could you help check
> whether it satisfies my above use case or not? Thanks!

As I said this  helper ends up modifying gso_size by +/- 20 
(sizeof(ipv6 header) - sizeof(ipv4 header))

So it wont work if skb_segment() is called after this change.

Not clear why the GRO packet is not sent as is (as a TSO packet) since 
mlx4/mlx5 NICs certainly support TSO.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13  6:25     ` Eric Dumazet
@ 2018-03-13  8:44       ` Steffen Klassert
  2018-03-13 22:37         ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Steffen Klassert @ 2018-03-13  8:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yonghong Song, Daniel Borkmann, Alexei Starovoitov, netdev, Martin Lau

On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
> 
> 
> On 03/12/2018 11:08 PM, Yonghong Song wrote:
> > 
> > 
> > On 3/12/18 11:04 PM, Eric Dumazet wrote:
> > > 
> > > 
> > > On 03/12/2018 10:45 PM, Yonghong Song wrote:
> > > > ...
> > > > Setup:
> > > > =====
> > > > 
> > > > The test will involve three machines:
> > > > �� M_ipv6 <-> M_nat <-> M_ipv4
> > > > 
> > > > The M_nat will do ipv4<->ipv6 address translation and then
> > > > forward packet
> > > > to proper destination. The control plane will configure M_nat properly
> > > > will understand virtual ipv4 address for machine M_ipv6, and
> > > > virtual ipv6 address for machine M_ipv4.
> > > > 
> > > > M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
> > > > The program uses bpf_skb_change_proto to do protocol conversion.
> > > > bpf_skb_change_proto will adjust skb header_len and len properly
> > > > based on protocol change.
> > > > After the conversion, the program will make proper change on
> > > > ethhdr and ip4/6 header, recalculate checksum, and send the packet out
> > > > through bpf_redirect.
> > > > 
> > > > Experiment:
> > > > ===========
> > > > 
> > > > MTU: 1500B for all three machines.
> > > > 
> > > > The tso/lro/gro are enabled on the M_nat box.
> > > > 
> > > > ping works on both ways of M_ipv6 <-> M_ipv4.
> > > > It works for transfering a small file (4KB) between M_ipv6 and
> > > > M_ipv4 (both ways).
> > > > Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
> > > > failed with the above BUG_ON, really fast.
> > > > Did not really test from M_ipv4 to M_ipv6 with large file.
> > > > 
> > > > The error path likely to be (also from the above call stack):
> > > > �� nic -> lro/gro -> bpf_program -> gso (BUG_ON)

Just out of curiosity, are these packets created with LRO or GRO?
Usually LRO is disabled if forwarding is enabled on a machine,
because segmented LGO packets are likely corrupt.

These packets take an alternative redirect path, so not sure what
happens here.

> > > > 
> > > > In one of experiments, I explicitly printed the skb->len and
> > > > skb->data_len. The values are below:
> > > > �� skb_segment: len 2856, data_len 2686
> > > > They should be equal to avoid BUG.
> > > > 
> > > > In another experiment, I got:
> > > > �� skb_segment: len 1428, data_len 1258
> > > > 
> > > > In both cases, the difference is 170 bytes. Not sure whether
> > > > this is just a coincidence or not.
> > > > 
> > > > Workaround:
> > > > ===========
> > > > 
> > > > A workaround to avoid BUG_ON is to disable lro/gro. This way,
> > > > kernel will not receive big packets and hence gso is not really called.
> > > > 
> > > > I am not familiar with gso code. Does anybody hit this BUG_ON before?
> > > > Any suggestion on how to debug this?
> > > > 
> > > 
> > > skb_segment() works if incoming GRO packet is not modified in its
> > > geometry.
> > > 
> > > In your case it seems you had to adjust gso_size (calling
> > > skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
> > > skb_segment() badly, because geometry changes, unless you had
> > > specific MTU/MSS restrictions.
> > > 
> > > You will have to make skb_segment() more generic if you really want this.
> > 
> > In net/core/filter.c function bpf_skb_change_proto, which is called
> > in the bpf program, does some GSO adjustment. Could you help check
> > whether it satisfies my above use case or not? Thanks!
> 
> As I said this  helper ends up modifying gso_size by +/- 20 (sizeof(ipv6
> header) - sizeof(ipv4 header))
> 
> So it wont work if skb_segment() is called after this change.

Even HW TSO use gso_size to segment the packets. Would'nt this
result in broken packets too, if gso_size is modified on a
forwarding path?

> 
> Not clear why the GRO packet is not sent as is (as a TSO packet) since
> mlx4/mlx5 NICs certainly support TSO.

If the packets are generated with GRO, there could be data chained
at the frag_list pointer. Most NICs can't offload such skbs, so if
skb_segment() can't split at the frag_list pointer, it will just
segment the packets based on gso_size.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13  8:44       ` Steffen Klassert
@ 2018-03-13 22:37         ` Yonghong Song
  2018-03-13 22:47           ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2018-03-13 22:37 UTC (permalink / raw)
  To: Steffen Klassert, Eric Dumazet
  Cc: Daniel Borkmann, Alexei Starovoitov, netdev, Martin Lau, saeedm, diptanu

Adding additional cc's:
   Saeed Mahameed as this is most likely mlx5 driver related.
   Diptanu Gon Choudhury who initially reported the issue.


On 3/13/18 1:44 AM, Steffen Klassert wrote:
> On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/12/2018 11:08 PM, Yonghong Song wrote:
>>>
>>>
>>> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>>>> ...
>>>>> Setup:
>>>>> =====
>>>>>
>>>>> The test will involve three machines:
>>>>>     M_ipv6 <-> M_nat <-> M_ipv4
>>>>>
>>>>> The M_nat will do ipv4<->ipv6 address translation and then
>>>>> forward packet
>>>>> to proper destination. The control plane will configure M_nat properly
>>>>> will understand virtual ipv4 address for machine M_ipv6, and
>>>>> virtual ipv6 address for machine M_ipv4.
>>>>>
>>>>> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
>>>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>>>> based on protocol change.
>>>>> After the conversion, the program will make proper change on
>>>>> ethhdr and ip4/6 header, recalculate checksum, and send the packet out
>>>>> through bpf_redirect.
>>>>>
>>>>> Experiment:
>>>>> ===========
>>>>>
>>>>> MTU: 1500B for all three machines.
>>>>>
>>>>> The tso/lro/gro are enabled on the M_nat box.
>>>>>
>>>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>>>> It works for transfering a small file (4KB) between M_ipv6 and
>>>>> M_ipv4 (both ways).
>>>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
>>>>> failed with the above BUG_ON, really fast.
>>>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>>>
>>>>> The error path likely to be (also from the above call stack):
>>>>>     nic -> lro/gro -> bpf_program -> gso (BUG_ON)
> 
> Just out of curiosity, are these packets created with LRO or GRO?
> Usually LRO is disabled if forwarding is enabled on a machine,
> because segmented LGO packets are likely corrupt.

In our experiments, LRO is disabled.
On mlx5, when GRO is on, the BUG_ON will happen, and
when GRO is off, the BUG_ON will not happen.

> 
> These packets take an alternative redirect path, so not sure what
> happens here.
> 
>>>>>
>>>>> In one of experiments, I explicitly printed the skb->len and
>>>>> skb->data_len. The values are below:
>>>>>     skb_segment: len 2856, data_len 2686
>>>>> They should be equal to avoid BUG.
>>>>>
>>>>> In another experiment, I got:
>>>>>     skb_segment: len 1428, data_len 1258
>>>>>
>>>>> In both cases, the difference is 170 bytes. Not sure whether
>>>>> this is just a coincidence or not.
>>>>>
>>>>> Workaround:
>>>>> ===========
>>>>>
>>>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>>>> kernel will not receive big packets and hence gso is not really called.
>>>>>
>>>>> I am not familiar with gso code. Does anybody hit this BUG_ON before?
>>>>> Any suggestion on how to debug this?
>>>>>
>>>>
>>>> skb_segment() works if incoming GRO packet is not modified in its
>>>> geometry.
>>>>
>>>> In your case it seems you had to adjust gso_size (calling
>>>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
>>>> skb_segment() badly, because geometry changes, unless you had
>>>> specific MTU/MSS restrictions.
>>>>
>>>> You will have to make skb_segment() more generic if you really want this.
>>>
>>> In net/core/filter.c function bpf_skb_change_proto, which is called
>>> in the bpf program, does some GSO adjustment. Could you help check
>>> whether it satisfies my above use case or not? Thanks!
>>
>> As I said this  helper ends up modifying gso_size by +/- 20 (sizeof(ipv6
>> header) - sizeof(ipv4 header))
>>
>> So it wont work if skb_segment() is called after this change.
> 
> Even HW TSO use gso_size to segment the packets. Would'nt this
> result in broken packets too, if gso_size is modified on a
> forwarding path?
> 
>>
>> Not clear why the GRO packet is not sent as is (as a TSO packet) since
>> mlx4/mlx5 NICs certainly support TSO.

This is a very good observation.
We did the same experiment on mlx4, the same kernel, the same userspace 
apps, the same bpf program. The only difference is mlx4 vs. mlx5.
The mlx4 works fine with LRO=off and GRO=on, while
mlx5 failed with the same LRO/GRO configuration.

On mlx4 box, we are able to see TSO packets are increasing as the
large file scp is in progress.
# ethtool -S eth0 | grep tso
      tso_packets: 45495
# ethtool -S eth0 | grep tso
      tso_packets: 45865
# ethtool -S eth0 | grep tso
      tso_packets: 46337
# ethtool -S eth0 | grep tso
      tso_packets: 46724

And use bcc tool to track to func call count for skb_segment
and find it is called 0 times. Clearly, mlx4 is able to take
the packet as TSO and hence the packet will not go to
the stack.

# funccount.py -i 3 'skb_segment'
Tracing 1 functions for "skb_segment"... Hit Ctrl-C to end.

FUNC                                    COUNT

FUNC                                    COUNT
...

CC Saeed Mahameed who may help debug and provide some insights
what is the problem.

> 
> If the packets are generated with GRO, there could be data chained
> at the frag_list pointer. Most NICs can't offload such skbs, so if
> skb_segment() can't split at the frag_list pointer, it will just
> segment the packets based on gso_size.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13 22:37         ` Yonghong Song
@ 2018-03-13 22:47           ` Eric Dumazet
  2018-03-13 23:09             ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-03-13 22:47 UTC (permalink / raw)
  To: Yonghong Song, Steffen Klassert, Eric Dumazet
  Cc: Daniel Borkmann, Alexei Starovoitov, netdev, Martin Lau, saeedm, diptanu



On 03/13/2018 03:37 PM, Yonghong Song wrote:
> Adding additional cc's:
>    Saeed Mahameed as this is most likely mlx5 driver related.
>    Diptanu Gon Choudhury who initially reported the issue.
> 
> 
> On 3/13/18 1:44 AM, Steffen Klassert wrote:
>> On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
>>>
>>>
>>> On 03/12/2018 11:08 PM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>>>>> ...
>>>>>> Setup:
>>>>>> =====
>>>>>>
>>>>>> The test will involve three machines:
>>>>>>     M_ipv6 <-> M_nat <-> M_ipv4
>>>>>>
>>>>>> The M_nat will do ipv4<->ipv6 address translation and then
>>>>>> forward packet
>>>>>> to proper destination. The control plane will configure M_nat 
>>>>>> properly
>>>>>> will understand virtual ipv4 address for machine M_ipv6, and
>>>>>> virtual ipv6 address for machine M_ipv4.
>>>>>>
>>>>>> M_nat runs a bpf program, which is attached to clsact (ingress) 
>>>>>> qdisc.
>>>>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>>>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>>>>> based on protocol change.
>>>>>> After the conversion, the program will make proper change on
>>>>>> ethhdr and ip4/6 header, recalculate checksum, and send the packet 
>>>>>> out
>>>>>> through bpf_redirect.
>>>>>>
>>>>>> Experiment:
>>>>>> ===========
>>>>>>
>>>>>> MTU: 1500B for all three machines.
>>>>>>
>>>>>> The tso/lro/gro are enabled on the M_nat box.
>>>>>>
>>>>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>>>>> It works for transfering a small file (4KB) between M_ipv6 and
>>>>>> M_ipv4 (both ways).
>>>>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
>>>>>> failed with the above BUG_ON, really fast.
>>>>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>>>>
>>>>>> The error path likely to be (also from the above call stack):
>>>>>>     nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>
>> Just out of curiosity, are these packets created with LRO or GRO?
>> Usually LRO is disabled if forwarding is enabled on a machine,
>> because segmented LGO packets are likely corrupt.
> 
> In our experiments, LRO is disabled.
> On mlx5, when GRO is on, the BUG_ON will happen, and
> when GRO is off, the BUG_ON will not happen.
> 
>>
>> These packets take an alternative redirect path, so not sure what
>> happens here.
>>
>>>>>>
>>>>>> In one of experiments, I explicitly printed the skb->len and
>>>>>> skb->data_len. The values are below:
>>>>>>     skb_segment: len 2856, data_len 2686
>>>>>> They should be equal to avoid BUG.
>>>>>>
>>>>>> In another experiment, I got:
>>>>>>     skb_segment: len 1428, data_len 1258
>>>>>>
>>>>>> In both cases, the difference is 170 bytes. Not sure whether
>>>>>> this is just a coincidence or not.
>>>>>>
>>>>>> Workaround:
>>>>>> ===========
>>>>>>
>>>>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>>>>> kernel will not receive big packets and hence gso is not really 
>>>>>> called.
>>>>>>
>>>>>> I am not familiar with gso code. Does anybody hit this BUG_ON before?
>>>>>> Any suggestion on how to debug this?
>>>>>>
>>>>>
>>>>> skb_segment() works if incoming GRO packet is not modified in its
>>>>> geometry.
>>>>>
>>>>> In your case it seems you had to adjust gso_size (calling
>>>>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
>>>>> skb_segment() badly, because geometry changes, unless you had
>>>>> specific MTU/MSS restrictions.
>>>>>
>>>>> You will have to make skb_segment() more generic if you really want 
>>>>> this.
>>>>
>>>> In net/core/filter.c function bpf_skb_change_proto, which is called
>>>> in the bpf program, does some GSO adjustment. Could you help check
>>>> whether it satisfies my above use case or not? Thanks!
>>>
>>> As I said this  helper ends up modifying gso_size by +/- 20 (sizeof(ipv6
>>> header) - sizeof(ipv4 header))
>>>
>>> So it wont work if skb_segment() is called after this change.
>>
>> Even HW TSO use gso_size to segment the packets. Would'nt this
>> result in broken packets too, if gso_size is modified on a
>> forwarding path?
>>
>>>
>>> Not clear why the GRO packet is not sent as is (as a TSO packet) since
>>> mlx4/mlx5 NICs certainly support TSO.
> 
> This is a very good observation.
> We did the same experiment on mlx4, the same kernel, the same userspace 
> apps, the same bpf program. The only difference is mlx4 vs. mlx5.
> The mlx4 works fine with LRO=off and GRO=on, while
> mlx5 failed with the same LRO/GRO configuration.
> 
> On mlx4 box, we are able to see TSO packets are increasing as the
> large file scp is in progress.
> # ethtool -S eth0 | grep tso
>       tso_packets: 45495
> # ethtool -S eth0 | grep tso
>       tso_packets: 45865
> # ethtool -S eth0 | grep tso
>       tso_packets: 46337
> # ethtool -S eth0 | grep tso
>       tso_packets: 46724
> 
> And use bcc tool to track to func call count for skb_segment
> and find it is called 0 times. Clearly, mlx4 is able to take
> the packet as TSO and hence the packet will not go to
> the stack.
> 
> # funccount.py -i 3 'skb_segment'
> Tracing 1 functions for "skb_segment"... Hit Ctrl-C to end.
> 
> FUNC                                    COUNT
> 
> FUNC                                    COUNT
> ...
> 
> CC Saeed Mahameed who may help debug and provide some insights
> what is the problem.

There are many reasons why a GRO packet might need to be segmented by 
software (skb_segment())

This is the step where you have crashes, so really mlx4/mlx5 difference 
do not matter.

gso_size should not be dynamically changed. This needs to be fixed in 
eBPF helpers eventually.

Traditionally, NAT might involve changing MSS option in TCP SYN and 
SYNACK ...

(This to avoid dropping too big packets, when the total size of a 
segment is increased by 20 bytes when converting IPV4 header to IPV6 one)

In the opposite direction, packet size is reduced by 20 bytes, so MSS 
can be left as is.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13 22:47           ` Eric Dumazet
@ 2018-03-13 23:09             ` Alexei Starovoitov
  2018-03-13 23:18               ` Daniel Borkmann
  2018-03-13 23:27               ` Eric Dumazet
  0 siblings, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2018-03-13 23:09 UTC (permalink / raw)
  To: Eric Dumazet, Yonghong Song, Steffen Klassert
  Cc: Daniel Borkmann, netdev, Martin Lau, saeedm, diptanu

On 3/13/18 3:47 PM, Eric Dumazet wrote:
>
>
> On 03/13/2018 03:37 PM, Yonghong Song wrote:
>> Adding additional cc's:
>>    Saeed Mahameed as this is most likely mlx5 driver related.
>>    Diptanu Gon Choudhury who initially reported the issue.
>>
>>
>> On 3/13/18 1:44 AM, Steffen Klassert wrote:
>>> On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/12/2018 11:08 PM, Yonghong Song wrote:
>>>>>
>>>>>
>>>>> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>>>>>
>>>>>>
>>>>>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>>>>>> ...
>>>>>>> Setup:
>>>>>>> =====
>>>>>>>
>>>>>>> The test will involve three machines:
>>>>>>>     M_ipv6 <-> M_nat <-> M_ipv4
>>>>>>>
>>>>>>> The M_nat will do ipv4<->ipv6 address translation and then
>>>>>>> forward packet
>>>>>>> to proper destination. The control plane will configure M_nat
>>>>>>> properly
>>>>>>> will understand virtual ipv4 address for machine M_ipv6, and
>>>>>>> virtual ipv6 address for machine M_ipv4.
>>>>>>>
>>>>>>> M_nat runs a bpf program, which is attached to clsact (ingress)
>>>>>>> qdisc.
>>>>>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>>>>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>>>>>> based on protocol change.
>>>>>>> After the conversion, the program will make proper change on
>>>>>>> ethhdr and ip4/6 header, recalculate checksum, and send the
>>>>>>> packet out
>>>>>>> through bpf_redirect.
>>>>>>>
>>>>>>> Experiment:
>>>>>>> ===========
>>>>>>>
>>>>>>> MTU: 1500B for all three machines.
>>>>>>>
>>>>>>> The tso/lro/gro are enabled on the M_nat box.
>>>>>>>
>>>>>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>>>>>> It works for transfering a small file (4KB) between M_ipv6 and
>>>>>>> M_ipv4 (both ways).
>>>>>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
>>>>>>> failed with the above BUG_ON, really fast.
>>>>>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>>>>>
>>>>>>> The error path likely to be (also from the above call stack):
>>>>>>>     nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>>
>>> Just out of curiosity, are these packets created with LRO or GRO?
>>> Usually LRO is disabled if forwarding is enabled on a machine,
>>> because segmented LGO packets are likely corrupt.
>>
>> In our experiments, LRO is disabled.
>> On mlx5, when GRO is on, the BUG_ON will happen, and
>> when GRO is off, the BUG_ON will not happen.
>>
>>>
>>> These packets take an alternative redirect path, so not sure what
>>> happens here.
>>>
>>>>>>>
>>>>>>> In one of experiments, I explicitly printed the skb->len and
>>>>>>> skb->data_len. The values are below:
>>>>>>>     skb_segment: len 2856, data_len 2686
>>>>>>> They should be equal to avoid BUG.
>>>>>>>
>>>>>>> In another experiment, I got:
>>>>>>>     skb_segment: len 1428, data_len 1258
>>>>>>>
>>>>>>> In both cases, the difference is 170 bytes. Not sure whether
>>>>>>> this is just a coincidence or not.
>>>>>>>
>>>>>>> Workaround:
>>>>>>> ===========
>>>>>>>
>>>>>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>>>>>> kernel will not receive big packets and hence gso is not really
>>>>>>> called.
>>>>>>>
>>>>>>> I am not familiar with gso code. Does anybody hit this BUG_ON
>>>>>>> before?
>>>>>>> Any suggestion on how to debug this?
>>>>>>>
>>>>>>
>>>>>> skb_segment() works if incoming GRO packet is not modified in its
>>>>>> geometry.
>>>>>>
>>>>>> In your case it seems you had to adjust gso_size (calling
>>>>>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
>>>>>> skb_segment() badly, because geometry changes, unless you had
>>>>>> specific MTU/MSS restrictions.
>>>>>>
>>>>>> You will have to make skb_segment() more generic if you really
>>>>>> want this.
>>>>>
>>>>> In net/core/filter.c function bpf_skb_change_proto, which is called
>>>>> in the bpf program, does some GSO adjustment. Could you help check
>>>>> whether it satisfies my above use case or not? Thanks!
>>>>
>>>> As I said this  helper ends up modifying gso_size by +/- 20
>>>> (sizeof(ipv6
>>>> header) - sizeof(ipv4 header))
>>>>
>>>> So it wont work if skb_segment() is called after this change.
>>>
>>> Even HW TSO use gso_size to segment the packets. Would'nt this
>>> result in broken packets too, if gso_size is modified on a
>>> forwarding path?
>>>
>>>>
>>>> Not clear why the GRO packet is not sent as is (as a TSO packet) since
>>>> mlx4/mlx5 NICs certainly support TSO.
>>
>> This is a very good observation.
>> We did the same experiment on mlx4, the same kernel, the same
>> userspace apps, the same bpf program. The only difference is mlx4 vs.
>> mlx5.
>> The mlx4 works fine with LRO=off and GRO=on, while
>> mlx5 failed with the same LRO/GRO configuration.
>>
>> On mlx4 box, we are able to see TSO packets are increasing as the
>> large file scp is in progress.
>> # ethtool -S eth0 | grep tso
>>       tso_packets: 45495
>> # ethtool -S eth0 | grep tso
>>       tso_packets: 45865
>> # ethtool -S eth0 | grep tso
>>       tso_packets: 46337
>> # ethtool -S eth0 | grep tso
>>       tso_packets: 46724
>>
>> And use bcc tool to track to func call count for skb_segment
>> and find it is called 0 times. Clearly, mlx4 is able to take
>> the packet as TSO and hence the packet will not go to
>> the stack.
>>
>> # funccount.py -i 3 'skb_segment'
>> Tracing 1 functions for "skb_segment"... Hit Ctrl-C to end.
>>
>> FUNC                                    COUNT
>>
>> FUNC                                    COUNT
>> ...
>>
>> CC Saeed Mahameed who may help debug and provide some insights
>> what is the problem.
>
> There are many reasons why a GRO packet might need to be segmented by
> software (skb_segment())
>
> This is the step where you have crashes, so really mlx4/mlx5 difference
> do not matter.
>
> gso_size should not be dynamically changed. This needs to be fixed in
> eBPF helpers eventually.

we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
It's not clear why it's not crashing there, but we cannot just
reject changing proto in bpf programs now.
We have to fix whatever needs to be fixed in skb_segment
(if bug is there) or fix whatever necessary on mlx5 side.
In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
through virtio would do, so if skb_segment() needs to do something
special with skb the SKB_GSO_DODGY flag is already there.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13 23:09             ` Alexei Starovoitov
@ 2018-03-13 23:18               ` Daniel Borkmann
  2018-03-13 23:27               ` Eric Dumazet
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-03-13 23:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet, Yonghong Song, Steffen Klassert
  Cc: netdev, Martin Lau, saeedm, diptanu

On 03/14/2018 12:09 AM, Alexei Starovoitov wrote:
> On 3/13/18 3:47 PM, Eric Dumazet wrote:
>>
>>
>> On 03/13/2018 03:37 PM, Yonghong Song wrote:
>>> Adding additional cc's:
>>>    Saeed Mahameed as this is most likely mlx5 driver related.
>>>    Diptanu Gon Choudhury who initially reported the issue.
>>>
>>>
>>> On 3/13/18 1:44 AM, Steffen Klassert wrote:
>>>> On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 03/12/2018 11:08 PM, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>>>>>>> ...
>>>>>>>> Setup:
>>>>>>>> =====
>>>>>>>>
>>>>>>>> The test will involve three machines:
>>>>>>>>     M_ipv6 <-> M_nat <-> M_ipv4
>>>>>>>>
>>>>>>>> The M_nat will do ipv4<->ipv6 address translation and then
>>>>>>>> forward packet
>>>>>>>> to proper destination. The control plane will configure M_nat
>>>>>>>> properly
>>>>>>>> will understand virtual ipv4 address for machine M_ipv6, and
>>>>>>>> virtual ipv6 address for machine M_ipv4.
>>>>>>>>
>>>>>>>> M_nat runs a bpf program, which is attached to clsact (ingress)
>>>>>>>> qdisc.
>>>>>>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>>>>>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>>>>>>> based on protocol change.
>>>>>>>> After the conversion, the program will make proper change on
>>>>>>>> ethhdr and ip4/6 header, recalculate checksum, and send the
>>>>>>>> packet out
>>>>>>>> through bpf_redirect.
>>>>>>>>
>>>>>>>> Experiment:
>>>>>>>> ===========
>>>>>>>>
>>>>>>>> MTU: 1500B for all three machines.
>>>>>>>>
>>>>>>>> The tso/lro/gro are enabled on the M_nat box.
>>>>>>>>
>>>>>>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>>>>>>> It works for transfering a small file (4KB) between M_ipv6 and
>>>>>>>> M_ipv4 (both ways).
>>>>>>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
>>>>>>>> failed with the above BUG_ON, really fast.
>>>>>>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>>>>>>
>>>>>>>> The error path likely to be (also from the above call stack):
>>>>>>>>     nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>>>
>>>> Just out of curiosity, are these packets created with LRO or GRO?
>>>> Usually LRO is disabled if forwarding is enabled on a machine,
>>>> because segmented LGO packets are likely corrupt.
>>>
>>> In our experiments, LRO is disabled.
>>> On mlx5, when GRO is on, the BUG_ON will happen, and
>>> when GRO is off, the BUG_ON will not happen.
>>>
>>>>
>>>> These packets take an alternative redirect path, so not sure what
>>>> happens here.
>>>>
>>>>>>>>
>>>>>>>> In one of experiments, I explicitly printed the skb->len and
>>>>>>>> skb->data_len. The values are below:
>>>>>>>>     skb_segment: len 2856, data_len 2686
>>>>>>>> They should be equal to avoid BUG.
>>>>>>>>
>>>>>>>> In another experiment, I got:
>>>>>>>>     skb_segment: len 1428, data_len 1258
>>>>>>>>
>>>>>>>> In both cases, the difference is 170 bytes. Not sure whether
>>>>>>>> this is just a coincidence or not.
>>>>>>>>
>>>>>>>> Workaround:
>>>>>>>> ===========
>>>>>>>>
>>>>>>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>>>>>>> kernel will not receive big packets and hence gso is not really
>>>>>>>> called.
>>>>>>>>
>>>>>>>> I am not familiar with gso code. Does anybody hit this BUG_ON
>>>>>>>> before?
>>>>>>>> Any suggestion on how to debug this?
>>>>>>>>
>>>>>>>
>>>>>>> skb_segment() works if incoming GRO packet is not modified in its
>>>>>>> geometry.
>>>>>>>
>>>>>>> In your case it seems you had to adjust gso_size (calling
>>>>>>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
>>>>>>> skb_segment() badly, because geometry changes, unless you had
>>>>>>> specific MTU/MSS restrictions.
>>>>>>>
>>>>>>> You will have to make skb_segment() more generic if you really
>>>>>>> want this.
>>>>>>
>>>>>> In net/core/filter.c function bpf_skb_change_proto, which is called
>>>>>> in the bpf program, does some GSO adjustment. Could you help check
>>>>>> whether it satisfies my above use case or not? Thanks!
>>>>>
>>>>> As I said this  helper ends up modifying gso_size by +/- 20
>>>>> (sizeof(ipv6
>>>>> header) - sizeof(ipv4 header))
>>>>>
>>>>> So it wont work if skb_segment() is called after this change.
>>>>
>>>> Even HW TSO use gso_size to segment the packets. Would'nt this
>>>> result in broken packets too, if gso_size is modified on a
>>>> forwarding path?
>>>>
>>>>>
>>>>> Not clear why the GRO packet is not sent as is (as a TSO packet) since
>>>>> mlx4/mlx5 NICs certainly support TSO.
>>>
>>> This is a very good observation.
>>> We did the same experiment on mlx4, the same kernel, the same
>>> userspace apps, the same bpf program. The only difference is mlx4 vs.
>>> mlx5.
>>> The mlx4 works fine with LRO=off and GRO=on, while
>>> mlx5 failed with the same LRO/GRO configuration.
>>>
>>> On mlx4 box, we are able to see TSO packets are increasing as the
>>> large file scp is in progress.
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 45495
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 45865
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 46337
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 46724
>>>
>>> And use bcc tool to track to func call count for skb_segment
>>> and find it is called 0 times. Clearly, mlx4 is able to take
>>> the packet as TSO and hence the packet will not go to
>>> the stack.
>>>
>>> # funccount.py -i 3 'skb_segment'
>>> Tracing 1 functions for "skb_segment"... Hit Ctrl-C to end.
>>>
>>> FUNC                                    COUNT
>>>
>>> FUNC                                    COUNT
>>> ...
>>>
>>> CC Saeed Mahameed who may help debug and provide some insights
>>> what is the problem.
>>
>> There are many reasons why a GRO packet might need to be segmented by
>> software (skb_segment())
>>
>> This is the step where you have crashes, so really mlx4/mlx5 difference
>> do not matter.
>>
>> gso_size should not be dynamically changed. This needs to be fixed in
>> eBPF helpers eventually.
> 
> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
> It's not clear why it's not crashing there, but we cannot just
> reject changing proto in bpf programs now.

Right, used in Cilium since the helper was added. Back then I recall I tested
this back to back over ixgbe w/ the various options mixed on/off with BPF prog
implementing nat64 on each side, running over various netperf streams; and later
integrated the logic into Cilium itself. Perhaps something changed with more
recent kernels as well. I'll look closer into this issue tomorrow.

> We have to fix whatever needs to be fixed in skb_segment
> (if bug is there) or fix whatever necessary on mlx5 side.
> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
> through virtio would do, so if skb_segment() needs to do something
> special with skb the SKB_GSO_DODGY flag is already there.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13 23:09             ` Alexei Starovoitov
  2018-03-13 23:18               ` Daniel Borkmann
@ 2018-03-13 23:27               ` Eric Dumazet
  2018-03-14  0:04                 ` Alexei Starovoitov
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-03-13 23:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet, Yonghong Song, Steffen Klassert
  Cc: Daniel Borkmann, netdev, Martin Lau, saeedm, diptanu



On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:

> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
> It's not clear why it's not crashing there, but we cannot just
> reject changing proto in bpf programs now.
> We have to fix whatever needs to be fixed in skb_segment
> (if bug is there) or fix whatever necessary on mlx5 side.
> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
> through virtio would do, so if skb_segment() needs to do something
> special with skb the SKB_GSO_DODGY flag is already there.

'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was 
not happy with the fix and provided something else.

GSO_DODGY has nothing to do with the problem really.

Changing gso_size is breaking GRO since it ends up changing the number 
of segments on the wire. TCP is not going to be happy, so you'll also 
have to fix TCP eventually.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-13 23:27               ` Eric Dumazet
@ 2018-03-14  0:04                 ` Alexei Starovoitov
  2018-03-14  0:26                   ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  0:04 UTC (permalink / raw)
  To: Eric Dumazet, Yonghong Song, Steffen Klassert
  Cc: Daniel Borkmann, netdev, Martin Lau, saeedm, diptanu

On 3/13/18 4:27 PM, Eric Dumazet wrote:
>
>
> On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:
>
>> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
>> It's not clear why it's not crashing there, but we cannot just
>> reject changing proto in bpf programs now.
>> We have to fix whatever needs to be fixed in skb_segment
>> (if bug is there) or fix whatever necessary on mlx5 side.
>> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
>> through virtio would do, so if skb_segment() needs to do something
>> special with skb the SKB_GSO_DODGY flag is already there.
>
> 'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was
> not happy with the fix and provided something else.

any link to your old patches and discussion?

I think since mlx4 can do tso on them and the packets came out
correct on the wire, there is nothing fundamentally wrong with
changing gso_size. Just tricky to teach skb_segment.

> GSO_DODGY has nothing to do with the problem really.
>
> Changing gso_size is breaking GRO since it ends up changing the number
> of segments on the wire. TCP is not going to be happy, so you'll also
> have to fix TCP eventually.
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-14  0:04                 ` Alexei Starovoitov
@ 2018-03-14  0:26                   ` Eric Dumazet
  2018-03-14  0:35                     ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-03-14  0:26 UTC (permalink / raw)
  To: Alexei Starovoitov, Yonghong Song, Steffen Klassert
  Cc: Daniel Borkmann, netdev, Martin Lau, saeedm, diptanu



On 03/13/2018 05:04 PM, Alexei Starovoitov wrote:
> On 3/13/18 4:27 PM, Eric Dumazet wrote:
>>
>>
>> On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:
>>
>>> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
>>> It's not clear why it's not crashing there, but we cannot just
>>> reject changing proto in bpf programs now.
>>> We have to fix whatever needs to be fixed in skb_segment
>>> (if bug is there) or fix whatever necessary on mlx5 side.
>>> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
>>> through virtio would do, so if skb_segment() needs to do something
>>> special with skb the SKB_GSO_DODGY flag is already there.
>>
>> 'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was
>> not happy with the fix and provided something else.
> 
> any link to your old patches and discussion?
> 
> I think since mlx4 can do tso on them and the packets came out
> correct on the wire, there is nothing fundamentally wrong with
> changing gso_size. Just tricky to teach skb_segment.
> 

The world is not mlx4 only. Some NIC will ask skb_segment() fallback 
segmentation for various reasons (like skb->len above a given limit like 
16KB)

Maybe https://www.spinics.net/lists/netdev/msg255549.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-14  0:26                   ` Eric Dumazet
@ 2018-03-14  0:35                     ` Eric Dumazet
  2018-03-14  1:15                       ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-03-14  0:35 UTC (permalink / raw)
  To: Alexei Starovoitov, Yonghong Song, Steffen Klassert
  Cc: Daniel Borkmann, netdev, Martin Lau, saeedm, diptanu



On 03/13/2018 05:26 PM, Eric Dumazet wrote:
> 
> 
> On 03/13/2018 05:04 PM, Alexei Starovoitov wrote:
>> On 3/13/18 4:27 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:
>>>
>>>> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
>>>> It's not clear why it's not crashing there, but we cannot just
>>>> reject changing proto in bpf programs now.
>>>> We have to fix whatever needs to be fixed in skb_segment
>>>> (if bug is there) or fix whatever necessary on mlx5 side.
>>>> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
>>>> through virtio would do, so if skb_segment() needs to do something
>>>> special with skb the SKB_GSO_DODGY flag is already there.
>>>
>>> 'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was
>>> not happy with the fix and provided something else.
>>
>> any link to your old patches and discussion?
>>
>> I think since mlx4 can do tso on them and the packets came out
>> correct on the wire, there is nothing fundamentally wrong with
>> changing gso_size. Just tricky to teach skb_segment.
>>
> 
> The world is not mlx4 only. Some NIC will ask skb_segment() fallback 
> segmentation for various reasons (like skb->len above a given limit like 
> 16KB)
> 
> Maybe https://www.spinics.net/lists/netdev/msg255549.html


Herbert patch :

commit 9d8506cc2d7ea1f911c72c100193a3677f6668c3
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Nov 21 11:10:04 2013 -0800

     gso: handle new frag_list of frags GRO packets

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-14  0:35                     ` Eric Dumazet
@ 2018-03-14  1:15                       ` Eric Dumazet
  2018-03-16 22:37                         ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-03-14  1:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Yonghong Song, Steffen Klassert
  Cc: Daniel Borkmann, netdev, Martin Lau, saeedm, diptanu



On 03/13/2018 05:35 PM, Eric Dumazet wrote:
> 
> 
> On 03/13/2018 05:26 PM, Eric Dumazet wrote:
>>
>>
>> On 03/13/2018 05:04 PM, Alexei Starovoitov wrote:
>>> On 3/13/18 4:27 PM, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:
>>>>
>>>>> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
>>>>> It's not clear why it's not crashing there, but we cannot just
>>>>> reject changing proto in bpf programs now.
>>>>> We have to fix whatever needs to be fixed in skb_segment
>>>>> (if bug is there) or fix whatever necessary on mlx5 side.
>>>>> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
>>>>> through virtio would do, so if skb_segment() needs to do something
>>>>> special with skb the SKB_GSO_DODGY flag is already there.
>>>>
>>>> 'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was
>>>> not happy with the fix and provided something else.
>>>
>>> any link to your old patches and discussion?
>>>
>>> I think since mlx4 can do tso on them and the packets came out
>>> correct on the wire, there is nothing fundamentally wrong with
>>> changing gso_size. Just tricky to teach skb_segment.
>>>
>>
>> The world is not mlx4 only. Some NIC will ask skb_segment() fallback 
>> segmentation for various reasons (like skb->len above a given limit 
>> like 16KB)
>>
>> Maybe https://www.spinics.net/lists/netdev/msg255549.html
> 
> 
> Herbert patch :
> 
> commit 9d8506cc2d7ea1f911c72c100193a3677f6668c3
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Thu Nov 21 11:10:04 2013 -0800
> 
>      gso: handle new frag_list of frags GRO packets
> 

I found my initial patch.

https://www.spinics.net/lists/netdev/msg255452.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-14  1:15                       ` Eric Dumazet
@ 2018-03-16 22:37                         ` Yonghong Song
  2018-03-16 23:03                           ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2018-03-16 22:37 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Steffen Klassert, Daniel Borkmann
  Cc: netdev, Martin Lau, saeedm, diptanu


Eric and Daniel,

I have tried to fix this issue but not really successful.
I tried two hacks:
   . if skb_headlen(list_skb) is not 0, we just pull
     skb_headlen(list_skb) from the skb to make skb_headlen(list_skb) = 
0, or
   . if skb_headlen(list_skb) is not 0, we go to the beginning of
     the outer loop which will allocate another nskb for this list_skb.

Both approaches removed the BUG and the packet is able to reach the
remote host. Upon receiving the packet, however, the remote host sends a
reset packet back so connection eventually closed. I did not debug
further on this.

Considering it is tricky to change skb_segment, I hacked test_bpf
kernel module to reproduce the issue. The change reflects the gso packet
structure I got from mlx5. Maybe you could take a look and suggest a fix 
or a direction of how to move forward.

Thanks!

============= PATCH  ===============

-bash-4.2$ git show
commit 41681ab51f85b4a0ea3416a0a62d6bde74f3af4b
Author: Yonghong Song <yhs@fb.com>
Date:   Fri Mar 16 15:10:02 2018 -0700

     [hack] hack test_bpf module to trigger BUG_ON in skb_segment.

     "modprobe test_bpf" will have the following errors:
     ...
     [   98.149165] ------------[ cut here ]------------
     [   98.159362] kernel BUG at net/core/skbuff.c:3667!
     [   98.169756] invalid opcode: 0000 [#1] SMP PTI
     [   98.179370] Modules linked in:
     [   98.179371]  test_bpf(+)
     ...

     The BUG happens in function skb_segment:
     ...
     3665                 while (pos < offset + len) {
     3666                         if (i >= nfrags) {
     3667                                 BUG_ON(skb_headlen(list_skb));
     3668
     3669                                 i = 0;
     3670                                 nfrags = 
skb_shinfo(list_skb)->nr_frags;
     3671                                 frag = 
skb_shinfo(list_skb)->frags;
     3672                                 frag_skb = list_skb;
     3673
     3674                                 BUG_ON(!nfrags);
     ...

     The skbs are constructed to mimic what mlx5 may generate.
     The packet size/header may not mimic real cases in production. But
     the processing flow is similar.

     Signed-off-by: Yonghong Song <yhs@fb.com>

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 2efb213..d36a991 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6574,6 +6574,67 @@ static bool exclude_test(int test_id)
         return test_id < test_range[0] || test_id > test_range[1];
  }

+static struct sk_buff *build_test_skb(void) {
+       u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+       u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
+       struct sk_buff *skb[2];
+       void *data[2], *page;
+       int i, data_size = 8;
+
+       page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+       if (!page)
+               return NULL;
+
+       for (i = 0; i < 2; i++) {
+               data[i] = kzalloc(headroom + tailroom + data_size, 
GFP_KERNEL);
+               if (!data[i])
+                       return NULL;
+               skb[i] = build_skb(data[i], 0);
+               if (!skb[i]) {
+                       kfree(data[i]);
+                       return NULL;
+               }
+               skb_reserve(skb[i], headroom);
+               skb_put(skb[i], data_size);
+               skb[i]->protocol = htons(ETH_P_IP);
+               skb_reset_network_header(skb[i]);
+               skb_set_mac_header(skb[i], -ETH_HLEN);
+
+               skb_add_rx_frag(skb[i], skb_shinfo(skb[i])->nr_frags,
+                        page, 0, 64, 64);
+       }
+
+       /* setup shinfo */
+       skb_shinfo(skb[0])->gso_size = 1448;
+       skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
+       skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
+       skb_shinfo(skb[0])->gso_segs = 0;
+       skb_shinfo(skb[0])->frag_list = skb[1];
+
+       /* adjust skb[0]'s len */
+       skb[0]->len += skb[1]->len;
+       skb[0]->data_len += skb[1]->data_len;
+       skb[0]->truesize += skb[1]->truesize;
+
+       return skb[0];
+}
+
+static void test_skb_segment(void) {
+       netdev_features_t features;
+       struct sk_buff *skb;
+
+       features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | 
NETIF_F_IPV6_CSUM;
+       features |= NETIF_F_RXCSUM;
+       skb = build_test_skb();
+       if (!skb)
+               pr_info("Failed in test_skb_segment:build_test_skb!");
+
+       if (skb_segment(skb, features))
+               pr_info("Success in test_skb_segment!");
+       else
+               pr_info("Failed in test_skb_segment!");
+}
+
  static __init int test_bpf(void)
  {
         int i, err_cnt = 0, pass_cnt = 0;
@@ -6631,7 +6692,8 @@ static int __init test_bpf_init(void)
         if (ret < 0)
                 return ret;

-       ret = test_bpf();
+       // ret = test_bpf();
+       test_skb_segment();

         destroy_bpf_tests();
         return ret;


=============  END ================


On 3/13/18 6:15 PM, Eric Dumazet wrote:
> 
> 
> On 03/13/2018 05:35 PM, Eric Dumazet wrote:
>>
>>
>> On 03/13/2018 05:26 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 03/13/2018 05:04 PM, Alexei Starovoitov wrote:
>>>> On 3/13/18 4:27 PM, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:
>>>>>
>>>>>> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
>>>>>> It's not clear why it's not crashing there, but we cannot just
>>>>>> reject changing proto in bpf programs now.
>>>>>> We have to fix whatever needs to be fixed in skb_segment
>>>>>> (if bug is there) or fix whatever necessary on mlx5 side.
>>>>>> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
>>>>>> through virtio would do, so if skb_segment() needs to do something
>>>>>> special with skb the SKB_GSO_DODGY flag is already there.
>>>>>
>>>>> 'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was
>>>>> not happy with the fix and provided something else.
>>>>
>>>> any link to your old patches and discussion?
>>>>
>>>> I think since mlx4 can do tso on them and the packets came out
>>>> correct on the wire, there is nothing fundamentally wrong with
>>>> changing gso_size. Just tricky to teach skb_segment.
>>>>
>>>
>>> The world is not mlx4 only. Some NIC will ask skb_segment() fallback 
>>> segmentation for various reasons (like skb->len above a given limit 
>>> like 16KB)
>>>
>>> Maybe 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg255549.html&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=x9lG2k53B7AFsnNx6t2DgbXt06J-sLznZIHk6YdGoGg&s=0Nxx2G8PtDAEMCFAvQ7kxYTXVr9aHdOolP1KB_lnmes&e= 
>>>
>>
>>
>> Herbert patch :
>>
>> commit 9d8506cc2d7ea1f911c72c100193a3677f6668c3
>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>> Date:   Thu Nov 21 11:10:04 2013 -0800
>>
>>      gso: handle new frag_list of frags GRO packets
>>
> 
> I found my initial patch.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg255452.html&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=x9lG2k53B7AFsnNx6t2DgbXt06J-sLznZIHk6YdGoGg&s=VuWRpUdJwBwTxpnMNZYgKvQANLL5UA7hZnTFZsQlK6c&e= 
> 
> 
> 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-16 22:37                         ` Yonghong Song
@ 2018-03-16 23:03                           ` Eric Dumazet
  2018-03-17  4:44                             ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2018-03-16 23:03 UTC (permalink / raw)
  To: Yonghong Song, Alexei Starovoitov, Steffen Klassert, Daniel Borkmann
  Cc: netdev, Martin Lau, saeedm, diptanu



On 03/16/2018 03:37 PM, Yonghong Song wrote:
> 
> Eric and Daniel,
> 
> I have tried to fix this issue but not really successful.
> I tried two hacks:
>   . if skb_headlen(list_skb) is not 0, we just pull
>     skb_headlen(list_skb) from the skb to make skb_headlen(list_skb) = 0, or
>   . if skb_headlen(list_skb) is not 0, we go to the beginning of
>     the outer loop which will allocate another nskb for this list_skb.
> 
> Both approaches removed the BUG and the packet is able to reach the
> remote host. Upon receiving the packet, however, the remote host sends a
> reset packet back so connection eventually closed. I did not debug
> further on this.
> 
> Considering it is tricky to change skb_segment, I hacked test_bpf
> kernel module to reproduce the issue. The change reflects the gso packet
> structure I got from mlx5. Maybe you could take a look and suggest a fix or a direction of how to move forward.
> 
> Thanks!
> 
> ============= PATCH  ===============
> 
> -bash-4.2$ git show
> commit 41681ab51f85b4a0ea3416a0a62d6bde74f3af4b
> Author: Yonghong Song <yhs@fb.com>
> Date:   Fri Mar 16 15:10:02 2018 -0700
> 
>     [hack] hack test_bpf module to trigger BUG_ON in skb_segment.
> 
>     "modprobe test_bpf" will have the following errors:
>     ...
>     [   98.149165] ------------[ cut here ]------------
>     [   98.159362] kernel BUG at net/core/skbuff.c:3667!
>     [   98.169756] invalid opcode: 0000 [#1] SMP PTI
>     [   98.179370] Modules linked in:
>     [   98.179371]  test_bpf(+)
>     ...
> 
>     The BUG happens in function skb_segment:
>     ...
>     3665                 while (pos < offset + len) {
>     3666                         if (i >= nfrags) {
>     3667                                 BUG_ON(skb_headlen(list_skb));

Note that you do not need to pull data.

Chances are high that skb->head is a page fragment, so can be considered as such ;)

Look at users of skb_head_is_locked()


>     3668
>     3669                                 i = 0;
>     3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
>     3671                                 frag = skb_shinfo(list_skb)->frags;
>     3672                                 frag_skb = list_skb;
>     3673
>     3674                                 BUG_ON(!nfrags);
>     ...
> 
>     The skbs are constructed to mimic what mlx5 may generate.
>     The packet size/header may not mimic real cases in production. But
>     the processing flow is similar.
> 
>     Signed-off-by: Yonghong Song <yhs@fb.com>
> 
> diff --git a/lib/test_bpf.c b/lib/test_bpf.c
> index 2efb213..d36a991 100644
> --- a/lib/test_bpf.c
> +++ b/lib/test_bpf.c
> @@ -6574,6 +6574,67 @@ static bool exclude_test(int test_id)
>         return test_id < test_range[0] || test_id > test_range[1];
>  }
> 
> +static struct sk_buff *build_test_skb(void) {
> +       u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +       u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
> +       struct sk_buff *skb[2];
> +       void *data[2], *page;
> +       int i, data_size = 8;
> +
> +       page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
> +       if (!page)
> +               return NULL;
> +
> +       for (i = 0; i < 2; i++) {
> +               data[i] = kzalloc(headroom + tailroom + data_size, GFP_KERNEL);
> +               if (!data[i])
> +                       return NULL;
> +               skb[i] = build_skb(data[i], 0);
> +               if (!skb[i]) {
> +                       kfree(data[i]);
> +                       return NULL;
> +               }
> +               skb_reserve(skb[i], headroom);
> +               skb_put(skb[i], data_size);
> +               skb[i]->protocol = htons(ETH_P_IP);
> +               skb_reset_network_header(skb[i]);
> +               skb_set_mac_header(skb[i], -ETH_HLEN);
> +
> +               skb_add_rx_frag(skb[i], skb_shinfo(skb[i])->nr_frags,
> +                        page, 0, 64, 64);
> +       }
> +
> +       /* setup shinfo */
> +       skb_shinfo(skb[0])->gso_size = 1448;
> +       skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
> +       skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
> +       skb_shinfo(skb[0])->gso_segs = 0;
> +       skb_shinfo(skb[0])->frag_list = skb[1];
> +
> +       /* adjust skb[0]'s len */
> +       skb[0]->len += skb[1]->len;
> +       skb[0]->data_len += skb[1]->data_len;
> +       skb[0]->truesize += skb[1]->truesize;
> +
> +       return skb[0];
> +}
> +
> +static void test_skb_segment(void) {
> +       netdev_features_t features;
> +       struct sk_buff *skb;
> +
> +       features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
> +       features |= NETIF_F_RXCSUM;
> +       skb = build_test_skb();
> +       if (!skb)
> +               pr_info("Failed in test_skb_segment:build_test_skb!");
> +
> +       if (skb_segment(skb, features))
> +               pr_info("Success in test_skb_segment!");
> +       else
> +               pr_info("Failed in test_skb_segment!");
> +}
> +
>  static __init int test_bpf(void)
>  {
>         int i, err_cnt = 0, pass_cnt = 0;
> @@ -6631,7 +6692,8 @@ static int __init test_bpf_init(void)
>         if (ret < 0)
>                 return ret;
> 
> -       ret = test_bpf();
> +       // ret = test_bpf();
> +       test_skb_segment();
> 
>         destroy_bpf_tests();
>         return ret;
> 
> 
> =============  END ================
> 
> 
> On 3/13/18 6:15 PM, Eric Dumazet wrote:
>>
>>
>> On 03/13/2018 05:35 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 03/13/2018 05:26 PM, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/13/2018 05:04 PM, Alexei Starovoitov wrote:
>>>>> On 3/13/18 4:27 PM, Eric Dumazet wrote:
>>>>>>
>>>>>>
>>>>>> On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:
>>>>>>
>>>>>>> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
>>>>>>> It's not clear why it's not crashing there, but we cannot just
>>>>>>> reject changing proto in bpf programs now.
>>>>>>> We have to fix whatever needs to be fixed in skb_segment
>>>>>>> (if bug is there) or fix whatever necessary on mlx5 side.
>>>>>>> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
>>>>>>> through virtio would do, so if skb_segment() needs to do something
>>>>>>> special with skb the SKB_GSO_DODGY flag is already there.
>>>>>>
>>>>>> 'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was
>>>>>> not happy with the fix and provided something else.
>>>>>
>>>>> any link to your old patches and discussion?
>>>>>
>>>>> I think since mlx4 can do tso on them and the packets came out
>>>>> correct on the wire, there is nothing fundamentally wrong with
>>>>> changing gso_size. Just tricky to teach skb_segment.
>>>>>
>>>>
>>>> The world is not mlx4 only. Some NIC will ask skb_segment() fallback segmentation for various reasons (like skb->len above a given limit like 16KB)
>>>>
>>>> Maybe https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg255549.html&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=x9lG2k53B7AFsnNx6t2DgbXt06J-sLznZIHk6YdGoGg&s=0Nxx2G8PtDAEMCFAvQ7kxYTXVr9aHdOolP1KB_lnmes&e=
>>>
>>>
>>> Herbert patch :
>>>
>>> commit 9d8506cc2d7ea1f911c72c100193a3677f6668c3
>>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>>> Date:   Thu Nov 21 11:10:04 2013 -0800
>>>
>>>      gso: handle new frag_list of frags GRO packets
>>>
>>
>> I found my initial patch.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg255452.html&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=x9lG2k53B7AFsnNx6t2DgbXt06J-sLznZIHk6YdGoGg&s=VuWRpUdJwBwTxpnMNZYgKvQANLL5UA7hZnTFZsQlK6c&e=
>>
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG_ON triggered in skb_segment
  2018-03-16 23:03                           ` Eric Dumazet
@ 2018-03-17  4:44                             ` Yonghong Song
  0 siblings, 0 replies; 18+ messages in thread
From: Yonghong Song @ 2018-03-17  4:44 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Steffen Klassert, Daniel Borkmann
  Cc: netdev, Martin Lau, saeedm, diptanu



On 3/16/18 4:03 PM, Eric Dumazet wrote:
> 
> 
> On 03/16/2018 03:37 PM, Yonghong Song wrote:
>>
>> Eric and Daniel,
>>
>> I have tried to fix this issue but not really successful.
>> I tried two hacks:
>>    . if skb_headlen(list_skb) is not 0, we just pull
>>      skb_headlen(list_skb) from the skb to make skb_headlen(list_skb) = 0, or
>>    . if skb_headlen(list_skb) is not 0, we go to the beginning of
>>      the outer loop which will allocate another nskb for this list_skb.
>>
>> Both approaches removed the BUG and the packet is able to reach the
>> remote host. Upon receiving the packet, however, the remote host sends a
>> reset packet back so connection eventually closed. I did not debug
>> further on this.
>>
>> Considering it is tricky to change skb_segment, I hacked test_bpf
>> kernel module to reproduce the issue. The change reflects the gso packet
>> structure I got from mlx5. Maybe you could take a look and suggest a fix or a direction of how to move forward.
>>
>> Thanks!
>>
>> ============= PATCH  ===============
>>
>> -bash-4.2$ git show
>> commit 41681ab51f85b4a0ea3416a0a62d6bde74f3af4b
>> Author: Yonghong Song <yhs@fb.com>
>> Date:   Fri Mar 16 15:10:02 2018 -0700
>>
>>      [hack] hack test_bpf module to trigger BUG_ON in skb_segment.
>>
>>      "modprobe test_bpf" will have the following errors:
>>      ...
>>      [   98.149165] ------------[ cut here ]------------
>>      [   98.159362] kernel BUG at net/core/skbuff.c:3667!
>>      [   98.169756] invalid opcode: 0000 [#1] SMP PTI
>>      [   98.179370] Modules linked in:
>>      [   98.179371]  test_bpf(+)
>>      ...
>>
>>      The BUG happens in function skb_segment:
>>      ...
>>      3665                 while (pos < offset + len) {
>>      3666                         if (i >= nfrags) {
>>      3667                                 BUG_ON(skb_headlen(list_skb));
> 
> Note that you do not need to pull data.
> 
> Chances are high that skb->head is a page fragment, so can be considered as such ;)
> 
> Look at users of skb_head_is_locked()

Thanks Eric! skb->head_frag... This may well explain why my original 
hack won't work.

Yonghong

> 
> 
>>      3668
>>      3669                                 i = 0;
>>      3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
>>      3671                                 frag = skb_shinfo(list_skb)->frags;
>>      3672                                 frag_skb = list_skb;
>>      3673
>>      3674                                 BUG_ON(!nfrags);
>>      ...
>>
>>      The skbs are constructed to mimic what mlx5 may generate.
>>      The packet size/header may not mimic real cases in production. But
>>      the processing flow is similar.
>>
>>      Signed-off-by: Yonghong Song <yhs@fb.com>
>>
>> diff --git a/lib/test_bpf.c b/lib/test_bpf.c
>> index 2efb213..d36a991 100644
>> --- a/lib/test_bpf.c
>> +++ b/lib/test_bpf.c
>> @@ -6574,6 +6574,67 @@ static bool exclude_test(int test_id)
>>          return test_id < test_range[0] || test_id > test_range[1];
>>   }
>>
>> +static struct sk_buff *build_test_skb(void) {
>> +       u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +       u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
>> +       struct sk_buff *skb[2];
>> +       void *data[2], *page;
>> +       int i, data_size = 8;
>> +
>> +       page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
>> +       if (!page)
>> +               return NULL;
>> +
>> +       for (i = 0; i < 2; i++) {
>> +               data[i] = kzalloc(headroom + tailroom + data_size, GFP_KERNEL);
>> +               if (!data[i])
>> +                       return NULL;
>> +               skb[i] = build_skb(data[i], 0);
>> +               if (!skb[i]) {
>> +                       kfree(data[i]);
>> +                       return NULL;
>> +               }
>> +               skb_reserve(skb[i], headroom);
>> +               skb_put(skb[i], data_size);
>> +               skb[i]->protocol = htons(ETH_P_IP);
>> +               skb_reset_network_header(skb[i]);
>> +               skb_set_mac_header(skb[i], -ETH_HLEN);
>> +
>> +               skb_add_rx_frag(skb[i], skb_shinfo(skb[i])->nr_frags,
>> +                        page, 0, 64, 64);
>> +       }
>> +
>> +       /* setup shinfo */
>> +       skb_shinfo(skb[0])->gso_size = 1448;
>> +       skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
>> +       skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
>> +       skb_shinfo(skb[0])->gso_segs = 0;
>> +       skb_shinfo(skb[0])->frag_list = skb[1];
>> +
>> +       /* adjust skb[0]'s len */
>> +       skb[0]->len += skb[1]->len;
>> +       skb[0]->data_len += skb[1]->data_len;
>> +       skb[0]->truesize += skb[1]->truesize;
>> +
>> +       return skb[0];
>> +}
>> +
>> +static void test_skb_segment(void) {
>> +       netdev_features_t features;
>> +       struct sk_buff *skb;
>> +
>> +       features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
>> +       features |= NETIF_F_RXCSUM;
>> +       skb = build_test_skb();
>> +       if (!skb)
>> +               pr_info("Failed in test_skb_segment:build_test_skb!");
>> +
>> +       if (skb_segment(skb, features))
>> +               pr_info("Success in test_skb_segment!");
>> +       else
>> +               pr_info("Failed in test_skb_segment!");
>> +}
>> +
>>   static __init int test_bpf(void)
>>   {
>>          int i, err_cnt = 0, pass_cnt = 0;
>> @@ -6631,7 +6692,8 @@ static int __init test_bpf_init(void)
>>          if (ret < 0)
>>                  return ret;
>>
>> -       ret = test_bpf();
>> +       // ret = test_bpf();
>> +       test_skb_segment();
>>
>>          destroy_bpf_tests();
>>          return ret;
>>
>>
>> =============  END ================
>>
>>
>> On 3/13/18 6:15 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 03/13/2018 05:35 PM, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/13/2018 05:26 PM, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 03/13/2018 05:04 PM, Alexei Starovoitov wrote:
>>>>>> On 3/13/18 4:27 PM, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 03/13/2018 04:09 PM, Alexei Starovoitov wrote:
>>>>>>>
>>>>>>>> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
>>>>>>>> It's not clear why it's not crashing there, but we cannot just
>>>>>>>> reject changing proto in bpf programs now.
>>>>>>>> We have to fix whatever needs to be fixed in skb_segment
>>>>>>>> (if bug is there) or fix whatever necessary on mlx5 side.
>>>>>>>> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
>>>>>>>> through virtio would do, so if skb_segment() needs to do something
>>>>>>>> special with skb the SKB_GSO_DODGY flag is already there.
>>>>>>>
>>>>>>> 'Fixing' skb_segment(), I did that a long time ago and Herbert Xu was
>>>>>>> not happy with the fix and provided something else.
>>>>>>
>>>>>> any link to your old patches and discussion?
>>>>>>
>>>>>> I think since mlx4 can do tso on them and the packets came out
>>>>>> correct on the wire, there is nothing fundamentally wrong with
>>>>>> changing gso_size. Just tricky to teach skb_segment.
>>>>>>
>>>>>
>>>>> The world is not mlx4 only. Some NIC will ask skb_segment() fallback segmentation for various reasons (like skb->len above a given limit like 16KB)
>>>>>
>>>>> Maybe https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg255549.html&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=x9lG2k53B7AFsnNx6t2DgbXt06J-sLznZIHk6YdGoGg&s=0Nxx2G8PtDAEMCFAvQ7kxYTXVr9aHdOolP1KB_lnmes&e=
>>>>
>>>>
>>>> Herbert patch :
>>>>
>>>> commit 9d8506cc2d7ea1f911c72c100193a3677f6668c3
>>>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>>>> Date:   Thu Nov 21 11:10:04 2013 -0800
>>>>
>>>>       gso: handle new frag_list of frags GRO packets
>>>>
>>>
>>> I found my initial patch.
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_netdev_msg255452.html&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=x9lG2k53B7AFsnNx6t2DgbXt06J-sLznZIHk6YdGoGg&s=VuWRpUdJwBwTxpnMNZYgKvQANLL5UA7hZnTFZsQlK6c&e=
>>>
>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-03-17  4:45 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-13  5:45 BUG_ON triggered in skb_segment Yonghong Song
2018-03-13  6:04 ` Eric Dumazet
2018-03-13  6:08   ` Yonghong Song
2018-03-13  6:25     ` Eric Dumazet
2018-03-13  8:44       ` Steffen Klassert
2018-03-13 22:37         ` Yonghong Song
2018-03-13 22:47           ` Eric Dumazet
2018-03-13 23:09             ` Alexei Starovoitov
2018-03-13 23:18               ` Daniel Borkmann
2018-03-13 23:27               ` Eric Dumazet
2018-03-14  0:04                 ` Alexei Starovoitov
2018-03-14  0:26                   ` Eric Dumazet
2018-03-14  0:35                     ` Eric Dumazet
2018-03-14  1:15                       ` Eric Dumazet
2018-03-16 22:37                         ` Yonghong Song
2018-03-16 23:03                           ` Eric Dumazet
2018-03-17  4:44                             ` Yonghong Song
2018-03-13  6:18 ` Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).