All of lore.kernel.org
 help / color / mirror / Atom feed
* Crashes in skb clone/allocation in 4.19.18
@ 2019-01-30 16:51 Ivan Babrou
  2019-01-30 17:00 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Ivan Babrou @ 2019-01-30 16:51 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Ignat Korchagin, Shawn Bohrer,
	Jakub Sitnicki

Hey,

We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
crashed with the following:

[ 2313.192006] general protection fault: 0000 [#1] SMP PTI
[ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
O      4.19.18-cloudflare-2019.1.8 #2019.1.8
[ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
[ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0
[ 2313.257768] Code: 89 fa 4c 89 f6 e8 68 40 a1 00 4c 8b 55 00 58 4d
85 d2 75 d6 e9 6f ff ff ff 41 8b 59 20 48 8d 4a 01 4c 89 f8 49 8b 39
4c 01 fb <48> 33 1b 49 33 99 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
0f 84
[ 2313.295550] RSP: 0000:ffff94457f903b48 EFLAGS: 00010202
[ 2313.310352] RAX: 08b82daf1f57da0e RBX: 08b82daf1f57da0e RCX: 00000000005ff72d
[ 2313.327189] RDX: 00000000005ff72c RSI: 0000000000480220 RDI: 0000000000026e40
[ 2313.344029] RBP: ffff94457f04d680 R08: ffff94457f926e40 R09: ffff94457f04d680
[ 2313.360912] R10: 000004ce652a0026 R11: 0000000000000000 R12: 0000000000480220
[ 2313.377857] R13: 00000000ffffffff R14: ffffffffb1ab3ab7 R15: 08b82daf1f57da0e
[ 2313.394820] FS:  00007fdea755c780(0000) GS:ffff94457f900000(0000)
knlGS:0000000000000000
[ 2313.412887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2313.428581] CR2: 000055acc3cf517b CR3: 000000201b1ea003 CR4: 00000000003606e0
[ 2313.445753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2313.462843] perf: interrupt took too long (8028 > 7291), lowering
kernel.perf_event_max_sample_rate to 24000
[ 2313.462867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2313.500216] Call Trace:
[ 2313.512833]  <IRQ>
[ 2313.524748]  __alloc_skb+0x57/0x1d0
[ 2313.537934]  __tcp_send_ack.part.48+0x2f/0x100
[ 2313.551845]  tcp_rcv_established+0x550/0x640
[ 2313.565394]  tcp_v4_do_rcv+0x12a/0x1e0
[ 2313.578322]  tcp_v4_rcv+0xadc/0xbd0
[ 2313.590993]  ip_local_deliver_finish+0x5d/0x1d0
[ 2313.604727]  ip_local_deliver+0x6b/0xe0
[ 2313.617782]  ? ip_sublist_rcv+0x200/0x200
[ 2313.630415] perf: interrupt took too long (10040 > 10035), lowering
kernel.perf_event_max_sample_rate to 19000
[ 2313.630948]  ip_rcv+0x52/0xd0
[ 2313.662850]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
[ 2313.662857]  __netif_receive_skb_one_core+0x52/0x70
[ 2313.690860]  netif_receive_skb_internal+0x34/0xe0
[ 2313.690883]  efx_rx_deliver+0x11a/0x180 [sfc]
[ 2313.717780]  ? __efx_rx_packet+0x1ef/0x730 [sfc]
[ 2313.717786]  ? __queue_work+0x103/0x3e0
[ 2313.743118]  ? efx_poll+0x35e/0x460 [sfc]
[ 2313.743125]  ? net_rx_action+0x138/0x360
[ 2313.767356]  ? __do_softirq+0xd8/0x2d2
[ 2313.767362]  ? irq_exit+0xb4/0xc0
[ 2313.790680]  ? do_IRQ+0x85/0xd0
[ 2313.790688]  ? common_interrupt+0xf/0xf
[ 2313.790694]  </IRQ>
[ 2313.823837] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw xt_nat iptable_nat
nf_nat_ipv4 nf_nat xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark
iptable_mangle xt_owner xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6
iptable_raw ip6table_filter ip6_tables nfnetlink_log xt_NFLOG
xt_tcpudp xt_comment xt_conntrack nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 xt_mark xt_multiport xt_set iptable_filter bpfilter
ip_set_hash_netport ip_set_hash_net ip_set_hash_ip ip_set nfnetlink
8021q garp mrp stp llc sb_edac x86_pkg_temp_thermal kvm_intel kvm
irqbypass crc32_pclmul crc32c_intel pcbc aesni_intel aes_x86_64
ipmi_ssif crypto_simd cryptd
[ 2313.952153]  sfc(O) glue_helper igb i2c_algo_bit ipmi_si mdio dca
ipmi_devintf ipmi_msghandler efivarfs ip_tables x_tables
[ 2313.952238] ---[ end trace 477d8e3081c605f6 ]---

Some nodes also crashed in skb_clone, rather than __alloc_skb:

[ 3810.686137] general protection fault: 0000 [#1] SMP PTI
[ 3810.694579] CPU: 64 PID: 69338 Comm: nginx-fl Not tainted
4.19.18-cloudflare-2019.1.8 #2019.1.8
[ 3810.706589] Hardware name: Quanta Cloud Technology Inc. QuantaPlex
T42S-2U(LBG-4) ^S5SZ090028/T42S-2U MB (Lewisburg-4), BIOS 3A11.Q10
06/29/2018
[ 3810.726475] RIP: 0010:kmem_cache_alloc+0x89/0x1c0
[ 3810.734701] Code: 82 72 49 83 78 10 00 4d 8b 30 0f 84 0e 01 00 00
4d 85 f6 0f 84 05 01 00 00 41 8b 5f 20 48 8d 4a 01 4c 89 f0 49 8b 3f
4c 01 f3 <48> 33 1b 49 33 9f 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
74 b2
[ 3810.761088] RSP: 0000:ffff99723fe03730 EFLAGS: 00010282
[ 3810.770132] RAX: f0382d8aebf1ae68 RBX: f0382d8aebf1ae68 RCX: 0000000001cb61cf
[ 3810.781105] RDX: 0000000001cb61ce RSI: 0000000000480020 RDI: 0000000000027550
[ 3810.792012] RBP: ffff99723f19d500 R08: ffff99723fe27550 R09: 00000000000005dc
[ 3810.802820] R10: ffff9992227c0000 R11: 0000000000004000 R12: 0000000000480020
[ 3810.813589] R13: ffffffff8dcb5f7d R14: f0382d8aebf1ae68 R15: ffff99723f19d500
[ 3810.824382] FS:  00007f2a8863c780(0000) GS:ffff99723fe00000(0000)
knlGS:0000000000000000
[ 3810.836189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3810.845662] CR2: 000055820762eecd CR3: 00000019eb850003 CR4: 00000000007606e0
[ 3810.856567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3810.867600] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3810.878554] PKRU: 55555554
[ 3810.884787] Call Trace:
[ 3810.890601]  <IRQ>
[ 3810.896116]  skb_clone+0x4d/0xb0
[ 3810.902712]  dev_queue_xmit_nit+0xd9/0x260
[ 3810.910181]  dev_hard_start_xmit+0x69/0x1f0
[ 3810.917784]  __dev_queue_xmit+0x6f7/0x8a0
[ 3810.925172]  ? eth_header+0x26/0xc0
[ 3810.932053]  ip_finish_output2+0x193/0x400
[ 3810.939670]  ? ip_finish_output+0x139/0x270
[ 3810.947241]  ip_output+0x6c/0xe0
[ 3810.953844]  ? ip_append_data.part.51+0xc0/0xc0
[ 3810.961802]  __tcp_transmit_skb+0x511/0xaa0
[ 3810.969420]  __tcp_retransmit_skb+0x19c/0x7c0
[ 3810.977209]  ? tcp_current_mss+0x57/0xa0
[ 3810.984493]  tcp_retransmit_skb+0x12/0x80
[ 3810.991894]  tcp_xmit_retransmit_queue.part.50+0x147/0x240
[ 3811.000754]  tcp_ack+0x9c4/0x11b0
[ 3811.007416]  tcp_rcv_established+0x190/0x640
[ 3811.015065]  ? tcp_v4_inbound_md5_hash+0x69/0x160
[ 3811.023106]  tcp_v4_do_rcv+0x12a/0x1e0
[ 3811.030190]  tcp_v4_rcv+0xadc/0xbd0
[ 3811.037009]  ip_local_deliver_finish+0x5d/0x1d0
[ 3811.044859]  ip_local_deliver+0x6b/0xe0
[ 3811.051999]  ? ip_sublist_rcv+0x200/0x200
[ 3811.059325]  ip_rcv+0x52/0xd0
[ 3811.065595]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
[ 3811.073361]  __netif_receive_skb_one_core+0x52/0x70
[ 3811.081621]  netif_receive_skb_internal+0x34/0xe0
[ 3811.089652]  napi_gro_receive+0xba/0xe0
[ 3811.096969]  mlx5e_handle_rx_cqe+0x1eb/0x530 [mlx5_core]
[ 3811.105545]  ? skb_release_head_state+0x5c/0xb0
[ 3811.113447]  mlx5e_poll_rx_cq+0xc8/0x910 [mlx5_core]
[ 3811.121652]  mlx5e_napi_poll+0xb1/0xc60 [mlx5_core]
[ 3811.129574]  net_rx_action+0x138/0x360
[ 3811.136266]  __do_softirq+0xd8/0x2d2
[ 3811.142679]  irq_exit+0xb4/0xc0
[ 3811.148578]  do_IRQ+0x85/0xd0
[ 3811.154254]  common_interrupt+0xf/0xf
[ 3811.160585]  </IRQ>
[ 3811.165319] RIP: 0033:0x5581e1551ca0
[ 3811.171546] Code: e8 10 41 ff 24 ee 81 7c ca 04 ff ff fe ff 0f 83
87 1c 00 00 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff 24 ee 48
8b 2c c2 <48> 89 2c ca 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff
24 ee
[ 3811.195925] RSP: 002b:00007ffdd615ebc0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffde
[ 3811.206319] RAX: 0000000000000000 RBX: 00000000406c9058 RCX: 000000000000000b
[ 3811.216321] RDX: 000000004099cdc8 RSI: fffffffb40c07eb0 RDI: 000000004183d738
[ 3811.226277] RBP: fffffff444c8c5c0 R08: 000000004099cdc8 R09: 00000000425ce3d8
[ 3811.236340] R10: 0000000044c8c5c0 R11: 000000004139cbb0 R12: 0000000000000000
[ 3811.246349] R13: 00005581ead6a9e0 R14: 000000004166afe8 R15: 00000000406c90f8
[ 3811.256320] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
ip6table_mangle ip6table_security ip6table_raw ip6table_filter
ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TPROXY
nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark iptable_mangle xt_owner
xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6 iptable_raw
nfnetlink_log xt_NFLOG xt_tcpudp xt_comment xt_conntrack nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_multiport xt_set
iptable_filter bpfilter ip_set_hash_netport ip_set_hash_net
ip_set_hash_ip ip_set nfnetlink 8021q garp mrp stp llc skx_edac
x86_pkg_temp_thermal kvm_intel kvm irqbypass ipmi_ssif crc32_pclmul
crc32c_intel pcbc aesni_intel aes_x86_64 crypto_simd mlx5_core
[ 3811.351698]  cryptd xhci_pci tpm_crb mlxfw glue_helper ioatdma
devlink ipmi_si xhci_hcd dca ipmi_devintf ipmi_msghandler tpm_tis
tpm_tis_core tpm efivarfs ip_tables x_tables
[ 3811.375161] ---[ end trace 1a7795bb39a63cf7 ]---

Is this know? Could it be related to this commit:

* https://github.com/torvalds/linux/commit/598e57e029290be3e7f8f87ff908091a5a22ed2f

Thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Crashes in skb clone/allocation in 4.19.18
  2019-01-30 16:51 Crashes in skb clone/allocation in 4.19.18 Ivan Babrou
@ 2019-01-30 17:00 ` Eric Dumazet
  2019-01-30 17:15 ` Cong Wang
  2019-01-30 17:33 ` Edward Cree
  2 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2019-01-30 17:00 UTC (permalink / raw)
  To: Ivan Babrou, netdev
  Cc: David S. Miller, Eric Dumazet, Ignat Korchagin, Shawn Bohrer,
	Jakub Sitnicki



On 01/30/2019 08:51 AM, Ivan Babrou wrote:
> Hey,
> 
> We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
> crashed with the following:
> 
> [ 2313.192006] general protection fault: 0000 [#1] SMP PTI
> [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
> O      4.19.18-cloudflare-2019.1.8 #2019.1.8
> [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
> T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
> [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0
> [ 2313.257768] Code: 89 fa 4c 89 f6 e8 68 40 a1 00 4c 8b 55 00 58 4d
> 85 d2 75 d6 e9 6f ff ff ff 41 8b 59 20 48 8d 4a 01 4c 89 f8 49 8b 39
> 4c 01 fb <48> 33 1b 49 33 99 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
> 0f 84
> [ 2313.295550] RSP: 0000:ffff94457f903b48 EFLAGS: 00010202
> [ 2313.310352] RAX: 08b82daf1f57da0e RBX: 08b82daf1f57da0e RCX: 00000000005ff72d
> [ 2313.327189] RDX: 00000000005ff72c RSI: 0000000000480220 RDI: 0000000000026e40
> [ 2313.344029] RBP: ffff94457f04d680 R08: ffff94457f926e40 R09: ffff94457f04d680
> [ 2313.360912] R10: 000004ce652a0026 R11: 0000000000000000 R12: 0000000000480220
> [ 2313.377857] R13: 00000000ffffffff R14: ffffffffb1ab3ab7 R15: 08b82daf1f57da0e
> [ 2313.394820] FS:  00007fdea755c780(0000) GS:ffff94457f900000(0000)
> knlGS:0000000000000000
> [ 2313.412887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2313.428581] CR2: 000055acc3cf517b CR3: 000000201b1ea003 CR4: 00000000003606e0
> [ 2313.445753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2313.462843] perf: interrupt took too long (8028 > 7291), lowering
> kernel.perf_event_max_sample_rate to 24000
> [ 2313.462867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2313.500216] Call Trace:
> [ 2313.512833]  <IRQ>
> [ 2313.524748]  __alloc_skb+0x57/0x1d0
> [ 2313.537934]  __tcp_send_ack.part.48+0x2f/0x100
> [ 2313.551845]  tcp_rcv_established+0x550/0x640
> [ 2313.565394]  tcp_v4_do_rcv+0x12a/0x1e0
> [ 2313.578322]  tcp_v4_rcv+0xadc/0xbd0
> [ 2313.590993]  ip_local_deliver_finish+0x5d/0x1d0
> [ 2313.604727]  ip_local_deliver+0x6b/0xe0
> [ 2313.617782]  ? ip_sublist_rcv+0x200/0x200
> [ 2313.630415] perf: interrupt took too long (10040 > 10035), lowering
> kernel.perf_event_max_sample_rate to 19000
> [ 2313.630948]  ip_rcv+0x52/0xd0
> [ 2313.662850]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
> [ 2313.662857]  __netif_receive_skb_one_core+0x52/0x70
> [ 2313.690860]  netif_receive_skb_internal+0x34/0xe0
> [ 2313.690883]  efx_rx_deliver+0x11a/0x180 [sfc]
> [ 2313.717780]  ? __efx_rx_packet+0x1ef/0x730 [sfc]
> [ 2313.717786]  ? __queue_work+0x103/0x3e0
> [ 2313.743118]  ? efx_poll+0x35e/0x460 [sfc]
> [ 2313.743125]  ? net_rx_action+0x138/0x360
> [ 2313.767356]  ? __do_softirq+0xd8/0x2d2
> [ 2313.767362]  ? irq_exit+0xb4/0xc0
> [ 2313.790680]  ? do_IRQ+0x85/0xd0
> [ 2313.790688]  ? common_interrupt+0xf/0xf
> [ 2313.790694]  </IRQ>
> [ 2313.823837] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw xt_nat iptable_nat
> nf_nat_ipv4 nf_nat xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark
> iptable_mangle xt_owner xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6
> iptable_raw ip6table_filter ip6_tables nfnetlink_log xt_NFLOG
> xt_tcpudp xt_comment xt_conntrack nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 xt_mark xt_multiport xt_set iptable_filter bpfilter
> ip_set_hash_netport ip_set_hash_net ip_set_hash_ip ip_set nfnetlink
> 8021q garp mrp stp llc sb_edac x86_pkg_temp_thermal kvm_intel kvm
> irqbypass crc32_pclmul crc32c_intel pcbc aesni_intel aes_x86_64
> ipmi_ssif crypto_simd cryptd
> [ 2313.952153]  sfc(O) glue_helper igb i2c_algo_bit ipmi_si mdio dca
> ipmi_devintf ipmi_msghandler efivarfs ip_tables x_tables
> [ 2313.952238] ---[ end trace 477d8e3081c605f6 ]---
> 
> Some nodes also crashed in skb_clone, rather than __alloc_skb:
> 
> [ 3810.686137] general protection fault: 0000 [#1] SMP PTI
> [ 3810.694579] CPU: 64 PID: 69338 Comm: nginx-fl Not tainted
> 4.19.18-cloudflare-2019.1.8 #2019.1.8
> [ 3810.706589] Hardware name: Quanta Cloud Technology Inc. QuantaPlex
> T42S-2U(LBG-4) ^S5SZ090028/T42S-2U MB (Lewisburg-4), BIOS 3A11.Q10
> 06/29/2018
> [ 3810.726475] RIP: 0010:kmem_cache_alloc+0x89/0x1c0
> [ 3810.734701] Code: 82 72 49 83 78 10 00 4d 8b 30 0f 84 0e 01 00 00
> 4d 85 f6 0f 84 05 01 00 00 41 8b 5f 20 48 8d 4a 01 4c 89 f0 49 8b 3f
> 4c 01 f3 <48> 33 1b 49 33 9f 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
> 74 b2
> [ 3810.761088] RSP: 0000:ffff99723fe03730 EFLAGS: 00010282
> [ 3810.770132] RAX: f0382d8aebf1ae68 RBX: f0382d8aebf1ae68 RCX: 0000000001cb61cf
> [ 3810.781105] RDX: 0000000001cb61ce RSI: 0000000000480020 RDI: 0000000000027550
> [ 3810.792012] RBP: ffff99723f19d500 R08: ffff99723fe27550 R09: 00000000000005dc
> [ 3810.802820] R10: ffff9992227c0000 R11: 0000000000004000 R12: 0000000000480020
> [ 3810.813589] R13: ffffffff8dcb5f7d R14: f0382d8aebf1ae68 R15: ffff99723f19d500
> [ 3810.824382] FS:  00007f2a8863c780(0000) GS:ffff99723fe00000(0000)
> knlGS:0000000000000000
> [ 3810.836189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3810.845662] CR2: 000055820762eecd CR3: 00000019eb850003 CR4: 00000000007606e0
> [ 3810.856567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3810.867600] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3810.878554] PKRU: 55555554
> [ 3810.884787] Call Trace:
> [ 3810.890601]  <IRQ>
> [ 3810.896116]  skb_clone+0x4d/0xb0
> [ 3810.902712]  dev_queue_xmit_nit+0xd9/0x260
> [ 3810.910181]  dev_hard_start_xmit+0x69/0x1f0
> [ 3810.917784]  __dev_queue_xmit+0x6f7/0x8a0
> [ 3810.925172]  ? eth_header+0x26/0xc0
> [ 3810.932053]  ip_finish_output2+0x193/0x400
> [ 3810.939670]  ? ip_finish_output+0x139/0x270
> [ 3810.947241]  ip_output+0x6c/0xe0
> [ 3810.953844]  ? ip_append_data.part.51+0xc0/0xc0
> [ 3810.961802]  __tcp_transmit_skb+0x511/0xaa0
> [ 3810.969420]  __tcp_retransmit_skb+0x19c/0x7c0
> [ 3810.977209]  ? tcp_current_mss+0x57/0xa0
> [ 3810.984493]  tcp_retransmit_skb+0x12/0x80
> [ 3810.991894]  tcp_xmit_retransmit_queue.part.50+0x147/0x240
> [ 3811.000754]  tcp_ack+0x9c4/0x11b0
> [ 3811.007416]  tcp_rcv_established+0x190/0x640
> [ 3811.015065]  ? tcp_v4_inbound_md5_hash+0x69/0x160
> [ 3811.023106]  tcp_v4_do_rcv+0x12a/0x1e0
> [ 3811.030190]  tcp_v4_rcv+0xadc/0xbd0
> [ 3811.037009]  ip_local_deliver_finish+0x5d/0x1d0
> [ 3811.044859]  ip_local_deliver+0x6b/0xe0
> [ 3811.051999]  ? ip_sublist_rcv+0x200/0x200
> [ 3811.059325]  ip_rcv+0x52/0xd0
> [ 3811.065595]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
> [ 3811.073361]  __netif_receive_skb_one_core+0x52/0x70
> [ 3811.081621]  netif_receive_skb_internal+0x34/0xe0
> [ 3811.089652]  napi_gro_receive+0xba/0xe0
> [ 3811.096969]  mlx5e_handle_rx_cqe+0x1eb/0x530 [mlx5_core]
> [ 3811.105545]  ? skb_release_head_state+0x5c/0xb0
> [ 3811.113447]  mlx5e_poll_rx_cq+0xc8/0x910 [mlx5_core]
> [ 3811.121652]  mlx5e_napi_poll+0xb1/0xc60 [mlx5_core]
> [ 3811.129574]  net_rx_action+0x138/0x360
> [ 3811.136266]  __do_softirq+0xd8/0x2d2
> [ 3811.142679]  irq_exit+0xb4/0xc0
> [ 3811.148578]  do_IRQ+0x85/0xd0
> [ 3811.154254]  common_interrupt+0xf/0xf
> [ 3811.160585]  </IRQ>
> [ 3811.165319] RIP: 0033:0x5581e1551ca0
> [ 3811.171546] Code: e8 10 41 ff 24 ee 81 7c ca 04 ff ff fe ff 0f 83
> 87 1c 00 00 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff 24 ee 48
> 8b 2c c2 <48> 89 2c ca 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff
> 24 ee
> [ 3811.195925] RSP: 002b:00007ffdd615ebc0 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffde
> [ 3811.206319] RAX: 0000000000000000 RBX: 00000000406c9058 RCX: 000000000000000b
> [ 3811.216321] RDX: 000000004099cdc8 RSI: fffffffb40c07eb0 RDI: 000000004183d738
> [ 3811.226277] RBP: fffffff444c8c5c0 R08: 000000004099cdc8 R09: 00000000425ce3d8
> [ 3811.236340] R10: 0000000044c8c5c0 R11: 000000004139cbb0 R12: 0000000000000000
> [ 3811.246349] R13: 00005581ead6a9e0 R14: 000000004166afe8 R15: 00000000406c90f8
> [ 3811.256320] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TPROXY
> nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark iptable_mangle xt_owner
> xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6 iptable_raw
> nfnetlink_log xt_NFLOG xt_tcpudp xt_comment xt_conntrack nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_multiport xt_set
> iptable_filter bpfilter ip_set_hash_netport ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink 8021q garp mrp stp llc skx_edac
> x86_pkg_temp_thermal kvm_intel kvm irqbypass ipmi_ssif crc32_pclmul
> crc32c_intel pcbc aesni_intel aes_x86_64 crypto_simd mlx5_core
> [ 3811.351698]  cryptd xhci_pci tpm_crb mlxfw glue_helper ioatdma
> devlink ipmi_si xhci_hcd dca ipmi_devintf ipmi_msghandler tpm_tis
> tpm_tis_core tpm efivarfs ip_tables x_tables
> [ 3811.375161] ---[ end trace 1a7795bb39a63cf7 ]---
> 
> Is this know? Could it be related to this commit:
> 
> * https://github.com/torvalds/linux/commit/598e57e029290be3e7f8f87ff908091a5a22ed2f
> 

I do not believe this commit could explain these crashes.

Given they are about 580 commits between 4.19.13 and 4.19.18, a bisection might be the easier way
to find the problem.

Thanks.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Crashes in skb clone/allocation in 4.19.18
  2019-01-30 16:51 Crashes in skb clone/allocation in 4.19.18 Ivan Babrou
  2019-01-30 17:00 ` Eric Dumazet
@ 2019-01-30 17:15 ` Cong Wang
  2019-01-30 17:28   ` Lance Richardson
  2019-01-30 17:33 ` Edward Cree
  2 siblings, 1 reply; 7+ messages in thread
From: Cong Wang @ 2019-01-30 17:15 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Linux Kernel Network Developers, David S. Miller, Eric Dumazet,
	Ignat Korchagin, Shawn Bohrer, Jakub Sitnicki

On Wed, Jan 30, 2019 at 8:54 AM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> Hey,
>
> We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
> crashed with the following:
>
> [ 2313.192006] general protection fault: 0000 [#1] SMP PTI
> [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
> O      4.19.18-cloudflare-2019.1.8 #2019.1.8
> [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
> T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
> [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0

This looks more like an mm bug than a networking one.

Also, it is always helpful if you can map the RIP to source code,
using scripts/faddr2line or scripts/decode_stacktrace.sh.


Thanks.


> [ 2313.257768] Code: 89 fa 4c 89 f6 e8 68 40 a1 00 4c 8b 55 00 58 4d
> 85 d2 75 d6 e9 6f ff ff ff 41 8b 59 20 48 8d 4a 01 4c 89 f8 49 8b 39
> 4c 01 fb <48> 33 1b 49 33 99 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
> 0f 84
> [ 2313.295550] RSP: 0000:ffff94457f903b48 EFLAGS: 00010202
> [ 2313.310352] RAX: 08b82daf1f57da0e RBX: 08b82daf1f57da0e RCX: 00000000005ff72d
> [ 2313.327189] RDX: 00000000005ff72c RSI: 0000000000480220 RDI: 0000000000026e40
> [ 2313.344029] RBP: ffff94457f04d680 R08: ffff94457f926e40 R09: ffff94457f04d680
> [ 2313.360912] R10: 000004ce652a0026 R11: 0000000000000000 R12: 0000000000480220
> [ 2313.377857] R13: 00000000ffffffff R14: ffffffffb1ab3ab7 R15: 08b82daf1f57da0e
> [ 2313.394820] FS:  00007fdea755c780(0000) GS:ffff94457f900000(0000)
> knlGS:0000000000000000
> [ 2313.412887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2313.428581] CR2: 000055acc3cf517b CR3: 000000201b1ea003 CR4: 00000000003606e0
> [ 2313.445753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2313.462843] perf: interrupt took too long (8028 > 7291), lowering
> kernel.perf_event_max_sample_rate to 24000
> [ 2313.462867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2313.500216] Call Trace:
> [ 2313.512833]  <IRQ>
> [ 2313.524748]  __alloc_skb+0x57/0x1d0
> [ 2313.537934]  __tcp_send_ack.part.48+0x2f/0x100
> [ 2313.551845]  tcp_rcv_established+0x550/0x640
> [ 2313.565394]  tcp_v4_do_rcv+0x12a/0x1e0
> [ 2313.578322]  tcp_v4_rcv+0xadc/0xbd0
> [ 2313.590993]  ip_local_deliver_finish+0x5d/0x1d0
> [ 2313.604727]  ip_local_deliver+0x6b/0xe0
> [ 2313.617782]  ? ip_sublist_rcv+0x200/0x200
> [ 2313.630415] perf: interrupt took too long (10040 > 10035), lowering
> kernel.perf_event_max_sample_rate to 19000
> [ 2313.630948]  ip_rcv+0x52/0xd0
> [ 2313.662850]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
> [ 2313.662857]  __netif_receive_skb_one_core+0x52/0x70
> [ 2313.690860]  netif_receive_skb_internal+0x34/0xe0
> [ 2313.690883]  efx_rx_deliver+0x11a/0x180 [sfc]
> [ 2313.717780]  ? __efx_rx_packet+0x1ef/0x730 [sfc]
> [ 2313.717786]  ? __queue_work+0x103/0x3e0
> [ 2313.743118]  ? efx_poll+0x35e/0x460 [sfc]
> [ 2313.743125]  ? net_rx_action+0x138/0x360
> [ 2313.767356]  ? __do_softirq+0xd8/0x2d2
> [ 2313.767362]  ? irq_exit+0xb4/0xc0
> [ 2313.790680]  ? do_IRQ+0x85/0xd0
> [ 2313.790688]  ? common_interrupt+0xf/0xf
> [ 2313.790694]  </IRQ>
> [ 2313.823837] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw xt_nat iptable_nat
> nf_nat_ipv4 nf_nat xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark
> iptable_mangle xt_owner xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6
> iptable_raw ip6table_filter ip6_tables nfnetlink_log xt_NFLOG
> xt_tcpudp xt_comment xt_conntrack nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 xt_mark xt_multiport xt_set iptable_filter bpfilter
> ip_set_hash_netport ip_set_hash_net ip_set_hash_ip ip_set nfnetlink
> 8021q garp mrp stp llc sb_edac x86_pkg_temp_thermal kvm_intel kvm
> irqbypass crc32_pclmul crc32c_intel pcbc aesni_intel aes_x86_64
> ipmi_ssif crypto_simd cryptd
> [ 2313.952153]  sfc(O) glue_helper igb i2c_algo_bit ipmi_si mdio dca
> ipmi_devintf ipmi_msghandler efivarfs ip_tables x_tables
> [ 2313.952238] ---[ end trace 477d8e3081c605f6 ]---
>
> Some nodes also crashed in skb_clone, rather than __alloc_skb:
>
> [ 3810.686137] general protection fault: 0000 [#1] SMP PTI
> [ 3810.694579] CPU: 64 PID: 69338 Comm: nginx-fl Not tainted
> 4.19.18-cloudflare-2019.1.8 #2019.1.8
> [ 3810.706589] Hardware name: Quanta Cloud Technology Inc. QuantaPlex
> T42S-2U(LBG-4) ^S5SZ090028/T42S-2U MB (Lewisburg-4), BIOS 3A11.Q10
> 06/29/2018
> [ 3810.726475] RIP: 0010:kmem_cache_alloc+0x89/0x1c0
> [ 3810.734701] Code: 82 72 49 83 78 10 00 4d 8b 30 0f 84 0e 01 00 00
> 4d 85 f6 0f 84 05 01 00 00 41 8b 5f 20 48 8d 4a 01 4c 89 f0 49 8b 3f
> 4c 01 f3 <48> 33 1b 49 33 9f 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
> 74 b2
> [ 3810.761088] RSP: 0000:ffff99723fe03730 EFLAGS: 00010282
> [ 3810.770132] RAX: f0382d8aebf1ae68 RBX: f0382d8aebf1ae68 RCX: 0000000001cb61cf
> [ 3810.781105] RDX: 0000000001cb61ce RSI: 0000000000480020 RDI: 0000000000027550
> [ 3810.792012] RBP: ffff99723f19d500 R08: ffff99723fe27550 R09: 00000000000005dc
> [ 3810.802820] R10: ffff9992227c0000 R11: 0000000000004000 R12: 0000000000480020
> [ 3810.813589] R13: ffffffff8dcb5f7d R14: f0382d8aebf1ae68 R15: ffff99723f19d500
> [ 3810.824382] FS:  00007f2a8863c780(0000) GS:ffff99723fe00000(0000)
> knlGS:0000000000000000
> [ 3810.836189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3810.845662] CR2: 000055820762eecd CR3: 00000019eb850003 CR4: 00000000007606e0
> [ 3810.856567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3810.867600] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3810.878554] PKRU: 55555554
> [ 3810.884787] Call Trace:
> [ 3810.890601]  <IRQ>
> [ 3810.896116]  skb_clone+0x4d/0xb0
> [ 3810.902712]  dev_queue_xmit_nit+0xd9/0x260
> [ 3810.910181]  dev_hard_start_xmit+0x69/0x1f0
> [ 3810.917784]  __dev_queue_xmit+0x6f7/0x8a0
> [ 3810.925172]  ? eth_header+0x26/0xc0
> [ 3810.932053]  ip_finish_output2+0x193/0x400
> [ 3810.939670]  ? ip_finish_output+0x139/0x270
> [ 3810.947241]  ip_output+0x6c/0xe0
> [ 3810.953844]  ? ip_append_data.part.51+0xc0/0xc0
> [ 3810.961802]  __tcp_transmit_skb+0x511/0xaa0
> [ 3810.969420]  __tcp_retransmit_skb+0x19c/0x7c0
> [ 3810.977209]  ? tcp_current_mss+0x57/0xa0
> [ 3810.984493]  tcp_retransmit_skb+0x12/0x80
> [ 3810.991894]  tcp_xmit_retransmit_queue.part.50+0x147/0x240
> [ 3811.000754]  tcp_ack+0x9c4/0x11b0
> [ 3811.007416]  tcp_rcv_established+0x190/0x640
> [ 3811.015065]  ? tcp_v4_inbound_md5_hash+0x69/0x160
> [ 3811.023106]  tcp_v4_do_rcv+0x12a/0x1e0
> [ 3811.030190]  tcp_v4_rcv+0xadc/0xbd0
> [ 3811.037009]  ip_local_deliver_finish+0x5d/0x1d0
> [ 3811.044859]  ip_local_deliver+0x6b/0xe0
> [ 3811.051999]  ? ip_sublist_rcv+0x200/0x200
> [ 3811.059325]  ip_rcv+0x52/0xd0
> [ 3811.065595]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
> [ 3811.073361]  __netif_receive_skb_one_core+0x52/0x70
> [ 3811.081621]  netif_receive_skb_internal+0x34/0xe0
> [ 3811.089652]  napi_gro_receive+0xba/0xe0
> [ 3811.096969]  mlx5e_handle_rx_cqe+0x1eb/0x530 [mlx5_core]
> [ 3811.105545]  ? skb_release_head_state+0x5c/0xb0
> [ 3811.113447]  mlx5e_poll_rx_cq+0xc8/0x910 [mlx5_core]
> [ 3811.121652]  mlx5e_napi_poll+0xb1/0xc60 [mlx5_core]
> [ 3811.129574]  net_rx_action+0x138/0x360
> [ 3811.136266]  __do_softirq+0xd8/0x2d2
> [ 3811.142679]  irq_exit+0xb4/0xc0
> [ 3811.148578]  do_IRQ+0x85/0xd0
> [ 3811.154254]  common_interrupt+0xf/0xf
> [ 3811.160585]  </IRQ>
> [ 3811.165319] RIP: 0033:0x5581e1551ca0
> [ 3811.171546] Code: e8 10 41 ff 24 ee 81 7c ca 04 ff ff fe ff 0f 83
> 87 1c 00 00 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff 24 ee 48
> 8b 2c c2 <48> 89 2c ca 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff
> 24 ee
> [ 3811.195925] RSP: 002b:00007ffdd615ebc0 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffde
> [ 3811.206319] RAX: 0000000000000000 RBX: 00000000406c9058 RCX: 000000000000000b
> [ 3811.216321] RDX: 000000004099cdc8 RSI: fffffffb40c07eb0 RDI: 000000004183d738
> [ 3811.226277] RBP: fffffff444c8c5c0 R08: 000000004099cdc8 R09: 00000000425ce3d8
> [ 3811.236340] R10: 0000000044c8c5c0 R11: 000000004139cbb0 R12: 0000000000000000
> [ 3811.246349] R13: 00005581ead6a9e0 R14: 000000004166afe8 R15: 00000000406c90f8
> [ 3811.256320] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TPROXY
> nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark iptable_mangle xt_owner
> xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6 iptable_raw
> nfnetlink_log xt_NFLOG xt_tcpudp xt_comment xt_conntrack nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_multiport xt_set
> iptable_filter bpfilter ip_set_hash_netport ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink 8021q garp mrp stp llc skx_edac
> x86_pkg_temp_thermal kvm_intel kvm irqbypass ipmi_ssif crc32_pclmul
> crc32c_intel pcbc aesni_intel aes_x86_64 crypto_simd mlx5_core
> [ 3811.351698]  cryptd xhci_pci tpm_crb mlxfw glue_helper ioatdma
> devlink ipmi_si xhci_hcd dca ipmi_devintf ipmi_msghandler tpm_tis
> tpm_tis_core tpm efivarfs ip_tables x_tables
> [ 3811.375161] ---[ end trace 1a7795bb39a63cf7 ]---
>
> Is this know? Could it be related to this commit:
>
> * https://github.com/torvalds/linux/commit/598e57e029290be3e7f8f87ff908091a5a22ed2f
>
> Thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Crashes in skb clone/allocation in 4.19.18
  2019-01-30 17:15 ` Cong Wang
@ 2019-01-30 17:28   ` Lance Richardson
  2019-01-30 17:34     ` Ivan Babrou
  0 siblings, 1 reply; 7+ messages in thread
From: Lance Richardson @ 2019-01-30 17:28 UTC (permalink / raw)
  To: Cong Wang
  Cc: Ivan Babrou, Linux Kernel Network Developers, David S. Miller,
	Eric Dumazet, Ignat Korchagin, Shawn Bohrer, Jakub Sitnicki

On Wed, Jan 30, 2019 at 12:17 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 8:54 AM Ivan Babrou <ivan@cloudflare.com> wrote:
> >
> > Hey,
> >
> > We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
> > crashed with the following:
> >
> > [ 2313.192006] general protection fault: 0000 [#1] SMP PTI
> > [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
> > O      4.19.18-cloudflare-2019.1.8 #2019.1.8

"Tainted: GO" appears to mean that an out-of tree kernel module was
loaded. If so, information about that module and whether the crash
occurs when it hasn't been loaded might be of interest.

   - Lance

> > [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
> > T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
> > [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0
>
> This looks more like an mm bug than a networking one.
>
> Also, it is always helpful if you can map the RIP to source code,
> using scripts/faddr2line or scripts/decode_stacktrace.sh.
>
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Crashes in skb clone/allocation in 4.19.18
  2019-01-30 16:51 Crashes in skb clone/allocation in 4.19.18 Ivan Babrou
  2019-01-30 17:00 ` Eric Dumazet
  2019-01-30 17:15 ` Cong Wang
@ 2019-01-30 17:33 ` Edward Cree
  2019-01-30 17:37   ` Edward Cree
  2 siblings, 1 reply; 7+ messages in thread
From: Edward Cree @ 2019-01-30 17:33 UTC (permalink / raw)
  To: Ivan Babrou, netdev
  Cc: David S. Miller, Eric Dumazet, Ignat Korchagin, Shawn Bohrer,
	Jakub Sitnicki

On 30/01/19 16:51, Ivan Babrou wrote:
> Hey,
>
> We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
> crashed with the following:
>
> [ 2313.192006] general protection fault: 0000 [#1] SMP PTI
> [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
> O      4.19.18-cloudflare-2019.1.8 #2019.1.8
> [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
> T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
> [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0
> [ 2313.257768] Code: 89 fa 4c 89 f6 e8 68 40 a1 00 4c 8b 55 00 58 4d
> 85 d2 75 d6 e9 6f ff ff ff 41 8b 59 20 48 8d 4a 01 4c 89 f8 49 8b 39
> 4c 01 fb <48> 33 1b 49 33 99 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
> 0f 84
> [ 2313.295550] RSP: 0000:ffff94457f903b48 EFLAGS: 00010202
> [ 2313.310352] RAX: 08b82daf1f57da0e RBX: 08b82daf1f57da0e RCX: 00000000005ff72d
> [ 2313.327189] RDX: 00000000005ff72c RSI: 0000000000480220 RDI: 0000000000026e40
> [ 2313.344029] RBP: ffff94457f04d680 R08: ffff94457f926e40 R09: ffff94457f04d680
> [ 2313.360912] R10: 000004ce652a0026 R11: 0000000000000000 R12: 0000000000480220
> [ 2313.377857] R13: 00000000ffffffff R14: ffffffffb1ab3ab7 R15: 08b82daf1f57da0e
> [ 2313.394820] FS:  00007fdea755c780(0000) GS:ffff94457f900000(0000)
> knlGS:0000000000000000
> [ 2313.412887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2313.428581] CR2: 000055acc3cf517b CR3: 000000201b1ea003 CR4: 00000000003606e0
> [ 2313.445753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2313.462843] perf: interrupt took too long (8028 > 7291), lowering
> kernel.perf_event_max_sample_rate to 24000
> [ 2313.462867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2313.500216] Call Trace:
> [ 2313.512833]  <IRQ>
> [ 2313.524748]  __alloc_skb+0x57/0x1d0
> [ 2313.537934]  __tcp_send_ack.part.48+0x2f/0x100
> [ 2313.551845]  tcp_rcv_established+0x550/0x640
> [ 2313.565394]  tcp_v4_do_rcv+0x12a/0x1e0
> [ 2313.578322]  tcp_v4_rcv+0xadc/0xbd0
> [ 2313.590993]  ip_local_deliver_finish+0x5d/0x1d0
> [ 2313.604727]  ip_local_deliver+0x6b/0xe0
> [ 2313.617782]  ? ip_sublist_rcv+0x200/0x200
> [ 2313.630415] perf: interrupt took too long (10040 > 10035), lowering
> kernel.perf_event_max_sample_rate to 19000
> [ 2313.630948]  ip_rcv+0x52/0xd0
> [ 2313.662850]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
> [ 2313.662857]  __netif_receive_skb_one_core+0x52/0x70
> [ 2313.690860]  netif_receive_skb_internal+0x34/0xe0
> [ 2313.690883]  efx_rx_deliver+0x11a/0x180 [sfc]
> [ 2313.717780]  ? __efx_rx_packet+0x1ef/0x730 [sfc]
> [ 2313.717786]  ? __queue_work+0x103/0x3e0
> [ 2313.743118]  ? efx_poll+0x35e/0x460 [sfc]
> [ 2313.743125]  ? net_rx_action+0x138/0x360
> [ 2313.767356]  ? __do_softirq+0xd8/0x2d2
> [ 2313.767362]  ? irq_exit+0xb4/0xc0
> [ 2313.790680]  ? do_IRQ+0x85/0xd0
> [ 2313.790688]  ? common_interrupt+0xf/0xf
> [ 2313.790694]  </IRQ>
Something odd is going on.  As far as I can tell from this call trace
 (which has some weirdness in it; any chance you could reproduce with
 frame pointers or a lower build optimisation level?) you're in the
 normal sfc receive path (under efx_process_channel(), although that's
 one of the functions that hasn't made it into the stack trace), which
 means you should have a channel->rx_list, and thus efx_rx_deliver()
 should be putting the packet on that list rather than calling
 netif_receive_skb().

I don't know how, or if, that could be related to the crash you're
 getting, but it might be worth looking into.
(It can't be the whole story, as your other crash is on a mlx5e and
 AFAIK they don't use list-RX yet.  Though, confusingly, an entry for
 ip_sublist_rcv still makes it into both stack traces.)

Maybe it's secondary damage from a wild pointer or other mm problem
 letting memory get scribbled on.

-Ed
> [ 2313.823837] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw xt_nat iptable_nat
> nf_nat_ipv4 nf_nat xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark
> iptable_mangle xt_owner xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6
> iptable_raw ip6table_filter ip6_tables nfnetlink_log xt_NFLOG
> xt_tcpudp xt_comment xt_conntrack nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 xt_mark xt_multiport xt_set iptable_filter bpfilter
> ip_set_hash_netport ip_set_hash_net ip_set_hash_ip ip_set nfnetlink
> 8021q garp mrp stp llc sb_edac x86_pkg_temp_thermal kvm_intel kvm
> irqbypass crc32_pclmul crc32c_intel pcbc aesni_intel aes_x86_64
> ipmi_ssif crypto_simd cryptd
> [ 2313.952153]  sfc(O) glue_helper igb i2c_algo_bit ipmi_si mdio dca
> ipmi_devintf ipmi_msghandler efivarfs ip_tables x_tables
> [ 2313.952238] ---[ end trace 477d8e3081c605f6 ]---
>
> Some nodes also crashed in skb_clone, rather than __alloc_skb:
>
> [ 3810.686137] general protection fault: 0000 [#1] SMP PTI
> [ 3810.694579] CPU: 64 PID: 69338 Comm: nginx-fl Not tainted
> 4.19.18-cloudflare-2019.1.8 #2019.1.8
> [ 3810.706589] Hardware name: Quanta Cloud Technology Inc. QuantaPlex
> T42S-2U(LBG-4) ^S5SZ090028/T42S-2U MB (Lewisburg-4), BIOS 3A11.Q10
> 06/29/2018
> [ 3810.726475] RIP: 0010:kmem_cache_alloc+0x89/0x1c0
> [ 3810.734701] Code: 82 72 49 83 78 10 00 4d 8b 30 0f 84 0e 01 00 00
> 4d 85 f6 0f 84 05 01 00 00 41 8b 5f 20 48 8d 4a 01 4c 89 f0 49 8b 3f
> 4c 01 f3 <48> 33 1b 49 33 9f 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
> 74 b2
> [ 3810.761088] RSP: 0000:ffff99723fe03730 EFLAGS: 00010282
> [ 3810.770132] RAX: f0382d8aebf1ae68 RBX: f0382d8aebf1ae68 RCX: 0000000001cb61cf
> [ 3810.781105] RDX: 0000000001cb61ce RSI: 0000000000480020 RDI: 0000000000027550
> [ 3810.792012] RBP: ffff99723f19d500 R08: ffff99723fe27550 R09: 00000000000005dc
> [ 3810.802820] R10: ffff9992227c0000 R11: 0000000000004000 R12: 0000000000480020
> [ 3810.813589] R13: ffffffff8dcb5f7d R14: f0382d8aebf1ae68 R15: ffff99723f19d500
> [ 3810.824382] FS:  00007f2a8863c780(0000) GS:ffff99723fe00000(0000)
> knlGS:0000000000000000
> [ 3810.836189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3810.845662] CR2: 000055820762eecd CR3: 00000019eb850003 CR4: 00000000007606e0
> [ 3810.856567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3810.867600] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3810.878554] PKRU: 55555554
> [ 3810.884787] Call Trace:
> [ 3810.890601]  <IRQ>
> [ 3810.896116]  skb_clone+0x4d/0xb0
> [ 3810.902712]  dev_queue_xmit_nit+0xd9/0x260
> [ 3810.910181]  dev_hard_start_xmit+0x69/0x1f0
> [ 3810.917784]  __dev_queue_xmit+0x6f7/0x8a0
> [ 3810.925172]  ? eth_header+0x26/0xc0
> [ 3810.932053]  ip_finish_output2+0x193/0x400
> [ 3810.939670]  ? ip_finish_output+0x139/0x270
> [ 3810.947241]  ip_output+0x6c/0xe0
> [ 3810.953844]  ? ip_append_data.part.51+0xc0/0xc0
> [ 3810.961802]  __tcp_transmit_skb+0x511/0xaa0
> [ 3810.969420]  __tcp_retransmit_skb+0x19c/0x7c0
> [ 3810.977209]  ? tcp_current_mss+0x57/0xa0
> [ 3810.984493]  tcp_retransmit_skb+0x12/0x80
> [ 3810.991894]  tcp_xmit_retransmit_queue.part.50+0x147/0x240
> [ 3811.000754]  tcp_ack+0x9c4/0x11b0
> [ 3811.007416]  tcp_rcv_established+0x190/0x640
> [ 3811.015065]  ? tcp_v4_inbound_md5_hash+0x69/0x160
> [ 3811.023106]  tcp_v4_do_rcv+0x12a/0x1e0
> [ 3811.030190]  tcp_v4_rcv+0xadc/0xbd0
> [ 3811.037009]  ip_local_deliver_finish+0x5d/0x1d0
> [ 3811.044859]  ip_local_deliver+0x6b/0xe0
> [ 3811.051999]  ? ip_sublist_rcv+0x200/0x200
> [ 3811.059325]  ip_rcv+0x52/0xd0
> [ 3811.065595]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
> [ 3811.073361]  __netif_receive_skb_one_core+0x52/0x70
> [ 3811.081621]  netif_receive_skb_internal+0x34/0xe0
> [ 3811.089652]  napi_gro_receive+0xba/0xe0
> [ 3811.096969]  mlx5e_handle_rx_cqe+0x1eb/0x530 [mlx5_core]
> [ 3811.105545]  ? skb_release_head_state+0x5c/0xb0
> [ 3811.113447]  mlx5e_poll_rx_cq+0xc8/0x910 [mlx5_core]
> [ 3811.121652]  mlx5e_napi_poll+0xb1/0xc60 [mlx5_core]
> [ 3811.129574]  net_rx_action+0x138/0x360
> [ 3811.136266]  __do_softirq+0xd8/0x2d2
> [ 3811.142679]  irq_exit+0xb4/0xc0
> [ 3811.148578]  do_IRQ+0x85/0xd0
> [ 3811.154254]  common_interrupt+0xf/0xf
> [ 3811.160585]  </IRQ>
> [ 3811.165319] RIP: 0033:0x5581e1551ca0
> [ 3811.171546] Code: e8 10 41 ff 24 ee 81 7c ca 04 ff ff fe ff 0f 83
> 87 1c 00 00 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff 24 ee 48
> 8b 2c c2 <48> 89 2c ca 8b 03 0f b6 cc 0f b6 e8 83 c3 04 c1 e8 10 41 ff
> 24 ee
> [ 3811.195925] RSP: 002b:00007ffdd615ebc0 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffde
> [ 3811.206319] RAX: 0000000000000000 RBX: 00000000406c9058 RCX: 000000000000000b
> [ 3811.216321] RDX: 000000004099cdc8 RSI: fffffffb40c07eb0 RDI: 000000004183d738
> [ 3811.226277] RBP: fffffff444c8c5c0 R08: 000000004099cdc8 R09: 00000000425ce3d8
> [ 3811.236340] R10: 0000000044c8c5c0 R11: 000000004139cbb0 R12: 0000000000000000
> [ 3811.246349] R13: 00005581ead6a9e0 R14: 000000004166afe8 R15: 00000000406c90f8
> [ 3811.256320] Modules linked in: tun xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TPROXY
> nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark iptable_mangle xt_owner
> xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6 iptable_raw
> nfnetlink_log xt_NFLOG xt_tcpudp xt_comment xt_conntrack nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_multiport xt_set
> iptable_filter bpfilter ip_set_hash_netport ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink 8021q garp mrp stp llc skx_edac
> x86_pkg_temp_thermal kvm_intel kvm irqbypass ipmi_ssif crc32_pclmul
> crc32c_intel pcbc aesni_intel aes_x86_64 crypto_simd mlx5_core
> [ 3811.351698]  cryptd xhci_pci tpm_crb mlxfw glue_helper ioatdma
> devlink ipmi_si xhci_hcd dca ipmi_devintf ipmi_msghandler tpm_tis
> tpm_tis_core tpm efivarfs ip_tables x_tables
> [ 3811.375161] ---[ end trace 1a7795bb39a63cf7 ]---
>
> Is this know? Could it be related to this commit:
>
> * https://github.com/torvalds/linux/commit/598e57e029290be3e7f8f87ff908091a5a22ed2f
>
> Thanks!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Crashes in skb clone/allocation in 4.19.18
  2019-01-30 17:28   ` Lance Richardson
@ 2019-01-30 17:34     ` Ivan Babrou
  0 siblings, 0 replies; 7+ messages in thread
From: Ivan Babrou @ 2019-01-30 17:34 UTC (permalink / raw)
  To: Lance Richardson
  Cc: Cong Wang, Linux Kernel Network Developers, David S. Miller,
	Eric Dumazet, Ignat Korchagin, Shawn Bohrer, Jakub Sitnicki

On Wed, Jan 30, 2019 at 9:28 AM Lance Richardson <lance604@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 12:17 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Wed, Jan 30, 2019 at 8:54 AM Ivan Babrou <ivan@cloudflare.com> wrote:
> > >
> > > Hey,
> > >
> > > We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
> > > crashed with the following:
> > >
> > > [ 2313.192006] general protection fault: 0000 [#1] SMP PTI
> > > [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
> > > O      4.19.18-cloudflare-2019.1.8 #2019.1.8
>
> "Tainted: GO" appears to mean that an out-of tree kernel module was
> loaded. If so, information about that module and whether the crash
> occurs when it hasn't been loaded might be of interest.

That module is Solarflare NIC driver. On in-tree Mellanox we've only
seen skb_clone crashes.

>    - Lance
>
> > > [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
> > > T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
> > > [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0
> >
> > This looks more like an mm bug than a networking one.
> >
> > Also, it is always helpful if you can map the RIP to source code,
> > using scripts/faddr2line or scripts/decode_stacktrace.sh.
> >
> >
> > Thanks.
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Crashes in skb clone/allocation in 4.19.18
  2019-01-30 17:33 ` Edward Cree
@ 2019-01-30 17:37   ` Edward Cree
  0 siblings, 0 replies; 7+ messages in thread
From: Edward Cree @ 2019-01-30 17:37 UTC (permalink / raw)
  To: Ivan Babrou, netdev
  Cc: David S. Miller, Eric Dumazet, Ignat Korchagin, Shawn Bohrer,
	Jakub Sitnicki

On 30/01/19 17:33, Edward Cree wrote:
> On 30/01/19 16:51, Ivan Babrou wrote:
>> Hey,
>>
>> We've upgraded some machines from 4.19.13 to 4.19.18 and some of them
>> crashed with the following:
>>
>> [ 2313.192006] general protection fault: 0000 [#1] SMP PTI
>> [ 2313.205924] CPU: 32 PID: 65437 Comm: nginx-fl Tainted: G
>> O      4.19.18-cloudflare-2019.1.8 #2019.1.8
>> [ 2313.224973] Hardware name: Quanta Computer Inc. QuantaPlex
>> T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
>> [ 2313.243400] RIP: 0010:kmem_cache_alloc_node+0x178/0x1f0
>> [ 2313.257768] Code: 89 fa 4c 89 f6 e8 68 40 a1 00 4c 8b 55 00 58 4d
>> 85 d2 75 d6 e9 6f ff ff ff 41 8b 59 20 48 8d 4a 01 4c 89 f8 49 8b 39
>> 4c 01 fb <48> 33 1b 49 33 99 38 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0
>> 0f 84
>> [ 2313.295550] RSP: 0000:ffff94457f903b48 EFLAGS: 00010202
>> [ 2313.310352] RAX: 08b82daf1f57da0e RBX: 08b82daf1f57da0e RCX: 00000000005ff72d
>> [ 2313.327189] RDX: 00000000005ff72c RSI: 0000000000480220 RDI: 0000000000026e40
>> [ 2313.344029] RBP: ffff94457f04d680 R08: ffff94457f926e40 R09: ffff94457f04d680
>> [ 2313.360912] R10: 000004ce652a0026 R11: 0000000000000000 R12: 0000000000480220
>> [ 2313.377857] R13: 00000000ffffffff R14: ffffffffb1ab3ab7 R15: 08b82daf1f57da0e
>> [ 2313.394820] FS:  00007fdea755c780(0000) GS:ffff94457f900000(0000)
>> knlGS:0000000000000000
>> [ 2313.412887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 2313.428581] CR2: 000055acc3cf517b CR3: 000000201b1ea003 CR4: 00000000003606e0
>> [ 2313.445753] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 2313.462843] perf: interrupt took too long (8028 > 7291), lowering
>> kernel.perf_event_max_sample_rate to 24000
>> [ 2313.462867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 2313.500216] Call Trace:
>> [ 2313.512833]  <IRQ>
>> [ 2313.524748]  __alloc_skb+0x57/0x1d0
>> [ 2313.537934]  __tcp_send_ack.part.48+0x2f/0x100
>> [ 2313.551845]  tcp_rcv_established+0x550/0x640
>> [ 2313.565394]  tcp_v4_do_rcv+0x12a/0x1e0
>> [ 2313.578322]  tcp_v4_rcv+0xadc/0xbd0
>> [ 2313.590993]  ip_local_deliver_finish+0x5d/0x1d0
>> [ 2313.604727]  ip_local_deliver+0x6b/0xe0
>> [ 2313.617782]  ? ip_sublist_rcv+0x200/0x200
>> [ 2313.630415] perf: interrupt took too long (10040 > 10035), lowering
>> kernel.perf_event_max_sample_rate to 19000
>> [ 2313.630948]  ip_rcv+0x52/0xd0
>> [ 2313.662850]  ? ip_rcv_core.isra.22+0x2b0/0x2b0
>> [ 2313.662857]  __netif_receive_skb_one_core+0x52/0x70
>> [ 2313.690860]  netif_receive_skb_internal+0x34/0xe0
>> [ 2313.690883]  efx_rx_deliver+0x11a/0x180 [sfc]
>> [ 2313.717780]  ? __efx_rx_packet+0x1ef/0x730 [sfc]
>> [ 2313.717786]  ? __queue_work+0x103/0x3e0
>> [ 2313.743118]  ? efx_poll+0x35e/0x460 [sfc]
>> [ 2313.743125]  ? net_rx_action+0x138/0x360
>> [ 2313.767356]  ? __do_softirq+0xd8/0x2d2
>> [ 2313.767362]  ? irq_exit+0xb4/0xc0
>> [ 2313.790680]  ? do_IRQ+0x85/0xd0
>> [ 2313.790688]  ? common_interrupt+0xf/0xf
>> [ 2313.790694]  </IRQ>
> Something odd is going on.  As far as I can tell from this call trace
>  (which has some weirdness in it; any chance you could reproduce with
>  frame pointers or a lower build optimisation level?) you're in the
>  normal sfc receive path (under efx_process_channel(), although that's
>  one of the functions that hasn't made it into the stack trace), which
>  means you should have a channel->rx_list, and thus efx_rx_deliver()
>  should be putting the packet on that list rather than calling
>  netif_receive_skb().
>
> I don't know how, or if, that could be related to the crash you're
>  getting, but it might be worth looking into.
> (It can't be the whole story, as your other crash is on a mlx5e and
>  AFAIK they don't use list-RX yet.  Though, confusingly, an entry for
>  ip_sublist_rcv still makes it into both stack traces.)
>
> Maybe it's secondary damage from a wild pointer or other mm problem
>  letting memory get scribbled on.
>
> -Ed
Aaaand as Lance has just pointed out, you're running the out-of-tree
 sfc driver, which doesn't have list RX yet.  Disregard the above.

-Ed

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-01-30 17:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-30 16:51 Crashes in skb clone/allocation in 4.19.18 Ivan Babrou
2019-01-30 17:00 ` Eric Dumazet
2019-01-30 17:15 ` Cong Wang
2019-01-30 17:28   ` Lance Richardson
2019-01-30 17:34     ` Ivan Babrou
2019-01-30 17:33 ` Edward Cree
2019-01-30 17:37   ` Edward Cree

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.