* BUG: unable to handle kernel paging request in fib6_node_lookup_1
@ 2018-09-05 6:11 Song Liu
2018-09-05 17:09 ` Wei Wang
2018-09-05 17:32 ` David Ahern
0 siblings, 2 replies; 5+ messages in thread
From: Song Liu @ 2018-09-05 6:11 UTC (permalink / raw)
To: Networking; +Cc: weiwan, David Ahern, Eric Dumazet
We are debugging an issue with fib6_node_lookup_1().
We use a 4.16 based kernel, and we have back ported most upstream
patches in ip6_fib.{c.h}. The only major differences I can spot are
8b7f2731bd68d83940714ce92381d1a72596407c
c3506372277779fccbffee2475400fcd689d5738
I guess the issue is not related to these two fixes.
After staring at the call trace and disassembly code (attached below)
I guess this is a use-after-free issue in (or right after) the lookup
loop:
for (;;) {
struct fib6_node *next;
dir = addr_bit_set(args->addr, fn->fn_bit);
next = dir ? rcu_dereference(fn->right) :
rcu_dereference(fn->left);
if (next) {
fn = next;
continue;
}
break;
}
I guess this probably also happens to latest upstream. I haven't
tested this with upstream kernel (or net tree) yet, because we
can only trigger this about once a week on 100 servers.
Does this look familiar? Any comments and/or suggestions are highly
appreciated.
Thanks,
Song
Bug stack trace:
[354764.457916] BUG: unable to handle kernel
[354764.466125] paging request
[354764.471720] at 00000000f60fc318
[354764.478360] IP: fib6_node_lookup_1+0x29/0x130
[354764.487249] PGD 800000010f725067
[354764.494062] P4D 800000010f725067
[354764.500878] PUD 0
[354764.505087] Oops: 0000 [#1] SMP PTI
[354764.512245] Modules linked in:
[354764.518536] udp_diag
[354764.523266] act_gact
[354764.527997] cls_bpf
[354764.532557] tcp_diag
[354764.537291] inet_diag
[354764.542200] nfsv3
[354764.546409] nfs
[354764.550273] fscache
[354764.554834] ip6table_raw
[354764.560260] ip6table_filter
[354764.566208] xt_DSCP
[354764.570765] iptable_raw
[354764.576020] iptable_filter
[354764.581790] ip6table_mangle
[354764.587738] iptable_mangle
[354764.593505] sb_edac
[354764.598058] x86_pkg_temp_thermal
[354764.604872] intel_powerclamp
[354764.610992] coretemp
[354764.615723] kvm_intel
[354764.620628] kvm
[354764.624494] irqbypass
[354764.629399] iTCO_wdt
[354764.634132] iTCO_vendor_support
[354764.640772] i2c_i801
[354764.645507] lpc_ich
[354764.650064] efivars
[354764.654619] mfd_core
[354764.659353] ipmi_si
[354764.663911] ipmi_devintf
[354764.669341] ipmi_msghandler
[354764.675281] acpi_cpufreq
[354764.680711] button
[354764.685096] sch_fq_codel
[354764.690520] nfsd
[354764.694557] nfs_acl
[354764.699118] lockd
[354764.703330] auth_rpcgss
[354764.708588] oid_registry
[354764.714006] grace
[354764.718213] sunrpc
[354764.722590] fuse
[354764.726626] loop
[354764.730661] efivarfs
[354764.735395] autofs4
[354764.739957] CPU: 5 PID: 3460038 Comm: java Not tainted 4.16.0-14_fbk2_1455_g6bcb99c57db6 #14
[354764.756996] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM03 06/02/2016
[354764.773001] RIP: 0010:fib6_node_lookup_1+0x29/0x130
[354764.782929] RSP: 0018:ffffc9003f0bb730 EFLAGS: 00010206
[354764.793557] RAX: ffff883fc131a000 RBX: 00000000f60fc300 RCX: 00000000ffffffe4
[354764.807999] RDX: 0000000000000010 RSI: 0000000000000001 RDI: ffffc9003f0bb8f0
[354764.822436] RBP: ffffc9003f0bb750 R08: 0000000000000002 R09: 0000000000000004
[354764.836877] R10: ffffc9003f0bb7a8 R11: ffff883ff7795780 R12: ffffffff82305080
[354764.851317] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[354764.865765] FS: 00007f8defcfc700(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000
[354764.882119] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[354764.893800] CR2: 00000000f60fc318 CR3: 0000000f68cae006 CR4: 00000000003606e0
[354764.908235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[354764.922671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[354764.937109] Call Trace:
[354764.942195] fib6_node_lookup+0x67/0x90
[354764.950042] ? fib6_table_lookup+0x43/0x2f0
[354764.958587] fib6_table_lookup+0x43/0x2f0
[354764.966794] ip6_pol_route+0x43/0x360
[354764.974294] ? ip6_pol_route_input+0x20/0x20
[354764.983016] fib6_rule_lookup+0x85/0x140
[354764.991050] ? ip6t_do_table+0x331/0x6b0
[354764.999089] ? ip6_route_output_flags+0xa3/0xc0
[354765.008342] ip6_route_me_harder+0xab/0x280
[354765.016889] ip6table_mangle_hook+0xd4/0x110 [ip6table_mangle]
[354765.028754] ? nf_hook_slow+0x43/0xc0
[354765.036269] nf_hook_slow+0x43/0xc0
[354765.043445] nf_hook+0x6e/0xc0
[354765.049731] ? ac6_proc_exit+0x20/0x20
[354765.057412] ip6_xmit+0x28a/0x500
[354765.064225] ? ac6_proc_exit+0x20/0x20
[354765.071902] ? inet6_csk_route_socket+0x10f/0x1c0
[354765.081495] ? update_group_capacity+0x23/0x1e0
[354765.090749] inet6_csk_xmit+0x82/0xd0
[354765.098277] tcp_transmit_skb+0x51d/0x9d0
[354765.106495] tcp_write_xmit+0x1bd/0xf40
[354765.114359] ? _copy_from_iter_full+0x9c/0x240
[354765.123444] tcp_sendmsg_locked+0x2c2/0xdd0
[354765.131991] tcp_sendmsg+0x27/0x40
[354765.138991] sock_sendmsg+0x36/0x40
[354765.146167] sock_write_iter+0x84/0xd0
Disassemble of the fib6_node_lookup_1:
Dump of assembler code for function fib6_node_lookup_1:
0xffffffff818b3c70 <+0>: callq 0xffffffff81a01610 <__fentry__>
0xffffffff818b3c75 <+5>: mov (%rsi),%eax
0xffffffff818b3c77 <+7>: test %eax,%eax
0xffffffff818b3c79 <+9>: je 0xffffffff818b3d94 <fib6_node_lookup_1+292>
0xffffffff818b3c7f <+15>: push %r12
0xffffffff818b3c81 <+17>: push %rbp
0xffffffff818b3c82 <+18>: mov %rsi,%rbp
0xffffffff818b3c85 <+21>: push %rbx
0xffffffff818b3c86 <+22>: mov %rdi,%rbx
0xffffffff818b3c89 <+25>: mov 0x8(%rsi),%rdi
0xffffffff818b3c8d <+29>: mov $0x1,%esi
0xffffffff818b3c92 <+34>: movzwl 0x28(%rbx),%ecx
0xffffffff818b3c96 <+38>: mov %esi,%edx
0xffffffff818b3c98 <+40>: mov %ecx,%eax
0xffffffff818b3c9a <+42>: xor $0xffffffe7,%ecx
0xffffffff818b3c9d <+45>: sar $0x5,%eax
0xffffffff818b3ca0 <+48>: shl %cl,%edx
0xffffffff818b3ca2 <+50>: cltq
0xffffffff818b3ca4 <+52>: test %edx,(%rdi,%rax,4)
0xffffffff818b3ca7 <+55>: je 0xffffffff818b3cb7 <fib6_node_lookup_1+71>
0xffffffff818b3ca9 <+57>: mov 0x10(%rbx),%rax
0xffffffff818b3cad <+61>: test %rax,%rax
0xffffffff818b3cb0 <+64>: je 0xffffffff818b3cc0 <fib6_node_lookup_1+80>
0xffffffff818b3cb2 <+66>: mov %rax,%rbx
0xffffffff818b3cb5 <+69>: jmp 0xffffffff818b3c92 <fib6_node_lookup_1+34>
0xffffffff818b3cb7 <+71>: mov 0x8(%rbx),%rax
0xffffffff818b3cbb <+75>: test %rax,%rax
0xffffffff818b3cbe <+78>: jne 0xffffffff818b3cb2 <fib6_node_lookup_1+66>
0xffffffff818b3cc0 <+80>: test %rbx,%rbx
0xffffffff818b3cc3 <+83>: je 0xffffffff818b3d17 <fib6_node_lookup_1+167>
0xffffffff818b3cc5 <+85>: mov $0xffffffffffffffff,%r12
0xffffffff818b3ccc <+92>: jmp 0xffffffff818b3d02 <fib6_node_lookup_1+146>
0xffffffff818b3cce <+94>: mov 0x20(%rbx),%rax
0xffffffff818b3cd2 <+98>: test %rax,%rax
0xffffffff818b3cd5 <+101>: je 0xffffffff818b3cf2 <fib6_node_lookup_1+130>
0xffffffff818b3cd7 <+103>: movslq 0x0(%rbp),%rdx
0xffffffff818b3cdb <+107>: mov 0x8(%rbp),%rsi
0xffffffff818b3cdf <+111>: add %rdx,%rax
0xffffffff818b3ce2 <+114>: mov 0x10(%rax),%edx
0xffffffff818b3ce5 <+117>: cmp $0x3f,%edx
0xffffffff818b3ce8 <+120>: jbe 0xffffffff818b3d1e <fib6_node_lookup_1+174>
0xffffffff818b3cea <+122>: mov (%rsi),%rcx
0xffffffff818b3ced <+125>: cmp %rcx,(%rax)
0xffffffff818b3cf0 <+128>: je 0xffffffff818b3d52 <fib6_node_lookup_1+226>
0xffffffff818b3cf2 <+130>: movzwl 0x2a(%rbx),%eax
0xffffffff818b3cf6 <+134>: test $0x2,%al
0xffffffff818b3cf8 <+136>: jne 0xffffffff818b3d17 <fib6_node_lookup_1+167>
0xffffffff818b3cfa <+138>: mov (%rbx),%rbx
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1
2018-09-05 6:11 BUG: unable to handle kernel paging request in fib6_node_lookup_1 Song Liu
@ 2018-09-05 17:09 ` Wei Wang
2018-09-05 18:10 ` Song Liu
2018-09-05 17:32 ` David Ahern
1 sibling, 1 reply; 5+ messages in thread
From: Wei Wang @ 2018-09-05 17:09 UTC (permalink / raw)
To: songliubraving; +Cc: Linux Kernel Network Developers, David Ahern, Eric Dumazet
On Tue, Sep 4, 2018 at 11:11 PM Song Liu <songliubraving@fb.com> wrote:
>
> We are debugging an issue with fib6_node_lookup_1().
>
> We use a 4.16 based kernel, and we have back ported most upstream
> patches in ip6_fib.{c.h}. The only major differences I can spot are
>
> 8b7f2731bd68d83940714ce92381d1a72596407c
> c3506372277779fccbffee2475400fcd689d5738
>
> I guess the issue is not related to these two fixes.
>
> After staring at the call trace and disassembly code (attached below)
> I guess this is a use-after-free issue in (or right after) the lookup
> loop:
>
> for (;;) {
> struct fib6_node *next;
>
> dir = addr_bit_set(args->addr, fn->fn_bit);
>
> next = dir ? rcu_dereference(fn->right) :
> rcu_dereference(fn->left);
>
> if (next) {
> fn = next;
> continue;
> }
> break;
> }
>
> I guess this probably also happens to latest upstream. I haven't
> tested this with upstream kernel (or net tree) yet, because we
> can only trigger this about once a week on 100 servers.
>
> Does this look familiar? Any comments and/or suggestions are highly
> appreciated.
>
By glancing at the commit logs, I don't think any changes were made
regarding the core logic of fib6_node handling recently.
(There were a couple of fixes regarding fib6_info but I don't think it
is the cause here... But it is still good to check if you have commit
9b0a8da8c4c6, e873e4b9cc7e, e70a3aad44cc in your build.)
I also went through the call path and did not find anything obviously wrong...
I think it's the best for you to reproduce it and we can debug further.
One question is, do you have "CONFIG_IPV6_SUBTREE" enabled and specify
src IP in the routing table?
Thanks.
Wei
> Thanks,
> Song
>
>
> Bug stack trace:
>
> [354764.457916] BUG: unable to handle kernel
> [354764.466125] paging request
> [354764.471720] at 00000000f60fc318
> [354764.478360] IP: fib6_node_lookup_1+0x29/0x130
> [354764.487249] PGD 800000010f725067
> [354764.494062] P4D 800000010f725067
> [354764.500878] PUD 0
> [354764.505087] Oops: 0000 [#1] SMP PTI
> [354764.512245] Modules linked in:
> [354764.518536] udp_diag
> [354764.523266] act_gact
> [354764.527997] cls_bpf
> [354764.532557] tcp_diag
> [354764.537291] inet_diag
> [354764.542200] nfsv3
> [354764.546409] nfs
> [354764.550273] fscache
> [354764.554834] ip6table_raw
> [354764.560260] ip6table_filter
> [354764.566208] xt_DSCP
> [354764.570765] iptable_raw
> [354764.576020] iptable_filter
> [354764.581790] ip6table_mangle
> [354764.587738] iptable_mangle
> [354764.593505] sb_edac
> [354764.598058] x86_pkg_temp_thermal
> [354764.604872] intel_powerclamp
> [354764.610992] coretemp
> [354764.615723] kvm_intel
> [354764.620628] kvm
> [354764.624494] irqbypass
> [354764.629399] iTCO_wdt
> [354764.634132] iTCO_vendor_support
> [354764.640772] i2c_i801
> [354764.645507] lpc_ich
> [354764.650064] efivars
> [354764.654619] mfd_core
> [354764.659353] ipmi_si
> [354764.663911] ipmi_devintf
> [354764.669341] ipmi_msghandler
> [354764.675281] acpi_cpufreq
> [354764.680711] button
> [354764.685096] sch_fq_codel
> [354764.690520] nfsd
> [354764.694557] nfs_acl
> [354764.699118] lockd
> [354764.703330] auth_rpcgss
> [354764.708588] oid_registry
> [354764.714006] grace
> [354764.718213] sunrpc
> [354764.722590] fuse
> [354764.726626] loop
> [354764.730661] efivarfs
> [354764.735395] autofs4
> [354764.739957] CPU: 5 PID: 3460038 Comm: java Not tainted 4.16.0-14_fbk2_1455_g6bcb99c57db6 #14
> [354764.756996] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM03 06/02/2016
> [354764.773001] RIP: 0010:fib6_node_lookup_1+0x29/0x130
> [354764.782929] RSP: 0018:ffffc9003f0bb730 EFLAGS: 00010206
> [354764.793557] RAX: ffff883fc131a000 RBX: 00000000f60fc300 RCX: 00000000ffffffe4
> [354764.807999] RDX: 0000000000000010 RSI: 0000000000000001 RDI: ffffc9003f0bb8f0
> [354764.822436] RBP: ffffc9003f0bb750 R08: 0000000000000002 R09: 0000000000000004
> [354764.836877] R10: ffffc9003f0bb7a8 R11: ffff883ff7795780 R12: ffffffff82305080
> [354764.851317] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
> [354764.865765] FS: 00007f8defcfc700(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000
> [354764.882119] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [354764.893800] CR2: 00000000f60fc318 CR3: 0000000f68cae006 CR4: 00000000003606e0
> [354764.908235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [354764.922671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [354764.937109] Call Trace:
> [354764.942195] fib6_node_lookup+0x67/0x90
> [354764.950042] ? fib6_table_lookup+0x43/0x2f0
> [354764.958587] fib6_table_lookup+0x43/0x2f0
> [354764.966794] ip6_pol_route+0x43/0x360
> [354764.974294] ? ip6_pol_route_input+0x20/0x20
> [354764.983016] fib6_rule_lookup+0x85/0x140
> [354764.991050] ? ip6t_do_table+0x331/0x6b0
> [354764.999089] ? ip6_route_output_flags+0xa3/0xc0
> [354765.008342] ip6_route_me_harder+0xab/0x280
> [354765.016889] ip6table_mangle_hook+0xd4/0x110 [ip6table_mangle]
> [354765.028754] ? nf_hook_slow+0x43/0xc0
> [354765.036269] nf_hook_slow+0x43/0xc0
> [354765.043445] nf_hook+0x6e/0xc0
> [354765.049731] ? ac6_proc_exit+0x20/0x20
> [354765.057412] ip6_xmit+0x28a/0x500
> [354765.064225] ? ac6_proc_exit+0x20/0x20
> [354765.071902] ? inet6_csk_route_socket+0x10f/0x1c0
> [354765.081495] ? update_group_capacity+0x23/0x1e0
> [354765.090749] inet6_csk_xmit+0x82/0xd0
> [354765.098277] tcp_transmit_skb+0x51d/0x9d0
> [354765.106495] tcp_write_xmit+0x1bd/0xf40
> [354765.114359] ? _copy_from_iter_full+0x9c/0x240
> [354765.123444] tcp_sendmsg_locked+0x2c2/0xdd0
> [354765.131991] tcp_sendmsg+0x27/0x40
> [354765.138991] sock_sendmsg+0x36/0x40
> [354765.146167] sock_write_iter+0x84/0xd0
>
>
> Disassemble of the fib6_node_lookup_1:
> Dump of assembler code for function fib6_node_lookup_1:
> 0xffffffff818b3c70 <+0>: callq 0xffffffff81a01610 <__fentry__>
> 0xffffffff818b3c75 <+5>: mov (%rsi),%eax
> 0xffffffff818b3c77 <+7>: test %eax,%eax
> 0xffffffff818b3c79 <+9>: je 0xffffffff818b3d94 <fib6_node_lookup_1+292>
> 0xffffffff818b3c7f <+15>: push %r12
> 0xffffffff818b3c81 <+17>: push %rbp
> 0xffffffff818b3c82 <+18>: mov %rsi,%rbp
> 0xffffffff818b3c85 <+21>: push %rbx
> 0xffffffff818b3c86 <+22>: mov %rdi,%rbx
> 0xffffffff818b3c89 <+25>: mov 0x8(%rsi),%rdi
> 0xffffffff818b3c8d <+29>: mov $0x1,%esi
> 0xffffffff818b3c92 <+34>: movzwl 0x28(%rbx),%ecx
> 0xffffffff818b3c96 <+38>: mov %esi,%edx
> 0xffffffff818b3c98 <+40>: mov %ecx,%eax
> 0xffffffff818b3c9a <+42>: xor $0xffffffe7,%ecx
> 0xffffffff818b3c9d <+45>: sar $0x5,%eax
> 0xffffffff818b3ca0 <+48>: shl %cl,%edx
> 0xffffffff818b3ca2 <+50>: cltq
> 0xffffffff818b3ca4 <+52>: test %edx,(%rdi,%rax,4)
> 0xffffffff818b3ca7 <+55>: je 0xffffffff818b3cb7 <fib6_node_lookup_1+71>
> 0xffffffff818b3ca9 <+57>: mov 0x10(%rbx),%rax
> 0xffffffff818b3cad <+61>: test %rax,%rax
> 0xffffffff818b3cb0 <+64>: je 0xffffffff818b3cc0 <fib6_node_lookup_1+80>
> 0xffffffff818b3cb2 <+66>: mov %rax,%rbx
> 0xffffffff818b3cb5 <+69>: jmp 0xffffffff818b3c92 <fib6_node_lookup_1+34>
> 0xffffffff818b3cb7 <+71>: mov 0x8(%rbx),%rax
> 0xffffffff818b3cbb <+75>: test %rax,%rax
> 0xffffffff818b3cbe <+78>: jne 0xffffffff818b3cb2 <fib6_node_lookup_1+66>
> 0xffffffff818b3cc0 <+80>: test %rbx,%rbx
> 0xffffffff818b3cc3 <+83>: je 0xffffffff818b3d17 <fib6_node_lookup_1+167>
> 0xffffffff818b3cc5 <+85>: mov $0xffffffffffffffff,%r12
> 0xffffffff818b3ccc <+92>: jmp 0xffffffff818b3d02 <fib6_node_lookup_1+146>
> 0xffffffff818b3cce <+94>: mov 0x20(%rbx),%rax
> 0xffffffff818b3cd2 <+98>: test %rax,%rax
> 0xffffffff818b3cd5 <+101>: je 0xffffffff818b3cf2 <fib6_node_lookup_1+130>
> 0xffffffff818b3cd7 <+103>: movslq 0x0(%rbp),%rdx
> 0xffffffff818b3cdb <+107>: mov 0x8(%rbp),%rsi
> 0xffffffff818b3cdf <+111>: add %rdx,%rax
> 0xffffffff818b3ce2 <+114>: mov 0x10(%rax),%edx
> 0xffffffff818b3ce5 <+117>: cmp $0x3f,%edx
> 0xffffffff818b3ce8 <+120>: jbe 0xffffffff818b3d1e <fib6_node_lookup_1+174>
> 0xffffffff818b3cea <+122>: mov (%rsi),%rcx
> 0xffffffff818b3ced <+125>: cmp %rcx,(%rax)
> 0xffffffff818b3cf0 <+128>: je 0xffffffff818b3d52 <fib6_node_lookup_1+226>
> 0xffffffff818b3cf2 <+130>: movzwl 0x2a(%rbx),%eax
> 0xffffffff818b3cf6 <+134>: test $0x2,%al
> 0xffffffff818b3cf8 <+136>: jne 0xffffffff818b3d17 <fib6_node_lookup_1+167>
> 0xffffffff818b3cfa <+138>: mov (%rbx),%rbx
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1
2018-09-05 6:11 BUG: unable to handle kernel paging request in fib6_node_lookup_1 Song Liu
2018-09-05 17:09 ` Wei Wang
@ 2018-09-05 17:32 ` David Ahern
2018-09-05 18:12 ` Song Liu
1 sibling, 1 reply; 5+ messages in thread
From: David Ahern @ 2018-09-05 17:32 UTC (permalink / raw)
To: Song Liu, Networking; +Cc: weiwan, Eric Dumazet
On 9/5/18 12:11 AM, Song Liu wrote:
> We are debugging an issue with fib6_node_lookup_1().
>
> We use a 4.16 based kernel, and we have back ported most upstream
> patches in ip6_fib.{c.h}. The only major differences I can spot are
>
Did you backport all patches in each set that included a change to those
files, or just the patches to ip6_fib.* and any dependencies?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1
2018-09-05 17:09 ` Wei Wang
@ 2018-09-05 18:10 ` Song Liu
0 siblings, 0 replies; 5+ messages in thread
From: Song Liu @ 2018-09-05 18:10 UTC (permalink / raw)
To: Wei Wang; +Cc: Linux Kernel Network Developers, David Ahern, Eric Dumazet
> On Sep 5, 2018, at 10:09 AM, Wei Wang <weiwan@google.com> wrote:
>
> On Tue, Sep 4, 2018 at 11:11 PM Song Liu <songliubraving@fb.com> wrote:
>>
>> We are debugging an issue with fib6_node_lookup_1().
>>
>> We use a 4.16 based kernel, and we have back ported most upstream
>> patches in ip6_fib.{c.h}. The only major differences I can spot are
>>
>> 8b7f2731bd68d83940714ce92381d1a72596407c
>> c3506372277779fccbffee2475400fcd689d5738
>>
>> I guess the issue is not related to these two fixes.
>>
>> After staring at the call trace and disassembly code (attached below)
>> I guess this is a use-after-free issue in (or right after) the lookup
>> loop:
>>
>> for (;;) {
>> struct fib6_node *next;
>>
>> dir = addr_bit_set(args->addr, fn->fn_bit);
>>
>> next = dir ? rcu_dereference(fn->right) :
>> rcu_dereference(fn->left);
>>
>> if (next) {
>> fn = next;
>> continue;
>> }
>> break;
>> }
>>
>> I guess this probably also happens to latest upstream. I haven't
>> tested this with upstream kernel (or net tree) yet, because we
>> can only trigger this about once a week on 100 servers.
>>
>> Does this look familiar? Any comments and/or suggestions are highly
>> appreciated.
>>
> By glancing at the commit logs, I don't think any changes were made
> regarding the core logic of fib6_node handling recently.
> (There were a couple of fixes regarding fib6_info but I don't think it
> is the cause here... But it is still good to check if you have commit
> 9b0a8da8c4c6, e873e4b9cc7e, e70a3aad44cc in your build.)
Looks like we don't have e70a3aad44cc. I think it fixes a memory leak
(instead of a use-after-free)? Let me add it and run some tests anyway.
Thanks a lot for this information.
>
> I also went through the call path and did not find anything obviously wrong...
> I think it's the best for you to reproduce it and we can debug further.
> One question is, do you have "CONFIG_IPV6_SUBTREE" enabled and specify
> src IP in the routing table?
We do have CONFIG_IPV6_SUBTREE enabled. But we usually do not specify
src IP in the routing table.
Let me try to reproduce it.
Thanks again,
Song
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1
2018-09-05 17:32 ` David Ahern
@ 2018-09-05 18:12 ` Song Liu
0 siblings, 0 replies; 5+ messages in thread
From: Song Liu @ 2018-09-05 18:12 UTC (permalink / raw)
To: David Ahern; +Cc: Networking, weiwan, Eric Dumazet
> On Sep 5, 2018, at 10:32 AM, David Ahern <dsahern@gmail.com> wrote:
>
> On 9/5/18 12:11 AM, Song Liu wrote:
>> We are debugging an issue with fib6_node_lookup_1().
>>
>> We use a 4.16 based kernel, and we have back ported most upstream
>> patches in ip6_fib.{c.h}. The only major differences I can spot are
>>
>
> Did you backport all patches in each set that included a change to those
> files, or just the patches to ip6_fib.* and any dependencies?
I believe we always try back port all patches in each set. But we have back
ported hundreds of patches to our 4.16 tree, so it is also likely we missed
some useful patches.
Thanks,
Song
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-09-05 22:44 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-05 6:11 BUG: unable to handle kernel paging request in fib6_node_lookup_1 Song Liu
2018-09-05 17:09 ` Wei Wang
2018-09-05 18:10 ` Song Liu
2018-09-05 17:32 ` David Ahern
2018-09-05 18:12 ` Song Liu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.