All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG: unable to handle kernel paging request in fib6_node_lookup_1
@ 2018-09-05  6:11 Song Liu
  2018-09-05 17:09 ` Wei Wang
  2018-09-05 17:32 ` David Ahern
  0 siblings, 2 replies; 5+ messages in thread
From: Song Liu @ 2018-09-05  6:11 UTC (permalink / raw)
  To: Networking; +Cc: weiwan, David Ahern, Eric Dumazet

We are debugging an issue with fib6_node_lookup_1(). 

We use a 4.16 based kernel, and we have back ported most upstream
patches in ip6_fib.{c.h}. The only major differences I can spot are

8b7f2731bd68d83940714ce92381d1a72596407c
c3506372277779fccbffee2475400fcd689d5738

I guess the issue is not related to these two fixes. 

After staring at the call trace and disassembly code (attached below) 
I guess this is a use-after-free issue in (or right after) the lookup 
loop:

        for (;;) {
                struct fib6_node *next;

                dir = addr_bit_set(args->addr, fn->fn_bit);

                next = dir ? rcu_dereference(fn->right) :
                             rcu_dereference(fn->left);

                if (next) {
                        fn = next;
                        continue;
                }
                break;
        }

I guess this probably also happens to latest upstream. I haven't 
tested this with upstream kernel (or net tree) yet, because we 
can only trigger this about once a week on 100 servers. 

Does this look familiar? Any comments and/or suggestions are highly
appreciated. 

Thanks, 
Song


Bug stack trace:

[354764.457916] BUG: unable to handle kernel
[354764.466125] paging request
[354764.471720]  at 00000000f60fc318
[354764.478360] IP: fib6_node_lookup_1+0x29/0x130
[354764.487249] PGD 800000010f725067
[354764.494062] P4D 800000010f725067
[354764.500878] PUD 0
[354764.505087] Oops: 0000 [#1] SMP PTI
[354764.512245] Modules linked in:
[354764.518536]  udp_diag
[354764.523266]  act_gact
[354764.527997]  cls_bpf
[354764.532557]  tcp_diag
[354764.537291]  inet_diag
[354764.542200]  nfsv3
[354764.546409]  nfs
[354764.550273]  fscache
[354764.554834]  ip6table_raw
[354764.560260]  ip6table_filter
[354764.566208]  xt_DSCP
[354764.570765]  iptable_raw
[354764.576020]  iptable_filter
[354764.581790]  ip6table_mangle
[354764.587738]  iptable_mangle
[354764.593505]  sb_edac
[354764.598058]  x86_pkg_temp_thermal
[354764.604872]  intel_powerclamp
[354764.610992]  coretemp
[354764.615723]  kvm_intel
[354764.620628]  kvm
[354764.624494]  irqbypass
[354764.629399]  iTCO_wdt
[354764.634132]  iTCO_vendor_support
[354764.640772]  i2c_i801
[354764.645507]  lpc_ich
[354764.650064]  efivars
[354764.654619]  mfd_core
[354764.659353]  ipmi_si
[354764.663911]  ipmi_devintf
[354764.669341]  ipmi_msghandler
[354764.675281]  acpi_cpufreq
[354764.680711]  button
[354764.685096]  sch_fq_codel
[354764.690520]  nfsd
[354764.694557]  nfs_acl
[354764.699118]  lockd
[354764.703330]  auth_rpcgss
[354764.708588]  oid_registry
[354764.714006]  grace
[354764.718213]  sunrpc
[354764.722590]  fuse
[354764.726626]  loop
[354764.730661]  efivarfs
[354764.735395]  autofs4
[354764.739957] CPU: 5 PID: 3460038 Comm: java Not tainted 4.16.0-14_fbk2_1455_g6bcb99c57db6 #14
[354764.756996] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM03   06/02/2016
[354764.773001] RIP: 0010:fib6_node_lookup_1+0x29/0x130
[354764.782929] RSP: 0018:ffffc9003f0bb730 EFLAGS: 00010206
[354764.793557] RAX: ffff883fc131a000 RBX: 00000000f60fc300 RCX: 00000000ffffffe4
[354764.807999] RDX: 0000000000000010 RSI: 0000000000000001 RDI: ffffc9003f0bb8f0
[354764.822436] RBP: ffffc9003f0bb750 R08: 0000000000000002 R09: 0000000000000004
[354764.836877] R10: ffffc9003f0bb7a8 R11: ffff883ff7795780 R12: ffffffff82305080
[354764.851317] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[354764.865765] FS:  00007f8defcfc700(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000
[354764.882119] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[354764.893800] CR2: 00000000f60fc318 CR3: 0000000f68cae006 CR4: 00000000003606e0
[354764.908235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[354764.922671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[354764.937109] Call Trace:
[354764.942195]  fib6_node_lookup+0x67/0x90
[354764.950042]  ? fib6_table_lookup+0x43/0x2f0
[354764.958587]  fib6_table_lookup+0x43/0x2f0
[354764.966794]  ip6_pol_route+0x43/0x360
[354764.974294]  ? ip6_pol_route_input+0x20/0x20
[354764.983016]  fib6_rule_lookup+0x85/0x140
[354764.991050]  ? ip6t_do_table+0x331/0x6b0
[354764.999089]  ? ip6_route_output_flags+0xa3/0xc0
[354765.008342]  ip6_route_me_harder+0xab/0x280
[354765.016889]  ip6table_mangle_hook+0xd4/0x110 [ip6table_mangle]
[354765.028754]  ? nf_hook_slow+0x43/0xc0
[354765.036269]  nf_hook_slow+0x43/0xc0
[354765.043445]  nf_hook+0x6e/0xc0
[354765.049731]  ? ac6_proc_exit+0x20/0x20
[354765.057412]  ip6_xmit+0x28a/0x500
[354765.064225]  ? ac6_proc_exit+0x20/0x20
[354765.071902]  ? inet6_csk_route_socket+0x10f/0x1c0
[354765.081495]  ? update_group_capacity+0x23/0x1e0
[354765.090749]  inet6_csk_xmit+0x82/0xd0
[354765.098277]  tcp_transmit_skb+0x51d/0x9d0
[354765.106495]  tcp_write_xmit+0x1bd/0xf40
[354765.114359]  ? _copy_from_iter_full+0x9c/0x240
[354765.123444]  tcp_sendmsg_locked+0x2c2/0xdd0
[354765.131991]  tcp_sendmsg+0x27/0x40
[354765.138991]  sock_sendmsg+0x36/0x40
[354765.146167]  sock_write_iter+0x84/0xd0


Disassemble of the fib6_node_lookup_1:
Dump of assembler code for function fib6_node_lookup_1:
   0xffffffff818b3c70 <+0>:     callq  0xffffffff81a01610 <__fentry__>
   0xffffffff818b3c75 <+5>:     mov    (%rsi),%eax
   0xffffffff818b3c77 <+7>:     test   %eax,%eax
   0xffffffff818b3c79 <+9>:     je     0xffffffff818b3d94 <fib6_node_lookup_1+292>
   0xffffffff818b3c7f <+15>:    push   %r12
   0xffffffff818b3c81 <+17>:    push   %rbp
   0xffffffff818b3c82 <+18>:    mov    %rsi,%rbp
   0xffffffff818b3c85 <+21>:    push   %rbx
   0xffffffff818b3c86 <+22>:    mov    %rdi,%rbx
   0xffffffff818b3c89 <+25>:    mov    0x8(%rsi),%rdi
   0xffffffff818b3c8d <+29>:    mov    $0x1,%esi
   0xffffffff818b3c92 <+34>:    movzwl 0x28(%rbx),%ecx
   0xffffffff818b3c96 <+38>:    mov    %esi,%edx
   0xffffffff818b3c98 <+40>:    mov    %ecx,%eax
   0xffffffff818b3c9a <+42>:    xor    $0xffffffe7,%ecx
   0xffffffff818b3c9d <+45>:    sar    $0x5,%eax
   0xffffffff818b3ca0 <+48>:    shl    %cl,%edx
   0xffffffff818b3ca2 <+50>:    cltq
   0xffffffff818b3ca4 <+52>:    test   %edx,(%rdi,%rax,4)
   0xffffffff818b3ca7 <+55>:    je     0xffffffff818b3cb7 <fib6_node_lookup_1+71>
   0xffffffff818b3ca9 <+57>:    mov    0x10(%rbx),%rax
   0xffffffff818b3cad <+61>:    test   %rax,%rax
   0xffffffff818b3cb0 <+64>:    je     0xffffffff818b3cc0 <fib6_node_lookup_1+80>
   0xffffffff818b3cb2 <+66>:    mov    %rax,%rbx
   0xffffffff818b3cb5 <+69>:    jmp    0xffffffff818b3c92 <fib6_node_lookup_1+34>
   0xffffffff818b3cb7 <+71>:    mov    0x8(%rbx),%rax
   0xffffffff818b3cbb <+75>:    test   %rax,%rax
   0xffffffff818b3cbe <+78>:    jne    0xffffffff818b3cb2 <fib6_node_lookup_1+66>
   0xffffffff818b3cc0 <+80>:    test   %rbx,%rbx
   0xffffffff818b3cc3 <+83>:    je     0xffffffff818b3d17 <fib6_node_lookup_1+167>
   0xffffffff818b3cc5 <+85>:    mov    $0xffffffffffffffff,%r12
   0xffffffff818b3ccc <+92>:    jmp    0xffffffff818b3d02 <fib6_node_lookup_1+146>
   0xffffffff818b3cce <+94>:    mov    0x20(%rbx),%rax
   0xffffffff818b3cd2 <+98>:    test   %rax,%rax
   0xffffffff818b3cd5 <+101>:   je     0xffffffff818b3cf2 <fib6_node_lookup_1+130>
   0xffffffff818b3cd7 <+103>:   movslq 0x0(%rbp),%rdx
   0xffffffff818b3cdb <+107>:   mov    0x8(%rbp),%rsi
   0xffffffff818b3cdf <+111>:   add    %rdx,%rax
   0xffffffff818b3ce2 <+114>:   mov    0x10(%rax),%edx
   0xffffffff818b3ce5 <+117>:   cmp    $0x3f,%edx
   0xffffffff818b3ce8 <+120>:   jbe    0xffffffff818b3d1e <fib6_node_lookup_1+174>
   0xffffffff818b3cea <+122>:   mov    (%rsi),%rcx
   0xffffffff818b3ced <+125>:   cmp    %rcx,(%rax)
   0xffffffff818b3cf0 <+128>:   je     0xffffffff818b3d52 <fib6_node_lookup_1+226>
   0xffffffff818b3cf2 <+130>:   movzwl 0x2a(%rbx),%eax
   0xffffffff818b3cf6 <+134>:   test   $0x2,%al
   0xffffffff818b3cf8 <+136>:   jne    0xffffffff818b3d17 <fib6_node_lookup_1+167>
   0xffffffff818b3cfa <+138>:   mov    (%rbx),%rbx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1
  2018-09-05  6:11 BUG: unable to handle kernel paging request in fib6_node_lookup_1 Song Liu
@ 2018-09-05 17:09 ` Wei Wang
  2018-09-05 18:10   ` Song Liu
  2018-09-05 17:32 ` David Ahern
  1 sibling, 1 reply; 5+ messages in thread
From: Wei Wang @ 2018-09-05 17:09 UTC (permalink / raw)
  To: songliubraving; +Cc: Linux Kernel Network Developers, David Ahern, Eric Dumazet

On Tue, Sep 4, 2018 at 11:11 PM Song Liu <songliubraving@fb.com> wrote:
>
> We are debugging an issue with fib6_node_lookup_1().
>
> We use a 4.16 based kernel, and we have back ported most upstream
> patches in ip6_fib.{c.h}. The only major differences I can spot are
>
> 8b7f2731bd68d83940714ce92381d1a72596407c
> c3506372277779fccbffee2475400fcd689d5738
>
> I guess the issue is not related to these two fixes.
>
> After staring at the call trace and disassembly code (attached below)
> I guess this is a use-after-free issue in (or right after) the lookup
> loop:
>
>         for (;;) {
>                 struct fib6_node *next;
>
>                 dir = addr_bit_set(args->addr, fn->fn_bit);
>
>                 next = dir ? rcu_dereference(fn->right) :
>                              rcu_dereference(fn->left);
>
>                 if (next) {
>                         fn = next;
>                         continue;
>                 }
>                 break;
>         }
>
> I guess this probably also happens to latest upstream. I haven't
> tested this with upstream kernel (or net tree) yet, because we
> can only trigger this about once a week on 100 servers.
>
> Does this look familiar? Any comments and/or suggestions are highly
> appreciated.
>
By glancing at the commit logs, I don't think any changes were made
regarding the core logic of fib6_node handling recently.
(There were a couple of fixes regarding fib6_info but I don't think it
is the cause here... But it is still good to check if you have commit
9b0a8da8c4c6, e873e4b9cc7e, e70a3aad44cc in your build.)

I also went through the call path and did not find anything obviously wrong...
I think it's the best for you to reproduce it and we can debug further.
One question is, do you have "CONFIG_IPV6_SUBTREE" enabled and specify
src IP in the routing table?

Thanks.
Wei

> Thanks,
> Song
>
>
> Bug stack trace:
>
> [354764.457916] BUG: unable to handle kernel
> [354764.466125] paging request
> [354764.471720]  at 00000000f60fc318
> [354764.478360] IP: fib6_node_lookup_1+0x29/0x130
> [354764.487249] PGD 800000010f725067
> [354764.494062] P4D 800000010f725067
> [354764.500878] PUD 0
> [354764.505087] Oops: 0000 [#1] SMP PTI
> [354764.512245] Modules linked in:
> [354764.518536]  udp_diag
> [354764.523266]  act_gact
> [354764.527997]  cls_bpf
> [354764.532557]  tcp_diag
> [354764.537291]  inet_diag
> [354764.542200]  nfsv3
> [354764.546409]  nfs
> [354764.550273]  fscache
> [354764.554834]  ip6table_raw
> [354764.560260]  ip6table_filter
> [354764.566208]  xt_DSCP
> [354764.570765]  iptable_raw
> [354764.576020]  iptable_filter
> [354764.581790]  ip6table_mangle
> [354764.587738]  iptable_mangle
> [354764.593505]  sb_edac
> [354764.598058]  x86_pkg_temp_thermal
> [354764.604872]  intel_powerclamp
> [354764.610992]  coretemp
> [354764.615723]  kvm_intel
> [354764.620628]  kvm
> [354764.624494]  irqbypass
> [354764.629399]  iTCO_wdt
> [354764.634132]  iTCO_vendor_support
> [354764.640772]  i2c_i801
> [354764.645507]  lpc_ich
> [354764.650064]  efivars
> [354764.654619]  mfd_core
> [354764.659353]  ipmi_si
> [354764.663911]  ipmi_devintf
> [354764.669341]  ipmi_msghandler
> [354764.675281]  acpi_cpufreq
> [354764.680711]  button
> [354764.685096]  sch_fq_codel
> [354764.690520]  nfsd
> [354764.694557]  nfs_acl
> [354764.699118]  lockd
> [354764.703330]  auth_rpcgss
> [354764.708588]  oid_registry
> [354764.714006]  grace
> [354764.718213]  sunrpc
> [354764.722590]  fuse
> [354764.726626]  loop
> [354764.730661]  efivarfs
> [354764.735395]  autofs4
> [354764.739957] CPU: 5 PID: 3460038 Comm: java Not tainted 4.16.0-14_fbk2_1455_g6bcb99c57db6 #14
> [354764.756996] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM03   06/02/2016
> [354764.773001] RIP: 0010:fib6_node_lookup_1+0x29/0x130
> [354764.782929] RSP: 0018:ffffc9003f0bb730 EFLAGS: 00010206
> [354764.793557] RAX: ffff883fc131a000 RBX: 00000000f60fc300 RCX: 00000000ffffffe4
> [354764.807999] RDX: 0000000000000010 RSI: 0000000000000001 RDI: ffffc9003f0bb8f0
> [354764.822436] RBP: ffffc9003f0bb750 R08: 0000000000000002 R09: 0000000000000004
> [354764.836877] R10: ffffc9003f0bb7a8 R11: ffff883ff7795780 R12: ffffffff82305080
> [354764.851317] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
> [354764.865765] FS:  00007f8defcfc700(0000) GS:ffff881fff940000(0000) knlGS:0000000000000000
> [354764.882119] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [354764.893800] CR2: 00000000f60fc318 CR3: 0000000f68cae006 CR4: 00000000003606e0
> [354764.908235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [354764.922671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [354764.937109] Call Trace:
> [354764.942195]  fib6_node_lookup+0x67/0x90
> [354764.950042]  ? fib6_table_lookup+0x43/0x2f0
> [354764.958587]  fib6_table_lookup+0x43/0x2f0
> [354764.966794]  ip6_pol_route+0x43/0x360
> [354764.974294]  ? ip6_pol_route_input+0x20/0x20
> [354764.983016]  fib6_rule_lookup+0x85/0x140
> [354764.991050]  ? ip6t_do_table+0x331/0x6b0
> [354764.999089]  ? ip6_route_output_flags+0xa3/0xc0
> [354765.008342]  ip6_route_me_harder+0xab/0x280
> [354765.016889]  ip6table_mangle_hook+0xd4/0x110 [ip6table_mangle]
> [354765.028754]  ? nf_hook_slow+0x43/0xc0
> [354765.036269]  nf_hook_slow+0x43/0xc0
> [354765.043445]  nf_hook+0x6e/0xc0
> [354765.049731]  ? ac6_proc_exit+0x20/0x20
> [354765.057412]  ip6_xmit+0x28a/0x500
> [354765.064225]  ? ac6_proc_exit+0x20/0x20
> [354765.071902]  ? inet6_csk_route_socket+0x10f/0x1c0
> [354765.081495]  ? update_group_capacity+0x23/0x1e0
> [354765.090749]  inet6_csk_xmit+0x82/0xd0
> [354765.098277]  tcp_transmit_skb+0x51d/0x9d0
> [354765.106495]  tcp_write_xmit+0x1bd/0xf40
> [354765.114359]  ? _copy_from_iter_full+0x9c/0x240
> [354765.123444]  tcp_sendmsg_locked+0x2c2/0xdd0
> [354765.131991]  tcp_sendmsg+0x27/0x40
> [354765.138991]  sock_sendmsg+0x36/0x40
> [354765.146167]  sock_write_iter+0x84/0xd0
>
>
> Disassemble of the fib6_node_lookup_1:
> Dump of assembler code for function fib6_node_lookup_1:
>    0xffffffff818b3c70 <+0>:     callq  0xffffffff81a01610 <__fentry__>
>    0xffffffff818b3c75 <+5>:     mov    (%rsi),%eax
>    0xffffffff818b3c77 <+7>:     test   %eax,%eax
>    0xffffffff818b3c79 <+9>:     je     0xffffffff818b3d94 <fib6_node_lookup_1+292>
>    0xffffffff818b3c7f <+15>:    push   %r12
>    0xffffffff818b3c81 <+17>:    push   %rbp
>    0xffffffff818b3c82 <+18>:    mov    %rsi,%rbp
>    0xffffffff818b3c85 <+21>:    push   %rbx
>    0xffffffff818b3c86 <+22>:    mov    %rdi,%rbx
>    0xffffffff818b3c89 <+25>:    mov    0x8(%rsi),%rdi
>    0xffffffff818b3c8d <+29>:    mov    $0x1,%esi
>    0xffffffff818b3c92 <+34>:    movzwl 0x28(%rbx),%ecx
>    0xffffffff818b3c96 <+38>:    mov    %esi,%edx
>    0xffffffff818b3c98 <+40>:    mov    %ecx,%eax
>    0xffffffff818b3c9a <+42>:    xor    $0xffffffe7,%ecx
>    0xffffffff818b3c9d <+45>:    sar    $0x5,%eax
>    0xffffffff818b3ca0 <+48>:    shl    %cl,%edx
>    0xffffffff818b3ca2 <+50>:    cltq
>    0xffffffff818b3ca4 <+52>:    test   %edx,(%rdi,%rax,4)
>    0xffffffff818b3ca7 <+55>:    je     0xffffffff818b3cb7 <fib6_node_lookup_1+71>
>    0xffffffff818b3ca9 <+57>:    mov    0x10(%rbx),%rax
>    0xffffffff818b3cad <+61>:    test   %rax,%rax
>    0xffffffff818b3cb0 <+64>:    je     0xffffffff818b3cc0 <fib6_node_lookup_1+80>
>    0xffffffff818b3cb2 <+66>:    mov    %rax,%rbx
>    0xffffffff818b3cb5 <+69>:    jmp    0xffffffff818b3c92 <fib6_node_lookup_1+34>
>    0xffffffff818b3cb7 <+71>:    mov    0x8(%rbx),%rax
>    0xffffffff818b3cbb <+75>:    test   %rax,%rax
>    0xffffffff818b3cbe <+78>:    jne    0xffffffff818b3cb2 <fib6_node_lookup_1+66>
>    0xffffffff818b3cc0 <+80>:    test   %rbx,%rbx
>    0xffffffff818b3cc3 <+83>:    je     0xffffffff818b3d17 <fib6_node_lookup_1+167>
>    0xffffffff818b3cc5 <+85>:    mov    $0xffffffffffffffff,%r12
>    0xffffffff818b3ccc <+92>:    jmp    0xffffffff818b3d02 <fib6_node_lookup_1+146>
>    0xffffffff818b3cce <+94>:    mov    0x20(%rbx),%rax
>    0xffffffff818b3cd2 <+98>:    test   %rax,%rax
>    0xffffffff818b3cd5 <+101>:   je     0xffffffff818b3cf2 <fib6_node_lookup_1+130>
>    0xffffffff818b3cd7 <+103>:   movslq 0x0(%rbp),%rdx
>    0xffffffff818b3cdb <+107>:   mov    0x8(%rbp),%rsi
>    0xffffffff818b3cdf <+111>:   add    %rdx,%rax
>    0xffffffff818b3ce2 <+114>:   mov    0x10(%rax),%edx
>    0xffffffff818b3ce5 <+117>:   cmp    $0x3f,%edx
>    0xffffffff818b3ce8 <+120>:   jbe    0xffffffff818b3d1e <fib6_node_lookup_1+174>
>    0xffffffff818b3cea <+122>:   mov    (%rsi),%rcx
>    0xffffffff818b3ced <+125>:   cmp    %rcx,(%rax)
>    0xffffffff818b3cf0 <+128>:   je     0xffffffff818b3d52 <fib6_node_lookup_1+226>
>    0xffffffff818b3cf2 <+130>:   movzwl 0x2a(%rbx),%eax
>    0xffffffff818b3cf6 <+134>:   test   $0x2,%al
>    0xffffffff818b3cf8 <+136>:   jne    0xffffffff818b3d17 <fib6_node_lookup_1+167>
>    0xffffffff818b3cfa <+138>:   mov    (%rbx),%rbx
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1
  2018-09-05  6:11 BUG: unable to handle kernel paging request in fib6_node_lookup_1 Song Liu
  2018-09-05 17:09 ` Wei Wang
@ 2018-09-05 17:32 ` David Ahern
  2018-09-05 18:12   ` Song Liu
  1 sibling, 1 reply; 5+ messages in thread
From: David Ahern @ 2018-09-05 17:32 UTC (permalink / raw)
  To: Song Liu, Networking; +Cc: weiwan, Eric Dumazet

On 9/5/18 12:11 AM, Song Liu wrote:
> We are debugging an issue with fib6_node_lookup_1(). 
> 
> We use a 4.16 based kernel, and we have back ported most upstream
> patches in ip6_fib.{c.h}. The only major differences I can spot are
> 

Did you backport all patches in each set that included a change to those
files, or just the patches to ip6_fib.* and any dependencies?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1
  2018-09-05 17:09 ` Wei Wang
@ 2018-09-05 18:10   ` Song Liu
  0 siblings, 0 replies; 5+ messages in thread
From: Song Liu @ 2018-09-05 18:10 UTC (permalink / raw)
  To: Wei Wang; +Cc: Linux Kernel Network Developers, David Ahern, Eric Dumazet



> On Sep 5, 2018, at 10:09 AM, Wei Wang <weiwan@google.com> wrote:
> 
> On Tue, Sep 4, 2018 at 11:11 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> We are debugging an issue with fib6_node_lookup_1().
>> 
>> We use a 4.16 based kernel, and we have back ported most upstream
>> patches in ip6_fib.{c.h}. The only major differences I can spot are
>> 
>> 8b7f2731bd68d83940714ce92381d1a72596407c
>> c3506372277779fccbffee2475400fcd689d5738
>> 
>> I guess the issue is not related to these two fixes.
>> 
>> After staring at the call trace and disassembly code (attached below)
>> I guess this is a use-after-free issue in (or right after) the lookup
>> loop:
>> 
>>        for (;;) {
>>                struct fib6_node *next;
>> 
>>                dir = addr_bit_set(args->addr, fn->fn_bit);
>> 
>>                next = dir ? rcu_dereference(fn->right) :
>>                             rcu_dereference(fn->left);
>> 
>>                if (next) {
>>                        fn = next;
>>                        continue;
>>                }
>>                break;
>>        }
>> 
>> I guess this probably also happens to latest upstream. I haven't
>> tested this with upstream kernel (or net tree) yet, because we
>> can only trigger this about once a week on 100 servers.
>> 
>> Does this look familiar? Any comments and/or suggestions are highly
>> appreciated.
>> 
> By glancing at the commit logs, I don't think any changes were made
> regarding the core logic of fib6_node handling recently.
> (There were a couple of fixes regarding fib6_info but I don't think it
> is the cause here... But it is still good to check if you have commit
> 9b0a8da8c4c6, e873e4b9cc7e, e70a3aad44cc in your build.)

Looks like we don't have e70a3aad44cc. I think it fixes a memory leak 
(instead of a use-after-free)? Let me add it and run some tests anyway. 
Thanks a lot for this information. 

> 
> I also went through the call path and did not find anything obviously wrong...
> I think it's the best for you to reproduce it and we can debug further.
> One question is, do you have "CONFIG_IPV6_SUBTREE" enabled and specify
> src IP in the routing table?

We do have CONFIG_IPV6_SUBTREE enabled. But we usually do not specify
src IP in the routing table. 

Let me try to reproduce it. 

Thanks again,
Song

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1
  2018-09-05 17:32 ` David Ahern
@ 2018-09-05 18:12   ` Song Liu
  0 siblings, 0 replies; 5+ messages in thread
From: Song Liu @ 2018-09-05 18:12 UTC (permalink / raw)
  To: David Ahern; +Cc: Networking, weiwan, Eric Dumazet



> On Sep 5, 2018, at 10:32 AM, David Ahern <dsahern@gmail.com> wrote:
> 
> On 9/5/18 12:11 AM, Song Liu wrote:
>> We are debugging an issue with fib6_node_lookup_1(). 
>> 
>> We use a 4.16 based kernel, and we have back ported most upstream
>> patches in ip6_fib.{c.h}. The only major differences I can spot are
>> 
> 
> Did you backport all patches in each set that included a change to those
> files, or just the patches to ip6_fib.* and any dependencies?

I believe we always try back port all patches in each set. But we have back
ported hundreds of patches to our 4.16 tree, so it is also likely we missed
some useful patches. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-09-05 22:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-05  6:11 BUG: unable to handle kernel paging request in fib6_node_lookup_1 Song Liu
2018-09-05 17:09 ` Wei Wang
2018-09-05 18:10   ` Song Liu
2018-09-05 17:32 ` David Ahern
2018-09-05 18:12   ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.