[5.10 CAN BUG report] kernel dump about echo skb

* [5.10 CAN BUG report] kernel dump about echo skb
@ 2021-01-21  8:37 Joakim Zhang
  2021-01-22  7:53 ` Joakim Zhang
  2021-01-22  8:58 ` Marc Kleine-Budde
  0 siblings, 2 replies; 8+ messages in thread
From: Joakim Zhang @ 2021-01-21  8:37 UTC (permalink / raw)
  To: Marc Kleine-Budde, Oleksij Rempel, Oliver Hartkopp; +Cc: kernel, linux-can

Hi Marc, Oleksij, Oliver,

Our auto test scripts reports below kernel dump, after revert Oleksij Rempel patch "286228d382ba can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()", this issue disappeared.
What strange is that, commit message intends to fix this issue. Report it here to see if anyone meets the same issue or can give some suggestions. I will continue to dig how to reproduce it manually, now
I don't know how to do it. Thanks.

1. Kernel dump log:
[  101.688327] ------------[ cut here ]------------
[  101.692968] refcount_t: addition on 0; use-after-free.
[  101.698210] WARNING: CPU: 0 PID: 0 at lib/refcount.c:25 refcount_warn_saturat
e+0x98/0x140
[  101.706386] Modules linked in:
[  101.709446] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.4-100715-gc60abf9c
2110-dirty #376
[  101.717881] Hardware name: NXP i.MX8MPlus EVK board (DT)
[  101.723193] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[  101.729200] pc : refcount_warn_saturate+0x98/0x140
[  101.733990] lr : refcount_warn_saturate+0x98/0x140
[  101.738778] sp : ffff800011bcbcd0
[  101.742090] x29: ffff800011bcbcd0 x28: ffff0000c16b7000
[  101.747405] x27: 0000000000000000 x26: ffff0000c1718000
[  101.752718] x25: 00000000048c0000 x24: ffff0000c1e54c00
[  101.758031] x23: 0000000000000000 x22: ffff0000c1718900
[  101.763345] x21: ffff0000c1718000 x20: ffff0000c1f00800
[  101.768661] x19: ffff0000c1985e00 x18: 0000000000000010
[  101.773974] x17: 0000000000000000 x16: 0000000000000000
[  101.779287] x15: 0000000000000229 x14: ffff8000119d2cb8
[  101.784600] x13: 00000000ffffffea x12: ffff800011a51b60
[  101.789914] x11: 0000000000000003 x10: ffff800011a39b78
[  101.795227] x9 : 0000000000017fe8 x8 : ffff800011a39b20
[  101.800543] x7 : c0000000ffffefff x6 : 0000000000000001
[  101.805858] x5 : 0000000000000000 x4 : ffff00017f36f900
[  101.811174] x3 : ffff00017f376830 x2 : ffff00017f36f900
[  101.816487] x1 : 5ead5e088841e800 x0 : 0000000000000000
[  101.821800] Call trace:
[  101.824248]  refcount_warn_saturate+0x98/0x140
[  101.828694]  can_put_echo_skb+0x154/0x190
[  101.832704]  flexcan_start_xmit+0x188/0x240
[  101.836888]  dev_hard_start_xmit+0xe0/0x1a8
[  101.841072]  sch_direct_xmit+0x11c/0x338
[  101.844995]  __qdisc_run+0x118/0x5d8
[  101.848570]  net_tx_action+0x110/0x1a8
[  101.852319]  efi_header_end+0x120/0x264
[  101.856156]  irq_exit+0xb8/0xd8
[  101.859298]  __handle_domain_irq+0x64/0xb8
[  101.863396]  gic_handle_irq+0x54/0x118
[  101.867145]  el1_irq+0xc0/0x180
[  101.870288]  cpuidle_enter_state+0x130/0x2e8
[  101.874560]  cpuidle_enter+0x34/0x48
[  101.878136]  call_cpuidle+0x18/0x38
[  101.881627]  do_idle+0x1f4/0x258
[  101.884854]  cpu_startup_entry+0x20/0x68
[  101.888778]  rest_init+0xd4/0xe0
[  101.892008]  arch_call_rest_init+0xc/0x14
[  101.896019]  start_kernel+0x470/0x4a4
[  101.899681] ---[ end trace 8de9661669c150a6 ]---
[  101.908725] ------------[ cut here ]------------
[  101.913379] refcount_t: underflow; use-after-free.
[  101.918360] WARNING: CPU: 0 PID: 9 at lib/refcount.c:28 refcount_warn_saturat
e+0xec/0x140
[  101.926544] Modules linked in:
[  101.929605] CPU: 0 PID: 9 Comm: ksoftirqd/0 Tainted: G        W         5.10.
4-100715-gc60abf9c2110-dirty #376
[  101.939603] Hardware name: NXP i.MX8MPlus EVK board (DT)
[  101.944914] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[  101.950921] pc : refcount_warn_saturate+0xec/0x140
[  101.955709] lr : refcount_warn_saturate+0xec/0x140
[  101.960497] sp : ffff800011c73b00
[  101.963809] x29: ffff800011c73b00 x28: 000000000000003d
[  101.969123] x27: ffff00017f37a880 x26: ffff800011c73d48
[  101.974435] x25: 000000000000012c x24: ffff800011c73d58
[  101.979748] x23: ffff0000c1718000 x22: 0000000000000000
[  101.985061] x21: ffff800011affcc0 x20: ffff0000c13eeb80
[  101.990374] x19: ffff0000c1f00800 x18: 0000000000000010
[  101.995686] x17: 0000000000000000 x16: 0000000000000000
[  102.000999] x15: 0000000000000258 x14: ffff0000c007f278
[  102.006311] x13: 00000000ffffffea x12: ffff800011a51b60
[  102.011624] x11: 0000000000000003 x10: ffff800011a39b78
[  102.016936] x9 : 0000000000017fe8 x8 : ffff800011a39b20
[  102.022249] x7 : c0000000ffffefff x6 : 0000000000000001
[  102.027562] x5 : 0000000000000000 x4 : ffff00017f36f900
[  102.032874] x3 : ffff00017f376830 x2 : ffff00017f36f900
[  102.038187] x1 : b6fa97510f7af300 x0 : 0000000000000000
[  102.043500] Call trace:
[  102.045947]  refcount_warn_saturate+0xec/0x140
[  102.050391]  sock_efree+0x6c/0x78
[  102.053707]  skb_release_head_state+0x40/0xb0
[  102.058065]  skb_release_all+0x14/0x30
[  102.061813]  kfree_skb+0x2c/0x88
[  102.065041]  can_receive+0xb4/0xf8
[  102.068442]  can_rcv+0x3c/0xc0
[  102.071496]  __netif_receive_skb_one_core+0x50/0x78
[  102.076373]  __netif_receive_skb+0x14/0x60
[  102.080468]  netif_receive_skb+0x34/0xc0
[  102.084392]  can_rx_offload_napi_poll+0x60/0xd8
[  102.088921]  net_rx_action+0x114/0x378
[  102.092670]  efi_header_end+0x120/0x264
[  102.096506]  run_ksoftirqd+0x44/0x58
[  102.100081]  smpboot_thread_fn+0x1b4/0x270
[  102.104177]  kthread+0x138/0x158
[  102.107405]  ret_from_fork+0x10/0x18
[  102.110979] ---[ end trace 8de9661669c150a7 ]---

2. Patch introduces issue:
commit 286228d382ba6320f04fa2e7c6fc8d4d92e428f4
Author: Oleksij Rempel <o.rempel@pengutronix.de>
Date:   Wed Dec 18 09:39:02 2019 +0100

    can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()

    All user space generated SKBs are owned by a socket (unless injected into the
    key via AF_PACKET). If a socket is closed, all associated skbs will be cleaned
    up.

    This leads to a problem when a CAN driver calls can_put_echo_skb() on a
    unshared SKB. If the socket is closed prior to the TX complete handler,
    can_get_echo_skb() and the subsequent delivering of the echo SKB to all
    registered callbacks, a SKB with a refcount of 0 is delivered.

    To avoid the problem, in can_get_echo_skb() the original SKB is now always
    cloned, regardless of shared SKB or not. If the process exists it can now
    safely discard its SKBs, without disturbing the delivery of the echo SKB.

    The problem shows up in the j1939 stack, when it clones the incoming skb, which
    detects the already 0 refcount.

    We can easily reproduce this with following example:

    testj1939 -B -r can0: &
    cansend can0 1823ff40#0123

    WARNING: CPU: 0 PID: 293 at lib/refcount.c:25 refcount_warn_saturate+0x108/0x174
    refcount_t: addition on 0; use-after-free.
    Modules linked in: coda_vpu imx_vdoa videobuf2_vmalloc dw_hdmi_ahb_audio vcan
    CPU: 0 PID: 293 Comm: cansend Not tainted 5.5.0-rc6-00376-g9e20dcb7040d #1
    Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
    Backtrace:
    [<c010f570>] (dump_backtrace) from [<c010f90c>] (show_stack+0x20/0x24)
    [<c010f8ec>] (show_stack) from [<c0c3e1a4>] (dump_stack+0x8c/0xa0)
    [<c0c3e118>] (dump_stack) from [<c0127fec>] (__warn+0xe0/0x108)
    [<c0127f0c>] (__warn) from [<c01283c8>] (warn_slowpath_fmt+0xa8/0xcc)
    [<c0128324>] (warn_slowpath_fmt) from [<c0539c0c>] (refcount_warn_saturate+0x108/0x174)
    [<c0539b04>] (refcount_warn_saturate) from [<c0ad2cac>] (j1939_can_recv+0x20c/0x210)
    [<c0ad2aa0>] (j1939_can_recv) from [<c0ac9dc8>] (can_rcv_filter+0xb4/0x268)
    [<c0ac9d14>] (can_rcv_filter) from [<c0aca2cc>] (can_receive+0xb0/0xe4)
    [<c0aca21c>] (can_receive) from [<c0aca348>] (can_rcv+0x48/0x98)
    [<c0aca300>] (can_rcv) from [<c09b1fdc>] (__netif_receive_skb_one_core+0x64/0x88)
    [<c09b1f78>] (__netif_receive_skb_one_core) from [<c09b2070>] (__netif_receive_skb+0x38/0x94)
    [<c09b2038>] (__netif_receive_skb) from [<c09b2130>] (netif_receive_skb_internal+0x64/0xf8)
    [<c09b20cc>] (netif_receive_skb_internal) from [<c09b21f8>] (netif_receive_skb+0x34/0x19c)
    [<c09b21c4>] (netif_receive_skb) from [<c0791278>] (can_rx_offload_napi_poll+0x58/0xb4)

    Fixes: 0ae89beb283a ("can: add destructor for self generated skbs")
    Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
    Link: http://lore.kernel.org/r/20200124132656.22156-1-o.rempel@pengutronix.de
    Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
    Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Best Regards,
Joakim Zhang

^ permalink raw reply	[flat|nested] 8+ messages in thread