linux-can.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [5.10 CAN BUG report] kernel dump about echo skb
@ 2021-01-21  8:37 Joakim Zhang
  2021-01-22  7:53 ` Joakim Zhang
  2021-01-22  8:58 ` Marc Kleine-Budde
  0 siblings, 2 replies; 8+ messages in thread
From: Joakim Zhang @ 2021-01-21  8:37 UTC (permalink / raw)
  To: Marc Kleine-Budde, Oleksij Rempel, Oliver Hartkopp; +Cc: kernel, linux-can


Hi Marc, Oleksij, Oliver,

Our auto test scripts reports below kernel dump, after revert Oleksij Rempel patch "286228d382ba can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()", this issue disappeared.
What strange is that, commit message intends to fix this issue. Report it here to see if anyone meets the same issue or can give some suggestions. I will continue to dig how to reproduce it manually, now
I don't know how to do it. Thanks.

1. Kernel dump log:
[  101.688327] ------------[ cut here ]------------
[  101.692968] refcount_t: addition on 0; use-after-free.
[  101.698210] WARNING: CPU: 0 PID: 0 at lib/refcount.c:25 refcount_warn_saturat
e+0x98/0x140
[  101.706386] Modules linked in:
[  101.709446] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.4-100715-gc60abf9c
2110-dirty #376
[  101.717881] Hardware name: NXP i.MX8MPlus EVK board (DT)
[  101.723193] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[  101.729200] pc : refcount_warn_saturate+0x98/0x140
[  101.733990] lr : refcount_warn_saturate+0x98/0x140
[  101.738778] sp : ffff800011bcbcd0
[  101.742090] x29: ffff800011bcbcd0 x28: ffff0000c16b7000
[  101.747405] x27: 0000000000000000 x26: ffff0000c1718000
[  101.752718] x25: 00000000048c0000 x24: ffff0000c1e54c00
[  101.758031] x23: 0000000000000000 x22: ffff0000c1718900
[  101.763345] x21: ffff0000c1718000 x20: ffff0000c1f00800
[  101.768661] x19: ffff0000c1985e00 x18: 0000000000000010
[  101.773974] x17: 0000000000000000 x16: 0000000000000000
[  101.779287] x15: 0000000000000229 x14: ffff8000119d2cb8
[  101.784600] x13: 00000000ffffffea x12: ffff800011a51b60
[  101.789914] x11: 0000000000000003 x10: ffff800011a39b78
[  101.795227] x9 : 0000000000017fe8 x8 : ffff800011a39b20
[  101.800543] x7 : c0000000ffffefff x6 : 0000000000000001
[  101.805858] x5 : 0000000000000000 x4 : ffff00017f36f900
[  101.811174] x3 : ffff00017f376830 x2 : ffff00017f36f900
[  101.816487] x1 : 5ead5e088841e800 x0 : 0000000000000000
[  101.821800] Call trace:
[  101.824248]  refcount_warn_saturate+0x98/0x140
[  101.828694]  can_put_echo_skb+0x154/0x190
[  101.832704]  flexcan_start_xmit+0x188/0x240
[  101.836888]  dev_hard_start_xmit+0xe0/0x1a8
[  101.841072]  sch_direct_xmit+0x11c/0x338
[  101.844995]  __qdisc_run+0x118/0x5d8
[  101.848570]  net_tx_action+0x110/0x1a8
[  101.852319]  efi_header_end+0x120/0x264
[  101.856156]  irq_exit+0xb8/0xd8
[  101.859298]  __handle_domain_irq+0x64/0xb8
[  101.863396]  gic_handle_irq+0x54/0x118
[  101.867145]  el1_irq+0xc0/0x180
[  101.870288]  cpuidle_enter_state+0x130/0x2e8
[  101.874560]  cpuidle_enter+0x34/0x48
[  101.878136]  call_cpuidle+0x18/0x38
[  101.881627]  do_idle+0x1f4/0x258
[  101.884854]  cpu_startup_entry+0x20/0x68
[  101.888778]  rest_init+0xd4/0xe0
[  101.892008]  arch_call_rest_init+0xc/0x14
[  101.896019]  start_kernel+0x470/0x4a4
[  101.899681] ---[ end trace 8de9661669c150a6 ]---
[  101.908725] ------------[ cut here ]------------
[  101.913379] refcount_t: underflow; use-after-free.
[  101.918360] WARNING: CPU: 0 PID: 9 at lib/refcount.c:28 refcount_warn_saturat
e+0xec/0x140
[  101.926544] Modules linked in:
[  101.929605] CPU: 0 PID: 9 Comm: ksoftirqd/0 Tainted: G        W         5.10.
4-100715-gc60abf9c2110-dirty #376
[  101.939603] Hardware name: NXP i.MX8MPlus EVK board (DT)
[  101.944914] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[  101.950921] pc : refcount_warn_saturate+0xec/0x140
[  101.955709] lr : refcount_warn_saturate+0xec/0x140
[  101.960497] sp : ffff800011c73b00
[  101.963809] x29: ffff800011c73b00 x28: 000000000000003d
[  101.969123] x27: ffff00017f37a880 x26: ffff800011c73d48
[  101.974435] x25: 000000000000012c x24: ffff800011c73d58
[  101.979748] x23: ffff0000c1718000 x22: 0000000000000000
[  101.985061] x21: ffff800011affcc0 x20: ffff0000c13eeb80
[  101.990374] x19: ffff0000c1f00800 x18: 0000000000000010
[  101.995686] x17: 0000000000000000 x16: 0000000000000000
[  102.000999] x15: 0000000000000258 x14: ffff0000c007f278
[  102.006311] x13: 00000000ffffffea x12: ffff800011a51b60
[  102.011624] x11: 0000000000000003 x10: ffff800011a39b78
[  102.016936] x9 : 0000000000017fe8 x8 : ffff800011a39b20
[  102.022249] x7 : c0000000ffffefff x6 : 0000000000000001
[  102.027562] x5 : 0000000000000000 x4 : ffff00017f36f900
[  102.032874] x3 : ffff00017f376830 x2 : ffff00017f36f900
[  102.038187] x1 : b6fa97510f7af300 x0 : 0000000000000000
[  102.043500] Call trace:
[  102.045947]  refcount_warn_saturate+0xec/0x140
[  102.050391]  sock_efree+0x6c/0x78
[  102.053707]  skb_release_head_state+0x40/0xb0
[  102.058065]  skb_release_all+0x14/0x30
[  102.061813]  kfree_skb+0x2c/0x88
[  102.065041]  can_receive+0xb4/0xf8
[  102.068442]  can_rcv+0x3c/0xc0
[  102.071496]  __netif_receive_skb_one_core+0x50/0x78
[  102.076373]  __netif_receive_skb+0x14/0x60
[  102.080468]  netif_receive_skb+0x34/0xc0
[  102.084392]  can_rx_offload_napi_poll+0x60/0xd8
[  102.088921]  net_rx_action+0x114/0x378
[  102.092670]  efi_header_end+0x120/0x264
[  102.096506]  run_ksoftirqd+0x44/0x58
[  102.100081]  smpboot_thread_fn+0x1b4/0x270
[  102.104177]  kthread+0x138/0x158
[  102.107405]  ret_from_fork+0x10/0x18
[  102.110979] ---[ end trace 8de9661669c150a7 ]---

2. Patch introduces issue:
commit 286228d382ba6320f04fa2e7c6fc8d4d92e428f4
Author: Oleksij Rempel <o.rempel@pengutronix.de>
Date:   Wed Dec 18 09:39:02 2019 +0100

    can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()

    All user space generated SKBs are owned by a socket (unless injected into the
    key via AF_PACKET). If a socket is closed, all associated skbs will be cleaned
    up.

    This leads to a problem when a CAN driver calls can_put_echo_skb() on a
    unshared SKB. If the socket is closed prior to the TX complete handler,
    can_get_echo_skb() and the subsequent delivering of the echo SKB to all
    registered callbacks, a SKB with a refcount of 0 is delivered.

    To avoid the problem, in can_get_echo_skb() the original SKB is now always
    cloned, regardless of shared SKB or not. If the process exists it can now
    safely discard its SKBs, without disturbing the delivery of the echo SKB.

    The problem shows up in the j1939 stack, when it clones the incoming skb, which
    detects the already 0 refcount.

    We can easily reproduce this with following example:

    testj1939 -B -r can0: &
    cansend can0 1823ff40#0123

    WARNING: CPU: 0 PID: 293 at lib/refcount.c:25 refcount_warn_saturate+0x108/0x174
    refcount_t: addition on 0; use-after-free.
    Modules linked in: coda_vpu imx_vdoa videobuf2_vmalloc dw_hdmi_ahb_audio vcan
    CPU: 0 PID: 293 Comm: cansend Not tainted 5.5.0-rc6-00376-g9e20dcb7040d #1
    Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
    Backtrace:
    [<c010f570>] (dump_backtrace) from [<c010f90c>] (show_stack+0x20/0x24)
    [<c010f8ec>] (show_stack) from [<c0c3e1a4>] (dump_stack+0x8c/0xa0)
    [<c0c3e118>] (dump_stack) from [<c0127fec>] (__warn+0xe0/0x108)
    [<c0127f0c>] (__warn) from [<c01283c8>] (warn_slowpath_fmt+0xa8/0xcc)
    [<c0128324>] (warn_slowpath_fmt) from [<c0539c0c>] (refcount_warn_saturate+0x108/0x174)
    [<c0539b04>] (refcount_warn_saturate) from [<c0ad2cac>] (j1939_can_recv+0x20c/0x210)
    [<c0ad2aa0>] (j1939_can_recv) from [<c0ac9dc8>] (can_rcv_filter+0xb4/0x268)
    [<c0ac9d14>] (can_rcv_filter) from [<c0aca2cc>] (can_receive+0xb0/0xe4)
    [<c0aca21c>] (can_receive) from [<c0aca348>] (can_rcv+0x48/0x98)
    [<c0aca300>] (can_rcv) from [<c09b1fdc>] (__netif_receive_skb_one_core+0x64/0x88)
    [<c09b1f78>] (__netif_receive_skb_one_core) from [<c09b2070>] (__netif_receive_skb+0x38/0x94)
    [<c09b2038>] (__netif_receive_skb) from [<c09b2130>] (netif_receive_skb_internal+0x64/0xf8)
    [<c09b20cc>] (netif_receive_skb_internal) from [<c09b21f8>] (netif_receive_skb+0x34/0x19c)
    [<c09b21c4>] (netif_receive_skb) from [<c0791278>] (can_rx_offload_napi_poll+0x58/0xb4)

    Fixes: 0ae89beb283a ("can: add destructor for self generated skbs")
    Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
    Link: http://lore.kernel.org/r/20200124132656.22156-1-o.rempel@pengutronix.de
    Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
    Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Best Regards,
Joakim Zhang


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [5.10 CAN BUG report] kernel dump about echo skb
  2021-01-21  8:37 [5.10 CAN BUG report] kernel dump about echo skb Joakim Zhang
@ 2021-01-22  7:53 ` Joakim Zhang
  2021-01-22  9:08   ` Vincent MAILHOL
  2021-01-22  8:58 ` Marc Kleine-Budde
  1 sibling, 1 reply; 8+ messages in thread
From: Joakim Zhang @ 2021-01-22  7:53 UTC (permalink / raw)
  To: Joakim Zhang, Marc Kleine-Budde, Oleksij Rempel, Oliver Hartkopp
  Cc: kernel, linux-can


@Marc Kleine-Budde, @Oleksij Rempel, any comments? Thanks a lot.

Best Regards,
Joakim Zhang

> -----Original Message-----
> From: Joakim Zhang <qiangqing.zhang@nxp.com>
> Sent: 2021年1月21日 16:38
> To: Marc Kleine-Budde <mkl@pengutronix.de>; Oleksij Rempel
> <o.rempel@pengutronix.de>; Oliver Hartkopp <socketcan@hartkopp.net>
> Cc: kernel@pengutronix.de; linux-can@vger.kernel.org
> Subject: [5.10 CAN BUG report] kernel dump about echo skb
> 
> 
> Hi Marc, Oleksij, Oliver,
> 
> Our auto test scripts reports below kernel dump, after revert Oleksij Rempel
> patch "286228d382ba can: can_create_echo_skb(): fix echo skb generation:
> always use skb_clone()", this issue disappeared.
> What strange is that, commit message intends to fix this issue. Report it here
> to see if anyone meets the same issue or can give some suggestions. I will
> continue to dig how to reproduce it manually, now I don't know how to do it.
> Thanks.
> 
> 1. Kernel dump log:
> [  101.688327] ------------[ cut here ]------------ [  101.692968] refcount_t:
> addition on 0; use-after-free.
> [  101.698210] WARNING: CPU: 0 PID: 0 at lib/refcount.c:25
> refcount_warn_saturat
> e+0x98/0x140
> [  101.706386] Modules linked in:
> [  101.709446] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 5.10.4-100715-gc60abf9c 2110-dirty #376 [  101.717881] Hardware name:
> NXP i.MX8MPlus EVK board (DT) [  101.723193] pstate: 40000005 (nZcv daif
> -PAN -UAO -TCO BTYPE=--) [  101.729200] pc :
> refcount_warn_saturate+0x98/0x140 [  101.733990] lr :
> refcount_warn_saturate+0x98/0x140 [  101.738778] sp : ffff800011bcbcd0
> [  101.742090] x29: ffff800011bcbcd0 x28: ffff0000c16b7000 [  101.747405]
> x27: 0000000000000000 x26: ffff0000c1718000 [  101.752718] x25:
> 00000000048c0000 x24: ffff0000c1e54c00 [  101.758031] x23:
> 0000000000000000 x22: ffff0000c1718900 [  101.763345] x21:
> ffff0000c1718000 x20: ffff0000c1f00800 [  101.768661] x19: ffff0000c1985e00
> x18: 0000000000000010 [  101.773974] x17: 0000000000000000 x16:
> 0000000000000000 [  101.779287] x15: 0000000000000229 x14:
> ffff8000119d2cb8 [  101.784600] x13: 00000000ffffffea x12: ffff800011a51b60
> [  101.789914] x11: 0000000000000003 x10: ffff800011a39b78 [  101.795227]
> x9 : 0000000000017fe8 x8 : ffff800011a39b20 [  101.800543] x7 :
> c0000000ffffefff x6 : 0000000000000001 [  101.805858] x5 :
> 0000000000000000 x4 : ffff00017f36f900 [  101.811174] x3 : ffff00017f376830
> x2 : ffff00017f36f900 [  101.816487] x1 : 5ead5e088841e800 x0 :
> 0000000000000000 [  101.821800] Call trace:
> [  101.824248]  refcount_warn_saturate+0x98/0x140 [  101.828694]
> can_put_echo_skb+0x154/0x190 [  101.832704]
> flexcan_start_xmit+0x188/0x240 [  101.836888]
> dev_hard_start_xmit+0xe0/0x1a8 [  101.841072]
> sch_direct_xmit+0x11c/0x338 [  101.844995]  __qdisc_run+0x118/0x5d8
> [  101.848570]  net_tx_action+0x110/0x1a8 [  101.852319]
> efi_header_end+0x120/0x264 [  101.856156]  irq_exit+0xb8/0xd8
> [  101.859298]  __handle_domain_irq+0x64/0xb8 [  101.863396]
> gic_handle_irq+0x54/0x118 [  101.867145]  el1_irq+0xc0/0x180
> [  101.870288]  cpuidle_enter_state+0x130/0x2e8 [  101.874560]
> cpuidle_enter+0x34/0x48 [  101.878136]  call_cpuidle+0x18/0x38
> [  101.881627]  do_idle+0x1f4/0x258 [  101.884854]
> cpu_startup_entry+0x20/0x68 [  101.888778]  rest_init+0xd4/0xe0
> [  101.892008]  arch_call_rest_init+0xc/0x14 [  101.896019]
> start_kernel+0x470/0x4a4 [  101.899681] ---[ end trace
> 8de9661669c150a6 ]--- [  101.908725] ------------[ cut here ]------------
> [  101.913379] refcount_t: underflow; use-after-free.
> [  101.918360] WARNING: CPU: 0 PID: 9 at lib/refcount.c:28
> refcount_warn_saturat
> e+0xec/0x140
> [  101.926544] Modules linked in:
> [  101.929605] CPU: 0 PID: 9 Comm: ksoftirqd/0 Tainted: G        W
> 5.10.
> 4-100715-gc60abf9c2110-dirty #376
> [  101.939603] Hardware name: NXP i.MX8MPlus EVK board (DT)
> [  101.944914] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
> [  101.950921] pc : refcount_warn_saturate+0xec/0x140 [  101.955709] lr :
> refcount_warn_saturate+0xec/0x140 [  101.960497] sp : ffff800011c73b00
> [  101.963809] x29: ffff800011c73b00 x28: 000000000000003d [  101.969123]
> x27: ffff00017f37a880 x26: ffff800011c73d48 [  101.974435] x25:
> 000000000000012c x24: ffff800011c73d58 [  101.979748] x23:
> ffff0000c1718000 x22: 0000000000000000 [  101.985061] x21:
> ffff800011affcc0 x20: ffff0000c13eeb80 [  101.990374] x19: ffff0000c1f00800
> x18: 0000000000000010 [  101.995686] x17: 0000000000000000 x16:
> 0000000000000000 [  102.000999] x15: 0000000000000258 x14:
> ffff0000c007f278 [  102.006311] x13: 00000000ffffffea x12: ffff800011a51b60
> [  102.011624] x11: 0000000000000003 x10: ffff800011a39b78 [  102.016936]
> x9 : 0000000000017fe8 x8 : ffff800011a39b20 [  102.022249] x7 :
> c0000000ffffefff x6 : 0000000000000001 [  102.027562] x5 :
> 0000000000000000 x4 : ffff00017f36f900 [  102.032874] x3 : ffff00017f376830
> x2 : ffff00017f36f900 [  102.038187] x1 : b6fa97510f7af300 x0 :
> 0000000000000000 [  102.043500] Call trace:
> [  102.045947]  refcount_warn_saturate+0xec/0x140 [  102.050391]
> sock_efree+0x6c/0x78 [  102.053707]  skb_release_head_state+0x40/0xb0
> [  102.058065]  skb_release_all+0x14/0x30 [  102.061813]
> kfree_skb+0x2c/0x88 [  102.065041]  can_receive+0xb4/0xf8 [  102.068442]
> can_rcv+0x3c/0xc0 [  102.071496]
> __netif_receive_skb_one_core+0x50/0x78
> [  102.076373]  __netif_receive_skb+0x14/0x60 [  102.080468]
> netif_receive_skb+0x34/0xc0 [  102.084392]
> can_rx_offload_napi_poll+0x60/0xd8
> [  102.088921]  net_rx_action+0x114/0x378 [  102.092670]
> efi_header_end+0x120/0x264 [  102.096506]  run_ksoftirqd+0x44/0x58
> [  102.100081]  smpboot_thread_fn+0x1b4/0x270 [  102.104177]
> kthread+0x138/0x158 [  102.107405]  ret_from_fork+0x10/0x18
> [  102.110979] ---[ end trace 8de9661669c150a7 ]---
> 
> 2. Patch introduces issue:
> commit 286228d382ba6320f04fa2e7c6fc8d4d92e428f4
> Author: Oleksij Rempel <o.rempel@pengutronix.de>
> Date:   Wed Dec 18 09:39:02 2019 +0100
> 
>     can: can_create_echo_skb(): fix echo skb generation: always use
> skb_clone()
> 
>     All user space generated SKBs are owned by a socket (unless injected into
> the
>     key via AF_PACKET). If a socket is closed, all associated skbs will be
> cleaned
>     up.
> 
>     This leads to a problem when a CAN driver calls can_put_echo_skb() on a
>     unshared SKB. If the socket is closed prior to the TX complete handler,
>     can_get_echo_skb() and the subsequent delivering of the echo SKB to all
>     registered callbacks, a SKB with a refcount of 0 is delivered.
> 
>     To avoid the problem, in can_get_echo_skb() the original SKB is now
> always
>     cloned, regardless of shared SKB or not. If the process exists it can now
>     safely discard its SKBs, without disturbing the delivery of the echo SKB.
> 
>     The problem shows up in the j1939 stack, when it clones the incoming skb,
> which
>     detects the already 0 refcount.
> 
>     We can easily reproduce this with following example:
> 
>     testj1939 -B -r can0: &
>     cansend can0 1823ff40#0123
> 
>     WARNING: CPU: 0 PID: 293 at lib/refcount.c:25
> refcount_warn_saturate+0x108/0x174
>     refcount_t: addition on 0; use-after-free.
>     Modules linked in: coda_vpu imx_vdoa videobuf2_vmalloc
> dw_hdmi_ahb_audio vcan
>     CPU: 0 PID: 293 Comm: cansend Not tainted
> 5.5.0-rc6-00376-g9e20dcb7040d #1
>     Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
>     Backtrace:
>     [<c010f570>] (dump_backtrace) from [<c010f90c>]
> (show_stack+0x20/0x24)
>     [<c010f8ec>] (show_stack) from [<c0c3e1a4>] (dump_stack+0x8c/0xa0)
>     [<c0c3e118>] (dump_stack) from [<c0127fec>] (__warn+0xe0/0x108)
>     [<c0127f0c>] (__warn) from [<c01283c8>]
> (warn_slowpath_fmt+0xa8/0xcc)
>     [<c0128324>] (warn_slowpath_fmt) from [<c0539c0c>]
> (refcount_warn_saturate+0x108/0x174)
>     [<c0539b04>] (refcount_warn_saturate) from [<c0ad2cac>]
> (j1939_can_recv+0x20c/0x210)
>     [<c0ad2aa0>] (j1939_can_recv) from [<c0ac9dc8>]
> (can_rcv_filter+0xb4/0x268)
>     [<c0ac9d14>] (can_rcv_filter) from [<c0aca2cc>] (can_receive+0xb0/0xe4)
>     [<c0aca21c>] (can_receive) from [<c0aca348>] (can_rcv+0x48/0x98)
>     [<c0aca300>] (can_rcv) from [<c09b1fdc>]
> (__netif_receive_skb_one_core+0x64/0x88)
>     [<c09b1f78>] (__netif_receive_skb_one_core) from [<c09b2070>]
> (__netif_receive_skb+0x38/0x94)
>     [<c09b2038>] (__netif_receive_skb) from [<c09b2130>]
> (netif_receive_skb_internal+0x64/0xf8)
>     [<c09b20cc>] (netif_receive_skb_internal) from [<c09b21f8>]
> (netif_receive_skb+0x34/0x19c)
>     [<c09b21c4>] (netif_receive_skb) from [<c0791278>]
> (can_rx_offload_napi_poll+0x58/0xb4)
> 
>     Fixes: 0ae89beb283a ("can: add destructor for self generated skbs")
>     Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
>     Link:
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flore.kern
> el.org%2Fr%2F20200124132656.22156-1-o.rempel%40pengutronix.de&amp;d
> ata=04%7C01%7Cqiangqing.zhang%40nxp.com%7C2d128ac66b3f4e93db9a08
> d8bde80d18%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6374681
> 51639635283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIj
> oiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QQtX%2
> FCpYRmeynsy18qnLFUMyFgwUSgzbQeSqYGwD3UM%3D&amp;reserved=0
>     Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
>     Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
> 
> Best Regards,
> Joakim Zhang


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [5.10 CAN BUG report] kernel dump about echo skb
  2021-01-21  8:37 [5.10 CAN BUG report] kernel dump about echo skb Joakim Zhang
  2021-01-22  7:53 ` Joakim Zhang
@ 2021-01-22  8:58 ` Marc Kleine-Budde
  1 sibling, 0 replies; 8+ messages in thread
From: Marc Kleine-Budde @ 2021-01-22  8:58 UTC (permalink / raw)
  To: Joakim Zhang; +Cc: Oleksij Rempel, Oliver Hartkopp, kernel, linux-can

[-- Attachment #1: Type: text/plain, Size: 1484 bytes --]

On Thu, Jan 21, 2021 at 08:37:31AM +0000, Joakim Zhang wrote:
> Our auto test scripts reports below kernel dump, after revert Oleksij Rempel
> patch "286228d382ba can: can_create_echo_skb(): fix echo skb generation: always
> use skb_clone()", this issue disappeared.  What strange is that, commit message
> intends to fix this issue. Report it here to see if anyone meets the same issue
> or can give some suggestions. I will continue to dig how to reproduce it
> manually, now I don't know how to do it. Thanks.

The backtrace looks more or less like a normal TX-path to me. Although the send
is not directly in the call stack of a user space application, but it's
deferred and then sent by the net_tx_action().

I have the feeling, that this backtrace is triggered by the same problem that
Oleksij tried to fix. Some CAN frames were send on a socket, that socket was
closed and the skbs have reached 0 refcount, after this the frames were put
into the driver.

With Oleksij's patch the skb is always cloned, this means the refcount is
always checked. Reverting that patch covers the problem.

If you can reproduce, revert the patch and add a refcount check againt 0 only.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [5.10 CAN BUG report] kernel dump about echo skb
  2021-01-22  7:53 ` Joakim Zhang
@ 2021-01-22  9:08   ` Vincent MAILHOL
  2021-01-22  9:23     ` Marc Kleine-Budde
  0 siblings, 1 reply; 8+ messages in thread
From: Vincent MAILHOL @ 2021-01-22  9:08 UTC (permalink / raw)
  To: Joakim Zhang
  Cc: Marc Kleine-Budde, Oleksij Rempel, Oliver Hartkopp, kernel, linux-can

On Fri. 22 Jan 2021 at 16:53, Joakim Zhang <qiangqing.zhang@nxp.com> wrote:
> @Marc Kleine-Budde, @Oleksij Rempel, any comments? Thanks a lot.
>
> Best Regards,
> Joakim Zhang
>
> > -----Original Message-----
> > From: Joakim Zhang <qiangqing.zhang@nxp.com>
> > Sent: 2021年1月21日 16:38
> > To: Marc Kleine-Budde <mkl@pengutronix.de>; Oleksij Rempel
> > <o.rempel@pengutronix.de>; Oliver Hartkopp <socketcan@hartkopp.net>
> > Cc: kernel@pengutronix.de; linux-can@vger.kernel.org
> > Subject: [5.10 CAN BUG report] kernel dump about echo skb
> >
> >
> > Hi Marc, Oleksij, Oliver,
> >
> > Our auto test scripts reports below kernel dump, after revert Oleksij Rempel
> > patch "286228d382ba can: can_create_echo_skb(): fix echo skb generation:
> > always use skb_clone()", this issue disappeared.
> > What strange is that, commit message intends to fix this issue. Report it here
> > to see if anyone meets the same issue or can give some suggestions. I will
> > continue to dig how to reproduce it manually, now I don't know how to do it.
> > Thanks.

Even if you can not reproduce, could you share some hints of how it
first occurred?
Did you use the same command as Oleksij:
>     testj1939 -B -r can0: &
>     cansend can0 1823ff40#0123
or did you use something else?

> > 1. Kernel dump log:
> > [  101.688327] ------------[ cut here ]------------ [  101.692968] refcount_t:
> > addition on 0; use-after-free.

The skb already had its refcount at zero when reaching
can_put_echo_skb(). It is as if it was already freed/consumed!

If you remove Oleksij’s patch, can_put_echo_skb() will probably not
clone the skb and thus not check the refcount: this means that you
will not see the warning, however, it does not necessarily mean that
the bug did not occur.

So far, it seems to me to be another bug which was invisible until
now and which Oleksij’s patch just uncovered. But I do not yet fully
understand what the root cause could be.

> > [  101.698210] WARNING: CPU: 0 PID: 0 at lib/refcount.c:25
> > refcount_warn_saturat
> > e+0x98/0x140
> > [  101.706386] Modules linked in:
> > [  101.709446] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> > 5.10.4-100715-gc60abf9c 2110-dirty #376 [  101.717881] Hardware name:
> > NXP i.MX8MPlus EVK board (DT) [  101.723193] pstate: 40000005 (nZcv daif
> > -PAN -UAO -TCO BTYPE=--) [  101.729200] pc :
> > refcount_warn_saturate+0x98/0x140 [  101.733990] lr :
> > refcount_warn_saturate+0x98/0x140 [  101.738778] sp : ffff800011bcbcd0
> > [  101.742090] x29: ffff800011bcbcd0 x28: ffff0000c16b7000 [  101.747405]
> > x27: 0000000000000000 x26: ffff0000c1718000 [  101.752718] x25:
> > 00000000048c0000 x24: ffff0000c1e54c00 [  101.758031] x23:
> > 0000000000000000 x22: ffff0000c1718900 [  101.763345] x21:
> > ffff0000c1718000 x20: ffff0000c1f00800 [  101.768661] x19: ffff0000c1985e00
> > x18: 0000000000000010 [  101.773974] x17: 0000000000000000 x16:
> > 0000000000000000 [  101.779287] x15: 0000000000000229 x14:
> > ffff8000119d2cb8 [  101.784600] x13: 00000000ffffffea x12: ffff800011a51b60
> > [  101.789914] x11: 0000000000000003 x10: ffff800011a39b78 [  101.795227]
> > x9 : 0000000000017fe8 x8 : ffff800011a39b20 [  101.800543] x7 :
> > c0000000ffffefff x6 : 0000000000000001 [  101.805858] x5 :
> > 0000000000000000 x4 : ffff00017f36f900 [  101.811174] x3 : ffff00017f376830
> > x2 : ffff00017f36f900 [  101.816487] x1 : 5ead5e088841e800 x0 :
> > 0000000000000000 [  101.821800] Call trace:
> > [  101.824248]  refcount_warn_saturate+0x98/0x140 [  101.828694]
> > can_put_echo_skb+0x154/0x190 [  101.832704]

Could you share the output of:
eu-addr2line -f -i -e drivers/net/can/dev.o can_put_echo_skb+0x154

> > flexcan_start_xmit+0x188/0x240 [  101.836888]
> > dev_hard_start_xmit+0xe0/0x1a8 [  101.841072]
> > sch_direct_xmit+0x11c/0x338 [  101.844995]  __qdisc_run+0x118/0x5d8
> > [  101.848570]  net_tx_action+0x110/0x1a8 [  101.852319]
> > efi_header_end+0x120/0x264 [  101.856156]  irq_exit+0xb8/0xd8
> > [  101.859298]  __handle_domain_irq+0x64/0xb8 [  101.863396]
> > gic_handle_irq+0x54/0x118 [  101.867145]  el1_irq+0xc0/0x180
> > [  101.870288]  cpuidle_enter_state+0x130/0x2e8 [  101.874560]
> > cpuidle_enter+0x34/0x48 [  101.878136]  call_cpuidle+0x18/0x38
> > [  101.881627]  do_idle+0x1f4/0x258 [  101.884854]
> > cpu_startup_entry+0x20/0x68 [  101.888778]  rest_init+0xd4/0xe0
> > [  101.892008]  arch_call_rest_init+0xc/0x14 [  101.896019]
> > start_kernel+0x470/0x4a4 [  101.899681] ---[ end trace
> > 8de9661669c150a6 ]--- [  101.908725] ------------[ cut here ]------------
> > [  101.913379] refcount_t: underflow; use-after-free.
> > [  101.918360] WARNING: CPU: 0 PID: 9 at lib/refcount.c:28
> > refcount_warn_saturat
> > e+0xec/0x140
> > [  101.926544] Modules linked in:
> > [  101.929605] CPU: 0 PID: 9 Comm: ksoftirqd/0 Tainted: G        W
> > 5.10.
> > 4-100715-gc60abf9c2110-dirty #376
> > [  101.939603] Hardware name: NXP i.MX8MPlus EVK board (DT)
> > [  101.944914] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
> > [  101.950921] pc : refcount_warn_saturate+0xec/0x140 [  101.955709] lr :
> > refcount_warn_saturate+0xec/0x140 [  101.960497] sp : ffff800011c73b00
> > [  101.963809] x29: ffff800011c73b00 x28: 000000000000003d [  101.969123]
> > x27: ffff00017f37a880 x26: ffff800011c73d48 [  101.974435] x25:
> > 000000000000012c x24: ffff800011c73d58 [  101.979748] x23:
> > ffff0000c1718000 x22: 0000000000000000 [  101.985061] x21:
> > ffff800011affcc0 x20: ffff0000c13eeb80 [  101.990374] x19: ffff0000c1f00800
> > x18: 0000000000000010 [  101.995686] x17: 0000000000000000 x16:
> > 0000000000000000 [  102.000999] x15: 0000000000000258 x14:
> > ffff0000c007f278 [  102.006311] x13: 00000000ffffffea x12: ffff800011a51b60
> > [  102.011624] x11: 0000000000000003 x10: ffff800011a39b78 [  102.016936]
> > x9 : 0000000000017fe8 x8 : ffff800011a39b20 [  102.022249] x7 :
> > c0000000ffffefff x6 : 0000000000000001 [  102.027562] x5 :
> > 0000000000000000 x4 : ffff00017f36f900 [  102.032874] x3 : ffff00017f376830
> > x2 : ffff00017f36f900 [  102.038187] x1 : b6fa97510f7af300 x0 :
> > 0000000000000000 [  102.043500] Call trace:
> > [  102.045947]  refcount_warn_saturate+0xec/0x140 [  102.050391]
> > sock_efree+0x6c/0x78 [  102.053707]  skb_release_head_state+0x40/0xb0
> > [  102.058065]  skb_release_all+0x14/0x30 [  102.061813]
> > kfree_skb+0x2c/0x88 [  102.065041]  can_receive+0xb4/0xf8 [  102.068442]
> > can_rcv+0x3c/0xc0 [  102.071496]
> > __netif_receive_skb_one_core+0x50/0x78
> > [  102.076373]  __netif_receive_skb+0x14/0x60 [  102.080468]
> > netif_receive_skb+0x34/0xc0 [  102.084392]
> > can_rx_offload_napi_poll+0x60/0xd8
> > [  102.088921]  net_rx_action+0x114/0x378 [  102.092670]
> > efi_header_end+0x120/0x264 [  102.096506]  run_ksoftirqd+0x44/0x58
> > [  102.100081]  smpboot_thread_fn+0x1b4/0x270 [  102.104177]
> > kthread+0x138/0x158 [  102.107405]  ret_from_fork+0x10/0x18
> > [  102.110979] ---[ end trace 8de9661669c150a7 ]---
> >
> > 2. Patch introduces issue:
> > commit 286228d382ba6320f04fa2e7c6fc8d4d92e428f4
> > Author: Oleksij Rempel <o.rempel@pengutronix.de>
> > Date:   Wed Dec 18 09:39:02 2019 +0100
> >
> >     can: can_create_echo_skb(): fix echo skb generation: always use
> > skb_clone()
> >
> >     All user space generated SKBs are owned by a socket (unless injected into
> > the
> >     key via AF_PACKET). If a socket is closed, all associated skbs will be
> > cleaned
> >     up.
> >
> >     This leads to a problem when a CAN driver calls can_put_echo_skb() on a
> >     unshared SKB. If the socket is closed prior to the TX complete handler,
> >     can_get_echo_skb() and the subsequent delivering of the echo SKB to all
> >     registered callbacks, a SKB with a refcount of 0 is delivered.
> >
> >     To avoid the problem, in can_get_echo_skb() the original SKB is now
> > always
> >     cloned, regardless of shared SKB or not. If the process exists it can now
> >     safely discard its SKBs, without disturbing the delivery of the echo SKB.
> >
> >     The problem shows up in the j1939 stack, when it clones the incoming skb,
> > which
> >     detects the already 0 refcount.
> >
> >     We can easily reproduce this with following example:
> >
> >     testj1939 -B -r can0: &
> >     cansend can0 1823ff40#0123
> >
> >     WARNING: CPU: 0 PID: 293 at lib/refcount.c:25
> > refcount_warn_saturate+0x108/0x174
> >     refcount_t: addition on 0; use-after-free.
> >     Modules linked in: coda_vpu imx_vdoa videobuf2_vmalloc
> > dw_hdmi_ahb_audio vcan
> >     CPU: 0 PID: 293 Comm: cansend Not tainted
> > 5.5.0-rc6-00376-g9e20dcb7040d #1
> >     Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> >     Backtrace:
> >     [<c010f570>] (dump_backtrace) from [<c010f90c>]
> > (show_stack+0x20/0x24)
> >     [<c010f8ec>] (show_stack) from [<c0c3e1a4>] (dump_stack+0x8c/0xa0)
> >     [<c0c3e118>] (dump_stack) from [<c0127fec>] (__warn+0xe0/0x108)
> >     [<c0127f0c>] (__warn) from [<c01283c8>]
> > (warn_slowpath_fmt+0xa8/0xcc)
> >     [<c0128324>] (warn_slowpath_fmt) from [<c0539c0c>]
> > (refcount_warn_saturate+0x108/0x174)
> >     [<c0539b04>] (refcount_warn_saturate) from [<c0ad2cac>]
> > (j1939_can_recv+0x20c/0x210)
> >     [<c0ad2aa0>] (j1939_can_recv) from [<c0ac9dc8>]
> > (can_rcv_filter+0xb4/0x268)
> >     [<c0ac9d14>] (can_rcv_filter) from [<c0aca2cc>] (can_receive+0xb0/0xe4)
> >     [<c0aca21c>] (can_receive) from [<c0aca348>] (can_rcv+0x48/0x98)
> >     [<c0aca300>] (can_rcv) from [<c09b1fdc>]
> > (__netif_receive_skb_one_core+0x64/0x88)
> >     [<c09b1f78>] (__netif_receive_skb_one_core) from [<c09b2070>]
> > (__netif_receive_skb+0x38/0x94)
> >     [<c09b2038>] (__netif_receive_skb) from [<c09b2130>]
> > (netif_receive_skb_internal+0x64/0xf8)
> >     [<c09b20cc>] (netif_receive_skb_internal) from [<c09b21f8>]
> > (netif_receive_skb+0x34/0x19c)
> >     [<c09b21c4>] (netif_receive_skb) from [<c0791278>]
> > (can_rx_offload_napi_poll+0x58/0xb4)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [5.10 CAN BUG report] kernel dump about echo skb
  2021-01-22  9:08   ` Vincent MAILHOL
@ 2021-01-22  9:23     ` Marc Kleine-Budde
  2021-01-22  9:40       ` Oleksij Rempel
  0 siblings, 1 reply; 8+ messages in thread
From: Marc Kleine-Budde @ 2021-01-22  9:23 UTC (permalink / raw)
  To: Vincent MAILHOL
  Cc: Joakim Zhang, Oleksij Rempel, Oliver Hartkopp, kernel, linux-can

[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]

On Fri, Jan 22, 2021 at 06:08:18PM +0900, Vincent MAILHOL wrote:
> > > 1. Kernel dump log:
> > > [  101.688327] ------------[ cut here ]------------ [  101.692968] refcount_t:
> > > addition on 0; use-after-free.
> 
> The skb already had its refcount at zero when reaching
> can_put_echo_skb(). It is as if it was already freed/consumed!

ACK

> If you remove Oleksij’s patch, can_put_echo_skb() will probably not
> clone the skb and thus not check the refcount: this means that you
> will not see the warning, however, it does not necessarily mean that
> the bug did not occur.

ACK

> So far, it seems to me to be another bug which was invisible until
> now and which Oleksij’s patch just uncovered. But I do not yet fully
> understand what the root cause could be.

Or it's the same bug, hitting earlier. Oleksij's backtrace was in the
TX-complete path and the problem was fixes by cloning the skb in before TX.
This means the refcount of the original skb was decremented between TX and
TX-complete. Here the refcount is decremented even before TX.

Does this make sense?

regareds,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [5.10 CAN BUG report] kernel dump about echo skb
  2021-01-22  9:23     ` Marc Kleine-Budde
@ 2021-01-22  9:40       ` Oleksij Rempel
  2021-01-22  9:51         ` Vincent MAILHOL
  0 siblings, 1 reply; 8+ messages in thread
From: Oleksij Rempel @ 2021-01-22  9:40 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Vincent MAILHOL, linux-can, Joakim Zhang, kernel, Oliver Hartkopp

On Fri, Jan 22, 2021 at 10:23:24AM +0100, Marc Kleine-Budde wrote:
> On Fri, Jan 22, 2021 at 06:08:18PM +0900, Vincent MAILHOL wrote:
> > > > 1. Kernel dump log:
> > > > [  101.688327] ------------[ cut here ]------------ [  101.692968] refcount_t:
> > > > addition on 0; use-after-free.
> > 
> > The skb already had its refcount at zero when reaching
> > can_put_echo_skb(). It is as if it was already freed/consumed!
> 
> ACK
> 
> > If you remove Oleksij’s patch, can_put_echo_skb() will probably not
> > clone the skb and thus not check the refcount: this means that you
> > will not see the warning, however, it does not necessarily mean that
> > the bug did not occur.
> 
> ACK
> 
> > So far, it seems to me to be another bug which was invisible until
> > now and which Oleksij’s patch just uncovered. But I do not yet fully
> > understand what the root cause could be.
> 
> Or it's the same bug, hitting earlier. Oleksij's backtrace was in the
> TX-complete path and the problem was fixes by cloning the skb in before TX.
> This means the refcount of the original skb was decremented between TX and
> TX-complete. Here the refcount is decremented even before TX.
> 
> Does this make sense?

Is socket was closed just before TX?

Regards,
Oleksij
-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [5.10 CAN BUG report] kernel dump about echo skb
  2021-01-22  9:40       ` Oleksij Rempel
@ 2021-01-22  9:51         ` Vincent MAILHOL
  2021-01-23  7:40           ` Joakim Zhang
  0 siblings, 1 reply; 8+ messages in thread
From: Vincent MAILHOL @ 2021-01-22  9:51 UTC (permalink / raw)
  To: Oleksij Rempel
  Cc: Marc Kleine-Budde, linux-can, Joakim Zhang, kernel, Oliver Hartkopp

On Fri 22 Jan 2021 at 18:40, Oleksij Rempel <o.rempel@pengutronix.de> wrote:
> On Fri, Jan 22, 2021 at 10:23:24AM +0100, Marc Kleine-Budde wrote:
> > On Fri, Jan 22, 2021 at 06:08:18PM +0900, Vincent MAILHOL wrote:
> > > > > 1. Kernel dump log:
> > > > > [  101.688327] ------------[ cut here ]------------ [  101.692968] refcount_t:
> > > > > addition on 0; use-after-free.
> > >
> > > The skb already had its refcount at zero when reaching
> > > can_put_echo_skb(). It is as if it was already freed/consumed!
> >
> > ACK
> >
> > > If you remove Oleksij’s patch, can_put_echo_skb() will probably not
> > > clone the skb and thus not check the refcount: this means that you
> > > will not see the warning, however, it does not necessarily mean that
> > > the bug did not occur.
> >
> > ACK
> >
> > > So far, it seems to me to be another bug which was invisible until
> > > now and which Oleksij’s patch just uncovered. But I do not yet fully
> > > understand what the root cause could be.
> >
> > Or it's the same bug, hitting earlier. Oleksij's backtrace was in the
> > TX-complete path and the problem was fixes by cloning the skb in before TX.
> > This means the refcount of the original skb was decremented between TX and
> > TX-complete. Here the refcount is decremented even before TX.
> >
> > Does this make sense?

Makes absolute sense.

That would mean that Olesij’s patch addressed the issue but too late
and that somewhere earlier in the TX path, the refcount is either:
  - decremented where it shouldn’t be
  - not incremented where it should be

Assuming that the skb reaches the xmit() funcion, what should be its
refcount?
2 if the socket is still opened and 1 if the socket is closed?
Something else?

> Is socket was closed just before TX?

Good question!
In addition, I would like to know which protocol is used (raw_can, j1939,
other?)


Yours sincerely,
Vincent

> Regards,
> Oleksij
> --
> Pengutronix e.K.                           |                             |
> Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
> 31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [5.10 CAN BUG report] kernel dump about echo skb
  2021-01-22  9:51         ` Vincent MAILHOL
@ 2021-01-23  7:40           ` Joakim Zhang
  0 siblings, 0 replies; 8+ messages in thread
From: Joakim Zhang @ 2021-01-23  7:40 UTC (permalink / raw)
  To: Vincent MAILHOL, Oleksij Rempel, Marc Kleine-Budde
  Cc: linux-can, kernel, Oliver Hartkopp


> -----Original Message-----
> From: Vincent MAILHOL <mailhol.vincent@wanadoo.fr>
> Sent: 2021年1月22日 17:52
> To: Oleksij Rempel <o.rempel@pengutronix.de>
> Cc: Marc Kleine-Budde <mkl@pengutronix.de>; linux-can@vger.kernel.org;
> Joakim Zhang <qiangqing.zhang@nxp.com>; kernel@pengutronix.de; Oliver
> Hartkopp <socketcan@hartkopp.net>
> Subject: Re: [5.10 CAN BUG report] kernel dump about echo skb
> 
> On Fri 22 Jan 2021 at 18:40, Oleksij Rempel <o.rempel@pengutronix.de>
> wrote:
> > On Fri, Jan 22, 2021 at 10:23:24AM +0100, Marc Kleine-Budde wrote:
> > > On Fri, Jan 22, 2021 at 06:08:18PM +0900, Vincent MAILHOL wrote:
> > > > > > 1. Kernel dump log:
> > > > > > [  101.688327] ------------[ cut here ]------------ [  101.692968]
> refcount_t:
> > > > > > addition on 0; use-after-free.
> > > >
> > > > The skb already had its refcount at zero when reaching
> > > > can_put_echo_skb(). It is as if it was already freed/consumed!
> > >
> > > ACK
> > >
> > > > If you remove Oleksij’s patch, can_put_echo_skb() will probably
> > > > not clone the skb and thus not check the refcount: this means that
> > > > you will not see the warning, however, it does not necessarily
> > > > mean that the bug did not occur.
> > >
> > > ACK
> > >
> > > > So far, it seems to me to be another bug which was invisible until
> > > > now and which Oleksij’s patch just uncovered. But I do not yet
> > > > fully understand what the root cause could be.
> > >
> > > Or it's the same bug, hitting earlier. Oleksij's backtrace was in
> > > the TX-complete path and the problem was fixes by cloning the skb in before
> TX.
> > > This means the refcount of the original skb was decremented between
> > > TX and TX-complete. Here the refcount is decremented even before TX.
> > >
> > > Does this make sense?
> 
> Makes absolute sense.
> 
> That would mean that Olesij’s patch addressed the issue but too late and that
> somewhere earlier in the TX path, the refcount is either:
>   - decremented where it shouldn’t be
>   - not incremented where it should be
> 
> Assuming that the skb reaches the xmit() funcion, what should be its refcount?
> 2 if the socket is still opened and 1 if the socket is closed?
> Something else?
> 
> > Is socket was closed just before TX?
> 
> Good question!
> In addition, I would like to know which protocol is used (raw_can, j1939,
> other?)



Hi Marc, Oleksij, Vincent, 

Thanks a lot for you looking into this issue. I still can't reproduce it manually, only can be reproduced when run auto scripts.

Below is test scripts for this:

1) Server side:
# Function:     test_can_02
# Description   CAN bitrate support test
# auto-manual   Partially auto
# Run time      5 min
test_can_02()
{
	RC=2
	TCID="test_can_02"
    TST_COUNT=2
    tcid_now=$TCID
	tcid_next="test_can_03"

    echo "=============================="
    echo "=     start $TCID      ="
	server_init || exit $RC
	echo $TCID >> $Log_file
    #STAND_BIT_RATES="50000 20000 10000"
    #BITRATE_DATA=("00 05 00 00" "00 02 00 00" "00 01 00 00")
    STAND_BIT_RATES="1000000 500000 250000 125000 50000 20000 10000"
    BITRATE_DATA=("01 00 00 00" "00 50 00 00" "00 25 00 00" "00 12 50 00" "00 05 00 00" "00 02 00 00" "00 01 00 00")
	local i=0
    bitrate_index=0
    for i in $STAND_BIT_RATES
    do
        echo "Now test bitrate $i"
        ifconfig $CANID down
        ip link set $CANID  type can bitrate $i || exit $RC
        ifconfig $CANID up || exit $RC
        candump $CANID > $Log_file &
        bgpid=$!
        ltime=0
        while [ $ltime -lt 10 ]; do
            over_status=$(cat $Log_file | grep "${BITRATE_DATA[$bitrate_index]}" | cut -d ']' -f '2')
            if [ -n "$over_status" ]; then
                echo "support bitrate $i"
                bitrate_index=`expr $bitrate_index + 1`
                cansend $CANID 777#1122334455667788
                kill -9 $bgpid > /dev/null 2>&1
                break
            fi
            let ltime=ltime+1
            sleep 5
        done
        if [ $ltime -eq 10 ]; then
            tst_resm TFAIL "Don't support $i bitrate !"
            kill -9 $bgpid > /dev/null 2>&1
            cat $Log_file
            exit 1
        fi
    done
    echo $tcid_next >> $Log_file

    tst_resm TPASS "test PASS"

    RC=0
    return $RC
}

2) Client side:
# Function:     test_can_02
# Description   CAN bitrate support test
# auto-manual   Partially auto
# Run time      5 min
test_can_02()
{
	RC=2
	TCID="test_can_02"
    TST_COUNT=2

    echo "=============================="
    echo "=     start $TCID      ="
	client_init || exit $RC
    kill -9 $bgpid > /dev/null 2>&1
    #STAND_BIT_RATES="50000 20000 10000"
    #BITRATE_DATA=(00050000 00020000 00010000)
    STAND_BIT_RATES="1000000 500000 250000 125000 50000 20000 10000"
    BITRATE_DATA=(01000000 00500000 00250000 00125000 00050000 00020000 00010000)
	local i=0
    bitrate_index=0
    for i in $STAND_BIT_RATES
    do
        echo "Now test bitrate $i"
        ifconfig $CANID down
        ip link set $CANID type can bitrate $i || exit $RC
        ifconfig $CANID up
        candump $CANID,777:C00007FF > $Log_file &
        cpid=$!

        candata="123#${BITRATE_DATA[$bitrate_index]}"
        #add more sleep time to make sure server side is ready before sending
        cansend $CANID $candata || exit $RC
        cansend $CANID $candata || exit $RC
        cansend $CANID $candata || exit $RC
        bitrate_index=`expr $bitrate_index + 1`

	    local timeout=10
	    over_status=$(cat $Log_file | grep 777)
	    while [ -z "$over_status" ]
	    do
            if [ $timeout -le 0 ]; then
                rm -rf $Log_file
                kill -9 $cpid > /dev/null 2>&1
                tst_resm TFAIL "test FAIL"
                return $RC
            else
                echo "wait server side to finish the calculate"
                sleep 5
                timeout=$(($timeout - 1))
            fi
	        over_status=$(cat $Log_file | grep 777)
	    done
        echo "test bitrate $i finish"
        kill -9 $cpid > /dev/null 2>&1
        rm -rf $Log_file
    done

    tst_resm TPASS "test PASS"

    RC=0
    return $RC
}

I also suspect it is the same issue as Oleksij reported. During this test case, CAN device up/down much frequently, which could trigger this problem.
What could I do to provide more info?

Best Regards,
Joakim Zhang
> 
> Yours sincerely,
> Vincent
> 
> > Regards,
> > Oleksij
> > --
> > Pengutronix e.K.                           |
> |
> > Steuerwalder Str. 21                       |
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pe
> ngutronix.de%2F&amp;data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C6d
> 93aec25d224120a07908d8bebb5cf5%7C686ea1d3bc2b4c6fa92cd99c5c301635
> %7C0%7C0%7C637469059210927498%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C30
> 00&amp;sdata=6JQbTYQYPswNnDQlYIa%2FpRlh9tfeZktrqFInOTCVYX0%3D&a
> mp;reserved=0  |
> > 31137 Hildesheim, Germany                  | Phone:
> +49-5121-206917-0    |
> > Amtsgericht Hildesheim, HRA 2686           | Fax:
> +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-23  7:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21  8:37 [5.10 CAN BUG report] kernel dump about echo skb Joakim Zhang
2021-01-22  7:53 ` Joakim Zhang
2021-01-22  9:08   ` Vincent MAILHOL
2021-01-22  9:23     ` Marc Kleine-Budde
2021-01-22  9:40       ` Oleksij Rempel
2021-01-22  9:51         ` Vincent MAILHOL
2021-01-23  7:40           ` Joakim Zhang
2021-01-22  8:58 ` Marc Kleine-Budde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).