All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joakim Zhang <qiangqing.zhang@nxp.com>
To: Vincent MAILHOL <mailhol.vincent@wanadoo.fr>,
	Oleksij Rempel <o.rempel@pengutronix.de>,
	Marc Kleine-Budde <mkl@pengutronix.de>
Cc: "linux-can@vger.kernel.org" <linux-can@vger.kernel.org>,
	"kernel@pengutronix.de" <kernel@pengutronix.de>,
	Oliver Hartkopp <socketcan@hartkopp.net>
Subject: RE: [5.10 CAN BUG report] kernel dump about echo skb
Date: Sat, 23 Jan 2021 07:40:47 +0000	[thread overview]
Message-ID: <VI1PR04MB680067A3BD86FFFC1E381FEBE6BF0@VI1PR04MB6800.eurprd04.prod.outlook.com> (raw)
In-Reply-To: <CAMZ6RqKd-P32QPyzeeJ7J-L1agHg7HmJtZrzO8WcXCRjDw5Rpw@mail.gmail.com>


> -----Original Message-----
> From: Vincent MAILHOL <mailhol.vincent@wanadoo.fr>
> Sent: 2021年1月22日 17:52
> To: Oleksij Rempel <o.rempel@pengutronix.de>
> Cc: Marc Kleine-Budde <mkl@pengutronix.de>; linux-can@vger.kernel.org;
> Joakim Zhang <qiangqing.zhang@nxp.com>; kernel@pengutronix.de; Oliver
> Hartkopp <socketcan@hartkopp.net>
> Subject: Re: [5.10 CAN BUG report] kernel dump about echo skb
> 
> On Fri 22 Jan 2021 at 18:40, Oleksij Rempel <o.rempel@pengutronix.de>
> wrote:
> > On Fri, Jan 22, 2021 at 10:23:24AM +0100, Marc Kleine-Budde wrote:
> > > On Fri, Jan 22, 2021 at 06:08:18PM +0900, Vincent MAILHOL wrote:
> > > > > > 1. Kernel dump log:
> > > > > > [  101.688327] ------------[ cut here ]------------ [  101.692968]
> refcount_t:
> > > > > > addition on 0; use-after-free.
> > > >
> > > > The skb already had its refcount at zero when reaching
> > > > can_put_echo_skb(). It is as if it was already freed/consumed!
> > >
> > > ACK
> > >
> > > > If you remove Oleksij’s patch, can_put_echo_skb() will probably
> > > > not clone the skb and thus not check the refcount: this means that
> > > > you will not see the warning, however, it does not necessarily
> > > > mean that the bug did not occur.
> > >
> > > ACK
> > >
> > > > So far, it seems to me to be another bug which was invisible until
> > > > now and which Oleksij’s patch just uncovered. But I do not yet
> > > > fully understand what the root cause could be.
> > >
> > > Or it's the same bug, hitting earlier. Oleksij's backtrace was in
> > > the TX-complete path and the problem was fixes by cloning the skb in before
> TX.
> > > This means the refcount of the original skb was decremented between
> > > TX and TX-complete. Here the refcount is decremented even before TX.
> > >
> > > Does this make sense?
> 
> Makes absolute sense.
> 
> That would mean that Olesij’s patch addressed the issue but too late and that
> somewhere earlier in the TX path, the refcount is either:
>   - decremented where it shouldn’t be
>   - not incremented where it should be
> 
> Assuming that the skb reaches the xmit() funcion, what should be its refcount?
> 2 if the socket is still opened and 1 if the socket is closed?
> Something else?
> 
> > Is socket was closed just before TX?
> 
> Good question!
> In addition, I would like to know which protocol is used (raw_can, j1939,
> other?)



Hi Marc, Oleksij, Vincent, 

Thanks a lot for you looking into this issue. I still can't reproduce it manually, only can be reproduced when run auto scripts.

Below is test scripts for this:

1) Server side:
# Function:     test_can_02
# Description   CAN bitrate support test
# auto-manual   Partially auto
# Run time      5 min
test_can_02()
{
	RC=2
	TCID="test_can_02"
    TST_COUNT=2
    tcid_now=$TCID
	tcid_next="test_can_03"

    echo "=============================="
    echo "=     start $TCID      ="
	server_init || exit $RC
	echo $TCID >> $Log_file
    #STAND_BIT_RATES="50000 20000 10000"
    #BITRATE_DATA=("00 05 00 00" "00 02 00 00" "00 01 00 00")
    STAND_BIT_RATES="1000000 500000 250000 125000 50000 20000 10000"
    BITRATE_DATA=("01 00 00 00" "00 50 00 00" "00 25 00 00" "00 12 50 00" "00 05 00 00" "00 02 00 00" "00 01 00 00")
	local i=0
    bitrate_index=0
    for i in $STAND_BIT_RATES
    do
        echo "Now test bitrate $i"
        ifconfig $CANID down
        ip link set $CANID  type can bitrate $i || exit $RC
        ifconfig $CANID up || exit $RC
        candump $CANID > $Log_file &
        bgpid=$!
        ltime=0
        while [ $ltime -lt 10 ]; do
            over_status=$(cat $Log_file | grep "${BITRATE_DATA[$bitrate_index]}" | cut -d ']' -f '2')
            if [ -n "$over_status" ]; then
                echo "support bitrate $i"
                bitrate_index=`expr $bitrate_index + 1`
                cansend $CANID 777#1122334455667788
                kill -9 $bgpid > /dev/null 2>&1
                break
            fi
            let ltime=ltime+1
            sleep 5
        done
        if [ $ltime -eq 10 ]; then
            tst_resm TFAIL "Don't support $i bitrate !"
            kill -9 $bgpid > /dev/null 2>&1
            cat $Log_file
            exit 1
        fi
    done
    echo $tcid_next >> $Log_file

    tst_resm TPASS "test PASS"

    RC=0
    return $RC
}

2) Client side:
# Function:     test_can_02
# Description   CAN bitrate support test
# auto-manual   Partially auto
# Run time      5 min
test_can_02()
{
	RC=2
	TCID="test_can_02"
    TST_COUNT=2

    echo "=============================="
    echo "=     start $TCID      ="
	client_init || exit $RC
    kill -9 $bgpid > /dev/null 2>&1
    #STAND_BIT_RATES="50000 20000 10000"
    #BITRATE_DATA=(00050000 00020000 00010000)
    STAND_BIT_RATES="1000000 500000 250000 125000 50000 20000 10000"
    BITRATE_DATA=(01000000 00500000 00250000 00125000 00050000 00020000 00010000)
	local i=0
    bitrate_index=0
    for i in $STAND_BIT_RATES
    do
        echo "Now test bitrate $i"
        ifconfig $CANID down
        ip link set $CANID type can bitrate $i || exit $RC
        ifconfig $CANID up
        candump $CANID,777:C00007FF > $Log_file &
        cpid=$!

        candata="123#${BITRATE_DATA[$bitrate_index]}"
        #add more sleep time to make sure server side is ready before sending
        cansend $CANID $candata || exit $RC
        cansend $CANID $candata || exit $RC
        cansend $CANID $candata || exit $RC
        bitrate_index=`expr $bitrate_index + 1`

	    local timeout=10
	    over_status=$(cat $Log_file | grep 777)
	    while [ -z "$over_status" ]
	    do
            if [ $timeout -le 0 ]; then
                rm -rf $Log_file
                kill -9 $cpid > /dev/null 2>&1
                tst_resm TFAIL "test FAIL"
                return $RC
            else
                echo "wait server side to finish the calculate"
                sleep 5
                timeout=$(($timeout - 1))
            fi
	        over_status=$(cat $Log_file | grep 777)
	    done
        echo "test bitrate $i finish"
        kill -9 $cpid > /dev/null 2>&1
        rm -rf $Log_file
    done

    tst_resm TPASS "test PASS"

    RC=0
    return $RC
}

I also suspect it is the same issue as Oleksij reported. During this test case, CAN device up/down much frequently, which could trigger this problem.
What could I do to provide more info?

Best Regards,
Joakim Zhang
> 
> Yours sincerely,
> Vincent
> 
> > Regards,
> > Oleksij
> > --
> > Pengutronix e.K.                           |
> |
> > Steuerwalder Str. 21                       |
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pe
> ngutronix.de%2F&amp;data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C6d
> 93aec25d224120a07908d8bebb5cf5%7C686ea1d3bc2b4c6fa92cd99c5c301635
> %7C0%7C0%7C637469059210927498%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C30
> 00&amp;sdata=6JQbTYQYPswNnDQlYIa%2FpRlh9tfeZktrqFInOTCVYX0%3D&a
> mp;reserved=0  |
> > 31137 Hildesheim, Germany                  | Phone:
> +49-5121-206917-0    |
> > Amtsgericht Hildesheim, HRA 2686           | Fax:
> +49-5121-206917-5555 |

  reply	other threads:[~2021-01-23  7:42 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-21  8:37 [5.10 CAN BUG report] kernel dump about echo skb Joakim Zhang
2021-01-22  7:53 ` Joakim Zhang
2021-01-22  9:08   ` Vincent MAILHOL
2021-01-22  9:23     ` Marc Kleine-Budde
2021-01-22  9:40       ` Oleksij Rempel
2021-01-22  9:51         ` Vincent MAILHOL
2021-01-23  7:40           ` Joakim Zhang [this message]
2021-01-22  8:58 ` Marc Kleine-Budde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR04MB680067A3BD86FFFC1E381FEBE6BF0@VI1PR04MB6800.eurprd04.prod.outlook.com \
    --to=qiangqing.zhang@nxp.com \
    --cc=kernel@pengutronix.de \
    --cc=linux-can@vger.kernel.org \
    --cc=mailhol.vincent@wanadoo.fr \
    --cc=mkl@pengutronix.de \
    --cc=o.rempel@pengutronix.de \
    --cc=socketcan@hartkopp.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.