linux-can.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mcp251xfd: Bad message receiption
@ 2022-12-22  7:44 Stefan Althöfer
  2022-12-22  9:07 ` Thomas.Kopp
  2022-12-22 14:51 ` mcp251xfd: Bad message receiption Marc Kleine-Budde
  0 siblings, 2 replies; 23+ messages in thread
From: Stefan Althöfer @ 2022-12-22  7:44 UTC (permalink / raw)
  To: linux-can

[-- Attachment #1: Type: text/plain, Size: 1664 bytes --]

Dear sirs,

I'm using the MCP2518FD CAN controller with Rasperry PI 4 and are suffering from wrong data reception.

My hardware is a Model 4 B with MCP2581 Click board(s) attached with a prototype (hand-made) interface PCB, as well
as baseboards for a CM4 with a single MCP2518 on it. The error occurs on both of them.

I have reduced my test case to a simple single thread self-receipt test:
    * TX two messages
    * Wait for RX and send out a new message on every receipt
    * TX for messages in total

Refer to the attached PDF for some error cases. Last send frames are at the top of the logs. You can see that wrong 
messages appear in the RX queue, which have been successfully transmitted in previous test loop. The data that is actually sent 
out is correct however (checked with an external logger for some cases).

The error appeared very often (every 1..10 minutes) when I was using kernel raspberrypi 5.15.32-v7l+ and reduced to
once in several hours when I switched to raspberrypi 6.0.12-v7l+. BTW I modified the loopback to external loopback
To check for correct transmission. No other changes to the driver as taken from the Raspbian repository.

I see infrequent mcp251xfd CRC read errors. I think those are due to the 2518 SPI errata. However they don't
occur at the time when the wrong messages are received (refer to the PDF).

Questions:
- Is this a known problem?
- What did change from 5.15.32 to 6.0.12 to reduce the error rate? Maybe this is just because of a change in execution timing?
- Any suggestion how I can step further in fixing this issue.

-----
Stefan Althöfer (stefan.althoefer@janztec.com)


[-- Attachment #2: can_mcp2518fd_fail.pdf --]
[-- Type: application/pdf, Size: 59680 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: mcp251xfd: Bad message receiption
  2022-12-22  7:44 mcp251xfd: Bad message receiption Stefan Althöfer
@ 2022-12-22  9:07 ` Thomas.Kopp
  2022-12-22 10:30   ` AW: " Stefan Althöfer
  2022-12-22 14:51 ` mcp251xfd: Bad message receiption Marc Kleine-Budde
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas.Kopp @ 2022-12-22  9:07 UTC (permalink / raw)
  To: Stefan.Althoefer; +Cc: linux-can

Hi Stefan,

> I have reduced my test case to a simple single thread self-receipt test:
>     * TX two messages
>     * Wait for RX and send out a new message on every receipt
>     * TX for messages in total
> 
> Refer to the attached PDF for some error cases. Last send frames are at the
> top of the logs. You can see that wrong
> messages appear in the RX queue, which have been successfully transmitted
> in previous test loop. The data that is actually sent
> out is correct however (checked with an external logger for some cases).

Do I read the pdf correctly (based on the /var/log stuff) that you have two MCP2518FD connected to a Pi4B and both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate?
What are your CAN interface settings? Would it be possible to share the script?


> I see infrequent mcp251xfd CRC read errors. I think those are due to the 2518
> SPI errata. However they don't
> occur at the time when the wrong messages are received (refer to the PDF).
Correct, this shouldn't be related to your problem.

 
> - Any suggestion how I can step further in fixing this issue.
One thing would be to dump the RAM i.e. the content of the fifos itself to see whether the device actually has the incorrect frames. Marc wrote a tool to dump registers and RAM via debugfs:
https://github.com/linux-can/can-utils/blob/master/mcp251xfd/mcp251xfd-dump.c

For this debugfs needs to be enabled and mounted (e.g. $mount -t debugfs none /sys/kernel/debug)

Now the registers can be dumped like this: cat /sys/kernel/debug/regmap/spi0.0-crc/registers

So I'd suggest to abort the script after the first error occurred and then dump registers/ram to find the RX fifo in question and check the content.

Best Regards,
Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* AW: mcp251xfd: Bad message receiption
  2022-12-22  9:07 ` Thomas.Kopp
@ 2022-12-22 10:30   ` Stefan Althöfer
  2022-12-22 13:02     ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Althöfer @ 2022-12-22 10:30 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: linux-can

Hi Thomas,

> Do I read the pdf correctly (based on the /var/log stuff) that you have two MCP2518FD connected to a Pi4B and 
> both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate?

Yes. For the loopback test the CANs are separate. Errors also occur when sending messages between
the controllers, but I think that is more difficult to analyze.

root@raspberrypi:~# ip -d -s a s can0
4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000
    link/can  promiscuity 0 minmtu 0 maxmtu 0
    can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
          bitrate 1000000 sample-point 0.800
          tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6
          mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
          dbitrate 4000000 dsample-point 0.800
          dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2
          mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
          clock 40000000
          re-started bus-errors arbit-lost error-warn error-pass bus-off
          0          0          0          0          0          0         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped missed  mcast
    3097429872 2517466658 0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    3696197680 1258733264 0       0       0       0
root@raspberrypi:~# ip -d -s a s can1
5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000
    link/can  promiscuity 0 minmtu 0 maxmtu 0
    can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
          bitrate 1000000 sample-point 0.800
          tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6
          mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
          dbitrate 4000000 dsample-point 0.800
          dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2
          mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
          clock 40000000
          re-started bus-errors arbit-lost error-warn error-pass bus-off
          0          112256     0          1          3          1         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    RX: bytes  packets  errors  dropped missed  mcast
    2590649888 773910314 3       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    3442619696 386944310 112254  12      0       0

This is the SPI setup for the dual CAN PI:

dtparam=spi=on
dtoverlay=spi6-1cs
# mcp251xfd DTS for RPI4 default CAN on SPI6.0
dtoverlay=mcp251xfd-spi6-0,interrupt_pin=25,oscillator=40000000
# mcp251xfd DTS for RPI4 CAN extension on SPI0.0
dtoverlay=mcp251xfd,spi0-0,interrupt=16,oscillator=40000000

Or did you mean something else with "script"?

I'll try the register dump when I suffer the next error ;-)

Mfg
Stefan

-----Ursprüngliche Nachricht-----
Von: Thomas.Kopp@microchip.com <Thomas.Kopp@microchip.com> 
Gesendet: Donnerstag, 22. Dezember 2022 10:07
An: Stefan Althöfer <Stefan.Althoefer@janztec.com>
Cc: linux-can@vger.kernel.org
Betreff: RE: mcp251xfd: Bad message receiption

Hi Stefan,

> I have reduced my test case to a simple single thread self-receipt test:
>     * TX two messages
>     * Wait for RX and send out a new message on every receipt
>     * TX for messages in total
> 
> Refer to the attached PDF for some error cases. Last send frames are 
> at the top of the logs. You can see that wrong messages appear in the 
> RX queue, which have been successfully transmitted in previous test 
> loop. The data that is actually sent out is correct however (checked 
> with an external logger for some cases).

Do I read the pdf correctly (based on the /var/log stuff) that you have two MCP2518FD connected to a Pi4B and both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate?
What are your CAN interface settings? Would it be possible to share the script?


> I see infrequent mcp251xfd CRC read errors. I think those are due to 
> the 2518 SPI errata. However they don't occur at the time when the 
> wrong messages are received (refer to the PDF).
Correct, this shouldn't be related to your problem.

 
> - Any suggestion how I can step further in fixing this issue.
One thing would be to dump the RAM i.e. the content of the fifos itself to see whether the device actually has the incorrect frames. Marc wrote a tool to dump registers and RAM via debugfs:
https://github.com/linux-can/can-utils/blob/master/mcp251xfd/mcp251xfd-dump.c

For this debugfs needs to be enabled and mounted (e.g. $mount -t debugfs none /sys/kernel/debug)

Now the registers can be dumped like this: cat /sys/kernel/debug/regmap/spi0.0-crc/registers

So I'd suggest to abort the script after the first error occurred and then dump registers/ram to find the RX fifo in question and check the content.

Best Regards,
Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: AW: mcp251xfd: Bad message receiption
  2022-12-22 10:30   ` AW: " Stefan Althöfer
@ 2022-12-22 13:02     ` Marc Kleine-Budde
  2022-12-23  8:18       ` AW: " Stefan Althöfer
  0 siblings, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2022-12-22 13:02 UTC (permalink / raw)
  To: Stefan Althöfer; +Cc: Thomas.Kopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 3692 bytes --]

On 22.12.2022 10:30:22, Stefan Althöfer wrote:
> Hi Thomas,
> 
> > Do I read the pdf correctly (based on the /var/log stuff) that you have two MCP2518FD connected to a Pi4B and 
> > both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate?
> 
> Yes. For the loopback test the CANs are separate. Errors also occur when sending messages between
> the controllers, but I think that is more difficult to analyze.
> 
> root@raspberrypi:~# ip -d -s a s can0
> 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000
>     link/can  promiscuity 0 minmtu 0 maxmtu 0
>     can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>           bitrate 1000000 sample-point 0.800
>           tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6
>           mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>           dbitrate 4000000 dsample-point 0.800
>           dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2
>           mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>           clock 40000000
>           re-started bus-errors arbit-lost error-warn error-pass bus-off
>           0          0          0          0          0          0         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>     RX: bytes  packets  errors  dropped missed  mcast
>     3097429872 2517466658 0       0       0       0
>     TX: bytes  packets  errors  dropped carrier collsns
>     3696197680 1258733264 0       0       0       0
> root@raspberrypi:~# ip -d -s a s can1
> 5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000
>     link/can  promiscuity 0 minmtu 0 maxmtu 0
>     can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>           bitrate 1000000 sample-point 0.800
>           tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6
>           mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>           dbitrate 4000000 dsample-point 0.800
>           dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2
>           mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>           clock 40000000
>           re-started bus-errors arbit-lost error-warn error-pass bus-off
>           0          112256     0          1          3          1         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
                       ^^^^^^                ^          ^          ^
>     RX: bytes  packets  errors  dropped missed  mcasta
>     2590649888 773910314 3       0       0       0
>     TX: bytes  packets  errors  dropped carrier collsns
>     3442619696 386944310 112254  12      0       0

Why do you have so many errors on can1? The device event went into bus-off.

> This is the SPI setup for the dual CAN PI:
> 
> dtparam=spi=on
> dtoverlay=spi6-1cs
> # mcp251xfd DTS for RPI4 default CAN on SPI6.0
> dtoverlay=mcp251xfd-spi6-0,interrupt_pin=25,oscillator=40000000
> # mcp251xfd DTS for RPI4 CAN extension on SPI0.0
> dtoverlay=mcp251xfd,spi0-0,interrupt=16,oscillator=40000000
> 
> Or did you mean something else with "script"?

Can you share your "scselftest" program?

> I'll try the register dump when I suffer the next error ;-)

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: mcp251xfd: Bad message receiption
  2022-12-22  7:44 mcp251xfd: Bad message receiption Stefan Althöfer
  2022-12-22  9:07 ` Thomas.Kopp
@ 2022-12-22 14:51 ` Marc Kleine-Budde
  1 sibling, 0 replies; 23+ messages in thread
From: Marc Kleine-Budde @ 2022-12-22 14:51 UTC (permalink / raw)
  To: Stefan Althöfer; +Cc: linux-can

[-- Attachment #1: Type: text/plain, Size: 3138 bytes --]

On 22.12.2022 07:44:59, Stefan Althöfer wrote:
> I'm using the MCP2518FD CAN controller with Rasperry PI 4 and are
> suffering from wrong data reception.
> 
> My hardware is a Model 4 B with MCP2581 Click board(s) attached with a
> prototype (hand-made) interface PCB, as well as baseboards for a CM4
> with a single MCP2518 on it. The error occurs on both of them.

Do you get the same error if the test is only running on a single
mcp2518fd?

Do you get the same error if only a single mcp2518fd is attached to the
raspi?

> I have reduced my test case to a simple single thread self-receipt test:
>     * TX two messages
>     * Wait for RX and send out a new message on every receipt
>     * TX for messages in total

Can you share the config/scripts/... to set up the interfaces?
Can you share the test code?

> Refer to the attached PDF for some error cases.

The text in the PDF is rendered down to paths only. So not select- nor
searchable, please send plain text files for further logs.

> Last send frames are at the top of the logs. You can see that wrong
> messages appear in the RX queue, which have been successfully
> transmitted in previous test loop. The data that is actually sent out
> is correct however (checked with an external logger for some cases).

Looking at the first page, there are 2 identical lines marked in red.
The next page show 4 red lines.

So under certain load situations you see that old messages are received
twice. If I remember correctly I heard something similar from someone
else before. It occurred with a modified driver. For various reasons
that has never debugged. I think I've fixed the root cause of the
problem during the v5.18 development cycle.

> The error appeared very often (every 1..10 minutes) when I was using
> kernel raspberrypi 5.15.32-v7l+ and reduced to once in several hours
> when I switched to raspberrypi 6.0.12-v7l+. BTW I modified the
> loopback to external loopback To check for correct transmission. No
> other changes to the driver as taken from the Raspbian repository.

Can you share your kernel, including the modifications?

> I see infrequent mcp251xfd CRC read errors. I think those are due to
> the 2518 SPI errata. However they don't occur at the time when the
> wrong messages are received (refer to the PDF).
> 
> Questions:
> - Is this a known problem?

No - it should not happen on any mainline kernel.

> - What did change from 5.15.32 to 6.0.12 to reduce the error rate?
>   Maybe this is just because of a change in execution timing?

Quite a lot: Next to probably unrelated changes (e.g. PLL support), the
workarounds for CRC errors were updated, in v5.18 multiple RX FIFOs are
used and IRQ coalescing support has been added.

> - Any suggestion how I can step further in fixing this issue.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* AW: AW: mcp251xfd: Bad message receiption
  2022-12-22 13:02     ` Marc Kleine-Budde
@ 2022-12-23  8:18       ` Stefan Althöfer
  2022-12-23 19:25         ` Marc Kleine-Budde
  2022-12-24 15:57         ` Marc Kleine-Budde
  0 siblings, 2 replies; 23+ messages in thread
From: Stefan Althöfer @ 2022-12-23  8:18 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 18302 bytes --]

> Why do you have so many errors on can1? The device event went into bus-off.

The device was not freshly started, and I just intended to share the setup. No errors when the problem appeared this night.

> Can you share your "scselftest" program?

Sure, attached as extracted from the build system for my test tools. Hope this is not the bad guy. Included my ancient copy of libsocketcan. 

> Do you get the same error if the test is only running on a single mcp2518fd?

I'm pretty sure I had cases where I started the test on both, one failed sooner and one failed later. However, currently the second test is
still running (after 9h).

> Do you get the same error if only a single mcp2518fd is attached to the raspi?

Only tested this with the CM4

> Can you share your kernel, including the modifications?

Yes if it turns out to be necessary, but I just cloned it from here: git clone --depth=1 --branch <branch> https://github.com/raspberrypi/linux
Following these instructions: https://www.raspberrypi.com/documentation/computers/linux_kernel.html#kernel-headers

Now this nights error log:

Loop 4429 2022-12-23T03:07:45 last+4s total:4h299m
+ ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -d 8 can1
CAN selftest can1 .......idx    2 tx:2A5#00 02 6A 8F 6C 81 C3 B0
         rx:2A5#00 00 85 9A EF 54 34 5D
idx    3 tx:2A5#00 03 4F 8F 85 59 8E DA
         rx:2A5#00 00 85 9A EF 54 34 5D
.c (700) fail
0: RX (004/003)  2A5#00 00 85 9A EF 54 34 5D
0: RX (004/002)  2A5#00 00 85 9A EF 54 34 5D
0: TX (003/002)  2A5#00 03 4F 8F 85 59 8E DA
0: RX (003/001)  2A5#00 01 7C 30 5C 7A 2D 1C
0: TX (002/001)  2A5#00 02 6A 8F 6C 81 C3 B0
0: RX (002/000)  2A5#00 00 AC AF 70 9B 32 03
0: TX (001/000)  2A5#00 01 7C 30 5C 7A 2D 1C
0: TX (000/000)  2A5#00 00 AC AF 70 9B 32 03
0: RX (004/003)  2A5#00 03 65 32 F8 E7 E6 F5
0: RX (004/002)  2A5#00 02 CC D5 09 FE 55 99
0: TX (003/002)  2A5#00 03 65 32 F8 E7 E6 F5
0: RX (003/001)  2A5#00 01 62 4F 17 34 44 CE
0: TX (002/001)  2A5#00 02 CC D5 09 FE 55 99
0: RX (002/000)  2A5#00 00 FE 58 FE FD C0 EF
0: TX (001/000)  2A5#00 01 62 4F 17 34 44 CE
0: TX (000/000)  2A5#00 00 FE 58 FE FD C0 EF
0: RX (004/003)  2A5#00 03 C9 7B EF 83 93 E0
0: RX (004/002)  2A5#00 02 DD 5E 82 B4 FD C4
0: TX (003/002)  2A5#00 03 C9 7B EF 83 93 E0
0: RX (003/001)  2A5#00 01 90 3B 32 80 90 67
0: TX (002/001)  2A5#00 02 DD 5E 82 B4 FD C4
0: RX (002/000)  2A5#00 00 8F 28 D2 F5 B6 98
0: TX (001/000)  2A5#00 01 90 3B 32 80 90 67
0: TX (000/000)  2A5#00 00 8F 28 D2 F5 B6 98
0: RX (004/003)  2A5#00 03 C5 1E FE 68 F1 65
0: RX (004/002)  2A5#00 02 7E 2B BA 18 F1 7B
0: TX (003/002)  2A5#00 03 C5 1E FE 68 F1 65
0: RX (003/001)  2A5#00 01 CE 1B D7 9E 42 15
0: TX (002/001)  2A5#00 02 7E 2B BA 18 F1 7B
0: RX (002/000)  2A5#00 00 85 9A EF 54 34 5D
0: TX (001/000)  2A5#00 01 CE 1B D7 9E 42 15
0: TX (000/000)  2A5#00 00 85 9A EF 54 34 5D
0: RX (004/003)  2A5#00 03 37 6D 66 8E C6 9B
0: RX (004/002)  2A5#00 02 2D 83 A3 D3 66 27
0: TX (003/002)  2A5#00 03 37 6D 66 8E C6 9B
0: RX (003/001)  2A5#00 01 E8 A5 9A C6 C1 AC
0: TX (002/001)  2A5#00 02 2D 83 A3 D3 66 27
0: RX (002/000)  2A5#00 00 E6 7A D0 27 3E DF
0: TX (001/000)  2A5#00 01 E8 A5 9A C6 C1 AC
0: TX (000/000)  2A5#00 00 E6 7A D0 27 3E DF
0: RX (004/003)  2A5#00 03 D4 54 CF 9A 6D 7A
0: RX (004/002)  2A5#00 02 06 3F 57 59 35 F6
0: TX (003/002)  2A5#00 03 D4 54 CF 9A 6D 7A
0: RX (003/001)  2A5#00 01 C1 F6 A6 E3 84 64
0: TX (002/001)  2A5#00 02 06 3F 57 59 35 F6
Stopped. Ctrl-C to exit.

And the register dump

000: 05ab0760
004: 001e0705
008: 00060101
00c: 00020700
010: dcd75c83
014: 00010000
018: 40400040
01c: bf1a0004
020: 00000000
024: 00000000
028: 00000000
02c: 00000000
030: 00000000
034: 00000000
038: 00000000
03c: 00000af4
040: 03000029
044: 00000000
048: 00000000
04c: 00000000
050: 00600080
054: 00000005
058: 00000030
05c: ef000029
060: 00000400
064: 00000160
068: e3000029
06c: 00000000
070: 000004f0
074: e3600090
078: 00000007
07c: 00000620
080: 00600000
084: 00000000
088: 00000740
08c: 00600000
090: 00000000
094: 00000750
098: 00600000
09c: 00000000
0a0: 00000760
0a4: 00600000
0a8: 00000000
0ac: 00000770
0b0: 00600000
0b4: 00000000
0b8: 00000780
0bc: 00600000
0c0: 00000000
0c4: 00000790
0c8: 00600000
0cc: 00000000
0d0: 000007a0
0d4: 00600000
0d8: 00000000
0dc: 000007b0
0e0: 00600000
0e4: 00000000
0e8: 000007c0
0ec: 00600000
0f0: 00000000
0f4: 000007d0
0f8: 00600000
0fc: 00000000
100: 000007e0
104: 00600000
108: 00000000
10c: 000007f0
110: 00600000
114: 00000000
118: 00000800
11c: 00600000
120: 00000000
124: 00000810
128: 00600000
12c: 00000000
130: 00000820
134: 00600000
138: 00000000
13c: 00000830
140: 00600000
144: 00000000
148: 00000840
14c: 00600000
150: 00000000
154: 00000850
158: 00600000
15c: 00000000
160: 00000860
164: 00600000
168: 00000000
16c: 00000870
170: 00600000
174: 00000000
178: 00000880
17c: 00600000
180: 00000000
184: 00000890
188: 00600000
18c: 00000000
190: 000008a0
194: 00600000
198: 00000000
19c: 000008b0
1a0: 00600000
1a4: 00000000
1a8: 000008c0
1ac: 00600000
1b0: 00000000
1b4: 000008d0
1b8: 00600000
1bc: 00000000
1c0: 000008e0
1c4: 00600000
1c8: 00000000
1cc: 000008f0
1d0: 00008281
1d4: 00000000
1d8: 00000000
1dc: 00000000
1e0: 00000000
1e4: 00000000
1e8: 00000000
1ec: 00000000
1f0: 00000000
1f4: 00000000
1f8: 00000000
1fc: 00000000
200: 00000000
204: 00000000
208: 00000000
20c: 00000000
210: 00000000
214: 00000000
218: 00000000
21c: 00000000
220: 00000000
224: 00000000
228: 00000000
22c: 00000000
230: 00000000
234: 00000000
238: 00000000
23c: 00000000
240: 00000000
244: 00000000
248: 00000000
24c: 00000000
250: 00000000
254: 00000000
258: 00000000
25c: 00000000
260: 00000000
264: 00000000
268: 00000000
26c: 00000000
270: 00000000
274: 00000000
278: 00000000
27c: 00000000
280: 00000000
284: 00000000
288: 00000000
28c: 00000000
290: 00000000
294: 00000000
298: 00000000
29c: 00000000
2a0: 00000000
2a4: 00000000
2a8: 00000000
2ac: 00000000
2b0: 00000000
2b4: 00000000
2b8: 00000000
2bc: 00000000
2c0: 00000000
2c4: 00000000
2c8: 00000000
2cc: 00000000
2d0: 00000000
2d4: 00000000
2d8: 00000000
2dc: 00000000
2e0: 00000000
2e4: 00000000
2e8: 00000000
2ec: 00000000
400: 000002a5
404: 0015e0c8
408: 018add13
40c: 000002a5
410: 0015e2c8
414: 018ae58d
418: 000002a5
41c: 0015e4c8
420: 018b2543
424: 000002a5
428: 0015e6c8
42c: 018b2efd
430: 000002a5
434: 000000c8
438: 018add13
43c: afac0000
440: 03329b70
444: 00000000
448: 00000000
44c: 00000000
450: 00000000
454: 00000000
458: 00000000
45c: 00000000
460: 00000000
464: 00000000
468: 00000000
46c: 00000000
470: 00000000
474: 00000000
478: 00000000
47c: 000002a5
480: 000000c8
484: 018ae58d
488: 307c0100
48c: 1c2d7a5c
490: 00000000
494: 00000000
498: 00000000
49c: 00000000
4a0: 00000000
4a4: 00000000
4a8: 00000000
4ac: 00000000
4b0: 00000000
4b4: 00000000
4b8: 00000000
4bc: 00000000
4c0: 00000000
4c4: 00000000
4c8: 000002a5
4cc: 000000c8
4d0: 018b2543
4d4: 8f6a0200
4d8: b0c3816c
4dc: 00000000
4e0: 00000000
4e4: 00000000
4e8: 00000000
4ec: 00000000
4f0: 00000000
4f4: 00000000
4f8: 00000000
4fc: 00000000
500: 00000000
504: 00000000
508: 00000000
50c: 00000000
510: 00000000
514: 000002a5
518: 000000c8
51c: 018b2efd
520: 8f4f0300
524: da8e5985
528: 00000000
52c: 00000000
530: 00000000
534: 00000000
538: 00000000
53c: 00000000
540: 00000000
544: 00000000
548: 00000000
54c: 00000000
550: 00000000
554: 00000000
558: 00000000
55c: 00000000
560: 000002a5
564: 000000c8
568: 01890d0d
56c: 9a850000
570: 5d3454ef
574: 00000000
578: 00000000
57c: 00000000
580: 00000000
584: 00000000
588: 00000000
58c: 00000000
590: 00000000
594: 00000000
598: 00000000
59c: 00000000
5a0: 00000000
5a4: 00000000
5a8: 00000000
5ac: 000002a5
5b0: 000000c8
5b4: 0189157d
5b8: 1bce0100
5bc: 15429ed7
5c0: 00000000
5c4: 00000000
5c8: 00000000
5cc: 00000000
5d0: 00000000
5d4: 00000000
5d8: 00000000
5dc: 00000000
5e0: 00000000
5e4: 00000000
5e8: 00000000
5ec: 00000000
5f0: 00000000
5f4: 00000000
5f8: 000002a5
5fc: 000000c8
600: 0189562d
604: 2b7e0200
608: 7bf118ba
60c: 00000000
610: 00000000
614: 00000000
618: 00000000
61c: 00000000
620: 00000000
624: 00000000
628: 00000000
62c: 00000000
630: 00000000
634: 00000000
638: 00000000
63c: 00000000
640: 00000000
644: 000002a5
648: 000000c8
64c: 01895f97
650: 1ec50300
654: 65f168fe
658: 00000000
65c: 00000000
660: 00000000
664: 00000000
668: 00000000
66c: 00000000
670: 00000000
674: 00000000
678: 00000000
67c: 00000000
680: 00000000
684: 00000000
688: 00000000
68c: 00000000
690: 000002a5
694: 000000c8
698: 0189a529
69c: 288f0000
6a0: 98b6f5d2
6a4: 00000000
6a8: 00000000
6ac: 00000000
6b0: 00000000
6b4: 00000000
6b8: 00000000
6bc: 00000000
6c0: 00000000
6c4: 00000000
6c8: 00000000
6cc: 00000000
6d0: 00000000
6d4: 00000000
6d8: 00000000
6dc: 000002a5
6e0: 000000c8
6e4: 0189ae61
6e8: 3b900100
6ec: 67908032
6f0: 00000000
6f4: 00000000
6f8: 00000000
6fc: 00000000
700: 00000000
704: 00000000
708: 00000000
70c: 00000000
710: 00000000
714: 00000000
718: 00000000
71c: 00000000
720: 00000000
724: 00000000
728: 000002a5
72c: 000000c8
730: 0189f1d7
734: 5edd0200
738: c4fdb482
73c: 00000000
740: 00000000
744: 00000000
748: 00000000
74c: 00000000
750: 00000000
754: 00000000
758: 00000000
75c: 00000000
760: 00000000
764: 00000000
768: 00000000
76c: 00000000
770: 00000000
774: 000002a5
778: 000000c8
77c: 0189fb73
780: 7bc90300
784: e09383ef
788: 00000000
78c: 00000000
790: 00000000
794: 00000000
798: 00000000
79c: 00000000
7a0: 00000000
7a4: 00000000
7a8: 00000000
7ac: 00000000
7b0: 00000000
7b4: 00000000
7b8: 00000000
7bc: 00000000
7c0: 000002a5
7c4: 000000c8
7c8: 018a40dd
7cc: 58fe0000
7d0: efc0fdfe
7d4: 00000000
7d8: 00000000
7dc: 00000000
7e0: 00000000
7e4: 00000000
7e8: 00000000
7ec: 00000000
7f0: 00000000
7f4: 00000000
7f8: 00000000
7fc: 00000000
800: 00000000
804: 00000000
808: 00000000
80c: 000002a5
810: 000000c8
814: 018a4bcd
818: 4f620100
81c: ce443417
820: 00000000
824: 00000000
828: 00000000
82c: 00000000
830: 00000000
834: 00000000
838: 00000000
83c: 00000000
840: 00000000
844: 00000000
848: 00000000
84c: 00000000
850: 00000000
854: 00000000
858: 000002a5
85c: 000000c8
860: 018a8bb5
864: d5cc0200
868: 9955fe09
86c: 00000000
870: 00000000
874: 00000000
878: 00000000
87c: 00000000
880: 00000000
884: 00000000
888: 00000000
88c: 00000000
890: 00000000
894: 00000000
898: 00000000
89c: 00000000
8a0: 00000000
8a4: 000002a5
8a8: 000000c8
8ac: 018a9547
8b0: 32650300
8b4: f5e6e7f8
8b8: 00000000
8bc: 00000000
8c0: 00000000
8c4: 00000000
8c8: 00000000
8cc: 00000000
8d0: 00000000
8d4: 00000000
8d8: 00000000
8dc: 00000000
8e0: 00000000
8e4: 00000000
8e8: 00000000
8ec: 00000000
8f0: 00000000
8f4: 00000000
8f8: 00000000
8fc: 00000000
900: 00000000
904: 00000000
908: 00000000
90c: 00000000
910: 00000000
914: 00000000
918: 00000000
91c: 00000000
920: 00000000
924: 00000000
928: 00000000
92c: 00000000
930: 00000000
934: 00000000
938: 00000000
93c: 00000000
940: 00000000
944: 00000000
948: 00000000
94c: 00000000
950: 00000000
954: 00000000
958: 00000000
95c: 00000000
960: 00000000
964: 00000000
968: 00000000
96c: 00000000
970: 00000000
974: 00000000
978: 00000000
97c: 00000000
980: 00000000
984: 00000000
988: 00000000
98c: 00000000
990: 00000000
994: 00000000
998: 00000000
99c: 00000000
9a0: 00000000
9a4: 00000000
9a8: 00000000
9ac: 00000000
9b0: 00000000
9b4: 00000000
9b8: 00000000
9bc: 00000000
9c0: 00000000
9c4: 00000000
9c8: 00000000
9cc: 00000000
9d0: 00000000
9d4: 00000000
9d8: 00000000
9dc: 00000000
9e0: 00000000
9e4: 00000000
9e8: 00000000
9ec: 00000000
9f0: 00000000
9f4: 00000000
9f8: 00000000
9fc: 00000000
a00: 00000000
a04: 00000000
a08: 00000000
a0c: 00000000
a10: 00000000
a14: 00000000
a18: 00000000
a1c: 00000000
a20: 000002a5
a24: 0015e0c8
a28: afac0000
a2c: 03329b70
a30: 00000000
a34: 00000000
a38: 00000000
a3c: 00000000
a40: 00000000
a44: 00000000
a48: 00000000
a4c: 00000000
a50: 00000000
a54: 00000000
a58: 00000000
a5c: 00000000
a60: 00000000
a64: 00000000
a68: 000002a5
a6c: 0015e2c8
a70: 307c0100
a74: 1c2d7a5c
a78: 00000000
a7c: 00000000
a80: 00000000
a84: 00000000
a88: 00000000
a8c: 00000000
a90: 00000000
a94: 00000000
a98: 00000000
a9c: 00000000
aa0: 00000000
aa4: 00000000
aa8: 00000000
aac: 00000000
ab0: 000002a5
ab4: 0015e4c8
ab8: 8f6a0200
abc: b0c3816c
ac0: 00000000
ac4: 00000000
ac8: 00000000
acc: 00000000
ad0: 00000000
ad4: 00000000
ad8: 00000000
adc: 00000000
ae0: 00000000
ae4: 00000000
ae8: 00000000
aec: 00000000
af0: 00000000
af4: 00000000
af8: 000002a5
afc: 0015e6c8
b00: 8f4f0300
b04: da8e5985
b08: 00000000
b0c: 00000000
b10: 00000000
b14: 00000000
b18: 00000000
b1c: 00000000
b20: 00000000
b24: 00000000
b28: 00000000
b2c: 00000000
b30: 00000000
b34: 00000000
b38: 00000000
b3c: 00000000
b40: 00000000
b44: 00000000
b48: 00000000
b4c: 00000000
b50: 00000000
b54: 00000000
b58: 00000000
b5c: 00000000
b60: 00000000
b64: 00000000
b68: 00000000
b6c: 00000000
b70: 00000000
b74: 00000000
b78: 00000000
b7c: 00000000
b80: 00000000
b84: 00000000
b88: 00000000
b8c: 00000000
b90: 00000000
b94: 00000000
b98: 00000000
b9c: 00000000
ba0: 00000000
ba4: 00000000
ba8: 00000000
bac: 00000000
bb0: 00000000
bb4: 00000000
bb8: 00000000
bbc: 00000000
bc0: 00000000
bc4: 00000000
bc8: 00000000
bcc: 00000000
bd0: 00000000
bd4: 00000000
bd8: 00000000
bdc: 00000000
be0: 00000000
be4: 00000000
be8: 00000000
bec: 00000000
bf0: 00000000
bf4: 00000000
bf8: 00000000
bfc: 00000000
e00: 00000468
e04: 03030003
e08: 030082b6
e0c: 00000007
e10: 00000000
e14: 00000000

mcp251xfd-dump did not work for me, but It looks as if the correct data is in the rx fifo.

Mfg
Stefan

-----Ursprüngliche Nachricht-----
Von: Marc Kleine-Budde <mkl@pengutronix.de> 
Gesendet: Donnerstag, 22. Dezember 2022 14:02
An: Stefan Althöfer <Stefan.Althoefer@janztec.com>
Cc: Thomas.Kopp@microchip.com; linux-can@vger.kernel.org
Betreff: Re: AW: mcp251xfd: Bad message receiption

On 22.12.2022 10:30:22, Stefan Althöfer wrote:
> Hi Thomas,
> 
> > Do I read the pdf correctly (based on the /var/log stuff) that you 
> > have two MCP2518FD connected to a Pi4B and both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate?
> 
> Yes. For the loopback test the CANs are separate. Errors also occur 
> when sending messages between the controllers, but I think that is more difficult to analyze.
> 
> root@raspberrypi:~# ip -d -s a s can0
> 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000
>     link/can  promiscuity 0 minmtu 0 maxmtu 0
>     can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>           bitrate 1000000 sample-point 0.800
>           tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6
>           mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>           dbitrate 4000000 dsample-point 0.800
>           dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2
>           mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>           clock 40000000
>           re-started bus-errors arbit-lost error-warn error-pass bus-off
>           0          0          0          0          0          0         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
>     RX: bytes  packets  errors  dropped missed  mcast
>     3097429872 2517466658 0       0       0       0
>     TX: bytes  packets  errors  dropped carrier collsns
>     3696197680 1258733264 0       0       0       0
> root@raspberrypi:~# ip -d -s a s can1
> 5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000
>     link/can  promiscuity 0 minmtu 0 maxmtu 0
>     can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
>           bitrate 1000000 sample-point 0.800
>           tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6
>           mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1
>           dbitrate 4000000 dsample-point 0.800
>           dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2
>           mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1
>           clock 40000000
>           re-started bus-errors arbit-lost error-warn error-pass bus-off
>           0          112256     0          1          3          1         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
                       ^^^^^^                ^          ^          ^
>     RX: bytes  packets  errors  dropped missed  mcasta
>     2590649888 773910314 3       0       0       0
>     TX: bytes  packets  errors  dropped carrier collsns
>     3442619696 386944310 112254  12      0       0

Why do you have so many errors on can1? The device event went into bus-off.

> This is the SPI setup for the dual CAN PI:
> 
> dtparam=spi=on
> dtoverlay=spi6-1cs
> # mcp251xfd DTS for RPI4 default CAN on SPI6.0
> dtoverlay=mcp251xfd-spi6-0,interrupt_pin=25,oscillator=40000000
> # mcp251xfd DTS for RPI4 CAN extension on SPI0.0
> dtoverlay=mcp251xfd,spi0-0,interrupt=16,oscillator=40000000
> 
> Or did you mean something else with "script"?

Can you share your "scselftest" program?

> I'll try the register dump when I suffer the next error ;-)

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: sctestself.tgz --]
[-- Type: application/x-compressed, Size: 25711 bytes --]

[-- Attachment #3: tc.pl --]
[-- Type: application/octet-stream, Size: 1085 bytes --]

#!/usr/bin/perl -w
use strict;

my $can=$ARGV[0] // die "arg0 must be candev";
my $port=$ARGV[1];
my $extlogger="192.168.5.167";
    
sub print_run_or_die {
    my $cmdref = $_[0];
    my $ret;
    
    print "+ ", join(" ", @$cmdref) , "\n";
    $ret = system @$cmdref;
    if ( $ret != 0 ){
	if( $ret < 0 ){
	    print STDERR "tc.pl: failed to execute $cmdref->[0]: $!\n";
	    system("ls","-l",$cmdref->[0]);
	    system("ls","-lH",$cmdref->[0]);
	    system("size",$cmdref->[0]);
	    exit 255;
	} elsif( $ret & 0xff ){
	    fprintf STDERR "tc.pl killed (%d)", ($ret & 0x7f);
	    exit 254;
	}
	exit ($ret >> 8);
    }
}

my @std_args = (
    "./sctestself", 
    "-b", 
    "-n", "4" ,
    "-l", "999",
    "-t", "2",
    "-v", "cmperr,logmsg",
    "-F", "refilldata,leastdots,allowintloopb,stoponerror",
    );

if( defined($port) ){
    push @std_args, "-X", "$extlogger:$port";
}

print_run_or_die([@std_args, "-d", "8", $can]);
print_run_or_die([@std_args, "-d", "16", $can]);
print_run_or_die([@std_args, "-d", "24", $can]);
print_run_or_die([@std_args, "-d", "32", $can]);


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: AW: AW: mcp251xfd: Bad message receiption
  2022-12-23  8:18       ` AW: " Stefan Althöfer
@ 2022-12-23 19:25         ` Marc Kleine-Budde
  2022-12-24 15:57         ` Marc Kleine-Budde
  1 sibling, 0 replies; 23+ messages in thread
From: Marc Kleine-Budde @ 2022-12-23 19:25 UTC (permalink / raw)
  To: Stefan Althöfer; +Cc: Thomas.Kopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 4716 bytes --]

On 23.12.2022 08:18:17, Stefan Althöfer wrote:
> > Why do you have so many errors on can1? The device event went into bus-off.
> 
> The device was not freshly started, and I just intended to share the
> setup. No errors when the problem appeared this night.

Ok

> > Can you share your "scselftest" program?
> 
> Sure, attached as extracted from the build system for my test tools.
> Hope this is not the bad guy. Included my ancient copy of
> libsocketcan.

> > Do you get the same error if the test is only running on a single
> > mcp2518fd?
> 
> I'm pretty sure I had cases where I started the test on both, one
> failed sooner and one failed later. However, currently the second test
> is still running (after 9h).

I want to know if there's an influence between the 2 interfaces.

> > Do you get the same error if only a single mcp2518fd is attached to the raspi?
> 
> Only tested this with the CM4

Would be interesting if you only attache 1 mcp2518fd to the board and
then re-run the test.

> > Can you share your kernel, including the modifications?
> 
> Yes if it turns out to be necessary, but I just cloned it from here:
> git clone --depth=1 --branch <branch>
> https://github.com/raspberrypi/linux

You said you made some modifications to the kernel, also it would be
good to use the _extact_ version you using to reproduce the error.

> Following these instructions: https://www.raspberrypi.com/documentation/computers/linux_kernel.html#kernel-headers

> Now this nights error log:

Thanks for the text, much easier to handle. :)

> Loop 4429 2022-12-23T03:07:45 last+4s total:4h299m
> + ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -d 8 can1
> CAN selftest can1 .......idx    2 tx:2A5#00 02 6A 8F 6C 81 C3 B0
>          rx:2A5#00 00 85 9A EF 54 34 5D
> idx    3 tx:2A5#00 03 4F 8F 85 59 8E DA
>          rx:2A5#00 00 85 9A EF 54 34 5D
> .c (700) fail
> 0: RX (004/003)  2A5#00 00 85 9A EF 54 34 5D
> 0: RX (004/002)  2A5#00 00 85 9A EF 54 34 5D

2 bad RX here

> 0: TX (003/002)  2A5#00 03 4F 8F 85 59 8E DA
> 0: RX (003/001)  2A5#00 01 7C 30 5C 7A 2D 1C
> 0: TX (002/001)  2A5#00 02 6A 8F 6C 81 C3 B0
> 0: RX (002/000)  2A5#00 00 AC AF 70 9B 32 03
> 0: TX (001/000)  2A5#00 01 7C 30 5C 7A 2D 1C
> 0: TX (000/000)  2A5#00 00 AC AF 70 9B 32 03
> 0: RX (004/003)  2A5#00 03 65 32 F8 E7 E6 F5
> 0: RX (004/002)  2A5#00 02 CC D5 09 FE 55 99
> 0: TX (003/002)  2A5#00 03 65 32 F8 E7 E6 F5
> 0: RX (003/001)  2A5#00 01 62 4F 17 34 44 CE
> 0: TX (002/001)  2A5#00 02 CC D5 09 FE 55 99
> 0: RX (002/000)  2A5#00 00 FE 58 FE FD C0 EF
> 0: TX (001/000)  2A5#00 01 62 4F 17 34 44 CE
> 0: TX (000/000)  2A5#00 00 FE 58 FE FD C0 EF
> 0: RX (004/003)  2A5#00 03 C9 7B EF 83 93 E0
> 0: RX (004/002)  2A5#00 02 DD 5E 82 B4 FD C4
> 0: TX (003/002)  2A5#00 03 C9 7B EF 83 93 E0
> 0: RX (003/001)  2A5#00 01 90 3B 32 80 90 67
> 0: TX (002/001)  2A5#00 02 DD 5E 82 B4 FD C4
> 0: RX (002/000)  2A5#00 00 8F 28 D2 F5 B6 98
> 0: TX (001/000)  2A5#00 01 90 3B 32 80 90 67
> 0: TX (000/000)  2A5#00 00 8F 28 D2 F5 B6 98
> 0: RX (004/003)  2A5#00 03 C5 1E FE 68 F1 65
> 0: RX (004/002)  2A5#00 02 7E 2B BA 18 F1 7B
> 0: TX (003/002)  2A5#00 03 C5 1E FE 68 F1 65
> 0: RX (003/001)  2A5#00 01 CE 1B D7 9E 42 15
> 0: TX (002/001)  2A5#00 02 7E 2B BA 18 F1 7B

> 0: RX (002/000)  2A5#00 00 85 9A EF 54 34 5D

properly received

> 0: TX (001/000)  2A5#00 01 CE 1B D7 9E 42 15

> 0: TX (000/000)  2A5#00 00 85 9A EF 54 34 5D

send

> 0: RX (004/003)  2A5#00 03 37 6D 66 8E C6 9B
> 0: RX (004/002)  2A5#00 02 2D 83 A3 D3 66 27
> 0: TX (003/002)  2A5#00 03 37 6D 66 8E C6 9B
> 0: RX (003/001)  2A5#00 01 E8 A5 9A C6 C1 AC
> 0: TX (002/001)  2A5#00 02 2D 83 A3 D3 66 27
> 0: RX (002/000)  2A5#00 00 E6 7A D0 27 3E DF
> 0: TX (001/000)  2A5#00 01 E8 A5 9A C6 C1 AC
> 0: TX (000/000)  2A5#00 00 E6 7A D0 27 3E DF
> 0: RX (004/003)  2A5#00 03 D4 54 CF 9A 6D 7A
> 0: RX (004/002)  2A5#00 02 06 3F 57 59 35 F6
> 0: TX (003/002)  2A5#00 03 D4 54 CF 9A 6D 7A
> 0: RX (003/001)  2A5#00 01 C1 F6 A6 E3 84 64
> 0: TX (002/001)  2A5#00 02 06 3F 57 59 35 F6
> Stopped. Ctrl-C to exit.
> 
> And the register dump

[...]

> mcp251xfd-dump did not work for me, but It looks as if the correct data is in the rx fifo.

With v5.18 the RAM layout changed, the mcp251xfd-dump tool was not
adopted. yet.

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: AW: AW: mcp251xfd: Bad message receiption
  2022-12-23  8:18       ` AW: " Stefan Althöfer
  2022-12-23 19:25         ` Marc Kleine-Budde
@ 2022-12-24 15:57         ` Marc Kleine-Budde
  2022-12-30 22:51           ` AW: " Stefan Althöfer
  1 sibling, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2022-12-24 15:57 UTC (permalink / raw)
  To: Stefan Althöfer; +Cc: Thomas.Kopp, linux-can

I've looked at the regmap dump and added the offset in the RX and TX
FIFOs.

tl;dr:

Everything, but the last 2 messages, look consistent The last 2 RX
messages have not been received by the hardware. On the other hand the
hardware has received the correct messages, but they were not received
by the test application.

On 23.12.2022 08:18:17, Stefan Althöfer wrote:
> Loop 4429 2022-12-23T03:07:45 last+4s total:4h299m
> + ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -d 8 can1
> CAN selftest can1 .......idx    2 tx:2A5#00 02 6A 8F 6C 81 C3 B0
>          rx:2A5#00 00 85 9A EF 54 34 5D
> idx    3 tx:2A5#00 03 4F 8F 85 59 8E DA
>          rx:2A5#00 00 85 9A EF 54 34 5D
> .c (700) fail
> 0: RX (004/003)  2A5#00 00 85 9A EF 54 34 5D    ???
                   Not RX'ed by hardware
> 0: RX (004/002)  2A5#00 00 85 9A EF 54 34 5D    ???
                   Not RX'ed by hardware

> 0: TX (003/002)  2A5#00 03 4F 8F 85 59 8E DA    TX 0x03 -> RX 0x03 !!!
                   Not RX'ed by application

In the hardware the message is in the TX FIFO at offset 0x03 and in RX
0x03.

> 0: RX (003/001)  2A5#00 01 7C 30 5C 7A 2D 1C    RX 0x01
> 0: TX (002/001)  2A5#00 02 6A 8F 6C 81 C3 B0    TX 0x02 -> RX 0x02 !!!
                   Not RX'ed by application

In the hardware the message is in the TX FIFO at offset 0x02 and in RX
0x02.

> 0: RX (002/000)  2A5#00 00 AC AF 70 9B 32 03    RX 0x00
> 0: TX (001/000)  2A5#00 01 7C 30 5C 7A 2D 1C    TX 0x01 -> RX 0x01

In the hardware the message is in the TX FIFO at offset 0x01 and in RX
0x01.

> 0: TX (000/000)  2A5#00 00 AC AF 70 9B 32 03    TX 0x00 -> RX 0x00

In the hardware the message is in the TX FIFO at offset 0x00 and in RX
0x00.

> 0: RX (004/003)  2A5#00 03 65 32 F8 E7 E6 F5    RX 0x0f
> 0: RX (004/002)  2A5#00 02 CC D5 09 FE 55 99    RX 0x0e
> 0: TX (003/002)  2A5#00 03 65 32 F8 E7 E6 F5
> 0: RX (003/001)  2A5#00 01 62 4F 17 34 44 CE    RX 0x0d
> 0: TX (002/001)  2A5#00 02 CC D5 09 FE 55 99
> 0: RX (002/000)  2A5#00 00 FE 58 FE FD C0 EF    RX 0x0c
> 0: TX (001/000)  2A5#00 01 62 4F 17 34 44 CE
> 0: TX (000/000)  2A5#00 00 FE 58 FE FD C0 EF
> 0: RX (004/003)  2A5#00 03 C9 7B EF 83 93 E0    RX 0x0b
> 0: RX (004/002)  2A5#00 02 DD 5E 82 B4 FD C4    RX 0x0a
> 0: TX (003/002)  2A5#00 03 C9 7B EF 83 93 E0
> 0: RX (003/001)  2A5#00 01 90 3B 32 80 90 67    RX 0x09
> 0: TX (002/001)  2A5#00 02 DD 5E 82 B4 FD C4
> 0: RX (002/000)  2A5#00 00 8F 28 D2 F5 B6 98    RX 0x08
> 0: TX (001/000)  2A5#00 01 90 3B 32 80 90 67
> 0: TX (000/000)  2A5#00 00 8F 28 D2 F5 B6 98
> 0: RX (004/003)  2A5#00 03 C5 1E FE 68 F1 65    RX 0x07
> 0: RX (004/002)  2A5#00 02 7E 2B BA 18 F1 7B    RX 0x06
> 0: TX (003/002)  2A5#00 03 C5 1E FE 68 F1 65
> 0: RX (003/001)  2A5#00 01 CE 1B D7 9E 42 15    RX 0x05
> 0: TX (002/001)  2A5#00 02 7E 2B BA 18 F1 7B
> 0: RX (002/000)  2A5#00 00 85 9A EF 54 34 5D    RX 0x04
> 0: TX (001/000)  2A5#00 01 CE 1B D7 9E 42 15
> 0: TX (000/000)  2A5#00 00 85 9A EF 54 34 5D

Can you run this in parallel to the test. Abort with Ctrl+c after the
test fails and send me the log file. The last 256 lines should be
enough.

| candump any,0:0,#FFFFFFFF -l

The candump writes to current dir by default. To avoid side effects
change to a tmpfs, if you have enough RAM :).

Also send the regmap dump and of course the error message of the test
tool.

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 23+ messages in thread

* AW: AW: AW: mcp251xfd: Bad message receiption
  2022-12-24 15:57         ` Marc Kleine-Budde
@ 2022-12-30 22:51           ` Stefan Althöfer
  2023-01-01 10:43             ` Stefan Althöfer
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Althöfer @ 2022-12-30 22:51 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 5427 bytes --]

> Would be interesting if you only attache 1 mcp2518fd to the board and then re-run the test.

The CM4 is the "raspberry pi 4 _compute_ module". So same "hardware" as the standard PI4 but with only one
interface attached (on a professional PCB). Error also happens there.

> You said you made some modifications to the kernel, also it would be good to use the _extact_ version you using to reproduce the error.

Does this answer this question?:
      root@raspberrypi:~/linux-6.0# git remote -v
      origin  https://github.com/raspberrypi/linux (fetch)
      origin  https://github.com/raspberrypi/linux (push)
      root@raspberrypi:~/linux-6.0# git status
      On branch rpi-6.0.y
      Your branch is up to date with 'origin/rpi-6.0.y'.
      
      Changes not staged for commit:
        (use "git add <file>..." to update what will be committed)
        (use "git restore <file>..." to discard changes in working directory)
              modified:   drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
      
      no changes added to commit (use "git add" and/or "git commit -a")
      root@raspberrypi:~/linux-6.0# git diff
      diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
      index 68df6d464..5eab9dd86 100644
      --- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
      +++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
      @@ -648,7 +648,7 @@ static u8 mcp251xfd_get_normal_mode(const struct mcp251xfd_priv *priv)
              u8 mode;
      
              if (priv->can.ctrlmode & CAN_CTRLMODE_LOOPBACK)
      -               mode = MCP251XFD_REG_CON_MODE_INT_LOOPBACK;
      +               mode = MCP251XFD_REG_CON_MODE_EXT_LOOPBACK;
              else if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
                      mode = MCP251XFD_REG_CON_MODE_LISTENONLY;
              else if (priv->can.ctrlmode & CAN_CTRLMODE_FD)

Regarding my test-program: I used to cross compile it. I moved it to the PI and compiled it natively with the Makefile.standalone
that I send to you. Surprisingly I found only 1 of my 4 instances failed after 72 hours. When I added -g -O2 flags, the natively
build programs failed 3 out of 4 within 26 hours (similar to the cross-build binaries: gcc-6.4 vs gcc-10.2).

> Can you run this in parallel to the test. Abort with Ctrl+c after the test fails and send me the log file. The last 256 lines should be enough.

Sure, but logging for 24h will be to much for my disk space. I found no utility to handle this, so I quick-hacked rbuflog
https://gist.github.com/stefanalt/dd4e68490d1a4e2a343b0beaa1b0d230. The result was so bizarre, that I finally
also added at normal log which I continuously shrinked by running the following command in parallel:

    candump -D can0,0:0,#FFFFFFFF | tee candump0_tee | rbuflog -n 200 -d 10000 candump0
    loopit -d 10 bash -c 'if [ $( du -k candump0_tee | cut -f1 ) -gt 7000 ] ; then cat /dev/null > candump0_tee ; fi'

After the fail I saved all logs/registers and sent a test message through the interface (cansend can0 555#55555555). After
that I re-saved everything and attached it as zip. 

The funny part is marked below. Some messages appear 4 times, where all other message appear only twice.
I have seen this happening several times.

  can0  2A5  [16]  00 02 67 A6 C8 CE 0D 37 F9 63 56 62 F6 2B E4 02
  can0  2A5  [16]  00 02 67 A6 C8 CE 0D 37 F9 63 56 62 F6 2B E4 02
  can0  2A5  [16]  00 03 C5 50 54 BE E1 16 7A 3F 70 B8 EE 5A 09 67
  can0  2A5  [16]  00 03 C5 50 54 BE E1 16 7A 3F 70 B8 EE 5A 09 67
  can0  2A5  [16]  00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8     **
  can0  2A5  [16]  00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8     **
  can0  2A5  [16]  00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8     **
  can0  2A5  [16]  00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8     **
  can0  2A5  [16]  00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94   **
  can0  2A5  [16]  00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94   **
  can0  2A5  [16]  00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94   **
  can0  2A5  [16]  00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94   **
  can0  2A5  [16]  00 02 0F 8D FC D0 57 48 F9 C8 5D EF 46 A9 DF 27
  can0  2A5  [16]  00 02 0F 8D FC D0 57 48 F9 C8 5D EF 46 A9 DF 27
  can0  2A5  [16]  00 03 4B 31 FF 18 67 D9 AA ED 07 FB 55 4B C6 FB
  can0  2A5  [16]  00 03 4B 31 FF 18 67 D9 AA ED 07 FB 55 4B C6 FB
  can0  555   [4]  55 55 55 55
  can0  555   [4]  55 55 55 55

Beside of that, the messages match with what my application has sent (rx-ed messages
removed from the following lines, see the zip for that)

0: TX72 (003/002)  2A5#00 03 4B 31 FF 18 67 D9 AA ED 07 FB 55 4B C6 FB
0: TX72 (002/001)  2A5#00 02 0F 8D FC D0 57 48 F9 C8 5D EF 46 A9 DF 27
0: TX72 (001/000)  2A5#00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94
0: TX72 (000/000)  2A5#00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8

And one more observation: Once or twice I have seen my application fail
with not returning MTU size in the socketcan read:

  + ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -d 16 can1
  CAN selftest can1 .
  ERROR: recvfrom: ret=16, errno=0

Is there any obvious explanation for this? I tried to add more output for this case, but it has not happed
until then.

--Stefan

[-- Attachment #2: 20221230T1945_can0.tgz --]
[-- Type: application/x-compressed, Size: 6905 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* AW: AW: AW: mcp251xfd: Bad message receiption
  2022-12-30 22:51           ` AW: " Stefan Althöfer
@ 2023-01-01 10:43             ` Stefan Althöfer
  2023-01-04 15:55               ` Stefan Althöfer
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Althöfer @ 2023-01-01 10:43 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 2341 bytes --]

Update: During a recent test-run I had both of the two CAN interfaces of my RPI4 Model B fail. First can0
and then some hours later can1.

Furthermore I attached another dump after fail. The last one was for the case where my application
receives 4 wrong messages (2 of them equal). This case is where my application receives
2 wrong messages only (both being equal). The candump has a different pattern:

CAN selftest can1 ......idx    2 tx:2A5#00 02 01 08 9A 5D 15 4B
         rx:2A5#00 00 D1 7D 9D 18 91 50
idx    3 tx:2A5#00 03 AC 4D 55 BF 2F 56
         rx:2A5#00 00 D1 7D 9D 18 91 50
c (584) fail
0: RX72 (004/003)  2A5#00 00 D1 7D 9D 18 91 50
0: RX72 (004/002)  2A5#00 00 D1 7D 9D 18 91 50
0: TX72 (003/002)  2A5#00 03 AC 4D 55 BF 2F 56
0: RX72 (003/001)  2A5#00 01 B0 ED 22 AE 94 9F
0: TX72 (002/001)  2A5#00 02 01 08 9A 5D 15 4B
0: RX72 (002/000)  2A5#00 00 57 3D 14 2B 3A BA
0: TX72 (001/000)  2A5#00 01 B0 ED 22 AE 94 9F
0: TX72 (000/000)  2A5#00 00 57 3D 14 2B 3A BA

  can1  2A5  [08]  00 03 1C 48 53 3D E1 9B
  can1  2A5  [08]  00 03 1C 48 53 3D E1 9B
  can1  2A5  [08]  00 00 57 3D 14 2B 3A BA
  can1  2A5  [08]  00 00 57 3D 14 2B 3A BA
  can1  2A5  [08]  00 01 B0 ED 22 AE 94 9F
  can1  2A5  [08]  00 01 B0 ED 22 AE 94 9F
  can1  2A5  [08]  00 02 01 08 9A 5D 15 4B     **
  can1  2A5  [08]  00 02 01 08 9A 5D 15 4B     **
  can1  2A5  [08]  00 02 01 08 9A 5D 15 4B     **
  can1  2A5  [08]  00 02 01 08 9A 5D 15 4B     **
  can1  2A5  [08]  00 03 AC 4D 55 BF 2F 56     **
  can1  2A5  [08]  00 03 AC 4D 55 BF 2F 56     **
  can1  2A5  [08]  00 03 AC 4D 55 BF 2F 56     **
  can1  2A5  [08]  00 03 AC 4D 55 BF 2F 56     **

I've seen other failures that followed those two pattern. Unfortunately, I removed my external logger, so
it's unsure what actually happens on the bus.

> And one more observation: Once or twice I have seen my application fail with not returning MTU size in the socketcan read:
>   ERROR: recvfrom: ret=16, errno=0
> Is there any obvious explanation for this? I tried to add more output for this case, but it has not happed until then.

Stupid me. Obviously this is a message in old CAN_MTU size. Is this still used for errors even if the MTU is set to CANFD_MTU?
This did not show up again, so I wasn't yet able to look into the data payload.

-- Stefan

[-- Attachment #2: 20221231T0158_can1.tgz --]
[-- Type: application/x-compressed, Size: 5992 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* AW: AW: AW: mcp251xfd: Bad message receiption
  2023-01-01 10:43             ` Stefan Althöfer
@ 2023-01-04 15:55               ` Stefan Althöfer
  2023-01-09  7:10                 ` AW: AW: AW: mcp251xfd: Bad message receiption (1/2) Stefan Althöfer
  2023-01-09  7:10                 ` AW: AW: AW: mcp251xfd: Bad message receiption (2/2) Stefan Althöfer
  0 siblings, 2 replies; 23+ messages in thread
From: Stefan Althöfer @ 2023-01-04 15:55 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]

I could again capture a case where some (false) 16 byte MTU frames were received - attached.

Loop 24435 2023-01-04T14:25:57 last+4s total:28h36m
.....
+ ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -X 192.168.5.221:4001 -d 24 can0
CAN selftest can0 Wrong-RX16(rxcnt=0):000#
Wrong-RX16(rxcnt=1):000#
Wrong-RX16(rxcnt=2):000#
Wrong-RX16(rxcnt=3):000#
idx    0 tx:2A5#00 00 DB 83 C7 C6 D2 22 D0 F1 81 25 26 15 EF 1F 1C 6F 91 BC 00 DC EF 74
         rx:000#
idx    1 tx:2A5#00 01 B2 FA 3F 38 59 F9 13 11 4D EE 94 14 B4 67 36 84 58 B7 A9 7F CC 99
         rx:000#
idx    2 tx:2A5#00 02 9E E9 08 2F A5 08 0C 95 7C BE 8F BB F7 E8 B5 0A F9 02 F8 8E 16 AC
         rx:000#
idx    3 tx:2A5#00 03 F5 4D 30 4D 04 DA CC D1 73 6B BA 7B 9A 5F 84 A6 F4 00 65 83 BC 5C
         rx:000#
.c (0) fail
0: RX16 (004/003)  000#
0: RX16 (004/002)  000#
0: TX72 (003/002)  2A5#00 03 F5 4D 30 4D 04 DA CC D1 73 6B BA 7B 9A 5F 84 A6 F4 00 65 83 BC 5C
0: RX16 (003/001)  000#
0: TX72 (002/001)  2A5#00 02 9E E9 08 2F A5 08 0C 95 7C BE 8F BB F7 E8 B5 0A F9 02 F8 8E 16 AC
0: RX16 (002/000)  000#
0: TX72 (001/000)  2A5#00 01 B2 FA 3F 38 59 F9 13 11 4D EE 94 14 B4 67 36 84 58 B7 A9 7F CC 99
0: TX72 (000/000)  2A5#00 00 DB 83 C7 C6 D2 22 D0 F1 81 25 26 15 EF 1F 1C 6F 91 BC 00 DC EF 74
External logger data:
(54:40.690.506      +0.089) RX B -      2A5  [24]  00 03 F5 4D 30 4D 04 DA  CC D1 73 6B BA 7B 9A 5F  84 A6 F4 00 65 83 BC 5C  
(54:40.690.417      +3.307) RX B -      2A5  [24]  00 02 9E E9 08 2F A5 08  0C 95 7C BE 8F BB F7 E8  B5 0A F9 02 F8 8E 16 AC       **** suspicious delay
(54:40.687.110      +0.088) RX B -      2A5  [24]  00 01 B2 FA 3F 38 59 F9  13 11 4D EE 94 14 B4 67  36 84 58 B7 A9 7F CC 99  
(54:40.687.022     +26.681) RX B -      2A5  [24]  00 00 DB 83 C7 C6 D2 22  D0 F1 81 25 26 15 EF 1F  1C 6F 91 BC 00 DC EF 74    **** plausible delay
(54:40.660.341      +0.071) RX B -      2A5  [16]  00 03 95 89 EA 22 98 62  36 77 C2 F9 69 99 0D 32  
(54:40.660.270      +0.417) RX B -      2A5  [16]  00 02 CF 52 6C 29 E2 86  73 10 3E 93 D4 72 30 05  
(54:40.659.853      +0.071) RX B -      2A5  [16]  00 01 8D 0E 77 13 DF 60  C3 F1 D7 14 C9 8B CC 04  
(54:40.659.782      +0.460) RX B -      2A5  [16]  00 00 5A 90 19 49 2D B8  20 C4 34 9D 31 23 59 E5  
(54:40.659.322      +0.070) RX B -      2A5  [16]  00 03 86 D1 E9 AF 12 76  51 D7 B4 B8 3B 43 85 68  
(54:40.659.252      +0.420) RX B -      2A5  [16]  00 02 EF 9D 9F D6 97 7B  7D 6C EF BB B4 6A B5 92

My sctestself received four 000# frames and quits. There are more such frames in the candump. None of them
were received by the external logger and neither in my own send-log.
It seems this happens when the sctestself does it's first test loop - which I also have seen in a second
case. The 26.681 ms delay seems to be normal as the program is started by the outer test loop. 
The +3.307 ms delay however is not normally seen when sctestself does it's first loop, not even 
if I drop caches after each sctestself run.

Can sctestself do something wrong to cause it?

-- Stefan

[-- Attachment #2: 20230104T1425_can0_rx16.tgz --]
[-- Type: application/x-compressed, Size: 6235 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* AW: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-04 15:55               ` Stefan Althöfer
@ 2023-01-09  7:10                 ` Stefan Althöfer
  2023-01-10 12:50                   ` Thomas.Kopp
  2023-01-09  7:10                 ` AW: AW: AW: mcp251xfd: Bad message receiption (2/2) Stefan Althöfer
  1 sibling, 1 reply; 23+ messages in thread
From: Stefan Althöfer @ 2023-01-09  7:10 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

Hope this doesn't start to get annoying!

I captured two more cases for the received-wrong-and-duplicate case. Attached.
Also I couldn’t resist to create a PDF with some things highlighted ;-) Perhaps I should mention that for the PDF I:
   * reversed the last-on-top output of sctestself
   * removed the TX log lines of sctestself
   * removed every send line of the candump log

For the last case, I attached debugger after the fail an tried to read from the socket until read returned -1:
    (gdb) p malloc(72)
    (gdb) p read(sockfd,$1,72)
    (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise surprise, there were more CAN frames in store. Refer to the log/pdf.

-- Stefan

[-- Attachment #2: 20230106T1822.tgz --]
[-- Type: application/x-compressed, Size: 7765 bytes --]

[-- Attachment #3: 20230107T1352.tgz --]
[-- Type: application/x-compressed, Size: 10957 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* AW: AW: AW: mcp251xfd: Bad message receiption (2/2)
  2023-01-04 15:55               ` Stefan Althöfer
  2023-01-09  7:10                 ` AW: AW: AW: mcp251xfd: Bad message receiption (1/2) Stefan Althöfer
@ 2023-01-09  7:10                 ` Stefan Althöfer
  1 sibling, 0 replies; 23+ messages in thread
From: Stefan Althöfer @ 2023-01-09  7:10 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can

[-- Attachment #1: Type: text/plain, Size: 34 bytes --]

Remaining Attachment

--Stefan

[-- Attachment #2: can_mcp2518fd_fail.zip --]
[-- Type: application/x-zip-compressed, Size: 65740 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-09  7:10                 ` AW: AW: AW: mcp251xfd: Bad message receiption (1/2) Stefan Althöfer
@ 2023-01-10 12:50                   ` Thomas.Kopp
  2023-01-10 14:53                     ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas.Kopp @ 2023-01-10 12:50 UTC (permalink / raw)
  To: Stefan.Althoefer, mkl; +Cc: linux-can

> Hope this doesn't start to get annoying!

Absolutely not, having the data and you debugging this is helpful.

 
> For the last case, I attached debugger after the fail an tried to read from the
> socket until read returned -1:
>     (gdb) p malloc(72)
>     (gdb) p read(sockfd,$1,72)
>     (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise
> surprise, there were more CAN frames in store. Refer to the log/pdf.


I was finally able to reproduce the issue and capture a full SPI/CAN log when it happens. At this point it looks like the tail pointer of the FIFO in the MCP2518FD gets corrupted and the driver does what it's supposed to do and reads out lots of frames... So yeah, I see all the frames being read via SPI multiple times which matches what you found with your gdb approach.
It's interesting to see that this wasn't noticed before - I have reproduced this issue also with cangen (when manually comparing TX/RX counters)

I'll try to find the rootcause and what can be done to mitigate this.

Best Regards,
Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-10 12:50                   ` Thomas.Kopp
@ 2023-01-10 14:53                     ` Marc Kleine-Budde
  2023-01-10 17:08                       ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2023-01-10 14:53 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can

[-- Attachment #1: Type: text/plain, Size: 1261 bytes --]

On 10.01.2023 12:50:18, Thomas.Kopp@microchip.com wrote:
> > For the last case, I attached debugger after the fail an tried to read from the
> > socket until read returned -1:
> >     (gdb) p malloc(72)
> >     (gdb) p read(sockfd,$1,72)
> >     (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise
> > surprise, there were more CAN frames in store. Refer to the log/pdf.
> 
> I was finally able to reproduce the issue and capture a full SPI/CAN
> log when it happens. At this point it looks like the tail pointer of
> the FIFO in the MCP2518FD gets corrupted and the driver does what it's
> supposed to do and reads out lots of frames...

You mean the head pointer?

> So yeah, I see all the
> frames being read via SPI multiple times which matches what you found
> with your gdb approach.

> It's interesting to see that this wasn't noticed before - I have
> reproduced this issue also with cangen (when manually comparing TX/RX
> counters)

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-10 14:53                     ` Marc Kleine-Budde
@ 2023-01-10 17:08                       ` Marc Kleine-Budde
  2023-01-10 18:39                         ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2023-01-10 17:08 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can

[-- Attachment #1: Type: text/plain, Size: 6866 bytes --]

On 10.01.2023 15:53:22, Marc Kleine-Budde wrote:
> On 10.01.2023 12:50:18, Thomas.Kopp@microchip.com wrote:
> > > For the last case, I attached debugger after the fail an tried to read from the
> > > socket until read returned -1:
> > >     (gdb) p malloc(72)
> > >     (gdb) p read(sockfd,$1,72)
> > >     (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise
> > > surprise, there were more CAN frames in store. Refer to the log/pdf.
> > 
> > I was finally able to reproduce the issue and capture a full SPI/CAN
> > log when it happens. At this point it looks like the tail pointer of
> > the FIFO in the MCP2518FD gets corrupted and the driver does what it's
> > supposed to do and reads out lots of frames...
> 
> You mean the head pointer?

I've reproduced the issue here, too, on a single-core imx6,
net-next/master from end of last year.

I added some tracing to the driver and recoded the following events.

Let me explain how to read this. The "chip" variants are read from the
chip via SPI. The "ring" variants are the cached values in the driver.

                   chip-tail       chip-delta (head-tail)
                     |     chip-head |
                     |       |       |     ring-tail     ring-head     ring-delta
                     |       |       |       |             |             | (head-tail)
| do_rxif_analyze: ct=0x0b ch=0x0c cd=0x01 rt=0x00000c9b rh=0x00000c9c rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
          ^^^^^^^^
These are the decoded lowest 3 bits of the FIFOSTA register.

The different runs of the rxif handler (mcp251xfd_handle_rxif()) are
separated by a ("----") line.


Now the analysis:

| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x09 ch=0x0a cd=0x01 rt=0x00000c99 rh=0x00000c9a rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0a ch=0x0b cd=0x01 rt=0x00000c9a rh=0x00000c9b rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0b ch=0x0c cd=0x01 rt=0x00000c9b rh=0x00000c9c rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

So far everything looks good. The chip delta ("cd"), i.e. the CAN frames
waiting to be read is 1. The ring delta is 1, too. Meaning 1 CAN frame
has been read.

| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0c ch=0x0f cd=0x03 rt=0x00000c9c rh=0x00000c9f rd=0x03
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

Now 3 CAN frame are waiting to be read. Thomas told me that he sees 3
CAN frames waiting too, before things go downhill.

| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0f ch=0x0e cd=0x0f rt=0x00000c9f rh=0x00000cae rd=0x0f
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF   x            Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

This is strange: chip head "jumped" to 0x0e, this means there are 0x0f
CAN frames waiting. However the FIFO flags (FIFO Half Full set, FIFO
Full not set) are consistent with 0x0f waiting CAN frames, the FIFO size
if 0x10.

Note: The chip tail "ct=0x0f" is consistent, as the "old" chip head was
      0x0f.

| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0e ch=0x0e cd=0x10 rt=0x00000cae rh=0x00000cbe rd=0x10
|         TFERFFIF   x            Receive FIFO Full Interrupt Flag 
|         TFHRFHIF   x            Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

Chip tail is consistent (0x0e) as it equals the old chip head. chip head
is also 0x0e, this means either an empty or a full FIFO, the flags
indicate a full FIFO.

The following events seem to be OK:

| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0e ch=0x0f cd=0x01 rt=0x00000cbe rh=0x00000cbf rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0f ch=0x00 cd=0x01 rt=0x00000cbf rh=0x00000cc0 rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------

Final note on the number of send and received CAN frames.

The ring head (in the event above) shows 0x0cc0 received CAN frames.
The sequence number in the register dump shows 0x000c9f, which means a
total of 0x000ca0 CAN frames have been send.

| TX-0 Object: 0x03 (0xaf8)
|               id = 0x000002a5
|            flags = 0x00193e08
|    SEQ_MCP2517FD = 0x00001f             Sequence (MCP2517)
|    SEQ_MCP2518FD = 0x000c9f             Sequence (MCP2518)
|             data = 00 03 69 df  c5 ee 62 77

The BDIAG1 register gives a total of 0x0ca0 error free messages, looks
consistent, too Thomas is this RX'ed or TX'ed error free messages?

| BDIAG1: bdiag1(0x03c)=0x00000ca0
[...]
|         EFMSGCNT = 0x0ca0       Error Free Message Counter

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-10 17:08                       ` Marc Kleine-Budde
@ 2023-01-10 18:39                         ` Marc Kleine-Budde
  2023-01-10 21:40                           ` Thomas.Kopp
  0 siblings, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2023-01-10 18:39 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can

[-- Attachment #1: Type: text/plain, Size: 4295 bytes --]

Another failure, it looks similar:

| do_rxif_analyze: ct=0x0a ch=0x0b cd=0x01 rt=0x000000fa rh=0x000000fb rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0b ch=0x0c cd=0x01 rt=0x000000fb rh=0x000000fc rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0c ch=0x0f cd=0x03 rt=0x000000fc rh=0x000000ff rd=0x03
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0f ch=0x0e cd=0x0f rt=0x000000ff rh=0x0000010e rd=0x0f
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF   x            Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0e ch=0x0e cd=0x10 rt=0x0000010e rh=0x0000011e rd=0x10
|         TFERFFIF   x            Receive FIFO Full Interrupt Flag 
|         TFHRFHIF   x            Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0e ch=0x0f cd=0x01 rt=0x0000011e rh=0x0000011f rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0f ch=0x00 cd=0x01 rt=0x0000011f rh=0x00000120 rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

| BDIAG1: bdiag1(0x03c)=0x00000100
|            DLCMM                DLC Mismatch
|              ESI                ESI flag of a received CAN FD message was set
|          DCRCERR                Data CRC Error
|         DSTUFERR                Data Bit Stuffing Error
|         DFORMERR                Data Format Error
|         DBIT1ERR                Data BIT1 Error
|         DBIT0ERR                Data BIT0 Error
|          TXBOERR                Device went to bus-off (and auto-recovered)
|          NCRCERR                CRC Error
|         NSTUFERR                Bit Stuffing Error
|         NFORMERR                Format Error
|          NACKERR                Transmitted message was not acknowledged 
|         NBIT1ERR                Bit1 Error
|         NBIT0ERR                Bit0 Error
|         EFMSGCNT = 0x0100       Error Free Message Counter

| TX-0 Object: 0x03 (0xaf8)
|               id = 0x000002a5
|            flags = 0x0001fe08
|    SEQ_MCP2517FD = 0x00007f             Sequence (MCP2517)
|    SEQ_MCP2518FD = 0x0000ff             Sequence (MCP2518)
|             data = 00 03 72 e4  78 54 d6 ed

The correct message counter is 0x100, the wrong one 0x120. That's 2x
FIFO size. I'd like to know when the FIFO head is wrong for the first
time, one that results in a data transfer where "old" FIFO contents is
read. I haven't dumped any data yet.

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-10 18:39                         ` Marc Kleine-Budde
@ 2023-01-10 21:40                           ` Thomas.Kopp
  2023-01-10 21:50                             ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas.Kopp @ 2023-01-10 21:40 UTC (permalink / raw)
  To: mkl; +Cc: Stefan.Althoefer, linux-can

About Head not tail: Whoops, yes correct.

> The correct message counter is 0x100, the wrong one 0x120. That's 2x
> FIFO size. I'd like to know when the FIFO head is wrong for the first
> time, one that results in a data transfer where "old" FIFO contents is
> read. I haven't dumped any data yet.

Not sure what you mean. Do you want the message counter at that point in time? 

Thomas 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-10 21:40                           ` Thomas.Kopp
@ 2023-01-10 21:50                             ` Marc Kleine-Budde
  2023-01-10 22:37                               ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2023-01-10 21:50 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can

[-- Attachment #1: Type: text/plain, Size: 3388 bytes --]

On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote:
> > The correct message counter is 0x100, the wrong one 0x120. That's 2x
> > FIFO size. I'd like to know when the FIFO head is wrong for the first
> > time, one that results in a data transfer where "old" FIFO contents is
> > read. I haven't dumped any data yet.

I got a chip-delta == 4 error.

> Not sure what you mean. Do you want the message counter at that point
> in time?

I want to know the chip tail where the driver reads the first time old
data. For example in this trace:

| do_rxif_analyze: ct=0x07 ch=0x08 cd=0x01 rt=0x00000427 rh=0x00000428 rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x08 ch=0x09 cd=0x01 rt=0x00000428 rh=0x00000429 rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag
| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x09 ch=0x0a cd=0x01 rt=0x00000429 rh=0x0000042a rd=0x01
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

chip delta is 1, so probably OK.

| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0a ch=0x0e cd=0x04 rt=0x0000042a rh=0x0000042e rd=0x04
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF                Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

chip delta is 4. Are all RX objects (0xa, 0xb, 0xc and 0xd) valid data?

| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0e ch=0x0c cd=0x0e rt=0x0000042e rh=0x0000043c rd=0x0e
|         TFERFFIF                Receive FIFO Full Interrupt Flag 
|         TFHRFHIF   x            Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

Are the read RX objects are all valid data?

| --------------------------------------------------------------------------------
| do_rxif_analyze: ct=0x0c ch=0x0c cd=0x10 rt=0x0000043c rh=0x0000044c rd=0x10
|         TFERFFIF   x            Receive FIFO Full Interrupt Flag 
|         TFHRFHIF   x            Receive FIFO Half Full Interrupt Flag
|         TFNRFNIF   x            Receive FIFO Not Empty Interrupt Flag

We read the whole FIFO, is everything valid data? Probably not. Which RX
FIFO object is the first "bad" one?

| --------------------------------------------------------------------------------

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-10 21:50                             ` Marc Kleine-Budde
@ 2023-01-10 22:37                               ` Marc Kleine-Budde
  2023-01-10 22:43                                 ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2023-01-10 22:37 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

On 10.01.2023 22:50:33, Marc Kleine-Budde wrote:
> On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote:
> > > The correct message counter is 0x100, the wrong one 0x120. That's 2x
> > > FIFO size. I'd like to know when the FIFO head is wrong for the first
> > > time, one that results in a data transfer where "old" FIFO contents is
> > > read. I haven't dumped any data yet.
> 
> I got a chip-delta == 4 error.

I have a proof of concept workaround implemented and I'll let it run
over night.

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-10 22:37                               ` Marc Kleine-Budde
@ 2023-01-10 22:43                                 ` Marc Kleine-Budde
  2023-01-11  8:17                                   ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2023-01-10 22:43 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

On 10.01.2023 23:37:44, Marc Kleine-Budde wrote:
> On 10.01.2023 22:50:33, Marc Kleine-Budde wrote:
> > On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote:
> > > > The correct message counter is 0x100, the wrong one 0x120. That's 2x
> > > > FIFO size. I'd like to know when the FIFO head is wrong for the first
> > > > time, one that results in a data transfer where "old" FIFO contents is
> > > > read. I haven't dumped any data yet.
> > 
> > I got a chip-delta == 4 error.
> 
> I have a proof of concept workaround implemented and I'll let it run
> over night.

\o/ The workaround triggered and Stefan's test program continued without
problems \o/

g'night,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-10 22:43                                 ` Marc Kleine-Budde
@ 2023-01-11  8:17                                   ` Marc Kleine-Budde
  2023-01-11 11:54                                     ` Marc Kleine-Budde
  0 siblings, 1 reply; 23+ messages in thread
From: Marc Kleine-Budde @ 2023-01-11  8:17 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

On 10.01.2023 23:43:02, Marc Kleine-Budde wrote:
> On 10.01.2023 23:37:44, Marc Kleine-Budde wrote:
> > On 10.01.2023 22:50:33, Marc Kleine-Budde wrote:
> > > On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote:
> > > > > The correct message counter is 0x100, the wrong one 0x120. That's 2x
> > > > > FIFO size. I'd like to know when the FIFO head is wrong for the first
> > > > > time, one that results in a data transfer where "old" FIFO contents is
> > > > > read. I haven't dumped any data yet.
> > > 
> > > I got a chip-delta == 4 error.
> > 
> > I have a proof of concept workaround implemented and I'll let it run
> > over night.
> 
> \o/ The workaround triggered and Stefan's test program continued without
> problems \o/

Still running. The workaround triggered more than 45x this night.

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2)
  2023-01-11  8:17                                   ` Marc Kleine-Budde
@ 2023-01-11 11:54                                     ` Marc Kleine-Budde
  0 siblings, 0 replies; 23+ messages in thread
From: Marc Kleine-Budde @ 2023-01-11 11:54 UTC (permalink / raw)
  To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can

[-- Attachment #1: Type: text/plain, Size: 1950 bytes --]

On 11.01.2023 09:17:25, Marc Kleine-Budde wrote:
> On 10.01.2023 23:43:02, Marc Kleine-Budde wrote:
> > On 10.01.2023 23:37:44, Marc Kleine-Budde wrote:
> > > On 10.01.2023 22:50:33, Marc Kleine-Budde wrote:
> > > > On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote:
> > > > > > The correct message counter is 0x100, the wrong one 0x120. That's 2x
> > > > > > FIFO size. I'd like to know when the FIFO head is wrong for the first
> > > > > > time, one that results in a data transfer where "old" FIFO contents is
> > > > > > read. I haven't dumped any data yet.
> > > > 
> > > > I got a chip-delta == 4 error.
> > > 
> > > I have a proof of concept workaround implemented and I'll let it run
> > > over night.
> > 
> > \o/ The workaround triggered and Stefan's test program continued without
> > problems \o/
> 
> Still running. The workaround triggered more than 45x this night.

I think the same problem occurs with the TEF, too. It's easier to
detect, as the TEF has sequence numbers.

The driver already implements a workaround. It limits the TEF head to TX
head [1] and it refuses to handle TEF objects with wrong sequence
numbers [2] and UINCs only the correctly handled ones.

However it doesn't roll back the "wrong" internal TEF head. The problem
occurs (on Linux) only during high TX load situations, so eventually
there will the enough finished TX frames and the internal TEF head will
be correct again.

regards, 
Marc

[1] https://elixir.bootlin.com/linux/v6.1.4/source/drivers/net/can/spi/mcp251xfd/mcp251xfd-tef.c#L141
[2] https://elixir.bootlin.com/linux/v6.1.4/source/drivers/net/can/spi/mcp251xfd/mcp251xfd-tef.c#L106

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-01-11 11:59 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-22  7:44 mcp251xfd: Bad message receiption Stefan Althöfer
2022-12-22  9:07 ` Thomas.Kopp
2022-12-22 10:30   ` AW: " Stefan Althöfer
2022-12-22 13:02     ` Marc Kleine-Budde
2022-12-23  8:18       ` AW: " Stefan Althöfer
2022-12-23 19:25         ` Marc Kleine-Budde
2022-12-24 15:57         ` Marc Kleine-Budde
2022-12-30 22:51           ` AW: " Stefan Althöfer
2023-01-01 10:43             ` Stefan Althöfer
2023-01-04 15:55               ` Stefan Althöfer
2023-01-09  7:10                 ` AW: AW: AW: mcp251xfd: Bad message receiption (1/2) Stefan Althöfer
2023-01-10 12:50                   ` Thomas.Kopp
2023-01-10 14:53                     ` Marc Kleine-Budde
2023-01-10 17:08                       ` Marc Kleine-Budde
2023-01-10 18:39                         ` Marc Kleine-Budde
2023-01-10 21:40                           ` Thomas.Kopp
2023-01-10 21:50                             ` Marc Kleine-Budde
2023-01-10 22:37                               ` Marc Kleine-Budde
2023-01-10 22:43                                 ` Marc Kleine-Budde
2023-01-11  8:17                                   ` Marc Kleine-Budde
2023-01-11 11:54                                     ` Marc Kleine-Budde
2023-01-09  7:10                 ` AW: AW: AW: mcp251xfd: Bad message receiption (2/2) Stefan Althöfer
2022-12-22 14:51 ` mcp251xfd: Bad message receiption Marc Kleine-Budde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).