* mcp251xfd: Bad message receiption @ 2022-12-22 7:44 Stefan Althöfer 2022-12-22 9:07 ` Thomas.Kopp 2022-12-22 14:51 ` mcp251xfd: Bad message receiption Marc Kleine-Budde 0 siblings, 2 replies; 23+ messages in thread From: Stefan Althöfer @ 2022-12-22 7:44 UTC (permalink / raw) To: linux-can [-- Attachment #1: Type: text/plain, Size: 1664 bytes --] Dear sirs, I'm using the MCP2518FD CAN controller with Rasperry PI 4 and are suffering from wrong data reception. My hardware is a Model 4 B with MCP2581 Click board(s) attached with a prototype (hand-made) interface PCB, as well as baseboards for a CM4 with a single MCP2518 on it. The error occurs on both of them. I have reduced my test case to a simple single thread self-receipt test: * TX two messages * Wait for RX and send out a new message on every receipt * TX for messages in total Refer to the attached PDF for some error cases. Last send frames are at the top of the logs. You can see that wrong messages appear in the RX queue, which have been successfully transmitted in previous test loop. The data that is actually sent out is correct however (checked with an external logger for some cases). The error appeared very often (every 1..10 minutes) when I was using kernel raspberrypi 5.15.32-v7l+ and reduced to once in several hours when I switched to raspberrypi 6.0.12-v7l+. BTW I modified the loopback to external loopback To check for correct transmission. No other changes to the driver as taken from the Raspbian repository. I see infrequent mcp251xfd CRC read errors. I think those are due to the 2518 SPI errata. However they don't occur at the time when the wrong messages are received (refer to the PDF). Questions: - Is this a known problem? - What did change from 5.15.32 to 6.0.12 to reduce the error rate? Maybe this is just because of a change in execution timing? - Any suggestion how I can step further in fixing this issue. ----- Stefan Althöfer (stefan.althoefer@janztec.com) [-- Attachment #2: can_mcp2518fd_fail.pdf --] [-- Type: application/pdf, Size: 59680 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: mcp251xfd: Bad message receiption 2022-12-22 7:44 mcp251xfd: Bad message receiption Stefan Althöfer @ 2022-12-22 9:07 ` Thomas.Kopp 2022-12-22 10:30 ` AW: " Stefan Althöfer 2022-12-22 14:51 ` mcp251xfd: Bad message receiption Marc Kleine-Budde 1 sibling, 1 reply; 23+ messages in thread From: Thomas.Kopp @ 2022-12-22 9:07 UTC (permalink / raw) To: Stefan.Althoefer; +Cc: linux-can Hi Stefan, > I have reduced my test case to a simple single thread self-receipt test: > * TX two messages > * Wait for RX and send out a new message on every receipt > * TX for messages in total > > Refer to the attached PDF for some error cases. Last send frames are at the > top of the logs. You can see that wrong > messages appear in the RX queue, which have been successfully transmitted > in previous test loop. The data that is actually sent > out is correct however (checked with an external logger for some cases). Do I read the pdf correctly (based on the /var/log stuff) that you have two MCP2518FD connected to a Pi4B and both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate? What are your CAN interface settings? Would it be possible to share the script? > I see infrequent mcp251xfd CRC read errors. I think those are due to the 2518 > SPI errata. However they don't > occur at the time when the wrong messages are received (refer to the PDF). Correct, this shouldn't be related to your problem. > - Any suggestion how I can step further in fixing this issue. One thing would be to dump the RAM i.e. the content of the fifos itself to see whether the device actually has the incorrect frames. Marc wrote a tool to dump registers and RAM via debugfs: https://github.com/linux-can/can-utils/blob/master/mcp251xfd/mcp251xfd-dump.c For this debugfs needs to be enabled and mounted (e.g. $mount -t debugfs none /sys/kernel/debug) Now the registers can be dumped like this: cat /sys/kernel/debug/regmap/spi0.0-crc/registers So I'd suggest to abort the script after the first error occurred and then dump registers/ram to find the RX fifo in question and check the content. Best Regards, Thomas ^ permalink raw reply [flat|nested] 23+ messages in thread
* AW: mcp251xfd: Bad message receiption 2022-12-22 9:07 ` Thomas.Kopp @ 2022-12-22 10:30 ` Stefan Althöfer 2022-12-22 13:02 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Stefan Althöfer @ 2022-12-22 10:30 UTC (permalink / raw) To: Thomas.Kopp; +Cc: linux-can Hi Thomas, > Do I read the pdf correctly (based on the /var/log stuff) that you have two MCP2518FD connected to a Pi4B and > both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate? Yes. For the loopback test the CANs are separate. Errors also occur when sending messages between the controllers, but I think that is more difficult to analyze. root@raspberrypi:~# ip -d -s a s can0 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000 link/can promiscuity 0 minmtu 0 maxmtu 0 can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 bitrate 1000000 sample-point 0.800 tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6 mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 dbitrate 4000000 dsample-point 0.800 dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2 mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 clock 40000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped missed mcast 3097429872 2517466658 0 0 0 0 TX: bytes packets errors dropped carrier collsns 3696197680 1258733264 0 0 0 0 root@raspberrypi:~# ip -d -s a s can1 5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000 link/can promiscuity 0 minmtu 0 maxmtu 0 can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 bitrate 1000000 sample-point 0.800 tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6 mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 dbitrate 4000000 dsample-point 0.800 dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2 mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 clock 40000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 112256 0 1 3 1 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped missed mcast 2590649888 773910314 3 0 0 0 TX: bytes packets errors dropped carrier collsns 3442619696 386944310 112254 12 0 0 This is the SPI setup for the dual CAN PI: dtparam=spi=on dtoverlay=spi6-1cs # mcp251xfd DTS for RPI4 default CAN on SPI6.0 dtoverlay=mcp251xfd-spi6-0,interrupt_pin=25,oscillator=40000000 # mcp251xfd DTS for RPI4 CAN extension on SPI0.0 dtoverlay=mcp251xfd,spi0-0,interrupt=16,oscillator=40000000 Or did you mean something else with "script"? I'll try the register dump when I suffer the next error ;-) Mfg Stefan -----Ursprüngliche Nachricht----- Von: Thomas.Kopp@microchip.com <Thomas.Kopp@microchip.com> Gesendet: Donnerstag, 22. Dezember 2022 10:07 An: Stefan Althöfer <Stefan.Althoefer@janztec.com> Cc: linux-can@vger.kernel.org Betreff: RE: mcp251xfd: Bad message receiption Hi Stefan, > I have reduced my test case to a simple single thread self-receipt test: > * TX two messages > * Wait for RX and send out a new message on every receipt > * TX for messages in total > > Refer to the attached PDF for some error cases. Last send frames are > at the top of the logs. You can see that wrong messages appear in the > RX queue, which have been successfully transmitted in previous test > loop. The data that is actually sent out is correct however (checked > with an external logger for some cases). Do I read the pdf correctly (based on the /var/log stuff) that you have two MCP2518FD connected to a Pi4B and both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate? What are your CAN interface settings? Would it be possible to share the script? > I see infrequent mcp251xfd CRC read errors. I think those are due to > the 2518 SPI errata. However they don't occur at the time when the > wrong messages are received (refer to the PDF). Correct, this shouldn't be related to your problem. > - Any suggestion how I can step further in fixing this issue. One thing would be to dump the RAM i.e. the content of the fifos itself to see whether the device actually has the incorrect frames. Marc wrote a tool to dump registers and RAM via debugfs: https://github.com/linux-can/can-utils/blob/master/mcp251xfd/mcp251xfd-dump.c For this debugfs needs to be enabled and mounted (e.g. $mount -t debugfs none /sys/kernel/debug) Now the registers can be dumped like this: cat /sys/kernel/debug/regmap/spi0.0-crc/registers So I'd suggest to abort the script after the first error occurred and then dump registers/ram to find the RX fifo in question and check the content. Best Regards, Thomas ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: AW: mcp251xfd: Bad message receiption 2022-12-22 10:30 ` AW: " Stefan Althöfer @ 2022-12-22 13:02 ` Marc Kleine-Budde 2022-12-23 8:18 ` AW: " Stefan Althöfer 0 siblings, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2022-12-22 13:02 UTC (permalink / raw) To: Stefan Althöfer; +Cc: Thomas.Kopp, linux-can [-- Attachment #1: Type: text/plain, Size: 3692 bytes --] On 22.12.2022 10:30:22, Stefan Althöfer wrote: > Hi Thomas, > > > Do I read the pdf correctly (based on the /var/log stuff) that you have two MCP2518FD connected to a Pi4B and > > both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate? > > Yes. For the loopback test the CANs are separate. Errors also occur when sending messages between > the controllers, but I think that is more difficult to analyze. > > root@raspberrypi:~# ip -d -s a s can0 > 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000 > link/can promiscuity 0 minmtu 0 maxmtu 0 > can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 > bitrate 1000000 sample-point 0.800 > tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6 > mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 > dbitrate 4000000 dsample-point 0.800 > dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2 > mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 > clock 40000000 > re-started bus-errors arbit-lost error-warn error-pass bus-off > 0 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 > RX: bytes packets errors dropped missed mcast > 3097429872 2517466658 0 0 0 0 > TX: bytes packets errors dropped carrier collsns > 3696197680 1258733264 0 0 0 0 > root@raspberrypi:~# ip -d -s a s can1 > 5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000 > link/can promiscuity 0 minmtu 0 maxmtu 0 > can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 > bitrate 1000000 sample-point 0.800 > tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6 > mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 > dbitrate 4000000 dsample-point 0.800 > dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2 > mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 > clock 40000000 > re-started bus-errors arbit-lost error-warn error-pass bus-off > 0 112256 0 1 3 1 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 ^^^^^^ ^ ^ ^ > RX: bytes packets errors dropped missed mcasta > 2590649888 773910314 3 0 0 0 > TX: bytes packets errors dropped carrier collsns > 3442619696 386944310 112254 12 0 0 Why do you have so many errors on can1? The device event went into bus-off. > This is the SPI setup for the dual CAN PI: > > dtparam=spi=on > dtoverlay=spi6-1cs > # mcp251xfd DTS for RPI4 default CAN on SPI6.0 > dtoverlay=mcp251xfd-spi6-0,interrupt_pin=25,oscillator=40000000 > # mcp251xfd DTS for RPI4 CAN extension on SPI0.0 > dtoverlay=mcp251xfd,spi0-0,interrupt=16,oscillator=40000000 > > Or did you mean something else with "script"? Can you share your "scselftest" program? > I'll try the register dump when I suffer the next error ;-) Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* AW: AW: mcp251xfd: Bad message receiption 2022-12-22 13:02 ` Marc Kleine-Budde @ 2022-12-23 8:18 ` Stefan Althöfer 2022-12-23 19:25 ` Marc Kleine-Budde 2022-12-24 15:57 ` Marc Kleine-Budde 0 siblings, 2 replies; 23+ messages in thread From: Stefan Althöfer @ 2022-12-23 8:18 UTC (permalink / raw) To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can [-- Attachment #1: Type: text/plain, Size: 18302 bytes --] > Why do you have so many errors on can1? The device event went into bus-off. The device was not freshly started, and I just intended to share the setup. No errors when the problem appeared this night. > Can you share your "scselftest" program? Sure, attached as extracted from the build system for my test tools. Hope this is not the bad guy. Included my ancient copy of libsocketcan. > Do you get the same error if the test is only running on a single mcp2518fd? I'm pretty sure I had cases where I started the test on both, one failed sooner and one failed later. However, currently the second test is still running (after 9h). > Do you get the same error if only a single mcp2518fd is attached to the raspi? Only tested this with the CM4 > Can you share your kernel, including the modifications? Yes if it turns out to be necessary, but I just cloned it from here: git clone --depth=1 --branch <branch> https://github.com/raspberrypi/linux Following these instructions: https://www.raspberrypi.com/documentation/computers/linux_kernel.html#kernel-headers Now this nights error log: Loop 4429 2022-12-23T03:07:45 last+4s total:4h299m + ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -d 8 can1 CAN selftest can1 .......idx 2 tx:2A5#00 02 6A 8F 6C 81 C3 B0 rx:2A5#00 00 85 9A EF 54 34 5D idx 3 tx:2A5#00 03 4F 8F 85 59 8E DA rx:2A5#00 00 85 9A EF 54 34 5D .c (700) fail 0: RX (004/003) 2A5#00 00 85 9A EF 54 34 5D 0: RX (004/002) 2A5#00 00 85 9A EF 54 34 5D 0: TX (003/002) 2A5#00 03 4F 8F 85 59 8E DA 0: RX (003/001) 2A5#00 01 7C 30 5C 7A 2D 1C 0: TX (002/001) 2A5#00 02 6A 8F 6C 81 C3 B0 0: RX (002/000) 2A5#00 00 AC AF 70 9B 32 03 0: TX (001/000) 2A5#00 01 7C 30 5C 7A 2D 1C 0: TX (000/000) 2A5#00 00 AC AF 70 9B 32 03 0: RX (004/003) 2A5#00 03 65 32 F8 E7 E6 F5 0: RX (004/002) 2A5#00 02 CC D5 09 FE 55 99 0: TX (003/002) 2A5#00 03 65 32 F8 E7 E6 F5 0: RX (003/001) 2A5#00 01 62 4F 17 34 44 CE 0: TX (002/001) 2A5#00 02 CC D5 09 FE 55 99 0: RX (002/000) 2A5#00 00 FE 58 FE FD C0 EF 0: TX (001/000) 2A5#00 01 62 4F 17 34 44 CE 0: TX (000/000) 2A5#00 00 FE 58 FE FD C0 EF 0: RX (004/003) 2A5#00 03 C9 7B EF 83 93 E0 0: RX (004/002) 2A5#00 02 DD 5E 82 B4 FD C4 0: TX (003/002) 2A5#00 03 C9 7B EF 83 93 E0 0: RX (003/001) 2A5#00 01 90 3B 32 80 90 67 0: TX (002/001) 2A5#00 02 DD 5E 82 B4 FD C4 0: RX (002/000) 2A5#00 00 8F 28 D2 F5 B6 98 0: TX (001/000) 2A5#00 01 90 3B 32 80 90 67 0: TX (000/000) 2A5#00 00 8F 28 D2 F5 B6 98 0: RX (004/003) 2A5#00 03 C5 1E FE 68 F1 65 0: RX (004/002) 2A5#00 02 7E 2B BA 18 F1 7B 0: TX (003/002) 2A5#00 03 C5 1E FE 68 F1 65 0: RX (003/001) 2A5#00 01 CE 1B D7 9E 42 15 0: TX (002/001) 2A5#00 02 7E 2B BA 18 F1 7B 0: RX (002/000) 2A5#00 00 85 9A EF 54 34 5D 0: TX (001/000) 2A5#00 01 CE 1B D7 9E 42 15 0: TX (000/000) 2A5#00 00 85 9A EF 54 34 5D 0: RX (004/003) 2A5#00 03 37 6D 66 8E C6 9B 0: RX (004/002) 2A5#00 02 2D 83 A3 D3 66 27 0: TX (003/002) 2A5#00 03 37 6D 66 8E C6 9B 0: RX (003/001) 2A5#00 01 E8 A5 9A C6 C1 AC 0: TX (002/001) 2A5#00 02 2D 83 A3 D3 66 27 0: RX (002/000) 2A5#00 00 E6 7A D0 27 3E DF 0: TX (001/000) 2A5#00 01 E8 A5 9A C6 C1 AC 0: TX (000/000) 2A5#00 00 E6 7A D0 27 3E DF 0: RX (004/003) 2A5#00 03 D4 54 CF 9A 6D 7A 0: RX (004/002) 2A5#00 02 06 3F 57 59 35 F6 0: TX (003/002) 2A5#00 03 D4 54 CF 9A 6D 7A 0: RX (003/001) 2A5#00 01 C1 F6 A6 E3 84 64 0: TX (002/001) 2A5#00 02 06 3F 57 59 35 F6 Stopped. Ctrl-C to exit. And the register dump 000: 05ab0760 004: 001e0705 008: 00060101 00c: 00020700 010: dcd75c83 014: 00010000 018: 40400040 01c: bf1a0004 020: 00000000 024: 00000000 028: 00000000 02c: 00000000 030: 00000000 034: 00000000 038: 00000000 03c: 00000af4 040: 03000029 044: 00000000 048: 00000000 04c: 00000000 050: 00600080 054: 00000005 058: 00000030 05c: ef000029 060: 00000400 064: 00000160 068: e3000029 06c: 00000000 070: 000004f0 074: e3600090 078: 00000007 07c: 00000620 080: 00600000 084: 00000000 088: 00000740 08c: 00600000 090: 00000000 094: 00000750 098: 00600000 09c: 00000000 0a0: 00000760 0a4: 00600000 0a8: 00000000 0ac: 00000770 0b0: 00600000 0b4: 00000000 0b8: 00000780 0bc: 00600000 0c0: 00000000 0c4: 00000790 0c8: 00600000 0cc: 00000000 0d0: 000007a0 0d4: 00600000 0d8: 00000000 0dc: 000007b0 0e0: 00600000 0e4: 00000000 0e8: 000007c0 0ec: 00600000 0f0: 00000000 0f4: 000007d0 0f8: 00600000 0fc: 00000000 100: 000007e0 104: 00600000 108: 00000000 10c: 000007f0 110: 00600000 114: 00000000 118: 00000800 11c: 00600000 120: 00000000 124: 00000810 128: 00600000 12c: 00000000 130: 00000820 134: 00600000 138: 00000000 13c: 00000830 140: 00600000 144: 00000000 148: 00000840 14c: 00600000 150: 00000000 154: 00000850 158: 00600000 15c: 00000000 160: 00000860 164: 00600000 168: 00000000 16c: 00000870 170: 00600000 174: 00000000 178: 00000880 17c: 00600000 180: 00000000 184: 00000890 188: 00600000 18c: 00000000 190: 000008a0 194: 00600000 198: 00000000 19c: 000008b0 1a0: 00600000 1a4: 00000000 1a8: 000008c0 1ac: 00600000 1b0: 00000000 1b4: 000008d0 1b8: 00600000 1bc: 00000000 1c0: 000008e0 1c4: 00600000 1c8: 00000000 1cc: 000008f0 1d0: 00008281 1d4: 00000000 1d8: 00000000 1dc: 00000000 1e0: 00000000 1e4: 00000000 1e8: 00000000 1ec: 00000000 1f0: 00000000 1f4: 00000000 1f8: 00000000 1fc: 00000000 200: 00000000 204: 00000000 208: 00000000 20c: 00000000 210: 00000000 214: 00000000 218: 00000000 21c: 00000000 220: 00000000 224: 00000000 228: 00000000 22c: 00000000 230: 00000000 234: 00000000 238: 00000000 23c: 00000000 240: 00000000 244: 00000000 248: 00000000 24c: 00000000 250: 00000000 254: 00000000 258: 00000000 25c: 00000000 260: 00000000 264: 00000000 268: 00000000 26c: 00000000 270: 00000000 274: 00000000 278: 00000000 27c: 00000000 280: 00000000 284: 00000000 288: 00000000 28c: 00000000 290: 00000000 294: 00000000 298: 00000000 29c: 00000000 2a0: 00000000 2a4: 00000000 2a8: 00000000 2ac: 00000000 2b0: 00000000 2b4: 00000000 2b8: 00000000 2bc: 00000000 2c0: 00000000 2c4: 00000000 2c8: 00000000 2cc: 00000000 2d0: 00000000 2d4: 00000000 2d8: 00000000 2dc: 00000000 2e0: 00000000 2e4: 00000000 2e8: 00000000 2ec: 00000000 400: 000002a5 404: 0015e0c8 408: 018add13 40c: 000002a5 410: 0015e2c8 414: 018ae58d 418: 000002a5 41c: 0015e4c8 420: 018b2543 424: 000002a5 428: 0015e6c8 42c: 018b2efd 430: 000002a5 434: 000000c8 438: 018add13 43c: afac0000 440: 03329b70 444: 00000000 448: 00000000 44c: 00000000 450: 00000000 454: 00000000 458: 00000000 45c: 00000000 460: 00000000 464: 00000000 468: 00000000 46c: 00000000 470: 00000000 474: 00000000 478: 00000000 47c: 000002a5 480: 000000c8 484: 018ae58d 488: 307c0100 48c: 1c2d7a5c 490: 00000000 494: 00000000 498: 00000000 49c: 00000000 4a0: 00000000 4a4: 00000000 4a8: 00000000 4ac: 00000000 4b0: 00000000 4b4: 00000000 4b8: 00000000 4bc: 00000000 4c0: 00000000 4c4: 00000000 4c8: 000002a5 4cc: 000000c8 4d0: 018b2543 4d4: 8f6a0200 4d8: b0c3816c 4dc: 00000000 4e0: 00000000 4e4: 00000000 4e8: 00000000 4ec: 00000000 4f0: 00000000 4f4: 00000000 4f8: 00000000 4fc: 00000000 500: 00000000 504: 00000000 508: 00000000 50c: 00000000 510: 00000000 514: 000002a5 518: 000000c8 51c: 018b2efd 520: 8f4f0300 524: da8e5985 528: 00000000 52c: 00000000 530: 00000000 534: 00000000 538: 00000000 53c: 00000000 540: 00000000 544: 00000000 548: 00000000 54c: 00000000 550: 00000000 554: 00000000 558: 00000000 55c: 00000000 560: 000002a5 564: 000000c8 568: 01890d0d 56c: 9a850000 570: 5d3454ef 574: 00000000 578: 00000000 57c: 00000000 580: 00000000 584: 00000000 588: 00000000 58c: 00000000 590: 00000000 594: 00000000 598: 00000000 59c: 00000000 5a0: 00000000 5a4: 00000000 5a8: 00000000 5ac: 000002a5 5b0: 000000c8 5b4: 0189157d 5b8: 1bce0100 5bc: 15429ed7 5c0: 00000000 5c4: 00000000 5c8: 00000000 5cc: 00000000 5d0: 00000000 5d4: 00000000 5d8: 00000000 5dc: 00000000 5e0: 00000000 5e4: 00000000 5e8: 00000000 5ec: 00000000 5f0: 00000000 5f4: 00000000 5f8: 000002a5 5fc: 000000c8 600: 0189562d 604: 2b7e0200 608: 7bf118ba 60c: 00000000 610: 00000000 614: 00000000 618: 00000000 61c: 00000000 620: 00000000 624: 00000000 628: 00000000 62c: 00000000 630: 00000000 634: 00000000 638: 00000000 63c: 00000000 640: 00000000 644: 000002a5 648: 000000c8 64c: 01895f97 650: 1ec50300 654: 65f168fe 658: 00000000 65c: 00000000 660: 00000000 664: 00000000 668: 00000000 66c: 00000000 670: 00000000 674: 00000000 678: 00000000 67c: 00000000 680: 00000000 684: 00000000 688: 00000000 68c: 00000000 690: 000002a5 694: 000000c8 698: 0189a529 69c: 288f0000 6a0: 98b6f5d2 6a4: 00000000 6a8: 00000000 6ac: 00000000 6b0: 00000000 6b4: 00000000 6b8: 00000000 6bc: 00000000 6c0: 00000000 6c4: 00000000 6c8: 00000000 6cc: 00000000 6d0: 00000000 6d4: 00000000 6d8: 00000000 6dc: 000002a5 6e0: 000000c8 6e4: 0189ae61 6e8: 3b900100 6ec: 67908032 6f0: 00000000 6f4: 00000000 6f8: 00000000 6fc: 00000000 700: 00000000 704: 00000000 708: 00000000 70c: 00000000 710: 00000000 714: 00000000 718: 00000000 71c: 00000000 720: 00000000 724: 00000000 728: 000002a5 72c: 000000c8 730: 0189f1d7 734: 5edd0200 738: c4fdb482 73c: 00000000 740: 00000000 744: 00000000 748: 00000000 74c: 00000000 750: 00000000 754: 00000000 758: 00000000 75c: 00000000 760: 00000000 764: 00000000 768: 00000000 76c: 00000000 770: 00000000 774: 000002a5 778: 000000c8 77c: 0189fb73 780: 7bc90300 784: e09383ef 788: 00000000 78c: 00000000 790: 00000000 794: 00000000 798: 00000000 79c: 00000000 7a0: 00000000 7a4: 00000000 7a8: 00000000 7ac: 00000000 7b0: 00000000 7b4: 00000000 7b8: 00000000 7bc: 00000000 7c0: 000002a5 7c4: 000000c8 7c8: 018a40dd 7cc: 58fe0000 7d0: efc0fdfe 7d4: 00000000 7d8: 00000000 7dc: 00000000 7e0: 00000000 7e4: 00000000 7e8: 00000000 7ec: 00000000 7f0: 00000000 7f4: 00000000 7f8: 00000000 7fc: 00000000 800: 00000000 804: 00000000 808: 00000000 80c: 000002a5 810: 000000c8 814: 018a4bcd 818: 4f620100 81c: ce443417 820: 00000000 824: 00000000 828: 00000000 82c: 00000000 830: 00000000 834: 00000000 838: 00000000 83c: 00000000 840: 00000000 844: 00000000 848: 00000000 84c: 00000000 850: 00000000 854: 00000000 858: 000002a5 85c: 000000c8 860: 018a8bb5 864: d5cc0200 868: 9955fe09 86c: 00000000 870: 00000000 874: 00000000 878: 00000000 87c: 00000000 880: 00000000 884: 00000000 888: 00000000 88c: 00000000 890: 00000000 894: 00000000 898: 00000000 89c: 00000000 8a0: 00000000 8a4: 000002a5 8a8: 000000c8 8ac: 018a9547 8b0: 32650300 8b4: f5e6e7f8 8b8: 00000000 8bc: 00000000 8c0: 00000000 8c4: 00000000 8c8: 00000000 8cc: 00000000 8d0: 00000000 8d4: 00000000 8d8: 00000000 8dc: 00000000 8e0: 00000000 8e4: 00000000 8e8: 00000000 8ec: 00000000 8f0: 00000000 8f4: 00000000 8f8: 00000000 8fc: 00000000 900: 00000000 904: 00000000 908: 00000000 90c: 00000000 910: 00000000 914: 00000000 918: 00000000 91c: 00000000 920: 00000000 924: 00000000 928: 00000000 92c: 00000000 930: 00000000 934: 00000000 938: 00000000 93c: 00000000 940: 00000000 944: 00000000 948: 00000000 94c: 00000000 950: 00000000 954: 00000000 958: 00000000 95c: 00000000 960: 00000000 964: 00000000 968: 00000000 96c: 00000000 970: 00000000 974: 00000000 978: 00000000 97c: 00000000 980: 00000000 984: 00000000 988: 00000000 98c: 00000000 990: 00000000 994: 00000000 998: 00000000 99c: 00000000 9a0: 00000000 9a4: 00000000 9a8: 00000000 9ac: 00000000 9b0: 00000000 9b4: 00000000 9b8: 00000000 9bc: 00000000 9c0: 00000000 9c4: 00000000 9c8: 00000000 9cc: 00000000 9d0: 00000000 9d4: 00000000 9d8: 00000000 9dc: 00000000 9e0: 00000000 9e4: 00000000 9e8: 00000000 9ec: 00000000 9f0: 00000000 9f4: 00000000 9f8: 00000000 9fc: 00000000 a00: 00000000 a04: 00000000 a08: 00000000 a0c: 00000000 a10: 00000000 a14: 00000000 a18: 00000000 a1c: 00000000 a20: 000002a5 a24: 0015e0c8 a28: afac0000 a2c: 03329b70 a30: 00000000 a34: 00000000 a38: 00000000 a3c: 00000000 a40: 00000000 a44: 00000000 a48: 00000000 a4c: 00000000 a50: 00000000 a54: 00000000 a58: 00000000 a5c: 00000000 a60: 00000000 a64: 00000000 a68: 000002a5 a6c: 0015e2c8 a70: 307c0100 a74: 1c2d7a5c a78: 00000000 a7c: 00000000 a80: 00000000 a84: 00000000 a88: 00000000 a8c: 00000000 a90: 00000000 a94: 00000000 a98: 00000000 a9c: 00000000 aa0: 00000000 aa4: 00000000 aa8: 00000000 aac: 00000000 ab0: 000002a5 ab4: 0015e4c8 ab8: 8f6a0200 abc: b0c3816c ac0: 00000000 ac4: 00000000 ac8: 00000000 acc: 00000000 ad0: 00000000 ad4: 00000000 ad8: 00000000 adc: 00000000 ae0: 00000000 ae4: 00000000 ae8: 00000000 aec: 00000000 af0: 00000000 af4: 00000000 af8: 000002a5 afc: 0015e6c8 b00: 8f4f0300 b04: da8e5985 b08: 00000000 b0c: 00000000 b10: 00000000 b14: 00000000 b18: 00000000 b1c: 00000000 b20: 00000000 b24: 00000000 b28: 00000000 b2c: 00000000 b30: 00000000 b34: 00000000 b38: 00000000 b3c: 00000000 b40: 00000000 b44: 00000000 b48: 00000000 b4c: 00000000 b50: 00000000 b54: 00000000 b58: 00000000 b5c: 00000000 b60: 00000000 b64: 00000000 b68: 00000000 b6c: 00000000 b70: 00000000 b74: 00000000 b78: 00000000 b7c: 00000000 b80: 00000000 b84: 00000000 b88: 00000000 b8c: 00000000 b90: 00000000 b94: 00000000 b98: 00000000 b9c: 00000000 ba0: 00000000 ba4: 00000000 ba8: 00000000 bac: 00000000 bb0: 00000000 bb4: 00000000 bb8: 00000000 bbc: 00000000 bc0: 00000000 bc4: 00000000 bc8: 00000000 bcc: 00000000 bd0: 00000000 bd4: 00000000 bd8: 00000000 bdc: 00000000 be0: 00000000 be4: 00000000 be8: 00000000 bec: 00000000 bf0: 00000000 bf4: 00000000 bf8: 00000000 bfc: 00000000 e00: 00000468 e04: 03030003 e08: 030082b6 e0c: 00000007 e10: 00000000 e14: 00000000 mcp251xfd-dump did not work for me, but It looks as if the correct data is in the rx fifo. Mfg Stefan -----Ursprüngliche Nachricht----- Von: Marc Kleine-Budde <mkl@pengutronix.de> Gesendet: Donnerstag, 22. Dezember 2022 14:02 An: Stefan Althöfer <Stefan.Althoefer@janztec.com> Cc: Thomas.Kopp@microchip.com; linux-can@vger.kernel.org Betreff: Re: AW: mcp251xfd: Bad message receiption On 22.12.2022 10:30:22, Stefan Althöfer wrote: > Hi Thomas, > > > Do I read the pdf correctly (based on the /var/log stuff) that you > > have two MCP2518FD connected to a Pi4B and both of them are running in internal/external loopback mode no interaction between them and the SPIs are separate? > > Yes. For the loopback test the CANs are separate. Errors also occur > when sending messages between the controllers, but I think that is more difficult to analyze. > > root@raspberrypi:~# ip -d -s a s can0 > 4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000 > link/can promiscuity 0 minmtu 0 maxmtu 0 > can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 > bitrate 1000000 sample-point 0.800 > tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6 > mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 > dbitrate 4000000 dsample-point 0.800 > dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2 > mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 > clock 40000000 > re-started bus-errors arbit-lost error-warn error-pass bus-off > 0 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 > RX: bytes packets errors dropped missed mcast > 3097429872 2517466658 0 0 0 0 > TX: bytes packets errors dropped carrier collsns > 3696197680 1258733264 0 0 0 0 > root@raspberrypi:~# ip -d -s a s can1 > 5: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP group default qlen 1000 > link/can promiscuity 0 minmtu 0 maxmtu 0 > can <LOOPBACK,BERR-REPORTING,FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 > bitrate 1000000 sample-point 0.800 > tq 25 prop-seg 15 phase-seg1 16 phase-seg2 8 sjw 6 > mcp251xfd: tseg1 2..256 tseg2 1..128 sjw 1..128 brp 1..256 brp-inc 1 > dbitrate 4000000 dsample-point 0.800 > dtq 25 dprop-seg 3 dphase-seg1 4 dphase-seg2 2 dsjw 2 > mcp251xfd: dtseg1 1..32 dtseg2 1..16 dsjw 1..16 dbrp 1..256 dbrp-inc 1 > clock 40000000 > re-started bus-errors arbit-lost error-warn error-pass bus-off > 0 112256 0 1 3 1 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 ^^^^^^ ^ ^ ^ > RX: bytes packets errors dropped missed mcasta > 2590649888 773910314 3 0 0 0 > TX: bytes packets errors dropped carrier collsns > 3442619696 386944310 112254 12 0 0 Why do you have so many errors on can1? The device event went into bus-off. > This is the SPI setup for the dual CAN PI: > > dtparam=spi=on > dtoverlay=spi6-1cs > # mcp251xfd DTS for RPI4 default CAN on SPI6.0 > dtoverlay=mcp251xfd-spi6-0,interrupt_pin=25,oscillator=40000000 > # mcp251xfd DTS for RPI4 CAN extension on SPI0.0 > dtoverlay=mcp251xfd,spi0-0,interrupt=16,oscillator=40000000 > > Or did you mean something else with "script"? Can you share your "scselftest" program? > I'll try the register dump when I suffer the next error ;-) Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: sctestself.tgz --] [-- Type: application/x-compressed, Size: 25711 bytes --] [-- Attachment #3: tc.pl --] [-- Type: application/octet-stream, Size: 1085 bytes --] #!/usr/bin/perl -w use strict; my $can=$ARGV[0] // die "arg0 must be candev"; my $port=$ARGV[1]; my $extlogger="192.168.5.167"; sub print_run_or_die { my $cmdref = $_[0]; my $ret; print "+ ", join(" ", @$cmdref) , "\n"; $ret = system @$cmdref; if ( $ret != 0 ){ if( $ret < 0 ){ print STDERR "tc.pl: failed to execute $cmdref->[0]: $!\n"; system("ls","-l",$cmdref->[0]); system("ls","-lH",$cmdref->[0]); system("size",$cmdref->[0]); exit 255; } elsif( $ret & 0xff ){ fprintf STDERR "tc.pl killed (%d)", ($ret & 0x7f); exit 254; } exit ($ret >> 8); } } my @std_args = ( "./sctestself", "-b", "-n", "4" , "-l", "999", "-t", "2", "-v", "cmperr,logmsg", "-F", "refilldata,leastdots,allowintloopb,stoponerror", ); if( defined($port) ){ push @std_args, "-X", "$extlogger:$port"; } print_run_or_die([@std_args, "-d", "8", $can]); print_run_or_die([@std_args, "-d", "16", $can]); print_run_or_die([@std_args, "-d", "24", $can]); print_run_or_die([@std_args, "-d", "32", $can]); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: AW: AW: mcp251xfd: Bad message receiption 2022-12-23 8:18 ` AW: " Stefan Althöfer @ 2022-12-23 19:25 ` Marc Kleine-Budde 2022-12-24 15:57 ` Marc Kleine-Budde 1 sibling, 0 replies; 23+ messages in thread From: Marc Kleine-Budde @ 2022-12-23 19:25 UTC (permalink / raw) To: Stefan Althöfer; +Cc: Thomas.Kopp, linux-can [-- Attachment #1: Type: text/plain, Size: 4716 bytes --] On 23.12.2022 08:18:17, Stefan Althöfer wrote: > > Why do you have so many errors on can1? The device event went into bus-off. > > The device was not freshly started, and I just intended to share the > setup. No errors when the problem appeared this night. Ok > > Can you share your "scselftest" program? > > Sure, attached as extracted from the build system for my test tools. > Hope this is not the bad guy. Included my ancient copy of > libsocketcan. > > Do you get the same error if the test is only running on a single > > mcp2518fd? > > I'm pretty sure I had cases where I started the test on both, one > failed sooner and one failed later. However, currently the second test > is still running (after 9h). I want to know if there's an influence between the 2 interfaces. > > Do you get the same error if only a single mcp2518fd is attached to the raspi? > > Only tested this with the CM4 Would be interesting if you only attache 1 mcp2518fd to the board and then re-run the test. > > Can you share your kernel, including the modifications? > > Yes if it turns out to be necessary, but I just cloned it from here: > git clone --depth=1 --branch <branch> > https://github.com/raspberrypi/linux You said you made some modifications to the kernel, also it would be good to use the _extact_ version you using to reproduce the error. > Following these instructions: https://www.raspberrypi.com/documentation/computers/linux_kernel.html#kernel-headers > Now this nights error log: Thanks for the text, much easier to handle. :) > Loop 4429 2022-12-23T03:07:45 last+4s total:4h299m > + ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -d 8 can1 > CAN selftest can1 .......idx 2 tx:2A5#00 02 6A 8F 6C 81 C3 B0 > rx:2A5#00 00 85 9A EF 54 34 5D > idx 3 tx:2A5#00 03 4F 8F 85 59 8E DA > rx:2A5#00 00 85 9A EF 54 34 5D > .c (700) fail > 0: RX (004/003) 2A5#00 00 85 9A EF 54 34 5D > 0: RX (004/002) 2A5#00 00 85 9A EF 54 34 5D 2 bad RX here > 0: TX (003/002) 2A5#00 03 4F 8F 85 59 8E DA > 0: RX (003/001) 2A5#00 01 7C 30 5C 7A 2D 1C > 0: TX (002/001) 2A5#00 02 6A 8F 6C 81 C3 B0 > 0: RX (002/000) 2A5#00 00 AC AF 70 9B 32 03 > 0: TX (001/000) 2A5#00 01 7C 30 5C 7A 2D 1C > 0: TX (000/000) 2A5#00 00 AC AF 70 9B 32 03 > 0: RX (004/003) 2A5#00 03 65 32 F8 E7 E6 F5 > 0: RX (004/002) 2A5#00 02 CC D5 09 FE 55 99 > 0: TX (003/002) 2A5#00 03 65 32 F8 E7 E6 F5 > 0: RX (003/001) 2A5#00 01 62 4F 17 34 44 CE > 0: TX (002/001) 2A5#00 02 CC D5 09 FE 55 99 > 0: RX (002/000) 2A5#00 00 FE 58 FE FD C0 EF > 0: TX (001/000) 2A5#00 01 62 4F 17 34 44 CE > 0: TX (000/000) 2A5#00 00 FE 58 FE FD C0 EF > 0: RX (004/003) 2A5#00 03 C9 7B EF 83 93 E0 > 0: RX (004/002) 2A5#00 02 DD 5E 82 B4 FD C4 > 0: TX (003/002) 2A5#00 03 C9 7B EF 83 93 E0 > 0: RX (003/001) 2A5#00 01 90 3B 32 80 90 67 > 0: TX (002/001) 2A5#00 02 DD 5E 82 B4 FD C4 > 0: RX (002/000) 2A5#00 00 8F 28 D2 F5 B6 98 > 0: TX (001/000) 2A5#00 01 90 3B 32 80 90 67 > 0: TX (000/000) 2A5#00 00 8F 28 D2 F5 B6 98 > 0: RX (004/003) 2A5#00 03 C5 1E FE 68 F1 65 > 0: RX (004/002) 2A5#00 02 7E 2B BA 18 F1 7B > 0: TX (003/002) 2A5#00 03 C5 1E FE 68 F1 65 > 0: RX (003/001) 2A5#00 01 CE 1B D7 9E 42 15 > 0: TX (002/001) 2A5#00 02 7E 2B BA 18 F1 7B > 0: RX (002/000) 2A5#00 00 85 9A EF 54 34 5D properly received > 0: TX (001/000) 2A5#00 01 CE 1B D7 9E 42 15 > 0: TX (000/000) 2A5#00 00 85 9A EF 54 34 5D send > 0: RX (004/003) 2A5#00 03 37 6D 66 8E C6 9B > 0: RX (004/002) 2A5#00 02 2D 83 A3 D3 66 27 > 0: TX (003/002) 2A5#00 03 37 6D 66 8E C6 9B > 0: RX (003/001) 2A5#00 01 E8 A5 9A C6 C1 AC > 0: TX (002/001) 2A5#00 02 2D 83 A3 D3 66 27 > 0: RX (002/000) 2A5#00 00 E6 7A D0 27 3E DF > 0: TX (001/000) 2A5#00 01 E8 A5 9A C6 C1 AC > 0: TX (000/000) 2A5#00 00 E6 7A D0 27 3E DF > 0: RX (004/003) 2A5#00 03 D4 54 CF 9A 6D 7A > 0: RX (004/002) 2A5#00 02 06 3F 57 59 35 F6 > 0: TX (003/002) 2A5#00 03 D4 54 CF 9A 6D 7A > 0: RX (003/001) 2A5#00 01 C1 F6 A6 E3 84 64 > 0: TX (002/001) 2A5#00 02 06 3F 57 59 35 F6 > Stopped. Ctrl-C to exit. > > And the register dump [...] > mcp251xfd-dump did not work for me, but It looks as if the correct data is in the rx fifo. With v5.18 the RAM layout changed, the mcp251xfd-dump tool was not adopted. yet. Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: AW: AW: mcp251xfd: Bad message receiption 2022-12-23 8:18 ` AW: " Stefan Althöfer 2022-12-23 19:25 ` Marc Kleine-Budde @ 2022-12-24 15:57 ` Marc Kleine-Budde 2022-12-30 22:51 ` AW: " Stefan Althöfer 1 sibling, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2022-12-24 15:57 UTC (permalink / raw) To: Stefan Althöfer; +Cc: Thomas.Kopp, linux-can I've looked at the regmap dump and added the offset in the RX and TX FIFOs. tl;dr: Everything, but the last 2 messages, look consistent The last 2 RX messages have not been received by the hardware. On the other hand the hardware has received the correct messages, but they were not received by the test application. On 23.12.2022 08:18:17, Stefan Althöfer wrote: > Loop 4429 2022-12-23T03:07:45 last+4s total:4h299m > + ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -d 8 can1 > CAN selftest can1 .......idx 2 tx:2A5#00 02 6A 8F 6C 81 C3 B0 > rx:2A5#00 00 85 9A EF 54 34 5D > idx 3 tx:2A5#00 03 4F 8F 85 59 8E DA > rx:2A5#00 00 85 9A EF 54 34 5D > .c (700) fail > 0: RX (004/003) 2A5#00 00 85 9A EF 54 34 5D ??? Not RX'ed by hardware > 0: RX (004/002) 2A5#00 00 85 9A EF 54 34 5D ??? Not RX'ed by hardware > 0: TX (003/002) 2A5#00 03 4F 8F 85 59 8E DA TX 0x03 -> RX 0x03 !!! Not RX'ed by application In the hardware the message is in the TX FIFO at offset 0x03 and in RX 0x03. > 0: RX (003/001) 2A5#00 01 7C 30 5C 7A 2D 1C RX 0x01 > 0: TX (002/001) 2A5#00 02 6A 8F 6C 81 C3 B0 TX 0x02 -> RX 0x02 !!! Not RX'ed by application In the hardware the message is in the TX FIFO at offset 0x02 and in RX 0x02. > 0: RX (002/000) 2A5#00 00 AC AF 70 9B 32 03 RX 0x00 > 0: TX (001/000) 2A5#00 01 7C 30 5C 7A 2D 1C TX 0x01 -> RX 0x01 In the hardware the message is in the TX FIFO at offset 0x01 and in RX 0x01. > 0: TX (000/000) 2A5#00 00 AC AF 70 9B 32 03 TX 0x00 -> RX 0x00 In the hardware the message is in the TX FIFO at offset 0x00 and in RX 0x00. > 0: RX (004/003) 2A5#00 03 65 32 F8 E7 E6 F5 RX 0x0f > 0: RX (004/002) 2A5#00 02 CC D5 09 FE 55 99 RX 0x0e > 0: TX (003/002) 2A5#00 03 65 32 F8 E7 E6 F5 > 0: RX (003/001) 2A5#00 01 62 4F 17 34 44 CE RX 0x0d > 0: TX (002/001) 2A5#00 02 CC D5 09 FE 55 99 > 0: RX (002/000) 2A5#00 00 FE 58 FE FD C0 EF RX 0x0c > 0: TX (001/000) 2A5#00 01 62 4F 17 34 44 CE > 0: TX (000/000) 2A5#00 00 FE 58 FE FD C0 EF > 0: RX (004/003) 2A5#00 03 C9 7B EF 83 93 E0 RX 0x0b > 0: RX (004/002) 2A5#00 02 DD 5E 82 B4 FD C4 RX 0x0a > 0: TX (003/002) 2A5#00 03 C9 7B EF 83 93 E0 > 0: RX (003/001) 2A5#00 01 90 3B 32 80 90 67 RX 0x09 > 0: TX (002/001) 2A5#00 02 DD 5E 82 B4 FD C4 > 0: RX (002/000) 2A5#00 00 8F 28 D2 F5 B6 98 RX 0x08 > 0: TX (001/000) 2A5#00 01 90 3B 32 80 90 67 > 0: TX (000/000) 2A5#00 00 8F 28 D2 F5 B6 98 > 0: RX (004/003) 2A5#00 03 C5 1E FE 68 F1 65 RX 0x07 > 0: RX (004/002) 2A5#00 02 7E 2B BA 18 F1 7B RX 0x06 > 0: TX (003/002) 2A5#00 03 C5 1E FE 68 F1 65 > 0: RX (003/001) 2A5#00 01 CE 1B D7 9E 42 15 RX 0x05 > 0: TX (002/001) 2A5#00 02 7E 2B BA 18 F1 7B > 0: RX (002/000) 2A5#00 00 85 9A EF 54 34 5D RX 0x04 > 0: TX (001/000) 2A5#00 01 CE 1B D7 9E 42 15 > 0: TX (000/000) 2A5#00 00 85 9A EF 54 34 5D Can you run this in parallel to the test. Abort with Ctrl+c after the test fails and send me the log file. The last 256 lines should be enough. | candump any,0:0,#FFFFFFFF -l The candump writes to current dir by default. To avoid side effects change to a tmpfs, if you have enough RAM :). Also send the regmap dump and of course the error message of the test tool. regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 23+ messages in thread
* AW: AW: AW: mcp251xfd: Bad message receiption 2022-12-24 15:57 ` Marc Kleine-Budde @ 2022-12-30 22:51 ` Stefan Althöfer 2023-01-01 10:43 ` Stefan Althöfer 0 siblings, 1 reply; 23+ messages in thread From: Stefan Althöfer @ 2022-12-30 22:51 UTC (permalink / raw) To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can [-- Attachment #1: Type: text/plain, Size: 5427 bytes --] > Would be interesting if you only attache 1 mcp2518fd to the board and then re-run the test. The CM4 is the "raspberry pi 4 _compute_ module". So same "hardware" as the standard PI4 but with only one interface attached (on a professional PCB). Error also happens there. > You said you made some modifications to the kernel, also it would be good to use the _extact_ version you using to reproduce the error. Does this answer this question?: root@raspberrypi:~/linux-6.0# git remote -v origin https://github.com/raspberrypi/linux (fetch) origin https://github.com/raspberrypi/linux (push) root@raspberrypi:~/linux-6.0# git status On branch rpi-6.0.y Your branch is up to date with 'origin/rpi-6.0.y'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c no changes added to commit (use "git add" and/or "git commit -a") root@raspberrypi:~/linux-6.0# git diff diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c index 68df6d464..5eab9dd86 100644 --- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c +++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c @@ -648,7 +648,7 @@ static u8 mcp251xfd_get_normal_mode(const struct mcp251xfd_priv *priv) u8 mode; if (priv->can.ctrlmode & CAN_CTRLMODE_LOOPBACK) - mode = MCP251XFD_REG_CON_MODE_INT_LOOPBACK; + mode = MCP251XFD_REG_CON_MODE_EXT_LOOPBACK; else if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY) mode = MCP251XFD_REG_CON_MODE_LISTENONLY; else if (priv->can.ctrlmode & CAN_CTRLMODE_FD) Regarding my test-program: I used to cross compile it. I moved it to the PI and compiled it natively with the Makefile.standalone that I send to you. Surprisingly I found only 1 of my 4 instances failed after 72 hours. When I added -g -O2 flags, the natively build programs failed 3 out of 4 within 26 hours (similar to the cross-build binaries: gcc-6.4 vs gcc-10.2). > Can you run this in parallel to the test. Abort with Ctrl+c after the test fails and send me the log file. The last 256 lines should be enough. Sure, but logging for 24h will be to much for my disk space. I found no utility to handle this, so I quick-hacked rbuflog https://gist.github.com/stefanalt/dd4e68490d1a4e2a343b0beaa1b0d230. The result was so bizarre, that I finally also added at normal log which I continuously shrinked by running the following command in parallel: candump -D can0,0:0,#FFFFFFFF | tee candump0_tee | rbuflog -n 200 -d 10000 candump0 loopit -d 10 bash -c 'if [ $( du -k candump0_tee | cut -f1 ) -gt 7000 ] ; then cat /dev/null > candump0_tee ; fi' After the fail I saved all logs/registers and sent a test message through the interface (cansend can0 555#55555555). After that I re-saved everything and attached it as zip. The funny part is marked below. Some messages appear 4 times, where all other message appear only twice. I have seen this happening several times. can0 2A5 [16] 00 02 67 A6 C8 CE 0D 37 F9 63 56 62 F6 2B E4 02 can0 2A5 [16] 00 02 67 A6 C8 CE 0D 37 F9 63 56 62 F6 2B E4 02 can0 2A5 [16] 00 03 C5 50 54 BE E1 16 7A 3F 70 B8 EE 5A 09 67 can0 2A5 [16] 00 03 C5 50 54 BE E1 16 7A 3F 70 B8 EE 5A 09 67 can0 2A5 [16] 00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8 ** can0 2A5 [16] 00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8 ** can0 2A5 [16] 00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8 ** can0 2A5 [16] 00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8 ** can0 2A5 [16] 00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94 ** can0 2A5 [16] 00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94 ** can0 2A5 [16] 00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94 ** can0 2A5 [16] 00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94 ** can0 2A5 [16] 00 02 0F 8D FC D0 57 48 F9 C8 5D EF 46 A9 DF 27 can0 2A5 [16] 00 02 0F 8D FC D0 57 48 F9 C8 5D EF 46 A9 DF 27 can0 2A5 [16] 00 03 4B 31 FF 18 67 D9 AA ED 07 FB 55 4B C6 FB can0 2A5 [16] 00 03 4B 31 FF 18 67 D9 AA ED 07 FB 55 4B C6 FB can0 555 [4] 55 55 55 55 can0 555 [4] 55 55 55 55 Beside of that, the messages match with what my application has sent (rx-ed messages removed from the following lines, see the zip for that) 0: TX72 (003/002) 2A5#00 03 4B 31 FF 18 67 D9 AA ED 07 FB 55 4B C6 FB 0: TX72 (002/001) 2A5#00 02 0F 8D FC D0 57 48 F9 C8 5D EF 46 A9 DF 27 0: TX72 (001/000) 2A5#00 01 CD 36 DA 92 86 2E 50 67 44 CB A6 B4 83 94 0: TX72 (000/000) 2A5#00 00 C1 CA 4C 28 70 15 F6 7E 4C EF E1 A2 52 D8 And one more observation: Once or twice I have seen my application fail with not returning MTU size in the socketcan read: + ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -d 16 can1 CAN selftest can1 . ERROR: recvfrom: ret=16, errno=0 Is there any obvious explanation for this? I tried to add more output for this case, but it has not happed until then. --Stefan [-- Attachment #2: 20221230T1945_can0.tgz --] [-- Type: application/x-compressed, Size: 6905 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* AW: AW: AW: mcp251xfd: Bad message receiption 2022-12-30 22:51 ` AW: " Stefan Althöfer @ 2023-01-01 10:43 ` Stefan Althöfer 2023-01-04 15:55 ` Stefan Althöfer 0 siblings, 1 reply; 23+ messages in thread From: Stefan Althöfer @ 2023-01-01 10:43 UTC (permalink / raw) To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can [-- Attachment #1: Type: text/plain, Size: 2341 bytes --] Update: During a recent test-run I had both of the two CAN interfaces of my RPI4 Model B fail. First can0 and then some hours later can1. Furthermore I attached another dump after fail. The last one was for the case where my application receives 4 wrong messages (2 of them equal). This case is where my application receives 2 wrong messages only (both being equal). The candump has a different pattern: CAN selftest can1 ......idx 2 tx:2A5#00 02 01 08 9A 5D 15 4B rx:2A5#00 00 D1 7D 9D 18 91 50 idx 3 tx:2A5#00 03 AC 4D 55 BF 2F 56 rx:2A5#00 00 D1 7D 9D 18 91 50 c (584) fail 0: RX72 (004/003) 2A5#00 00 D1 7D 9D 18 91 50 0: RX72 (004/002) 2A5#00 00 D1 7D 9D 18 91 50 0: TX72 (003/002) 2A5#00 03 AC 4D 55 BF 2F 56 0: RX72 (003/001) 2A5#00 01 B0 ED 22 AE 94 9F 0: TX72 (002/001) 2A5#00 02 01 08 9A 5D 15 4B 0: RX72 (002/000) 2A5#00 00 57 3D 14 2B 3A BA 0: TX72 (001/000) 2A5#00 01 B0 ED 22 AE 94 9F 0: TX72 (000/000) 2A5#00 00 57 3D 14 2B 3A BA can1 2A5 [08] 00 03 1C 48 53 3D E1 9B can1 2A5 [08] 00 03 1C 48 53 3D E1 9B can1 2A5 [08] 00 00 57 3D 14 2B 3A BA can1 2A5 [08] 00 00 57 3D 14 2B 3A BA can1 2A5 [08] 00 01 B0 ED 22 AE 94 9F can1 2A5 [08] 00 01 B0 ED 22 AE 94 9F can1 2A5 [08] 00 02 01 08 9A 5D 15 4B ** can1 2A5 [08] 00 02 01 08 9A 5D 15 4B ** can1 2A5 [08] 00 02 01 08 9A 5D 15 4B ** can1 2A5 [08] 00 02 01 08 9A 5D 15 4B ** can1 2A5 [08] 00 03 AC 4D 55 BF 2F 56 ** can1 2A5 [08] 00 03 AC 4D 55 BF 2F 56 ** can1 2A5 [08] 00 03 AC 4D 55 BF 2F 56 ** can1 2A5 [08] 00 03 AC 4D 55 BF 2F 56 ** I've seen other failures that followed those two pattern. Unfortunately, I removed my external logger, so it's unsure what actually happens on the bus. > And one more observation: Once or twice I have seen my application fail with not returning MTU size in the socketcan read: > ERROR: recvfrom: ret=16, errno=0 > Is there any obvious explanation for this? I tried to add more output for this case, but it has not happed until then. Stupid me. Obviously this is a message in old CAN_MTU size. Is this still used for errors even if the MTU is set to CANFD_MTU? This did not show up again, so I wasn't yet able to look into the data payload. -- Stefan [-- Attachment #2: 20221231T0158_can1.tgz --] [-- Type: application/x-compressed, Size: 5992 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* AW: AW: AW: mcp251xfd: Bad message receiption 2023-01-01 10:43 ` Stefan Althöfer @ 2023-01-04 15:55 ` Stefan Althöfer 2023-01-09 7:10 ` AW: AW: AW: mcp251xfd: Bad message receiption (1/2) Stefan Althöfer 2023-01-09 7:10 ` AW: AW: AW: mcp251xfd: Bad message receiption (2/2) Stefan Althöfer 0 siblings, 2 replies; 23+ messages in thread From: Stefan Althöfer @ 2023-01-04 15:55 UTC (permalink / raw) To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can [-- Attachment #1: Type: text/plain, Size: 3127 bytes --] I could again capture a case where some (false) 16 byte MTU frames were received - attached. Loop 24435 2023-01-04T14:25:57 last+4s total:28h36m ..... + ./sctestself -b -n 4 -l 999 -t 2 -v cmperr,logmsg -F refilldata,leastdots,allowintloopb,stoponerror -X 192.168.5.221:4001 -d 24 can0 CAN selftest can0 Wrong-RX16(rxcnt=0):000# Wrong-RX16(rxcnt=1):000# Wrong-RX16(rxcnt=2):000# Wrong-RX16(rxcnt=3):000# idx 0 tx:2A5#00 00 DB 83 C7 C6 D2 22 D0 F1 81 25 26 15 EF 1F 1C 6F 91 BC 00 DC EF 74 rx:000# idx 1 tx:2A5#00 01 B2 FA 3F 38 59 F9 13 11 4D EE 94 14 B4 67 36 84 58 B7 A9 7F CC 99 rx:000# idx 2 tx:2A5#00 02 9E E9 08 2F A5 08 0C 95 7C BE 8F BB F7 E8 B5 0A F9 02 F8 8E 16 AC rx:000# idx 3 tx:2A5#00 03 F5 4D 30 4D 04 DA CC D1 73 6B BA 7B 9A 5F 84 A6 F4 00 65 83 BC 5C rx:000# .c (0) fail 0: RX16 (004/003) 000# 0: RX16 (004/002) 000# 0: TX72 (003/002) 2A5#00 03 F5 4D 30 4D 04 DA CC D1 73 6B BA 7B 9A 5F 84 A6 F4 00 65 83 BC 5C 0: RX16 (003/001) 000# 0: TX72 (002/001) 2A5#00 02 9E E9 08 2F A5 08 0C 95 7C BE 8F BB F7 E8 B5 0A F9 02 F8 8E 16 AC 0: RX16 (002/000) 000# 0: TX72 (001/000) 2A5#00 01 B2 FA 3F 38 59 F9 13 11 4D EE 94 14 B4 67 36 84 58 B7 A9 7F CC 99 0: TX72 (000/000) 2A5#00 00 DB 83 C7 C6 D2 22 D0 F1 81 25 26 15 EF 1F 1C 6F 91 BC 00 DC EF 74 External logger data: (54:40.690.506 +0.089) RX B - 2A5 [24] 00 03 F5 4D 30 4D 04 DA CC D1 73 6B BA 7B 9A 5F 84 A6 F4 00 65 83 BC 5C (54:40.690.417 +3.307) RX B - 2A5 [24] 00 02 9E E9 08 2F A5 08 0C 95 7C BE 8F BB F7 E8 B5 0A F9 02 F8 8E 16 AC **** suspicious delay (54:40.687.110 +0.088) RX B - 2A5 [24] 00 01 B2 FA 3F 38 59 F9 13 11 4D EE 94 14 B4 67 36 84 58 B7 A9 7F CC 99 (54:40.687.022 +26.681) RX B - 2A5 [24] 00 00 DB 83 C7 C6 D2 22 D0 F1 81 25 26 15 EF 1F 1C 6F 91 BC 00 DC EF 74 **** plausible delay (54:40.660.341 +0.071) RX B - 2A5 [16] 00 03 95 89 EA 22 98 62 36 77 C2 F9 69 99 0D 32 (54:40.660.270 +0.417) RX B - 2A5 [16] 00 02 CF 52 6C 29 E2 86 73 10 3E 93 D4 72 30 05 (54:40.659.853 +0.071) RX B - 2A5 [16] 00 01 8D 0E 77 13 DF 60 C3 F1 D7 14 C9 8B CC 04 (54:40.659.782 +0.460) RX B - 2A5 [16] 00 00 5A 90 19 49 2D B8 20 C4 34 9D 31 23 59 E5 (54:40.659.322 +0.070) RX B - 2A5 [16] 00 03 86 D1 E9 AF 12 76 51 D7 B4 B8 3B 43 85 68 (54:40.659.252 +0.420) RX B - 2A5 [16] 00 02 EF 9D 9F D6 97 7B 7D 6C EF BB B4 6A B5 92 My sctestself received four 000# frames and quits. There are more such frames in the candump. None of them were received by the external logger and neither in my own send-log. It seems this happens when the sctestself does it's first test loop - which I also have seen in a second case. The 26.681 ms delay seems to be normal as the program is started by the outer test loop. The +3.307 ms delay however is not normally seen when sctestself does it's first loop, not even if I drop caches after each sctestself run. Can sctestself do something wrong to cause it? -- Stefan [-- Attachment #2: 20230104T1425_can0_rx16.tgz --] [-- Type: application/x-compressed, Size: 6235 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* AW: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-04 15:55 ` Stefan Althöfer @ 2023-01-09 7:10 ` Stefan Althöfer 2023-01-10 12:50 ` Thomas.Kopp 2023-01-09 7:10 ` AW: AW: AW: mcp251xfd: Bad message receiption (2/2) Stefan Althöfer 1 sibling, 1 reply; 23+ messages in thread From: Stefan Althöfer @ 2023-01-09 7:10 UTC (permalink / raw) To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can [-- Attachment #1: Type: text/plain, Size: 717 bytes --] Hope this doesn't start to get annoying! I captured two more cases for the received-wrong-and-duplicate case. Attached. Also I couldn’t resist to create a PDF with some things highlighted ;-) Perhaps I should mention that for the PDF I: * reversed the last-on-top output of sctestself * removed the TX log lines of sctestself * removed every send line of the candump log For the last case, I attached debugger after the fail an tried to read from the socket until read returned -1: (gdb) p malloc(72) (gdb) p read(sockfd,$1,72) (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise surprise, there were more CAN frames in store. Refer to the log/pdf. -- Stefan [-- Attachment #2: 20230106T1822.tgz --] [-- Type: application/x-compressed, Size: 7765 bytes --] [-- Attachment #3: 20230107T1352.tgz --] [-- Type: application/x-compressed, Size: 10957 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-09 7:10 ` AW: AW: AW: mcp251xfd: Bad message receiption (1/2) Stefan Althöfer @ 2023-01-10 12:50 ` Thomas.Kopp 2023-01-10 14:53 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Thomas.Kopp @ 2023-01-10 12:50 UTC (permalink / raw) To: Stefan.Althoefer, mkl; +Cc: linux-can > Hope this doesn't start to get annoying! Absolutely not, having the data and you debugging this is helpful. > For the last case, I attached debugger after the fail an tried to read from the > socket until read returned -1: > (gdb) p malloc(72) > (gdb) p read(sockfd,$1,72) > (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise > surprise, there were more CAN frames in store. Refer to the log/pdf. I was finally able to reproduce the issue and capture a full SPI/CAN log when it happens. At this point it looks like the tail pointer of the FIFO in the MCP2518FD gets corrupted and the driver does what it's supposed to do and reads out lots of frames... So yeah, I see all the frames being read via SPI multiple times which matches what you found with your gdb approach. It's interesting to see that this wasn't noticed before - I have reproduced this issue also with cangen (when manually comparing TX/RX counters) I'll try to find the rootcause and what can be done to mitigate this. Best Regards, Thomas ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-10 12:50 ` Thomas.Kopp @ 2023-01-10 14:53 ` Marc Kleine-Budde 2023-01-10 17:08 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2023-01-10 14:53 UTC (permalink / raw) To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can [-- Attachment #1: Type: text/plain, Size: 1261 bytes --] On 10.01.2023 12:50:18, Thomas.Kopp@microchip.com wrote: > > For the last case, I attached debugger after the fail an tried to read from the > > socket until read returned -1: > > (gdb) p malloc(72) > > (gdb) p read(sockfd,$1,72) > > (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise > > surprise, there were more CAN frames in store. Refer to the log/pdf. > > I was finally able to reproduce the issue and capture a full SPI/CAN > log when it happens. At this point it looks like the tail pointer of > the FIFO in the MCP2518FD gets corrupted and the driver does what it's > supposed to do and reads out lots of frames... You mean the head pointer? > So yeah, I see all the > frames being read via SPI multiple times which matches what you found > with your gdb approach. > It's interesting to see that this wasn't noticed before - I have > reproduced this issue also with cangen (when manually comparing TX/RX > counters) Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-10 14:53 ` Marc Kleine-Budde @ 2023-01-10 17:08 ` Marc Kleine-Budde 2023-01-10 18:39 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2023-01-10 17:08 UTC (permalink / raw) To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can [-- Attachment #1: Type: text/plain, Size: 6866 bytes --] On 10.01.2023 15:53:22, Marc Kleine-Budde wrote: > On 10.01.2023 12:50:18, Thomas.Kopp@microchip.com wrote: > > > For the last case, I attached debugger after the fail an tried to read from the > > > socket until read returned -1: > > > (gdb) p malloc(72) > > > (gdb) p read(sockfd,$1,72) > > > (gdb) print fprint_canframe_mtu(stdout, $1, "\n", 4, 72) And surprise > > > surprise, there were more CAN frames in store. Refer to the log/pdf. > > > > I was finally able to reproduce the issue and capture a full SPI/CAN > > log when it happens. At this point it looks like the tail pointer of > > the FIFO in the MCP2518FD gets corrupted and the driver does what it's > > supposed to do and reads out lots of frames... > > You mean the head pointer? I've reproduced the issue here, too, on a single-core imx6, net-next/master from end of last year. I added some tracing to the driver and recoded the following events. Let me explain how to read this. The "chip" variants are read from the chip via SPI. The "ring" variants are the cached values in the driver. chip-tail chip-delta (head-tail) | chip-head | | | | ring-tail ring-head ring-delta | | | | | | (head-tail) | do_rxif_analyze: ct=0x0b ch=0x0c cd=0x01 rt=0x00000c9b rh=0x00000c9c rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag ^^^^^^^^ These are the decoded lowest 3 bits of the FIFOSTA register. The different runs of the rxif handler (mcp251xfd_handle_rxif()) are separated by a ("----") line. Now the analysis: | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x09 ch=0x0a cd=0x01 rt=0x00000c99 rh=0x00000c9a rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0a ch=0x0b cd=0x01 rt=0x00000c9a rh=0x00000c9b rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0b ch=0x0c cd=0x01 rt=0x00000c9b rh=0x00000c9c rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag So far everything looks good. The chip delta ("cd"), i.e. the CAN frames waiting to be read is 1. The ring delta is 1, too. Meaning 1 CAN frame has been read. | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0c ch=0x0f cd=0x03 rt=0x00000c9c rh=0x00000c9f rd=0x03 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag Now 3 CAN frame are waiting to be read. Thomas told me that he sees 3 CAN frames waiting too, before things go downhill. | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0f ch=0x0e cd=0x0f rt=0x00000c9f rh=0x00000cae rd=0x0f | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF x Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag This is strange: chip head "jumped" to 0x0e, this means there are 0x0f CAN frames waiting. However the FIFO flags (FIFO Half Full set, FIFO Full not set) are consistent with 0x0f waiting CAN frames, the FIFO size if 0x10. Note: The chip tail "ct=0x0f" is consistent, as the "old" chip head was 0x0f. | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0e ch=0x0e cd=0x10 rt=0x00000cae rh=0x00000cbe rd=0x10 | TFERFFIF x Receive FIFO Full Interrupt Flag | TFHRFHIF x Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag Chip tail is consistent (0x0e) as it equals the old chip head. chip head is also 0x0e, this means either an empty or a full FIFO, the flags indicate a full FIFO. The following events seem to be OK: | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0e ch=0x0f cd=0x01 rt=0x00000cbe rh=0x00000cbf rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0f ch=0x00 cd=0x01 rt=0x00000cbf rh=0x00000cc0 rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- Final note on the number of send and received CAN frames. The ring head (in the event above) shows 0x0cc0 received CAN frames. The sequence number in the register dump shows 0x000c9f, which means a total of 0x000ca0 CAN frames have been send. | TX-0 Object: 0x03 (0xaf8) | id = 0x000002a5 | flags = 0x00193e08 | SEQ_MCP2517FD = 0x00001f Sequence (MCP2517) | SEQ_MCP2518FD = 0x000c9f Sequence (MCP2518) | data = 00 03 69 df c5 ee 62 77 The BDIAG1 register gives a total of 0x0ca0 error free messages, looks consistent, too Thomas is this RX'ed or TX'ed error free messages? | BDIAG1: bdiag1(0x03c)=0x00000ca0 [...] | EFMSGCNT = 0x0ca0 Error Free Message Counter regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-10 17:08 ` Marc Kleine-Budde @ 2023-01-10 18:39 ` Marc Kleine-Budde 2023-01-10 21:40 ` Thomas.Kopp 0 siblings, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2023-01-10 18:39 UTC (permalink / raw) To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can [-- Attachment #1: Type: text/plain, Size: 4295 bytes --] Another failure, it looks similar: | do_rxif_analyze: ct=0x0a ch=0x0b cd=0x01 rt=0x000000fa rh=0x000000fb rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0b ch=0x0c cd=0x01 rt=0x000000fb rh=0x000000fc rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0c ch=0x0f cd=0x03 rt=0x000000fc rh=0x000000ff rd=0x03 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0f ch=0x0e cd=0x0f rt=0x000000ff rh=0x0000010e rd=0x0f | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF x Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0e ch=0x0e cd=0x10 rt=0x0000010e rh=0x0000011e rd=0x10 | TFERFFIF x Receive FIFO Full Interrupt Flag | TFHRFHIF x Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0e ch=0x0f cd=0x01 rt=0x0000011e rh=0x0000011f rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0f ch=0x00 cd=0x01 rt=0x0000011f rh=0x00000120 rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | BDIAG1: bdiag1(0x03c)=0x00000100 | DLCMM DLC Mismatch | ESI ESI flag of a received CAN FD message was set | DCRCERR Data CRC Error | DSTUFERR Data Bit Stuffing Error | DFORMERR Data Format Error | DBIT1ERR Data BIT1 Error | DBIT0ERR Data BIT0 Error | TXBOERR Device went to bus-off (and auto-recovered) | NCRCERR CRC Error | NSTUFERR Bit Stuffing Error | NFORMERR Format Error | NACKERR Transmitted message was not acknowledged | NBIT1ERR Bit1 Error | NBIT0ERR Bit0 Error | EFMSGCNT = 0x0100 Error Free Message Counter | TX-0 Object: 0x03 (0xaf8) | id = 0x000002a5 | flags = 0x0001fe08 | SEQ_MCP2517FD = 0x00007f Sequence (MCP2517) | SEQ_MCP2518FD = 0x0000ff Sequence (MCP2518) | data = 00 03 72 e4 78 54 d6 ed The correct message counter is 0x100, the wrong one 0x120. That's 2x FIFO size. I'd like to know when the FIFO head is wrong for the first time, one that results in a data transfer where "old" FIFO contents is read. I haven't dumped any data yet. Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-10 18:39 ` Marc Kleine-Budde @ 2023-01-10 21:40 ` Thomas.Kopp 2023-01-10 21:50 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Thomas.Kopp @ 2023-01-10 21:40 UTC (permalink / raw) To: mkl; +Cc: Stefan.Althoefer, linux-can About Head not tail: Whoops, yes correct. > The correct message counter is 0x100, the wrong one 0x120. That's 2x > FIFO size. I'd like to know when the FIFO head is wrong for the first > time, one that results in a data transfer where "old" FIFO contents is > read. I haven't dumped any data yet. Not sure what you mean. Do you want the message counter at that point in time? Thomas ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-10 21:40 ` Thomas.Kopp @ 2023-01-10 21:50 ` Marc Kleine-Budde 2023-01-10 22:37 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2023-01-10 21:50 UTC (permalink / raw) To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can [-- Attachment #1: Type: text/plain, Size: 3388 bytes --] On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote: > > The correct message counter is 0x100, the wrong one 0x120. That's 2x > > FIFO size. I'd like to know when the FIFO head is wrong for the first > > time, one that results in a data transfer where "old" FIFO contents is > > read. I haven't dumped any data yet. I got a chip-delta == 4 error. > Not sure what you mean. Do you want the message counter at that point > in time? I want to know the chip tail where the driver reads the first time old data. For example in this trace: | do_rxif_analyze: ct=0x07 ch=0x08 cd=0x01 rt=0x00000427 rh=0x00000428 rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x08 ch=0x09 cd=0x01 rt=0x00000428 rh=0x00000429 rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x09 ch=0x0a cd=0x01 rt=0x00000429 rh=0x0000042a rd=0x01 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag chip delta is 1, so probably OK. | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0a ch=0x0e cd=0x04 rt=0x0000042a rh=0x0000042e rd=0x04 | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag chip delta is 4. Are all RX objects (0xa, 0xb, 0xc and 0xd) valid data? | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0e ch=0x0c cd=0x0e rt=0x0000042e rh=0x0000043c rd=0x0e | TFERFFIF Receive FIFO Full Interrupt Flag | TFHRFHIF x Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag Are the read RX objects are all valid data? | -------------------------------------------------------------------------------- | do_rxif_analyze: ct=0x0c ch=0x0c cd=0x10 rt=0x0000043c rh=0x0000044c rd=0x10 | TFERFFIF x Receive FIFO Full Interrupt Flag | TFHRFHIF x Receive FIFO Half Full Interrupt Flag | TFNRFNIF x Receive FIFO Not Empty Interrupt Flag We read the whole FIFO, is everything valid data? Probably not. Which RX FIFO object is the first "bad" one? | -------------------------------------------------------------------------------- Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-10 21:50 ` Marc Kleine-Budde @ 2023-01-10 22:37 ` Marc Kleine-Budde 2023-01-10 22:43 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2023-01-10 22:37 UTC (permalink / raw) To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can [-- Attachment #1: Type: text/plain, Size: 787 bytes --] On 10.01.2023 22:50:33, Marc Kleine-Budde wrote: > On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote: > > > The correct message counter is 0x100, the wrong one 0x120. That's 2x > > > FIFO size. I'd like to know when the FIFO head is wrong for the first > > > time, one that results in a data transfer where "old" FIFO contents is > > > read. I haven't dumped any data yet. > > I got a chip-delta == 4 error. I have a proof of concept workaround implemented and I'll let it run over night. Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-10 22:37 ` Marc Kleine-Budde @ 2023-01-10 22:43 ` Marc Kleine-Budde 2023-01-11 8:17 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2023-01-10 22:43 UTC (permalink / raw) To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can [-- Attachment #1: Type: text/plain, Size: 959 bytes --] On 10.01.2023 23:37:44, Marc Kleine-Budde wrote: > On 10.01.2023 22:50:33, Marc Kleine-Budde wrote: > > On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote: > > > > The correct message counter is 0x100, the wrong one 0x120. That's 2x > > > > FIFO size. I'd like to know when the FIFO head is wrong for the first > > > > time, one that results in a data transfer where "old" FIFO contents is > > > > read. I haven't dumped any data yet. > > > > I got a chip-delta == 4 error. > > I have a proof of concept workaround implemented and I'll let it run > over night. \o/ The workaround triggered and Stefan's test program continued without problems \o/ g'night, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-10 22:43 ` Marc Kleine-Budde @ 2023-01-11 8:17 ` Marc Kleine-Budde 2023-01-11 11:54 ` Marc Kleine-Budde 0 siblings, 1 reply; 23+ messages in thread From: Marc Kleine-Budde @ 2023-01-11 8:17 UTC (permalink / raw) To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can [-- Attachment #1: Type: text/plain, Size: 1098 bytes --] On 10.01.2023 23:43:02, Marc Kleine-Budde wrote: > On 10.01.2023 23:37:44, Marc Kleine-Budde wrote: > > On 10.01.2023 22:50:33, Marc Kleine-Budde wrote: > > > On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote: > > > > > The correct message counter is 0x100, the wrong one 0x120. That's 2x > > > > > FIFO size. I'd like to know when the FIFO head is wrong for the first > > > > > time, one that results in a data transfer where "old" FIFO contents is > > > > > read. I haven't dumped any data yet. > > > > > > I got a chip-delta == 4 error. > > > > I have a proof of concept workaround implemented and I'll let it run > > over night. > > \o/ The workaround triggered and Stefan's test program continued without > problems \o/ Still running. The workaround triggered more than 45x this night. Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RE: RE: AW: AW: mcp251xfd: Bad message receiption (1/2) 2023-01-11 8:17 ` Marc Kleine-Budde @ 2023-01-11 11:54 ` Marc Kleine-Budde 0 siblings, 0 replies; 23+ messages in thread From: Marc Kleine-Budde @ 2023-01-11 11:54 UTC (permalink / raw) To: Thomas.Kopp; +Cc: Stefan.Althoefer, linux-can [-- Attachment #1: Type: text/plain, Size: 1950 bytes --] On 11.01.2023 09:17:25, Marc Kleine-Budde wrote: > On 10.01.2023 23:43:02, Marc Kleine-Budde wrote: > > On 10.01.2023 23:37:44, Marc Kleine-Budde wrote: > > > On 10.01.2023 22:50:33, Marc Kleine-Budde wrote: > > > > On 10.01.2023 21:40:16, Thomas.Kopp@microchip.com wrote: > > > > > > The correct message counter is 0x100, the wrong one 0x120. That's 2x > > > > > > FIFO size. I'd like to know when the FIFO head is wrong for the first > > > > > > time, one that results in a data transfer where "old" FIFO contents is > > > > > > read. I haven't dumped any data yet. > > > > > > > > I got a chip-delta == 4 error. > > > > > > I have a proof of concept workaround implemented and I'll let it run > > > over night. > > > > \o/ The workaround triggered and Stefan's test program continued without > > problems \o/ > > Still running. The workaround triggered more than 45x this night. I think the same problem occurs with the TEF, too. It's easier to detect, as the TEF has sequence numbers. The driver already implements a workaround. It limits the TEF head to TX head [1] and it refuses to handle TEF objects with wrong sequence numbers [2] and UINCs only the correctly handled ones. However it doesn't roll back the "wrong" internal TEF head. The problem occurs (on Linux) only during high TX load situations, so eventually there will the enough finished TX frames and the internal TEF head will be correct again. regards, Marc [1] https://elixir.bootlin.com/linux/v6.1.4/source/drivers/net/can/spi/mcp251xfd/mcp251xfd-tef.c#L141 [2] https://elixir.bootlin.com/linux/v6.1.4/source/drivers/net/can/spi/mcp251xfd/mcp251xfd-tef.c#L106 -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* AW: AW: AW: mcp251xfd: Bad message receiption (2/2) 2023-01-04 15:55 ` Stefan Althöfer 2023-01-09 7:10 ` AW: AW: AW: mcp251xfd: Bad message receiption (1/2) Stefan Althöfer @ 2023-01-09 7:10 ` Stefan Althöfer 1 sibling, 0 replies; 23+ messages in thread From: Stefan Althöfer @ 2023-01-09 7:10 UTC (permalink / raw) To: Marc Kleine-Budde; +Cc: Thomas.Kopp, linux-can [-- Attachment #1: Type: text/plain, Size: 34 bytes --] Remaining Attachment --Stefan [-- Attachment #2: can_mcp2518fd_fail.zip --] [-- Type: application/x-zip-compressed, Size: 65740 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: mcp251xfd: Bad message receiption 2022-12-22 7:44 mcp251xfd: Bad message receiption Stefan Althöfer 2022-12-22 9:07 ` Thomas.Kopp @ 2022-12-22 14:51 ` Marc Kleine-Budde 1 sibling, 0 replies; 23+ messages in thread From: Marc Kleine-Budde @ 2022-12-22 14:51 UTC (permalink / raw) To: Stefan Althöfer; +Cc: linux-can [-- Attachment #1: Type: text/plain, Size: 3138 bytes --] On 22.12.2022 07:44:59, Stefan Althöfer wrote: > I'm using the MCP2518FD CAN controller with Rasperry PI 4 and are > suffering from wrong data reception. > > My hardware is a Model 4 B with MCP2581 Click board(s) attached with a > prototype (hand-made) interface PCB, as well as baseboards for a CM4 > with a single MCP2518 on it. The error occurs on both of them. Do you get the same error if the test is only running on a single mcp2518fd? Do you get the same error if only a single mcp2518fd is attached to the raspi? > I have reduced my test case to a simple single thread self-receipt test: > * TX two messages > * Wait for RX and send out a new message on every receipt > * TX for messages in total Can you share the config/scripts/... to set up the interfaces? Can you share the test code? > Refer to the attached PDF for some error cases. The text in the PDF is rendered down to paths only. So not select- nor searchable, please send plain text files for further logs. > Last send frames are at the top of the logs. You can see that wrong > messages appear in the RX queue, which have been successfully > transmitted in previous test loop. The data that is actually sent out > is correct however (checked with an external logger for some cases). Looking at the first page, there are 2 identical lines marked in red. The next page show 4 red lines. So under certain load situations you see that old messages are received twice. If I remember correctly I heard something similar from someone else before. It occurred with a modified driver. For various reasons that has never debugged. I think I've fixed the root cause of the problem during the v5.18 development cycle. > The error appeared very often (every 1..10 minutes) when I was using > kernel raspberrypi 5.15.32-v7l+ and reduced to once in several hours > when I switched to raspberrypi 6.0.12-v7l+. BTW I modified the > loopback to external loopback To check for correct transmission. No > other changes to the driver as taken from the Raspbian repository. Can you share your kernel, including the modifications? > I see infrequent mcp251xfd CRC read errors. I think those are due to > the 2518 SPI errata. However they don't occur at the time when the > wrong messages are received (refer to the PDF). > > Questions: > - Is this a known problem? No - it should not happen on any mainline kernel. > - What did change from 5.15.32 to 6.0.12 to reduce the error rate? > Maybe this is just because of a change in execution timing? Quite a lot: Next to probably unrelated changes (e.g. PLL support), the workarounds for CRC errors were updated, in v5.18 multiple RX FIFOs are used and IRQ coalescing support has been added. > - Any suggestion how I can step further in fixing this issue. regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2023-01-11 11:59 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-12-22 7:44 mcp251xfd: Bad message receiption Stefan Althöfer 2022-12-22 9:07 ` Thomas.Kopp 2022-12-22 10:30 ` AW: " Stefan Althöfer 2022-12-22 13:02 ` Marc Kleine-Budde 2022-12-23 8:18 ` AW: " Stefan Althöfer 2022-12-23 19:25 ` Marc Kleine-Budde 2022-12-24 15:57 ` Marc Kleine-Budde 2022-12-30 22:51 ` AW: " Stefan Althöfer 2023-01-01 10:43 ` Stefan Althöfer 2023-01-04 15:55 ` Stefan Althöfer 2023-01-09 7:10 ` AW: AW: AW: mcp251xfd: Bad message receiption (1/2) Stefan Althöfer 2023-01-10 12:50 ` Thomas.Kopp 2023-01-10 14:53 ` Marc Kleine-Budde 2023-01-10 17:08 ` Marc Kleine-Budde 2023-01-10 18:39 ` Marc Kleine-Budde 2023-01-10 21:40 ` Thomas.Kopp 2023-01-10 21:50 ` Marc Kleine-Budde 2023-01-10 22:37 ` Marc Kleine-Budde 2023-01-10 22:43 ` Marc Kleine-Budde 2023-01-11 8:17 ` Marc Kleine-Budde 2023-01-11 11:54 ` Marc Kleine-Budde 2023-01-09 7:10 ` AW: AW: AW: mcp251xfd: Bad message receiption (2/2) Stefan Althöfer 2022-12-22 14:51 ` mcp251xfd: Bad message receiption Marc Kleine-Budde
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).