Hi! I'm seeing a weird problem, and I'd like some help with further things to try in order to track down what's going on. I have bisected the issue to f9aa460672c9 ("driver core: Refactor fw_devlink feature") The symptoms are that I get (seemingly) random memory corruption when processing large amounts of data (compared to system size). I have two known reproducers, but I'm sure there are more if I keep digging. One is to do this: $ dd if=/dev/urandom of=testfile bs=1024 count=40000 40000+0 records in 40000+0 records out 40960000 bytes (41 MB, 39 MiB) copied, 19.7759 s, 2.1 MB/s $ for i in 1 2 3 4; do cat testfile | sha256sum; done d8c85f816e08baa5ad27050bf0413e11a09f325fb0a8843b7b2b45b9333ab542 - f223c1cbb6dbecb02d1741e7991dc98cd8d5b40ffee05bb32dc2c15eb73d6b1f - d6f3e7f3d325c67e83a6104934dd8a7c891ebfd9a2cf59633dbe97fb2cbb9c81 - cf8ada47e7e2fee299314440b225ba83fca3cef1f6286adc160a5d4f207caccd - It is harder to tickle the problem if I redirect the testfile to sha256sum w/o involving cat or give the file as an argument to sha256sum. I can also get things to behave better by getting rid of a bunch of USB interrupts by doing the following: $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB0/latency_timer $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB2/latency_timer $ echo 100 > /sys/bus/usb-serial/devices/ttyUSB3/latency_timer With the lower interrupt pressure I get this: $ for i in 1 2 3 4; do cat testfile | sha256sum; done 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d - Nice. However, I need the latency to be lower than the default 16ms, 3ms could perhaps work in theory, but preferably 1ms, so the above 100ms is far off. The initial hash run was with latency set to 1ms, which makes it easy to trigger the issue. The latency timer setting is for this driver: drivers/usb/serial/ftdi_sio.c And also, that does not help with the other reproducer, namely to copy that same random testfile with scp to a working system... $ scp testfile peda@xyzzy:testfile1 testfile 100% 39MB 2.0MB/s 00:19 $ scp testfile peda@xyzzy:testfile2 testfile 100% 39MB 2.1MB/s 00:18 $ scp testfile peda@xyzzy:testfile3 testfile 100% 39MB 2.1MB/s 00:18 $ scp testfile peda@xyzzy:testfile4 testfile 100% 39MB 2.1MB/s 00:19 ...and then perform the sha256sum on that xyzzy host instead: $ sha256sum testfile? 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2 c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4 $ sha256sum testfile? 39dc3a7d05483ae7a2c64c5ed2e8e6108287bf4ddf124a2f0c1a9d0221f9ac66 testfile1 9597ef542e7cce879872a027d9ec591feb5fc766aeaec47d58eff6e8c6ab3206 testfile2 c6104a700b1d6f13eb1de84b5a91a1846a3e1576e052d51a664d2e2711a3869d testfile3 60b9c240cb331bad530c3c1d766f50d53a24e01831bfc04e48f329b738521310 testfile4 Same output every time. Of course. xyzzy is a working system... Converting these files to hex (hexdump -C) and diffing yields this: $ diff -u0 testfile1.hex testfile2.hex --- testfile1.hex 2022-03-02 23:56:38.273149516 +0100 +++ testfile2.hex 2022-03-03 00:00:57.912747033 +0100 @@ -8658,2 +8658,2 @@ -00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T| -00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(| +00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...| +00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......| @@ -20592,2 +20592,2 @@ -000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..| -00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..| +000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__| +00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\............D.| +00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |....@.ygih......| +00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583| @@ -902788,2 +902788,2 @@ -00dc6830 f2 41 23 1b ec 54 d5 fe f0 33 51 f7 d2 fc bf bd |.A#..T...3Q.....| -00dc6840 e5 1f 58 df 24 2f e3 dc 65 87 b2 27 12 86 d1 9a |..X.$/..e..'....| +00dc6830 44 82 94 b5 c9 26 08 42 bd 89 e1 96 41 66 8a b5 |D....&.B....Af..| +00dc6840 a5 34 46 5e fd 1b c1 73 86 33 24 fd 4d e1 e1 68 |.4F^...s.3$.M..h| @@ -931900,2 +931900,2 @@ -00e383b0 ee 64 c5 6f 38 44 5b 31 41 e1 2c 64 49 d5 f8 ad |.d.o8D[1A.,dI...| -00e383c0 fb 85 52 4f 00 1f 80 7a f3 de ee 8e db ac d5 bb |..RO...z........| +00e383b0 4b 4d 29 a1 0a 99 8f f7 32 71 8c de 23 ca a0 f1 |KM).....2q..#...| +00e383c0 e2 af e3 c4 a0 95 d3 1c ed 58 c4 c5 30 da 56 b9 |.........X..0.V.| @@ -1170109,2 +1170109,2 @@ -011dabc0 6a 7c 0c 3c 86 1a b6 48 50 d7 98 68 0c 01 e3 1c |j|.<...HP..h....| -011dabd0 a3 a8 b0 f2 62 21 86 b9 d1 52 9d 74 9e 26 42 51 |....b!...R.t.&BQ| +011dabc0 5b 1a 9e 23 ae 58 42 68 83 58 df d6 c1 57 6b b0 |[..#.XBh.X...Wk.| +011dabd0 ec d5 50 8b 76 5e 96 b4 49 21 f7 e4 b7 8f a3 45 |..P.v^..I!.....E| @@ -1880164,2 +1880164,2 @@ -01cb0630 1c 74 74 16 75 b4 de f7 ce 4b 5e 4d 97 d6 36 d4 |.tt.u....K^M..6.| -01cb0640 44 d9 fd 69 c5 d0 f0 a6 c6 44 26 53 7f 91 f3 62 |D..i.....D&S...b| +01cb0630 73 bc 40 ce f8 9d 99 91 1b 14 8b a8 52 2a 7b 39 |s.@.........R*{9| +01cb0640 6b ff f5 c5 02 b9 ab c2 c2 08 5e e7 3a 5e 69 c4 |k.........^.:^i.| Grepping (some of the above) for duplicates yields this: $ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile1.hex 00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...| 00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......| 00021d10 08 2a dd c6 c8 0f 0d e2 4c 1e 46 21 f9 89 a2 54 |.*......L.F!...T| 00021d20 23 8c 4f f1 46 f1 61 05 ee f2 d2 ee 56 79 4f 28 |#.O.F.a.....VyO(| $ egrep "0 (08 2a dd|23 8c 4f|7b c8 d2|2a f8 fb)" testfile2.hex 00020d40 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...| 00020d50 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......| 00021d10 7b c8 d2 0b f4 ca 5f ba 61 b3 93 04 59 8f ed bf |{....._.a...Y...|* 00021d20 2a f8 fb 0c ad 0e 23 2a 3e cf d3 10 02 ef 04 b9 |*.....#*>.......|* $ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile1.hex 0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..| 0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..| 000506f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..|* 00050700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..|* $ egrep "0 (1f 6c ca|8b 6d 86|ad e6 d5|bf eb 8e)" testfile2.hex 0004f6f0 1f 6c ca 6b a6 2a 39 a6 1f bd b0 67 5b 22 1a dd |.l.k.*9....g["..| 0004f700 8b 6d 86 7c 87 37 ee a8 46 4d e5 79 0e 3e 96 e6 |.m.|.7..FM.y.>..| 000506f0 ad e6 d5 65 e6 dc c1 a3 e2 ba c9 e2 61 39 5f 5f |...e........a9__| 00050700 bf eb 8e 5c 08 f1 f2 89 3c 57 c5 07 b9 f4 91 fc |...\............D.| $ egrep "0 (0d 49 c3|94 4d 5b|d3 b4 96|5f f4 10 92)" testfile2.hex 00707dd0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |....@.ygih......| 00707de0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583| 00708da0 d3 b4 96 d6 40 8d 79 67 69 68 fd 10 b4 15 82 e6 |....@.ygih......|* 00708db0 5f f4 10 92 ae 39 9d 92 42 88 44 3b be 35 38 33 |_....9..B.D;.583|* I.e. testfile1 is (probably) corrupted at 000506f0..70f while testfile2 is (probably) corrupted at 00021d10..2f and 00708da0..bf (correpted lines marked with hand-made asterisks above) If I keep grepping like this, the pattern is similar both within these files and within testfile3 and testfile4. I.e. with corruptions in 32-byte blocks at (seemingly) random positions in the files. The corruption is always 16-byte-aligned and the bad data seems to be a copy from exactly one page up in the file. As stated above, I have bisected the issue to patch f9aa460672c9 ("driver core: Refactor fw_devlink feature") which was added between v5.10-rc3 and v5.10-rc4. Every kernel I have tried with that patch applied have exhibited the issue, and I have had no trouble like this with any kernel without that patch. Apart from a whole bunch of kernels prior to v5.10-rc3, that includes some later kernels with the patch reverted (along with the dependent followup 2d09e6eb4a6f). The latest I have tried is 5.11.22. Those two patches does not revert cleanly in 5.12 (and thereafter) so I have not tried anything beyond 5.11 with the patch reverted. I fail to understand how that patch might cause this issue. I have compared boot messages before and after the patch and there is no (significant) difference. Everything seems to happen in the same order with the same result. But that comparison is of course limited to what is logged. In some random attempt I tried to disable the D-Cache bit, and that makes it all very slow but it also (seemingly) fixes the issue. But that may of course be due to vastly different timings. Some background: We have a "Linea" CPU module, with a design based on the Atmel (now Microchip) SAMA5D31 evaluation board. This CPU module is used on e.g. our TSE-850 for which there is a device tree in arch/arm/boot/dts/at91-tse850-3.dts It has a nand flash for the rootfs and 64 MB RAM. The 40 MB random testfile is thus big enough to cause page cache churn. We have used this module in thousands of delivered units (however, not that many TSE-850) and have never observed anything like this before. But that has been with older kernels. 4.13. and 4.15. was what we were on until this recent activity. We're now developing a new product (preliminary device tree included) and the trusty old CPU module was used again and a fresh new kernel was built for it. I then started to notice this issue and have tried to include as much relevant data as possible. If you need more data or would like me to test something, please ask. I'm stumped. Cheers, Peter