Weird EROFS data corruption

From: Juhyung Park <qkrwngud825@gmail.com>
To: Gao Xiang <xiang@kernel.org>, linux-erofs@lists.ozlabs.org
Cc: linux-f2fs-devel@lists.sourceforge.net,
	linux-crypto@vger.kernel.org,
	 Yann Collet <yann.collet.73@gmail.com>
Subject: Weird EROFS data corruption
Date: Mon, 4 Dec 2023 01:22:23 +0900	[thread overview]
Message-ID: <CAD14+f2AVKf8Fa2OO1aAUdDNTDsVzzR6ctU_oJSmTyd6zSYR2Q@mail.gmail.com> (raw)

(Cc'ing f2fs and crypto as I've noticed something similar with f2fs a
while ago, which may mean that this is not specific to EROFS:
https://lore.kernel.org/all/CAD14+f2nBZtLfLC6CwNjgCOuRRRjwzttp3D3iK4Of+1EEjK+cw@mail.gmail.com/
)

Hi.

I'm encountering a very weird EROFS data corruption.

I noticed when I build an EROFS image for AOSP development, the device
would randomly not boot from a certain build.
After inspecting the log, I noticed that a file got corrupted.

After adding a hash check during the build flow, I noticed that EROFS
would randomly read data wrong.

I now have a reliable method of reproducing the issue, but here's the
funny/weird part: it's only happening on my laptop (i7-1185G7). This
is not happening with my 128 cores buildfarm machine (Threadripper
3990X).

I first suspected a hardware issue, but:
a. The laptop had its motherboard replaced recently (due to a failing
physical Type-C port).
b. The laptop passes memory test (memtest86).
c. This happens on all kernel versions from v5.4 to the latest v6.6
including my personal custom builds and Canonical's official Ubuntu
kernels.
d. This happens on different host SSDs and file-system combinations.
e. This only happens on LZ4. LZ4HC doesn't trigger the issue.
f. This only happens when mounting the image natively by the kernel.
Using fuse with erofsfuse is fine.

This is how I'm reproducing the issue:

# mkfs.erofs -zlz4 -T0 --ignore-mtime tmp.img /mnt/lib64/
mkfs.erofs 1.7
Build completed.
------
Filesystem UUID: 3a7e1f90-5450-40f9-92a2-945bacdb51c3
Filesystem total blocks: 53075 (of 4096-byte blocks)
Filesystem total inodes: 973
Filesystem total metadata blocks: 73
Filesystem total deduplicated bytes (of source files): 0
# mount tmp.img /mnt
# for i in {1..30}; do echo 3 > /proc/sys/vm/drop_caches; find /mnt
-type f -exec xxh64sum {} + | sort -k2 | xxh64sum -; done
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
293a8e7de2a53019  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
293a8e7de2a53019  stdin
293a8e7de2a53019  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin

As you can see, I sometimes get 0b40f1abfbb6e9a8 and 293a8e7de2a53019 in others.

This is when I manually inspect the failing file:

# echo 3 > /proc/sys/vm/drop_caches; xxh64sum
/mnt/vendor.qti.hardware.mwqemadapter@1.0.so
dc96f35f015a0e5d  /mnt/vendor.qti.hardware.mwqemadapter@1.0.so
# xxd < /mnt/vendor.qti.hardware.mwqemadapter@1.0.so > /tmp/1
[ several more attempts until I get a different hash... ]
# echo 3 > /proc/sys/vm/drop_caches; xxh64sum
/mnt/vendor.qti.hardware.mwqemadapter@1.0.so
1cfe5d69c28fff6c  /mnt/vendor.qti.hardware.mwqemadapter@1.0.so
# xxd < /mnt/vendor.qti.hardware.mwqemadapter@1.0.so > /tmp/2
# diff /tmp/[12]
3741c3741
< 0000e9c0: f40e 0000 b46b 0000 ac5c 0000 140e 0000  .....k...\......
---
> 0000e9c0: 445a 0000 e40d 0000 ac5c 0000 140e 0000  DZ.......\......

This could still very well be my hardware issue, but I highly suspect
something's wrong with the kernel software code that happens to only
trigger on my hardware configuration.

I've uploaded the generated image here:
https://arter97.com/.erofs/
but I'm not sure it'll be reproducible on other machines.

I've also tried updating the LZ4 module in the /lib/lz4 to the latest
v1.9.4 and the latest dev trunk (4032c8c787e6). I've managed to get it
working with the Linux kernel, but the corruption still happens.

Let me know if there's anything I can help to narrow down the culprit.

Thanks,