All of lore.kernel.org
 help / color / mirror / Atom feed
From: Juhyung Park <qkrwngud825@gmail.com>
To: Gao Xiang <xiang@kernel.org>, linux-erofs@lists.ozlabs.org
Cc: linux-f2fs-devel@lists.sourceforge.net,
	linux-crypto@vger.kernel.org,
	 Yann Collet <yann.collet.73@gmail.com>
Subject: Weird EROFS data corruption
Date: Mon, 4 Dec 2023 01:22:23 +0900	[thread overview]
Message-ID: <CAD14+f2AVKf8Fa2OO1aAUdDNTDsVzzR6ctU_oJSmTyd6zSYR2Q@mail.gmail.com> (raw)

(Cc'ing f2fs and crypto as I've noticed something similar with f2fs a
while ago, which may mean that this is not specific to EROFS:
https://lore.kernel.org/all/CAD14+f2nBZtLfLC6CwNjgCOuRRRjwzttp3D3iK4Of+1EEjK+cw@mail.gmail.com/
)

Hi.

I'm encountering a very weird EROFS data corruption.

I noticed when I build an EROFS image for AOSP development, the device
would randomly not boot from a certain build.
After inspecting the log, I noticed that a file got corrupted.

After adding a hash check during the build flow, I noticed that EROFS
would randomly read data wrong.

I now have a reliable method of reproducing the issue, but here's the
funny/weird part: it's only happening on my laptop (i7-1185G7). This
is not happening with my 128 cores buildfarm machine (Threadripper
3990X).

I first suspected a hardware issue, but:
a. The laptop had its motherboard replaced recently (due to a failing
physical Type-C port).
b. The laptop passes memory test (memtest86).
c. This happens on all kernel versions from v5.4 to the latest v6.6
including my personal custom builds and Canonical's official Ubuntu
kernels.
d. This happens on different host SSDs and file-system combinations.
e. This only happens on LZ4. LZ4HC doesn't trigger the issue.
f. This only happens when mounting the image natively by the kernel.
Using fuse with erofsfuse is fine.

This is how I'm reproducing the issue:

# mkfs.erofs -zlz4 -T0 --ignore-mtime tmp.img /mnt/lib64/
mkfs.erofs 1.7
Build completed.
------
Filesystem UUID: 3a7e1f90-5450-40f9-92a2-945bacdb51c3
Filesystem total blocks: 53075 (of 4096-byte blocks)
Filesystem total inodes: 973
Filesystem total metadata blocks: 73
Filesystem total deduplicated bytes (of source files): 0
# mount tmp.img /mnt
# for i in {1..30}; do echo 3 > /proc/sys/vm/drop_caches; find /mnt
-type f -exec xxh64sum {} + | sort -k2 | xxh64sum -; done
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
293a8e7de2a53019  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
293a8e7de2a53019  stdin
293a8e7de2a53019  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin

As you can see, I sometimes get 0b40f1abfbb6e9a8 and 293a8e7de2a53019 in others.

This is when I manually inspect the failing file:

# echo 3 > /proc/sys/vm/drop_caches; xxh64sum
/mnt/vendor.qti.hardware.mwqemadapter@1.0.so
dc96f35f015a0e5d  /mnt/vendor.qti.hardware.mwqemadapter@1.0.so
# xxd < /mnt/vendor.qti.hardware.mwqemadapter@1.0.so > /tmp/1
[ several more attempts until I get a different hash... ]
# echo 3 > /proc/sys/vm/drop_caches; xxh64sum
/mnt/vendor.qti.hardware.mwqemadapter@1.0.so
1cfe5d69c28fff6c  /mnt/vendor.qti.hardware.mwqemadapter@1.0.so
# xxd < /mnt/vendor.qti.hardware.mwqemadapter@1.0.so > /tmp/2
# diff /tmp/[12]
3741c3741
< 0000e9c0: f40e 0000 b46b 0000 ac5c 0000 140e 0000  .....k...\......
---
> 0000e9c0: 445a 0000 e40d 0000 ac5c 0000 140e 0000  DZ.......\......

This could still very well be my hardware issue, but I highly suspect
something's wrong with the kernel software code that happens to only
trigger on my hardware configuration.

I've uploaded the generated image here:
https://arter97.com/.erofs/
but I'm not sure it'll be reproducible on other machines.

I've also tried updating the LZ4 module in the /lib/lz4 to the latest
v1.9.4 and the latest dev trunk (4032c8c787e6). I've managed to get it
working with the Linux kernel, but the corruption still happens.

Let me know if there's anything I can help to narrow down the culprit.

Thanks,

WARNING: multiple messages have this Message-ID (diff)
From: Juhyung Park <qkrwngud825@gmail.com>
To: Gao Xiang <xiang@kernel.org>, linux-erofs@lists.ozlabs.org
Cc: Yann Collet <yann.collet.73@gmail.com>,
	linux-crypto@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net
Subject: [f2fs-dev] Weird EROFS data corruption
Date: Mon, 4 Dec 2023 01:22:23 +0900	[thread overview]
Message-ID: <CAD14+f2AVKf8Fa2OO1aAUdDNTDsVzzR6ctU_oJSmTyd6zSYR2Q@mail.gmail.com> (raw)

(Cc'ing f2fs and crypto as I've noticed something similar with f2fs a
while ago, which may mean that this is not specific to EROFS:
https://lore.kernel.org/all/CAD14+f2nBZtLfLC6CwNjgCOuRRRjwzttp3D3iK4Of+1EEjK+cw@mail.gmail.com/
)

Hi.

I'm encountering a very weird EROFS data corruption.

I noticed when I build an EROFS image for AOSP development, the device
would randomly not boot from a certain build.
After inspecting the log, I noticed that a file got corrupted.

After adding a hash check during the build flow, I noticed that EROFS
would randomly read data wrong.

I now have a reliable method of reproducing the issue, but here's the
funny/weird part: it's only happening on my laptop (i7-1185G7). This
is not happening with my 128 cores buildfarm machine (Threadripper
3990X).

I first suspected a hardware issue, but:
a. The laptop had its motherboard replaced recently (due to a failing
physical Type-C port).
b. The laptop passes memory test (memtest86).
c. This happens on all kernel versions from v5.4 to the latest v6.6
including my personal custom builds and Canonical's official Ubuntu
kernels.
d. This happens on different host SSDs and file-system combinations.
e. This only happens on LZ4. LZ4HC doesn't trigger the issue.
f. This only happens when mounting the image natively by the kernel.
Using fuse with erofsfuse is fine.

This is how I'm reproducing the issue:

# mkfs.erofs -zlz4 -T0 --ignore-mtime tmp.img /mnt/lib64/
mkfs.erofs 1.7
Build completed.
------
Filesystem UUID: 3a7e1f90-5450-40f9-92a2-945bacdb51c3
Filesystem total blocks: 53075 (of 4096-byte blocks)
Filesystem total inodes: 973
Filesystem total metadata blocks: 73
Filesystem total deduplicated bytes (of source files): 0
# mount tmp.img /mnt
# for i in {1..30}; do echo 3 > /proc/sys/vm/drop_caches; find /mnt
-type f -exec xxh64sum {} + | sort -k2 | xxh64sum -; done
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
293a8e7de2a53019  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
293a8e7de2a53019  stdin
293a8e7de2a53019  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin

As you can see, I sometimes get 0b40f1abfbb6e9a8 and 293a8e7de2a53019 in others.

This is when I manually inspect the failing file:

# echo 3 > /proc/sys/vm/drop_caches; xxh64sum
/mnt/vendor.qti.hardware.mwqemadapter@1.0.so
dc96f35f015a0e5d  /mnt/vendor.qti.hardware.mwqemadapter@1.0.so
# xxd < /mnt/vendor.qti.hardware.mwqemadapter@1.0.so > /tmp/1
[ several more attempts until I get a different hash... ]
# echo 3 > /proc/sys/vm/drop_caches; xxh64sum
/mnt/vendor.qti.hardware.mwqemadapter@1.0.so
1cfe5d69c28fff6c  /mnt/vendor.qti.hardware.mwqemadapter@1.0.so
# xxd < /mnt/vendor.qti.hardware.mwqemadapter@1.0.so > /tmp/2
# diff /tmp/[12]
3741c3741
< 0000e9c0: f40e 0000 b46b 0000 ac5c 0000 140e 0000  .....k...\......
---
> 0000e9c0: 445a 0000 e40d 0000 ac5c 0000 140e 0000  DZ.......\......

This could still very well be my hardware issue, but I highly suspect
something's wrong with the kernel software code that happens to only
trigger on my hardware configuration.

I've uploaded the generated image here:
https://arter97.com/.erofs/
but I'm not sure it'll be reproducible on other machines.

I've also tried updating the LZ4 module in the /lib/lz4 to the latest
v1.9.4 and the latest dev trunk (4032c8c787e6). I've managed to get it
working with the Linux kernel, but the corruption still happens.

Let me know if there's anything I can help to narrow down the culprit.

Thanks,


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

WARNING: multiple messages have this Message-ID (diff)
From: Juhyung Park <qkrwngud825@gmail.com>
To: Gao Xiang <xiang@kernel.org>, linux-erofs@lists.ozlabs.org
Cc: Yann Collet <yann.collet.73@gmail.com>,
	linux-crypto@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net
Subject: Weird EROFS data corruption
Date: Mon, 4 Dec 2023 01:22:23 +0900	[thread overview]
Message-ID: <CAD14+f2AVKf8Fa2OO1aAUdDNTDsVzzR6ctU_oJSmTyd6zSYR2Q@mail.gmail.com> (raw)

(Cc'ing f2fs and crypto as I've noticed something similar with f2fs a
while ago, which may mean that this is not specific to EROFS:
https://lore.kernel.org/all/CAD14+f2nBZtLfLC6CwNjgCOuRRRjwzttp3D3iK4Of+1EEjK+cw@mail.gmail.com/
)

Hi.

I'm encountering a very weird EROFS data corruption.

I noticed when I build an EROFS image for AOSP development, the device
would randomly not boot from a certain build.
After inspecting the log, I noticed that a file got corrupted.

After adding a hash check during the build flow, I noticed that EROFS
would randomly read data wrong.

I now have a reliable method of reproducing the issue, but here's the
funny/weird part: it's only happening on my laptop (i7-1185G7). This
is not happening with my 128 cores buildfarm machine (Threadripper
3990X).

I first suspected a hardware issue, but:
a. The laptop had its motherboard replaced recently (due to a failing
physical Type-C port).
b. The laptop passes memory test (memtest86).
c. This happens on all kernel versions from v5.4 to the latest v6.6
including my personal custom builds and Canonical's official Ubuntu
kernels.
d. This happens on different host SSDs and file-system combinations.
e. This only happens on LZ4. LZ4HC doesn't trigger the issue.
f. This only happens when mounting the image natively by the kernel.
Using fuse with erofsfuse is fine.

This is how I'm reproducing the issue:

# mkfs.erofs -zlz4 -T0 --ignore-mtime tmp.img /mnt/lib64/
mkfs.erofs 1.7
Build completed.
------
Filesystem UUID: 3a7e1f90-5450-40f9-92a2-945bacdb51c3
Filesystem total blocks: 53075 (of 4096-byte blocks)
Filesystem total inodes: 973
Filesystem total metadata blocks: 73
Filesystem total deduplicated bytes (of source files): 0
# mount tmp.img /mnt
# for i in {1..30}; do echo 3 > /proc/sys/vm/drop_caches; find /mnt
-type f -exec xxh64sum {} + | sort -k2 | xxh64sum -; done
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
293a8e7de2a53019  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
293a8e7de2a53019  stdin
293a8e7de2a53019  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin
0b40f1abfbb6e9a8  stdin

As you can see, I sometimes get 0b40f1abfbb6e9a8 and 293a8e7de2a53019 in others.

This is when I manually inspect the failing file:

# echo 3 > /proc/sys/vm/drop_caches; xxh64sum
/mnt/vendor.qti.hardware.mwqemadapter@1.0.so
dc96f35f015a0e5d  /mnt/vendor.qti.hardware.mwqemadapter@1.0.so
# xxd < /mnt/vendor.qti.hardware.mwqemadapter@1.0.so > /tmp/1
[ several more attempts until I get a different hash... ]
# echo 3 > /proc/sys/vm/drop_caches; xxh64sum
/mnt/vendor.qti.hardware.mwqemadapter@1.0.so
1cfe5d69c28fff6c  /mnt/vendor.qti.hardware.mwqemadapter@1.0.so
# xxd < /mnt/vendor.qti.hardware.mwqemadapter@1.0.so > /tmp/2
# diff /tmp/[12]
3741c3741
< 0000e9c0: f40e 0000 b46b 0000 ac5c 0000 140e 0000  .....k...\......
---
> 0000e9c0: 445a 0000 e40d 0000 ac5c 0000 140e 0000  DZ.......\......

This could still very well be my hardware issue, but I highly suspect
something's wrong with the kernel software code that happens to only
trigger on my hardware configuration.

I've uploaded the generated image here:
https://arter97.com/.erofs/
but I'm not sure it'll be reproducible on other machines.

I've also tried updating the LZ4 module in the /lib/lz4 to the latest
v1.9.4 and the latest dev trunk (4032c8c787e6). I've managed to get it
working with the Linux kernel, but the corruption still happens.

Let me know if there's anything I can help to narrow down the culprit.

Thanks,

             reply	other threads:[~2023-12-03 16:22 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-03 16:22 Juhyung Park [this message]
2023-12-03 16:22 ` Weird EROFS data corruption Juhyung Park
2023-12-03 16:22 ` [f2fs-dev] " Juhyung Park
2023-12-03 16:52 ` Gao Xiang
2023-12-03 16:52   ` [f2fs-dev] " Gao Xiang
2023-12-03 16:52   ` Gao Xiang
2023-12-03 17:01   ` Juhyung Park
2023-12-03 17:01     ` Juhyung Park
2023-12-03 17:01     ` [f2fs-dev] " Juhyung Park
2023-12-03 17:21     ` Gao Xiang
2023-12-03 17:21       ` Gao Xiang
2023-12-03 17:21       ` [f2fs-dev] " Gao Xiang
2023-12-03 17:32       ` Juhyung Park
2023-12-03 17:32         ` Juhyung Park
2023-12-03 17:32         ` [f2fs-dev] " Juhyung Park
2023-12-04  3:28         ` Gao Xiang
2023-12-04  3:28           ` Gao Xiang
2023-12-04  3:28           ` [f2fs-dev] " Gao Xiang
2023-12-04  3:41           ` Juhyung Park
2023-12-04  3:41             ` Juhyung Park
2023-12-04  3:41             ` Juhyung Park
2023-12-05  7:32             ` Gao Xiang
2023-12-05  7:32               ` Gao Xiang
2023-12-05  7:32               ` [f2fs-dev] " Gao Xiang
2023-12-05 14:23               ` Juhyung Park
2023-12-05 14:23                 ` Juhyung Park
2023-12-05 14:23                 ` [f2fs-dev] " Juhyung Park
2023-12-05 14:34                 ` Gao Xiang
2023-12-05 14:34                   ` Gao Xiang
2023-12-05 14:34                   ` [f2fs-dev] " Gao Xiang
2023-12-05 14:43                   ` Juhyung Park
2023-12-05 14:43                     ` Juhyung Park
2023-12-05 14:43                     ` Juhyung Park
2023-12-06  3:11                     ` Gao Xiang
2023-12-06  3:11                       ` [f2fs-dev] " Gao Xiang
2023-12-06  3:11                       ` Gao Xiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAD14+f2AVKf8Fa2OO1aAUdDNTDsVzzR6ctU_oJSmTyd6zSYR2Q@mail.gmail.com \
    --to=qkrwngud825@gmail.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-erofs@lists.ozlabs.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=xiang@kernel.org \
    --cc=yann.collet.73@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.