linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* aarch64: ext4 metadata integrity regression in kernels >= 5.5 ?
@ 2020-07-12  9:22 Russell King - ARM Linux admin
  2020-07-12 10:07 ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 3+ messages in thread
From: Russell King - ARM Linux admin @ 2020-07-12  9:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Some will know that during the last six months, I've been seeing
problems on the LX2160A rev 1 with corrupted checksums on a EXT4
FS on a NVMe recently.  I'm not certain exactly which kernels are
affected, but I know that 5.1 seems to be fine, and 5.5, possibly
5.4 onwards seem affected, maybe earlier.

The symptom is that the kernel will run for some random amount of
time (between a few days and a few months) and then EXT4 will
complain with "iget: checksum invalid" on the root filesystem either
during a logrotate or a mandb rebuild.

Upon investigation with debugfs and hexdump, it appeared that a single
EXT4 inode in one sector contained an invalid 32-bit checksum.  EXT4
splits the 32-bit checksum into two 16-bit halves and stores them in
separate locations in the inode, consequently any read or update of
the checksum requires two separate reads or writes.

The problem initially seemed to correlate with powering the platform
down as the trigger, and it was suggested that the NVMe was at fault.
However, a recent case disproved that theory when the problem appeared
to self-correct itself after using "hdparm -f" on the drive, and the
problem going away - e2fsck found no errors on the filesystem, and I
could remount the filesystem in read/write mode.  "hdparm -f" syncs
the device and flushes the kernel cache, which it also does when you
use "hdparm -t" to measure disk performance.

My next question was whether it was being caused by PCIe ordering
issues.  I've since upgraded the machine to a LX2160A rev 2, which has
yet to show any symptoms of this.

However, the reason for this email is a troubling development with this
problem:

[7478798.720368] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #157096: comm mandb: iget: checksum invalid
[7478798.729925] Aborting journal on device mmcblk0p1-8.
[7478798.734070] EXT4-fs (mmcblk0p1): Remounting filesystem read-only
[7478798.734589] EXT4-fs error (device mmcblk0p1): ext4_journal_check_start:84: Detected aborted journal

Running "e2fsck -n" on the system without having done anything gives:

Inode 13755 passes checks, but checksum does not match inode.  Fix? no
Inode 157096 passes checks, but checksum does not match inode.  Fix? no

amongst other errors, which are expected for a filesystem that is
normally "in-use".  Using "hdparm -f" does not make these errors go
away.

The offending inodes found by e2fsck corresponds with:
  /usr/share/man/nl/man1/apt-transport-mirror.1.gz
  /lib/firmware/rtl_bt/rtl8723a_fw.bin

However, just like all the other instances, these would not have changed
recently except for atime updates.

There are a couple of important differences here:
- It is an Armada 8040 system - Clearfog GT-8K running a 5.6 kernel,
  rather than the LX2160A.
- Its rootfs is on eMMC, not NVMe.

That seems to rule out the NVMe being a cause of the problem, and any
PCIe issues of the LX2160A rev 1.

Another data point is that I'm also running an Armada 8040 system as a
VM host, which has over a year uptime, so is on an older kernel (5.1).
This uses EXT4 for its rootfs as well, but is on SATA SSD, and has not
shown any issues.  The VMs it runs are a later kernel (5.6) also with
EXT4, and have yet to display any symptoms.

The similarities are - the kernel is the same or similar binary on the
failing systems (I've been running the same kernel config on both.)
Both are a Cortex-A72, but slightly different revisions.

So, it's starting to feel like an aarch64 problem, potentially a
locking or ordering issue.  Due to how rare this issue is,
investigating it is likely very difficult.  However, it seems to be
very real, as the symptoms have now been observed on two rather
different aarch64 platforms.

Due to the amount of time required to test, it very difficult to do any
kind of bisection, or test alternative kernels - it would take months
of runtime for a single test.

I'm chucking this out there so that if anyone else is seeing this
behaviour, they can shout and maybe confirm what I'm seeing.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: aarch64: ext4 metadata integrity regression in kernels >= 5.5 ?
  2020-07-12  9:22 aarch64: ext4 metadata integrity regression in kernels >= 5.5 ? Russell King - ARM Linux admin
@ 2020-07-12 10:07 ` Russell King - ARM Linux admin
  2020-10-27 19:51   ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 3+ messages in thread
From: Russell King - ARM Linux admin @ 2020-07-12 10:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jul 12, 2020 at 10:22:31AM +0100, Russell King - ARM Linux admin wrote:
> Some will know that during the last six months, I've been seeing
> problems on the LX2160A rev 1 with corrupted checksums on a EXT4
> FS on a NVMe recently.  I'm not certain exactly which kernels are
> affected, but I know that 5.1 seems to be fine, and 5.5, possibly
> 5.4 onwards seem affected, maybe earlier.
> 
> The symptom is that the kernel will run for some random amount of
> time (between a few days and a few months) and then EXT4 will
> complain with "iget: checksum invalid" on the root filesystem either
> during a logrotate or a mandb rebuild.
> 
> Upon investigation with debugfs and hexdump, it appeared that a single
> EXT4 inode in one sector contained an invalid 32-bit checksum.  EXT4
> splits the 32-bit checksum into two 16-bit halves and stores them in
> separate locations in the inode, consequently any read or update of
> the checksum requires two separate reads or writes.
> 
> The problem initially seemed to correlate with powering the platform
> down as the trigger, and it was suggested that the NVMe was at fault.
> However, a recent case disproved that theory when the problem appeared
> to self-correct itself after using "hdparm -f" on the drive, and the
> problem going away - e2fsck found no errors on the filesystem, and I
> could remount the filesystem in read/write mode.  "hdparm -f" syncs
> the device and flushes the kernel cache, which it also does when you
> use "hdparm -t" to measure disk performance.
> 
> My next question was whether it was being caused by PCIe ordering
> issues.  I've since upgraded the machine to a LX2160A rev 2, which has
> yet to show any symptoms of this.
> 
> However, the reason for this email is a troubling development with this
> problem:
> 
> [7478798.720368] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #157096: comm mandb: iget: checksum invalid
> [7478798.729925] Aborting journal on device mmcblk0p1-8.
> [7478798.734070] EXT4-fs (mmcblk0p1): Remounting filesystem read-only
> [7478798.734589] EXT4-fs error (device mmcblk0p1): ext4_journal_check_start:84: Detected aborted journal
> 
> Running "e2fsck -n" on the system without having done anything gives:
> 
> Inode 13755 passes checks, but checksum does not match inode.  Fix? no
> Inode 157096 passes checks, but checksum does not match inode.  Fix? no
> 
> amongst other errors, which are expected for a filesystem that is
> normally "in-use".  Using "hdparm -f" does not make these errors go
> away.
> 
> The offending inodes found by e2fsck corresponds with:
>   /usr/share/man/nl/man1/apt-transport-mirror.1.gz
>   /lib/firmware/rtl_bt/rtl8723a_fw.bin
> 
> However, just like all the other instances, these would not have changed
> recently except for atime updates.
> 
> There are a couple of important differences here:
> - It is an Armada 8040 system - Clearfog GT-8K running a 5.6 kernel,
>   rather than the LX2160A.
> - Its rootfs is on eMMC, not NVMe.
> 
> That seems to rule out the NVMe being a cause of the problem, and any
> PCIe issues of the LX2160A rev 1.
> 
> Another data point is that I'm also running an Armada 8040 system as a
> VM host, which has over a year uptime, so is on an older kernel (5.1).
> This uses EXT4 for its rootfs as well, but is on SATA SSD, and has not
> shown any issues.  The VMs it runs are a later kernel (5.6) also with
> EXT4, and have yet to display any symptoms.
> 
> The similarities are - the kernel is the same or similar binary on the
> failing systems (I've been running the same kernel config on both.)
> Both are a Cortex-A72, but slightly different revisions.
> 
> So, it's starting to feel like an aarch64 problem, potentially a
> locking or ordering issue.  Due to how rare this issue is,
> investigating it is likely very difficult.  However, it seems to be
> very real, as the symptoms have now been observed on two rather
> different aarch64 platforms.
> 
> Due to the amount of time required to test, it very difficult to do any
> kind of bisection, or test alternative kernels - it would take months
> of runtime for a single test.
> 
> I'm chucking this out there so that if anyone else is seeing this
> behaviour, they can shout and maybe confirm what I'm seeing.

A bit more information:

Inode 157096 is /usr/share/man/nl/man1/apt-transport-mirror.1.gz:

--- bad
+++ fixed
 debugfs:  stat <157096>
 Inode: 157096   Type: regular    Mode:  0644   Flags: 0x80000
 Generation: 3717235945    Version: 0x00000000:00000001
 User:     0   Group:     0   Project:     0   Size: 3811
 File ACL: 0
 Links: 1   Blockcount: 8
 Fragment:  Address: 0    Number: 0    Size: 0
  ctime: 0x5ebcd62f:ba34bf1c -- Thu May 14 06:25:03 2020
  atime: 0x5ebcd63b:a2906fa0 -- Thu May 14 06:25:15 2020
  mtime: 0x5eba730a:00000000 -- Tue May 12 10:57:30 2020
 crtime: 0x5ebcd62f:a25cccf4 -- Thu May 14 06:25:03 2020
 Size of extra inode fields: 32
-Inode checksum: 0x13fd5c3c
+Inode checksum: 0x600eba80
 EXTENTS:
 (0):1173965

Note that mandb is set to run daily, so one must assume that the
inode checksum was fine the previous day.  Note that the file itself
is fine - it passes gzip's integrity checks, and the contents are
correct:

# zcat /usr/share/man/nl/man1/apt-transport-mirror.1.gz >/dev/null

For the other inode, 13755, /lib/firmware/rtl_bt/rtl8723a_fw.bin:

--- bad
+++ fixed
 debugfs:  stat <13755>
 Inode: 13755   Type: regular    Mode:  0644   Flags: 0x80000
 Generation: 2326028864    Version: 0x00000000:00000001
 User:     0   Group:     0   Project:     0   Size: 24548
 File ACL: 0
 Links: 1   Blockcount: 48
 Fragment:  Address: 0    Number: 0    Size: 0
  ctime: 0x5e88ffc5:b9a541e4 -- Sat Apr  4 22:44:37 2020
  atime: 0x5e88ffc4:00000000 -- Sat Apr  4 22:44:36 2020
  mtime: 0x5d5f3bb0:00000000 -- Fri Aug 23 02:04:48 2019
 crtime: 0x5e88ffc5:51b03564 -- Sat Apr  4 22:44:37 2020
 Size of extra inode fields: 32
-Inode checksum: 0x4d9c9f81
+Inode checksum: 0x487c2bf3
 EXTENTS:
 (0-5):835670-835675

In both cases, the times suggest that there has been no change made to
these inode recently.

It would have been great to know the state of these inodes prior to the
checksum not matching, but alas, time travel has yet to be invented!
Maybe if/when it happens again on the Armada 8040, I'll have an ext4fs
image to compare against - and hopefully identify exactly what has
changed.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: aarch64: ext4 metadata integrity regression in kernels >= 5.5 ?
  2020-07-12 10:07 ` Russell King - ARM Linux admin
@ 2020-10-27 19:51   ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 3+ messages in thread
From: Russell King - ARM Linux admin @ 2020-10-27 19:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jul 12, 2020 at 11:07:39AM +0100, Russell King - ARM Linux admin wrote:
> On Sun, Jul 12, 2020 at 10:22:31AM +0100, Russell King - ARM Linux admin wrote:
> > Some will know that during the last six months, I've been seeing
> > problems on the LX2160A rev 1 with corrupted checksums on a EXT4
> > FS on a NVMe recently.  I'm not certain exactly which kernels are
> > affected, but I know that 5.1 seems to be fine, and 5.5, possibly
> > 5.4 onwards seem affected, maybe earlier.
> > 
> > The symptom is that the kernel will run for some random amount of
> > time (between a few days and a few months) and then EXT4 will
> > complain with "iget: checksum invalid" on the root filesystem either
> > during a logrotate or a mandb rebuild.
> > 
> > Upon investigation with debugfs and hexdump, it appeared that a single
> > EXT4 inode in one sector contained an invalid 32-bit checksum.  EXT4
> > splits the 32-bit checksum into two 16-bit halves and stores them in
> > separate locations in the inode, consequently any read or update of
> > the checksum requires two separate reads or writes.
> > 
> > The problem initially seemed to correlate with powering the platform
> > down as the trigger, and it was suggested that the NVMe was at fault.
> > However, a recent case disproved that theory when the problem appeared
> > to self-correct itself after using "hdparm -f" on the drive, and the
> > problem going away - e2fsck found no errors on the filesystem, and I
> > could remount the filesystem in read/write mode.  "hdparm -f" syncs
> > the device and flushes the kernel cache, which it also does when you
> > use "hdparm -t" to measure disk performance.
> > 
> > My next question was whether it was being caused by PCIe ordering
> > issues.  I've since upgraded the machine to a LX2160A rev 2, which has
> > yet to show any symptoms of this.
> > 
> > However, the reason for this email is a troubling development with this
> > problem:
> > 
> > [7478798.720368] EXT4-fs error (device mmcblk0p1): ext4_lookup:1707: inode #157096: comm mandb: iget: checksum invalid
> > [7478798.729925] Aborting journal on device mmcblk0p1-8.
> > [7478798.734070] EXT4-fs (mmcblk0p1): Remounting filesystem read-only
> > [7478798.734589] EXT4-fs error (device mmcblk0p1): ext4_journal_check_start:84: Detected aborted journal
> > 
> > Running "e2fsck -n" on the system without having done anything gives:
> > 
> > Inode 13755 passes checks, but checksum does not match inode.  Fix? no
> > Inode 157096 passes checks, but checksum does not match inode.  Fix? no
> > 
> > amongst other errors, which are expected for a filesystem that is
> > normally "in-use".  Using "hdparm -f" does not make these errors go
> > away.
> > 
> > The offending inodes found by e2fsck corresponds with:
> >   /usr/share/man/nl/man1/apt-transport-mirror.1.gz
> >   /lib/firmware/rtl_bt/rtl8723a_fw.bin
> > 
> > However, just like all the other instances, these would not have changed
> > recently except for atime updates.
> > 
> > There are a couple of important differences here:
> > - It is an Armada 8040 system - Clearfog GT-8K running a 5.6 kernel,
> >   rather than the LX2160A.
> > - Its rootfs is on eMMC, not NVMe.
> > 
> > That seems to rule out the NVMe being a cause of the problem, and any
> > PCIe issues of the LX2160A rev 1.
> > 
> > Another data point is that I'm also running an Armada 8040 system as a
> > VM host, which has over a year uptime, so is on an older kernel (5.1).
> > This uses EXT4 for its rootfs as well, but is on SATA SSD, and has not
> > shown any issues.  The VMs it runs are a later kernel (5.6) also with
> > EXT4, and have yet to display any symptoms.
> > 
> > The similarities are - the kernel is the same or similar binary on the
> > failing systems (I've been running the same kernel config on both.)
> > Both are a Cortex-A72, but slightly different revisions.
> > 
> > So, it's starting to feel like an aarch64 problem, potentially a
> > locking or ordering issue.  Due to how rare this issue is,
> > investigating it is likely very difficult.  However, it seems to be
> > very real, as the symptoms have now been observed on two rather
> > different aarch64 platforms.
> > 
> > Due to the amount of time required to test, it very difficult to do any
> > kind of bisection, or test alternative kernels - it would take months
> > of runtime for a single test.
> > 
> > I'm chucking this out there so that if anyone else is seeing this
> > behaviour, they can shout and maybe confirm what I'm seeing.
> 
> A bit more information:
> 
> Inode 157096 is /usr/share/man/nl/man1/apt-transport-mirror.1.gz:
> 
> --- bad
> +++ fixed
>  debugfs:  stat <157096>
>  Inode: 157096   Type: regular    Mode:  0644   Flags: 0x80000
>  Generation: 3717235945    Version: 0x00000000:00000001
>  User:     0   Group:     0   Project:     0   Size: 3811
>  File ACL: 0
>  Links: 1   Blockcount: 8
>  Fragment:  Address: 0    Number: 0    Size: 0
>   ctime: 0x5ebcd62f:ba34bf1c -- Thu May 14 06:25:03 2020
>   atime: 0x5ebcd63b:a2906fa0 -- Thu May 14 06:25:15 2020
>   mtime: 0x5eba730a:00000000 -- Tue May 12 10:57:30 2020
>  crtime: 0x5ebcd62f:a25cccf4 -- Thu May 14 06:25:03 2020
>  Size of extra inode fields: 32
> -Inode checksum: 0x13fd5c3c
> +Inode checksum: 0x600eba80
>  EXTENTS:
>  (0):1173965
> 
> Note that mandb is set to run daily, so one must assume that the
> inode checksum was fine the previous day.  Note that the file itself
> is fine - it passes gzip's integrity checks, and the contents are
> correct:
> 
> # zcat /usr/share/man/nl/man1/apt-transport-mirror.1.gz >/dev/null
> 
> For the other inode, 13755, /lib/firmware/rtl_bt/rtl8723a_fw.bin:
> 
> --- bad
> +++ fixed
>  debugfs:  stat <13755>
>  Inode: 13755   Type: regular    Mode:  0644   Flags: 0x80000
>  Generation: 2326028864    Version: 0x00000000:00000001
>  User:     0   Group:     0   Project:     0   Size: 24548
>  File ACL: 0
>  Links: 1   Blockcount: 48
>  Fragment:  Address: 0    Number: 0    Size: 0
>   ctime: 0x5e88ffc5:b9a541e4 -- Sat Apr  4 22:44:37 2020
>   atime: 0x5e88ffc4:00000000 -- Sat Apr  4 22:44:36 2020
>   mtime: 0x5d5f3bb0:00000000 -- Fri Aug 23 02:04:48 2019
>  crtime: 0x5e88ffc5:51b03564 -- Sat Apr  4 22:44:37 2020
>  Size of extra inode fields: 32
> -Inode checksum: 0x4d9c9f81
> +Inode checksum: 0x487c2bf3
>  EXTENTS:
>  (0-5):835670-835675
> 
> In both cases, the times suggest that there has been no change made to
> these inode recently.
> 
> It would have been great to know the state of these inodes prior to the
> checksum not matching, but alas, time travel has yet to be invented!
> Maybe if/when it happens again on the Armada 8040, I'll have an ext4fs
> image to compare against - and hopefully identify exactly what has
> changed.

The problems have persisted up until I added some additional debug to
the ext4 code, and so far the Armada 8040 system has been up for 58
days without incident. This suggests that it is a subtle timing bug,
which is going to be nigh on impossible to debug. Unfortunately, it
means that I just can't trust recent aarch64 kernels not to corrupt
my filesystems, and I certainly can't trust them to run any of my
critical systems.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-10-27 19:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-12  9:22 aarch64: ext4 metadata integrity regression in kernels >= 5.5 ? Russell King - ARM Linux admin
2020-07-12 10:07 ` Russell King - ARM Linux admin
2020-10-27 19:51   ` Russell King - ARM Linux admin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).