All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs crash on armv7
@ 2020-11-25 15:28 Joe Hermaszewski
  2020-11-26  6:15 ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Hermaszewski @ 2020-11-25 15:28 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
The filesystem has started behaving badly recently and I'm writing to:

- Solicit advice on how best to get the system back to a stable state
- Report a potential bug

## What happened:

A couple of days ago I could no longer ssh into it, and on the serial
connection there were heaps of messages (and new ones appearing with great
frequency) along the lines of: `parent transid verify failed on blah... wanted
x got y`.

Although I don't have a record of the precise messages I do remember that there
was a difference of `15` between x and y.

I power-cycled system and started a scrub after it rebooted, this was
interrupted quite promptly by several more errors in btrfs, and the disk
remounted RO.

Every now and then in the kernel log I get messages like:

`parent transid verify failed on blah... wanted x got y`

## Important info

The dev stats are all zero.

Here are the outputs of some btrfs commands, dmesg and the kernel log from the
previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9

The "cut here" portion of the kernel log is as follows

```
[  409.158097] ------------[ cut here ]------------
[  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
btree_csum_one_bio+0x208/0x248 [btrfs]
[  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
raid6_pq
[  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
[  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
[  409.158261] Backtrace:
[  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
(show_stack+0x20/0x24)
[  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
[  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
(dump_stack+0x98/0xac)
[  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
[  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
[  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
(warn_slowpath_fmt+0x74/0xc4)
[  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
[  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
(btree_csum_one_bio+0x208/0x248 [btrfs])
[  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
r5:00000000 r4:eefd2d3c
[  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
[<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
[  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
r6:00000000 r5:ea3d17f0
[  409.158472]  r4:bf059e0c
[  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
[<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
[  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
[  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
(btree_write_cache_pages+0x380/0x408 [btrfs])
[  409.158686]  r5:00000000 r4:00000000
[  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
[<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
[  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
r6:e2393d80 r5:ecc45f70
[  409.158794]  r4:e2393d80
[  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
[<c0284748>] (do_writepages+0x58/0xf4)
[  409.158852]  r5:ecc45f70 r4:ecc45e68
[  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
(__filemap_fdatawrite_range+0xf8/0x130)
[  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
[  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
[<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
[  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
r6:c73bc628 r5:00001000
[  409.158875]  r4:4fd0bfff
[  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
[<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
[  409.158931]  r5:00000001 r4:00000000
[  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
[<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
[  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
r6:ec82b000 r5:e2392000
[  409.159040]  r4:c8b81ca8
[  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
[  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
[  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
[<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
[  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
r6:ec82b414 r5:00000bb8
[  409.159254]  r4:ec82b000
[  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
[<c014fabc>] (kthread+0x170/0x174)
[  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
r6:00000000 r5:ed5ee700
[  409.159315]  r4:ed5ee1c0
[  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
(ret_from_fork+0x14/0x2c)
[  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
[  409.159326] 3fa0:                                     00000000
00000000 00000000 00000000
[  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
r6:00000000 r5:c014f94c
[  409.159340]  r4:ed5ee700
[  409.159342] ---[ end trace eea59ced12fa7859 ]---
[  409.165084] BTRFS: error (device sda1) in
btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
writing out transaction)
[  409.176920] BTRFS info (device sda1): forced readonly
[  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
transaction.
[  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
errno=-5 IO failure
[  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
[  409.310199] BTRFS info (device sda1): scrub: not finished on devid
3 with status: -125
[  409.664880] BTRFS info (device sda1): scrub: not finished on devid
4 with status: -125
[  410.106791] BTRFS info (device sda1): scrub: not finished on devid
1 with status: -125
[  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
[  411.268594] BTRFS info (device sda1): scrub: not finished on devid
2 with status: -30
[  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
```

Information requested here
(https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):

```
 $ uname -a
Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux

 $ btrfs --version
btrfs-progs v5.7

 $ sudo btrfs fi show
Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
        Total devices 4 FS bytes used 10.26TiB
        devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
        devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
        devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
        devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1

Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
        Total devices 1 FS bytes used 116.35MiB
        devid    1 size 399.82MiB used 224.00MiB path /dev/sda2

 $ sudo btrfs fi df /
Data, RAID1: total=10.25TiB, used=10.24TiB
System, RAID1: total=64.00MiB, used=1.45MiB
Metadata, RAID1: total=18.00GiB, used=17.19GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
```

Thanks to demfloro and multicore on #btrfs for prompting this email.

Best wishes,
Joe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-11-25 15:28 btrfs crash on armv7 Joe Hermaszewski
@ 2020-11-26  6:15 ` Qu Wenruo
  2020-11-26  6:26   ` Qu Wenruo
  2020-11-26 10:53   ` Joe Hermaszewski
  0 siblings, 2 replies; 14+ messages in thread
From: Qu Wenruo @ 2020-11-26  6:15 UTC (permalink / raw)
  To: Joe Hermaszewski, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 9296 bytes --]



On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
> Hi,
> 
> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
> The filesystem has started behaving badly recently and I'm writing to:
> 
> - Solicit advice on how best to get the system back to a stable state
> - Report a potential bug
> 
> ## What happened:
> 
> A couple of days ago I could no longer ssh into it, and on the serial
> connection there were heaps of messages (and new ones appearing with great
> frequency) along the lines of: `parent transid verify failed on blah... wanted
> x got y`.
> 
> Although I don't have a record of the precise messages I do remember that there
> was a difference of `15` between x and y.

This normally can be a sign of unreliable HDD, which lies on FLUSH,
killing metadata COW.

But, your btrfs check doesn't report the same problem, thus I'm confused.

Would you please try to run a "btrfs check --readonly /dev/sda1" with
the fs unmounted?

And, would you provide the full dmesg of that mismatch?
The reason for the exact number is, I'm suspecting hardware memory
corruption.

> 
> I power-cycled system and started a scrub after it rebooted, this was
> interrupted quite promptly by several more errors in btrfs, and the disk
> remounted RO.
> 
> Every now and then in the kernel log I get messages like:
> 
> `parent transid verify failed on blah... wanted x got y`

Not showing up in the gist dmesg though.

> 
> ## Important info
> 
> The dev stats are all zero.
> 
> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
> 
> The "cut here" portion of the kernel log is as follows
> 
> ```
> [  409.158097] ------------[ cut here ]------------
> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
> btree_csum_one_bio+0x208/0x248 [btrfs]

The line number shows, the tree bytenr doesn't match with the one in memory.
This is really too rare to be seen, especially when we have no other
error reported from btrfs (at least in the gist)

Since there is no other problems showing up in the gist, it means it
could be a bit flip, considering the magic generation gap you mention is
15, I'm more suspicious about memory bit flip.

If you can provide the full parent transid mismatch error message, it
may help to determine the possibility of hardware memory corruption.

Thanks,
Qu

> [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
> raid6_pq
> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
> [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
> [  409.158261] Backtrace:
> [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
> (show_stack+0x20/0x24)
> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
> (dump_stack+0x98/0xac)
> [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
> (warn_slowpath_fmt+0x74/0xc4)
> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
> (btree_csum_one_bio+0x208/0x248 [btrfs])
> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
> r5:00000000 r4:eefd2d3c
> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
> r6:00000000 r5:ea3d17f0
> [  409.158472]  r4:bf059e0c
> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
> (btree_write_cache_pages+0x380/0x408 [btrfs])
> [  409.158686]  r5:00000000 r4:00000000
> [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
> r6:e2393d80 r5:ecc45f70
> [  409.158794]  r4:e2393d80
> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
> [<c0284748>] (do_writepages+0x58/0xf4)
> [  409.158852]  r5:ecc45f70 r4:ecc45e68
> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
> (__filemap_fdatawrite_range+0xf8/0x130)
> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
> r6:c73bc628 r5:00001000
> [  409.158875]  r4:4fd0bfff
> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> [  409.158931]  r5:00000001 r4:00000000
> [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
> r6:ec82b000 r5:e2392000
> [  409.159040]  r4:c8b81ca8
> [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
> r6:ec82b414 r5:00000bb8
> [  409.159254]  r4:ec82b000
> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
> [<c014fabc>] (kthread+0x170/0x174)
> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
> r6:00000000 r5:ed5ee700
> [  409.159315]  r4:ed5ee1c0
> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
> (ret_from_fork+0x14/0x2c)
> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
> [  409.159326] 3fa0:                                     00000000
> 00000000 00000000 00000000
> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> r6:00000000 r5:c014f94c
> [  409.159340]  r4:ed5ee700
> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
> [  409.165084] BTRFS: error (device sda1) in
> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> writing out transaction)
> [  409.176920] BTRFS info (device sda1): forced readonly
> [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
> transaction.
> [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
> errno=-5 IO failure
> [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
> [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
> 3 with status: -125
> [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
> 4 with status: -125
> [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
> 1 with status: -125
> [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
> [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
> 2 with status: -30
> [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
> ```
> 
> Information requested here
> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
> 
> ```
>  $ uname -a
> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
> 
>  $ btrfs --version
> btrfs-progs v5.7
> 
>  $ sudo btrfs fi show
> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
>         Total devices 4 FS bytes used 10.26TiB
>         devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
>         devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
>         devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
>         devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
> 
> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
>         Total devices 1 FS bytes used 116.35MiB
>         devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
> 
>  $ sudo btrfs fi df /
> Data, RAID1: total=10.25TiB, used=10.24TiB
> System, RAID1: total=64.00MiB, used=1.45MiB
> Metadata, RAID1: total=18.00GiB, used=17.19GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> ```
> 
> Thanks to demfloro and multicore on #btrfs for prompting this email.
> 
> Best wishes,
> Joe
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-11-26  6:15 ` Qu Wenruo
@ 2020-11-26  6:26   ` Qu Wenruo
  2020-11-26 10:53   ` Joe Hermaszewski
  1 sibling, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2020-11-26  6:26 UTC (permalink / raw)
  To: Joe Hermaszewski, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2893 bytes --]



On 2020/11/26 下午2:15, Qu Wenruo wrote:
> 
> 
> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
>> Hi,
>>
>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
>> The filesystem has started behaving badly recently and I'm writing to:
>>
>> - Solicit advice on how best to get the system back to a stable state
>> - Report a potential bug
>>
>> ## What happened:
>>
>> A couple of days ago I could no longer ssh into it, and on the serial
>> connection there were heaps of messages (and new ones appearing with great
>> frequency) along the lines of: `parent transid verify failed on blah... wanted
>> x got y`.
>>
>> Although I don't have a record of the precise messages I do remember that there
>> was a difference of `15` between x and y.
> 
> This normally can be a sign of unreliable HDD, which lies on FLUSH,
> killing metadata COW.
> 
> But, your btrfs check doesn't report the same problem, thus I'm confused.
> 
> Would you please try to run a "btrfs check --readonly /dev/sda1" with
> the fs unmounted?
> 
> And, would you provide the full dmesg of that mismatch?
> The reason for the exact number is, I'm suspecting hardware memory
> corruption.
> 
>>
>> I power-cycled system and started a scrub after it rebooted, this was
>> interrupted quite promptly by several more errors in btrfs, and the disk
>> remounted RO.
>>
>> Every now and then in the kernel log I get messages like:
>>
>> `parent transid verify failed on blah... wanted x got y`
> 
> Not showing up in the gist dmesg though.
> 
>>
>> ## Important info
>>
>> The dev stats are all zero.
>>
>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
>>
>> The "cut here" portion of the kernel log is as follows
>>
>> ```
>> [  409.158097] ------------[ cut here ]------------
>> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
>> btree_csum_one_bio+0x208/0x248 [btrfs]
> 
> The line number shows, the tree bytenr doesn't match with the one in memory.
> This is really too rare to be seen, especially when we have no other
> error reported from btrfs (at least in the gist)
> 
> Since there is no other problems showing up in the gist, it means it
> could be a bit flip, considering the magic generation gap you mention is
> 15, I'm more suspicious about memory bit flip.
> 
> If you can provide the full parent transid mismatch error message, it
> may help to determine the possibility of hardware memory corruption.

Also if you can access the device, especially sda, please also try to
attach it to a known working system, and check if "btrfs check" reports
any error. (That file hole error can be ignored, as it won't cause any
problem).

Thanks,
Qu

> 
> Thanks,
> Qu



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-11-26  6:15 ` Qu Wenruo
  2020-11-26  6:26   ` Qu Wenruo
@ 2020-11-26 10:53   ` Joe Hermaszewski
  2020-11-26 11:05     ` Qu Wenruo
  1 sibling, 1 reply; 14+ messages in thread
From: Joe Hermaszewski @ 2020-11-26 10:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi Qu,

This is the output of btrfs check with the device unmounted:

```
[root@nixos:~]# btrfs check --readonly /dev/sda1
Opening filesystem to check...
Checking filesystem on /dev/sda1
UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 294 inode 24665 errors 100, file extent discount
Found file extent holes:
        start: 3709534208, len: 163840
root 294 inode 406548 errors 100, file extent discount
Found file extent holes:
        start: 0, len: 99450880
ERROR: errors found in fs roots
found 11279251902464 bytes used, error(s) found
total csum bytes: 10935906504
total tree bytes: 18455740416
total fs tree bytes: 5798936576
total extent tree bytes: 532250624
btree space waste bytes: 2308849707
file data blocks allocated: 17243418611712
 referenced 14210256932864
```

Looks the same as before...

I'm afraid I don't have the precise details of the transid error
saved. I think I meant to write earlier that the message that comes up
now and then is the "errno=-5 IO failure (Error while writing out
transaction)". I've not seen the transid mismatch one since the first
incident, sorry for that slip-up.

I'll try and attach to a known good system, will I have to attach the
whole array because I don't think I have that many enclosures for the
drives.

I'll also run a memtest on this box.

Thank you for the helpful response.

Best,
Joe

On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
> > Hi,
> >
> > I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
> > The filesystem has started behaving badly recently and I'm writing to:
> >
> > - Solicit advice on how best to get the system back to a stable state
> > - Report a potential bug
> >
> > ## What happened:
> >
> > A couple of days ago I could no longer ssh into it, and on the serial
> > connection there were heaps of messages (and new ones appearing with great
> > frequency) along the lines of: `parent transid verify failed on blah... wanted
> > x got y`.
> >
> > Although I don't have a record of the precise messages I do remember that there
> > was a difference of `15` between x and y.
>
> This normally can be a sign of unreliable HDD, which lies on FLUSH,
> killing metadata COW.
>
> But, your btrfs check doesn't report the same problem, thus I'm confused.
>
> Would you please try to run a "btrfs check --readonly /dev/sda1" with
> the fs unmounted?
>
> And, would you provide the full dmesg of that mismatch?
> The reason for the exact number is, I'm suspecting hardware memory
> corruption.
>
> >
> > I power-cycled system and started a scrub after it rebooted, this was
> > interrupted quite promptly by several more errors in btrfs, and the disk
> > remounted RO.
> >
> > Every now and then in the kernel log I get messages like:
> >
> > `parent transid verify failed on blah... wanted x got y`
>
> Not showing up in the gist dmesg though.
>
> >
> > ## Important info
> >
> > The dev stats are all zero.
> >
> > Here are the outputs of some btrfs commands, dmesg and the kernel log from the
> > previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
> >
> > The "cut here" portion of the kernel log is as follows
> >
> > ```
> > [  409.158097] ------------[ cut here ]------------
> > [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
> > btree_csum_one_bio+0x208/0x248 [btrfs]
>
> The line number shows, the tree bytenr doesn't match with the one in memory.
> This is really too rare to be seen, especially when we have no other
> error reported from btrfs (at least in the gist)
>
> Since there is no other problems showing up in the gist, it means it
> could be a bit flip, considering the magic generation gap you mention is
> 15, I'm more suspicious about memory bit flip.
>
> If you can provide the full parent transid mismatch error message, it
> may help to determine the possibility of hardware memory corruption.
>
> Thanks,
> Qu
>
> > [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> > iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
> > nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
> > xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
> > ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
> > ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
> > llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
> > raid6_pq
> > [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
> > [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
> > [  409.158261] Backtrace:
> > [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
> > (show_stack+0x20/0x24)
> > [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
> > [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
> > (dump_stack+0x98/0xac)
> > [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> > [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
> > [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
> > (warn_slowpath_fmt+0x74/0xc4)
> > [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
> > [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
> > (btree_csum_one_bio+0x208/0x248 [btrfs])
> > [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
> > r5:00000000 r4:eefd2d3c
> > [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
> > [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> > [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
> > r6:00000000 r5:ea3d17f0
> > [  409.158472]  r4:bf059e0c
> > [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
> > [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
> > [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
> > [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
> > (btree_write_cache_pages+0x380/0x408 [btrfs])
> > [  409.158686]  r5:00000000 r4:00000000
> > [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
> > [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
> > [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
> > r6:e2393d80 r5:ecc45f70
> > [  409.158794]  r4:e2393d80
> > [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
> > [<c0284748>] (do_writepages+0x58/0xf4)
> > [  409.158852]  r5:ecc45f70 r4:ecc45e68
> > [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
> > (__filemap_fdatawrite_range+0xf8/0x130)
> > [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
> > [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
> > [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
> > [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
> > r6:c73bc628 r5:00001000
> > [  409.158875]  r4:4fd0bfff
> > [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
> > [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> > [  409.158931]  r5:00000001 r4:00000000
> > [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
> > [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> > [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
> > r6:ec82b000 r5:e2392000
> > [  409.159040]  r4:c8b81ca8
> > [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
> > from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> > [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
> > [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
> > [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> > [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
> > r6:ec82b414 r5:00000bb8
> > [  409.159254]  r4:ec82b000
> > [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
> > [<c014fabc>] (kthread+0x170/0x174)
> > [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
> > r6:00000000 r5:ed5ee700
> > [  409.159315]  r4:ed5ee1c0
> > [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
> > (ret_from_fork+0x14/0x2c)
> > [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
> > [  409.159326] 3fa0:                                     00000000
> > 00000000 00000000 00000000
> > [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
> > 00000000 00000000 00000000
> > [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> > r6:00000000 r5:c014f94c
> > [  409.159340]  r4:ed5ee700
> > [  409.159342] ---[ end trace eea59ced12fa7859 ]---
> > [  409.165084] BTRFS: error (device sda1) in
> > btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> > writing out transaction)
> > [  409.176920] BTRFS info (device sda1): forced readonly
> > [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
> > transaction.
> > [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
> > errno=-5 IO failure
> > [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
> > [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
> > 3 with status: -125
> > [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
> > 4 with status: -125
> > [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
> > 1 with status: -125
> > [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
> > [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
> > 2 with status: -30
> > [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
> > ```
> >
> > Information requested here
> > (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
> >
> > ```
> >  $ uname -a
> > Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
> >
> >  $ btrfs --version
> > btrfs-progs v5.7
> >
> >  $ sudo btrfs fi show
> > Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
> >         Total devices 4 FS bytes used 10.26TiB
> >         devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
> >         devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
> >         devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
> >         devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
> >
> > Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
> >         Total devices 1 FS bytes used 116.35MiB
> >         devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
> >
> >  $ sudo btrfs fi df /
> > Data, RAID1: total=10.25TiB, used=10.24TiB
> > System, RAID1: total=64.00MiB, used=1.45MiB
> > Metadata, RAID1: total=18.00GiB, used=17.19GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > ```
> >
> > Thanks to demfloro and multicore on #btrfs for prompting this email.
> >
> > Best wishes,
> > Joe
> >
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-11-26 10:53   ` Joe Hermaszewski
@ 2020-11-26 11:05     ` Qu Wenruo
  2020-11-27 15:15       ` Joe Hermaszewski
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-11-26 11:05 UTC (permalink / raw)
  To: Joe Hermaszewski; +Cc: linux-btrfs


[-- Attachment #1.1.1: Type: text/plain, Size: 12181 bytes --]



On 2020/11/26 下午6:53, Joe Hermaszewski wrote:
> Hi Qu,
> 
> This is the output of btrfs check with the device unmounted:
> 
> ```
> [root@nixos:~]# btrfs check --readonly /dev/sda1
> Opening filesystem to check...
> Checking filesystem on /dev/sda1
> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space cache
> [4/7] checking fs roots
> root 294 inode 24665 errors 100, file extent discount
> Found file extent holes:
>         start: 3709534208, len: 163840
> root 294 inode 406548 errors 100, file extent discount
> Found file extent holes:
>         start: 0, len: 99450880
> ERROR: errors found in fs roots
> found 11279251902464 bytes used, error(s) found
> total csum bytes: 10935906504
> total tree bytes: 18455740416
> total fs tree bytes: 5798936576
> total extent tree bytes: 532250624
> btree space waste bytes: 2308849707
> file data blocks allocated: 17243418611712
>  referenced 14210256932864
> ```
> 
> Looks the same as before...

Then it means, at least your metadata is pretty safe. Nothing really
needs to be worried (so far).

This also means, the problem only happens at runtime, not something on-disk.

If you are OK to re-compile the kernel, and the warning message is
always the same at line number 531 of disk-io.c, then I recommend the
attached debug diff

The diff will output the found_start (from eb) and page_start (from page).
If it's memory bitflip, we should have a pretty signature diff at the
output.
If not memory bitflip, then we really have a bug to dig.


> 
> I'm afraid I don't have the precise details of the transid error
> saved. I think I meant to write earlier that the message that comes up
> now and then is the "errno=-5 IO failure (Error while writing out
> transaction)". I've not seen the transid mismatch one since the first
> incident, sorry for that slip-up.
> 
> I'll try and attach to a known good system, will I have to attach the
> whole array because I don't think I have that many enclosures for the
> drives.

You only need to mount the fs you're seeing the problem.
But if that fs is on multiple devices, then I guess you need to mount
them all.

Thanks,
Qu
> 
> I'll also run a memtest on this box.
> 
> Thank you for the helpful response.
> 
> Best,
> Joe
> 
> On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
>>> Hi,
>>>
>>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
>>> The filesystem has started behaving badly recently and I'm writing to:
>>>
>>> - Solicit advice on how best to get the system back to a stable state
>>> - Report a potential bug
>>>
>>> ## What happened:
>>>
>>> A couple of days ago I could no longer ssh into it, and on the serial
>>> connection there were heaps of messages (and new ones appearing with great
>>> frequency) along the lines of: `parent transid verify failed on blah... wanted
>>> x got y`.
>>>
>>> Although I don't have a record of the precise messages I do remember that there
>>> was a difference of `15` between x and y.
>>
>> This normally can be a sign of unreliable HDD, which lies on FLUSH,
>> killing metadata COW.
>>
>> But, your btrfs check doesn't report the same problem, thus I'm confused.
>>
>> Would you please try to run a "btrfs check --readonly /dev/sda1" with
>> the fs unmounted?
>>
>> And, would you provide the full dmesg of that mismatch?
>> The reason for the exact number is, I'm suspecting hardware memory
>> corruption.
>>
>>>
>>> I power-cycled system and started a scrub after it rebooted, this was
>>> interrupted quite promptly by several more errors in btrfs, and the disk
>>> remounted RO.
>>>
>>> Every now and then in the kernel log I get messages like:
>>>
>>> `parent transid verify failed on blah... wanted x got y`
>>
>> Not showing up in the gist dmesg though.
>>
>>>
>>> ## Important info
>>>
>>> The dev stats are all zero.
>>>
>>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
>>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
>>>
>>> The "cut here" portion of the kernel log is as follows
>>>
>>> ```
>>> [  409.158097] ------------[ cut here ]------------
>>> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
>>> btree_csum_one_bio+0x208/0x248 [btrfs]
>>
>> The line number shows, the tree bytenr doesn't match with the one in memory.
>> This is really too rare to be seen, especially when we have no other
>> error reported from btrfs (at least in the gist)
>>
>> Since there is no other problems showing up in the gist, it means it
>> could be a bit flip, considering the magic generation gap you mention is
>> 15, I'm more suspicious about memory bit flip.
>>
>> If you can provide the full parent transid mismatch error message, it
>> may help to determine the possibility of hardware memory corruption.
>>
>> Thanks,
>> Qu
>>
>>> [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
>>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
>>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
>>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
>>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
>>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
>>> raid6_pq
>>> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
>>> [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
>>> [  409.158261] Backtrace:
>>> [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
>>> (show_stack+0x20/0x24)
>>> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
>>> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
>>> (dump_stack+0x98/0xac)
>>> [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
>>> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
>>> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
>>> (warn_slowpath_fmt+0x74/0xc4)
>>> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
>>> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
>>> (btree_csum_one_bio+0x208/0x248 [btrfs])
>>> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
>>> r5:00000000 r4:eefd2d3c
>>> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
>>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
>>> r6:00000000 r5:ea3d17f0
>>> [  409.158472]  r4:bf059e0c
>>> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
>>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
>>> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
>>> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
>>> [  409.158686]  r5:00000000 r4:00000000
>>> [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
>>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
>>> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
>>> r6:e2393d80 r5:ecc45f70
>>> [  409.158794]  r4:e2393d80
>>> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
>>> [<c0284748>] (do_writepages+0x58/0xf4)
>>> [  409.158852]  r5:ecc45f70 r4:ecc45e68
>>> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
>>> (__filemap_fdatawrite_range+0xf8/0x130)
>>> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
>>> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
>>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
>>> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
>>> r6:c73bc628 r5:00001000
>>> [  409.158875]  r4:4fd0bfff
>>> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
>>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>> [  409.158931]  r5:00000001 r4:00000000
>>> [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
>>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
>>> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
>>> r6:ec82b000 r5:e2392000
>>> [  409.159040]  r4:c8b81ca8
>>> [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
>>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
>>> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
>>> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
>>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
>>> r6:ec82b414 r5:00000bb8
>>> [  409.159254]  r4:ec82b000
>>> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
>>> [<c014fabc>] (kthread+0x170/0x174)
>>> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
>>> r6:00000000 r5:ed5ee700
>>> [  409.159315]  r4:ed5ee1c0
>>> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
>>> (ret_from_fork+0x14/0x2c)
>>> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
>>> [  409.159326] 3fa0:                                     00000000
>>> 00000000 00000000 00000000
>>> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
>>> 00000000 00000000 00000000
>>> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
>>> [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
>>> r6:00000000 r5:c014f94c
>>> [  409.159340]  r4:ed5ee700
>>> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
>>> [  409.165084] BTRFS: error (device sda1) in
>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
>>> writing out transaction)
>>> [  409.176920] BTRFS info (device sda1): forced readonly
>>> [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
>>> transaction.
>>> [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
>>> errno=-5 IO failure
>>> [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
>>> [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
>>> 3 with status: -125
>>> [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
>>> 4 with status: -125
>>> [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
>>> 1 with status: -125
>>> [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
>>> [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
>>> 2 with status: -30
>>> [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
>>> ```
>>>
>>> Information requested here
>>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
>>>
>>> ```
>>>  $ uname -a
>>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
>>>
>>>  $ btrfs --version
>>> btrfs-progs v5.7
>>>
>>>  $ sudo btrfs fi show
>>> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>         Total devices 4 FS bytes used 10.26TiB
>>>         devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
>>>         devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
>>>         devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
>>>         devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
>>>
>>> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
>>>         Total devices 1 FS bytes used 116.35MiB
>>>         devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
>>>
>>>  $ sudo btrfs fi df /
>>> Data, RAID1: total=10.25TiB, used=10.24TiB
>>> System, RAID1: total=64.00MiB, used=1.45MiB
>>> Metadata, RAID1: total=18.00GiB, used=17.19GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>> ```
>>>
>>> Thanks to demfloro and multicore on #btrfs for prompting this email.
>>>
>>> Best wishes,
>>> Joe
>>>
>>

[-- Attachment #1.1.2: debug.diff --]
[-- Type: text/x-patch, Size: 650 bytes --]

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index cd65ef7c7c3f..ebf36f616ebf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -528,8 +528,12 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 	 * Please do not consolidate these warnings into a single if.
 	 * It is useful to know what went wrong.
 	 */
-	if (WARN_ON(found_start != start))
+	if (WARN_ON(found_start != start)) {
+		btrfs_warn(fs_info,
+		"page_start and eb_start mismatch, eb_start=%llu page_start=%llu",
+			found_start, start);
 		return -EUCLEAN;
+	}
 	if (WARN_ON(!PageUptodate(page)))
 		return -EUCLEAN;
 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-11-26 11:05     ` Qu Wenruo
@ 2020-11-27 15:15       ` Joe Hermaszewski
  2020-11-28  0:45         ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Hermaszewski @ 2020-11-27 15:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Hi Qu,

Thanks for the patch. I recompiled the kernel ran the scrub and your
patch worked as expected, here is the log:

```
[  337.365239] BTRFS info (device sda1): scrub: started on devid 2
[  337.366283] BTRFS info (device sda1): scrub: started on devid 1
[  337.402822] BTRFS info (device sda1): scrub: started on devid 3
[  337.411944] BTRFS info (device sda1): scrub: started on devid 4
[  471.997496] ------------[ cut here ]------------
[  471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
btree_csum_one_bio+0x22c/0x278 [btrfs]
[  471.997616] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter
ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype nf_log_ipv6
nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables
iptable_filter sch_fq_codel loop tun tap macvlan bridge stp llc lm75
ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
[  471.997666] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.78 #1-NixOS
[  471.997668] Hardware name: Marvell Armada 380/385 (Device Tree)
[  471.997669] Backtrace:
[  471.997680] [<c010f698>] (dump_backtrace) from [<c010f938>]
(show_stack+0x20/0x24)
[  471.997684]  r7:00000213 r6:600c0013 r5:00000000 r4:c0f8c044
[  471.997691] [<c010f918>] (show_stack) from [<c0a1b848>]
(dump_stack+0x98/0xac)
[  471.997696] [<c0a1b7b0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
[  471.997700]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
[  471.997704] [<c012a8b8>] (__warn) from [<c012ad24>]
(warn_slowpath_fmt+0x74/0xc4)
[  471.997707]  r7:00000213 r6:bf120990 r5:00000000 r4:e1822000
[  471.997765] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
(btree_csum_one_bio+0x22c/0x278 [btrfs])
[  471.997770]  r9:00000001 r8:c10decb8 r7:e1822000 r6:ed66b000
r5:00001000 r4:4fd00000
[  471.997872] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
[<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
[  471.997877]  r10:e2197b48 r9:ec845fc0 r8:ec845f70 r7:ed66b000
r6:00000000 r5:e2b74af0
[  471.997879]  r4:bf059e9c
[  471.997981] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
[<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
[  471.997985]  r7:ef3724d4 r6:e1823cac r5:00000000 r4:bf059e9c
[  471.998087] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>]
(btree_write_cache_pages+0x380/0x408 [btrfs])
[  471.998090]  r5:00000000 r4:00000000
[  471.998193] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
[<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
[  471.998197]  r10:00000001 r9:4fd00000 r8:c0280d14 r7:e1822000
r6:e1823d80 r5:ec845f70
[  471.998199]  r4:e1823d80
[  471.998255] [<bf05903c>] (btree_writepages [btrfs]) from
[<c02847c8>] (do_writepages+0x58/0xf4)
[  471.998258]  r5:ec845f70 r4:ec845e68
[  471.998266] [<c0284770>] (do_writepages) from [<c0278cb0>]
(__filemap_fdatawrite_range+0xf8/0x130)
[  471.998270]  r8:ec845f70 r7:00001000 r6:4fd0bfff r5:e1822000 r4:ec845e68
[  471.998275] [<c0278bb8>] (__filemap_fdatawrite_range) from
[<c0278e38>] (filemap_fdatawrite_range+0x2c/0x34)
[  471.998279]  r10:ec845f70 r9:00001000 r8:4fd0bfff r7:e1823e4c
r6:c1955e28 r5:00001000
[  471.998281]  r4:4fd0bfff
[  471.998335] [<c0278e0c>] (filemap_fdatawrite_range) from
[<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
[  471.998338]  r5:00000001 r4:00000000
[  471.998441] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
[<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
[  471.998446]  r10:e1822000 r9:ed66b010 r8:ed66b000 r7:c1955e28
r6:ed66b000 r5:e1822000
[  471.998447]  r4:eddc4f30
[  471.998551] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs])
from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
[  471.998555]  r8:ed66b418 r7:c1955e00 r6:ed66b000 r5:eddc4f30 r4:00000000
[  471.998658] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
[<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
[  471.998663]  r10:ed66b28c r9:00000000 r8:001aab57 r7:00000064
r6:ed66b414 r5:00000bb8
[  471.998664]  r4:ed66b000
[  471.998719] [<bf05cbfc>] (transaction_kthread [btrfs]) from
[<c014facc>] (kthread+0x170/0x174)
[  471.998724]  r10:ed765bfc r9:bf05cbfc r8:ed56a000 r7:e1822000
r6:00000000 r5:ed5c8600
[  471.998726]  r4:ed5c8740
[  471.998731] [<c014f95c>] (kthread) from [<c01010e8>]
(ret_from_fork+0x14/0x2c)
[  471.998733] Exception stack(0xe1823fb0 to 0xe1823ff8)
[  471.998737] 3fa0:                                     00000000
00000000 00000000 00000000
[  471.998741] 3fc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[  471.998745] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  471.998749]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
r6:00000000 r5:c014f95c
[  471.998751]  r4:ed5c8600
[  471.998820] ---[ end trace 6a9fb9f547659a4d ]---
[  471.998855] BTRFS warning (device sda1): page_start and eb_start
mismatch, eb_start=17593525075968 page_start=1339031552
[  472.015514] BTRFS: error (device sda1) in
btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
writing out transaction)
[  472.027321] BTRFS info (device sda1): forced readonly
[  472.027326] BTRFS warning (device sda1): Skipping commit of aborted
transaction.
[  472.027330] BTRFS: error (device sda1) in cleanup_transaction:1832:
errno=-5 IO failure
[  472.035378] BTRFS info (device sda1): delayed_refs has NO entry
[  472.039102] BTRFS warning (device sda1): failed setting block group ro: -30
[  472.039107] BTRFS info (device sda1): scrub: not finished on devid
3 with status: -30
[  472.039114] BTRFS info (device sda1): scrub: not finished on devid
4 with status: -30
[  472.039227] BTRFS warning (device sda1): failed setting block group ro: -30
[  472.039230] BTRFS warning (device sda1): failed setting block group ro: -30
[  472.039237] BTRFS info (device sda1): scrub: not finished on devid
2 with status: -30
[  472.039243] BTRFS info (device sda1): scrub: not finished on devid
1 with status: -30
```

To rob you of the suspense:

printf "0x%016x\n" (17593525075968 - 1339031552)
0x0000100000000000

Very surprising considering that this machine has ECC ram!

What would you recommend as a fix?

Best,
Joe

On Thu, Nov 26, 2020 at 7:05 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/11/26 下午6:53, Joe Hermaszewski wrote:
> > Hi Qu,
> >
> > This is the output of btrfs check with the device unmounted:
> >
> > ```
> > [root@nixos:~]# btrfs check --readonly /dev/sda1
> > Opening filesystem to check...
> > Checking filesystem on /dev/sda1
> > UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
> > [1/7] checking root items
> > [2/7] checking extents
> > [3/7] checking free space cache
> > [4/7] checking fs roots
> > root 294 inode 24665 errors 100, file extent discount
> > Found file extent holes:
> >         start: 3709534208, len: 163840
> > root 294 inode 406548 errors 100, file extent discount
> > Found file extent holes:
> >         start: 0, len: 99450880
> > ERROR: errors found in fs roots
> > found 11279251902464 bytes used, error(s) found
> > total csum bytes: 10935906504
> > total tree bytes: 18455740416
> > total fs tree bytes: 5798936576
> > total extent tree bytes: 532250624
> > btree space waste bytes: 2308849707
> > file data blocks allocated: 17243418611712
> >  referenced 14210256932864
> > ```
> >
> > Looks the same as before...
>
> Then it means, at least your metadata is pretty safe. Nothing really
> needs to be worried (so far).
>
> This also means, the problem only happens at runtime, not something on-disk.
>
> If you are OK to re-compile the kernel, and the warning message is
> always the same at line number 531 of disk-io.c, then I recommend the
> attached debug diff
>
> The diff will output the found_start (from eb) and page_start (from page).
> If it's memory bitflip, we should have a pretty signature diff at the
> output.
> If not memory bitflip, then we really have a bug to dig.
>
>
> >
> > I'm afraid I don't have the precise details of the transid error
> > saved. I think I meant to write earlier that the message that comes up
> > now and then is the "errno=-5 IO failure (Error while writing out
> > transaction)". I've not seen the transid mismatch one since the first
> > incident, sorry for that slip-up.
> >
> > I'll try and attach to a known good system, will I have to attach the
> > whole array because I don't think I have that many enclosures for the
> > drives.
>
> You only need to mount the fs you're seeing the problem.
> But if that fs is on multiple devices, then I guess you need to mount
> them all.
>
> Thanks,
> Qu
> >
> > I'll also run a memtest on this box.
> >
> > Thank you for the helpful response.
> >
> > Best,
> > Joe
> >
> > On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
> >>> Hi,
> >>>
> >>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
> >>> The filesystem has started behaving badly recently and I'm writing to:
> >>>
> >>> - Solicit advice on how best to get the system back to a stable state
> >>> - Report a potential bug
> >>>
> >>> ## What happened:
> >>>
> >>> A couple of days ago I could no longer ssh into it, and on the serial
> >>> connection there were heaps of messages (and new ones appearing with great
> >>> frequency) along the lines of: `parent transid verify failed on blah... wanted
> >>> x got y`.
> >>>
> >>> Although I don't have a record of the precise messages I do remember that there
> >>> was a difference of `15` between x and y.
> >>
> >> This normally can be a sign of unreliable HDD, which lies on FLUSH,
> >> killing metadata COW.
> >>
> >> But, your btrfs check doesn't report the same problem, thus I'm confused.
> >>
> >> Would you please try to run a "btrfs check --readonly /dev/sda1" with
> >> the fs unmounted?
> >>
> >> And, would you provide the full dmesg of that mismatch?
> >> The reason for the exact number is, I'm suspecting hardware memory
> >> corruption.
> >>
> >>>
> >>> I power-cycled system and started a scrub after it rebooted, this was
> >>> interrupted quite promptly by several more errors in btrfs, and the disk
> >>> remounted RO.
> >>>
> >>> Every now and then in the kernel log I get messages like:
> >>>
> >>> `parent transid verify failed on blah... wanted x got y`
> >>
> >> Not showing up in the gist dmesg though.
> >>
> >>>
> >>> ## Important info
> >>>
> >>> The dev stats are all zero.
> >>>
> >>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
> >>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
> >>>
> >>> The "cut here" portion of the kernel log is as follows
> >>>
> >>> ```
> >>> [  409.158097] ------------[ cut here ]------------
> >>> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
> >>> btree_csum_one_bio+0x208/0x248 [btrfs]
> >>
> >> The line number shows, the tree bytenr doesn't match with the one in memory.
> >> This is really too rare to be seen, especially when we have no other
> >> error reported from btrfs (at least in the gist)
> >>
> >> Since there is no other problems showing up in the gist, it means it
> >> could be a bit flip, considering the magic generation gap you mention is
> >> 15, I'm more suspicious about memory bit flip.
> >>
> >> If you can provide the full parent transid mismatch error message, it
> >> may help to determine the possibility of hardware memory corruption.
> >>
> >> Thanks,
> >> Qu
> >>
> >>> [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> >>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
> >>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
> >>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
> >>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
> >>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
> >>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
> >>> raid6_pq
> >>> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
> >>> [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
> >>> [  409.158261] Backtrace:
> >>> [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
> >>> (show_stack+0x20/0x24)
> >>> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
> >>> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
> >>> (dump_stack+0x98/0xac)
> >>> [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> >>> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
> >>> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
> >>> (warn_slowpath_fmt+0x74/0xc4)
> >>> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
> >>> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
> >>> (btree_csum_one_bio+0x208/0x248 [btrfs])
> >>> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
> >>> r5:00000000 r4:eefd2d3c
> >>> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
> >>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> >>> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
> >>> r6:00000000 r5:ea3d17f0
> >>> [  409.158472]  r4:bf059e0c
> >>> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
> >>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
> >>> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
> >>> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
> >>> (btree_write_cache_pages+0x380/0x408 [btrfs])
> >>> [  409.158686]  r5:00000000 r4:00000000
> >>> [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
> >>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
> >>> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
> >>> r6:e2393d80 r5:ecc45f70
> >>> [  409.158794]  r4:e2393d80
> >>> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
> >>> [<c0284748>] (do_writepages+0x58/0xf4)
> >>> [  409.158852]  r5:ecc45f70 r4:ecc45e68
> >>> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
> >>> (__filemap_fdatawrite_range+0xf8/0x130)
> >>> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
> >>> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
> >>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
> >>> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
> >>> r6:c73bc628 r5:00001000
> >>> [  409.158875]  r4:4fd0bfff
> >>> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
> >>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> >>> [  409.158931]  r5:00000001 r4:00000000
> >>> [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
> >>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> >>> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
> >>> r6:ec82b000 r5:e2392000
> >>> [  409.159040]  r4:c8b81ca8
> >>> [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
> >>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> >>> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
> >>> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
> >>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> >>> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
> >>> r6:ec82b414 r5:00000bb8
> >>> [  409.159254]  r4:ec82b000
> >>> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
> >>> [<c014fabc>] (kthread+0x170/0x174)
> >>> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
> >>> r6:00000000 r5:ed5ee700
> >>> [  409.159315]  r4:ed5ee1c0
> >>> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
> >>> (ret_from_fork+0x14/0x2c)
> >>> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
> >>> [  409.159326] 3fa0:                                     00000000
> >>> 00000000 00000000 00000000
> >>> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
> >>> 00000000 00000000 00000000
> >>> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> >>> [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> >>> r6:00000000 r5:c014f94c
> >>> [  409.159340]  r4:ed5ee700
> >>> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
> >>> [  409.165084] BTRFS: error (device sda1) in
> >>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> >>> writing out transaction)
> >>> [  409.176920] BTRFS info (device sda1): forced readonly
> >>> [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
> >>> transaction.
> >>> [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
> >>> errno=-5 IO failure
> >>> [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
> >>> [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
> >>> 3 with status: -125
> >>> [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
> >>> 4 with status: -125
> >>> [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
> >>> 1 with status: -125
> >>> [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
> >>> [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
> >>> 2 with status: -30
> >>> [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
> >>> ```
> >>>
> >>> Information requested here
> >>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
> >>>
> >>> ```
> >>>  $ uname -a
> >>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
> >>>
> >>>  $ btrfs --version
> >>> btrfs-progs v5.7
> >>>
> >>>  $ sudo btrfs fi show
> >>> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
> >>>         Total devices 4 FS bytes used 10.26TiB
> >>>         devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
> >>>         devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
> >>>         devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
> >>>         devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
> >>>
> >>> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
> >>>         Total devices 1 FS bytes used 116.35MiB
> >>>         devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
> >>>
> >>>  $ sudo btrfs fi df /
> >>> Data, RAID1: total=10.25TiB, used=10.24TiB
> >>> System, RAID1: total=64.00MiB, used=1.45MiB
> >>> Metadata, RAID1: total=18.00GiB, used=17.19GiB
> >>> GlobalReserve, single: total=512.00MiB, used=0.00B
> >>> ```
> >>>
> >>> Thanks to demfloro and multicore on #btrfs for prompting this email.
> >>>
> >>> Best wishes,
> >>> Joe
> >>>
> >>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-11-27 15:15       ` Joe Hermaszewski
@ 2020-11-28  0:45         ` Qu Wenruo
  2020-12-19 10:35           ` Joe Hermaszewski
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-11-28  0:45 UTC (permalink / raw)
  To: Joe Hermaszewski; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 19817 bytes --]



On 2020/11/27 下午11:15, Joe Hermaszewski wrote:
> Hi Qu,
> 
> Thanks for the patch. I recompiled the kernel ran the scrub and your
> patch worked as expected, here is the log:
> 
> ```
> [  337.365239] BTRFS info (device sda1): scrub: started on devid 2
> [  337.366283] BTRFS info (device sda1): scrub: started on devid 1
> [  337.402822] BTRFS info (device sda1): scrub: started on devid 3
> [  337.411944] BTRFS info (device sda1): scrub: started on devid 4
> [  471.997496] ------------[ cut here ]------------
> [  471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
> btree_csum_one_bio+0x22c/0x278 [btrfs]
> [  471.997616] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter
> ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype nf_log_ipv6
> nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables
> iptable_filter sch_fq_codel loop tun tap macvlan bridge stp llc lm75
> ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
> [  471.997666] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.78 #1-NixOS
> [  471.997668] Hardware name: Marvell Armada 380/385 (Device Tree)
> [  471.997669] Backtrace:
> [  471.997680] [<c010f698>] (dump_backtrace) from [<c010f938>]
> (show_stack+0x20/0x24)
> [  471.997684]  r7:00000213 r6:600c0013 r5:00000000 r4:c0f8c044
> [  471.997691] [<c010f918>] (show_stack) from [<c0a1b848>]
> (dump_stack+0x98/0xac)
> [  471.997696] [<c0a1b7b0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> [  471.997700]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
> [  471.997704] [<c012a8b8>] (__warn) from [<c012ad24>]
> (warn_slowpath_fmt+0x74/0xc4)
> [  471.997707]  r7:00000213 r6:bf120990 r5:00000000 r4:e1822000
> [  471.997765] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
> (btree_csum_one_bio+0x22c/0x278 [btrfs])
> [  471.997770]  r9:00000001 r8:c10decb8 r7:e1822000 r6:ed66b000
> r5:00001000 r4:4fd00000
> [  471.997872] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
> [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> [  471.997877]  r10:e2197b48 r9:ec845fc0 r8:ec845f70 r7:ed66b000
> r6:00000000 r5:e2b74af0
> [  471.997879]  r4:bf059e9c
> [  471.997981] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
> [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
> [  471.997985]  r7:ef3724d4 r6:e1823cac r5:00000000 r4:bf059e9c
> [  471.998087] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>]
> (btree_write_cache_pages+0x380/0x408 [btrfs])
> [  471.998090]  r5:00000000 r4:00000000
> [  471.998193] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
> [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
> [  471.998197]  r10:00000001 r9:4fd00000 r8:c0280d14 r7:e1822000
> r6:e1823d80 r5:ec845f70
> [  471.998199]  r4:e1823d80
> [  471.998255] [<bf05903c>] (btree_writepages [btrfs]) from
> [<c02847c8>] (do_writepages+0x58/0xf4)
> [  471.998258]  r5:ec845f70 r4:ec845e68
> [  471.998266] [<c0284770>] (do_writepages) from [<c0278cb0>]
> (__filemap_fdatawrite_range+0xf8/0x130)
> [  471.998270]  r8:ec845f70 r7:00001000 r6:4fd0bfff r5:e1822000 r4:ec845e68
> [  471.998275] [<c0278bb8>] (__filemap_fdatawrite_range) from
> [<c0278e38>] (filemap_fdatawrite_range+0x2c/0x34)
> [  471.998279]  r10:ec845f70 r9:00001000 r8:4fd0bfff r7:e1823e4c
> r6:c1955e28 r5:00001000
> [  471.998281]  r4:4fd0bfff
> [  471.998335] [<c0278e0c>] (filemap_fdatawrite_range) from
> [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> [  471.998338]  r5:00000001 r4:00000000
> [  471.998441] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
> [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> [  471.998446]  r10:e1822000 r9:ed66b010 r8:ed66b000 r7:c1955e28
> r6:ed66b000 r5:e1822000
> [  471.998447]  r4:eddc4f30
> [  471.998551] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs])
> from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> [  471.998555]  r8:ed66b418 r7:c1955e00 r6:ed66b000 r5:eddc4f30 r4:00000000
> [  471.998658] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
> [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> [  471.998663]  r10:ed66b28c r9:00000000 r8:001aab57 r7:00000064
> r6:ed66b414 r5:00000bb8
> [  471.998664]  r4:ed66b000
> [  471.998719] [<bf05cbfc>] (transaction_kthread [btrfs]) from
> [<c014facc>] (kthread+0x170/0x174)
> [  471.998724]  r10:ed765bfc r9:bf05cbfc r8:ed56a000 r7:e1822000
> r6:00000000 r5:ed5c8600
> [  471.998726]  r4:ed5c8740
> [  471.998731] [<c014f95c>] (kthread) from [<c01010e8>]
> (ret_from_fork+0x14/0x2c)
> [  471.998733] Exception stack(0xe1823fb0 to 0xe1823ff8)
> [  471.998737] 3fa0:                                     00000000
> 00000000 00000000 00000000
> [  471.998741] 3fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [  471.998745] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [  471.998749]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> r6:00000000 r5:c014f95c
> [  471.998751]  r4:ed5c8600
> [  471.998820] ---[ end trace 6a9fb9f547659a4d ]---
> [  471.998855] BTRFS warning (device sda1): page_start and eb_start
> mismatch, eb_start=17593525075968 page_start=1339031552
> [  472.015514] BTRFS: error (device sda1) in
> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> writing out transaction)
> [  472.027321] BTRFS info (device sda1): forced readonly
> [  472.027326] BTRFS warning (device sda1): Skipping commit of aborted
> transaction.
> [  472.027330] BTRFS: error (device sda1) in cleanup_transaction:1832:
> errno=-5 IO failure
> [  472.035378] BTRFS info (device sda1): delayed_refs has NO entry
> [  472.039102] BTRFS warning (device sda1): failed setting block group ro: -30
> [  472.039107] BTRFS info (device sda1): scrub: not finished on devid
> 3 with status: -30
> [  472.039114] BTRFS info (device sda1): scrub: not finished on devid
> 4 with status: -30
> [  472.039227] BTRFS warning (device sda1): failed setting block group ro: -30
> [  472.039230] BTRFS warning (device sda1): failed setting block group ro: -30
> [  472.039237] BTRFS info (device sda1): scrub: not finished on devid
> 2 with status: -30
> [  472.039243] BTRFS info (device sda1): scrub: not finished on devid
> 1 with status: -30
> ```
> 
> To rob you of the suspense:
> 
> printf "0x%016x\n" (17593525075968 - 1339031552)
> 0x0000100000000000
> 
> Very surprising considering that this machine has ECC ram!

ARMv7 with ECC, and we still had such bitflip!?

I guess everything is possible in 2020.

> 
> What would you recommend as a fix?

It's a hardware problem, I really don't have any recommendation at all.

I guess you may want to request an RMA?

Thanks,
Qu

> 
> Best,
> Joe
> 
> On Thu, Nov 26, 2020 at 7:05 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/11/26 下午6:53, Joe Hermaszewski wrote:
>>> Hi Qu,
>>>
>>> This is the output of btrfs check with the device unmounted:
>>>
>>> ```
>>> [root@nixos:~]# btrfs check --readonly /dev/sda1
>>> Opening filesystem to check...
>>> Checking filesystem on /dev/sda1
>>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>> [1/7] checking root items
>>> [2/7] checking extents
>>> [3/7] checking free space cache
>>> [4/7] checking fs roots
>>> root 294 inode 24665 errors 100, file extent discount
>>> Found file extent holes:
>>>         start: 3709534208, len: 163840
>>> root 294 inode 406548 errors 100, file extent discount
>>> Found file extent holes:
>>>         start: 0, len: 99450880
>>> ERROR: errors found in fs roots
>>> found 11279251902464 bytes used, error(s) found
>>> total csum bytes: 10935906504
>>> total tree bytes: 18455740416
>>> total fs tree bytes: 5798936576
>>> total extent tree bytes: 532250624
>>> btree space waste bytes: 2308849707
>>> file data blocks allocated: 17243418611712
>>>  referenced 14210256932864
>>> ```
>>>
>>> Looks the same as before...
>>
>> Then it means, at least your metadata is pretty safe. Nothing really
>> needs to be worried (so far).
>>
>> This also means, the problem only happens at runtime, not something on-disk.
>>
>> If you are OK to re-compile the kernel, and the warning message is
>> always the same at line number 531 of disk-io.c, then I recommend the
>> attached debug diff
>>
>> The diff will output the found_start (from eb) and page_start (from page).
>> If it's memory bitflip, we should have a pretty signature diff at the
>> output.
>> If not memory bitflip, then we really have a bug to dig.
>>
>>
>>>
>>> I'm afraid I don't have the precise details of the transid error
>>> saved. I think I meant to write earlier that the message that comes up
>>> now and then is the "errno=-5 IO failure (Error while writing out
>>> transaction)". I've not seen the transid mismatch one since the first
>>> incident, sorry for that slip-up.
>>>
>>> I'll try and attach to a known good system, will I have to attach the
>>> whole array because I don't think I have that many enclosures for the
>>> drives.
>>
>> You only need to mount the fs you're seeing the problem.
>> But if that fs is on multiple devices, then I guess you need to mount
>> them all.
>>
>> Thanks,
>> Qu
>>>
>>> I'll also run a memtest on this box.
>>>
>>> Thank you for the helpful response.
>>>
>>> Best,
>>> Joe
>>>
>>> On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
>>>>> Hi,
>>>>>
>>>>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
>>>>> The filesystem has started behaving badly recently and I'm writing to:
>>>>>
>>>>> - Solicit advice on how best to get the system back to a stable state
>>>>> - Report a potential bug
>>>>>
>>>>> ## What happened:
>>>>>
>>>>> A couple of days ago I could no longer ssh into it, and on the serial
>>>>> connection there were heaps of messages (and new ones appearing with great
>>>>> frequency) along the lines of: `parent transid verify failed on blah... wanted
>>>>> x got y`.
>>>>>
>>>>> Although I don't have a record of the precise messages I do remember that there
>>>>> was a difference of `15` between x and y.
>>>>
>>>> This normally can be a sign of unreliable HDD, which lies on FLUSH,
>>>> killing metadata COW.
>>>>
>>>> But, your btrfs check doesn't report the same problem, thus I'm confused.
>>>>
>>>> Would you please try to run a "btrfs check --readonly /dev/sda1" with
>>>> the fs unmounted?
>>>>
>>>> And, would you provide the full dmesg of that mismatch?
>>>> The reason for the exact number is, I'm suspecting hardware memory
>>>> corruption.
>>>>
>>>>>
>>>>> I power-cycled system and started a scrub after it rebooted, this was
>>>>> interrupted quite promptly by several more errors in btrfs, and the disk
>>>>> remounted RO.
>>>>>
>>>>> Every now and then in the kernel log I get messages like:
>>>>>
>>>>> `parent transid verify failed on blah... wanted x got y`
>>>>
>>>> Not showing up in the gist dmesg though.
>>>>
>>>>>
>>>>> ## Important info
>>>>>
>>>>> The dev stats are all zero.
>>>>>
>>>>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
>>>>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
>>>>>
>>>>> The "cut here" portion of the kernel log is as follows
>>>>>
>>>>> ```
>>>>> [  409.158097] ------------[ cut here ]------------
>>>>> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
>>>>> btree_csum_one_bio+0x208/0x248 [btrfs]
>>>>
>>>> The line number shows, the tree bytenr doesn't match with the one in memory.
>>>> This is really too rare to be seen, especially when we have no other
>>>> error reported from btrfs (at least in the gist)
>>>>
>>>> Since there is no other problems showing up in the gist, it means it
>>>> could be a bit flip, considering the magic generation gap you mention is
>>>> 15, I'm more suspicious about memory bit flip.
>>>>
>>>> If you can provide the full parent transid mismatch error message, it
>>>> may help to determine the possibility of hardware memory corruption.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>> [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
>>>>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>>>>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
>>>>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
>>>>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
>>>>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
>>>>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
>>>>> raid6_pq
>>>>> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
>>>>> [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
>>>>> [  409.158261] Backtrace:
>>>>> [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
>>>>> (show_stack+0x20/0x24)
>>>>> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
>>>>> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
>>>>> (dump_stack+0x98/0xac)
>>>>> [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
>>>>> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
>>>>> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
>>>>> (warn_slowpath_fmt+0x74/0xc4)
>>>>> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
>>>>> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
>>>>> (btree_csum_one_bio+0x208/0x248 [btrfs])
>>>>> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
>>>>> r5:00000000 r4:eefd2d3c
>>>>> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
>>>>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>>>> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
>>>>> r6:00000000 r5:ea3d17f0
>>>>> [  409.158472]  r4:bf059e0c
>>>>> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
>>>>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
>>>>> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
>>>>> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
>>>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
>>>>> [  409.158686]  r5:00000000 r4:00000000
>>>>> [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
>>>>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
>>>>> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
>>>>> r6:e2393d80 r5:ecc45f70
>>>>> [  409.158794]  r4:e2393d80
>>>>> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
>>>>> [<c0284748>] (do_writepages+0x58/0xf4)
>>>>> [  409.158852]  r5:ecc45f70 r4:ecc45e68
>>>>> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
>>>>> (__filemap_fdatawrite_range+0xf8/0x130)
>>>>> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
>>>>> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
>>>>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
>>>>> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
>>>>> r6:c73bc628 r5:00001000
>>>>> [  409.158875]  r4:4fd0bfff
>>>>> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
>>>>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>>>> [  409.158931]  r5:00000001 r4:00000000
>>>>> [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
>>>>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
>>>>> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
>>>>> r6:ec82b000 r5:e2392000
>>>>> [  409.159040]  r4:c8b81ca8
>>>>> [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
>>>>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
>>>>> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
>>>>> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
>>>>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>>>> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
>>>>> r6:ec82b414 r5:00000bb8
>>>>> [  409.159254]  r4:ec82b000
>>>>> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
>>>>> [<c014fabc>] (kthread+0x170/0x174)
>>>>> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
>>>>> r6:00000000 r5:ed5ee700
>>>>> [  409.159315]  r4:ed5ee1c0
>>>>> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
>>>>> (ret_from_fork+0x14/0x2c)
>>>>> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
>>>>> [  409.159326] 3fa0:                                     00000000
>>>>> 00000000 00000000 00000000
>>>>> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
>>>>> 00000000 00000000 00000000
>>>>> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
>>>>> [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
>>>>> r6:00000000 r5:c014f94c
>>>>> [  409.159340]  r4:ed5ee700
>>>>> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
>>>>> [  409.165084] BTRFS: error (device sda1) in
>>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
>>>>> writing out transaction)
>>>>> [  409.176920] BTRFS info (device sda1): forced readonly
>>>>> [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
>>>>> transaction.
>>>>> [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
>>>>> errno=-5 IO failure
>>>>> [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
>>>>> [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
>>>>> 3 with status: -125
>>>>> [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
>>>>> 4 with status: -125
>>>>> [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
>>>>> 1 with status: -125
>>>>> [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
>>>>> [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
>>>>> 2 with status: -30
>>>>> [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
>>>>> ```
>>>>>
>>>>> Information requested here
>>>>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
>>>>>
>>>>> ```
>>>>>  $ uname -a
>>>>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
>>>>>
>>>>>  $ btrfs --version
>>>>> btrfs-progs v5.7
>>>>>
>>>>>  $ sudo btrfs fi show
>>>>> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>>>         Total devices 4 FS bytes used 10.26TiB
>>>>>         devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
>>>>>         devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
>>>>>         devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
>>>>>         devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
>>>>>
>>>>> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
>>>>>         Total devices 1 FS bytes used 116.35MiB
>>>>>         devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
>>>>>
>>>>>  $ sudo btrfs fi df /
>>>>> Data, RAID1: total=10.25TiB, used=10.24TiB
>>>>> System, RAID1: total=64.00MiB, used=1.45MiB
>>>>> Metadata, RAID1: total=18.00GiB, used=17.19GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>> ```
>>>>>
>>>>> Thanks to demfloro and multicore on #btrfs for prompting this email.
>>>>>
>>>>> Best wishes,
>>>>> Joe
>>>>>
>>>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-11-28  0:45         ` Qu Wenruo
@ 2020-12-19 10:35           ` Joe Hermaszewski
  2020-12-20  0:28             ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Hermaszewski @ 2020-12-19 10:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Ok, so I managed to get hold of a 64bit machine on which to run btrfs
check. `btrfs check` returns exactly the same output as the armv7 box
(no serious problems) which I suppose is good. `btrfs scrub` also
finds no problems. Boring Logs below.

What I don't quite understand is how the scrub problem on armv7l is so
reliable when it's not persisted on the disks, is the same physical
memory location being used for this breaking value, or is it perhaps a
specific pattern of data on the bus which causes this?

If it's the former, how easy would it be to find this broken location
and blacklist it? If it's the latter then I guess there's no hope but
to try replacing the psu/machine.

The machine survived a couple of days of memtester (on about 95% of
the RAM) and `7z b` with no problems, *shrug*

Best wishes and thanks for the generous help so far.
Joe

btrfs scrub aarch64:
```
[j@nixos:~]$ sudo btrfs scrub status -d /mnt
UUID:             b8f4ad49-29c8-4d19-a886-cef9c487f124
scrub device /dev/sda1 (id 1) history
Scrub started:    Fri Dec 18 14:24:30 2020
Status:           finished
Duration:         7:36:31
Total to scrub:   2.40TiB
Rate:             91.95MiB/s
Error summary:    no errors found
scrub device /dev/sdb1 (id 2) history
Scrub started:    Fri Dec 18 14:24:30 2020
Status:           finished
Duration:         7:12:51
Total to scrub:   2.40TiB
Rate:             96.90MiB/s
Error summary:    no errors found
scrub device /dev/sdd1 (id 3) history
Scrub started:    Fri Dec 18 14:24:30 2020
Status:           finished
Duration:         19:47:01
Total to scrub:   7.86TiB
Rate:             115.70MiB/s
Error summary:    no errors found
scrub device /dev/sdc1 (id 4) history
Scrub started:    Fri Dec 18 14:24:30 2020
Status:           finished
Duration:         19:46:38
Total to scrub:   7.86TiB
Rate:             115.74MiB/s
Error summary:    no errors found
```

btrfs check aarch64:
```
[nixos@nixos:/]$ sudo btrfs check --readonly /dev/sda1
Opening filesystem to check...
Checking filesystem on /dev/sda1
UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 294 inode 24665 errors 100, file extent discount
Found file extent holes:
        start: 3709534208, len: 163840
root 294 inode 406548 errors 100, file extent discount
Found file extent holes:
        start: 98729984, len: 720896
ERROR: errors found in fs roots
found 11280701063168 bytes used, error(s) found
total csum bytes: 10937464120
total tree bytes: 18538053632
total fs tree bytes: 5877579776
total extent tree bytes: 534052864
btree space waste bytes: 2316660292
file data blocks allocated: 17244587220992
 referenced 14211684794368%
```


On Sat, Nov 28, 2020 at 8:46 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/11/27 下午11:15, Joe Hermaszewski wrote:
> > Hi Qu,
> >
> > Thanks for the patch. I recompiled the kernel ran the scrub and your
> > patch worked as expected, here is the log:
> >
> > ```
> > [  337.365239] BTRFS info (device sda1): scrub: started on devid 2
> > [  337.366283] BTRFS info (device sda1): scrub: started on devid 1
> > [  337.402822] BTRFS info (device sda1): scrub: started on devid 3
> > [  337.411944] BTRFS info (device sda1): scrub: started on devid 4
> > [  471.997496] ------------[ cut here ]------------
> > [  471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
> > btree_csum_one_bio+0x22c/0x278 [btrfs]
> > [  471.997616] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> > iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack
> > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter
> > ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype nf_log_ipv6
> > nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables
> > iptable_filter sch_fq_codel loop tun tap macvlan bridge stp llc lm75
> > ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
> > [  471.997666] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.78 #1-NixOS
> > [  471.997668] Hardware name: Marvell Armada 380/385 (Device Tree)
> > [  471.997669] Backtrace:
> > [  471.997680] [<c010f698>] (dump_backtrace) from [<c010f938>]
> > (show_stack+0x20/0x24)
> > [  471.997684]  r7:00000213 r6:600c0013 r5:00000000 r4:c0f8c044
> > [  471.997691] [<c010f918>] (show_stack) from [<c0a1b848>]
> > (dump_stack+0x98/0xac)
> > [  471.997696] [<c0a1b7b0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> > [  471.997700]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
> > [  471.997704] [<c012a8b8>] (__warn) from [<c012ad24>]
> > (warn_slowpath_fmt+0x74/0xc4)
> > [  471.997707]  r7:00000213 r6:bf120990 r5:00000000 r4:e1822000
> > [  471.997765] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
> > (btree_csum_one_bio+0x22c/0x278 [btrfs])
> > [  471.997770]  r9:00000001 r8:c10decb8 r7:e1822000 r6:ed66b000
> > r5:00001000 r4:4fd00000
> > [  471.997872] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
> > [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> > [  471.997877]  r10:e2197b48 r9:ec845fc0 r8:ec845f70 r7:ed66b000
> > r6:00000000 r5:e2b74af0
> > [  471.997879]  r4:bf059e9c
> > [  471.997981] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
> > [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
> > [  471.997985]  r7:ef3724d4 r6:e1823cac r5:00000000 r4:bf059e9c
> > [  471.998087] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>]
> > (btree_write_cache_pages+0x380/0x408 [btrfs])
> > [  471.998090]  r5:00000000 r4:00000000
> > [  471.998193] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
> > [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
> > [  471.998197]  r10:00000001 r9:4fd00000 r8:c0280d14 r7:e1822000
> > r6:e1823d80 r5:ec845f70
> > [  471.998199]  r4:e1823d80
> > [  471.998255] [<bf05903c>] (btree_writepages [btrfs]) from
> > [<c02847c8>] (do_writepages+0x58/0xf4)
> > [  471.998258]  r5:ec845f70 r4:ec845e68
> > [  471.998266] [<c0284770>] (do_writepages) from [<c0278cb0>]
> > (__filemap_fdatawrite_range+0xf8/0x130)
> > [  471.998270]  r8:ec845f70 r7:00001000 r6:4fd0bfff r5:e1822000 r4:ec845e68
> > [  471.998275] [<c0278bb8>] (__filemap_fdatawrite_range) from
> > [<c0278e38>] (filemap_fdatawrite_range+0x2c/0x34)
> > [  471.998279]  r10:ec845f70 r9:00001000 r8:4fd0bfff r7:e1823e4c
> > r6:c1955e28 r5:00001000
> > [  471.998281]  r4:4fd0bfff
> > [  471.998335] [<c0278e0c>] (filemap_fdatawrite_range) from
> > [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> > [  471.998338]  r5:00000001 r4:00000000
> > [  471.998441] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
> > [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> > [  471.998446]  r10:e1822000 r9:ed66b010 r8:ed66b000 r7:c1955e28
> > r6:ed66b000 r5:e1822000
> > [  471.998447]  r4:eddc4f30
> > [  471.998551] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs])
> > from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> > [  471.998555]  r8:ed66b418 r7:c1955e00 r6:ed66b000 r5:eddc4f30 r4:00000000
> > [  471.998658] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
> > [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> > [  471.998663]  r10:ed66b28c r9:00000000 r8:001aab57 r7:00000064
> > r6:ed66b414 r5:00000bb8
> > [  471.998664]  r4:ed66b000
> > [  471.998719] [<bf05cbfc>] (transaction_kthread [btrfs]) from
> > [<c014facc>] (kthread+0x170/0x174)
> > [  471.998724]  r10:ed765bfc r9:bf05cbfc r8:ed56a000 r7:e1822000
> > r6:00000000 r5:ed5c8600
> > [  471.998726]  r4:ed5c8740
> > [  471.998731] [<c014f95c>] (kthread) from [<c01010e8>]
> > (ret_from_fork+0x14/0x2c)
> > [  471.998733] Exception stack(0xe1823fb0 to 0xe1823ff8)
> > [  471.998737] 3fa0:                                     00000000
> > 00000000 00000000 00000000
> > [  471.998741] 3fc0: 00000000 00000000 00000000 00000000 00000000
> > 00000000 00000000 00000000
> > [  471.998745] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > [  471.998749]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> > r6:00000000 r5:c014f95c
> > [  471.998751]  r4:ed5c8600
> > [  471.998820] ---[ end trace 6a9fb9f547659a4d ]---
> > [  471.998855] BTRFS warning (device sda1): page_start and eb_start
> > mismatch, eb_start=17593525075968 page_start=1339031552
> > [  472.015514] BTRFS: error (device sda1) in
> > btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> > writing out transaction)
> > [  472.027321] BTRFS info (device sda1): forced readonly
> > [  472.027326] BTRFS warning (device sda1): Skipping commit of aborted
> > transaction.
> > [  472.027330] BTRFS: error (device sda1) in cleanup_transaction:1832:
> > errno=-5 IO failure
> > [  472.035378] BTRFS info (device sda1): delayed_refs has NO entry
> > [  472.039102] BTRFS warning (device sda1): failed setting block group ro: -30
> > [  472.039107] BTRFS info (device sda1): scrub: not finished on devid
> > 3 with status: -30
> > [  472.039114] BTRFS info (device sda1): scrub: not finished on devid
> > 4 with status: -30
> > [  472.039227] BTRFS warning (device sda1): failed setting block group ro: -30
> > [  472.039230] BTRFS warning (device sda1): failed setting block group ro: -30
> > [  472.039237] BTRFS info (device sda1): scrub: not finished on devid
> > 2 with status: -30
> > [  472.039243] BTRFS info (device sda1): scrub: not finished on devid
> > 1 with status: -30
> > ```
> >
> > To rob you of the suspense:
> >
> > printf "0x%016x\n" (17593525075968 - 1339031552)
> > 0x0000100000000000
> >
> > Very surprising considering that this machine has ECC ram!
>
> ARMv7 with ECC, and we still had such bitflip!?
>
> I guess everything is possible in 2020.
>
> >
> > What would you recommend as a fix?
>
> It's a hardware problem, I really don't have any recommendation at all.
>
> I guess you may want to request an RMA?
>
> Thanks,
> Qu
>
> >
> > Best,
> > Joe
> >
> > On Thu, Nov 26, 2020 at 7:05 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/11/26 下午6:53, Joe Hermaszewski wrote:
> >>> Hi Qu,
> >>>
> >>> This is the output of btrfs check with the device unmounted:
> >>>
> >>> ```
> >>> [root@nixos:~]# btrfs check --readonly /dev/sda1
> >>> Opening filesystem to check...
> >>> Checking filesystem on /dev/sda1
> >>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
> >>> [1/7] checking root items
> >>> [2/7] checking extents
> >>> [3/7] checking free space cache
> >>> [4/7] checking fs roots
> >>> root 294 inode 24665 errors 100, file extent discount
> >>> Found file extent holes:
> >>>         start: 3709534208, len: 163840
> >>> root 294 inode 406548 errors 100, file extent discount
> >>> Found file extent holes:
> >>>         start: 0, len: 99450880
> >>> ERROR: errors found in fs roots
> >>> found 11279251902464 bytes used, error(s) found
> >>> total csum bytes: 10935906504
> >>> total tree bytes: 18455740416
> >>> total fs tree bytes: 5798936576
> >>> total extent tree bytes: 532250624
> >>> btree space waste bytes: 2308849707
> >>> file data blocks allocated: 17243418611712
> >>>  referenced 14210256932864
> >>> ```
> >>>
> >>> Looks the same as before...
> >>
> >> Then it means, at least your metadata is pretty safe. Nothing really
> >> needs to be worried (so far).
> >>
> >> This also means, the problem only happens at runtime, not something on-disk.
> >>
> >> If you are OK to re-compile the kernel, and the warning message is
> >> always the same at line number 531 of disk-io.c, then I recommend the
> >> attached debug diff
> >>
> >> The diff will output the found_start (from eb) and page_start (from page).
> >> If it's memory bitflip, we should have a pretty signature diff at the
> >> output.
> >> If not memory bitflip, then we really have a bug to dig.
> >>
> >>
> >>>
> >>> I'm afraid I don't have the precise details of the transid error
> >>> saved. I think I meant to write earlier that the message that comes up
> >>> now and then is the "errno=-5 IO failure (Error while writing out
> >>> transaction)". I've not seen the transid mismatch one since the first
> >>> incident, sorry for that slip-up.
> >>>
> >>> I'll try and attach to a known good system, will I have to attach the
> >>> whole array because I don't think I have that many enclosures for the
> >>> drives.
> >>
> >> You only need to mount the fs you're seeing the problem.
> >> But if that fs is on multiple devices, then I guess you need to mount
> >> them all.
> >>
> >> Thanks,
> >> Qu
> >>>
> >>> I'll also run a memtest on this box.
> >>>
> >>> Thank you for the helpful response.
> >>>
> >>> Best,
> >>> Joe
> >>>
> >>> On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
> >>>>> The filesystem has started behaving badly recently and I'm writing to:
> >>>>>
> >>>>> - Solicit advice on how best to get the system back to a stable state
> >>>>> - Report a potential bug
> >>>>>
> >>>>> ## What happened:
> >>>>>
> >>>>> A couple of days ago I could no longer ssh into it, and on the serial
> >>>>> connection there were heaps of messages (and new ones appearing with great
> >>>>> frequency) along the lines of: `parent transid verify failed on blah... wanted
> >>>>> x got y`.
> >>>>>
> >>>>> Although I don't have a record of the precise messages I do remember that there
> >>>>> was a difference of `15` between x and y.
> >>>>
> >>>> This normally can be a sign of unreliable HDD, which lies on FLUSH,
> >>>> killing metadata COW.
> >>>>
> >>>> But, your btrfs check doesn't report the same problem, thus I'm confused.
> >>>>
> >>>> Would you please try to run a "btrfs check --readonly /dev/sda1" with
> >>>> the fs unmounted?
> >>>>
> >>>> And, would you provide the full dmesg of that mismatch?
> >>>> The reason for the exact number is, I'm suspecting hardware memory
> >>>> corruption.
> >>>>
> >>>>>
> >>>>> I power-cycled system and started a scrub after it rebooted, this was
> >>>>> interrupted quite promptly by several more errors in btrfs, and the disk
> >>>>> remounted RO.
> >>>>>
> >>>>> Every now and then in the kernel log I get messages like:
> >>>>>
> >>>>> `parent transid verify failed on blah... wanted x got y`
> >>>>
> >>>> Not showing up in the gist dmesg though.
> >>>>
> >>>>>
> >>>>> ## Important info
> >>>>>
> >>>>> The dev stats are all zero.
> >>>>>
> >>>>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
> >>>>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
> >>>>>
> >>>>> The "cut here" portion of the kernel log is as follows
> >>>>>
> >>>>> ```
> >>>>> [  409.158097] ------------[ cut here ]------------
> >>>>> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
> >>>>> btree_csum_one_bio+0x208/0x248 [btrfs]
> >>>>
> >>>> The line number shows, the tree bytenr doesn't match with the one in memory.
> >>>> This is really too rare to be seen, especially when we have no other
> >>>> error reported from btrfs (at least in the gist)
> >>>>
> >>>> Since there is no other problems showing up in the gist, it means it
> >>>> could be a bit flip, considering the magic generation gap you mention is
> >>>> 15, I'm more suspicious about memory bit flip.
> >>>>
> >>>> If you can provide the full parent transid mismatch error message, it
> >>>> may help to determine the possibility of hardware memory corruption.
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>> [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> >>>>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
> >>>>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
> >>>>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
> >>>>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
> >>>>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
> >>>>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
> >>>>> raid6_pq
> >>>>> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
> >>>>> [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
> >>>>> [  409.158261] Backtrace:
> >>>>> [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
> >>>>> (show_stack+0x20/0x24)
> >>>>> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
> >>>>> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
> >>>>> (dump_stack+0x98/0xac)
> >>>>> [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> >>>>> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
> >>>>> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
> >>>>> (warn_slowpath_fmt+0x74/0xc4)
> >>>>> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
> >>>>> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
> >>>>> (btree_csum_one_bio+0x208/0x248 [btrfs])
> >>>>> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
> >>>>> r5:00000000 r4:eefd2d3c
> >>>>> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
> >>>>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> >>>>> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
> >>>>> r6:00000000 r5:ea3d17f0
> >>>>> [  409.158472]  r4:bf059e0c
> >>>>> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
> >>>>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
> >>>>> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
> >>>>> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
> >>>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
> >>>>> [  409.158686]  r5:00000000 r4:00000000
> >>>>> [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
> >>>>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
> >>>>> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
> >>>>> r6:e2393d80 r5:ecc45f70
> >>>>> [  409.158794]  r4:e2393d80
> >>>>> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
> >>>>> [<c0284748>] (do_writepages+0x58/0xf4)
> >>>>> [  409.158852]  r5:ecc45f70 r4:ecc45e68
> >>>>> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
> >>>>> (__filemap_fdatawrite_range+0xf8/0x130)
> >>>>> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
> >>>>> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
> >>>>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
> >>>>> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
> >>>>> r6:c73bc628 r5:00001000
> >>>>> [  409.158875]  r4:4fd0bfff
> >>>>> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
> >>>>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> >>>>> [  409.158931]  r5:00000001 r4:00000000
> >>>>> [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
> >>>>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> >>>>> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
> >>>>> r6:ec82b000 r5:e2392000
> >>>>> [  409.159040]  r4:c8b81ca8
> >>>>> [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
> >>>>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> >>>>> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
> >>>>> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
> >>>>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> >>>>> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
> >>>>> r6:ec82b414 r5:00000bb8
> >>>>> [  409.159254]  r4:ec82b000
> >>>>> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
> >>>>> [<c014fabc>] (kthread+0x170/0x174)
> >>>>> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
> >>>>> r6:00000000 r5:ed5ee700
> >>>>> [  409.159315]  r4:ed5ee1c0
> >>>>> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
> >>>>> (ret_from_fork+0x14/0x2c)
> >>>>> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
> >>>>> [  409.159326] 3fa0:                                     00000000
> >>>>> 00000000 00000000 00000000
> >>>>> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
> >>>>> 00000000 00000000 00000000
> >>>>> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> >>>>> [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> >>>>> r6:00000000 r5:c014f94c
> >>>>> [  409.159340]  r4:ed5ee700
> >>>>> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
> >>>>> [  409.165084] BTRFS: error (device sda1) in
> >>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> >>>>> writing out transaction)
> >>>>> [  409.176920] BTRFS info (device sda1): forced readonly
> >>>>> [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
> >>>>> transaction.
> >>>>> [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
> >>>>> errno=-5 IO failure
> >>>>> [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
> >>>>> [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
> >>>>> 3 with status: -125
> >>>>> [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
> >>>>> 4 with status: -125
> >>>>> [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
> >>>>> 1 with status: -125
> >>>>> [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
> >>>>> [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
> >>>>> 2 with status: -30
> >>>>> [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
> >>>>> ```
> >>>>>
> >>>>> Information requested here
> >>>>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
> >>>>>
> >>>>> ```
> >>>>>  $ uname -a
> >>>>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
> >>>>>
> >>>>>  $ btrfs --version
> >>>>> btrfs-progs v5.7
> >>>>>
> >>>>>  $ sudo btrfs fi show
> >>>>> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
> >>>>>         Total devices 4 FS bytes used 10.26TiB
> >>>>>         devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
> >>>>>         devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
> >>>>>         devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
> >>>>>         devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
> >>>>>
> >>>>> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
> >>>>>         Total devices 1 FS bytes used 116.35MiB
> >>>>>         devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
> >>>>>
> >>>>>  $ sudo btrfs fi df /
> >>>>> Data, RAID1: total=10.25TiB, used=10.24TiB
> >>>>> System, RAID1: total=64.00MiB, used=1.45MiB
> >>>>> Metadata, RAID1: total=18.00GiB, used=17.19GiB
> >>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
> >>>>> ```
> >>>>>
> >>>>> Thanks to demfloro and multicore on #btrfs for prompting this email.
> >>>>>
> >>>>> Best wishes,
> >>>>> Joe
> >>>>>
> >>>>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-12-19 10:35           ` Joe Hermaszewski
@ 2020-12-20  0:28             ` Qu Wenruo
  2021-04-08  8:16               ` Joe Hermaszewski
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-12-20  0:28 UTC (permalink / raw)
  To: Joe Hermaszewski; +Cc: linux-btrfs



On 2020/12/19 下午6:35, Joe Hermaszewski wrote:
> Ok, so I managed to get hold of a 64bit machine on which to run btrfs
> check. `btrfs check` returns exactly the same output as the armv7 box
> (no serious problems) which I suppose is good. `btrfs scrub` also
> finds no problems. Boring Logs below.
>
> What I don't quite understand is how the scrub problem on armv7l is so
> reliable when it's not persisted on the disks, is the same physical
> memory location being used for this breaking value, or is it perhaps a
> specific pattern of data on the bus which causes this?

If it's so reliably reproducible, then I guess that would be the case,
either a specific memory range has the problem, or a specific pattern on
the bus causing the problem.

>
> If it's the former, how easy would it be to find this broken location
> and blacklist it? If it's the latter then I guess there's no hope but
> to try replacing the psu/machine.
>
> The machine survived a couple of days of memtester (on about 95% of
> the RAM) and `7z b` with no problems, *shrug*

The memtester should rule out the former case, the latter case may be
resolved by newer kernel if it's not a hardware problem but a software one.

Thanks,
Qu

>
> Best wishes and thanks for the generous help so far.
> Joe
>
> btrfs scrub aarch64:
> ```
> [j@nixos:~]$ sudo btrfs scrub status -d /mnt
> UUID:             b8f4ad49-29c8-4d19-a886-cef9c487f124
> scrub device /dev/sda1 (id 1) history
> Scrub started:    Fri Dec 18 14:24:30 2020
> Status:           finished
> Duration:         7:36:31
> Total to scrub:   2.40TiB
> Rate:             91.95MiB/s
> Error summary:    no errors found
> scrub device /dev/sdb1 (id 2) history
> Scrub started:    Fri Dec 18 14:24:30 2020
> Status:           finished
> Duration:         7:12:51
> Total to scrub:   2.40TiB
> Rate:             96.90MiB/s
> Error summary:    no errors found
> scrub device /dev/sdd1 (id 3) history
> Scrub started:    Fri Dec 18 14:24:30 2020
> Status:           finished
> Duration:         19:47:01
> Total to scrub:   7.86TiB
> Rate:             115.70MiB/s
> Error summary:    no errors found
> scrub device /dev/sdc1 (id 4) history
> Scrub started:    Fri Dec 18 14:24:30 2020
> Status:           finished
> Duration:         19:46:38
> Total to scrub:   7.86TiB
> Rate:             115.74MiB/s
> Error summary:    no errors found
> ```
>
> btrfs check aarch64:
> ```
> [nixos@nixos:/]$ sudo btrfs check --readonly /dev/sda1
> Opening filesystem to check...
> Checking filesystem on /dev/sda1
> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space cache
> [4/7] checking fs roots
> root 294 inode 24665 errors 100, file extent discount
> Found file extent holes:
>          start: 3709534208, len: 163840
> root 294 inode 406548 errors 100, file extent discount
> Found file extent holes:
>          start: 98729984, len: 720896
> ERROR: errors found in fs roots
> found 11280701063168 bytes used, error(s) found
> total csum bytes: 10937464120
> total tree bytes: 18538053632
> total fs tree bytes: 5877579776
> total extent tree bytes: 534052864
> btree space waste bytes: 2316660292
> file data blocks allocated: 17244587220992
>   referenced 14211684794368%
> ```
>
>
> On Sat, Nov 28, 2020 at 8:46 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/11/27 下午11:15, Joe Hermaszewski wrote:
>>> Hi Qu,
>>>
>>> Thanks for the patch. I recompiled the kernel ran the scrub and your
>>> patch worked as expected, here is the log:
>>>
>>> ```
>>> [  337.365239] BTRFS info (device sda1): scrub: started on devid 2
>>> [  337.366283] BTRFS info (device sda1): scrub: started on devid 1
>>> [  337.402822] BTRFS info (device sda1): scrub: started on devid 3
>>> [  337.411944] BTRFS info (device sda1): scrub: started on devid 4
>>> [  471.997496] ------------[ cut here ]------------
>>> [  471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
>>> btree_csum_one_bio+0x22c/0x278 [btrfs]
>>> [  471.997616] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
>>> iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack
>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter
>>> ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype nf_log_ipv6
>>> nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables
>>> iptable_filter sch_fq_codel loop tun tap macvlan bridge stp llc lm75
>>> ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
>>> [  471.997666] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.78 #1-NixOS
>>> [  471.997668] Hardware name: Marvell Armada 380/385 (Device Tree)
>>> [  471.997669] Backtrace:
>>> [  471.997680] [<c010f698>] (dump_backtrace) from [<c010f938>]
>>> (show_stack+0x20/0x24)
>>> [  471.997684]  r7:00000213 r6:600c0013 r5:00000000 r4:c0f8c044
>>> [  471.997691] [<c010f918>] (show_stack) from [<c0a1b848>]
>>> (dump_stack+0x98/0xac)
>>> [  471.997696] [<c0a1b7b0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
>>> [  471.997700]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
>>> [  471.997704] [<c012a8b8>] (__warn) from [<c012ad24>]
>>> (warn_slowpath_fmt+0x74/0xc4)
>>> [  471.997707]  r7:00000213 r6:bf120990 r5:00000000 r4:e1822000
>>> [  471.997765] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
>>> (btree_csum_one_bio+0x22c/0x278 [btrfs])
>>> [  471.997770]  r9:00000001 r8:c10decb8 r7:e1822000 r6:ed66b000
>>> r5:00001000 r4:4fd00000
>>> [  471.997872] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
>>> [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>> [  471.997877]  r10:e2197b48 r9:ec845fc0 r8:ec845f70 r7:ed66b000
>>> r6:00000000 r5:e2b74af0
>>> [  471.997879]  r4:bf059e9c
>>> [  471.997981] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
>>> [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
>>> [  471.997985]  r7:ef3724d4 r6:e1823cac r5:00000000 r4:bf059e9c
>>> [  471.998087] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>]
>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
>>> [  471.998090]  r5:00000000 r4:00000000
>>> [  471.998193] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
>>> [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
>>> [  471.998197]  r10:00000001 r9:4fd00000 r8:c0280d14 r7:e1822000
>>> r6:e1823d80 r5:ec845f70
>>> [  471.998199]  r4:e1823d80
>>> [  471.998255] [<bf05903c>] (btree_writepages [btrfs]) from
>>> [<c02847c8>] (do_writepages+0x58/0xf4)
>>> [  471.998258]  r5:ec845f70 r4:ec845e68
>>> [  471.998266] [<c0284770>] (do_writepages) from [<c0278cb0>]
>>> (__filemap_fdatawrite_range+0xf8/0x130)
>>> [  471.998270]  r8:ec845f70 r7:00001000 r6:4fd0bfff r5:e1822000 r4:ec845e68
>>> [  471.998275] [<c0278bb8>] (__filemap_fdatawrite_range) from
>>> [<c0278e38>] (filemap_fdatawrite_range+0x2c/0x34)
>>> [  471.998279]  r10:ec845f70 r9:00001000 r8:4fd0bfff r7:e1823e4c
>>> r6:c1955e28 r5:00001000
>>> [  471.998281]  r4:4fd0bfff
>>> [  471.998335] [<c0278e0c>] (filemap_fdatawrite_range) from
>>> [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>> [  471.998338]  r5:00000001 r4:00000000
>>> [  471.998441] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
>>> [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
>>> [  471.998446]  r10:e1822000 r9:ed66b010 r8:ed66b000 r7:c1955e28
>>> r6:ed66b000 r5:e1822000
>>> [  471.998447]  r4:eddc4f30
>>> [  471.998551] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs])
>>> from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
>>> [  471.998555]  r8:ed66b418 r7:c1955e00 r6:ed66b000 r5:eddc4f30 r4:00000000
>>> [  471.998658] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
>>> [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>> [  471.998663]  r10:ed66b28c r9:00000000 r8:001aab57 r7:00000064
>>> r6:ed66b414 r5:00000bb8
>>> [  471.998664]  r4:ed66b000
>>> [  471.998719] [<bf05cbfc>] (transaction_kthread [btrfs]) from
>>> [<c014facc>] (kthread+0x170/0x174)
>>> [  471.998724]  r10:ed765bfc r9:bf05cbfc r8:ed56a000 r7:e1822000
>>> r6:00000000 r5:ed5c8600
>>> [  471.998726]  r4:ed5c8740
>>> [  471.998731] [<c014f95c>] (kthread) from [<c01010e8>]
>>> (ret_from_fork+0x14/0x2c)
>>> [  471.998733] Exception stack(0xe1823fb0 to 0xe1823ff8)
>>> [  471.998737] 3fa0:                                     00000000
>>> 00000000 00000000 00000000
>>> [  471.998741] 3fc0: 00000000 00000000 00000000 00000000 00000000
>>> 00000000 00000000 00000000
>>> [  471.998745] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
>>> [  471.998749]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
>>> r6:00000000 r5:c014f95c
>>> [  471.998751]  r4:ed5c8600
>>> [  471.998820] ---[ end trace 6a9fb9f547659a4d ]---
>>> [  471.998855] BTRFS warning (device sda1): page_start and eb_start
>>> mismatch, eb_start=17593525075968 page_start=1339031552
>>> [  472.015514] BTRFS: error (device sda1) in
>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
>>> writing out transaction)
>>> [  472.027321] BTRFS info (device sda1): forced readonly
>>> [  472.027326] BTRFS warning (device sda1): Skipping commit of aborted
>>> transaction.
>>> [  472.027330] BTRFS: error (device sda1) in cleanup_transaction:1832:
>>> errno=-5 IO failure
>>> [  472.035378] BTRFS info (device sda1): delayed_refs has NO entry
>>> [  472.039102] BTRFS warning (device sda1): failed setting block group ro: -30
>>> [  472.039107] BTRFS info (device sda1): scrub: not finished on devid
>>> 3 with status: -30
>>> [  472.039114] BTRFS info (device sda1): scrub: not finished on devid
>>> 4 with status: -30
>>> [  472.039227] BTRFS warning (device sda1): failed setting block group ro: -30
>>> [  472.039230] BTRFS warning (device sda1): failed setting block group ro: -30
>>> [  472.039237] BTRFS info (device sda1): scrub: not finished on devid
>>> 2 with status: -30
>>> [  472.039243] BTRFS info (device sda1): scrub: not finished on devid
>>> 1 with status: -30
>>> ```
>>>
>>> To rob you of the suspense:
>>>
>>> printf "0x%016x\n" (17593525075968 - 1339031552)
>>> 0x0000100000000000
>>>
>>> Very surprising considering that this machine has ECC ram!
>>
>> ARMv7 with ECC, and we still had such bitflip!?
>>
>> I guess everything is possible in 2020.
>>
>>>
>>> What would you recommend as a fix?
>>
>> It's a hardware problem, I really don't have any recommendation at all.
>>
>> I guess you may want to request an RMA?
>>
>> Thanks,
>> Qu
>>
>>>
>>> Best,
>>> Joe
>>>
>>> On Thu, Nov 26, 2020 at 7:05 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/11/26 下午6:53, Joe Hermaszewski wrote:
>>>>> Hi Qu,
>>>>>
>>>>> This is the output of btrfs check with the device unmounted:
>>>>>
>>>>> ```
>>>>> [root@nixos:~]# btrfs check --readonly /dev/sda1
>>>>> Opening filesystem to check...
>>>>> Checking filesystem on /dev/sda1
>>>>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>>> [1/7] checking root items
>>>>> [2/7] checking extents
>>>>> [3/7] checking free space cache
>>>>> [4/7] checking fs roots
>>>>> root 294 inode 24665 errors 100, file extent discount
>>>>> Found file extent holes:
>>>>>          start: 3709534208, len: 163840
>>>>> root 294 inode 406548 errors 100, file extent discount
>>>>> Found file extent holes:
>>>>>          start: 0, len: 99450880
>>>>> ERROR: errors found in fs roots
>>>>> found 11279251902464 bytes used, error(s) found
>>>>> total csum bytes: 10935906504
>>>>> total tree bytes: 18455740416
>>>>> total fs tree bytes: 5798936576
>>>>> total extent tree bytes: 532250624
>>>>> btree space waste bytes: 2308849707
>>>>> file data blocks allocated: 17243418611712
>>>>>   referenced 14210256932864
>>>>> ```
>>>>>
>>>>> Looks the same as before...
>>>>
>>>> Then it means, at least your metadata is pretty safe. Nothing really
>>>> needs to be worried (so far).
>>>>
>>>> This also means, the problem only happens at runtime, not something on-disk.
>>>>
>>>> If you are OK to re-compile the kernel, and the warning message is
>>>> always the same at line number 531 of disk-io.c, then I recommend the
>>>> attached debug diff
>>>>
>>>> The diff will output the found_start (from eb) and page_start (from page).
>>>> If it's memory bitflip, we should have a pretty signature diff at the
>>>> output.
>>>> If not memory bitflip, then we really have a bug to dig.
>>>>
>>>>
>>>>>
>>>>> I'm afraid I don't have the precise details of the transid error
>>>>> saved. I think I meant to write earlier that the message that comes up
>>>>> now and then is the "errno=-5 IO failure (Error while writing out
>>>>> transaction)". I've not seen the transid mismatch one since the first
>>>>> incident, sorry for that slip-up.
>>>>>
>>>>> I'll try and attach to a known good system, will I have to attach the
>>>>> whole array because I don't think I have that many enclosures for the
>>>>> drives.
>>>>
>>>> You only need to mount the fs you're seeing the problem.
>>>> But if that fs is on multiple devices, then I guess you need to mount
>>>> them all.
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> I'll also run a memtest on this box.
>>>>>
>>>>> Thank you for the helpful response.
>>>>>
>>>>> Best,
>>>>> Joe
>>>>>
>>>>> On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
>>>>>>> The filesystem has started behaving badly recently and I'm writing to:
>>>>>>>
>>>>>>> - Solicit advice on how best to get the system back to a stable state
>>>>>>> - Report a potential bug
>>>>>>>
>>>>>>> ## What happened:
>>>>>>>
>>>>>>> A couple of days ago I could no longer ssh into it, and on the serial
>>>>>>> connection there were heaps of messages (and new ones appearing with great
>>>>>>> frequency) along the lines of: `parent transid verify failed on blah... wanted
>>>>>>> x got y`.
>>>>>>>
>>>>>>> Although I don't have a record of the precise messages I do remember that there
>>>>>>> was a difference of `15` between x and y.
>>>>>>
>>>>>> This normally can be a sign of unreliable HDD, which lies on FLUSH,
>>>>>> killing metadata COW.
>>>>>>
>>>>>> But, your btrfs check doesn't report the same problem, thus I'm confused.
>>>>>>
>>>>>> Would you please try to run a "btrfs check --readonly /dev/sda1" with
>>>>>> the fs unmounted?
>>>>>>
>>>>>> And, would you provide the full dmesg of that mismatch?
>>>>>> The reason for the exact number is, I'm suspecting hardware memory
>>>>>> corruption.
>>>>>>
>>>>>>>
>>>>>>> I power-cycled system and started a scrub after it rebooted, this was
>>>>>>> interrupted quite promptly by several more errors in btrfs, and the disk
>>>>>>> remounted RO.
>>>>>>>
>>>>>>> Every now and then in the kernel log I get messages like:
>>>>>>>
>>>>>>> `parent transid verify failed on blah... wanted x got y`
>>>>>>
>>>>>> Not showing up in the gist dmesg though.
>>>>>>
>>>>>>>
>>>>>>> ## Important info
>>>>>>>
>>>>>>> The dev stats are all zero.
>>>>>>>
>>>>>>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
>>>>>>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
>>>>>>>
>>>>>>> The "cut here" portion of the kernel log is as follows
>>>>>>>
>>>>>>> ```
>>>>>>> [  409.158097] ------------[ cut here ]------------
>>>>>>> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
>>>>>>> btree_csum_one_bio+0x208/0x248 [btrfs]
>>>>>>
>>>>>> The line number shows, the tree bytenr doesn't match with the one in memory.
>>>>>> This is really too rare to be seen, especially when we have no other
>>>>>> error reported from btrfs (at least in the gist)
>>>>>>
>>>>>> Since there is no other problems showing up in the gist, it means it
>>>>>> could be a bit flip, considering the magic generation gap you mention is
>>>>>> 15, I'm more suspicious about memory bit flip.
>>>>>>
>>>>>> If you can provide the full parent transid mismatch error message, it
>>>>>> may help to determine the possibility of hardware memory corruption.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>> [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
>>>>>>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>>>>>>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
>>>>>>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
>>>>>>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
>>>>>>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
>>>>>>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
>>>>>>> raid6_pq
>>>>>>> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
>>>>>>> [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
>>>>>>> [  409.158261] Backtrace:
>>>>>>> [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
>>>>>>> (show_stack+0x20/0x24)
>>>>>>> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
>>>>>>> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
>>>>>>> (dump_stack+0x98/0xac)
>>>>>>> [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
>>>>>>> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
>>>>>>> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
>>>>>>> (warn_slowpath_fmt+0x74/0xc4)
>>>>>>> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
>>>>>>> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
>>>>>>> (btree_csum_one_bio+0x208/0x248 [btrfs])
>>>>>>> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
>>>>>>> r5:00000000 r4:eefd2d3c
>>>>>>> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
>>>>>>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>>>>>> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
>>>>>>> r6:00000000 r5:ea3d17f0
>>>>>>> [  409.158472]  r4:bf059e0c
>>>>>>> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
>>>>>>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
>>>>>>> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
>>>>>>> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
>>>>>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
>>>>>>> [  409.158686]  r5:00000000 r4:00000000
>>>>>>> [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
>>>>>>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
>>>>>>> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
>>>>>>> r6:e2393d80 r5:ecc45f70
>>>>>>> [  409.158794]  r4:e2393d80
>>>>>>> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
>>>>>>> [<c0284748>] (do_writepages+0x58/0xf4)
>>>>>>> [  409.158852]  r5:ecc45f70 r4:ecc45e68
>>>>>>> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
>>>>>>> (__filemap_fdatawrite_range+0xf8/0x130)
>>>>>>> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
>>>>>>> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
>>>>>>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
>>>>>>> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
>>>>>>> r6:c73bc628 r5:00001000
>>>>>>> [  409.158875]  r4:4fd0bfff
>>>>>>> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
>>>>>>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>>>>>> [  409.158931]  r5:00000001 r4:00000000
>>>>>>> [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
>>>>>>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
>>>>>>> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
>>>>>>> r6:ec82b000 r5:e2392000
>>>>>>> [  409.159040]  r4:c8b81ca8
>>>>>>> [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
>>>>>>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
>>>>>>> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
>>>>>>> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
>>>>>>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>>>>>> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
>>>>>>> r6:ec82b414 r5:00000bb8
>>>>>>> [  409.159254]  r4:ec82b000
>>>>>>> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
>>>>>>> [<c014fabc>] (kthread+0x170/0x174)
>>>>>>> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
>>>>>>> r6:00000000 r5:ed5ee700
>>>>>>> [  409.159315]  r4:ed5ee1c0
>>>>>>> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
>>>>>>> (ret_from_fork+0x14/0x2c)
>>>>>>> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
>>>>>>> [  409.159326] 3fa0:                                     00000000
>>>>>>> 00000000 00000000 00000000
>>>>>>> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
>>>>>>> 00000000 00000000 00000000
>>>>>>> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
>>>>>>> [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
>>>>>>> r6:00000000 r5:c014f94c
>>>>>>> [  409.159340]  r4:ed5ee700
>>>>>>> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
>>>>>>> [  409.165084] BTRFS: error (device sda1) in
>>>>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
>>>>>>> writing out transaction)
>>>>>>> [  409.176920] BTRFS info (device sda1): forced readonly
>>>>>>> [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
>>>>>>> transaction.
>>>>>>> [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
>>>>>>> errno=-5 IO failure
>>>>>>> [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
>>>>>>> [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
>>>>>>> 3 with status: -125
>>>>>>> [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
>>>>>>> 4 with status: -125
>>>>>>> [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
>>>>>>> 1 with status: -125
>>>>>>> [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
>>>>>>> [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
>>>>>>> 2 with status: -30
>>>>>>> [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
>>>>>>> ```
>>>>>>>
>>>>>>> Information requested here
>>>>>>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
>>>>>>>
>>>>>>> ```
>>>>>>>   $ uname -a
>>>>>>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
>>>>>>>
>>>>>>>   $ btrfs --version
>>>>>>> btrfs-progs v5.7
>>>>>>>
>>>>>>>   $ sudo btrfs fi show
>>>>>>> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>>>>>          Total devices 4 FS bytes used 10.26TiB
>>>>>>>          devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
>>>>>>>          devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
>>>>>>>          devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
>>>>>>>          devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
>>>>>>>
>>>>>>> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
>>>>>>>          Total devices 1 FS bytes used 116.35MiB
>>>>>>>          devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
>>>>>>>
>>>>>>>   $ sudo btrfs fi df /
>>>>>>> Data, RAID1: total=10.25TiB, used=10.24TiB
>>>>>>> System, RAID1: total=64.00MiB, used=1.45MiB
>>>>>>> Metadata, RAID1: total=18.00GiB, used=17.19GiB
>>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>> ```
>>>>>>>
>>>>>>> Thanks to demfloro and multicore on #btrfs for prompting this email.
>>>>>>>
>>>>>>> Best wishes,
>>>>>>> Joe
>>>>>>>
>>>>>>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2020-12-20  0:28             ` Qu Wenruo
@ 2021-04-08  8:16               ` Joe Hermaszewski
  2021-04-08  8:38                 ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Hermaszewski @ 2021-04-08  8:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

It took a while but I managed to get hold of another one of these
arm32 boards. Very disappointingly this exact "bitflip" is still
present (log enclosed).

To summarise, as it's been a while:

- When running scrub, a "page_start" and "eb_start" mismatch is
detected (off by a single bit).
- `btrfs check` reports no significant errors on aarch64 or arm32.
- `btrfs scrub` completes successfully on aarch64!
- Now, I can confirm that `btrfs scrub` fails in the same manner on
two arm32 machines.

Not really sure where to go from here. The only things I can think of are:

- Bug in `btrfs scrub` on arm32 (seems unlikely given the single-bit
error here).
- A bitflip on the original machine wrote bad data to the disk, and
somehow only the arm32 implementation catches this...

I can give a newer kernel a try (I'm on 5.4.80 at the moment) I suppose

Best,
Joe

```
$ dmesg
[  258.791966] BTRFS info (device sda1): scrub: started on devid 4
[  258.792030] BTRFS info (device sda1): scrub: started on devid 1

 the magic generation gap you mention is
15, I'm more suspicious about me

[  258.792062] BTRFS info (device sda1): scrub: started on devid 2
[  411.849669] ------------[ cut here ]------------
[  411.849786] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
btree_csum_one_bio+0x22c/0x278 [btrfs]
[  411.849789] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
ip6table_filter ip6_tables iptable_filter phy_generic uio_pdrv_genirq
uio sch_fq_codel loop tun tap macvlan bridge stp llc lm75 ip_tables
x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
[  411.849836] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.80 #1-NixOS
[  411.849838] Hardware name: Marvell Armada 380/385 (Device Tree)
[  411.849840] Backtrace:
[  411.849850] [<c010f698>] (dump_backtrace) from [<c010f938>]
(show_stack+0x20/0x24)
[  411.849855]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c0c4
[  411.849861] [<c010f918>] (show_stack) from [<c0a1bb28>]
(dump_stack+0x98/0xac)
[  411.849866] [<c0a1ba90>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
[  411.849870]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
[  411.849875] [<c012a8b8>] (__warn) from [<c012ad24>]
(warn_slowpath_fmt+0x74/0xc4)
[  411.849878]  r7:00000213 r6:bf120990 r5:00000000 r4:e1c1c000
[  411.849936] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
(btree_csum_one_bio+0x22c/0x278 [btrfs])
[  411.849941]  r9:00000001 r8:e6afec78 r7:e1c1c000 r6:ed7f7000
r5:00001000 r4:8fd00000
[  411.850044] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
[<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
[  411.850049]  r10:e6afe4c0 r9:ecc57fc0 r8:ecc57f70 r7:ed7f7000
r6:00000000 r5:c1660370
[  411.850051]  r4:bf059e9c
[  411.850154] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
[<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
[  411.850158]  r7:ef2d58ec r6:e1c1dcac r5:00000000 r4:bf059e9c
[  411.850260] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>]
(btree_write_cache_pages+0x380/0x408 [btrfs])
[  411.850263]  r5:00000000 r4:00000000
[  411.850366] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
[<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
[  411.850371]  r10:00000001 r9:8fd00000 r8:c0280bcc r7:e1c1c000
r6:e1c1dd80 r5:ecc57f70
[  411.850373]  r4:e1c1dd80
[  411.850428] [<bf05903c>] (btree_writepages [btrfs]) from
[<c0284680>] (do_writepages+0x58/0xf4)
[  411.850431]  r5:ecc57f70 r4:ecc57e68
[  411.850439] [<c0284628>] (do_writepages) from [<c0278b68>]
(__filemap_fdatawrite_range+0xf8/0x130)
[  411.850442]  r8:ecc57f70 r7:00001000 r6:8fd0bfff r5:e1c1c000 r4:ecc57e68
[  411.850448] [<c0278a70>] (__filemap_fdatawrite_range) from
[<c0278cf0>] (filemap_fdatawrite_range+0x2c/0x34)
[  411.850452]  r10:ecc57f70 r9:00001000 r8:8fd0bfff r7:e1c1de4c
r6:e0202428 r5:00001000
[  411.850454]  r4:8fd0bfff
[  411.850509] [<c0278cc4>] (filemap_fdatawrite_range) from
[<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
[  411.850512]  r5:00000001 r4:00000000
[  411.850615] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
[<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
[  411.850620]  r10:e1c1c000 r9:ed7f7010 r8:ed7f7000 r7:e0202428
r6:ed7f7000 r5:e1c1c000
[  411.850621]  r4:e56433f0
[  411.850724] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs])
from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
[  411.850728]  r8:ed7f7418 r7:e0202400 r6:ed7f7000 r5:e56433f0 r4:00000000
[  411.850831] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
[<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
[  411.850836]  r10:ed7f728c r9:00000000 r8:001abf6a r7:00000064
r6:ed7f7414 r5:00000bb8
[  411.850838]  r4:ed7f7000
[  411.850893] [<bf05cbfc>] (transaction_kthread [btrfs]) from
[<c014fab4>] (kthread+0x170/0x174)
[  411.850898]  r10:ecbddbfc r9:bf05cbfc r8:ed669800 r7:e1c1c000
r6:00000000 r5:ed560d40
[  411.850899]  r4:ed54eb00
[  411.850904] [<c014f944>] (kthread) from [<c01010e8>]
(ret_from_fork+0x14/0x2c)
[  411.850907] Exception stack(0xe1c1dfb0 to 0xe1c1dff8)
[  411.850911] dfa0:                                     00000000
00000000 00000000 00000000
[  411.850915] dfc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[  411.850919] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  411.850923]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
r6:00000000 r5:c014f944
[  411.850925]  r4:ed560d40
[  411.850976] ---[ end trace 1bbf4c394bd7fb26 ]---
[  411.850986] BTRFS warning (device sda1): page_start and eb_start
mismatch, eb_start=17594598817792 page_start=2412773376
[  411.857914] BTRFS: error (device sda1) in
btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
writing out transaction)
[  411.869764] BTRFS info (device sda1): forced readonly
[  411.869772] BTRFS warning (device sda1): Skipping commit of aborted
transaction.
[  411.869778] BTRFS: error (device sda1) in cleanup_transaction:1832:
errno=-5 IO failure
[  411.877843] BTRFS info (device sda1): delayed_refs has NO entry
[  411.878076] BTRFS warning (device sda1): Skipping commit of aborted
transaction.
[  411.882940] BTRFS warning (device sda1): failed setting block group ro: -30
[  411.882956] BTRFS info (device sda1): scrub: not finished on devid
3 with status: -30
[  411.883030] BTRFS warning (device sda1): failed setting block group ro: -30
[  411.883039] BTRFS info (device sda1): scrub: not finished on devid
4 with status: -30
[  411.883104] BTRFS warning (device sda1): failed setting block group ro: -30
[  411.883116] BTRFS info (device sda1): scrub: not finished on devid
1 with status: -30
[  411.883191] BTRFS warning (device sda1): failed setting block group ro: -30
[  411.883200] BTRFS info (device sda1): scrub: not finished on devid
2 with status: -30
[  411.953186] BTRFS info (device sda1): delayed_refs has NO entry
[  576.361548] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  576.369639] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  576.378477] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  576.386936] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  576.395093] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  576.396830] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  576.403153] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  576.450091] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  576.462821] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  576.474987] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  576.486882] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  576.497905] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  576.824673] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  576.838414] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  577.324362] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  577.337736] BTRFS error (device sda1): parent transid verify failed
on 12756293910528 wanted 1752939 found 1752922
[  578.462271] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  578.470347] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  578.479476] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  578.488099] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  578.496793] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  578.505280] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.539117] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.547207] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.556371] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.564922] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.573356] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.581573] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  581.427776] verify_parent_transid: 28 callbacks suppressed
... many more of these
```

On Sun, Dec 20, 2020 at 8:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/12/19 下午6:35, Joe Hermaszewski wrote:
> > Ok, so I managed to get hold of a 64bit machine on which to run btrfs
> > check. `btrfs check` returns exactly the same output as the armv7 box
> > (no serious problems) which I suppose is good. `btrfs scrub` also
> > finds no problems. Boring Logs below.
> >
> > What I don't quite understand is how the scrub problem on armv7l is so
> > reliable when it's not persisted on the disks, is the same physical
> > memory location being used for this breaking value, or is it perhaps a
> > specific pattern of data on the bus which causes this?
>
> If it's so reliably reproducible, then I guess that would be the case,
> either a specific memory range has the problem, or a specific pattern on
> the bus causing the problem.
>
> >
> > If it's the former, how easy would it be to find this broken location
> > and blacklist it? If it's the latter then I guess there's no hope but
> > to try replacing the psu/machine.
> >
> > The machine survived a couple of days of memtester (on about 95% of
> > the RAM) and `7z b` with no problems, *shrug*
>
> The memtester should rule out the former case, the latter case may be
> resolved by newer kernel if it's not a hardware problem but a software one.
>
> Thanks,
> Qu
>
> >
> > Best wishes and thanks for the generous help so far.
> > Joe
> >
> > btrfs scrub aarch64:
> > ```
> > [j@nixos:~]$ sudo btrfs scrub status -d /mnt
> > UUID:             b8f4ad49-29c8-4d19-a886-cef9c487f124
> > scrub device /dev/sda1 (id 1) history
> > Scrub started:    Fri Dec 18 14:24:30 2020
> > Status:           finished
> > Duration:         7:36:31
> > Total to scrub:   2.40TiB
> > Rate:             91.95MiB/s
> > Error summary:    no errors found
> > scrub device /dev/sdb1 (id 2) history
> > Scrub started:    Fri Dec 18 14:24:30 2020
> > Status:           finished
> > Duration:         7:12:51
> > Total to scrub:   2.40TiB
> > Rate:             96.90MiB/s
> > Error summary:    no errors found
> > scrub device /dev/sdd1 (id 3) history
> > Scrub started:    Fri Dec 18 14:24:30 2020
> > Status:           finished
> > Duration:         19:47:01
> > Total to scrub:   7.86TiB
> > Rate:             115.70MiB/s
> > Error summary:    no errors found
> > scrub device /dev/sdc1 (id 4) history
> > Scrub started:    Fri Dec 18 14:24:30 2020
> > Status:           finished
> > Duration:         19:46:38
> > Total to scrub:   7.86TiB
> > Rate:             115.74MiB/s
> > Error summary:    no errors found
> > ```
> >
> > btrfs check aarch64:
> > ```
> > [nixos@nixos:/]$ sudo btrfs check --readonly /dev/sda1
> > Opening filesystem to check...
> > Checking filesystem on /dev/sda1
> > UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
> > [1/7] checking root items
> > [2/7] checking extents
> > [3/7] checking free space cache
> > [4/7] checking fs roots
> > root 294 inode 24665 errors 100, file extent discount
> > Found file extent holes:
> >          start: 3709534208, len: 163840
> > root 294 inode 406548 errors 100, file extent discount
> > Found file extent holes:
> >          start: 98729984, len: 720896
> > ERROR: errors found in fs roots
> > found 11280701063168 bytes used, error(s) found
> > total csum bytes: 10937464120
> > total tree bytes: 18538053632
> > total fs tree bytes: 5877579776
> > total extent tree bytes: 534052864
> > btree space waste bytes: 2316660292
> > file data blocks allocated: 17244587220992
> >   referenced 14211684794368%
> > ```
> >
> >
> > On Sat, Nov 28, 2020 at 8:46 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2020/11/27 下午11:15, Joe Hermaszewski wrote:
> >>> Hi Qu,
> >>>
> >>> Thanks for the patch. I recompiled the kernel ran the scrub and your
> >>> patch worked as expected, here is the log:
> >>>
> >>> ```
> >>> [  337.365239] BTRFS info (device sda1): scrub: started on devid 2
> >>> [  337.366283] BTRFS info (device sda1): scrub: started on devid 1
> >>> [  337.402822] BTRFS info (device sda1): scrub: started on devid 3
> >>> [  337.411944] BTRFS info (device sda1): scrub: started on devid 4
> >>> [  471.997496] ------------[ cut here ]------------
> >>> [  471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
> >>> btree_csum_one_bio+0x22c/0x278 [btrfs]
> >>> [  471.997616] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> >>> iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack
> >>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter
> >>> ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype nf_log_ipv6
> >>> nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables
> >>> iptable_filter sch_fq_codel loop tun tap macvlan bridge stp llc lm75
> >>> ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
> >>> [  471.997666] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.78 #1-NixOS
> >>> [  471.997668] Hardware name: Marvell Armada 380/385 (Device Tree)
> >>> [  471.997669] Backtrace:
> >>> [  471.997680] [<c010f698>] (dump_backtrace) from [<c010f938>]
> >>> (show_stack+0x20/0x24)
> >>> [  471.997684]  r7:00000213 r6:600c0013 r5:00000000 r4:c0f8c044
> >>> [  471.997691] [<c010f918>] (show_stack) from [<c0a1b848>]
> >>> (dump_stack+0x98/0xac)
> >>> [  471.997696] [<c0a1b7b0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> >>> [  471.997700]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
> >>> [  471.997704] [<c012a8b8>] (__warn) from [<c012ad24>]
> >>> (warn_slowpath_fmt+0x74/0xc4)
> >>> [  471.997707]  r7:00000213 r6:bf120990 r5:00000000 r4:e1822000
> >>> [  471.997765] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
> >>> (btree_csum_one_bio+0x22c/0x278 [btrfs])
> >>> [  471.997770]  r9:00000001 r8:c10decb8 r7:e1822000 r6:ed66b000
> >>> r5:00001000 r4:4fd00000
> >>> [  471.997872] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
> >>> [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> >>> [  471.997877]  r10:e2197b48 r9:ec845fc0 r8:ec845f70 r7:ed66b000
> >>> r6:00000000 r5:e2b74af0
> >>> [  471.997879]  r4:bf059e9c
> >>> [  471.997981] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
> >>> [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
> >>> [  471.997985]  r7:ef3724d4 r6:e1823cac r5:00000000 r4:bf059e9c
> >>> [  471.998087] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>]
> >>> (btree_write_cache_pages+0x380/0x408 [btrfs])
> >>> [  471.998090]  r5:00000000 r4:00000000
> >>> [  471.998193] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
> >>> [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
> >>> [  471.998197]  r10:00000001 r9:4fd00000 r8:c0280d14 r7:e1822000
> >>> r6:e1823d80 r5:ec845f70
> >>> [  471.998199]  r4:e1823d80
> >>> [  471.998255] [<bf05903c>] (btree_writepages [btrfs]) from
> >>> [<c02847c8>] (do_writepages+0x58/0xf4)
> >>> [  471.998258]  r5:ec845f70 r4:ec845e68
> >>> [  471.998266] [<c0284770>] (do_writepages) from [<c0278cb0>]
> >>> (__filemap_fdatawrite_range+0xf8/0x130)
> >>> [  471.998270]  r8:ec845f70 r7:00001000 r6:4fd0bfff r5:e1822000 r4:ec845e68
> >>> [  471.998275] [<c0278bb8>] (__filemap_fdatawrite_range) from
> >>> [<c0278e38>] (filemap_fdatawrite_range+0x2c/0x34)
> >>> [  471.998279]  r10:ec845f70 r9:00001000 r8:4fd0bfff r7:e1823e4c
> >>> r6:c1955e28 r5:00001000
> >>> [  471.998281]  r4:4fd0bfff
> >>> [  471.998335] [<c0278e0c>] (filemap_fdatawrite_range) from
> >>> [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> >>> [  471.998338]  r5:00000001 r4:00000000
> >>> [  471.998441] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
> >>> [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> >>> [  471.998446]  r10:e1822000 r9:ed66b010 r8:ed66b000 r7:c1955e28
> >>> r6:ed66b000 r5:e1822000
> >>> [  471.998447]  r4:eddc4f30
> >>> [  471.998551] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs])
> >>> from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> >>> [  471.998555]  r8:ed66b418 r7:c1955e00 r6:ed66b000 r5:eddc4f30 r4:00000000
> >>> [  471.998658] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
> >>> [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> >>> [  471.998663]  r10:ed66b28c r9:00000000 r8:001aab57 r7:00000064
> >>> r6:ed66b414 r5:00000bb8
> >>> [  471.998664]  r4:ed66b000
> >>> [  471.998719] [<bf05cbfc>] (transaction_kthread [btrfs]) from
> >>> [<c014facc>] (kthread+0x170/0x174)
> >>> [  471.998724]  r10:ed765bfc r9:bf05cbfc r8:ed56a000 r7:e1822000
> >>> r6:00000000 r5:ed5c8600
> >>> [  471.998726]  r4:ed5c8740
> >>> [  471.998731] [<c014f95c>] (kthread) from [<c01010e8>]
> >>> (ret_from_fork+0x14/0x2c)
> >>> [  471.998733] Exception stack(0xe1823fb0 to 0xe1823ff8)
> >>> [  471.998737] 3fa0:                                     00000000
> >>> 00000000 00000000 00000000
> >>> [  471.998741] 3fc0: 00000000 00000000 00000000 00000000 00000000
> >>> 00000000 00000000 00000000
> >>> [  471.998745] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> >>> [  471.998749]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> >>> r6:00000000 r5:c014f95c
> >>> [  471.998751]  r4:ed5c8600
> >>> [  471.998820] ---[ end trace 6a9fb9f547659a4d ]---
> >>> [  471.998855] BTRFS warning (device sda1): page_start and eb_start
> >>> mismatch, eb_start=17593525075968 page_start=1339031552
> >>> [  472.015514] BTRFS: error (device sda1) in
> >>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> >>> writing out transaction)
> >>> [  472.027321] BTRFS info (device sda1): forced readonly
> >>> [  472.027326] BTRFS warning (device sda1): Skipping commit of aborted
> >>> transaction.
> >>> [  472.027330] BTRFS: error (device sda1) in cleanup_transaction:1832:
> >>> errno=-5 IO failure
> >>> [  472.035378] BTRFS info (device sda1): delayed_refs has NO entry
> >>> [  472.039102] BTRFS warning (device sda1): failed setting block group ro: -30
> >>> [  472.039107] BTRFS info (device sda1): scrub: not finished on devid
> >>> 3 with status: -30
> >>> [  472.039114] BTRFS info (device sda1): scrub: not finished on devid
> >>> 4 with status: -30
> >>> [  472.039227] BTRFS warning (device sda1): failed setting block group ro: -30
> >>> [  472.039230] BTRFS warning (device sda1): failed setting block group ro: -30
> >>> [  472.039237] BTRFS info (device sda1): scrub: not finished on devid
> >>> 2 with status: -30
> >>> [  472.039243] BTRFS info (device sda1): scrub: not finished on devid
> >>> 1 with status: -30
> >>> ```
> >>>
> >>> To rob you of the suspense:
> >>>
> >>> printf "0x%016x\n" (17593525075968 - 1339031552)
> >>> 0x0000100000000000
> >>>
> >>> Very surprising considering that this machine has ECC ram!
> >>
> >> ARMv7 with ECC, and we still had such bitflip!?
> >>
> >> I guess everything is possible in 2020.
> >>
> >>>
> >>> What would you recommend as a fix?
> >>
> >> It's a hardware problem, I really don't have any recommendation at all.
> >>
> >> I guess you may want to request an RMA?
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> Best,
> >>> Joe
> >>>
> >>> On Thu, Nov 26, 2020 at 7:05 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/11/26 下午6:53, Joe Hermaszewski wrote:
> >>>>> Hi Qu,
> >>>>>
> >>>>> This is the output of btrfs check with the device unmounted:
> >>>>>
> >>>>> ```
> >>>>> [root@nixos:~]# btrfs check --readonly /dev/sda1
> >>>>> Opening filesystem to check...
> >>>>> Checking filesystem on /dev/sda1
> >>>>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
> >>>>> [1/7] checking root items
> >>>>> [2/7] checking extents
> >>>>> [3/7] checking free space cache
> >>>>> [4/7] checking fs roots
> >>>>> root 294 inode 24665 errors 100, file extent discount
> >>>>> Found file extent holes:
> >>>>>          start: 3709534208, len: 163840
> >>>>> root 294 inode 406548 errors 100, file extent discount
> >>>>> Found file extent holes:
> >>>>>          start: 0, len: 99450880
> >>>>> ERROR: errors found in fs roots
> >>>>> found 11279251902464 bytes used, error(s) found
> >>>>> total csum bytes: 10935906504
> >>>>> total tree bytes: 18455740416
> >>>>> total fs tree bytes: 5798936576
> >>>>> total extent tree bytes: 532250624
> >>>>> btree space waste bytes: 2308849707
> >>>>> file data blocks allocated: 17243418611712
> >>>>>   referenced 14210256932864
> >>>>> ```
> >>>>>
> >>>>> Looks the same as before...
> >>>>
> >>>> Then it means, at least your metadata is pretty safe. Nothing really
> >>>> needs to be worried (so far).
> >>>>
> >>>> This also means, the problem only happens at runtime, not something on-disk.
> >>>>
> >>>> If you are OK to re-compile the kernel, and the warning message is
> >>>> always the same at line number 531 of disk-io.c, then I recommend the
> >>>> attached debug diff
> >>>>
> >>>> The diff will output the found_start (from eb) and page_start (from page).
> >>>> If it's memory bitflip, we should have a pretty signature diff at the
> >>>> output.
> >>>> If not memory bitflip, then we really have a bug to dig.
> >>>>
> >>>>
> >>>>>
> >>>>> I'm afraid I don't have the precise details of the transid error
> >>>>> saved. I think I meant to write earlier that the message that comes up
> >>>>> now and then is the "errno=-5 IO failure (Error while writing out
> >>>>> transaction)". I've not seen the transid mismatch one since the first
> >>>>> incident, sorry for that slip-up.
> >>>>>
> >>>>> I'll try and attach to a known good system, will I have to attach the
> >>>>> whole array because I don't think I have that many enclosures for the
> >>>>> drives.
> >>>>
> >>>> You only need to mount the fs you're seeing the problem.
> >>>> But if that fs is on multiple devices, then I guess you need to mount
> >>>> them all.
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>>
> >>>>> I'll also run a memtest on this box.
> >>>>>
> >>>>> Thank you for the helpful response.
> >>>>>
> >>>>> Best,
> >>>>> Joe
> >>>>>
> >>>>> On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
> >>>>>>> The filesystem has started behaving badly recently and I'm writing to:
> >>>>>>>
> >>>>>>> - Solicit advice on how best to get the system back to a stable state
> >>>>>>> - Report a potential bug
> >>>>>>>
> >>>>>>> ## What happened:
> >>>>>>>
> >>>>>>> A couple of days ago I could no longer ssh into it, and on the serial
> >>>>>>> connection there were heaps of messages (and new ones appearing with great
> >>>>>>> frequency) along the lines of: `parent transid verify failed on blah... wanted
> >>>>>>> x got y`.
> >>>>>>>
> >>>>>>> Although I don't have a record of the precise messages I do remember that there
> >>>>>>> was a difference of `15` between x and y.
> >>>>>>
> >>>>>> This normally can be a sign of unreliable HDD, which lies on FLUSH,
> >>>>>> killing metadata COW.
> >>>>>>
> >>>>>> But, your btrfs check doesn't report the same problem, thus I'm confused.
> >>>>>>
> >>>>>> Would you please try to run a "btrfs check --readonly /dev/sda1" with
> >>>>>> the fs unmounted?
> >>>>>>
> >>>>>> And, would you provide the full dmesg of that mismatch?
> >>>>>> The reason for the exact number is, I'm suspecting hardware memory
> >>>>>> corruption.
> >>>>>>
> >>>>>>>
> >>>>>>> I power-cycled system and started a scrub after it rebooted, this was
> >>>>>>> interrupted quite promptly by several more errors in btrfs, and the disk
> >>>>>>> remounted RO.
> >>>>>>>
> >>>>>>> Every now and then in the kernel log I get messages like:
> >>>>>>>
> >>>>>>> `parent transid verify failed on blah... wanted x got y`
> >>>>>>
> >>>>>> Not showing up in the gist dmesg though.
> >>>>>>
> >>>>>>>
> >>>>>>> ## Important info
> >>>>>>>
> >>>>>>> The dev stats are all zero.
> >>>>>>>
> >>>>>>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
> >>>>>>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
> >>>>>>>
> >>>>>>> The "cut here" portion of the kernel log is as follows
> >>>>>>>
> >>>>>>> ```
> >>>>>>> [  409.158097] ------------[ cut here ]------------
> >>>>>>> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
> >>>>>>> btree_csum_one_bio+0x208/0x248 [btrfs]
> >>>>>>
> >>>>>> The line number shows, the tree bytenr doesn't match with the one in memory.
> >>>>>> This is really too rare to be seen, especially when we have no other
> >>>>>> error reported from btrfs (at least in the gist)
> >>>>>>
> >>>>>> Since there is no other problems showing up in the gist, it means it
> >>>>>> could be a bit flip, considering the magic generation gap you mention is
> >>>>>> 15, I'm more suspicious about memory bit flip.
> >>>>>>
> >>>>>> If you can provide the full parent transid mismatch error message, it
> >>>>>> may help to determine the possibility of hardware memory corruption.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Qu
> >>>>>>
> >>>>>>> [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> >>>>>>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
> >>>>>>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
> >>>>>>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
> >>>>>>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
> >>>>>>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
> >>>>>>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
> >>>>>>> raid6_pq
> >>>>>>> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
> >>>>>>> [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
> >>>>>>> [  409.158261] Backtrace:
> >>>>>>> [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
> >>>>>>> (show_stack+0x20/0x24)
> >>>>>>> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
> >>>>>>> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
> >>>>>>> (dump_stack+0x98/0xac)
> >>>>>>> [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> >>>>>>> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
> >>>>>>> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
> >>>>>>> (warn_slowpath_fmt+0x74/0xc4)
> >>>>>>> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
> >>>>>>> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
> >>>>>>> (btree_csum_one_bio+0x208/0x248 [btrfs])
> >>>>>>> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
> >>>>>>> r5:00000000 r4:eefd2d3c
> >>>>>>> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
> >>>>>>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> >>>>>>> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
> >>>>>>> r6:00000000 r5:ea3d17f0
> >>>>>>> [  409.158472]  r4:bf059e0c
> >>>>>>> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
> >>>>>>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
> >>>>>>> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
> >>>>>>> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
> >>>>>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
> >>>>>>> [  409.158686]  r5:00000000 r4:00000000
> >>>>>>> [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
> >>>>>>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
> >>>>>>> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
> >>>>>>> r6:e2393d80 r5:ecc45f70
> >>>>>>> [  409.158794]  r4:e2393d80
> >>>>>>> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
> >>>>>>> [<c0284748>] (do_writepages+0x58/0xf4)
> >>>>>>> [  409.158852]  r5:ecc45f70 r4:ecc45e68
> >>>>>>> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
> >>>>>>> (__filemap_fdatawrite_range+0xf8/0x130)
> >>>>>>> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
> >>>>>>> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
> >>>>>>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
> >>>>>>> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
> >>>>>>> r6:c73bc628 r5:00001000
> >>>>>>> [  409.158875]  r4:4fd0bfff
> >>>>>>> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
> >>>>>>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> >>>>>>> [  409.158931]  r5:00000001 r4:00000000
> >>>>>>> [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
> >>>>>>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> >>>>>>> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
> >>>>>>> r6:ec82b000 r5:e2392000
> >>>>>>> [  409.159040]  r4:c8b81ca8
> >>>>>>> [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
> >>>>>>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> >>>>>>> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
> >>>>>>> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
> >>>>>>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> >>>>>>> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
> >>>>>>> r6:ec82b414 r5:00000bb8
> >>>>>>> [  409.159254]  r4:ec82b000
> >>>>>>> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
> >>>>>>> [<c014fabc>] (kthread+0x170/0x174)
> >>>>>>> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
> >>>>>>> r6:00000000 r5:ed5ee700
> >>>>>>> [  409.159315]  r4:ed5ee1c0
> >>>>>>> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
> >>>>>>> (ret_from_fork+0x14/0x2c)
> >>>>>>> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
> >>>>>>> [  409.159326] 3fa0:                                     00000000
> >>>>>>> 00000000 00000000 00000000
> >>>>>>> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
> >>>>>>> 00000000 00000000 00000000
> >>>>>>> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> >>>>>>> [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> >>>>>>> r6:00000000 r5:c014f94c
> >>>>>>> [  409.159340]  r4:ed5ee700
> >>>>>>> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
> >>>>>>> [  409.165084] BTRFS: error (device sda1) in
> >>>>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> >>>>>>> writing out transaction)
> >>>>>>> [  409.176920] BTRFS info (device sda1): forced readonly
> >>>>>>> [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
> >>>>>>> transaction.
> >>>>>>> [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
> >>>>>>> errno=-5 IO failure
> >>>>>>> [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
> >>>>>>> [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
> >>>>>>> 3 with status: -125
> >>>>>>> [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
> >>>>>>> 4 with status: -125
> >>>>>>> [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
> >>>>>>> 1 with status: -125
> >>>>>>> [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
> >>>>>>> [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
> >>>>>>> 2 with status: -30
> >>>>>>> [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
> >>>>>>> ```
> >>>>>>>
> >>>>>>> Information requested here
> >>>>>>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
> >>>>>>>
> >>>>>>> ```
> >>>>>>>   $ uname -a
> >>>>>>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
> >>>>>>>
> >>>>>>>   $ btrfs --version
> >>>>>>> btrfs-progs v5.7
> >>>>>>>
> >>>>>>>   $ sudo btrfs fi show
> >>>>>>> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
> >>>>>>>          Total devices 4 FS bytes used 10.26TiB
> >>>>>>>          devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
> >>>>>>>          devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
> >>>>>>>          devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
> >>>>>>>          devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
> >>>>>>>
> >>>>>>> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
> >>>>>>>          Total devices 1 FS bytes used 116.35MiB
> >>>>>>>          devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
> >>>>>>>
> >>>>>>>   $ sudo btrfs fi df /
> >>>>>>> Data, RAID1: total=10.25TiB, used=10.24TiB
> >>>>>>> System, RAID1: total=64.00MiB, used=1.45MiB
> >>>>>>> Metadata, RAID1: total=18.00GiB, used=17.19GiB
> >>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
> >>>>>>> ```
> >>>>>>>
> >>>>>>> Thanks to demfloro and multicore on #btrfs for prompting this email.
> >>>>>>>
> >>>>>>> Best wishes,
> >>>>>>> Joe
> >>>>>>>
> >>>>>>
> >>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2021-04-08  8:16               ` Joe Hermaszewski
@ 2021-04-08  8:38                 ` Qu Wenruo
       [not found]                   ` <CA+4cVr8sxGT1Zz+1tz+0OqBCukFgn7d_ZZq31bXASc426YbJ7A@mail.gmail.com>
  2021-04-08 11:15                   ` riteshh
  0 siblings, 2 replies; 14+ messages in thread
From: Qu Wenruo @ 2021-04-08  8:38 UTC (permalink / raw)
  To: Joe Hermaszewski; +Cc: linux-btrfs



On 2021/4/8 下午4:16, Joe Hermaszewski wrote:
> It took a while but I managed to get hold of another one of these
> arm32 boards. Very disappointingly this exact "bitflip" is still
> present (log enclosed).

Yeah, we got to the conclusion it's not bitflip, but completely 32bit
limit on armv7.

For ARMv7, it's a 32bit system, where unsigned long is only 32bit.

This means, things like page->index is only 32bit long, and for 4K page
size, it also means all filesystems (not only btrfs) can only utilize at
most 16T bytes.

But there is pitfall for btrfs, btrfs uses its internal address space
for its meatadata, and the address space is U64.

And furthermore, for btrfs it can have metadata at bytenr way larger
than the total device size.
This is possible because btrfs maps part of its address space to real
disks, thus it can have bytenr way larger than device size.

But this brings to a problem, 32bit Linux can only handle 16T, but in
your case, some of your metadata is already beyond 16T in btrfs address
space.

Then a lot of things are going to be wrong.

I have submitted a patch to do extra check, at least allowing user to
know this is the limit of 32bit:
https://patchwork.kernel.org/project/linux-btrfs/patch/20210225011814.24009-1-wqu@suse.com/

Unfortunately, this will not help existing fs though.

Thanks,
Qu
>
> To summarise, as it's been a while:
>
> - When running scrub, a "page_start" and "eb_start" mismatch is
> detected (off by a single bit).
> - `btrfs check` reports no significant errors on aarch64 or arm32.
> - `btrfs scrub` completes successfully on aarch64!
> - Now, I can confirm that `btrfs scrub` fails in the same manner on
> two arm32 machines.
>
> Not really sure where to go from here. The only things I can think of are:
>
> - Bug in `btrfs scrub` on arm32 (seems unlikely given the single-bit
> error here).
> - A bitflip on the original machine wrote bad data to the disk, and
> somehow only the arm32 implementation catches this...
>
> I can give a newer kernel a try (I'm on 5.4.80 at the moment) I suppose
>
> Best,
> Joe
>
> ```
> $ dmesg
> [  258.791966] BTRFS info (device sda1): scrub: started on devid 4
> [  258.792030] BTRFS info (device sda1): scrub: started on devid 1
>
>   the magic generation gap you mention is
> 15, I'm more suspicious about me
>
> [  258.792062] BTRFS info (device sda1): scrub: started on devid 2
> [  411.849669] ------------[ cut here ]------------
> [  411.849786] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
> btree_csum_one_bio+0x22c/0x278 [btrfs]
> [  411.849789] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
> ip6table_filter ip6_tables iptable_filter phy_generic uio_pdrv_genirq
> uio sch_fq_codel loop tun tap macvlan bridge stp llc lm75 ip_tables
> x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
> [  411.849836] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.80 #1-NixOS
> [  411.849838] Hardware name: Marvell Armada 380/385 (Device Tree)
> [  411.849840] Backtrace:
> [  411.849850] [<c010f698>] (dump_backtrace) from [<c010f938>]
> (show_stack+0x20/0x24)
> [  411.849855]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c0c4
> [  411.849861] [<c010f918>] (show_stack) from [<c0a1bb28>]
> (dump_stack+0x98/0xac)
> [  411.849866] [<c0a1ba90>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
> [  411.849870]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
> [  411.849875] [<c012a8b8>] (__warn) from [<c012ad24>]
> (warn_slowpath_fmt+0x74/0xc4)
> [  411.849878]  r7:00000213 r6:bf120990 r5:00000000 r4:e1c1c000
> [  411.849936] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
> (btree_csum_one_bio+0x22c/0x278 [btrfs])
> [  411.849941]  r9:00000001 r8:e6afec78 r7:e1c1c000 r6:ed7f7000
> r5:00001000 r4:8fd00000
> [  411.850044] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
> [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
> [  411.850049]  r10:e6afe4c0 r9:ecc57fc0 r8:ecc57f70 r7:ed7f7000
> r6:00000000 r5:c1660370
> [  411.850051]  r4:bf059e9c
> [  411.850154] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
> [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
> [  411.850158]  r7:ef2d58ec r6:e1c1dcac r5:00000000 r4:bf059e9c
> [  411.850260] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>]
> (btree_write_cache_pages+0x380/0x408 [btrfs])
> [  411.850263]  r5:00000000 r4:00000000
> [  411.850366] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
> [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
> [  411.850371]  r10:00000001 r9:8fd00000 r8:c0280bcc r7:e1c1c000
> r6:e1c1dd80 r5:ecc57f70
> [  411.850373]  r4:e1c1dd80
> [  411.850428] [<bf05903c>] (btree_writepages [btrfs]) from
> [<c0284680>] (do_writepages+0x58/0xf4)
> [  411.850431]  r5:ecc57f70 r4:ecc57e68
> [  411.850439] [<c0284628>] (do_writepages) from [<c0278b68>]
> (__filemap_fdatawrite_range+0xf8/0x130)
> [  411.850442]  r8:ecc57f70 r7:00001000 r6:8fd0bfff r5:e1c1c000 r4:ecc57e68
> [  411.850448] [<c0278a70>] (__filemap_fdatawrite_range) from
> [<c0278cf0>] (filemap_fdatawrite_range+0x2c/0x34)
> [  411.850452]  r10:ecc57f70 r9:00001000 r8:8fd0bfff r7:e1c1de4c
> r6:e0202428 r5:00001000
> [  411.850454]  r4:8fd0bfff
> [  411.850509] [<c0278cc4>] (filemap_fdatawrite_range) from
> [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
> [  411.850512]  r5:00000001 r4:00000000
> [  411.850615] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
> [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
> [  411.850620]  r10:e1c1c000 r9:ed7f7010 r8:ed7f7000 r7:e0202428
> r6:ed7f7000 r5:e1c1c000
> [  411.850621]  r4:e56433f0
> [  411.850724] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs])
> from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
> [  411.850728]  r8:ed7f7418 r7:e0202400 r6:ed7f7000 r5:e56433f0 r4:00000000
> [  411.850831] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
> [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
> [  411.850836]  r10:ed7f728c r9:00000000 r8:001abf6a r7:00000064
> r6:ed7f7414 r5:00000bb8
> [  411.850838]  r4:ed7f7000
> [  411.850893] [<bf05cbfc>] (transaction_kthread [btrfs]) from
> [<c014fab4>] (kthread+0x170/0x174)
> [  411.850898]  r10:ecbddbfc r9:bf05cbfc r8:ed669800 r7:e1c1c000
> r6:00000000 r5:ed560d40
> [  411.850899]  r4:ed54eb00
> [  411.850904] [<c014f944>] (kthread) from [<c01010e8>]
> (ret_from_fork+0x14/0x2c)
> [  411.850907] Exception stack(0xe1c1dfb0 to 0xe1c1dff8)
> [  411.850911] dfa0:                                     00000000
> 00000000 00000000 00000000
> [  411.850915] dfc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [  411.850919] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [  411.850923]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> r6:00000000 r5:c014f944
> [  411.850925]  r4:ed560d40
> [  411.850976] ---[ end trace 1bbf4c394bd7fb26 ]---
> [  411.850986] BTRFS warning (device sda1): page_start and eb_start
> mismatch, eb_start=17594598817792 page_start=2412773376
> [  411.857914] BTRFS: error (device sda1) in
> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
> writing out transaction)
> [  411.869764] BTRFS info (device sda1): forced readonly
> [  411.869772] BTRFS warning (device sda1): Skipping commit of aborted
> transaction.
> [  411.869778] BTRFS: error (device sda1) in cleanup_transaction:1832:
> errno=-5 IO failure
> [  411.877843] BTRFS info (device sda1): delayed_refs has NO entry
> [  411.878076] BTRFS warning (device sda1): Skipping commit of aborted
> transaction.
> [  411.882940] BTRFS warning (device sda1): failed setting block group ro: -30
> [  411.882956] BTRFS info (device sda1): scrub: not finished on devid
> 3 with status: -30
> [  411.883030] BTRFS warning (device sda1): failed setting block group ro: -30
> [  411.883039] BTRFS info (device sda1): scrub: not finished on devid
> 4 with status: -30
> [  411.883104] BTRFS warning (device sda1): failed setting block group ro: -30
> [  411.883116] BTRFS info (device sda1): scrub: not finished on devid
> 1 with status: -30
> [  411.883191] BTRFS warning (device sda1): failed setting block group ro: -30
> [  411.883200] BTRFS info (device sda1): scrub: not finished on devid
> 2 with status: -30
> [  411.953186] BTRFS info (device sda1): delayed_refs has NO entry
> [  576.361548] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  576.369639] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  576.378477] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  576.386936] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  576.395093] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  576.396830] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  576.403153] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  576.450091] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  576.462821] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  576.474987] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  576.486882] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  576.497905] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  576.824673] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  576.838414] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  577.324362] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  577.337736] BTRFS error (device sda1): parent transid verify failed
> on 12756293910528 wanted 1752939 found 1752922
> [  578.462271] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  578.470347] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  578.479476] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  578.488099] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  578.496793] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  578.505280] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  580.539117] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  580.547207] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  580.556371] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  580.564922] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  580.573356] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  580.581573] BTRFS critical (device sda1): unable to find logical
> 2412789760 length 4096
> [  581.427776] verify_parent_transid: 28 callbacks suppressed
> ... many more of these
> ```
>
> On Sun, Dec 20, 2020 at 8:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/12/19 下午6:35, Joe Hermaszewski wrote:
>>> Ok, so I managed to get hold of a 64bit machine on which to run btrfs
>>> check. `btrfs check` returns exactly the same output as the armv7 box
>>> (no serious problems) which I suppose is good. `btrfs scrub` also
>>> finds no problems. Boring Logs below.
>>>
>>> What I don't quite understand is how the scrub problem on armv7l is so
>>> reliable when it's not persisted on the disks, is the same physical
>>> memory location being used for this breaking value, or is it perhaps a
>>> specific pattern of data on the bus which causes this?
>>
>> If it's so reliably reproducible, then I guess that would be the case,
>> either a specific memory range has the problem, or a specific pattern on
>> the bus causing the problem.
>>
>>>
>>> If it's the former, how easy would it be to find this broken location
>>> and blacklist it? If it's the latter then I guess there's no hope but
>>> to try replacing the psu/machine.
>>>
>>> The machine survived a couple of days of memtester (on about 95% of
>>> the RAM) and `7z b` with no problems, *shrug*
>>
>> The memtester should rule out the former case, the latter case may be
>> resolved by newer kernel if it's not a hardware problem but a software one.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Best wishes and thanks for the generous help so far.
>>> Joe
>>>
>>> btrfs scrub aarch64:
>>> ```
>>> [j@nixos:~]$ sudo btrfs scrub status -d /mnt
>>> UUID:             b8f4ad49-29c8-4d19-a886-cef9c487f124
>>> scrub device /dev/sda1 (id 1) history
>>> Scrub started:    Fri Dec 18 14:24:30 2020
>>> Status:           finished
>>> Duration:         7:36:31
>>> Total to scrub:   2.40TiB
>>> Rate:             91.95MiB/s
>>> Error summary:    no errors found
>>> scrub device /dev/sdb1 (id 2) history
>>> Scrub started:    Fri Dec 18 14:24:30 2020
>>> Status:           finished
>>> Duration:         7:12:51
>>> Total to scrub:   2.40TiB
>>> Rate:             96.90MiB/s
>>> Error summary:    no errors found
>>> scrub device /dev/sdd1 (id 3) history
>>> Scrub started:    Fri Dec 18 14:24:30 2020
>>> Status:           finished
>>> Duration:         19:47:01
>>> Total to scrub:   7.86TiB
>>> Rate:             115.70MiB/s
>>> Error summary:    no errors found
>>> scrub device /dev/sdc1 (id 4) history
>>> Scrub started:    Fri Dec 18 14:24:30 2020
>>> Status:           finished
>>> Duration:         19:46:38
>>> Total to scrub:   7.86TiB
>>> Rate:             115.74MiB/s
>>> Error summary:    no errors found
>>> ```
>>>
>>> btrfs check aarch64:
>>> ```
>>> [nixos@nixos:/]$ sudo btrfs check --readonly /dev/sda1
>>> Opening filesystem to check...
>>> Checking filesystem on /dev/sda1
>>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>> [1/7] checking root items
>>> [2/7] checking extents
>>> [3/7] checking free space cache
>>> [4/7] checking fs roots
>>> root 294 inode 24665 errors 100, file extent discount
>>> Found file extent holes:
>>>           start: 3709534208, len: 163840
>>> root 294 inode 406548 errors 100, file extent discount
>>> Found file extent holes:
>>>           start: 98729984, len: 720896
>>> ERROR: errors found in fs roots
>>> found 11280701063168 bytes used, error(s) found
>>> total csum bytes: 10937464120
>>> total tree bytes: 18538053632
>>> total fs tree bytes: 5877579776
>>> total extent tree bytes: 534052864
>>> btree space waste bytes: 2316660292
>>> file data blocks allocated: 17244587220992
>>>    referenced 14211684794368%
>>> ```
>>>
>>>
>>> On Sat, Nov 28, 2020 at 8:46 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/11/27 下午11:15, Joe Hermaszewski wrote:
>>>>> Hi Qu,
>>>>>
>>>>> Thanks for the patch. I recompiled the kernel ran the scrub and your
>>>>> patch worked as expected, here is the log:
>>>>>
>>>>> ```
>>>>> [  337.365239] BTRFS info (device sda1): scrub: started on devid 2
>>>>> [  337.366283] BTRFS info (device sda1): scrub: started on devid 1
>>>>> [  337.402822] BTRFS info (device sda1): scrub: started on devid 3
>>>>> [  337.411944] BTRFS info (device sda1): scrub: started on devid 4
>>>>> [  471.997496] ------------[ cut here ]------------
>>>>> [  471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
>>>>> btree_csum_one_bio+0x22c/0x278 [btrfs]
>>>>> [  471.997616] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
>>>>> iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack
>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter
>>>>> ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype nf_log_ipv6
>>>>> nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables
>>>>> iptable_filter sch_fq_codel loop tun tap macvlan bridge stp llc lm75
>>>>> ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
>>>>> [  471.997666] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.78 #1-NixOS
>>>>> [  471.997668] Hardware name: Marvell Armada 380/385 (Device Tree)
>>>>> [  471.997669] Backtrace:
>>>>> [  471.997680] [<c010f698>] (dump_backtrace) from [<c010f938>]
>>>>> (show_stack+0x20/0x24)
>>>>> [  471.997684]  r7:00000213 r6:600c0013 r5:00000000 r4:c0f8c044
>>>>> [  471.997691] [<c010f918>] (show_stack) from [<c0a1b848>]
>>>>> (dump_stack+0x98/0xac)
>>>>> [  471.997696] [<c0a1b7b0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
>>>>> [  471.997700]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
>>>>> [  471.997704] [<c012a8b8>] (__warn) from [<c012ad24>]
>>>>> (warn_slowpath_fmt+0x74/0xc4)
>>>>> [  471.997707]  r7:00000213 r6:bf120990 r5:00000000 r4:e1822000
>>>>> [  471.997765] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
>>>>> (btree_csum_one_bio+0x22c/0x278 [btrfs])
>>>>> [  471.997770]  r9:00000001 r8:c10decb8 r7:e1822000 r6:ed66b000
>>>>> r5:00001000 r4:4fd00000
>>>>> [  471.997872] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
>>>>> [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>>>> [  471.997877]  r10:e2197b48 r9:ec845fc0 r8:ec845f70 r7:ed66b000
>>>>> r6:00000000 r5:e2b74af0
>>>>> [  471.997879]  r4:bf059e9c
>>>>> [  471.997981] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
>>>>> [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
>>>>> [  471.997985]  r7:ef3724d4 r6:e1823cac r5:00000000 r4:bf059e9c
>>>>> [  471.998087] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>]
>>>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
>>>>> [  471.998090]  r5:00000000 r4:00000000
>>>>> [  471.998193] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
>>>>> [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
>>>>> [  471.998197]  r10:00000001 r9:4fd00000 r8:c0280d14 r7:e1822000
>>>>> r6:e1823d80 r5:ec845f70
>>>>> [  471.998199]  r4:e1823d80
>>>>> [  471.998255] [<bf05903c>] (btree_writepages [btrfs]) from
>>>>> [<c02847c8>] (do_writepages+0x58/0xf4)
>>>>> [  471.998258]  r5:ec845f70 r4:ec845e68
>>>>> [  471.998266] [<c0284770>] (do_writepages) from [<c0278cb0>]
>>>>> (__filemap_fdatawrite_range+0xf8/0x130)
>>>>> [  471.998270]  r8:ec845f70 r7:00001000 r6:4fd0bfff r5:e1822000 r4:ec845e68
>>>>> [  471.998275] [<c0278bb8>] (__filemap_fdatawrite_range) from
>>>>> [<c0278e38>] (filemap_fdatawrite_range+0x2c/0x34)
>>>>> [  471.998279]  r10:ec845f70 r9:00001000 r8:4fd0bfff r7:e1823e4c
>>>>> r6:c1955e28 r5:00001000
>>>>> [  471.998281]  r4:4fd0bfff
>>>>> [  471.998335] [<c0278e0c>] (filemap_fdatawrite_range) from
>>>>> [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>>>> [  471.998338]  r5:00000001 r4:00000000
>>>>> [  471.998441] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
>>>>> [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
>>>>> [  471.998446]  r10:e1822000 r9:ed66b010 r8:ed66b000 r7:c1955e28
>>>>> r6:ed66b000 r5:e1822000
>>>>> [  471.998447]  r4:eddc4f30
>>>>> [  471.998551] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs])
>>>>> from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
>>>>> [  471.998555]  r8:ed66b418 r7:c1955e00 r6:ed66b000 r5:eddc4f30 r4:00000000
>>>>> [  471.998658] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
>>>>> [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>>>> [  471.998663]  r10:ed66b28c r9:00000000 r8:001aab57 r7:00000064
>>>>> r6:ed66b414 r5:00000bb8
>>>>> [  471.998664]  r4:ed66b000
>>>>> [  471.998719] [<bf05cbfc>] (transaction_kthread [btrfs]) from
>>>>> [<c014facc>] (kthread+0x170/0x174)
>>>>> [  471.998724]  r10:ed765bfc r9:bf05cbfc r8:ed56a000 r7:e1822000
>>>>> r6:00000000 r5:ed5c8600
>>>>> [  471.998726]  r4:ed5c8740
>>>>> [  471.998731] [<c014f95c>] (kthread) from [<c01010e8>]
>>>>> (ret_from_fork+0x14/0x2c)
>>>>> [  471.998733] Exception stack(0xe1823fb0 to 0xe1823ff8)
>>>>> [  471.998737] 3fa0:                                     00000000
>>>>> 00000000 00000000 00000000
>>>>> [  471.998741] 3fc0: 00000000 00000000 00000000 00000000 00000000
>>>>> 00000000 00000000 00000000
>>>>> [  471.998745] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
>>>>> [  471.998749]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
>>>>> r6:00000000 r5:c014f95c
>>>>> [  471.998751]  r4:ed5c8600
>>>>> [  471.998820] ---[ end trace 6a9fb9f547659a4d ]---
>>>>> [  471.998855] BTRFS warning (device sda1): page_start and eb_start
>>>>> mismatch, eb_start=17593525075968 page_start=1339031552
>>>>> [  472.015514] BTRFS: error (device sda1) in
>>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
>>>>> writing out transaction)
>>>>> [  472.027321] BTRFS info (device sda1): forced readonly
>>>>> [  472.027326] BTRFS warning (device sda1): Skipping commit of aborted
>>>>> transaction.
>>>>> [  472.027330] BTRFS: error (device sda1) in cleanup_transaction:1832:
>>>>> errno=-5 IO failure
>>>>> [  472.035378] BTRFS info (device sda1): delayed_refs has NO entry
>>>>> [  472.039102] BTRFS warning (device sda1): failed setting block group ro: -30
>>>>> [  472.039107] BTRFS info (device sda1): scrub: not finished on devid
>>>>> 3 with status: -30
>>>>> [  472.039114] BTRFS info (device sda1): scrub: not finished on devid
>>>>> 4 with status: -30
>>>>> [  472.039227] BTRFS warning (device sda1): failed setting block group ro: -30
>>>>> [  472.039230] BTRFS warning (device sda1): failed setting block group ro: -30
>>>>> [  472.039237] BTRFS info (device sda1): scrub: not finished on devid
>>>>> 2 with status: -30
>>>>> [  472.039243] BTRFS info (device sda1): scrub: not finished on devid
>>>>> 1 with status: -30
>>>>> ```
>>>>>
>>>>> To rob you of the suspense:
>>>>>
>>>>> printf "0x%016x\n" (17593525075968 - 1339031552)
>>>>> 0x0000100000000000
>>>>>
>>>>> Very surprising considering that this machine has ECC ram!
>>>>
>>>> ARMv7 with ECC, and we still had such bitflip!?
>>>>
>>>> I guess everything is possible in 2020.
>>>>
>>>>>
>>>>> What would you recommend as a fix?
>>>>
>>>> It's a hardware problem, I really don't have any recommendation at all.
>>>>
>>>> I guess you may want to request an RMA?
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> Best,
>>>>> Joe
>>>>>
>>>>> On Thu, Nov 26, 2020 at 7:05 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/11/26 下午6:53, Joe Hermaszewski wrote:
>>>>>>> Hi Qu,
>>>>>>>
>>>>>>> This is the output of btrfs check with the device unmounted:
>>>>>>>
>>>>>>> ```
>>>>>>> [root@nixos:~]# btrfs check --readonly /dev/sda1
>>>>>>> Opening filesystem to check...
>>>>>>> Checking filesystem on /dev/sda1
>>>>>>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>>>>> [1/7] checking root items
>>>>>>> [2/7] checking extents
>>>>>>> [3/7] checking free space cache
>>>>>>> [4/7] checking fs roots
>>>>>>> root 294 inode 24665 errors 100, file extent discount
>>>>>>> Found file extent holes:
>>>>>>>           start: 3709534208, len: 163840
>>>>>>> root 294 inode 406548 errors 100, file extent discount
>>>>>>> Found file extent holes:
>>>>>>>           start: 0, len: 99450880
>>>>>>> ERROR: errors found in fs roots
>>>>>>> found 11279251902464 bytes used, error(s) found
>>>>>>> total csum bytes: 10935906504
>>>>>>> total tree bytes: 18455740416
>>>>>>> total fs tree bytes: 5798936576
>>>>>>> total extent tree bytes: 532250624
>>>>>>> btree space waste bytes: 2308849707
>>>>>>> file data blocks allocated: 17243418611712
>>>>>>>    referenced 14210256932864
>>>>>>> ```
>>>>>>>
>>>>>>> Looks the same as before...
>>>>>>
>>>>>> Then it means, at least your metadata is pretty safe. Nothing really
>>>>>> needs to be worried (so far).
>>>>>>
>>>>>> This also means, the problem only happens at runtime, not something on-disk.
>>>>>>
>>>>>> If you are OK to re-compile the kernel, and the warning message is
>>>>>> always the same at line number 531 of disk-io.c, then I recommend the
>>>>>> attached debug diff
>>>>>>
>>>>>> The diff will output the found_start (from eb) and page_start (from page).
>>>>>> If it's memory bitflip, we should have a pretty signature diff at the
>>>>>> output.
>>>>>> If not memory bitflip, then we really have a bug to dig.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'm afraid I don't have the precise details of the transid error
>>>>>>> saved. I think I meant to write earlier that the message that comes up
>>>>>>> now and then is the "errno=-5 IO failure (Error while writing out
>>>>>>> transaction)". I've not seen the transid mismatch one since the first
>>>>>>> incident, sorry for that slip-up.
>>>>>>>
>>>>>>> I'll try and attach to a known good system, will I have to attach the
>>>>>>> whole array because I don't think I have that many enclosures for the
>>>>>>> drives.
>>>>>>
>>>>>> You only need to mount the fs you're seeing the problem.
>>>>>> But if that fs is on multiple devices, then I guess you need to mount
>>>>>> them all.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>>
>>>>>>> I'll also run a memtest on this box.
>>>>>>>
>>>>>>> Thank you for the helpful response.
>>>>>>>
>>>>>>> Best,
>>>>>>> Joe
>>>>>>>
>>>>>>> On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have a arm32 machine with four drives with a btrfs fs spanning then in RAID1.
>>>>>>>>> The filesystem has started behaving badly recently and I'm writing to:
>>>>>>>>>
>>>>>>>>> - Solicit advice on how best to get the system back to a stable state
>>>>>>>>> - Report a potential bug
>>>>>>>>>
>>>>>>>>> ## What happened:
>>>>>>>>>
>>>>>>>>> A couple of days ago I could no longer ssh into it, and on the serial
>>>>>>>>> connection there were heaps of messages (and new ones appearing with great
>>>>>>>>> frequency) along the lines of: `parent transid verify failed on blah... wanted
>>>>>>>>> x got y`.
>>>>>>>>>
>>>>>>>>> Although I don't have a record of the precise messages I do remember that there
>>>>>>>>> was a difference of `15` between x and y.
>>>>>>>>
>>>>>>>> This normally can be a sign of unreliable HDD, which lies on FLUSH,
>>>>>>>> killing metadata COW.
>>>>>>>>
>>>>>>>> But, your btrfs check doesn't report the same problem, thus I'm confused.
>>>>>>>>
>>>>>>>> Would you please try to run a "btrfs check --readonly /dev/sda1" with
>>>>>>>> the fs unmounted?
>>>>>>>>
>>>>>>>> And, would you provide the full dmesg of that mismatch?
>>>>>>>> The reason for the exact number is, I'm suspecting hardware memory
>>>>>>>> corruption.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I power-cycled system and started a scrub after it rebooted, this was
>>>>>>>>> interrupted quite promptly by several more errors in btrfs, and the disk
>>>>>>>>> remounted RO.
>>>>>>>>>
>>>>>>>>> Every now and then in the kernel log I get messages like:
>>>>>>>>>
>>>>>>>>> `parent transid verify failed on blah... wanted x got y`
>>>>>>>>
>>>>>>>> Not showing up in the gist dmesg though.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> ## Important info
>>>>>>>>>
>>>>>>>>> The dev stats are all zero.
>>>>>>>>>
>>>>>>>>> Here are the outputs of some btrfs commands, dmesg and the kernel log from the
>>>>>>>>> previous two boots: https://gist.github.com/b1beab134403c5047e2efbceb98985f9
>>>>>>>>>
>>>>>>>>> The "cut here" portion of the kernel log is as follows
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> [  409.158097] ------------[ cut here ]------------
>>>>>>>>> [  409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531
>>>>>>>>> btree_csum_one_bio+0x208/0x248 [btrfs]
>>>>>>>>
>>>>>>>> The line number shows, the tree bytenr doesn't match with the one in memory.
>>>>>>>> This is really too rare to be seen, especially when we have no other
>>>>>>>> error reported from btrfs (at least in the gist)
>>>>>>>>
>>>>>>>> Since there is no other problems showing up in the gist, it means it
>>>>>>>> could be a bit flip, considering the magic generation gap you mention is
>>>>>>>> 15, I'm more suspicious about memory bit flip.
>>>>>>>>
>>>>>>>> If you can provide the full parent transid mismatch error message, it
>>>>>>>> may help to determine the possibility of hardware memory corruption.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Qu
>>>>>>>>
>>>>>>>>> [  409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
>>>>>>>>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>>>>>>>>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
>>>>>>>>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
>>>>>>>>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter
>>>>>>>>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
>>>>>>>>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
>>>>>>>>> raid6_pq
>>>>>>>>> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted 5.4.77 #1-NixOS
>>>>>>>>> [  409.158260] Hardware name: Marvell Armada 380/385 (Device Tree)
>>>>>>>>> [  409.158261] Backtrace:
>>>>>>>>> [  409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>]
>>>>>>>>> (show_stack+0x20/0x24)
>>>>>>>>> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044
>>>>>>>>> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
>>>>>>>>> (dump_stack+0x98/0xac)
>>>>>>>>> [  409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108)
>>>>>>>>> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990
>>>>>>>>> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
>>>>>>>>> (warn_slowpath_fmt+0x74/0xc4)
>>>>>>>>> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000 r4:e2392000
>>>>>>>>> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>]
>>>>>>>>> (btree_csum_one_bio+0x208/0x248 [btrfs])
>>>>>>>>> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0
>>>>>>>>> r5:00000000 r4:eefd2d3c
>>>>>>>>> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
>>>>>>>>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>>>>>>>> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000
>>>>>>>>> r6:00000000 r5:ea3d17f0
>>>>>>>>> [  409.158472]  r4:bf059e0c
>>>>>>>>> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from
>>>>>>>>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
>>>>>>>>> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c
>>>>>>>>> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>]
>>>>>>>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
>>>>>>>>> [  409.158686]  r5:00000000 r4:00000000
>>>>>>>>> [  409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from
>>>>>>>>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
>>>>>>>>> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000
>>>>>>>>> r6:e2393d80 r5:ecc45f70
>>>>>>>>> [  409.158794]  r4:e2393d80
>>>>>>>>> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
>>>>>>>>> [<c0284748>] (do_writepages+0x58/0xf4)
>>>>>>>>> [  409.158852]  r5:ecc45f70 r4:ecc45e68
>>>>>>>>> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
>>>>>>>>> (__filemap_fdatawrite_range+0xf8/0x130)
>>>>>>>>> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 r4:ecc45e68
>>>>>>>>> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
>>>>>>>>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
>>>>>>>>> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c
>>>>>>>>> r6:c73bc628 r5:00001000
>>>>>>>>> [  409.158875]  r4:4fd0bfff
>>>>>>>>> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
>>>>>>>>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>>>>>>>> [  409.158931]  r5:00000001 r4:00000000
>>>>>>>>> [  409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from
>>>>>>>>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
>>>>>>>>> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628
>>>>>>>>> r6:ec82b000 r5:e2392000
>>>>>>>>> [  409.159040]  r4:c8b81ca8
>>>>>>>>> [  409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs])
>>>>>>>>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
>>>>>>>>> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 r4:00000000
>>>>>>>>> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from
>>>>>>>>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>>>>>>>> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064
>>>>>>>>> r6:ec82b414 r5:00000bb8
>>>>>>>>> [  409.159254]  r4:ec82b000
>>>>>>>>> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from
>>>>>>>>> [<c014fabc>] (kthread+0x170/0x174)
>>>>>>>>> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000
>>>>>>>>> r6:00000000 r5:ed5ee700
>>>>>>>>> [  409.159315]  r4:ed5ee1c0
>>>>>>>>> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
>>>>>>>>> (ret_from_fork+0x14/0x2c)
>>>>>>>>> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
>>>>>>>>> [  409.159326] 3fa0:                                     00000000
>>>>>>>>> 00000000 00000000 00000000
>>>>>>>>> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000
>>>>>>>>> 00000000 00000000 00000000
>>>>>>>>> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
>>>>>>>>> [  409.159338]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
>>>>>>>>> r6:00000000 r5:c014f94c
>>>>>>>>> [  409.159340]  r4:ed5ee700
>>>>>>>>> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
>>>>>>>>> [  409.165084] BTRFS: error (device sda1) in
>>>>>>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
>>>>>>>>> writing out transaction)
>>>>>>>>> [  409.176920] BTRFS info (device sda1): forced readonly
>>>>>>>>> [  409.176947] BTRFS warning (device sda1): Skipping commit of aborted
>>>>>>>>> transaction.
>>>>>>>>> [  409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832:
>>>>>>>>> errno=-5 IO failure
>>>>>>>>> [  409.185049] BTRFS info (device sda1): delayed_refs has NO entry
>>>>>>>>> [  409.310199] BTRFS info (device sda1): scrub: not finished on devid
>>>>>>>>> 3 with status: -125
>>>>>>>>> [  409.664880] BTRFS info (device sda1): scrub: not finished on devid
>>>>>>>>> 4 with status: -125
>>>>>>>>> [  410.106791] BTRFS info (device sda1): scrub: not finished on devid
>>>>>>>>> 1 with status: -125
>>>>>>>>> [  411.268585] BTRFS warning (device sda1): failed setting block group ro: -30
>>>>>>>>> [  411.268594] BTRFS info (device sda1): scrub: not finished on devid
>>>>>>>>> 2 with status: -30
>>>>>>>>> [  411.268605] BTRFS info (device sda1): delayed_refs has NO entry
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> Information requested here
>>>>>>>>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list):
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>>    $ uname -a
>>>>>>>>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l GNU/Linux
>>>>>>>>>
>>>>>>>>>    $ btrfs --version
>>>>>>>>> btrfs-progs v5.7
>>>>>>>>>
>>>>>>>>>    $ sudo btrfs fi show
>>>>>>>>> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>>>>>>>           Total devices 4 FS bytes used 10.26TiB
>>>>>>>>>           devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
>>>>>>>>>           devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
>>>>>>>>>           devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
>>>>>>>>>           devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
>>>>>>>>>
>>>>>>>>> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
>>>>>>>>>           Total devices 1 FS bytes used 116.35MiB
>>>>>>>>>           devid    1 size 399.82MiB used 224.00MiB path /dev/sda2
>>>>>>>>>
>>>>>>>>>    $ sudo btrfs fi df /
>>>>>>>>> Data, RAID1: total=10.25TiB, used=10.24TiB
>>>>>>>>> System, RAID1: total=64.00MiB, used=1.45MiB
>>>>>>>>> Metadata, RAID1: total=18.00GiB, used=17.19GiB
>>>>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> Thanks to demfloro and multicore on #btrfs for prompting this email.
>>>>>>>>>
>>>>>>>>> Best wishes,
>>>>>>>>> Joe
>>>>>>>>>
>>>>>>>>
>>>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
       [not found]                       ` <CA+4cVr9Zgscn=L0a6CXrCaWK12mne8EpdW0eEe+PPuhQG2fmxQ@mail.gmail.com>
@ 2021-04-08 10:22                         ` Qu Wenruo
  0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2021-04-08 10:22 UTC (permalink / raw)
  To: Joe Hermaszewski, linux-btrfs



On 2021/4/8 下午6:11, Joe Hermaszewski wrote:
> Thanks for explaining so patiently, I have a couple more questions if
> you have the time:
>
> With the patch I assume that this FS will just refuse to mount on
> arm32,

Yes.

If the fs has a chunk beyond 16T, it will be definitely be rejected.
For your case, since you already have such metadata, meaning there is
definitely at least one chunk at or beyond that boundary, thus
immediately rejection will be triggered.

> and in general such a large FS can't be used reliably there.

Not "reliably", but completely "unusable".

As btrfs can't even read some metadata, it would be a miracle to mount
the fs while avoiding all metadata read beyond 16T.

> (If I'm wrong about this, is it possible to use a 64 bit machine to
> move the offending metadata back below 16TB? (I feel that this may be
> a gross misunderstanding!))

Moving is possible (balance is here for the work).

But balance can only move data to larger bytenr, meaning it will just
make the case worse.

The requirement for balance to move data/metadata to larger bytenr is,
btrfs assumes chunk with larger bytenr are created newer.
Thus things like balance itself may use that feature to make sure if it
has balanced the full fs.

Thus in btrfs, the bytenr of new chunks are increamental, no way to
create a new chunk with lower bytenr.

>
> I'm not sure I understood this part of your reply:
>
>> Btrfs balance will also go forward larger bytenr, thus it's unrelated to size at all.
>
> I just realised that the last couple of messages didn't go to the
> list, I'm happy for you to reply to the list if you feel that this
> conversation would be beneficial there!

Sorry my bad...

Thanks,
Qu

>
> On Thu, Apr 8, 2021 at 5:41 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2021/4/8 下午5:36, Joe Hermaszewski wrote:
>>> Thanks for the quick reply! In this case what's the best course of
>>> action for me?
>>>
>>> - Can this array be mounted and recovered without problems on a 64 bit
>>> machine?
>>
>> There should be no problem as long as btrfs check reports no error.
>>
>> It's purely runtime assumption failed to get met, thus no real damage to
>> your fs.
>>
>>> - Does this 16TB limit refer to the total size of the devices, or the
>>> apparent size after RAID? (I assume the former as this array is only
>>> 14TB after raid I think).
>>
>> The limit is to any address space, not only some file/device size.
>>
>> For btrfs even you only have 1 disk with 1T size, you can still get
>> metadata beyond 16T.
>>
>> Btrfs balance will also go forward larger bytenr, thus it's unrelated to
>> size at all.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks for the patch, I'll take a look at building my system with it in
>>> future.
>>>
>>> Best wishes,
>>> Joe
>>>
>>>
>>> On Thu, Apr 8, 2021, 4:38 PM Qu Wenruo <quwenruo.btrfs@gmx.com
>>> <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>
>>>
>>>
>>>      On 2021/4/8 下午4:16, Joe Hermaszewski wrote:
>>>       > It took a while but I managed to get hold of another one of these
>>>       > arm32 boards. Very disappointingly this exact "bitflip" is still
>>>       > present (log enclosed).
>>>
>>>      Yeah, we got to the conclusion it's not bitflip, but completely 32bit
>>>      limit on armv7.
>>>
>>>      For ARMv7, it's a 32bit system, where unsigned long is only 32bit.
>>>
>>>      This means, things like page->index is only 32bit long, and for 4K page
>>>      size, it also means all filesystems (not only btrfs) can only utilize at
>>>      most 16T bytes.
>>>
>>>      But there is pitfall for btrfs, btrfs uses its internal address space
>>>      for its meatadata, and the address space is U64.
>>>
>>>      And furthermore, for btrfs it can have metadata at bytenr way larger
>>>      than the total device size.
>>>      This is possible because btrfs maps part of its address space to real
>>>      disks, thus it can have bytenr way larger than device size.
>>>
>>>      But this brings to a problem, 32bit Linux can only handle 16T, but in
>>>      your case, some of your metadata is already beyond 16T in btrfs address
>>>      space.
>>>
>>>      Then a lot of things are going to be wrong.
>>>
>>>      I have submitted a patch to do extra check, at least allowing user to
>>>      know this is the limit of 32bit:
>>>      https://patchwork.kernel.org/project/linux-btrfs/patch/20210225011814.24009-1-wqu@suse.com/
>>>      <https://patchwork.kernel.org/project/linux-btrfs/patch/20210225011814.24009-1-wqu@suse.com/>
>>>
>>>      Unfortunately, this will not help existing fs though.
>>>
>>>      Thanks,
>>>      Qu
>>>       >
>>>       > To summarise, as it's been a while:
>>>       >
>>>       > - When running scrub, a "page_start" and "eb_start" mismatch is
>>>       > detected (off by a single bit).
>>>       > - `btrfs check` reports no significant errors on aarch64 or arm32.
>>>       > - `btrfs scrub` completes successfully on aarch64!
>>>       > - Now, I can confirm that `btrfs scrub` fails in the same manner on
>>>       > two arm32 machines.
>>>       >
>>>       > Not really sure where to go from here. The only things I can
>>>      think of are:
>>>       >
>>>       > - Bug in `btrfs scrub` on arm32 (seems unlikely given the single-bit
>>>       > error here).
>>>       > - A bitflip on the original machine wrote bad data to the disk, and
>>>       > somehow only the arm32 implementation catches this...
>>>       >
>>>       > I can give a newer kernel a try (I'm on 5.4.80 at the moment) I
>>>      suppose
>>>       >
>>>       > Best,
>>>       > Joe
>>>       >
>>>       > ```
>>>       > $ dmesg
>>>       > [  258.791966] BTRFS info (device sda1): scrub: started on devid 4
>>>       > [  258.792030] BTRFS info (device sda1): scrub: started on devid 1
>>>       >
>>>       >   the magic generation gap you mention is
>>>       > 15, I'm more suspicious about me
>>>       >
>>>       > [  258.792062] BTRFS info (device sda1): scrub: started on devid 2
>>>       > [  411.849669] ------------[ cut here ]------------
>>>       > [  411.849786] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
>>>       > btree_csum_one_bio+0x22c/0x278 [btrfs]
>>>       > [  411.849789] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
>>>       > iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>>>       > nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw
>>>       > xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp
>>>       > ip6table_filter ip6_tables iptable_filter phy_generic uio_pdrv_genirq
>>>       > uio sch_fq_codel loop tun tap macvlan bridge stp llc lm75 ip_tables
>>>       > x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq
>>>       > [  411.849836] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted
>>>      5.4.80 #1-NixOS
>>>       > [  411.849838] Hardware name: Marvell Armada 380/385 (Device Tree)
>>>       > [  411.849840] Backtrace:
>>>       > [  411.849850] [<c010f698>] (dump_backtrace) from [<c010f938>]
>>>       > (show_stack+0x20/0x24)
>>>       > [  411.849855]  r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c0c4
>>>       > [  411.849861] [<c010f918>] (show_stack) from [<c0a1bb28>]
>>>       > (dump_stack+0x98/0xac)
>>>       > [  411.849866] [<c0a1ba90>] (dump_stack) from [<c012a998>]
>>>      (__warn+0xe0/0x108)
>>>       > [  411.849870]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
>>>       > [  411.849875] [<c012a8b8>] (__warn) from [<c012ad24>]
>>>       > (warn_slowpath_fmt+0x74/0xc4)
>>>       > [  411.849878]  r7:00000213 r6:bf120990 r5:00000000 r4:e1c1c000
>>>       > [  411.849936] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
>>>       > (btree_csum_one_bio+0x22c/0x278 [btrfs])
>>>       > [  411.849941]  r9:00000001 r8:e6afec78 r7:e1c1c000 r6:ed7f7000
>>>       > r5:00001000 r4:8fd00000
>>>       > [  411.850044] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
>>>       > [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>>       > [  411.850049]  r10:e6afe4c0 r9:ecc57fc0 r8:ecc57f70 r7:ed7f7000
>>>       > r6:00000000 r5:c1660370
>>>       > [  411.850051]  r4:bf059e9c
>>>       > [  411.850154] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
>>>       > [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
>>>       > [  411.850158]  r7:ef2d58ec r6:e1c1dcac r5:00000000 r4:bf059e9c
>>>       > [  411.850260] [<bf08b168>] (submit_one_bio [btrfs]) from
>>>      [<bf096664>]
>>>       > (btree_write_cache_pages+0x380/0x408 [btrfs])
>>>       > [  411.850263]  r5:00000000 r4:00000000
>>>       > [  411.850366] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from
>>>       > [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
>>>       > [  411.850371]  r10:00000001 r9:8fd00000 r8:c0280bcc r7:e1c1c000
>>>       > r6:e1c1dd80 r5:ecc57f70
>>>       > [  411.850373]  r4:e1c1dd80
>>>       > [  411.850428] [<bf05903c>] (btree_writepages [btrfs]) from
>>>       > [<c0284680>] (do_writepages+0x58/0xf4)
>>>       > [  411.850431]  r5:ecc57f70 r4:ecc57e68
>>>       > [  411.850439] [<c0284628>] (do_writepages) from [<c0278b68>]
>>>       > (__filemap_fdatawrite_range+0xf8/0x130)
>>>       > [  411.850442]  r8:ecc57f70 r7:00001000 r6:8fd0bfff r5:e1c1c000
>>>      r4:ecc57e68
>>>       > [  411.850448] [<c0278a70>] (__filemap_fdatawrite_range) from
>>>       > [<c0278cf0>] (filemap_fdatawrite_range+0x2c/0x34)
>>>       > [  411.850452]  r10:ecc57f70 r9:00001000 r8:8fd0bfff r7:e1c1de4c
>>>       > r6:e0202428 r5:00001000
>>>       > [  411.850454]  r4:8fd0bfff
>>>       > [  411.850509] [<c0278cc4>] (filemap_fdatawrite_range) from
>>>       > [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>>       > [  411.850512]  r5:00000001 r4:00000000
>>>       > [  411.850615] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from
>>>       > [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
>>>       > [  411.850620]  r10:e1c1c000 r9:ed7f7010 r8:ed7f7000 r7:e0202428
>>>       > r6:ed7f7000 r5:e1c1c000
>>>       > [  411.850621]  r4:e56433f0
>>>       > [  411.850724] [<bf06069c>] (btrfs_write_and_wait_transaction
>>>      [btrfs])
>>>       > from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
>>>       > [  411.850728]  r8:ed7f7418 r7:e0202400 r6:ed7f7000 r5:e56433f0
>>>      r4:00000000
>>>       > [  411.850831] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from
>>>       > [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>>       > [  411.850836]  r10:ed7f728c r9:00000000 r8:001abf6a r7:00000064
>>>       > r6:ed7f7414 r5:00000bb8
>>>       > [  411.850838]  r4:ed7f7000
>>>       > [  411.850893] [<bf05cbfc>] (transaction_kthread [btrfs]) from
>>>       > [<c014fab4>] (kthread+0x170/0x174)
>>>       > [  411.850898]  r10:ecbddbfc r9:bf05cbfc r8:ed669800 r7:e1c1c000
>>>       > r6:00000000 r5:ed560d40
>>>       > [  411.850899]  r4:ed54eb00
>>>       > [  411.850904] [<c014f944>] (kthread) from [<c01010e8>]
>>>       > (ret_from_fork+0x14/0x2c)
>>>       > [  411.850907] Exception stack(0xe1c1dfb0 to 0xe1c1dff8)
>>>       > [  411.850911] dfa0:                                     00000000
>>>       > 00000000 00000000 00000000
>>>       > [  411.850915] dfc0: 00000000 00000000 00000000 00000000 00000000
>>>       > 00000000 00000000 00000000
>>>       > [  411.850919] dfe0: 00000000 00000000 00000000 00000000 00000013
>>>      00000000
>>>       > [  411.850923]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
>>>       > r6:00000000 r5:c014f944
>>>       > [  411.850925]  r4:ed560d40
>>>       > [  411.850976] ---[ end trace 1bbf4c394bd7fb26 ]---
>>>       > [  411.850986] BTRFS warning (device sda1): page_start and eb_start
>>>       > mismatch, eb_start=17594598817792 page_start=2412773376
>>>       > [  411.857914] BTRFS: error (device sda1) in
>>>       > btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
>>>       > writing out transaction)
>>>       > [  411.869764] BTRFS info (device sda1): forced readonly
>>>       > [  411.869772] BTRFS warning (device sda1): Skipping commit of
>>>      aborted
>>>       > transaction.
>>>       > [  411.869778] BTRFS: error (device sda1) in
>>>      cleanup_transaction:1832:
>>>       > errno=-5 IO failure
>>>       > [  411.877843] BTRFS info (device sda1): delayed_refs has NO entry
>>>       > [  411.878076] BTRFS warning (device sda1): Skipping commit of
>>>      aborted
>>>       > transaction.
>>>       > [  411.882940] BTRFS warning (device sda1): failed setting block
>>>      group ro: -30
>>>       > [  411.882956] BTRFS info (device sda1): scrub: not finished on devid
>>>       > 3 with status: -30
>>>       > [  411.883030] BTRFS warning (device sda1): failed setting block
>>>      group ro: -30
>>>       > [  411.883039] BTRFS info (device sda1): scrub: not finished on devid
>>>       > 4 with status: -30
>>>       > [  411.883104] BTRFS warning (device sda1): failed setting block
>>>      group ro: -30
>>>       > [  411.883116] BTRFS info (device sda1): scrub: not finished on devid
>>>       > 1 with status: -30
>>>       > [  411.883191] BTRFS warning (device sda1): failed setting block
>>>      group ro: -30
>>>       > [  411.883200] BTRFS info (device sda1): scrub: not finished on devid
>>>       > 2 with status: -30
>>>       > [  411.953186] BTRFS info (device sda1): delayed_refs has NO entry
>>>       > [  576.361548] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  576.369639] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  576.378477] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  576.386936] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  576.395093] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  576.396830] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  576.403153] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  576.450091] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  576.462821] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  576.474987] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  576.486882] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  576.497905] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  576.824673] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  576.838414] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  577.324362] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  577.337736] BTRFS error (device sda1): parent transid verify
>>>      failed
>>>       > on 12756293910528 wanted 1752939 found 1752922
>>>       > [  578.462271] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  578.470347] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  578.479476] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  578.488099] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  578.496793] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  578.505280] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  580.539117] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  580.547207] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  580.556371] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  580.564922] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  580.573356] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  580.581573] BTRFS critical (device sda1): unable to find logical
>>>       > 2412789760 length 4096
>>>       > [  581.427776] verify_parent_transid: 28 callbacks suppressed
>>>       > ... many more of these
>>>       > ```
>>>       >
>>>       > On Sun, Dec 20, 2020 at 8:30 AM Qu Wenruo <quwenruo.btrfs@gmx.com
>>>      <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>       >>
>>>       >>
>>>       >>
>>>       >> On 2020/12/19 下午6:35, Joe Hermaszewski wrote:
>>>       >>> Ok, so I managed to get hold of a 64bit machine on which to run
>>>      btrfs
>>>       >>> check. `btrfs check` returns exactly the same output as the
>>>      armv7 box
>>>       >>> (no serious problems) which I suppose is good. `btrfs scrub` also
>>>       >>> finds no problems. Boring Logs below.
>>>       >>>
>>>       >>> What I don't quite understand is how the scrub problem on
>>>      armv7l is so
>>>       >>> reliable when it's not persisted on the disks, is the same physical
>>>       >>> memory location being used for this breaking value, or is it
>>>      perhaps a
>>>       >>> specific pattern of data on the bus which causes this?
>>>       >>
>>>       >> If it's so reliably reproducible, then I guess that would be the
>>>      case,
>>>       >> either a specific memory range has the problem, or a specific
>>>      pattern on
>>>       >> the bus causing the problem.
>>>       >>
>>>       >>>
>>>       >>> If it's the former, how easy would it be to find this broken
>>>      location
>>>       >>> and blacklist it? If it's the latter then I guess there's no
>>>      hope but
>>>       >>> to try replacing the psu/machine.
>>>       >>>
>>>       >>> The machine survived a couple of days of memtester (on about 95% of
>>>       >>> the RAM) and `7z b` with no problems, *shrug*
>>>       >>
>>>       >> The memtester should rule out the former case, the latter case
>>>      may be
>>>       >> resolved by newer kernel if it's not a hardware problem but a
>>>      software one.
>>>       >>
>>>       >> Thanks,
>>>       >> Qu
>>>       >>
>>>       >>>
>>>       >>> Best wishes and thanks for the generous help so far.
>>>       >>> Joe
>>>       >>>
>>>       >>> btrfs scrub aarch64:
>>>       >>> ```
>>>       >>> [j@nixos:~]$ sudo btrfs scrub status -d /mnt
>>>       >>> UUID:             b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>       >>> scrub device /dev/sda1 (id 1) history
>>>       >>> Scrub started:    Fri Dec 18 14:24:30 2020
>>>       >>> Status:           finished
>>>       >>> Duration:         7:36:31
>>>       >>> Total to scrub:   2.40TiB
>>>       >>> Rate:             91.95MiB/s
>>>       >>> Error summary:    no errors found
>>>       >>> scrub device /dev/sdb1 (id 2) history
>>>       >>> Scrub started:    Fri Dec 18 14:24:30 2020
>>>       >>> Status:           finished
>>>       >>> Duration:         7:12:51
>>>       >>> Total to scrub:   2.40TiB
>>>       >>> Rate:             96.90MiB/s
>>>       >>> Error summary:    no errors found
>>>       >>> scrub device /dev/sdd1 (id 3) history
>>>       >>> Scrub started:    Fri Dec 18 14:24:30 2020
>>>       >>> Status:           finished
>>>       >>> Duration:         19:47:01
>>>       >>> Total to scrub:   7.86TiB
>>>       >>> Rate:             115.70MiB/s
>>>       >>> Error summary:    no errors found
>>>       >>> scrub device /dev/sdc1 (id 4) history
>>>       >>> Scrub started:    Fri Dec 18 14:24:30 2020
>>>       >>> Status:           finished
>>>       >>> Duration:         19:46:38
>>>       >>> Total to scrub:   7.86TiB
>>>       >>> Rate:             115.74MiB/s
>>>       >>> Error summary:    no errors found
>>>       >>> ```
>>>       >>>
>>>       >>> btrfs check aarch64:
>>>       >>> ```
>>>       >>> [nixos@nixos:/]$ sudo btrfs check --readonly /dev/sda1
>>>       >>> Opening filesystem to check...
>>>       >>> Checking filesystem on /dev/sda1
>>>       >>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>       >>> [1/7] checking root items
>>>       >>> [2/7] checking extents
>>>       >>> [3/7] checking free space cache
>>>       >>> [4/7] checking fs roots
>>>       >>> root 294 inode 24665 errors 100, file extent discount
>>>       >>> Found file extent holes:
>>>       >>>           start: 3709534208, len: 163840
>>>       >>> root 294 inode 406548 errors 100, file extent discount
>>>       >>> Found file extent holes:
>>>       >>>           start: 98729984, len: 720896
>>>       >>> ERROR: errors found in fs roots
>>>       >>> found 11280701063168 bytes used, error(s) found
>>>       >>> total csum bytes: 10937464120
>>>       >>> total tree bytes: 18538053632
>>>       >>> total fs tree bytes: 5877579776
>>>       >>> total extent tree bytes: 534052864
>>>       >>> btree space waste bytes: 2316660292
>>>       >>> file data blocks allocated: 17244587220992
>>>       >>>    referenced 14211684794368%
>>>       >>> ```
>>>       >>>
>>>       >>>
>>>       >>> On Sat, Nov 28, 2020 at 8:46 AM Qu Wenruo
>>>      <quwenruo.btrfs@gmx.com <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>       >>>>
>>>       >>>>
>>>       >>>>
>>>       >>>> On 2020/11/27 下午11:15, Joe Hermaszewski wrote:
>>>       >>>>> Hi Qu,
>>>       >>>>>
>>>       >>>>> Thanks for the patch. I recompiled the kernel ran the scrub
>>>      and your
>>>       >>>>> patch worked as expected, here is the log:
>>>       >>>>>
>>>       >>>>> ```
>>>       >>>>> [  337.365239] BTRFS info (device sda1): scrub: started on
>>>      devid 2
>>>       >>>>> [  337.366283] BTRFS info (device sda1): scrub: started on
>>>      devid 1
>>>       >>>>> [  337.402822] BTRFS info (device sda1): scrub: started on
>>>      devid 3
>>>       >>>>> [  337.411944] BTRFS info (device sda1): scrub: started on
>>>      devid 4
>>>       >>>>> [  471.997496] ------------[ cut here ]------------
>>>       >>>>> [  471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
>>>       >>>>> btree_csum_one_bio+0x22c/0x278 [btrfs]
>>>       >>>>> [  471.997616] Modules linked in: cfg80211 rfkill 8021q
>>>      ip6table_nat
>>>       >>>>> iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack
>>>       >>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter
>>>      ipt_rpfilter
>>>       >>>>> ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype
>>>      nf_log_ipv6
>>>       >>>>> nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter
>>>      ip6_tables
>>>       >>>>> iptable_filter sch_fq_codel loop tun tap macvlan bridge stp
>>>      llc lm75
>>>       >>>>> ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor
>>>      raid6_pq
>>>       >>>>> [  471.997666] CPU: 0 PID: 218 Comm: btrfs-transacti Not
>>>      tainted 5.4.78 #1-NixOS
>>>       >>>>> [  471.997668] Hardware name: Marvell Armada 380/385 (Device
>>>      Tree)
>>>       >>>>> [  471.997669] Backtrace:
>>>       >>>>> [  471.997680] [<c010f698>] (dump_backtrace) from [<c010f938>]
>>>       >>>>> (show_stack+0x20/0x24)
>>>       >>>>> [  471.997684]  r7:00000213 r6:600c0013 r5:00000000 r4:c0f8c044
>>>       >>>>> [  471.997691] [<c010f918>] (show_stack) from [<c0a1b848>]
>>>       >>>>> (dump_stack+0x98/0xac)
>>>       >>>>> [  471.997696] [<c0a1b7b0>] (dump_stack) from [<c012a998>]
>>>      (__warn+0xe0/0x108)
>>>       >>>>> [  471.997700]  r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990
>>>       >>>>> [  471.997704] [<c012a8b8>] (__warn) from [<c012ad24>]
>>>       >>>>> (warn_slowpath_fmt+0x74/0xc4)
>>>       >>>>> [  471.997707]  r7:00000213 r6:bf120990 r5:00000000 r4:e1822000
>>>       >>>>> [  471.997765] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>]
>>>       >>>>> (btree_csum_one_bio+0x22c/0x278 [btrfs])
>>>       >>>>> [  471.997770]  r9:00000001 r8:c10decb8 r7:e1822000 r6:ed66b000
>>>       >>>>> r5:00001000 r4:4fd00000
>>>       >>>>> [  471.997872] [<bf058d20>] (btree_csum_one_bio [btrfs]) from
>>>       >>>>> [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>>       >>>>> [  471.997877]  r10:e2197b48 r9:ec845fc0 r8:ec845f70 r7:ed66b000
>>>       >>>>> r6:00000000 r5:e2b74af0
>>>       >>>>> [  471.997879]  r4:bf059e9c
>>>       >>>>> [  471.997981] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from
>>>       >>>>> [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs])
>>>       >>>>> [  471.997985]  r7:ef3724d4 r6:e1823cac r5:00000000 r4:bf059e9c
>>>       >>>>> [  471.998087] [<bf08b168>] (submit_one_bio [btrfs]) from
>>>      [<bf096664>]
>>>       >>>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
>>>       >>>>> [  471.998090]  r5:00000000 r4:00000000
>>>       >>>>> [  471.998193] [<bf0962e4>] (btree_write_cache_pages [btrfs])
>>>      from
>>>       >>>>> [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs])
>>>       >>>>> [  471.998197]  r10:00000001 r9:4fd00000 r8:c0280d14 r7:e1822000
>>>       >>>>> r6:e1823d80 r5:ec845f70
>>>       >>>>> [  471.998199]  r4:e1823d80
>>>       >>>>> [  471.998255] [<bf05903c>] (btree_writepages [btrfs]) from
>>>       >>>>> [<c02847c8>] (do_writepages+0x58/0xf4)
>>>       >>>>> [  471.998258]  r5:ec845f70 r4:ec845e68
>>>       >>>>> [  471.998266] [<c0284770>] (do_writepages) from [<c0278cb0>]
>>>       >>>>> (__filemap_fdatawrite_range+0xf8/0x130)
>>>       >>>>> [  471.998270]  r8:ec845f70 r7:00001000 r6:4fd0bfff
>>>      r5:e1822000 r4:ec845e68
>>>       >>>>> [  471.998275] [<c0278bb8>] (__filemap_fdatawrite_range) from
>>>       >>>>> [<c0278e38>] (filemap_fdatawrite_range+0x2c/0x34)
>>>       >>>>> [  471.998279]  r10:ec845f70 r9:00001000 r8:4fd0bfff r7:e1823e4c
>>>       >>>>> r6:c1955e28 r5:00001000
>>>       >>>>> [  471.998281]  r4:4fd0bfff
>>>       >>>>> [  471.998335] [<c0278e0c>] (filemap_fdatawrite_range) from
>>>       >>>>> [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>>       >>>>> [  471.998338]  r5:00000001 r4:00000000
>>>       >>>>> [  471.998441] [<bf0604a8>] (btrfs_write_marked_extents
>>>      [btrfs]) from
>>>       >>>>> [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs])
>>>       >>>>> [  471.998446]  r10:e1822000 r9:ed66b010 r8:ed66b000 r7:c1955e28
>>>       >>>>> r6:ed66b000 r5:e1822000
>>>       >>>>> [  471.998447]  r4:eddc4f30
>>>       >>>>> [  471.998551] [<bf06069c>] (btrfs_write_and_wait_transaction
>>>      [btrfs])
>>>       >>>>> from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs])
>>>       >>>>> [  471.998555]  r8:ed66b418 r7:c1955e00 r6:ed66b000
>>>      r5:eddc4f30 r4:00000000
>>>       >>>>> [  471.998658] [<bf061ccc>] (btrfs_commit_transaction
>>>      [btrfs]) from
>>>       >>>>> [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>>       >>>>> [  471.998663]  r10:ed66b28c r9:00000000 r8:001aab57 r7:00000064
>>>       >>>>> r6:ed66b414 r5:00000bb8
>>>       >>>>> [  471.998664]  r4:ed66b000
>>>       >>>>> [  471.998719] [<bf05cbfc>] (transaction_kthread [btrfs]) from
>>>       >>>>> [<c014facc>] (kthread+0x170/0x174)
>>>       >>>>> [  471.998724]  r10:ed765bfc r9:bf05cbfc r8:ed56a000 r7:e1822000
>>>       >>>>> r6:00000000 r5:ed5c8600
>>>       >>>>> [  471.998726]  r4:ed5c8740
>>>       >>>>> [  471.998731] [<c014f95c>] (kthread) from [<c01010e8>]
>>>       >>>>> (ret_from_fork+0x14/0x2c)
>>>       >>>>> [  471.998733] Exception stack(0xe1823fb0 to 0xe1823ff8)
>>>       >>>>> [  471.998737] 3fa0:                                     00000000
>>>       >>>>> 00000000 00000000 00000000
>>>       >>>>> [  471.998741] 3fc0: 00000000 00000000 00000000 00000000 00000000
>>>       >>>>> 00000000 00000000 00000000
>>>       >>>>> [  471.998745] 3fe0: 00000000 00000000 00000000 00000000
>>>      00000013 00000000
>>>       >>>>> [  471.998749]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
>>>       >>>>> r6:00000000 r5:c014f95c
>>>       >>>>> [  471.998751]  r4:ed5c8600
>>>       >>>>> [  471.998820] ---[ end trace 6a9fb9f547659a4d ]---
>>>       >>>>> [  471.998855] BTRFS warning (device sda1): page_start and
>>>      eb_start
>>>       >>>>> mismatch, eb_start=17593525075968 page_start=1339031552
>>>       >>>>> [  472.015514] BTRFS: error (device sda1) in
>>>       >>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while
>>>       >>>>> writing out transaction)
>>>       >>>>> [  472.027321] BTRFS info (device sda1): forced readonly
>>>       >>>>> [  472.027326] BTRFS warning (device sda1): Skipping commit
>>>      of aborted
>>>       >>>>> transaction.
>>>       >>>>> [  472.027330] BTRFS: error (device sda1) in
>>>      cleanup_transaction:1832:
>>>       >>>>> errno=-5 IO failure
>>>       >>>>> [  472.035378] BTRFS info (device sda1): delayed_refs has NO
>>>      entry
>>>       >>>>> [  472.039102] BTRFS warning (device sda1): failed setting
>>>      block group ro: -30
>>>       >>>>> [  472.039107] BTRFS info (device sda1): scrub: not finished
>>>      on devid
>>>       >>>>> 3 with status: -30
>>>       >>>>> [  472.039114] BTRFS info (device sda1): scrub: not finished
>>>      on devid
>>>       >>>>> 4 with status: -30
>>>       >>>>> [  472.039227] BTRFS warning (device sda1): failed setting
>>>      block group ro: -30
>>>       >>>>> [  472.039230] BTRFS warning (device sda1): failed setting
>>>      block group ro: -30
>>>       >>>>> [  472.039237] BTRFS info (device sda1): scrub: not finished
>>>      on devid
>>>       >>>>> 2 with status: -30
>>>       >>>>> [  472.039243] BTRFS info (device sda1): scrub: not finished
>>>      on devid
>>>       >>>>> 1 with status: -30
>>>       >>>>> ```
>>>       >>>>>
>>>       >>>>> To rob you of the suspense:
>>>       >>>>>
>>>       >>>>> printf "0x%016x\n" (17593525075968 - 1339031552)
>>>       >>>>> 0x0000100000000000
>>>       >>>>>
>>>       >>>>> Very surprising considering that this machine has ECC ram!
>>>       >>>>
>>>       >>>> ARMv7 with ECC, and we still had such bitflip!?
>>>       >>>>
>>>       >>>> I guess everything is possible in 2020.
>>>       >>>>
>>>       >>>>>
>>>       >>>>> What would you recommend as a fix?
>>>       >>>>
>>>       >>>> It's a hardware problem, I really don't have any
>>>      recommendation at all.
>>>       >>>>
>>>       >>>> I guess you may want to request an RMA?
>>>       >>>>
>>>       >>>> Thanks,
>>>       >>>> Qu
>>>       >>>>
>>>       >>>>>
>>>       >>>>> Best,
>>>       >>>>> Joe
>>>       >>>>>
>>>       >>>>> On Thu, Nov 26, 2020 at 7:05 PM Qu Wenruo
>>>      <quwenruo.btrfs@gmx.com <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>       >>>>>>
>>>       >>>>>>
>>>       >>>>>>
>>>       >>>>>> On 2020/11/26 下午6:53, Joe Hermaszewski wrote:
>>>       >>>>>>> Hi Qu,
>>>       >>>>>>>
>>>       >>>>>>> This is the output of btrfs check with the device unmounted:
>>>       >>>>>>>
>>>       >>>>>>> ```
>>>       >>>>>>> [root@nixos:~]# btrfs check --readonly /dev/sda1
>>>       >>>>>>> Opening filesystem to check...
>>>       >>>>>>> Checking filesystem on /dev/sda1
>>>       >>>>>>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>       >>>>>>> [1/7] checking root items
>>>       >>>>>>> [2/7] checking extents
>>>       >>>>>>> [3/7] checking free space cache
>>>       >>>>>>> [4/7] checking fs roots
>>>       >>>>>>> root 294 inode 24665 errors 100, file extent discount
>>>       >>>>>>> Found file extent holes:
>>>       >>>>>>>           start: 3709534208, len: 163840
>>>       >>>>>>> root 294 inode 406548 errors 100, file extent discount
>>>       >>>>>>> Found file extent holes:
>>>       >>>>>>>           start: 0, len: 99450880
>>>       >>>>>>> ERROR: errors found in fs roots
>>>       >>>>>>> found 11279251902464 bytes used, error(s) found
>>>       >>>>>>> total csum bytes: 10935906504
>>>       >>>>>>> total tree bytes: 18455740416
>>>       >>>>>>> total fs tree bytes: 5798936576
>>>       >>>>>>> total extent tree bytes: 532250624
>>>       >>>>>>> btree space waste bytes: 2308849707
>>>       >>>>>>> file data blocks allocated: 17243418611712
>>>       >>>>>>>    referenced 14210256932864
>>>       >>>>>>> ```
>>>       >>>>>>>
>>>       >>>>>>> Looks the same as before...
>>>       >>>>>>
>>>       >>>>>> Then it means, at least your metadata is pretty safe.
>>>      Nothing really
>>>       >>>>>> needs to be worried (so far).
>>>       >>>>>>
>>>       >>>>>> This also means, the problem only happens at runtime, not
>>>      something on-disk.
>>>       >>>>>>
>>>       >>>>>> If you are OK to re-compile the kernel, and the warning
>>>      message is
>>>       >>>>>> always the same at line number 531 of disk-io.c, then I
>>>      recommend the
>>>       >>>>>> attached debug diff
>>>       >>>>>>
>>>       >>>>>> The diff will output the found_start (from eb) and
>>>      page_start (from page).
>>>       >>>>>> If it's memory bitflip, we should have a pretty signature
>>>      diff at the
>>>       >>>>>> output.
>>>       >>>>>> If not memory bitflip, then we really have a bug to dig.
>>>       >>>>>>
>>>       >>>>>>
>>>       >>>>>>>
>>>       >>>>>>> I'm afraid I don't have the precise details of the transid
>>>      error
>>>       >>>>>>> saved. I think I meant to write earlier that the message
>>>      that comes up
>>>       >>>>>>> now and then is the "errno=-5 IO failure (Error while
>>>      writing out
>>>       >>>>>>> transaction)". I've not seen the transid mismatch one since
>>>      the first
>>>       >>>>>>> incident, sorry for that slip-up.
>>>       >>>>>>>
>>>       >>>>>>> I'll try and attach to a known good system, will I have to
>>>      attach the
>>>       >>>>>>> whole array because I don't think I have that many
>>>      enclosures for the
>>>       >>>>>>> drives.
>>>       >>>>>>
>>>       >>>>>> You only need to mount the fs you're seeing the problem.
>>>       >>>>>> But if that fs is on multiple devices, then I guess you need
>>>      to mount
>>>       >>>>>> them all.
>>>       >>>>>>
>>>       >>>>>> Thanks,
>>>       >>>>>> Qu
>>>       >>>>>>>
>>>       >>>>>>> I'll also run a memtest on this box.
>>>       >>>>>>>
>>>       >>>>>>> Thank you for the helpful response.
>>>       >>>>>>>
>>>       >>>>>>> Best,
>>>       >>>>>>> Joe
>>>       >>>>>>>
>>>       >>>>>>> On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo
>>>      <quwenruo.btrfs@gmx.com <mailto:quwenruo.btrfs@gmx.com>> wrote:
>>>       >>>>>>>>
>>>       >>>>>>>>
>>>       >>>>>>>>
>>>       >>>>>>>> On 2020/11/25 下午11:28, Joe Hermaszewski wrote:
>>>       >>>>>>>>> Hi,
>>>       >>>>>>>>>
>>>       >>>>>>>>> I have a arm32 machine with four drives with a btrfs fs
>>>      spanning then in RAID1.
>>>       >>>>>>>>> The filesystem has started behaving badly recently and
>>>      I'm writing to:
>>>       >>>>>>>>>
>>>       >>>>>>>>> - Solicit advice on how best to get the system back to a
>>>      stable state
>>>       >>>>>>>>> - Report a potential bug
>>>       >>>>>>>>>
>>>       >>>>>>>>> ## What happened:
>>>       >>>>>>>>>
>>>       >>>>>>>>> A couple of days ago I could no longer ssh into it, and
>>>      on the serial
>>>       >>>>>>>>> connection there were heaps of messages (and new ones
>>>      appearing with great
>>>       >>>>>>>>> frequency) along the lines of: `parent transid verify
>>>      failed on blah... wanted
>>>       >>>>>>>>> x got y`.
>>>       >>>>>>>>>
>>>       >>>>>>>>> Although I don't have a record of the precise messages I
>>>      do remember that there
>>>       >>>>>>>>> was a difference of `15` between x and y.
>>>       >>>>>>>>
>>>       >>>>>>>> This normally can be a sign of unreliable HDD, which lies
>>>      on FLUSH,
>>>       >>>>>>>> killing metadata COW.
>>>       >>>>>>>>
>>>       >>>>>>>> But, your btrfs check doesn't report the same problem,
>>>      thus I'm confused.
>>>       >>>>>>>>
>>>       >>>>>>>> Would you please try to run a "btrfs check --readonly
>>>      /dev/sda1" with
>>>       >>>>>>>> the fs unmounted?
>>>       >>>>>>>>
>>>       >>>>>>>> And, would you provide the full dmesg of that mismatch?
>>>       >>>>>>>> The reason for the exact number is, I'm suspecting
>>>      hardware memory
>>>       >>>>>>>> corruption.
>>>       >>>>>>>>
>>>       >>>>>>>>>
>>>       >>>>>>>>> I power-cycled system and started a scrub after it
>>>      rebooted, this was
>>>       >>>>>>>>> interrupted quite promptly by several more errors in
>>>      btrfs, and the disk
>>>       >>>>>>>>> remounted RO.
>>>       >>>>>>>>>
>>>       >>>>>>>>> Every now and then in the kernel log I get messages like:
>>>       >>>>>>>>>
>>>       >>>>>>>>> `parent transid verify failed on blah... wanted x got y`
>>>       >>>>>>>>
>>>       >>>>>>>> Not showing up in the gist dmesg though.
>>>       >>>>>>>>
>>>       >>>>>>>>>
>>>       >>>>>>>>> ## Important info
>>>       >>>>>>>>>
>>>       >>>>>>>>> The dev stats are all zero.
>>>       >>>>>>>>>
>>>       >>>>>>>>> Here are the outputs of some btrfs commands, dmesg and
>>>      the kernel log from the
>>>       >>>>>>>>> previous two boots:
>>>      https://gist.github.com/b1beab134403c5047e2efbceb98985f9
>>>      <https://gist.github.com/b1beab134403c5047e2efbceb98985f9>
>>>       >>>>>>>>>
>>>       >>>>>>>>> The "cut here" portion of the kernel log is as follows
>>>       >>>>>>>>>
>>>       >>>>>>>>> ```
>>>       >>>>>>>>> [  409.158097] ------------[ cut here ]------------
>>>       >>>>>>>>> [  409.158205] WARNING: CPU: 1 PID: 217 at
>>>      fs/btrfs/disk-io.c:531
>>>       >>>>>>>>> btree_csum_one_bio+0x208/0x248 [btrfs]
>>>       >>>>>>>>
>>>       >>>>>>>> The line number shows, the tree bytenr doesn't match with
>>>      the one in memory.
>>>       >>>>>>>> This is really too rare to be seen, especially when we
>>>      have no other
>>>       >>>>>>>> error reported from btrfs (at least in the gist)
>>>       >>>>>>>>
>>>       >>>>>>>> Since there is no other problems showing up in the gist,
>>>      it means it
>>>       >>>>>>>> could be a bit flip, considering the magic generation gap
>>>      you mention is
>>>       >>>>>>>> 15, I'm more suspicious about memory bit flip.
>>>       >>>>>>>>
>>>       >>>>>>>> If you can provide the full parent transid mismatch error
>>>      message, it
>>>       >>>>>>>> may help to determine the possibility of hardware memory
>>>      corruption.
>>>       >>>>>>>>
>>>       >>>>>>>> Thanks,
>>>       >>>>>>>> Qu
>>>       >>>>>>>>
>>>       >>>>>>>>> [  409.158208] Modules linked in: cfg80211 rfkill 8021q
>>>      ip6table_nat
>>>       >>>>>>>>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6
>>>       >>>>>>>>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw
>>>      iptable_raw
>>>       >>>>>>>>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG
>>>      xt_tcpudp
>>>       >>>>>>>>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio
>>>      ip6table_filter
>>>       >>>>>>>>> ip6_tables iptable_filter sch_fq_codel loop tun tap
>>>      macvlan bridge stp
>>>       >>>>>>>>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs
>>>      libcrc32c xor
>>>       >>>>>>>>> raid6_pq
>>>       >>>>>>>>> [  409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not
>>>      tainted 5.4.77 #1-NixOS
>>>       >>>>>>>>> [  409.158260] Hardware name: Marvell Armada 380/385
>>>      (Device Tree)
>>>       >>>>>>>>> [  409.158261] Backtrace:
>>>       >>>>>>>>> [  409.158272] [<c010f698>] (dump_backtrace) from
>>>      [<c010f938>]
>>>       >>>>>>>>> (show_stack+0x20/0x24)
>>>       >>>>>>>>> [  409.158277]  r7:00000213 r6:600f0013 r5:00000000
>>>      r4:c0f8c044
>>>       >>>>>>>>> [  409.158283] [<c010f918>] (show_stack) from [<c0a1b388>]
>>>       >>>>>>>>> (dump_stack+0x98/0xac)
>>>       >>>>>>>>> [  409.158288] [<c0a1b2f0>] (dump_stack) from
>>>      [<c012a998>] (__warn+0xe0/0x108)
>>>       >>>>>>>>> [  409.158292]  r7:00000213 r6:bf058ec8 r5:00000009
>>>      r4:bf120990
>>>       >>>>>>>>> [  409.158296] [<c012a8b8>] (__warn) from [<c012ad24>]
>>>       >>>>>>>>> (warn_slowpath_fmt+0x74/0xc4)
>>>       >>>>>>>>> [  409.158300]  r7:00000213 r6:bf120990 r5:00000000
>>>      r4:e2392000
>>>       >>>>>>>>> [  409.158358] [<c012acb4>] (warn_slowpath_fmt) from
>>>      [<bf058ec8>]
>>>       >>>>>>>>> (btree_csum_one_bio+0x208/0x248 [btrfs])
>>>       >>>>>>>>> [  409.158363]  r9:e277abe0 r8:00000001 r7:e2392000
>>>      r6:ea3d17f0
>>>       >>>>>>>>> r5:00000000 r4:eefd2d3c
>>>       >>>>>>>>> [  409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from
>>>       >>>>>>>>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs])
>>>       >>>>>>>>> [  409.158470]  r10:e32ce170 r9:ecc45fc0 r8:ecc45f70
>>>      r7:ec82b000
>>>       >>>>>>>>> r6:00000000 r5:ea3d17f0
>>>       >>>>>>>>> [  409.158472]  r4:bf059e0c
>>>       >>>>>>>>> [  409.158575] [<bf059e0c>] (btree_submit_bio_hook
>>>      [btrfs]) from
>>>       >>>>>>>>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs])
>>>       >>>>>>>>> [  409.158578]  r7:ef36c048 r6:e2393cac r5:00000000
>>>      r4:bf059e0c
>>>       >>>>>>>>> [  409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from
>>>      [<bf0965d4>]
>>>       >>>>>>>>> (btree_write_cache_pages+0x380/0x408 [btrfs])
>>>       >>>>>>>>> [  409.158686]  r5:00000000 r4:00000000
>>>       >>>>>>>>> [  409.158788] [<bf096254>] (btree_write_cache_pages
>>>      [btrfs]) from
>>>       >>>>>>>>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs])
>>>       >>>>>>>>> [  409.158793]  r10:00000001 r9:4fd00000 r8:c0280c94
>>>      r7:e2392000
>>>       >>>>>>>>> r6:e2393d80 r5:ecc45f70
>>>       >>>>>>>>> [  409.158794]  r4:e2393d80
>>>       >>>>>>>>> [  409.158850] [<bf058fac>] (btree_writepages [btrfs]) from
>>>       >>>>>>>>> [<c0284748>] (do_writepages+0x58/0xf4)
>>>       >>>>>>>>> [  409.158852]  r5:ecc45f70 r4:ecc45e68
>>>       >>>>>>>>> [  409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>]
>>>       >>>>>>>>> (__filemap_fdatawrite_range+0xf8/0x130)
>>>       >>>>>>>>> [  409.158864]  r8:ecc45f70 r7:00001000 r6:4fd0bfff
>>>      r5:e2392000 r4:ecc45e68
>>>       >>>>>>>>> [  409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from
>>>       >>>>>>>>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34)
>>>       >>>>>>>>> [  409.158874]  r10:ecc45f70 r9:00001000 r8:4fd0bfff
>>>      r7:e2393e4c
>>>       >>>>>>>>> r6:c73bc628 r5:00001000
>>>       >>>>>>>>> [  409.158875]  r4:4fd0bfff
>>>       >>>>>>>>> [  409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from
>>>       >>>>>>>>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs])
>>>       >>>>>>>>> [  409.158931]  r5:00000001 r4:00000000
>>>       >>>>>>>>> [  409.159033] [<bf060418>] (btrfs_write_marked_extents
>>>      [btrfs]) from
>>>       >>>>>>>>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4
>>>      [btrfs])
>>>       >>>>>>>>> [  409.159038]  r10:e2392000 r9:ec82b010 r8:ec82b000
>>>      r7:c73bc628
>>>       >>>>>>>>> r6:ec82b000 r5:e2392000
>>>       >>>>>>>>> [  409.159040]  r4:c8b81ca8
>>>       >>>>>>>>> [  409.159141] [<bf06060c>]
>>>      (btrfs_write_and_wait_transaction [btrfs])
>>>       >>>>>>>>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94
>>>      [btrfs])
>>>       >>>>>>>>> [  409.159145]  r8:ec82b418 r7:c73bc600 r6:ec82b000
>>>      r5:c8b81ca8 r4:00000000
>>>       >>>>>>>>> [  409.159248] [<bf061c3c>] (btrfs_commit_transaction
>>>      [btrfs]) from
>>>       >>>>>>>>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs])
>>>       >>>>>>>>> [  409.159253]  r10:ec82b28c r9:00000000 r8:001aaafa
>>>      r7:00000064
>>>       >>>>>>>>> r6:ec82b414 r5:00000bb8
>>>       >>>>>>>>> [  409.159254]  r4:ec82b000
>>>       >>>>>>>>> [  409.159309] [<bf05cb6c>] (transaction_kthread [btrfs])
>>>      from
>>>       >>>>>>>>> [<c014fabc>] (kthread+0x170/0x174)
>>>       >>>>>>>>> [  409.159313]  r10:eca87bfc r9:bf05cb6c r8:ed619000
>>>      r7:e2392000
>>>       >>>>>>>>> r6:00000000 r5:ed5ee700
>>>       >>>>>>>>> [  409.159315]  r4:ed5ee1c0
>>>       >>>>>>>>> [  409.159320] [<c014f94c>] (kthread) from [<c01010e8>]
>>>       >>>>>>>>> (ret_from_fork+0x14/0x2c)
>>>       >>>>>>>>> [  409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8)
>>>       >>>>>>>>> [  409.159326] 3fa0:
>>>        00000000
>>>       >>>>>>>>> 00000000 00000000 00000000
>>>       >>>>>>>>> [  409.159331] 3fc0: 00000000 00000000 00000000 00000000
>>>      00000000
>>>       >>>>>>>>> 00000000 00000000 00000000
>>>       >>>>>>>>> [  409.159334] 3fe0: 00000000 00000000 00000000 00000000
>>>      00000013 00000000
>>>       >>>>>>>>> [  409.159338]  r10:00000000 r9:00000000 r8:00000000
>>>      r7:00000000
>>>       >>>>>>>>> r6:00000000 r5:c014f94c
>>>       >>>>>>>>> [  409.159340]  r4:ed5ee700
>>>       >>>>>>>>> [  409.159342] ---[ end trace eea59ced12fa7859 ]---
>>>       >>>>>>>>> [  409.165084] BTRFS: error (device sda1) in
>>>       >>>>>>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error
>>>      while
>>>       >>>>>>>>> writing out transaction)
>>>       >>>>>>>>> [  409.176920] BTRFS info (device sda1): forced readonly
>>>       >>>>>>>>> [  409.176947] BTRFS warning (device sda1): Skipping
>>>      commit of aborted
>>>       >>>>>>>>> transaction.
>>>       >>>>>>>>> [  409.176952] BTRFS: error (device sda1) in
>>>      cleanup_transaction:1832:
>>>       >>>>>>>>> errno=-5 IO failure
>>>       >>>>>>>>> [  409.185049] BTRFS info (device sda1): delayed_refs has
>>>      NO entry
>>>       >>>>>>>>> [  409.310199] BTRFS info (device sda1): scrub: not
>>>      finished on devid
>>>       >>>>>>>>> 3 with status: -125
>>>       >>>>>>>>> [  409.664880] BTRFS info (device sda1): scrub: not
>>>      finished on devid
>>>       >>>>>>>>> 4 with status: -125
>>>       >>>>>>>>> [  410.106791] BTRFS info (device sda1): scrub: not
>>>      finished on devid
>>>       >>>>>>>>> 1 with status: -125
>>>       >>>>>>>>> [  411.268585] BTRFS warning (device sda1): failed
>>>      setting block group ro: -30
>>>       >>>>>>>>> [  411.268594] BTRFS info (device sda1): scrub: not
>>>      finished on devid
>>>       >>>>>>>>> 2 with status: -30
>>>       >>>>>>>>> [  411.268605] BTRFS info (device sda1): delayed_refs has
>>>      NO entry
>>>       >>>>>>>>> ```
>>>       >>>>>>>>>
>>>       >>>>>>>>> Information requested here
>>>       >>>>>>>>>
>>>      (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list
>>>      <https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list>):
>>>       >>>>>>>>>
>>>       >>>>>>>>> ```
>>>       >>>>>>>>>    $ uname -a
>>>       >>>>>>>>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC
>>>      2020 armv7l GNU/Linux
>>>       >>>>>>>>>
>>>       >>>>>>>>>    $ btrfs --version
>>>       >>>>>>>>> btrfs-progs v5.7
>>>       >>>>>>>>>
>>>       >>>>>>>>>    $ sudo btrfs fi show
>>>       >>>>>>>>> Label: none  uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124
>>>       >>>>>>>>>           Total devices 4 FS bytes used 10.26TiB
>>>       >>>>>>>>>           devid    1 size 3.64TiB used 2.40TiB path /dev/sda1
>>>       >>>>>>>>>           devid    2 size 3.64TiB used 2.40TiB path /dev/sdc1
>>>       >>>>>>>>>           devid    3 size 9.09TiB used 7.86TiB path /dev/sdd1
>>>       >>>>>>>>>           devid    4 size 9.09TiB used 7.86TiB path /dev/sdb1
>>>       >>>>>>>>>
>>>       >>>>>>>>> Label: none  uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2
>>>       >>>>>>>>>           Total devices 1 FS bytes used 116.35MiB
>>>       >>>>>>>>>           devid    1 size 399.82MiB used 224.00MiB path
>>>      /dev/sda2
>>>       >>>>>>>>>
>>>       >>>>>>>>>    $ sudo btrfs fi df /
>>>       >>>>>>>>> Data, RAID1: total=10.25TiB, used=10.24TiB
>>>       >>>>>>>>> System, RAID1: total=64.00MiB, used=1.45MiB
>>>       >>>>>>>>> Metadata, RAID1: total=18.00GiB, used=17.19GiB
>>>       >>>>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>       >>>>>>>>> ```
>>>       >>>>>>>>>
>>>       >>>>>>>>> Thanks to demfloro and multicore on #btrfs for prompting
>>>      this email.
>>>       >>>>>>>>>
>>>       >>>>>>>>> Best wishes,
>>>       >>>>>>>>> Joe
>>>       >>>>>>>>>
>>>       >>>>>>>>
>>>       >>>>
>>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2021-04-08  8:38                 ` Qu Wenruo
       [not found]                   ` <CA+4cVr8sxGT1Zz+1tz+0OqBCukFgn7d_ZZq31bXASc426YbJ7A@mail.gmail.com>
@ 2021-04-08 11:15                   ` riteshh
  2021-04-08 11:29                     ` Qu Wenruo
  1 sibling, 1 reply; 14+ messages in thread
From: riteshh @ 2021-04-08 11:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Joe Hermaszewski, linux-btrfs


Please excuse my silly queries here.

On 21/04/08 04:38PM, Qu Wenruo wrote:
>
>
> On 2021/4/8 下午4:16, Joe Hermaszewski wrote:
> > It took a while but I managed to get hold of another one of these
> > arm32 boards. Very disappointingly this exact "bitflip" is still
> > present (log enclosed).
>
> Yeah, we got to the conclusion it's not bitflip, but completely 32bit
> limit on armv7.
>
> For ARMv7, it's a 32bit system, where unsigned long is only 32bit.
>
> This means, things like page->index is only 32bit long, and for 4K page
> size, it also means all filesystems (not only btrfs) can only utilize at
> most 16T bytes.
>
> But there is pitfall for btrfs, btrfs uses its internal address space
> for its meatadata, and the address space is U64.

Can you pls point me to the code you are referring here?
So IIUC, you mean since page->index can hold a value which can be upto 32bit in
size so the maximum FS address range which can be accessed is 16T.
This should be true in general for any FS no?

>
> And furthermore, for btrfs it can have metadata at bytenr way larger
> than the total device size.

Is this because of multi-device support?

> This is possible because btrfs maps part of its address space to real
> disks, thus it can have bytenr way larger than device size.

Please a code pointing to that will help me understand this better.
Thanks.

>
> But this brings to a problem, 32bit Linux can only handle 16T, but in
> your case, some of your metadata is already beyond 16T in btrfs address
> space.

Sorry I am not much aware of the history. Was this disk mkfs on 64-bit system
and then connected to a 32bit board?

This also brings me to check with you about other filesystems.
See the capacity section from below wiki[1]. Depending upon the host OS
limitation on the max size of the filesystem may vary right?

[1] https://en.wikipedia.org/wiki/XFS

-ritesh

>
> Then a lot of things are going to be wrong.
>
> I have submitted a patch to do extra check, at least allowing user to
> know this is the limit of 32bit:
> https://patchwork.kernel.org/project/linux-btrfs/patch/20210225011814.24009-1-wqu@suse.com/
>
> Unfortunately, this will not help existing fs though.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: btrfs crash on armv7
  2021-04-08 11:15                   ` riteshh
@ 2021-04-08 11:29                     ` Qu Wenruo
  0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2021-04-08 11:29 UTC (permalink / raw)
  To: riteshh; +Cc: Joe Hermaszewski, linux-btrfs



On 2021/4/8 下午7:15, riteshh wrote:
>
> Please excuse my silly queries here.
>
> On 21/04/08 04:38PM, Qu Wenruo wrote:
>>
>>
>> On 2021/4/8 下午4:16, Joe Hermaszewski wrote:
>>> It took a while but I managed to get hold of another one of these
>>> arm32 boards. Very disappointingly this exact "bitflip" is still
>>> present (log enclosed).
>>
>> Yeah, we got to the conclusion it's not bitflip, but completely 32bit
>> limit on armv7.
>>
>> For ARMv7, it's a 32bit system, where unsigned long is only 32bit.
>>
>> This means, things like page->index is only 32bit long, and for 4K page
>> size, it also means all filesystems (not only btrfs) can only utilize at
>> most 16T bytes.
>>
>> But there is pitfall for btrfs, btrfs uses its internal address space
>> for its meatadata, and the address space is U64.
>
> Can you pls point me to the code you are referring here?
> So IIUC, you mean since page->index can hold a value which can be upto 32bit in
> size so the maximum FS address range which can be accessed is 16T.
> This should be true in general for any FS no?


The code is in definition of "struct page",from "include/linux/mm_types.h"

Yes, for all fs.

But no other fs has another internal address space, unlike btrfs.

Btrfs uses its internal space to implement multi-device support.

>
>>
>> And furthermore, for btrfs it can have metadata at bytenr way larger
>> than the total device size.
>
> Is this because of multi-device support?

Yes.

>
>> This is possible because btrfs maps part of its address space to real
>> disks, thus it can have bytenr way larger than device size.
>
> Please a code pointing to that will help me understand this better.
> Thanks.

You need to understand btrfs chunk tree first.

Each btrfs chunk item is a mapping from btrfs logical address to each
real device.

The easiest way to understand it is not code, but "btrfs ins dump-tree
-t chunk <device>" to experience it by yourself.

>
>>
>> But this brings to a problem, 32bit Linux can only handle 16T, but in
>> your case, some of your metadata is already beyond 16T in btrfs address
>> space.
>
> Sorry I am not much aware of the history. Was this disk mkfs on 64-bit system
> and then connected to a 32bit board?

Possible.

But there are other cases to go beyond that limit, especially with balance.

So I'm not confident enough to say what's the exact event to make the fs
cross the line.

>
> This also brings me to check with you about other filesystems.
> See the capacity section from below wiki[1]. Depending upon the host OS
> limitation on the max size of the filesystem may vary right?
>
> [1] https://en.wikipedia.org/wiki/XFS

The last time Dave Chineer said, for xfs larger than 16T, 32bit kernel
will just refuse to mount.

Thanks,
Qu

>
> -ritesh
>
>>
>> Then a lot of things are going to be wrong.
>>
>> I have submitted a patch to do extra check, at least allowing user to
>> know this is the limit of 32bit:
>> https://patchwork.kernel.org/project/linux-btrfs/patch/20210225011814.24009-1-wqu@suse.com/
>>
>> Unfortunately, this will not help existing fs though.
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-04-08 11:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-25 15:28 btrfs crash on armv7 Joe Hermaszewski
2020-11-26  6:15 ` Qu Wenruo
2020-11-26  6:26   ` Qu Wenruo
2020-11-26 10:53   ` Joe Hermaszewski
2020-11-26 11:05     ` Qu Wenruo
2020-11-27 15:15       ` Joe Hermaszewski
2020-11-28  0:45         ` Qu Wenruo
2020-12-19 10:35           ` Joe Hermaszewski
2020-12-20  0:28             ` Qu Wenruo
2021-04-08  8:16               ` Joe Hermaszewski
2021-04-08  8:38                 ` Qu Wenruo
     [not found]                   ` <CA+4cVr8sxGT1Zz+1tz+0OqBCukFgn7d_ZZq31bXASc426YbJ7A@mail.gmail.com>
     [not found]                     ` <1ae47f73-f39e-bb71-d0b2-02999a703a4b@gmx.com>
     [not found]                       ` <CA+4cVr9Zgscn=L0a6CXrCaWK12mne8EpdW0eEe+PPuhQG2fmxQ@mail.gmail.com>
2021-04-08 10:22                         ` Qu Wenruo
2021-04-08 11:15                   ` riteshh
2021-04-08 11:29                     ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.