All of lore.kernel.org
 help / color / mirror / Atom feed
* Interrupted and resumed scrubs seem to have caused filesystem to go readonly (EFBIG error)
@ 2020-01-01 23:35 Graham Cobb
  2020-01-02  1:26 ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Graham Cobb @ 2020-01-01 23:35 UTC (permalink / raw)
  To: linux-btrfs

I have a problem on one BTRFS filesystem. It is not a critical
filesystem (it is used for backups) and I have not yet tried even
unmounting and remounting, let alone a "btrfs check".

The problem seems to be that after several iterations of running 'btrfs
scrub' for 30 minutes, then pausing for a while, then resuming the
scrub, I got a transaction aborted with an EFBIG error and a warning in
the kernel log. The fs went readonly, and transid verify errors are now
reported. The original log extract is available at
http://www.cobb.uk.net/kern.log.bug-010120 but I have pasted the key
part below.

The kernel is a Debian Testing kernel:
Linux black 5.3.0-2-amd64 #1 SMP Debian 5.3.9-3 (2019-11-19) x86_64
GNU/Linux

I run this same script monthly, and I have not seen this problem before,
so I cannot be certain it is caused by the scrub. I have not yet tried
to reproduce it, or to investigate the filesystem (check, etc).

Does anyone recognise this as a known/fixed problem? If not, is there
any particular further information I could gather before or during my
attempt to either recover the filesystem or just rebuild it?

Here is the log (starting with the 7th resumed scrub):


Jan  1 06:41:45 black kernel: [1930660.938782] BTRFS info (device sdc3):
scrub: started on devid 1
Jan  1 06:41:45 black kernel: [1930660.939195] BTRFS info (device sdc3):
scrub: started on devid 4
Jan  1 06:41:45 black kernel: [1930661.475557] ------------[ cut here
]------------
Jan  1 06:41:45 black kernel: [1930661.475562] BTRFS: Transaction
aborted (error -27)
Jan  1 06:41:45 black kernel: [1930661.475667] WARNING: CPU: 0 PID:
771075 at fs/btrfs/extent-tree.c:8247 btrfs_create_pending_block_
groups+0x1db/0x230 [btrfs]
Jan  1 06:41:45 black kernel: [1930661.475669] Modules linked in: fuse
nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache bnep nf_t
ables snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq
snd_seq_device cpufreq_userspace cpufreq_powersave cpufreq_cons
ervative nfnetlink_queue nfnetlink_log nfnetlink bluetooth drbg
ansi_cprng ecdh_generic ecc binfmt_misc hid_generic usbhid hid it87 h
wmon_vid radeon edac_mce_amd kvm_amd eeepc_wmi ccp asus_wmi rng_core
evdev sparse_keymap kvm snd_hda_codec_realtek rfkill irqbypass s
nd_hda_codec_generic ttm video wmi_bmof ledtrig_audio pcspkr
snd_hda_codec_hdmi drm_kms_helper fam15h_power k10temp snd_hda_intel snd
_hda_codec snd_hda_core snd_hwdep sp5100_tco drm snd_pcm_oss
snd_mixer_oss watchdog snd_pcm snd_timer sg snd soundcore i2c_algo_bit b
utton acpi_cpufreq eeprom i2c_nforce2 firewire_sbp2 firewire_core
crc_itu_t psmouse nfsd parport_pc ppdev auth_rpcgss lp nfs_acl parp
ort lockd grace sunrpc ip_tables x_tables autofs4 btrfs xor
zstd_decompress zstd_compress raid6_pq libcrc32c
Jan  1 06:41:45 black kernel: [1930661.475710]  ext4 crc16 mbcache jbd2
crc32c_generic sr_mod cdrom uas usb_storage sd_mod dm_crypt d
m_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
ohci_pci aesni_intel ahci libahci xhci_pci aes_x86_64 xhci_hcd c
rypto_simd libata ehci_pci ohci_hcd ehci_hcd cryptd glue_helper scsi_mod
usbcore r8169 i2c_piix4 realtek libphy usb_common wmi
Jan  1 06:41:45 black kernel: [1930661.475737] CPU: 0 PID: 771075 Comm:
btrfs Not tainted 5.3.0-2-amd64 #1 Debian 5.3.9-3
Jan  1 06:41:45 black kernel: [1930661.475739] Hardware name: To be
filled by O.E.M. To be filled by O.E.M./M5A97, BIOS 0705 08/22/20
11
Jan  1 06:41:45 black kernel: [1930661.475767] RIP:
0010:btrfs_create_pending_block_groups+0x1db/0x230 [btrfs]
Jan  1 06:41:45 black kernel: [1930661.475770] Code: e9 26 ff ff ff 48
8b 45 50 f0 48 0f ba a8 38 17 00 00 02 72 17 41 83 fc fb 74 2d
 44 89 e6 48 c7 c7 50 2e 7a c0 e8 23 9d 19 e3 <0f> 0b 44 89 e1 ba 37 20
00 00 48 c7 c6 20 80 79 c0 48 89 ef e8 73
Jan  1 06:41:45 black kernel: [1930661.475772] RSP:
0018:ffff9c69804cfb00 EFLAGS: 00010286
Jan  1 06:41:45 black kernel: [1930661.475775] RAX: 0000000000000000
RBX: ffff909444e7a520 RCX: 0000000000000006
Jan  1 06:41:45 black kernel: [1930661.475777] RDX: 0000000000000007
RSI: 0000000000000096 RDI: ffff90957aa17680
Jan  1 06:41:45 black kernel: [1930661.475779] RBP: ffff90946c745d68
R08: 0000000000010ec1 R09: 0000000000000007
Jan  1 06:41:45 black kernel: [1930661.475781] R10: 0000000000000000
R11: 0000000000000001 R12: 00000000ffffffe5
Jan  1 06:41:45 black kernel: [1930661.475783] R13: ffff90946c745dc0
R14: ffff909575d2e000 R15: ffff909574444000
Jan  1 06:41:45 black kernel: [1930661.475786] FS:
00007ff2eb4c7700(0000) GS:ffff90957aa00000(0000) knlGS:0000000000000000
Jan  1 06:41:45 black kernel: [1930661.475788] CS:  0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Jan  1 06:41:45 black kernel: [1930661.475790] CR2: 00005634edab7008
CR3: 00000000bd0f2000 CR4: 00000000000406f0
Jan  1 06:41:45 black kernel: [1930661.475792] Call Trace:
Jan  1 06:41:45 black kernel: [1930661.475826]
__btrfs_end_transaction+0x3f/0x1b0 [btrfs]
Jan  1 06:41:45 black kernel: [1930661.475855]
btrfs_inc_block_group_ro+0x10e/0x150 [btrfs]
Jan  1 06:41:45 black kernel: [1930661.475891]
scrub_enumerate_chunks+0x162/0x560 [btrfs]
Jan  1 06:41:45 black kernel: [1930661.475900]  ?
remove_wait_queue+0x20/0x60
Jan  1 06:41:45 black kernel: [1930661.475936]
btrfs_scrub_dev+0x26b/0x590 [btrfs]
Jan  1 06:41:45 black kernel: [1930661.475942]  ? _cond_resched+0x15/0x30
Jan  1 06:41:45 black kernel: [1930661.475946]  ?
__kmalloc_track_caller+0x16e/0x260
Jan  1 06:41:45 black kernel: [1930661.475980]  ?
btrfs_ioctl+0x82f/0x2e10 [btrfs]
Jan  1 06:41:45 black kernel: [1930661.475984]  ?
__check_object_size+0x136/0x147
Jan  1 06:41:45 black kernel: [1930661.476019]  btrfs_ioctl+0x87a/0x2e10
[btrfs]
Jan  1 06:41:45 black kernel: [1930661.476024]  ?
tomoyo_path_number_perm+0x66/0x1d0
Jan  1 06:41:45 black kernel: [1930661.476030]  ? do_vfs_ioctl+0x40e/0x670
Jan  1 06:41:45 black kernel: [1930661.476033]  do_vfs_ioctl+0x40e/0x670
Jan  1 06:41:45 black kernel: [1930661.476036]  ?
create_task_io_context+0x95/0x100
Jan  1 06:41:45 black kernel: [1930661.476040]  ksys_ioctl+0x5e/0x90
Jan  1 06:41:45 black kernel: [1930661.476044]  __x64_sys_ioctl+0x16/0x20
Jan  1 06:41:45 black kernel: [1930661.476048]  do_syscall_64+0x53/0x140
Jan  1 06:41:45 black kernel: [1930661.476052]
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan  1 06:41:45 black kernel: [1930661.476055] RIP: 0033:0x7ff2eb5b95b7
Jan  1 06:41:45 black kernel: [1930661.476058] Code: 00 00 90 48 8b 05
d9 78 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84
00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b
0d a9 78 0c 00 f7 d8 64 89 01 48
Jan  1 06:41:45 black kernel: [1930661.476061] RSP:
002b:00007ff2eb4c6d38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan  1 06:41:45 black kernel: [1930661.476064] RAX: ffffffffffffffda
RBX: 000055eeaf2e94b0 RCX: 00007ff2eb5b95b7
Jan  1 06:41:45 black kernel: [1930661.476066] RDX: 000055eeaf2e94b0
RSI: 00000000c400941b RDI: 0000000000000003
Jan  1 06:41:45 black kernel: [1930661.476067] RBP: 0000000000000000
R08: 00007ff2eb4c7700 R09: 0000000000000000
Jan  1 06:41:45 black kernel: [1930661.476069] R10: 00007ff2eb4c7700
R11: 0000000000000246 R12: 00007ffc4cfa511e
Jan  1 06:41:45 black kernel: [1930661.476071] R13: 00007ffc4cfa511f
R14: 00007ff2eb4c7700 R15: 00007ff2eb4c6e40
Jan  1 06:41:45 black kernel: [1930661.476075] ---[ end trace
6429c1bf293fecb8 ]---
Jan  1 06:41:45 black kernel: [1930661.476079] BTRFS: error (device
sdc3) in btrfs_create_pending_block_groups:8247: errno=-27 unknown
Jan  1 06:41:45 black kernel: [1930661.476082] BTRFS info (device sdc3):
forced readonly
Jan  1 06:41:45 black kernel: [1930661.489816] BTRFS warning (device
sdc3): failed setting block group ro: -30
Jan  1 06:41:45 black kernel: [1930661.489821] BTRFS info (device sdc3):
scrub: not finished on devid 1 with status: -30
Jan  1 06:41:52 black kernel: [1930668.052295] BTRFS warning (device
sdc3): failed setting block group ro: -30
Jan  1 06:41:52 black kernel: [1930668.052301] BTRFS info (device sdc3):
scrub: not finished on devid 4 with status: -30
Jan  1 06:51:56 black kernel: [1931271.801468] BTRFS error (device
sdc3): parent transid verify failed on 16216583520256 wanted 301800
found 301756
Jan  1 06:51:56 black kernel: [1931271.822215] BTRFS error (device
sdc3): parent transid verify failed on 16216583520256 wanted 301800
found 301756
Jan  1 06:51:57 black kernel: [1931273.492798] BTRFS error (device
sdc3): parent transid verify failed on 16216583520256 wanted 301800
found 301756
Jan  1 06:51:57 black kernel: [1931273.493041] BTRFS error (device
sdc3): parent transid verify failed on 16216583520256 wanted 301800
found 301756

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Interrupted and resumed scrubs seem to have caused filesystem to go readonly (EFBIG error)
  2020-01-01 23:35 Interrupted and resumed scrubs seem to have caused filesystem to go readonly (EFBIG error) Graham Cobb
@ 2020-01-02  1:26 ` Qu Wenruo
  2020-01-02 12:07   ` Graham Cobb
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2020-01-02  1:26 UTC (permalink / raw)
  To: Graham Cobb, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 9497 bytes --]



On 2020/1/2 上午7:35, Graham Cobb wrote:
> I have a problem on one BTRFS filesystem. It is not a critical
> filesystem (it is used for backups) and I have not yet tried even
> unmounting and remounting, let alone a "btrfs check".
> 
> The problem seems to be that after several iterations of running 'btrfs
> scrub' for 30 minutes, then pausing for a while, then resuming the
> scrub, I got a transaction aborted with an EFBIG error and a warning in
> the kernel log. The fs went readonly, and transid verify errors are now
> reported. The original log extract is available at
> http://www.cobb.uk.net/kern.log.bug-010120 but I have pasted the key
> part below.

EFBIG in btrfs is very rare, and can only be caused by too many system
chunks.

The most common reason is the chunk pre-alllocation for scrub, which
also matches your situation.

There is already a fix for it, and will land in v5.5 kernel.
It looks like we should backport it.

Thanks,
Qu


> 
> The kernel is a Debian Testing kernel:
> Linux black 5.3.0-2-amd64 #1 SMP Debian 5.3.9-3 (2019-11-19) x86_64
> GNU/Linux
> 
> I run this same script monthly, and I have not seen this problem before,
> so I cannot be certain it is caused by the scrub. I have not yet tried
> to reproduce it, or to investigate the filesystem (check, etc).
> 
> Does anyone recognise this as a known/fixed problem? If not, is there
> any particular further information I could gather before or during my
> attempt to either recover the filesystem or just rebuild it?
> 
> Here is the log (starting with the 7th resumed scrub):
> 
> 
> Jan  1 06:41:45 black kernel: [1930660.938782] BTRFS info (device sdc3):
> scrub: started on devid 1
> Jan  1 06:41:45 black kernel: [1930660.939195] BTRFS info (device sdc3):
> scrub: started on devid 4
> Jan  1 06:41:45 black kernel: [1930661.475557] ------------[ cut here
> ]------------
> Jan  1 06:41:45 black kernel: [1930661.475562] BTRFS: Transaction
> aborted (error -27)
> Jan  1 06:41:45 black kernel: [1930661.475667] WARNING: CPU: 0 PID:
> 771075 at fs/btrfs/extent-tree.c:8247 btrfs_create_pending_block_
> groups+0x1db/0x230 [btrfs]
> Jan  1 06:41:45 black kernel: [1930661.475669] Modules linked in: fuse
> nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache bnep nf_t
> ables snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq
> snd_seq_device cpufreq_userspace cpufreq_powersave cpufreq_cons
> ervative nfnetlink_queue nfnetlink_log nfnetlink bluetooth drbg
> ansi_cprng ecdh_generic ecc binfmt_misc hid_generic usbhid hid it87 h
> wmon_vid radeon edac_mce_amd kvm_amd eeepc_wmi ccp asus_wmi rng_core
> evdev sparse_keymap kvm snd_hda_codec_realtek rfkill irqbypass s
> nd_hda_codec_generic ttm video wmi_bmof ledtrig_audio pcspkr
> snd_hda_codec_hdmi drm_kms_helper fam15h_power k10temp snd_hda_intel snd
> _hda_codec snd_hda_core snd_hwdep sp5100_tco drm snd_pcm_oss
> snd_mixer_oss watchdog snd_pcm snd_timer sg snd soundcore i2c_algo_bit b
> utton acpi_cpufreq eeprom i2c_nforce2 firewire_sbp2 firewire_core
> crc_itu_t psmouse nfsd parport_pc ppdev auth_rpcgss lp nfs_acl parp
> ort lockd grace sunrpc ip_tables x_tables autofs4 btrfs xor
> zstd_decompress zstd_compress raid6_pq libcrc32c
> Jan  1 06:41:45 black kernel: [1930661.475710]  ext4 crc16 mbcache jbd2
> crc32c_generic sr_mod cdrom uas usb_storage sd_mod dm_crypt d
> m_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> ohci_pci aesni_intel ahci libahci xhci_pci aes_x86_64 xhci_hcd c
> rypto_simd libata ehci_pci ohci_hcd ehci_hcd cryptd glue_helper scsi_mod
> usbcore r8169 i2c_piix4 realtek libphy usb_common wmi
> Jan  1 06:41:45 black kernel: [1930661.475737] CPU: 0 PID: 771075 Comm:
> btrfs Not tainted 5.3.0-2-amd64 #1 Debian 5.3.9-3
> Jan  1 06:41:45 black kernel: [1930661.475739] Hardware name: To be
> filled by O.E.M. To be filled by O.E.M./M5A97, BIOS 0705 08/22/20
> 11
> Jan  1 06:41:45 black kernel: [1930661.475767] RIP:
> 0010:btrfs_create_pending_block_groups+0x1db/0x230 [btrfs]
> Jan  1 06:41:45 black kernel: [1930661.475770] Code: e9 26 ff ff ff 48
> 8b 45 50 f0 48 0f ba a8 38 17 00 00 02 72 17 41 83 fc fb 74 2d
>  44 89 e6 48 c7 c7 50 2e 7a c0 e8 23 9d 19 e3 <0f> 0b 44 89 e1 ba 37 20
> 00 00 48 c7 c6 20 80 79 c0 48 89 ef e8 73
> Jan  1 06:41:45 black kernel: [1930661.475772] RSP:
> 0018:ffff9c69804cfb00 EFLAGS: 00010286
> Jan  1 06:41:45 black kernel: [1930661.475775] RAX: 0000000000000000
> RBX: ffff909444e7a520 RCX: 0000000000000006
> Jan  1 06:41:45 black kernel: [1930661.475777] RDX: 0000000000000007
> RSI: 0000000000000096 RDI: ffff90957aa17680
> Jan  1 06:41:45 black kernel: [1930661.475779] RBP: ffff90946c745d68
> R08: 0000000000010ec1 R09: 0000000000000007
> Jan  1 06:41:45 black kernel: [1930661.475781] R10: 0000000000000000
> R11: 0000000000000001 R12: 00000000ffffffe5
> Jan  1 06:41:45 black kernel: [1930661.475783] R13: ffff90946c745dc0
> R14: ffff909575d2e000 R15: ffff909574444000
> Jan  1 06:41:45 black kernel: [1930661.475786] FS:
> 00007ff2eb4c7700(0000) GS:ffff90957aa00000(0000) knlGS:0000000000000000
> Jan  1 06:41:45 black kernel: [1930661.475788] CS:  0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> Jan  1 06:41:45 black kernel: [1930661.475790] CR2: 00005634edab7008
> CR3: 00000000bd0f2000 CR4: 00000000000406f0
> Jan  1 06:41:45 black kernel: [1930661.475792] Call Trace:
> Jan  1 06:41:45 black kernel: [1930661.475826]
> __btrfs_end_transaction+0x3f/0x1b0 [btrfs]
> Jan  1 06:41:45 black kernel: [1930661.475855]
> btrfs_inc_block_group_ro+0x10e/0x150 [btrfs]
> Jan  1 06:41:45 black kernel: [1930661.475891]
> scrub_enumerate_chunks+0x162/0x560 [btrfs]
> Jan  1 06:41:45 black kernel: [1930661.475900]  ?
> remove_wait_queue+0x20/0x60
> Jan  1 06:41:45 black kernel: [1930661.475936]
> btrfs_scrub_dev+0x26b/0x590 [btrfs]
> Jan  1 06:41:45 black kernel: [1930661.475942]  ? _cond_resched+0x15/0x30
> Jan  1 06:41:45 black kernel: [1930661.475946]  ?
> __kmalloc_track_caller+0x16e/0x260
> Jan  1 06:41:45 black kernel: [1930661.475980]  ?
> btrfs_ioctl+0x82f/0x2e10 [btrfs]
> Jan  1 06:41:45 black kernel: [1930661.475984]  ?
> __check_object_size+0x136/0x147
> Jan  1 06:41:45 black kernel: [1930661.476019]  btrfs_ioctl+0x87a/0x2e10
> [btrfs]
> Jan  1 06:41:45 black kernel: [1930661.476024]  ?
> tomoyo_path_number_perm+0x66/0x1d0
> Jan  1 06:41:45 black kernel: [1930661.476030]  ? do_vfs_ioctl+0x40e/0x670
> Jan  1 06:41:45 black kernel: [1930661.476033]  do_vfs_ioctl+0x40e/0x670
> Jan  1 06:41:45 black kernel: [1930661.476036]  ?
> create_task_io_context+0x95/0x100
> Jan  1 06:41:45 black kernel: [1930661.476040]  ksys_ioctl+0x5e/0x90
> Jan  1 06:41:45 black kernel: [1930661.476044]  __x64_sys_ioctl+0x16/0x20
> Jan  1 06:41:45 black kernel: [1930661.476048]  do_syscall_64+0x53/0x140
> Jan  1 06:41:45 black kernel: [1930661.476052]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Jan  1 06:41:45 black kernel: [1930661.476055] RIP: 0033:0x7ff2eb5b95b7
> Jan  1 06:41:45 black kernel: [1930661.476058] Code: 00 00 90 48 8b 05
> d9 78 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84
> 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b
> 0d a9 78 0c 00 f7 d8 64 89 01 48
> Jan  1 06:41:45 black kernel: [1930661.476061] RSP:
> 002b:00007ff2eb4c6d38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> Jan  1 06:41:45 black kernel: [1930661.476064] RAX: ffffffffffffffda
> RBX: 000055eeaf2e94b0 RCX: 00007ff2eb5b95b7
> Jan  1 06:41:45 black kernel: [1930661.476066] RDX: 000055eeaf2e94b0
> RSI: 00000000c400941b RDI: 0000000000000003
> Jan  1 06:41:45 black kernel: [1930661.476067] RBP: 0000000000000000
> R08: 00007ff2eb4c7700 R09: 0000000000000000
> Jan  1 06:41:45 black kernel: [1930661.476069] R10: 00007ff2eb4c7700
> R11: 0000000000000246 R12: 00007ffc4cfa511e
> Jan  1 06:41:45 black kernel: [1930661.476071] R13: 00007ffc4cfa511f
> R14: 00007ff2eb4c7700 R15: 00007ff2eb4c6e40
> Jan  1 06:41:45 black kernel: [1930661.476075] ---[ end trace
> 6429c1bf293fecb8 ]---
> Jan  1 06:41:45 black kernel: [1930661.476079] BTRFS: error (device
> sdc3) in btrfs_create_pending_block_groups:8247: errno=-27 unknown
> Jan  1 06:41:45 black kernel: [1930661.476082] BTRFS info (device sdc3):
> forced readonly
> Jan  1 06:41:45 black kernel: [1930661.489816] BTRFS warning (device
> sdc3): failed setting block group ro: -30
> Jan  1 06:41:45 black kernel: [1930661.489821] BTRFS info (device sdc3):
> scrub: not finished on devid 1 with status: -30
> Jan  1 06:41:52 black kernel: [1930668.052295] BTRFS warning (device
> sdc3): failed setting block group ro: -30
> Jan  1 06:41:52 black kernel: [1930668.052301] BTRFS info (device sdc3):
> scrub: not finished on devid 4 with status: -30
> Jan  1 06:51:56 black kernel: [1931271.801468] BTRFS error (device
> sdc3): parent transid verify failed on 16216583520256 wanted 301800
> found 301756
> Jan  1 06:51:56 black kernel: [1931271.822215] BTRFS error (device
> sdc3): parent transid verify failed on 16216583520256 wanted 301800
> found 301756
> Jan  1 06:51:57 black kernel: [1931273.492798] BTRFS error (device
> sdc3): parent transid verify failed on 16216583520256 wanted 301800
> found 301756
> Jan  1 06:51:57 black kernel: [1931273.493041] BTRFS error (device
> sdc3): parent transid verify failed on 16216583520256 wanted 301800
> found 301756
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Interrupted and resumed scrubs seem to have caused filesystem to go readonly (EFBIG error)
  2020-01-02  1:26 ` Qu Wenruo
@ 2020-01-02 12:07   ` Graham Cobb
  2020-01-02 12:34     ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Graham Cobb @ 2020-01-02 12:07 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 02/01/2020 01:26, Qu Wenruo wrote:
> 
> 
> On 2020/1/2 上午7:35, Graham Cobb wrote:
>> I have a problem on one BTRFS filesystem. It is not a critical
>> filesystem (it is used for backups) and I have not yet tried even
>> unmounting and remounting, let alone a "btrfs check".
>>
>> The problem seems to be that after several iterations of running 'btrfs
>> scrub' for 30 minutes, then pausing for a while, then resuming the
>> scrub, I got a transaction aborted with an EFBIG error and a warning in
>> the kernel log. The fs went readonly, and transid verify errors are now
>> reported. The original log extract is available at
>> http://www.cobb.uk.net/kern.log.bug-010120 but I have pasted the key
>> part below.
> 
> EFBIG in btrfs is very rare, and can only be caused by too many system
> chunks.
> 
> The most common reason is the chunk pre-alllocation for scrub, which
> also matches your situation.
> 
> There is already a fix for it, and will land in v5.5 kernel.
> It looks like we should backport it.

Thanks Qu. I will wait for that kernel, and maybe stop my monthly scrubs
(although my several other btrfs filesystems did not have a problem this
month fortunately).

I am getting transid errors:

>> Jan  1 06:51:56 black kernel: [1931271.801468] BTRFS error (device
>> sdc3): parent transid verify failed on 16216583520256 wanted 301800
>> found 301756

I presume 301800 is the transaction which failed and caused the fs to go
readonly. I don't suppose it is likely I could revert the whole fs to
the state of the last successful transaction is there?

It is not a big problem: the fs only contains backup snapshots (not my
only backups!) although it would be nice to recover the historical
snapshots if I could (I used them to research a bug I reported to debian
just the other day!).

Regards
Graham

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Interrupted and resumed scrubs seem to have caused filesystem to go readonly (EFBIG error)
  2020-01-02 12:07   ` Graham Cobb
@ 2020-01-02 12:34     ` Qu Wenruo
  2020-01-04 10:46       ` Graham Cobb
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2020-01-02 12:34 UTC (permalink / raw)
  To: Graham Cobb, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2899 bytes --]



On 2020/1/2 下午8:07, Graham Cobb wrote:
> On 02/01/2020 01:26, Qu Wenruo wrote:
>>
>>
>> On 2020/1/2 上午7:35, Graham Cobb wrote:
>>> I have a problem on one BTRFS filesystem. It is not a critical
>>> filesystem (it is used for backups) and I have not yet tried even
>>> unmounting and remounting, let alone a "btrfs check".
>>>
>>> The problem seems to be that after several iterations of running 'btrfs
>>> scrub' for 30 minutes, then pausing for a while, then resuming the
>>> scrub, I got a transaction aborted with an EFBIG error and a warning in
>>> the kernel log. The fs went readonly, and transid verify errors are now
>>> reported. The original log extract is available at
>>> http://www.cobb.uk.net/kern.log.bug-010120 but I have pasted the key
>>> part below.
>>
>> EFBIG in btrfs is very rare, and can only be caused by too many system
>> chunks.
>>
>> The most common reason is the chunk pre-alllocation for scrub, which
>> also matches your situation.
>>
>> There is already a fix for it, and will land in v5.5 kernel.
>> It looks like we should backport it.
> 
> Thanks Qu. I will wait for that kernel, and maybe stop my monthly scrubs
> (although my several other btrfs filesystems did not have a problem this
> month fortunately).

And the problem will normally not impact the fs, as newly created empty
system chunks will be soon cleaned up.

> 
> I am getting transid errors:

This is not a good news. And in fact it's normally a deadly problem.

> 
>>> Jan  1 06:51:56 black kernel: [1931271.801468] BTRFS error (device
>>> sdc3): parent transid verify failed on 16216583520256 wanted 301800
>>> found 301756
> 
> I presume 301800 is the transaction which failed and caused the fs to go
> readonly. I don't suppose it is likely I could revert the whole fs to
> the state of the last successful transaction is there?

This means some tree blocks doesn't reach disk.
It can be deadly, or just a side effect caused by the transaction abort.

> 
> It is not a big problem: the fs only contains backup snapshots (not my
> only backups!) although it would be nice to recover the historical
> snapshots if I could (I used them to research a bug I reported to debian
> just the other day!).

I'm afraid this depends on where the corruption is.

If it's just caused by that EFBIG error, and btrfs check reports no
error, then it's just temporary problem caused by transaction abort.


If it's in extent tree, it only affects mount or certain write
operations, but if you can mount the fs, it should be OK to read out the
whole fs.

If it's in csum tree, it will affect certain data read, other than
mostly OK.

If it's in subvolume trees, some directories/files can't be accessed.

So, please run a btrfs check on the unmounted fs to verify what's the case.

Thanks,
Qu

> 
> Regards
> Graham
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Interrupted and resumed scrubs seem to have caused filesystem to go readonly (EFBIG error)
  2020-01-02 12:34     ` Qu Wenruo
@ 2020-01-04 10:46       ` Graham Cobb
  0 siblings, 0 replies; 5+ messages in thread
From: Graham Cobb @ 2020-01-04 10:46 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 02/01/2020 12:34, Qu Wenruo wrote:
> 
> 
> On 2020/1/2 下午8:07, Graham Cobb wrote:
>> On 02/01/2020 01:26, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/1/2 上午7:35, Graham Cobb wrote:
>>>> I have a problem on one BTRFS filesystem. It is not a critical
>>>> filesystem (it is used for backups) and I have not yet tried even
>>>> unmounting and remounting, let alone a "btrfs check".
>>>>
>>>> The problem seems to be that after several iterations of running 'btrfs
>>>> scrub' for 30 minutes, then pausing for a while, then resuming the
>>>> scrub, I got a transaction aborted with an EFBIG error and a warning in
>>>> the kernel log. The fs went readonly, and transid verify errors are now
>>>> reported. The original log extract is available at
>>>> http://www.cobb.uk.net/kern.log.bug-010120 but I have pasted the key
>>>> part below.
>>>
>>> EFBIG in btrfs is very rare, and can only be caused by too many system
>>> chunks.
>>>
>>> The most common reason is the chunk pre-alllocation for scrub, which
>>> also matches your situation.
>>>
>>> There is already a fix for it, and will land in v5.5 kernel.
>>> It looks like we should backport it.
>>
>> Thanks Qu. I will wait for that kernel, and maybe stop my monthly scrubs
>> (although my several other btrfs filesystems did not have a problem this
>> month fortunately).
> 
> And the problem will normally not impact the fs, as newly created empty
> system chunks will be soon cleaned up.
> 
>>
>> I am getting transid errors:
> 
> This is not a good news. And in fact it's normally a deadly problem.

In fact, this was not a real problem: the errors were because the
filesystem was still mounted from the original error and had gone ro so
I guess the in-memory state was different from the on-disk state.  Doh!

A simple umount and mount worked fine, although I then did a btrfs check
which also worked fine:

black:~# btrfs check --readonly -p /dev/sdc3
Opening filesystem to check...
Checking filesystem on /dev/sdc3
UUID: 4d1ba5af-8b89-4cb5-96c6-55d1f028a202
[1/7] checking root items                      (0:06:27 elapsed,
25179174 items checked)
[2/7] checking extents                         (6:34:26 elapsed, 2419791
items checked)
cache and super generation don't match, space cache will be invalidated
[3/7] checking free space tree                 (0:00:00 elapsed)
[4/7] checking fs roots                        (25:44:17 elapsed,
1497725 items checked)
[5/7] checking csums (without verifying data)  (0:54:36 elapsed, 4812627
items checked)
[6/7] checking root refs                       (0:00:00 elapsed, 1067
items checked)
[7/7] checking quota groups skipped (not enabled on this FS)
found 11946687545430 bytes used, no error found
total csum bytes: 11626743024
total tree bytes: 39628275712
total fs tree bytes: 24636817408
total extent tree bytes: 2363850752
btree space waste bytes: 5422658757
file data blocks allocated: 29159815589888
 referenced 16100593688576

Thanks again for the help, and for the design which prevented fs
corruption in this case.

I would encourage you to consider backporting the fix for the original
EFBIG problem, as you suggested above.

Graham

> 
>>
>>>> Jan  1 06:51:56 black kernel: [1931271.801468] BTRFS error (device
>>>> sdc3): parent transid verify failed on 16216583520256 wanted 301800
>>>> found 301756
>>
>> I presume 301800 is the transaction which failed and caused the fs to go
>> readonly. I don't suppose it is likely I could revert the whole fs to
>> the state of the last successful transaction is there?
> 
> This means some tree blocks doesn't reach disk.
> It can be deadly, or just a side effect caused by the transaction abort.
> 
>>
>> It is not a big problem: the fs only contains backup snapshots (not my
>> only backups!) although it would be nice to recover the historical
>> snapshots if I could (I used them to research a bug I reported to debian
>> just the other day!).
> 
> I'm afraid this depends on where the corruption is.
> 
> If it's just caused by that EFBIG error, and btrfs check reports no
> error, then it's just temporary problem caused by transaction abort.
> 
> 
> If it's in extent tree, it only affects mount or certain write
> operations, but if you can mount the fs, it should be OK to read out the
> whole fs.
> 
> If it's in csum tree, it will affect certain data read, other than
> mostly OK.
> 
> If it's in subvolume trees, some directories/files can't be accessed.
> 
> So, please run a btrfs check on the unmounted fs to verify what's the case.
> 
> Thanks,
> Qu
> 
>>
>> Regards
>> Graham
>>
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-01-04 10:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-01 23:35 Interrupted and resumed scrubs seem to have caused filesystem to go readonly (EFBIG error) Graham Cobb
2020-01-02  1:26 ` Qu Wenruo
2020-01-02 12:07   ` Graham Cobb
2020-01-02 12:34     ` Qu Wenruo
2020-01-04 10:46       ` Graham Cobb

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.