All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG at mount time on v4.8.10
@ 2017-01-06  4:36 Petr Janecek
  2017-01-06  8:04 ` Duncan
  2017-01-11 13:07 ` BUG at mount time on v4.8.10 (and 4.9.2) Petr Janecek
  0 siblings, 2 replies; 3+ messages in thread
From: Petr Janecek @ 2017-01-06  4:36 UTC (permalink / raw)
  To: linux-btrfs

Hello,
      I just got a BUG on mount of a raid10 fs. /dev/sde was added to
the fs recently and balance has been started. After reboot (balance
still running), the fs can not be mounted any more.

# btrfs fi sh
Label: 'BTR0'  uuid: 0ec83db3-4574-4e40-8d57-ebbe9fe246e1
	Total devices 5 FS bytes used 5.45TiB
	devid    1 size 2.73TiB used 2.64TiB path /dev/sdk
	devid    2 size 2.73TiB used 2.64TiB path /dev/sdj
	devid    3 size 2.73TiB used 2.64TiB path /dev/sda
	devid    4 size 2.73TiB used 2.64TiB path /dev/sdb
	devid    5 size 2.73TiB used 356.03GiB path /dev/sde


[ 1291.115237] BTRFS info (device sdb): disk space caching is enabled
[ 1291.121456] BTRFS info (device sdb): has skinny extents
[ 1380.872569] BUG: unable to handle kernel paging request at fffffffffffffd60
[ 1380.879592] IP: [<ffffffffc045cf6f>] qgroup_fix_relocated_data_extents+0x1f/0x2a0 [btrfs]
[ 1380.887822] PGD 2fb807067 PUD 2fb809067 PMD 0 
[ 1380.892349] Oops: 0000 [#1] SMP
[ 1380.895497] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs libcrc32c crc32c_generic x86_pkg_temp_thermal coretemp kvm_intel iTCO_wdt kvm iTCO_vendor_support irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper i2c_i801 pcspkr i2c_smbus loop ipmi_watchdog raid10 acpi_cpufreq mei_me tpm_tis tpm_tis_core mei md_mod evdev battery video tpm acpi_power_meter ie31200_edac button shpchp edac_core processor ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 btrfs xor raid6_pq sg hid_generic usbhid hid sd_mod xhci_pci igb ahci xhci_hcd i2c_algo_bit i2c_core libahci dca mpt3sas raid_class libata ptp scsi_transport_sas usbcore crc32c_intel pps_core usb_common scsi_mod fan thermal
[ 1380.966236] CPU: 0 PID: 1214 Comm: mount Not tainted 4.8.10 #9
[ 1380.972075] Hardware name: Supermicro Super Server/X11SSL-CF, BIOS 1.0a 01/29/2016
[ 1380.979652] task: ffff9587d6274100 task.stack: ffff9587c5654000
[ 1380.985579] RIP: 0010:[<ffffffffc045cf6f>]  [<ffffffffc045cf6f>] qgroup_fix_relocated_data_extents+0x1f/0x2a0 [btrfs]
[ 1380.996263] RSP: 0018:ffff9587c5657a00  EFLAGS: 00010246
[ 1381.001601] RAX: 0000000000000000 RBX: ffff9587caab1b60 RCX: 0000000000000000
[ 1381.008763] RDX: ffff9587cda38270 RSI: ffff9587c17c8000 RDI: ffff9587cda381e0
[ 1381.015922] RBP: ffff9587c5658000 R08: 0000000000000000 R09: ffff9587cda381e0
[ 1381.023082] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9587c5657a98
[ 1381.030241] R13: 0000000000000000 R14: ffff9587c17c8000 R15: 00000000cda381e0
[ 1381.037407] FS:  00007fad760e9840(0000) GS:ffff9587f7800000(0000) knlGS:0000000000000000
[ 1381.045540] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1381.051316] CR2: fffffffffffffd60 CR3: 0000000845623000 CR4: 00000000003406f0
[ 1381.058478] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1381.065638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1381.072796] Stack:
[ 1381.074831]  0000000000000000 ffff9587cda381e0 0000000000000801 ffff9587cdf07000
[ 1381.082360]  ffff9587cda381e0 0000000000000801 ffffffffc040a834 ffff9587d6274100
[ 1381.089890]  0000000000000000 0000000000000246 ffff9587caab1b60 ffff9587d1170000
[ 1381.097423] Call Trace:
[ 1381.099908]  [<ffffffffc040a834>] ? start_transaction+0x94/0x4c0 [btrfs]
[ 1381.106658]  [<ffffffffc0460b03>] ? btrfs_recover_relocation+0x3b3/0x440 [btrfs]
[ 1381.114114]  [<ffffffffc0407384>] ? open_ctree+0x2164/0x2620 [btrfs]
[ 1381.120513]  [<ffffffffc03db44a>] ? btrfs_mount+0xcfa/0xe30 [btrfs]
[ 1381.126810]  [<ffffffffa6149ec7>] ? pcpu_alloc_area+0x2a7/0x3d0
[ 1381.132755]  [<ffffffffa64e942f>] ? __mutex_unlock_slowpath+0x9f/0x130
[ 1381.139309]  [<ffffffffa614ab13>] ? pcpu_alloc+0x323/0x620
[ 1381.144822]  [<ffffffffa6197691>] ? mount_fs+0x31/0x160
[ 1381.150081]  [<ffffffffa60921c7>] ? __init_waitqueue_head+0x17/0x30
[ 1381.156375]  [<ffffffffa61b201d>] ? vfs_kern_mount+0x5d/0x110
[ 1381.162150]  [<ffffffffc03da8f4>] ? btrfs_mount+0x1a4/0xe30 [btrfs]
[ 1381.168447]  [<ffffffffa614ab13>] ? pcpu_alloc+0x323/0x620
[ 1381.173956]  [<ffffffffa6197691>] ? mount_fs+0x31/0x160
[ 1381.179209]  [<ffffffffa60921c7>] ? __init_waitqueue_head+0x17/0x30
[ 1381.185503]  [<ffffffffa61b201d>] ? vfs_kern_mount+0x5d/0x110
[ 1381.191275]  [<ffffffffa61b4a54>] ? do_mount+0x1b4/0xc30
[ 1381.196613]  [<ffffffffa6146698>] ? memdup_user+0x38/0x60
[ 1381.202040]  [<ffffffffa61b57af>] ? SyS_mount+0x7f/0xc0
[ 1381.207292]  [<ffffffffa6001b68>] ? do_syscall_64+0x68/0x220
[ 1381.212977]  [<ffffffffa64eb9bc>] ? entry_SYSCALL64_slow_path+0x25/0x25
[ 1381.219617] Code: 00 00 5b 5d c3 0f 0b 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 50 48 8b 46 08 4c 8b 6e 10 48 8b a8 f0 01 00 00 31 c0 <4d> 8b a5 60 fd ff ff f6 85 f0 10 00 00 01 74 09 80 be d8 05 00 
[ 1381.240041] RIP  [<ffffffffc045cf6f>] qgroup_fix_relocated_data_extents+0x1f/0x2a0 [btrfs]
[ 1381.248385]  RSP <ffff9587c5657a00>
[ 1381.251897] CR2: fffffffffffffd60
[ 1381.256043] ---[ end trace 97955e79aea6d1af ]---



Regards

Petr

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: BUG at mount time on v4.8.10
  2017-01-06  4:36 BUG at mount time on v4.8.10 Petr Janecek
@ 2017-01-06  8:04 ` Duncan
  2017-01-11 13:07 ` BUG at mount time on v4.8.10 (and 4.9.2) Petr Janecek
  1 sibling, 0 replies; 3+ messages in thread
From: Duncan @ 2017-01-06  8:04 UTC (permalink / raw)
  To: linux-btrfs

Petr Janecek posted on Fri, 06 Jan 2017 05:36:01 +0100 as excerpted:

> I just got a BUG on mount of a raid10 fs. /dev/sde was added to
> the fs recently and balance has been started. After reboot (balance
> still running), the fs can not be mounted any more.

Try the skip_balance mount option (as described on the wiki or in the 
btrfs (5) manpage).

If it's a balance-related bug, that should avoid triggering it, altho 
obviously the (meta?)data that triggered it is still there.  But it 
should let you mount the filesystem and freshen your backups, at least, 
before you try potentially risky repair options.

Additionally, once mounted, you can btrfs balance cancel, to kill the 
balance more permanently, so it doesn't try to start on the next mount, 
if skip_balance isn't given again.

> # btrfs fi sh
> Label: 'BTR0'  uuid: 0ec83db3-4574-4e40-8d57-ebbe9fe246e1
> 	Total devices 5 FS bytes used 5.45TiB
> 	devid    1 size 2.73TiB used 2.64TiB path /dev/sdk
> 	devid    2 size 2.73TiB used 2.64TiB path /dev/sdj
> 	devid    3 size 2.73TiB used 2.64TiB path /dev/sda
> 	devid    4 size 2.73TiB used 2.64TiB path /dev/sdb
> 	devid    5 size 2.73TiB used 356.03GiB path /dev/sde

I'm just a user and list regular, not a dev, so dumps such as the below 
don't mean much to me.  Often, about the only thing useful I can pick out 
of them is the kernel version (which matches what you provided in the 
subject, 4.8.10), but in this case, there's something additional...

> [ 1380.872569] BUG: unable to handle kernel paging request at
> fffffffffffffd60
> [ 1380.879592] IP: [<ffffffffc045cf6f>]
> qgroup_fix_relocated_data_extents+0x1f/0x2a0 [btrfs]

qgroup?  You're using btrfs quotas?  

As the wiki suggests, btrfs quota code isn't particularly stable yet, and 
has been the source of numerous bugs.  It remains under intense bug-
squashing focus, and in general, my recommendation remains don't use it 
unless you're specifically working with the devs on finding and fixing 
those quota-related bugs.

Basically, your quota use-case falls into one of two categories.  Either 
you don't need the functionality and are best served by turning it off, 
since by doing so you'll avoid the bugs it brings with it, or you really 
do need the quota functionality, and are best served by running a 
filesystem where the quota code is mature and well tested -- where it 
actually works, including corner-cases, the way it's supposed to work.

Thus, assuming mount with skip_balance works, I'd first cancel the 
balance.  Then I'd remount read-only to prevent further damage while I 
freshened my backups, just in case.

(If you don't have the resources to do backups and aren't running with 
data you can afford to lose, btrfs isn't the filesystem for you; choose a 
filesystem that's more mature and stable.  And... strongly consider doing 
backups even on fully stable filesystems, because as any sysadmin worth 
the label will tell you, the value of your data is defined by the number 
of backups you consider it worth having of it, no backups, the value is 
throw-away, no matter any claims to the contrary.)

Then after the backups are freshened, remount writable again, and if you 
don't need quotas, disable the quota functionality.

Then with your backups freshened and quota functionality turned off, try 
the balance again.  With luck the problem was limited to the quota code 
and with that off the balance will go just fine.  Many have reported that 
it goes faster as well, since there's quite a section of quota code that 
balance runs that doesn't scale as well as one might hope, that can be 
entirely skipped if it's off.

Of course if you actually need those quotas for your use-case, that won't 
work so well, but then, as I suggested above, if you actually need 
quotas, you're best served by using a filesystem where they're actually 
stable and work as intended, as well, so in that case, after your backups 
are freshened you'll probably be doing a mkfs to some other filesystem, 
instead of turning quotas off and trying again to do a balance on the 
existing filesystem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: BUG at mount time on v4.8.10 (and 4.9.2)
  2017-01-06  4:36 BUG at mount time on v4.8.10 Petr Janecek
  2017-01-06  8:04 ` Duncan
@ 2017-01-11 13:07 ` Petr Janecek
  1 sibling, 0 replies; 3+ messages in thread
From: Petr Janecek @ 2017-01-11 13:07 UTC (permalink / raw)
  To: linux-btrfs

Hello,

> > I just got a BUG on mount of a raid10 fs. /dev/sde was added to
> > the fs recently and balance has been started. After reboot (balance
> > still running), the fs can not be mounted any more.
> 
> 
> Try the skip_balance mount option (as described on the wiki or in the 
> btrfs (5) manpage).

  unfortunately, I get almost same trace with "-o skip_balance".
Same thing happens on 4.9.2. Mounting "ro,skip_balance" works, so I can
recover the data. But balance was supposed to be working, and running a
balance after adding a device to almost full fs is imho exactly what one
whould do.

[...]
> I'm just a user and list regular, not a dev, so dumps such as the below 
> don't mean much to me.  Often, about the only thing useful I can pick out 
> of them is the kernel version (which matches what you provided in the 
> subject, 4.8.10), but in this case, there's something additional...
> 
> > [ 1380.872569] BUG: unable to handle kernel paging request at
> > fffffffffffffd60
> > [ 1380.879592] IP: [<ffffffffc045cf6f>]
> > qgroup_fix_relocated_data_extents+0x1f/0x2a0 [btrfs]
> 
> qgroup?  You're using btrfs quotas?  

  No, I confirmed that after the successful readonly mount.
>From btrfs_recover_relocation(), qgroup_fix_relocated_data_extent() is
called unconditionally, except for some error conditions before.
Maybe that is the problem? Log from 4.9.2 is below.


Regards

Petr



[  135.044216] BUG: unable to handle kernel paging request at fffffffffffffd60
[  135.051531] IP: [<ffffffffc06ef0ef>] qgroup_fix_relocated_data_extents+0x1f/0x2b0 [btrfs]
[  135.059953] PGD 1d5809067 [  135.062597] PUD 1d580b067 
PMD 0 [  135.066153] 
[  135.067858] Oops: 0000 [#1] SMP
[  135.071109] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs libcrc32c crc32c_generic x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crc32_pclmul iTCO_wdt ghash_clmulni_intel iTCO_vendor_support aesni_intel aes_x86_64 ablk_helper cryptd lrw loop mei_me i2c_i801 gf128mul ipmi_watchdog glue_helper pcspkr i2c_smbus mei raid10 acpi_cpufreq tpm_tis ie31200_edac tpm_tis_core md_mod evdev tpm battery video shpchp acpi_power_meter edac_core button processor ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 btrfs xor raid6_pq sg sd_mod hid_generic usbhid hid xhci_pci ahci igb xhci_hcd libahci i2c_algo_bit i2c_core mpt3sas dca raid_class usbcore libata scsi_transport_sas ptp crc32c_intel pps_core usb_common scsi_mod fan thermal
[  135.149994] CPU: 5 PID: 1180 Comm: mount Not tainted 4.9.2 #11
[  135.155932] Hardware name: Supermicro Super Server/X11SSL-CF, BIOS 1.0a 01/29/2016
[  135.163606] task: ffff97b615f271c0 task.stack: ffffb90f03db4000
[  135.169653] RIP: 0010:[<ffffffffc06ef0ef>]  [<ffffffffc06ef0ef>] qgroup_fix_relocated_data_extents+0x1f/0x2b0 [btrfs]
[  135.180548] RSP: 0018:ffffb90f03db7a00  EFLAGS: 00010246
[  135.185981] RAX: 0000000000000000 RBX: ffff97b6162be000 RCX: 0000000000007ce5
[  135.193239] RDX: ffff97b60c07aa90 RSI: ffff97b60703e000 RDI: ffff97b60c07aa00
[  135.200501] RBP: ffff97b60ae46000 R08: ffff97b6162be000 R09: ffff97b60c07aa00
[  135.207764] R10: 0000000000000000 R11: 0000000000000001 R12: ffffb90f03db7a98
[  135.215027] R13: ffff97b60c07aa00 R14: ffff97b60703e000 R15: 0000000000000000
[  135.222292] FS:  00007f748468d840(0000) GS:ffff97b637940000(0000) knlGS:0000000000000000
[  135.230525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  135.236399] CR2: fffffffffffffd60 CR3: 000000085651b000 CR4: 00000000003406e0
[  135.243654] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  135.250909] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  135.258171] Stack:
[  135.260304]  0000000000000000 ffff97b60c07aa00 0000000000000801 ffff97b60ae80000
[  135.268309]  ffff97b60c07aa00 0000000000000801 ffffffffc069ca08 ffff97b615f271c0
[  135.276317]  0000000000000000 0000000000000246 ffff97b614ca72a0 ffff97b60ae46000
[  135.284335] Call Trace:
[  135.286917]  [<ffffffffc069ca08>] ? start_transaction+0x98/0x4a0 [btrfs]
[  135.293762]  [<ffffffffc06f2cc3>] ? btrfs_recover_relocation+0x3b3/0x440 [btrfs]
[  135.301321]  [<ffffffffc06995aa>] ? open_ctree+0x214a/0x2600 [btrfs]
[  135.307817]  [<ffffffffc066d41b>] ? btrfs_mount+0xd0b/0xe40 [btrfs]
[  135.314215]  [<ffffffff8314c7d7>] ? pcpu_alloc_area+0x2a7/0x3d0
[  135.320265]  [<ffffffff834f373f>] ? __mutex_unlock_slowpath+0x9f/0x130
[  135.326921]  [<ffffffff8314d423>] ? pcpu_alloc+0x323/0x620
[  135.332535]  [<ffffffff83199fb1>] ? mount_fs+0x31/0x160
[  135.337884]  [<ffffffff83094047>] ? __init_waitqueue_head+0x17/0x30
[  135.344281]  [<ffffffff831b4c9d>] ? vfs_kern_mount+0x5d/0x110
[  135.350153]  [<ffffffffc066c8b4>] ? btrfs_mount+0x1a4/0xe40 [btrfs]
[  135.356545]  [<ffffffff8314d423>] ? pcpu_alloc+0x323/0x620
[  135.362151]  [<ffffffff83199fb1>] ? mount_fs+0x31/0x160
[  135.367499]  [<ffffffff83094047>] ? __init_waitqueue_head+0x17/0x30
[  135.373886]  [<ffffffff831b4c9d>] ? vfs_kern_mount+0x5d/0x110
[  135.379755]  [<ffffffff831b7814>] ? do_mount+0x1b4/0xc30
[  135.385190]  [<ffffffff831b857f>] ? SyS_mount+0x7f/0xc0
[  135.390535]  [<ffffffff83001b5a>] ? do_syscall_64+0x6a/0x240
[  135.396320]  [<ffffffff834f6246>] ? entry_SYSCALL64_slow_path+0x25/0x25
[  135.403062] Code: 00 00 5b 5d c3 0f 0b 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 50 48 8b 46 08 4c 8b 7e 10 48 8b 98 f0 01 00 00 31 c0 <49> 8b af 60 fd ff ff 48 8b 53 20 83 e2 40 74 09 80 be d8 05 00 
[  135.429926] RIP  [<ffffffffc06ef0ef>] qgroup_fix_relocated_data_extents+0x1f/0x2b0 [btrfs]
[  135.438474]  RSP <ffffb90f03db7a00>
[  135.442082] CR2: fffffffffffffd60
[  135.445526] ---[ end trace 603287b5bf87e6dd ]---


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-01-11 13:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-06  4:36 BUG at mount time on v4.8.10 Petr Janecek
2017-01-06  8:04 ` Duncan
2017-01-11 13:07 ` BUG at mount time on v4.8.10 (and 4.9.2) Petr Janecek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.