All of lore.kernel.org
 help / color / mirror / Atom feed
* Filesystem forced to readonly after use
@ 2016-09-13 19:20 Cesar Strauss
  2016-09-13 19:39 ` Chris Murphy
  2016-09-13 19:49 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 8+ messages in thread
From: Cesar Strauss @ 2016-09-13 19:20 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1198 bytes --]

Hello,

I have a BTRFS filesystem that is reverting to read-only after a few 
moments of use. There is a stack trace visible in the kernel log, which 
is attached.

Here is my system information:

# uname -a

Linux rescue 4.7.2-1-ARCH #1 SMP PREEMPT Sat Aug 20 23:02:56 CEST 2016 
x86_64 GNU/Linux

# btrfs --version

btrfs-progs v4.7

# btrfs fi show

Label: 'linux'  uuid: 79862c20-d0b0-4ffa-a9af-e3a40868a243
         Total devices 1 FS bytes used 284.60GiB
         devid    1 size 300.03GiB used 300.03GiB path /dev/sdb5

# btrfs fi df /mnt

Data, single: total=278.00GiB, used=274.68GiB
System, DUP: total=8.00MiB, used=64.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=11.00GiB, used=9.92GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

As soon as the problem started, I saw that the Metadata, DUP was 
completely used. It become a little better (like above) after a scrub.
I can easily recover disk space by removing old snapshots, if needed.

The dmesg output is attached.

Before making further recovery attempts, or even restoring from backup, 
I would like to ask for the best option to proceed.

Thanks,

Cesar


[-- Attachment #2: dmesg.log --]
[-- Type: text/x-log, Size: 3571 bytes --]

[20048.035688] BTRFS info (device sdb5): disk space caching is enabled
[20190.871802] BTRFS error (device sdb5): parent transid verify failed on 160420773888 wanted 181826 found 181573
[20190.882573] BTRFS error (device sdb5): parent transid verify failed on 160420773888 wanted 181826 found 181573
[20190.882607] ------------[ cut here ]------------
[20190.882642] WARNING: CPU: 3 PID: 5026 at fs/btrfs/extent-tree.c:2963 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[20190.882645] BTRFS: Transaction aborted (error -5)
[20190.882648] Modules linked in: hid_generic usbhid hid btrfs xor raid6_pq sr_mod cdrom intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass joydev mousedev crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dell_wmi dell_laptop amdkfd amd_iommu_v2 glue_helper sparse_keymap dell_smbios uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev radeon media ums_realtek ablk_helper ttm cryptd snd_hda_codec_hdmi arc4 snd_hda_codec_realtek snd_hda_codec_generic dcdbas dell_smm_hwmon iwldvm iTCO_wdt mac80211 iTCO_vendor_support xhci_pci xhci_hcd r8169 snd_hda_intel snd_hda_codec mii iwlwifi evdev i915 input_leds led_class btusb btrtl btbcm btintel bluetooth intel_cstate intel_rapl_perf cfg80211 psmouse pcspkr
[20190.882725]  snd_hda_core mac_hid rfkill snd_hwdep thermal wmi snd_pcm drm_kms_helper snd_timer drm snd soundcore intel_gtt shpchp syscopyarea ahci sysfillrect sysimgblt fb_sys_fops i2c_algo_bit libahci fjes libata button ac battery mei_me video mei i2c_i801 lpc_ich dell_smo8800 tpm_tis tpm sch_fq_codel ip_tables x_tables ext4 crc16 jbd2 mbcache sd_mod uas usb_storage scsi_mod serio_raw atkbd libps2 ehci_pci ehci_hcd usbcore usb_common i8042 serio
[20190.882782] CPU: 3 PID: 5026 Comm: kworker/u16:2 Tainted: G        W       4.7.2-1-ARCH #1
[20190.882785] Hardware name: Dell Inc.          Dell System Vostro 3450/0GG0VM, BIOS A05 05/24/2011
[20190.882814] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[20190.882818]  0000000000000286 00000000860a9f71 ffff88015420fc90 ffffffff812eb132
[20190.882824]  ffff88015420fce0 0000000000000000 ffff88015420fcd0 ffffffff8107a3ab
[20190.882828]  00000b938efd7800 ffffffffa0c0be15 ffff880085917688 000000000000034c
[20190.882834] Call Trace:
[20190.882842]  [<ffffffff812eb132>] dump_stack+0x63/0x81
[20190.882847]  [<ffffffff8107a3ab>] __warn+0xcb/0xf0
[20190.882852]  [<ffffffff8107a42f>] warn_slowpath_fmt+0x5f/0x80
[20190.882875]  [<ffffffffa0b6dc5c>] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[20190.882895]  [<ffffffffa0b6dd24>] delayed_ref_async_start+0x94/0xb0 [btrfs]
[20190.882920]  [<ffffffffa0bb8437>] btrfs_scrubparity_helper+0x77/0x350 [btrfs]
[20190.882943]  [<ffffffffa0bb874e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[20190.882948]  [<ffffffff81093615>] process_one_work+0x1e5/0x480
[20190.882953]  [<ffffffff810938f8>] worker_thread+0x48/0x4e0
[20190.882958]  [<ffffffff810938b0>] ? process_one_work+0x480/0x480
[20190.882962]  [<ffffffff810938b0>] ? process_one_work+0x480/0x480
[20190.882968]  [<ffffffff81099598>] kthread+0xd8/0xf0
[20190.882975]  [<ffffffff815de9bf>] ret_from_fork+0x1f/0x40
[20190.882981]  [<ffffffff810994c0>] ? kthread_worker_fn+0x170/0x170
[20190.882985] ---[ end trace 99d6d7ec847d19d4 ]---
[20190.882990] BTRFS: error (device sdb5) in btrfs_run_delayed_refs:2963: errno=-5 IO failure
[20190.882994] BTRFS info (device sdb5): forced readonly
[20295.373706] BTRFS error (device sdb5): cleaner transaction attach returned -30

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Filesystem forced to readonly after use
  2016-09-13 19:20 Filesystem forced to readonly after use Cesar Strauss
@ 2016-09-13 19:39 ` Chris Murphy
  2016-09-13 20:50   ` Cesar Strauss
  2016-09-13 19:49 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 8+ messages in thread
From: Chris Murphy @ 2016-09-13 19:39 UTC (permalink / raw)
  To: Cesar Strauss; +Cc: Btrfs BTRFS

I just wouldn't use btrfs repair with this version of progs, go back
to v4.6.1 or upgrade to 4.7.2.  You could do an offline check (no
repair) and see if that reveals anything useful for developers. But I
can't tell what's going on from the call trace.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Filesystem forced to readonly after use
  2016-09-13 19:20 Filesystem forced to readonly after use Cesar Strauss
  2016-09-13 19:39 ` Chris Murphy
@ 2016-09-13 19:49 ` Austin S. Hemmelgarn
  2016-09-13 20:22   ` Chris Murphy
  2016-09-13 20:39   ` Cesar Strauss
  1 sibling, 2 replies; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2016-09-13 19:49 UTC (permalink / raw)
  To: Cesar Strauss, linux-btrfs

On 2016-09-13 15:20, Cesar Strauss wrote:
> Hello,
>
> I have a BTRFS filesystem that is reverting to read-only after a few
> moments of use. There is a stack trace visible in the kernel log, which
> is attached.
>
> Here is my system information:
>
> # uname -a
>
> Linux rescue 4.7.2-1-ARCH #1 SMP PREEMPT Sat Aug 20 23:02:56 CEST 2016
> x86_64 GNU/Linux
>
> # btrfs --version
>
> btrfs-progs v4.7
It's always good to see people who are staying up-to-date on the kernel 
and userspace :)
>
> # btrfs fi show
>
> Label: 'linux'  uuid: 79862c20-d0b0-4ffa-a9af-e3a40868a243
>         Total devices 1 FS bytes used 284.60GiB
>         devid    1 size 300.03GiB used 300.03GiB path /dev/sdb5
Given this, you're running with the whole device fully allocated by the 
chunk allocator, this is not a good state to be in for any extended 
period of time on a filesystem which is being written to and modified.
>
> # btrfs fi df /mnt
>
> Data, single: total=278.00GiB, used=274.68GiB
> System, DUP: total=8.00MiB, used=64.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=11.00GiB, used=9.92GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
But you appear to have a reasonable amount of slack space within the 
chunks themselves.
>
> As soon as the problem started, I saw that the Metadata, DUP was
> completely used. It become a little better (like above) after a scrub.
> I can easily recover disk space by removing old snapshots, if needed.
>
> The dmesg output is attached.
>
> Before making further recovery attempts, or even restoring from backup,
> I would like to ask for the best option to proceed.
I'd be kind of curious to see the results from btrfs check run without 
repair, but I doubt that will help narrow things down any further.

As of right now, the absolute first thing I'd do is check your logs to 
see if you can find any indication of errors from the disk itself.  I 
don't think it's likely, but it's worth checking.

The couple of lines just before the crash in the attached kernel log 
would indicate to me that some of the metadata is corrupted.  There are 
two likely possibilities for how that happened:
1. Running with no extra space for new chunks to be allocated is not a 
common use case, so it's not well tested, and it wouldn't surprise me if 
some accounting falls apart in that situation.
2. You might have bad RAM or a bad PSU.  This is the second thing you 
should check after checking to see if the disk is OK, as either will 
likely cause any repair attempts to make things worse.  RAM is pretty 
easy to check, but for a PSU you need a proper testing device.  You can 
get such a device on Amazon or similar sites for about 25USD, and it's 
generally worth having around for troubleshooting.

Assuming your disk and RAM are good, the next thing to do would be try 
and get the filesystem into a more usable state.  The best option for 
this is to expand the filesystem if possible.  Given that you're running 
right near capacity, I'd suggest at least 16G of extra space if 
possible.  If that isn't a viable solution for you, the other option is 
to delete some of the oldest snapshots (Ideally enough that you have at 
least a few GB of extra space in the data chunks and a few hundred MB in 
the metadata chunks), then add a 4-8GB device to the FS temporarily (a 
ramdisk or flash drive works well for this), and run a full balance.  If 
you're lucky, this will fix any metadata that's messed up, and the 
system should be usable.  If not, it shouldn't make things any worse, 
and you probably want to look at btrfs restore to copy out the data to a 
new filesystem (ideally a bigger one).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Filesystem forced to readonly after use
  2016-09-13 19:49 ` Austin S. Hemmelgarn
@ 2016-09-13 20:22   ` Chris Murphy
  2016-09-13 20:39   ` Cesar Strauss
  1 sibling, 0 replies; 8+ messages in thread
From: Chris Murphy @ 2016-09-13 20:22 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Cesar Strauss, Btrfs BTRFS

On Tue, Sep 13, 2016 at 1:49 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-09-13 15:20, Cesar Strauss wrote:

>>
>> btrfs-progs v4.7
>
> It's always good to see people who are staying up-to-date on the kernel and
> userspace :)

Yes, although it and 4.7.1 are marked as do not use.

https://btrfs.wiki.kernel.org/index.php/Changelog#btrfs-progs-4.7.2_.28Sep_2016.29


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Filesystem forced to readonly after use
  2016-09-13 19:49 ` Austin S. Hemmelgarn
  2016-09-13 20:22   ` Chris Murphy
@ 2016-09-13 20:39   ` Cesar Strauss
  2016-09-13 21:06     ` Chris Murphy
  2016-09-14 11:29     ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 8+ messages in thread
From: Cesar Strauss @ 2016-09-13 20:39 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2495 bytes --]

On 13-09-2016 16:49, Austin S. Hemmelgarn wrote:
> I'd be kind of curious to see the results from btrfs check run without
> repair, but I doubt that will help narrow things down any further.

Attached.

>
> As of right now, the absolute first thing I'd do is check your logs to
> see if you can find any indication of errors from the disk itself.  I
> don't think it's likely, but it's worth checking.

Will do.

> The couple of lines just before the crash in the attached kernel log
> would indicate to me that some of the metadata is corrupted.  There are
> two likely possibilities for how that happened:
> 1. Running with no extra space for new chunks to be allocated is not a
> common use case, so it's not well tested, and it wouldn't surprise me if
> some accounting falls apart in that situation.

Indeed. I periodically remove old snapshots and check for disk space, 
bit I guess I ran a bit too near the limit this time.

> 2. You might have bad RAM or a bad PSU.  This is the second thing you
> should check after checking to see if the disk is OK, as either will
> likely cause any repair attempts to make things worse.  RAM is pretty
> easy to check, but for a PSU you need a proper testing device.  You can
> get such a device on Amazon or similar sites for about 25USD, and it's
> generally worth having around for troubleshooting.

Understood.

This notebook has occasional failures when resuming from hibernation. I 
suppose, from the point of view of the filesystem, this corresponds to 
an unclean reboot.

>
> Assuming your disk and RAM are good, the next thing to do would be try
> and get the filesystem into a more usable state.  The best option for
> this is to expand the filesystem if possible.  Given that you're running
> right near capacity, I'd suggest at least 16G of extra space if
> possible.  If that isn't a viable solution for you, the other option is
> to delete some of the oldest snapshots (Ideally enough that you have at
> least a few GB of extra space in the data chunks and a few hundred MB in
> the metadata chunks), then add a 4-8GB device to the FS temporarily (a
> ramdisk or flash drive works well for this), and run a full balance.  If
> you're lucky, this will fix any metadata that's messed up, and the
> system should be usable.  If not, it shouldn't make things any worse,
> and you probably want to look at btrfs restore to copy out the data to a
> new filesystem (ideally a bigger one).

I will try this next.

Thanks for the help!

Cesar


[-- Attachment #2: btrfs.check --]
[-- Type: text/plain, Size: 2647 bytes --]

checking extents
parent transid verify failed on 160420773888 wanted 181826 found 181573
parent transid verify failed on 160420773888 wanted 181826 found 181573
parent transid verify failed on 160420773888 wanted 181826 found 181573
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420773888
parent transid verify failed on 160418889728 wanted 181826 found 181572
parent transid verify failed on 160418889728 wanted 181826 found 181572
parent transid verify failed on 160418889728 wanted 181826 found 181572
parent transid verify failed on 160418889728 wanted 181826 found 181572
parent transid verify failed on 160420741120 wanted 181826 found 181573
parent transid verify failed on 160420741120 wanted 181826 found 181573
parent transid verify failed on 160420741120 wanted 181826 found 181573
parent transid verify failed on 160420741120 wanted 181826 found 181573
Ignoring transid failure
leaf parent key incorrect 160420741120
bad block 160420741120
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 160420773888 wanted 181826 found 181573
Ignoring transid failure
parent transid verify failed on 160418889728 wanted 181826 found 181572
parent transid verify failed on 160418889728 wanted 181826 found 181572
parent transid verify failed on 160420741120 wanted 181826 found 181573
Ignoring transid failure
Error: could not find btree root extent for root 1183
Checking filesystem on /dev/sdb5
UUID: 79862c20-d0b0-4ffa-a9af-e3a40868a243

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Filesystem forced to readonly after use
  2016-09-13 19:39 ` Chris Murphy
@ 2016-09-13 20:50   ` Cesar Strauss
  0 siblings, 0 replies; 8+ messages in thread
From: Cesar Strauss @ 2016-09-13 20:50 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 13-09-2016 16:39, Chris Murphy wrote:
> I just wouldn't use btrfs repair with this version of progs, go back
> to v4.6.1 or upgrade to 4.7.2.

Thanks for the tip. I upgraded to 4.7.2.

Cesar


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Filesystem forced to readonly after use
  2016-09-13 20:39   ` Cesar Strauss
@ 2016-09-13 21:06     ` Chris Murphy
  2016-09-14 11:29     ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 8+ messages in thread
From: Chris Murphy @ 2016-09-13 21:06 UTC (permalink / raw)
  To: Cesar Strauss; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

>From the fsck...

bad block 160420741120

I can't tell though if that's a bad Btrfs leaf/node where both dup
copies are bad; or if it's a bad sector.

I'd mount it ro, and take a backup of anything you care about before
proceeding further.

smartctl -x might reveal if there are problems the drive itself is aware of.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Filesystem forced to readonly after use
  2016-09-13 20:39   ` Cesar Strauss
  2016-09-13 21:06     ` Chris Murphy
@ 2016-09-14 11:29     ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2016-09-14 11:29 UTC (permalink / raw)
  To: Cesar Strauss, linux-btrfs

On 2016-09-13 16:39, Cesar Strauss wrote:
> On 13-09-2016 16:49, Austin S. Hemmelgarn wrote:
>> I'd be kind of curious to see the results from btrfs check run without
>> repair, but I doubt that will help narrow things down any further.
>
> Attached.
>
>>
>> As of right now, the absolute first thing I'd do is check your logs to
>> see if you can find any indication of errors from the disk itself.  I
>> don't think it's likely, but it's worth checking.
>
> Will do.
>
>> The couple of lines just before the crash in the attached kernel log
>> would indicate to me that some of the metadata is corrupted.  There are
>> two likely possibilities for how that happened:
>> 1. Running with no extra space for new chunks to be allocated is not a
>> common use case, so it's not well tested, and it wouldn't surprise me if
>> some accounting falls apart in that situation.
>
> Indeed. I periodically remove old snapshots and check for disk space,
> bit I guess I ran a bit too near the limit this time.
In theory, BTRFS _should_ work in such a situation.  In practice, you 
get all kinds of odd behaviors.  In your case, you still have a 
reasonable amount of free space in both data and metadata chunks, so it 
isn't quite as bad as it could be (trying to get a FS working again when 
you have zero space in any chunks is a serious pain).
>
>> 2. You might have bad RAM or a bad PSU.  This is the second thing you
>> should check after checking to see if the disk is OK, as either will
>> likely cause any repair attempts to make things worse.  RAM is pretty
>> easy to check, but for a PSU you need a proper testing device.  You can
>> get such a device on Amazon or similar sites for about 25USD, and it's
>> generally worth having around for troubleshooting.
>
> Understood.
>
> This notebook has occasional failures when resuming from hibernation. I
> suppose, from the point of view of the filesystem, this corresponds to
> an unclean reboot.
Yeah, although it's generally not quite as bad as an unclean reboot 
(default configurations on almost all Linux distros call sync just 
before the actual power off, so you don't have to worry about stuff in 
the write cache being lost).  That said, it can also be worse than an 
unclean reboot depending on when the crash happens.

This brings up a good point though that I forgot, repeated unclean 
shutdowns (or failed resumes) can cause stuff like this to happen.  I 
don't often think about it since I rarely have issues with power loss or 
hard crashes (and I don't use hibernation), so it's not something I 
often remember to mention when helping people with filesystem issues.
>
>>
>> Assuming your disk and RAM are good, the next thing to do would be try
>> and get the filesystem into a more usable state.  The best option for
>> this is to expand the filesystem if possible.  Given that you're running
>> right near capacity, I'd suggest at least 16G of extra space if
>> possible.  If that isn't a viable solution for you, the other option is
>> to delete some of the oldest snapshots (Ideally enough that you have at
>> least a few GB of extra space in the data chunks and a few hundred MB in
>> the metadata chunks), then add a 4-8GB device to the FS temporarily (a
>> ramdisk or flash drive works well for this), and run a full balance.  If
>> you're lucky, this will fix any metadata that's messed up, and the
>> system should be usable.  If not, it shouldn't make things any worse,
>> and you probably want to look at btrfs restore to copy out the data to a
>> new filesystem (ideally a bigger one).
>
> I will try this next.
Like Chris mentioned, you probably want to use a different version of 
btrfs-progs.  I hadn't seen that that version was marked to not be used, 
otherwise I would have said something in my first reply.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-09-14 11:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-13 19:20 Filesystem forced to readonly after use Cesar Strauss
2016-09-13 19:39 ` Chris Murphy
2016-09-13 20:50   ` Cesar Strauss
2016-09-13 19:49 ` Austin S. Hemmelgarn
2016-09-13 20:22   ` Chris Murphy
2016-09-13 20:39   ` Cesar Strauss
2016-09-13 21:06     ` Chris Murphy
2016-09-14 11:29     ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.