All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs mounts RO, kernel oops on RW
@ 2017-05-28  2:46 Bill Williamson
  2017-05-28  5:56 ` Duncan
  2017-05-28 22:25 ` Chris Murphy
  0 siblings, 2 replies; 6+ messages in thread
From: Bill Williamson @ 2017-05-28  2:46 UTC (permalink / raw)
  To: linux-btrfs

Version details:
btrfs-progs v4.9.1
Linux bigserver 4.10.0-22-generic #24-Ubuntu SMP Mon May 22 17:43:20
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Array Details:
root@bigserver:~# btrfs fi df /mnt/storage
Data, RAID1: total=12.48TiB, used=12.25TiB
System, RAID1: total=32.00MiB, used=2.11MiB
Metadata, RAID1: total=14.00GiB, used=13.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00


root@bigserver:~# btrfs fi show /mnt/storage
Label: none  uuid: c792d033-b0a6-44a0-bd37-9825de7eeb8b
        Total devices 10 FS bytes used 12.27TiB
        devid    1 size 2.73TiB used 2.71TiB path /dev/sde
        devid    2 size 3.64TiB used 3.62TiB path /dev/sdh
        devid    5 size 1.82TiB used 1.80TiB path /dev/sdg
        devid    6 size 1.82TiB used 1.80TiB path /dev/sdc
        devid    8 size 1.36TiB used 1.35TiB path /dev/sdb
        devid    9 size 3.64TiB used 3.62TiB path /dev/sdf
        devid   12 size 1.82TiB used 1.80TiB path /dev/sdd
        devid   13 size 4.55TiB used 4.53TiB path /dev/sdk
        devid   14 size 3.64TiB used 3.62TiB path /dev/sdi
        devid   15 size 3.64TiB used 134.00GiB path /dev/sdj


Issue:
I can mount my btrfs readonly (recovery option not necessary).
Attempting to mount it readwrite results in a kernel null pointer
exception.

Background:
I have a home server with a bunch of disks running btrfs raid 1.  When
it starts to fill up I add another disk and re-balance.
I added a new 4TB disk and began the re-balance.  After a while I
needed to shut down the server, and did so gracefully with a shutdown
-h now.  Upon rebooting the array wouldn't mount, so I put "noauto"
into fstab to allow a graceful bootup and diagnose from there.

At first I got the failed to read log tree error, so I ran
btrfs-zero-log.  It walked back 3-4 transactions but now seems okay.
After that fix:
- btrfs check shows no errors.
- mounting the filesystem RO works great, I can read files.
- mounting the filesystem RW results in a huge kernel exception and a
hang, centering around can_overcommit and
btrfs_async_reclaim_metadata_space

My "you're screwed, it's dead" backup plan is to build another server
and buy 2x8TB drives, and then copy the data I care about over, but
I'd much rather save myself the trouble and $$$ and repair the array
if possible.

I'd also like to learn what happened so that I can avoid it in the future.

Thanks,

Bill Williamson


Full exception as it shows up in syslog (I've trimmed the big datetime
stamp to the left of each entry).

BTRFS info (device sdb): disk space caching is enabled
BUG: unable to handle kernel NULL pointer dereference at 00000000000001f0
IP: can_overcommit+0x28/0x110 [btrfs]
PGD 20dd42067
PUD 20abdc067
PMD 0

Oops: 0000 [#1] SMP
Modules linked in: binfmt_misc ppdev edac_mce_amd edac_core kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc
aesni_intel aes_x86_64 crypto_simd glue_helper cryptd serio_raw joydev
input_leds k10temp fam15h_power i2c_piix4 snd_hda_codec_via
snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec
snd_hda_core snd_hwdep snd_pcm snd_timer parport_pc snd soundcore
asus_atk0110 shpchp mac_hid lp parport ip_tables x_tables autofs4
btrfs xor raid6_pq pata_acpi hid_microsoft usbhid hid amdkfd
amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea
sysfillrect psmouse sysimgblt fb_sys_fops mvsas drm pata_atiixp ahci
r8169 libsas libahci mii scsi_transport_sas wmi fjes
CPU: 7 PID: 2779 Comm: kworker/u16:3 Not tainted 4.10.0-22-generic #24-Ubuntu
Hardware name: System manufacturer System Product Name/M5A78L-M/USB3,
BIOS 1503    11/14/2012
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
task: ffff88b6c7855a00 task.stack: ffffb66443524000
RIP: 0010:can_overcommit+0x28/0x110 [btrfs]
RSP: 0018:ffffb66443527dc0 EFLAGS: 00010286
RAX: 0000000001000000 RBX: ffff88b785f4d400 RCX: 0000000000000002
RDX: 0000000000800000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffb66443527df8 R08: 0000000000000008 R09: ffff88b785f4d4a8
R10: ffffffff8fd92d40 R11: 0000000000001ee0 R12: ffff88b785f4d4b8
R13: 0000000000800000 R14: 0000000000000000 R15: ffff88b785f4d400
FS:  0000000000000000(0000) GS:ffff88b79edc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000001f0 CR3: 0000000213c57000 CR4: 00000000000406e0
Call Trace:
? __switch_to+0x23c/0x520
btrfs_async_reclaim_metadata_space+0x378/0x440 [btrfs]
process_one_work+0x1fc/0x4b0
worker_thread+0x4b/0x500
kthread+0x109/0x140
? process_one_work+0x4b0/0x4b0
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x2c/0x40
Code: 0f 1f 00 0f 1f 44 00 00 f6 46 58 01 0f 85 f0 00 00 00 55 48 89
e5 41 57 41 56 41 55 41 54 49 89 d5 53 48 89 f3 31 f6 48 83 ec 10 <4c>
8b b7 f0 01 00 00 89 4d cc 49 3b 7e 38 4d 8d be 48 01 00 00
RIP: can_overcommit+0x28/0x110 [btrfs] RSP: ffffb66443527dc0
CR2: 00000000000001f0
---[ end trace 2bcd6443af9da872 ]---

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs mounts RO, kernel oops on RW
  2017-05-28  2:46 btrfs mounts RO, kernel oops on RW Bill Williamson
@ 2017-05-28  5:56 ` Duncan
  2017-05-28  7:27   ` Bill Williamson
  2017-05-28 22:25 ` Chris Murphy
  1 sibling, 1 reply; 6+ messages in thread
From: Duncan @ 2017-05-28  5:56 UTC (permalink / raw)
  To: linux-btrfs

Bill Williamson posted on Sun, 28 May 2017 12:46:00 +1000 as excerpted:

> Version details:
> btrfs-progs v4.9.1
> Linux bigserver 4.10.0-22-generic #24-Ubuntu SMP Mon May 22
> 17:43:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
> Array Details:
> root@bigserver:~# btrfs fi df /mnt/storage
> Data, RAID1: total=12.48TiB, used=12.25TiB
> System, RAID1: total=32.00MiB, used=2.11MiB
> Metadata, RAID1: total=14.00GiB, used=13.31GiB
> GlobalReserve, single: total=512.00MiB, used=0.00
> 
> 
> root@bigserver:~# btrfs fi show /mnt/storage Label: none  uuid:
> c792d033-b0a6-44a0-bd37-9825de7eeb8b
>         Total devices 10 FS bytes used 12.27TiB
>         devid    1 size 2.73TiB used 2.71TiB path /dev/sde
>         devid    2 size 3.64TiB used 3.62TiB path /dev/sdh
>         devid    5 size 1.82TiB used 1.80TiB path /dev/sdg
>         devid    6 size 1.82TiB used 1.80TiB path /dev/sdc
>         devid    8 size 1.36TiB used 1.35TiB path /dev/sdb
>         devid    9 size 3.64TiB used 3.62TiB path /dev/sdf
>         devid   12 size 1.82TiB used 1.80TiB path /dev/sdd
>         devid   13 size 4.55TiB used 4.53TiB path /dev/sdk
>         devid   14 size 3.64TiB used 3.62TiB path /dev/sdi
>         devid   15 size 3.64TiB used 134.00GiB path /dev/sdj

Only one device with free space of any size.  That can be an issue for 
raid1, which needs two devices with free space for it to be worth 
anything.  But you were working on that and it doesn't seem to be your 
current issue...


> Issue:
> I can mount my btrfs readonly (recovery option not necessary).
> Attempting to mount it readwrite results in a kernel null pointer
> exception.
> 
> Background:
> I have a home server with a bunch of disks running btrfs raid 1.  When
> it starts to fill up I add another disk and re-balance.
> I added a new 4TB disk and began the re-balance.  After a while I needed
> to shut down the server, and did so gracefully with a shutdown -h now. 
> Upon rebooting the array wouldn't mount, so I put "noauto"
> into fstab to allow a graceful bootup and diagnose from there.

So far, so good.

> At first I got the failed to read log tree error, so I ran
> btrfs-zero-log.  It walked back 3-4 transactions but now seems okay.
> 
> After that fix:
> - btrfs check shows no errors.
> - mounting the filesystem RO works great, I can read files.
> - mounting the filesystem RW results in a huge kernel exception and a
> hang, centering around can_overcommit and
> btrfs_async_reclaim_metadata_space

Try using the skip_balance mount option.  See the btrfs (5) manpage (you 
must specify the 5, or you'll get the section 8 general btrfs command 
manpage).

If that works, you can resume or cancel the balance once the filesystem 
is mounted writable.

But the filesystem is clearly not healthy, and that won't make it 
healthy, just eliminate the current heart-attack trigger.  I'd observe 
the sysadmin's rules of backups below before trying anything else, 
including the skip_balance mount option.


> My "you're screwed, it's dead" backup plan is to build another server
> and buy 2x8TB drives, and then copy the data I care about over, but I'd
> much rather save myself the trouble and $$$ and repair the array if
> possible.

The sysadmin's first rule of backups:  The value of your data is defined 
by the number and currency of your backups: No backups, you are defining 
your data as of only trivial value, worth less than the time/trouble/
resources necessary to make those backups.  (In)Actions speak louder than 
words, so the definition holds regardless of any after-the-fact protests 
to the contrary.

Put differently, if you don't /already/ have backups, then by definition, 
you /don't/ care about any data on those drives and need not bother 
copying it over as that would be as much of a hassle as making the backup 
in the first place and you've already demonstrated you don't value the 
data enough to do that.

Put yet differently, if the potential loss of that data has changed your 
mind about its value, better make that backup **NOW**, preferably before 
any further attempts to mount writable, with or without skip_balance, 
while you have the chance and before further inaction tempts fate by 
continuing to define the data as throw-away value.  Next time you might 
not get that chance!

(The second rule of backups is that a would-be backup isn't a backup 
until you've tested it restorable/usable.  Until then, it's only a would-
be backup, as the backup simply isn't complete until it has been tested.)


After that, assuming skip_balance works, I'd try a scrub.  Given that 
both data and metadata are raid1, that should ensure everything matches 
checksum and eliminate any wrote-one-mirror-crashed-while-writing-the-
other, type errors.  Of course if the filesystem is corrupted enough, 
when you get to that point it might crash if it can't fix it, but at 
least here, I've found scrub pretty reliable at fixing problems such as 
bad shutdowns.

If scrub finds errors and they're all correctable, you're likely healthy 
again, but it might be worth running a read-only btrfs check to be sure.  
Same if scrub finds some uncorrectable errors. If the check reports 
errors, post them here and see what the experts say (I'm not a dev, just 
another user, and that sort of thing is normally beyond me), before 
actually trying to fix them.


Meanwhile, turning the topic a bit, toward your suggested 8 TB drives.  
Be aware that many of those are archive-targeted drives and aren't 
designed for normal use.  Linux (generally, not just btrfs) originally 
had problems with them but they've been fixed for a few kernel cycles 
now.  However, unless you really /are/ going to use them for archiving, 
that is, write once and shelve them, btrfs, and any other COW-based 
filesystem, isn't going to be your best choice for filesystem on them, as 
COW is a worst-case for the technology they use.  A more conventional 
filesystem should work better, altho ordinary usage performance still 
isn't going to be great because they're not /designed/ for that sort of 
usage, but rather for mostly write once and archive, or alternatively, 
for a write, save, whole-drive (firmware command level) secure-erase, 
reuse, cycle.

So if you're going for the really large drives, do be aware of that and 
buy archive-usage or otherwise based on what you actually plan to do with 
the drives.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs mounts RO, kernel oops on RW
  2017-05-28  5:56 ` Duncan
@ 2017-05-28  7:27   ` Bill Williamson
  2017-05-28 20:51     ` Duncan
  2017-05-29  8:39     ` Marat Khalili
  0 siblings, 2 replies; 6+ messages in thread
From: Bill Williamson @ 2017-05-28  7:27 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On 28 May 2017 at 15:56, Duncan <1i5t5.duncan@cox.net> wrote:
> Bill Williamson posted on Sun, 28 May 2017 12:46:00 +1000 as excerpted:
>
>> At first I got the failed to read log tree error, so I ran
>> btrfs-zero-log.  It walked back 3-4 transactions but now seems okay.
>>
>> After that fix:
>> - btrfs check shows no errors.
>> - mounting the filesystem RO works great, I can read files.
>> - mounting the filesystem RW results in a huge kernel exception and a
>> hang, centering around can_overcommit and
>> btrfs_async_reclaim_metadata_space
>
> Try using the skip_balance mount option.  See the btrfs (5) manpage (you
> must specify the 5, or you'll get the section 8 general btrfs command
> manpage).
>
> If that works, you can resume or cancel the balance once the filesystem
> is mounted writable.

Thanks for the tip, but no joy :(

Exactly the same kernel oops.

> The sysadmin's first rule of backups:  The value of your data is defined
> by the number and currency of your backups: No backups, you are defining
> your data as of only trivial value, worth less than the time/trouble/
> resources necessary to make those backups.  (In)Actions speak louder than
> words, so the definition holds regardless of any after-the-fact protests
> to the contrary.

Thanks also for the backup reinforcement here, as it made me double
check everything I have in place.  The important data (photos etc) is
backed up in a few different places and will be easy to restore.  The
only thing that wasn't in those backups is our minecraft world saves,
so they are now backed up too :)

The rest of the data is the typical "family" stuff, as in a rip of
every blu ray/dvd we own, a lot of tv shows, etc.  So stuff that is
all replaceable, but annoys the kids that it will be gone for a short
while.  Eg not worth spending $$$ on a like for like backup regime.


> (The second rule of backups is that a would-be backup isn't a backup
> until you've tested it restorable/usable.  Until then, it's only a would-
> be backup, as the backup simply isn't complete until it has been tested.)

Also a good reminder to check that my photos will restore from
elsewhere.... and they do.

> Meanwhile, turning the topic a bit, toward your suggested 8 TB drives.
> Be aware that many of those are archive-targeted drives and aren't
> designed for normal use.

I didn't consider that, as "most" of my data is semi-archival, but it
still gets moved around a bit.  I was indeed considering the 8TB
archive hard drives you were thinking of, so thanks for the warning!

I'm not asking for a specific endorsement, but should I be considering
something like the seagate ironwolf or WD red drives?

Thanks again for the reply

BIll

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs mounts RO, kernel oops on RW
  2017-05-28  7:27   ` Bill Williamson
@ 2017-05-28 20:51     ` Duncan
  2017-05-29  8:39     ` Marat Khalili
  1 sibling, 0 replies; 6+ messages in thread
From: Duncan @ 2017-05-28 20:51 UTC (permalink / raw)
  To: linux-btrfs

Bill Williamson posted on Sun, 28 May 2017 17:27:47 +1000 as excerpted:

> I'm not asking for a specific endorsement, but should I be considering
> something like the seagate ironwolf or WD red drives?

There's a (well, at least one) guy here that knows much more about that 
than I do, and FWIW, my own usage is rather lower on the scale, sub-TB 
and generally SSD.  I just saw the trouble with the 8TB archive drives 
hit the list and while as I said the worst of it is now fixed, know they 
still have problems with btrfs due to its write pattern, so thought I'd 
warn you, just in case.  

Basically, the way these work is that to get their extreme storage 
density, other than for a relatively small fast-write area (perhaps a 
hundred gig or a half TB, I'm not sure), they interleaf sectors in 
"zones", overlapping like tiles on a tile roof, and to write or rewrite 
just a single sector means (re)writing the entire zone.  So they'll 
typically write to the fast-write area during the initial write, then 
when the drive isn't busy satisfying incoming requests, it'll rewrite 
that data into one of these zones.

So as long as you're either doing relatively slow archiving writes and 
there's time in between to do the zone rewrites (think security cam 
footage with motion-sensitivity so it doesn't actually write if the image 
is static), these things work great, and are a great deal for the money 
due to their generally lower per-TB cost.

But they'll do a lot of background rewriting from the fast write area 
when the disk is otherwise idle, and if you shut down in the middle of a 
rewrite, I believe it has to start that zone rewrite over.

And if the writes come in too fast or too steady and overwhelm that fast 
write area, things **DRAMATICALLY** slow down, and can appear to lock up 
the system at times because the zone rewrite is in progress and it can't 
just drop it to satisfy a read request unless it wants to scrap the zone 
rewrite and start it over later, which of course intensifies the problem 
even more if the writes are already coming in faster than the drive can 
rewrite the zones.

So these work best in large always-on installations with relatively slow 
archive-write patterns such that very little is rewritten where the drive 
has lots of otherwise inactive powered-on time to do its zone rewrites, 
uninterrupted by incoming requests or power-downs.  (I have a feeling 
Amazon Glacier may be a major if not their primary customer...)

Within that envelope, they tend to be very good value for the money, 
which makes them very tempting in general, but a lot of people simply 
aren't aware of the serious limits of the targeted usage envelope, and 
due to that low cost, want to use them for other things for which they're 
a very poor match.

And the btrfs write pattern, along with the fact that it's still 
stabilizing, not as stable and mature as for instance ext* or xfs, makes 
btrfs a very poor choice for use on these drives, unless the use-case 
really /does/ fall squarely within their target envelope.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs mounts RO, kernel oops on RW
  2017-05-28  2:46 btrfs mounts RO, kernel oops on RW Bill Williamson
  2017-05-28  5:56 ` Duncan
@ 2017-05-28 22:25 ` Chris Murphy
  1 sibling, 0 replies; 6+ messages in thread
From: Chris Murphy @ 2017-05-28 22:25 UTC (permalink / raw)
  To: Bill Williamson; +Cc: Btrfs BTRFS

On Sat, May 27, 2017 at 8:46 PM, Bill Williamson <bill@bbqninja.com> wrote:
> btrfs_async_reclaim_metadata_space

I only found this:
https://www.spinics.net/lists/linux-btrfs/msg62382.html

I can't tell if it's related though.


I'd include the results from btrfs check and btrfs check
--mode=lowmem, using btrfs-progs 4.10.2 or 4.11 (without --repair) and
post the results as attachments. They might reveal what issues there
are with the file system.

>4.10.0-22-generic #24-Ubuntu

I don't know what that translates into upstream wise. 4.10.18 is the
last of that series. I would try 4.11.3, since it's the most current.
And if that doesn't work then I'd try 4.9.30 or something relatively
recent in that series, just because it's a long term and shouldn't
have any major new features causing this if it's a regression.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs mounts RO, kernel oops on RW
  2017-05-28  7:27   ` Bill Williamson
  2017-05-28 20:51     ` Duncan
@ 2017-05-29  8:39     ` Marat Khalili
  1 sibling, 0 replies; 6+ messages in thread
From: Marat Khalili @ 2017-05-29  8:39 UTC (permalink / raw)
  Cc: linux-btrfs

> I'm not asking for a specific endorsement, but should I be considering
> something like the seagate ironwolf or WD red drives?

You need two qualitative features from HDD for RAID usage:
1) being failfast (TLER in WD talk), and
2) be designed to tolerate vibrations from other disks in a box.

Therefore you need _at least_ WD Red or alternative from Seagate. Paying 
more can only bring you quantitative benefits AFAIK. Just don't put 
desktop drives in RAID.

(Sorry for being off-topic, but after some long recent discussions I 
don't feel as guilty. :) )

--

With Best Regards,
Marat Khalili


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-05-29  8:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-28  2:46 btrfs mounts RO, kernel oops on RW Bill Williamson
2017-05-28  5:56 ` Duncan
2017-05-28  7:27   ` Bill Williamson
2017-05-28 20:51     ` Duncan
2017-05-29  8:39     ` Marat Khalili
2017-05-28 22:25 ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.