All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Nick Bowler <nbowler@draconx.ca>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	Filipe Manana <fdmanana@kernel.org>
Subject: Re: Btrfs filesystem trashed after OOM scenario
Date: Tue, 24 Sep 2019 23:55:08 -0600	[thread overview]
Message-ID: <CAJCQCtRGm4vD3a6xqa8mihutYgFxfYOJDtA31KD-Ctu5Hi+kKA@mail.gmail.com> (raw)
In-Reply-To: <CADyTPEw=g7y+DroBt+CO-=8T3=8kO5Muj6Ts3LrkwDtKx2=zcQ@mail.gmail.com>

On Tue, Sep 24, 2019 at 10:25 PM Nick Bowler <nbowler@draconx.ca> wrote:
>
> On Tue, Sep 24, 2019, 18:34 Chris Murphy, <lists@colorremedies.com> wrote:
> > On Tue, Sep 24, 2019 at 4:04 PM Nick Bowler <nbowler@draconx.ca> wrote:
> > > - Running Linux 5.2.14, I pushed this system to OOM; the oom killer
> > > ran and killed some userspace tasks.  At this point many of the
> > > remaining tasks were stuck in uninterruptible sleeps.  Not really
> > > worried, I turned the machine off and on again to just get everything
> > > back to normal.  But I guess now that everything had gone horribly
> > > wrong already at this point...
> >
> > Yeah the kernel oomkiller is pretty much only about kernel
> > preservation, not user space preservation.
>
> Indeed I am not bothered at all by needing to turn it off and on again
> in this situation.  But filesystems being completely trashed is
> another matter...

Yep I agree. Maybe Filipe will chime in whether you hit this or if
there's some other issue.

> > So if you're willing to blow shit up again, you can try to reproduce
> > with one of those.
>
> Well I could try but it sounds like this might be hard to reproduce...

If you're using 5.2.15+ you won't hit the fixed bug. But if there's
some other cause you might still hit that and it's worth knowing about
under controlled test conditions than some unexpected time.

> > I was also doing oomkiller blow shit up tests a few weeks ago with
> > these same problem kernels and never hit this bug, or any others. I
> > also had to do a LOT of force power offs because the system just
> > became totally wedged in and I had no way of estimating how long it
> > would be for recovery so after 30 minutes I hit the power button. Many
> > times. Zero corruptions. That's with a single Samsung 840 EVO in a
> > laptop relegated to such testing.
>
> Just a thought... the system was alive but I was able to briefly
> inspect the situation and notice that tasks were blocked and
> unkillable... until my shell hung too and then I was hosed.  But I
> didn't hit the power button but rather rebooted with sysrq+e, sysrq+u,
> sysrq+b.  Not sure if that makes a difference.

Dunno.

Basically what I've discovered is you want to avoid depending on
oomkiller, it's just not suitable for maintaining user space
interactivity at all. I've used this:
https://github.com/rfjakob/earlyoom

And that monitors both swap and memory use, and will trigger oom much
sooner than the kernel's oomkiller. The system responsiveness takes a
hit, so I can't call it a good user experience. But the recovery is
faster and with almost no testing off hand it's consistently killing
the largest and most offending program, where oomkiller might
sometimes kill some small unrelated daemon and that will free up just
enough memory that the kernel will be happy for a long time while user
space is totally blocked.


>
> > Might be a different bug. Not sure. But also, this is with
> >
> > > [  347.551595] CPU: 3 PID: 1143 Comm: mount Not tainted 4.19.34-1-lts #1
> >
> > So I don't know how an older kernel will report on the problem caused
> > by the 5.2 bug.
>
> This is the kernel from systemrescuecd.  I can try taking a disk image
> and mounting on another machine with a newer linux version.

Try btrfs check --readonly and report back the results. I suggest
btrfs-progs 5.0 or higher, 5.2.2 if you can muster it. That might help
clarify if you hit the 5.2 regression bug. But btrfs check can't fix
it if that's what you hit. So it's 'btrfs restore' to scrape out what
you can, and then create a new file system.


-- 
Chris Murphy

  reply	other threads:[~2019-09-25  5:55 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-24 22:03 Btrfs filesystem trashed after OOM scenario Nick Bowler
2019-09-24 22:34 ` Chris Murphy
2019-09-25  4:25   ` Nick Bowler
2019-09-25  5:55     ` Chris Murphy [this message]
2019-09-26 11:26     ` Austin S. Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtRGm4vD3a6xqa8mihutYgFxfYOJDtA31KD-Ctu5Hi+kKA@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nbowler@draconx.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.