Re: BUG_ON() in workingset_node_shadows_dec() triggers

From: Kees Cook <keescook@chromium.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Willy Tarreau <w@1wt.eu>,
	Paul Gortmaker <paul.gortmaker@windriver.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Antonio SJ Musumeci <trapexit@spawn.link>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable <stable@vger.kernel.org>
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers
Date: Thu, 6 Oct 2016 15:07:03 -0700	[thread overview]
Message-ID: <CAGXu5jKwPjOikfTqr1YX-5wFAdpW=XsMsjuERkjHP=mzokLNCw@mail.gmail.com> (raw)
In-Reply-To: <CA+55aFwmH=VL+8COPBUeiTzG8U-UD57pcZr1iG4BGyP99feTgA@mail.gmail.com>

On Wed, Oct 5, 2016 at 3:29 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Oct 5, 2016 at 3:17 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> With my more paranoid desires, I would prefer to keep "stop kernel
>> execution with the state set up by this process", not just "make the
>> process never return to user-space".
>
> Quite honestly, I think the answer to that is: "No. Not by default".
> So with some kind of kernel command line option, yes, kind of like
> "reboot_on_oops" (or whatever it is - I've never used it ;)
>
>>> And *if* we make BUG() actually do something sane (non-trapping), we
>>> can easily make it be generic, not arch-specific. In fact, I'd
>>> implement it by just adding a "handle_bug()" in kernel/panic.c...
>>
>> Yeah, I'm not sure what the right next step would be. Do we need a new
>> set of functions between WARN and BUG? Or maybe extract the
>> process-killing logic on a per-arch level and make it a specific API
>> so that it can be explicitly called as part of error-handling? Hmm
>
> So the process-killing logic actually used to historically just be
> "call do_exit()". In fact, that's what most architectures still do in
> their error paths. And it's what a lot of people who just want to kill
> the current code do.
>
> So calling "do_exit()" is actually perfectly fine. It's just that
> calling do_exit() from BUG_ON() is a major pain, because of the
> asynchronous nature of BUG_ON(). But if you are in a regular system
> call and don't hold any locks, do_exit() is still fine.

Well, that's the problem I repeatedly ran into: locks. And actually,
more than that. Even in seccomp when trying to do a clean death of a
process (without the paranoid requirement that kernel execution for
that syscall stop), I've run into problems with do_exit() -- in that
case it was bad ptrace state assumptions during signal delivery. I
originally used do_exit with the hardened usercopy stuff and that
would fall all over itself under lkdtm testing, since locks were held.
The "cleanest" way to handle it seemed to be the lock-busting logic
already built into BUG, so I moved to that.

> In fact, all that x86 really does differently from do_exit() in the
> fault path is to reset the stack pointer first, so that you don't get
> stack smashers when you have recursive faults (which used to be one
> really nasty failure case, not just with BUG_ON(), but with any kernel
> oops in general). So on x86, the crash code actually calls a function
> called "rewind_stack_do_exit()" instead.
>
> But the name gives it away: it's the exact same thing.
>
> So you can actually do a generic BUG_ON() (even with the current
> semantics) pretty much today by just having a config option that the
> architecture can set to specify whether you should just call
> "do_exit()" or "rewind_stack_do_exit()" to do that final killing
> action.
>
> There's a few other possible gotcha's (the code is hard to follow
> because the normal implementation uses a trapping instruction and
> hides the BUG() information in the text, so you get the whole fault
> path), but on the whole I think it should be fairly straightforward do
> just get rid of all the arch code, and replace it with a generic
> function that can then decide internally whether it wants to just
> warn, whether it wants to SIGKILL, or whether it wants to do the
> traditional thing and just force do_exit(). Or do new things like
> reboot or just halt.
>
> But it really would be very nice to never have do_exit() have to worry
> about odd callers. We've had a *lot* of trouble over the years with
> deadlocks on critical locks in do_exit(), for example.

Right, so I see a number of "features" that are mixed together between
do_exit, WARN, BUG, and panic, and it seems code wants various
combinations of them:

- report details about an unexpected condition occurring (BUG,
actually, does NOT do this: it doesn't take any kind of format
argument, but arch-specific internal traps DO, before calling the rest
of the internals that BUG is wired up to.)
- dump CPU and stack state
- kill current
- kill current's entire thread group
- stop kernel execution from continuing
- halt the system
- reboot the system

By far the most problematic is "stop kernel execution from
continuing", but that's currently the behavior that BUG depends on, so
replacing BUG with anything needs to either fix the surrounding logic
to fail sanely or we have the keep the feature.

I remain convinced, though, that "stop this thread of kernel
execution" is not the same as "reboot the system".

-Kees

-- 
Kees Cook
Nexus Security