Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

From: Andy Lutomirski <luto@amacapital.net>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>, Borislav Petkov <bp@alien8.de>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>, Tony Luck <tony.luck@intel.com>
Subject: Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep
Date: Mon, 23 May 2016 19:09:41 -0700	[thread overview]
Message-ID: <CALCETrWwoaR5uyKigpayB-jJuDuMxeqqcudFtA=k+UqxDHVD9g@mail.gmail.com> (raw)
In-Reply-To: <CA+55aFysHFxi3+gM69rpj445QQ352ZRucjw23ZeTBj-azp6pqg@mail.gmail.com>

On Mon, May 23, 2016 at 6:48 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 23, 2016 at 6:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>> Or we could just let ksoftirqd do its thing and stop raising
>>> HARDIRQ_COUNT.  We could add a new preempt count field just for IST
>>> (yuck).  We could try to hijack a different preempt count field
>>> (NMI?).  But I kind of like the idea of just reinstating the original
>>> patch of explicitly checking that we're on a safe stack in schedule
>>> and __might_sleep, since that is the actual condition we care about.
>>
>> Ping?  I can still trigger this fairly easily on 4.6.
>
> .. I haven't seen a patch from you, last I saw that was kind of what I expected.
>
> That said, I still despise your patch. Why can't you just fix
> "in_interrupt()" and be done with it. The original patch was like 50
> lines of changes for somethinig that feels like it should be a
> one-liner.
>
> And no, we don't add idiotic new config symbols for things like "I
> have this one-liner trivial arch helper". What we do is to just test
> for such a helper with "#ifdef" (and if it's a inline function we do
> #define xyz xyz" so that the #ifdef works).
>
> So the original patch in this thread is still off the table,
> especially since there was absolutely no explanation for why it should
> be such a crazy complicated thing.
>
> What exactly is it you are nervous about scheduling in NMI's? I agree
> that that would be disastrous, but it's not supposed to actually
> happen.

It's not the NMIs I'm worried about -- they do crazy stuff, but it's
self-contained crazy stuff, and NMIs can *never* schedule, so they're
unlikely to have issues here.  It's the things that are a bit more
ambiguous.  For example, MCE handlers sometimes do schedule, and we
allow it if they use the right helpers and check the right conditions.
The original patch lets us print a big fat warning if they screw it
up.

I can't modify in_interrupt for the same reason that the current code
is broken: if in_interrupt() returns true, that's a promise that we'll
call invoke_softirq via irq_exit in a timely manner.  Doing this from
a weird ultra-atomic context that interrupted kernel code that had
IF=0 would be bad.  IOW, the whole in_interrupt() mechanism was
designed until the entirely reasonably assumption that interrupts only
happen when interrupts are on.  These special interrupt-like things
(NMI, MCE, debug) can happen asynchronously with interrupts *off*, and
the result is a mess.

What about this silly fix?  (Pardon the probable whitespace damage.)

commit b3e89c652bdf6a7d5a23b094ae921f193e62c534
Author: Andy Lutomirski <luto@kernel.org>
Date:   Mon May 23 19:07:21 2016 -0700

    x86/traps: Don't for in_interrupt() to return true in IST handlers

    Forcing in_interrupt() to return true if we're not in a bona fide
    interrupt confuses the softirq code.

    Cc: stable@vger.kernel.org
    Fixes: 959274753857 ("x86, traps: Track entry into and exit from
IST context")
    Signed-off-by: Andy Lutomirski <luto@kernel.org>

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d1590486204a..9c1d948d5d8b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -96,6 +96,19 @@ static inline void cond_local_irq_disable(struct
pt_regs *regs)
         local_irq_disable();
 }

+/*
+ * We want to cause in_atomic() to return true while in an IST handler
+ * so that attempts to schedule will warn.  We also want to detect buggy
+ * code that does preempt_enable(); preempt_disable() in an IST handler,
+ * so we want in_atomic to still return true if that happens.
+ *
+ * We cannot add use HARDIRQ_OFFSET or otherwise cause in_interrupt() to
+ * return true: the softirq code assumes that in_interrupt() only
+ * returns true if we will soon execute softirqs, and we can't do that
+ * if an IST entry interrupts kernel code with interrupts disabled.
+ */
+#define IST_OFFSET (3 * PREEMPT_OFFSET)
+
 void ist_enter(struct pt_regs *regs)
 {
     if (user_mode(regs)) {
@@ -116,7 +129,7 @@ void ist_enter(struct pt_regs *regs)
      * on x86_64 and entered from user mode, in which case we're
      * still atomic unless ist_begin_non_atomic is called.
      */
-    preempt_count_add(HARDIRQ_OFFSET);
+    preempt_count_add(IST_OFFSET);

     /* This code is a bit fragile.  Test it. */
     RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");
@@ -124,7 +137,7 @@ void ist_enter(struct pt_regs *regs)

 void ist_exit(struct pt_regs *regs)
 {
-    preempt_count_sub(HARDIRQ_OFFSET);
+    preempt_count_sub(IST_OFFSET);

     if (!user_mode(regs))
         rcu_nmi_exit();