From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1423076AbcBQRxK (ORCPT <rfc822;w@1wt.eu>);
	Wed, 17 Feb 2016 12:53:10 -0500
Received: from mail-ob0-f169.google.com ([209.85.214.169]:36270 "EHLO
	mail-ob0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1423027AbcBQRxD (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 17 Feb 2016 12:53:03 -0500
MIME-Version: 1.0
In-Reply-To: <20160217092911.GA2023@pd.tnic>
References: <20160211192741.GG5565@pd.tnic> <CALCETrVSChX9BFyLn2mf4q3a0WxyNgFzcb4A4nFfAQnbfO02Pg@mail.gmail.com>
 <CALCETrVU8RvcDAUPfwoW9FVvgyn3z-5R86+4-mXtubpTd4YiKg@mail.gmail.com>
 <20160212170010.GE4099@pd.tnic> <20160215191422.GB32716@pd.tnic>
 <CALCETrW3r89hbU3u2xG2u+ZDrk-WzwmicL9ZL-L1GVGZfH-sdQ@mail.gmail.com>
 <20160217081646.GA32354@gmail.com> <20160217092911.GA2023@pd.tnic>
From: Andy Lutomirski <luto@amacapital.net>
Date: Wed, 17 Feb 2016 09:52:41 -0800
Message-ID: <CALCETrXC2OJLW=1jSfjMtH023PeqM2_C3_nerZ0co9qoQ6+J2Q@mail.gmail.com>
Subject: Re: WARNING: CPU: 0 PID: 3031 at ./arch/x86/include/asm/fpu/internal.h:530
 fpu__restore+0x90/0x130()
To: Borislav Petkov <bp@alien8.de>
Cc: X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Feb 17, 2016 1:29 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> On Wed, Feb 17, 2016 at 09:16:46AM +0100, Ingo Molnar wrote:
> > So I'm wondering why this started triggering only now. Is this a pre-existing bug
> > that somehow got triggered via:
> >
> >   58122bf1d856 x86/fpu: Default eagerfpu=on on all CPUs
> >
> > ?
>
> Well, that's an interesting question. See, the thing is, I triggered
> this only *once* by accident and I haven't seen it ever since.
>
> The "reliable" "reproducer" I used to debug this was Andy's suggestion
> to stick a schedule() in __fpu__restore_sig().
>
> So the answer to that question is not easy.
>
> BUT(!), regardless, the bug still needs to be fixed because my tracing
> here
>
> https://lkml.kernel.org/r/20160215191422.GB32716@pd.tnic
>
> showed that getting preempted after setting
>
>         fpu->fpstate_active = 1;
>
> leads to the WARN. Because - and please doublecheck me on that - when
> we're in __switch_to() and the task which already has ->fpstate_active
> set and it is the next task to which we're going to switch to, when it
> enters switch_fpu_prepare(), it does:
>
>         fpu.preload = static_cpu_has(X86_FEATURE_FPU) &&
>                       new_fpu->fpstate_active &&
>                       ^^^^^^^^^^^^^^^^^^^^^^^
>
> so that fpu.preload is set now.
>
> A bit later in that same function:
>
>                 /* Don't change CR0.TS if we just switch! */
>                 if (fpu.preload) {
>                         new_fpu->counter++;
>                         __fpregs_activate(new_fpu);
>                         ^^^^^^^^^^^^^^^^^
>
> ->fpregs_active gets set here and when the task returns to
> __fpu__restore_sig(), fpu__restore() sets it again, leading to the WARN.
>
> Mind you, this happens on 32-bit only because there we sigreturn with
> irqs enabled. Look at the call trace.

Not quite.  IRQs are on in the 64-bit case, too, but the problematic
code doesn't run because the 64-bit sigreturn case doesn't need to
fiddle with the fpu layout.

--Andy