All of lore.kernel.org
 help / color / mirror / Atom feed
* issue with x86 FPU state after suspend to ram
@ 2012-11-30 18:52 Vincent Palatin
  2012-11-30 18:52 ` [PATCH] x86, fpu: avoid FPU lazy restore after suspend Vincent Palatin
  0 siblings, 1 reply; 13+ messages in thread
From: Vincent Palatin @ 2012-11-30 18:52 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, linux-kernel, Linus Torvalds
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Jarkko Sakkinen,
	Duncan Laurie, Olof Johansson

Hi,

On a 4-core Ivybridge platform, when doing a lot of suspend-to-ram/resume
cycles, we were observing processes randomly killed by a SIGFPE.
When dumping the FPU registers state on the SIGFPE (usually a floating stack
underflow/overflow on a floating point arithmetic operation), the FPU registers
looks empty or at least corrupted which was more or less impossible with
respect to the disassembled floating point code.

After doing more tracing, in the faulty case, the process seems to be keeping
FPU ownership over a secondary CPU unplug/re-plug triggered by the suspend.
Then it's doing a lazy restore of its FPU context (ie just using the current
FPU hardware registers as he is the owner) instead of writing them back
to the hardware from the version previously saved in the task context,
despite the fact the whole FPU hardware state has been lost.

Just invalidating the "fpu_owner_task" when disabling a secondary CPU seems
to solve my issue (it's already reset for the primary CPU).

By the way, when FPU the lazy restore patch was discussed back in february,
Ingo commented (in http://permalink.gmane.org/gmane.linux.kernel/1255423) :
"
I guess the CPU hotplug case deserves a comment in the code: CPU
hotplug + replug of the same (but meanwhile reset) CPU is safe
because fpu_owner_task[cpu] gets reset to NULL.
"
That contradicts my previous observation, so maybe I have totally overlooked
something in this mechanism.
Can you comment ?

I'm still putting my patch proposal in this thread.
The issue seems to exist since 3.4 after the FPU lazy restore was actually
implemented by commit 7e16838d "i387: support lazy restore of FPU state".
But the issue is mainly visible on 3.4 and 3.6 since on tip of tree, it is
hidden by the eager fpu implementation for platforms with xsave support,
but it still happens with eagerfpu=off.

To apply this change to 3.4, "this_cpu_write" needs to be replaced by
percpu_write.

--
Vincent


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 18:52 issue with x86 FPU state after suspend to ram Vincent Palatin
@ 2012-11-30 18:52 ` Vincent Palatin
  2012-11-30 18:57   ` H. Peter Anvin
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Vincent Palatin @ 2012-11-30 18:52 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, linux-kernel, Linus Torvalds
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Jarkko Sakkinen,
	Duncan Laurie, Olof Johansson, Vincent Palatin

When a cpu enters S3 state, the FPU state is lost.
After resuming for S3, if we try to lazy restore the FPU for a process running
on the same CPU, this will result in a corrupted FPU context.

We can just invalidate the "fpu_owner_task", so nobody will try to
lazy restore a state which no longer exists in the hardware.

Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off,
by doing thousands of suspend/resume cycles with 4 processes doing FPU
operations running. Without the patch, a process is killed after a
few hundreds cycles by a SIGFPE.

The issue seems to exist since 3.4 (after the FPU lazy restore was actually implemented),
to apply the change to 3.4, "this_cpu_write" needs to be replaced by percpu_write.

Cc: Duncan Laurie <dlaurie@chromium.org>
Cc: Olof Johansson <olofj@chromium.org>
Cc: <stable@kernel.org> [v3.4+] # for 3.4 need to replace this_cpu_write by percpu_write
Signed-off-by: Vincent Palatin <vpalatin@chromium.org>
---
 arch/x86/kernel/smpboot.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c80a33b..7610c58 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -68,6 +68,8 @@
 #include <asm/mwait.h>
 #include <asm/apic.h>
 #include <asm/io_apic.h>
+#include <asm/i387.h>
+#include <asm/fpu-internal.h>
 #include <asm/setup.h>
 #include <asm/uv/uv.h>
 #include <linux/mc146818rtc.h>
@@ -1230,6 +1232,9 @@ int native_cpu_disable(void)
 	clear_local_APIC();
 
 	cpu_disable_common();
+
+	/* the FPU context will be lost, nobody owns it */
+	this_cpu_write(fpu_owner_task, NULL);
 	return 0;
 }
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 18:52 ` [PATCH] x86, fpu: avoid FPU lazy restore after suspend Vincent Palatin
@ 2012-11-30 18:57   ` H. Peter Anvin
  2012-11-30 19:25   ` Linus Torvalds
  2012-11-30 19:26   ` tip-bot for Vincent Palatin
  2 siblings, 0 replies; 13+ messages in thread
From: H. Peter Anvin @ 2012-11-30 18:57 UTC (permalink / raw)
  To: Vincent Palatin
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Thomas Gleixner, x86,
	Peter Zijlstra, Jarkko Sakkinen, Duncan Laurie, Olof Johansson

On 11/30/2012 10:52 AM, Vincent Palatin wrote:
> When a cpu enters S3 state, the FPU state is lost.
> After resuming for S3, if we try to lazy restore the FPU for a process running
> on the same CPU, this will result in a corrupted FPU context.
> 
> We can just invalidate the "fpu_owner_task", so nobody will try to
> lazy restore a state which no longer exists in the hardware.
> 
> Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off,
> by doing thousands of suspend/resume cycles with 4 processes doing FPU
> operations running. Without the patch, a process is killed after a
> few hundreds cycles by a SIGFPE.
> 
> The issue seems to exist since 3.4 (after the FPU lazy restore was actually implemented),
> to apply the change to 3.4, "this_cpu_write" needs to be replaced by percpu_write.
> 
> Cc: Duncan Laurie <dlaurie@chromium.org>
> Cc: Olof Johansson <olofj@chromium.org>
> Cc: <stable@kernel.org> [v3.4+] # for 3.4 need to replace this_cpu_write by percpu_write
> Signed-off-by: Vincent Palatin <vpalatin@chromium.org>

Ouch!  Thank you for catching this!

	-hpa



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 18:52 ` [PATCH] x86, fpu: avoid FPU lazy restore after suspend Vincent Palatin
  2012-11-30 18:57   ` H. Peter Anvin
@ 2012-11-30 19:25   ` Linus Torvalds
  2012-11-30 19:38     ` H. Peter Anvin
  2012-11-30 19:26   ` tip-bot for Vincent Palatin
  2 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2012-11-30 19:25 UTC (permalink / raw)
  To: Vincent Palatin
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Peter Zijlstra,
	Jarkko Sakkinen, Duncan Laurie, Olof Johansson

On Fri, Nov 30, 2012 at 10:52 AM, Vincent Palatin <vpalatin@chromium.org> wrote:
> When a cpu enters S3 state, the FPU state is lost.
> After resuming for S3, if we try to lazy restore the FPU for a process running
> on the same CPU, this will result in a corrupted FPU context.

Good catch, and I think the patch is technically correct, but:

> We can just invalidate the "fpu_owner_task", so nobody will try to
> lazy restore a state which no longer exists in the hardware.

I think this works well, but is a tiny bit subtle.

And what I _don't_ necessarily think is the right thing to do is to
only comment on it in the commit message, because it means that later
generations will not necessarily ever notice. And it does result in a
new combination that I don't think we've had before: if the current
task is the owner, we now have "tsk->thread.fpu.has_fpu  = 1" but with
"fpu_owner_task" being NULL.

Which gets us the semantics we want (we will save the current CPU info
when switching away, but we will not restore when switching back), but
my gut feel is that we really want to comment on that exact thing. And
possibly even make a helper function for this in <sm/fpu-internal.h>,
something like

    /*
     * Must be run with preemption disabled: this clears the fpu_owner_task,
     * on this CPU.
     *
     * This will disable any lazy FPU state restore of the current FPU state,
     * but if the current thread owns the FPU, it will still be saved by.
     */
    static inline void __cpu_disable_lazy_restore(void)
    {
        this_cpu_write(fpu_owner_task, NULL);
    }

and in fact I think the right place to do this *might* be in
"native_cpu_die()" instead, at which point it would actually be
something like

    per_cpu(fpu_owner_task, cpu) = NULL;

*after* the CPU is dead, so that nothing ever can actually see the
state where a process is still running on the CPU and might possibly
use the FPU.

I dunno. I think doing it after really killing the CPU (ie in the
native_cpu_die() function) might be easier to think about, but I don't
really hate your patch either (it does make me go "ok, we need to
guarantee no scheduling or FP use after" - which is probably true, but
it's still some non-local thing). Either way, a comment about it and
abstracting whatever the invalidation sequence is in fpu-internal.h
sounds like a good idea.

Hmm?

                  Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tip:x86/urgent] x86, fpu: Avoid FPU lazy restore after suspend
  2012-11-30 18:52 ` [PATCH] x86, fpu: avoid FPU lazy restore after suspend Vincent Palatin
  2012-11-30 18:57   ` H. Peter Anvin
  2012-11-30 19:25   ` Linus Torvalds
@ 2012-11-30 19:26   ` tip-bot for Vincent Palatin
  2 siblings, 0 replies; 13+ messages in thread
From: tip-bot for Vincent Palatin @ 2012-11-30 19:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, olofj, dlaurie, vpalatin, tglx, hpa

Commit-ID:  c9370d1039848a813a63c3fd44b6e4833a78246a
Gitweb:     http://git.kernel.org/tip/c9370d1039848a813a63c3fd44b6e4833a78246a
Author:     Vincent Palatin <vpalatin@chromium.org>
AuthorDate: Fri, 30 Nov 2012 10:52:03 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 30 Nov 2012 10:58:14 -0800

x86, fpu: Avoid FPU lazy restore after suspend

When a cpu enters S3 state, the FPU state is lost.
After resuming for S3, if we try to lazy restore the FPU for a process running
on the same CPU, this will result in a corrupted FPU context.

We can just invalidate the "fpu_owner_task", so nobody will try to
lazy restore a state which no longer exists in the hardware.

Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off,
by doing thousands of suspend/resume cycles with 4 processes doing FPU
operations running. Without the patch, a process is killed after a
few hundreds cycles by a SIGFPE.

The issue seems to exist since 3.4 (after the FPU lazy restore was actually implemented),
to apply the change to 3.4, "this_cpu_write" needs to be replaced by percpu_write.

Cc: Duncan Laurie <dlaurie@chromium.org>
Cc: Olof Johansson <olofj@chromium.org>
Cc: <stable@kernel.org> [v3.4+] # for 3.4 need to replace this_cpu_write by percpu_write
Signed-off-by: Vincent Palatin <vpalatin@chromium.org>
Link: http://lkml.kernel.org/r/1354301523-5252-2-git-send-email-vpalatin@chromium.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/smpboot.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c80a33b..7610c58 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -68,6 +68,8 @@
 #include <asm/mwait.h>
 #include <asm/apic.h>
 #include <asm/io_apic.h>
+#include <asm/i387.h>
+#include <asm/fpu-internal.h>
 #include <asm/setup.h>
 #include <asm/uv/uv.h>
 #include <linux/mc146818rtc.h>
@@ -1230,6 +1232,9 @@ int native_cpu_disable(void)
 	clear_local_APIC();
 
 	cpu_disable_common();
+
+	/* the FPU context will be lost, nobody owns it */
+	this_cpu_write(fpu_owner_task, NULL);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 19:25   ` Linus Torvalds
@ 2012-11-30 19:38     ` H. Peter Anvin
  2012-11-30 19:41       ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2012-11-30 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Palatin, Ingo Molnar, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Peter Zijlstra,
	Jarkko Sakkinen, Duncan Laurie, Olof Johansson

On 11/30/2012 11:25 AM, Linus Torvalds wrote:
> 
> and in fact I think the right place to do this *might* be in
> "native_cpu_die()" instead, at which point it would actually be
> something like
> 
>     per_cpu(fpu_owner_task, cpu) = NULL;
> 
> *after* the CPU is dead, so that nothing ever can actually see the
> state where a process is still running on the CPU and might possibly
> use the FPU.
> 
> I dunno. I think doing it after really killing the CPU (ie in the
> native_cpu_die() function) might be easier to think about, but I don't
> really hate your patch either (it does make me go "ok, we need to
> guarantee no scheduling or FP use after" - which is probably true, but
> it's still some non-local thing). Either way, a comment about it and
> abstracting whatever the invalidation sequence is in fpu-internal.h
> sounds like a good idea.
> 

Hmm... from my point of view it would almost seem saner to do this on
the way *up*... as part of CPU (re-)initialization.  After all, the
"nothing is currently running on this CPU" is part of the initial state
of the CPU, regardless of if we have ever been online before or not.

	-hpa


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 19:38     ` H. Peter Anvin
@ 2012-11-30 19:41       ` Linus Torvalds
  2012-11-30 19:51         ` H. Peter Anvin
  2012-11-30 19:52         ` [PATCH v2] " Vincent Palatin
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2012-11-30 19:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vincent Palatin, Ingo Molnar, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Peter Zijlstra,
	Jarkko Sakkinen, Duncan Laurie, Olof Johansson

On Fri, Nov 30, 2012 at 11:38 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 11/30/2012 11:25 AM, Linus Torvalds wrote:
>>
>> and in fact I think the right place to do this *might* be in
>> "native_cpu_die()" instead, at which point it would actually be
>> something like
>>
>>     per_cpu(fpu_owner_task, cpu) = NULL;
>>
>> *after* the CPU is dead, so that nothing ever can actually see the
>> state where a process is still running on the CPU and might possibly
>> use the FPU.
>>
>> I dunno. I think doing it after really killing the CPU (ie in the
>> native_cpu_die() function) might be easier to think about, but I don't
>> really hate your patch either (it does make me go "ok, we need to
>> guarantee no scheduling or FP use after" - which is probably true, but
>> it's still some non-local thing). Either way, a comment about it and
>> abstracting whatever the invalidation sequence is in fpu-internal.h
>> sounds like a good idea.
>>
>
> Hmm... from my point of view it would almost seem saner to do this on
> the way *up*... as part of CPU (re-)initialization.  After all, the
> "nothing is currently running on this CPU" is part of the initial state
> of the CPU, regardless of if we have ever been online before or not.

That works for me too, and has the similar advantage of being easier
to think about than "the cpu is actually still being used, and we're
playing tricks with the FPU state cache" approach.

Regardless, much credit to Vincent (and presumably others on the
chromium team) for finding and debugging this.

             Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 19:41       ` Linus Torvalds
@ 2012-11-30 19:51         ` H. Peter Anvin
       [not found]           ` <CAP_ceTxmMhQeDi=x9HmYke85hKMg3_YhbXSnfDC12rOcocQJpA@mail.gmail.com>
  2012-11-30 19:52         ` [PATCH v2] " Vincent Palatin
  1 sibling, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2012-11-30 19:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Palatin, Ingo Molnar, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Peter Zijlstra,
	Jarkko Sakkinen, Duncan Laurie, Olof Johansson

On 11/30/2012 11:41 AM, Linus Torvalds wrote:
>>
>> Hmm... from my point of view it would almost seem saner to do this on
>> the way *up*... as part of CPU (re-)initialization.  After all, the
>> "nothing is currently running on this CPU" is part of the initial state
>> of the CPU, regardless of if we have ever been online before or not.
> 
> That works for me too, and has the similar advantage of being easier
> to think about than "the cpu is actually still being used, and we're
> playing tricks with the FPU state cache" approach.
> 
> Regardless, much credit to Vincent (and presumably others on the
> chromium team) for finding and debugging this.
> 

OK, how do you want to handle this, given that 3.7 is presumably
imminent?  We can apply Vincent's patch now, and then restructure it
later, or we can go for the gold now, but replicating the testing
Vincent has done will probably take well into next week.

	-hpa



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 19:41       ` Linus Torvalds
  2012-11-30 19:51         ` H. Peter Anvin
@ 2012-11-30 19:52         ` Vincent Palatin
  2012-11-30 20:15           ` [PATCH v3] " Vincent Palatin
  1 sibling, 1 reply; 13+ messages in thread
From: Vincent Palatin @ 2012-11-30 19:52 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, linux-kernel, Linus Torvalds
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Jarkko Sakkinen,
	Vincent Palatin, Duncan Laurie, Olof Johansson

When a cpu enters S3 state, the FPU state is lost.
After resuming for S3, if we try to lazy restore the FPU for a process running
on the same CPU, this will result in a corrupted FPU context.

Ensure that "fpu_owner_task" is properly invalided when (re-)initializing a CPU,
so nobody will try to lazy restore a state which doesn't exist in the hardware.

Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off,
by doing thousands of suspend/resume cycles with 4 processes doing FPU
operations running. Without the patch, a process is killed after a
few hundreds cycles by a SIGFPE.

The issue seems to exist since 3.4 (after the FPU lazy restore was actually implemented),
to apply the change to 3.4, "this_cpu_write" needs to be replaced by percpu_write.

Cc: Duncan Laurie <dlaurie@chromium.org>
Cc: Olof Johansson <olofj@chromium.org>
Cc: <stable@kernel.org> [v3.4+] # for 3.4 need to replace this_cpu_write by percpu_write
Signed-off-by: Vincent Palatin <vpalatin@chromium.org>
---
 arch/x86/include/asm/fpu-internal.h |   15 +++++++++------
 arch/x86/kernel/smpboot.c           |    5 +++++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index 831dbb9..41ab26e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -399,14 +399,17 @@ static inline void drop_init_fpu(struct task_struct *tsk)
 typedef struct { int preload; } fpu_switch_t;
 
 /*
- * FIXME! We could do a totally lazy restore, but we need to
- * add a per-cpu "this was the task that last touched the FPU
- * on this CPU" variable, and the task needs to have a "I last
- * touched the FPU on this CPU" and check them.
+ * Must be run with preemption disabled: this clears the fpu_owner_task,
+ * on this CPU.
  *
- * We don't do that yet, so "fpu_lazy_restore()" always returns
- * false, but some day..
+ * This will disable any lazy FPU state restore of the current FPU state,
+ * but if the current thread owns the FPU, it will still be saved by.
  */
+static inline void __cpu_disable_lazy_restore(unsigned int cpu)
+{
+	per_cpu(fpu_owner_task, cpu) = NULL;
+}
+
 static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
 {
 	return new == this_cpu_read_stable(fpu_owner_task) &&
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c80a33b..f3e2ec8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -68,6 +68,8 @@
 #include <asm/mwait.h>
 #include <asm/apic.h>
 #include <asm/io_apic.h>
+#include <asm/i387.h>
+#include <asm/fpu-internal.h>
 #include <asm/setup.h>
 #include <asm/uv/uv.h>
 #include <linux/mc146818rtc.h>
@@ -818,6 +820,9 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 
 	per_cpu(cpu_state, cpu) = CPU_UP_PREPARE;
 
+	/* the FPU context is blank, nobody can own it */
+	__cpu_disable_lazy_restore(cpu);
+
 	err = do_boot_cpu(apicid, cpu, tidle);
 	if (err) {
 		pr_debug("do_boot_cpu failed %d\n", err);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86, fpu: avoid FPU lazy restore after suspend
       [not found]           ` <CAP_ceTxmMhQeDi=x9HmYke85hKMg3_YhbXSnfDC12rOcocQJpA@mail.gmail.com>
@ 2012-11-30 19:55             ` H. Peter Anvin
  2012-11-30 21:45               ` Vincent Palatin
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2012-11-30 19:55 UTC (permalink / raw)
  To: Vincent Palatin
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Peter Zijlstra,
	Jarkko Sakkinen, Duncan Laurie, Olof Johansson

On 11/30/2012 11:54 AM, Vincent Palatin wrote:
>>
> I have done a patch v2 according to your suggestions.
> I will run the testing on it now.
> I probably need at least 2 to 3 hours to validate it.
> 

That would be super.  Let me know and I'll queue it up and send a pull
request with this and a few more urgent things to Linus.

	-hpa



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 19:52         ` [PATCH v2] " Vincent Palatin
@ 2012-11-30 20:15           ` Vincent Palatin
  2012-11-30 22:10             ` [tip:x86/urgent] x86, fpu: Avoid " tip-bot for Vincent Palatin
  0 siblings, 1 reply; 13+ messages in thread
From: Vincent Palatin @ 2012-11-30 20:15 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, linux-kernel, Linus Torvalds
  Cc: Thomas Gleixner, x86, Peter Zijlstra, Jarkko Sakkinen,
	Vincent Palatin, Duncan Laurie, Olof Johansson

When a cpu enters S3 state, the FPU state is lost.
After resuming for S3, if we try to lazy restore the FPU for a process running
on the same CPU, this will result in a corrupted FPU context.

Ensure that "fpu_owner_task" is properly invalided when (re-)initializing a CPU,
so nobody will try to lazy restore a state which doesn't exist in the hardware.

Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off,
by doing thousands of suspend/resume cycles with 4 processes doing FPU
operations running. Without the patch, a process is killed after a
few hundreds cycles by a SIGFPE.

Cc: Duncan Laurie <dlaurie@chromium.org>
Cc: Olof Johansson <olofj@chromium.org>
Cc: <stable@kernel.org> [v3.4+] # for 3.4 need to replace this_cpu_write by percpu_write
Signed-off-by: Vincent Palatin <vpalatin@chromium.org>
---
Hi,

The patch updated according the HPA and Linus comments.
I'm still re-running the testing on v3.

Change in v3:
- remove misleading comment about 3.4 in the description.

Change in v2:
- add an helper function and comment in fpu-internal.h as described by Linus
- do the cleaning in the native_cpu_up function as suggested by HPA

Vincent

 arch/x86/include/asm/fpu-internal.h |   15 +++++++++------
 arch/x86/kernel/smpboot.c           |    5 +++++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index 831dbb9..41ab26e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -399,14 +399,17 @@ static inline void drop_init_fpu(struct task_struct *tsk)
 typedef struct { int preload; } fpu_switch_t;
 
 /*
- * FIXME! We could do a totally lazy restore, but we need to
- * add a per-cpu "this was the task that last touched the FPU
- * on this CPU" variable, and the task needs to have a "I last
- * touched the FPU on this CPU" and check them.
+ * Must be run with preemption disabled: this clears the fpu_owner_task,
+ * on this CPU.
  *
- * We don't do that yet, so "fpu_lazy_restore()" always returns
- * false, but some day..
+ * This will disable any lazy FPU state restore of the current FPU state,
+ * but if the current thread owns the FPU, it will still be saved by.
  */
+static inline void __cpu_disable_lazy_restore(unsigned int cpu)
+{
+	per_cpu(fpu_owner_task, cpu) = NULL;
+}
+
 static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
 {
 	return new == this_cpu_read_stable(fpu_owner_task) &&
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c80a33b..f3e2ec8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -68,6 +68,8 @@
 #include <asm/mwait.h>
 #include <asm/apic.h>
 #include <asm/io_apic.h>
+#include <asm/i387.h>
+#include <asm/fpu-internal.h>
 #include <asm/setup.h>
 #include <asm/uv/uv.h>
 #include <linux/mc146818rtc.h>
@@ -818,6 +820,9 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 
 	per_cpu(cpu_state, cpu) = CPU_UP_PREPARE;
 
+	/* the FPU context is blank, nobody can own it */
+	__cpu_disable_lazy_restore(cpu);
+
 	err = do_boot_cpu(apicid, cpu, tidle);
 	if (err) {
 		pr_debug("do_boot_cpu failed %d\n", err);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] x86, fpu: avoid FPU lazy restore after suspend
  2012-11-30 19:55             ` H. Peter Anvin
@ 2012-11-30 21:45               ` Vincent Palatin
  0 siblings, 0 replies; 13+ messages in thread
From: Vincent Palatin @ 2012-11-30 21:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Thomas Gleixner, the arch/x86 maintainers, Peter Zijlstra,
	Jarkko Sakkinen, Duncan Laurie, Olof Johansson

On Fri, Nov 30, 2012 at 11:55 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 11/30/2012 11:54 AM, Vincent Palatin wrote:
> >>
> > I have done a patch v2 according to your suggestions.
> > I will run the testing on it now.
> > I probably need at least 2 to 3 hours to validate it.
> >
>
> That would be super.  Let me know and I'll queue it up and send a pull
> request with this and a few more urgent things to Linus.


I have done 1000+ cycles so far with patch v3 (on 4-core Ivybridge and
no eagerfpu),
and did not hit my issue.
I let the testing going on,
but wrt the issue after suspend, this fixes it with very high probability
(ie I have never done that many cycles without hitting the issue).

-- 
Vincent

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tip:x86/urgent] x86, fpu: Avoid FPU lazy restore after suspend
  2012-11-30 20:15           ` [PATCH v3] " Vincent Palatin
@ 2012-11-30 22:10             ` tip-bot for Vincent Palatin
  0 siblings, 0 replies; 13+ messages in thread
From: tip-bot for Vincent Palatin @ 2012-11-30 22:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, olofj, dlaurie, vpalatin, tglx, hpa

Commit-ID:  644c154186386bb1fa6446bc5e037b9ed098db46
Gitweb:     http://git.kernel.org/tip/644c154186386bb1fa6446bc5e037b9ed098db46
Author:     Vincent Palatin <vpalatin@chromium.org>
AuthorDate: Fri, 30 Nov 2012 12:15:32 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 30 Nov 2012 13:48:05 -0800

x86, fpu: Avoid FPU lazy restore after suspend

When a cpu enters S3 state, the FPU state is lost.
After resuming for S3, if we try to lazy restore the FPU for a process running
on the same CPU, this will result in a corrupted FPU context.

Ensure that "fpu_owner_task" is properly invalided when (re-)initializing a CPU,
so nobody will try to lazy restore a state which doesn't exist in the hardware.

Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off,
by doing thousands of suspend/resume cycles with 4 processes doing FPU
operations running. Without the patch, a process is killed after a
few hundreds cycles by a SIGFPE.

Cc: Duncan Laurie <dlaurie@chromium.org>
Cc: Olof Johansson <olofj@chromium.org>
Cc: <stable@kernel.org> v3.4+ # for 3.4 need to replace this_cpu_write by percpu_write
Signed-off-by: Vincent Palatin <vpalatin@chromium.org>
Link: http://lkml.kernel.org/r/1354306532-1014-1-git-send-email-vpalatin@chromium.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/fpu-internal.h | 15 +++++++++------
 arch/x86/kernel/smpboot.c           |  5 +++++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index 831dbb9..41ab26e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -399,14 +399,17 @@ static inline void drop_init_fpu(struct task_struct *tsk)
 typedef struct { int preload; } fpu_switch_t;
 
 /*
- * FIXME! We could do a totally lazy restore, but we need to
- * add a per-cpu "this was the task that last touched the FPU
- * on this CPU" variable, and the task needs to have a "I last
- * touched the FPU on this CPU" and check them.
+ * Must be run with preemption disabled: this clears the fpu_owner_task,
+ * on this CPU.
  *
- * We don't do that yet, so "fpu_lazy_restore()" always returns
- * false, but some day..
+ * This will disable any lazy FPU state restore of the current FPU state,
+ * but if the current thread owns the FPU, it will still be saved by.
  */
+static inline void __cpu_disable_lazy_restore(unsigned int cpu)
+{
+	per_cpu(fpu_owner_task, cpu) = NULL;
+}
+
 static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
 {
 	return new == this_cpu_read_stable(fpu_owner_task) &&
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c80a33b..f3e2ec8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -68,6 +68,8 @@
 #include <asm/mwait.h>
 #include <asm/apic.h>
 #include <asm/io_apic.h>
+#include <asm/i387.h>
+#include <asm/fpu-internal.h>
 #include <asm/setup.h>
 #include <asm/uv/uv.h>
 #include <linux/mc146818rtc.h>
@@ -818,6 +820,9 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
 
 	per_cpu(cpu_state, cpu) = CPU_UP_PREPARE;
 
+	/* the FPU context is blank, nobody can own it */
+	__cpu_disable_lazy_restore(cpu);
+
 	err = do_boot_cpu(apicid, cpu, tidle);
 	if (err) {
 		pr_debug("do_boot_cpu failed %d\n", err);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-11-30 22:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-30 18:52 issue with x86 FPU state after suspend to ram Vincent Palatin
2012-11-30 18:52 ` [PATCH] x86, fpu: avoid FPU lazy restore after suspend Vincent Palatin
2012-11-30 18:57   ` H. Peter Anvin
2012-11-30 19:25   ` Linus Torvalds
2012-11-30 19:38     ` H. Peter Anvin
2012-11-30 19:41       ` Linus Torvalds
2012-11-30 19:51         ` H. Peter Anvin
     [not found]           ` <CAP_ceTxmMhQeDi=x9HmYke85hKMg3_YhbXSnfDC12rOcocQJpA@mail.gmail.com>
2012-11-30 19:55             ` H. Peter Anvin
2012-11-30 21:45               ` Vincent Palatin
2012-11-30 19:52         ` [PATCH v2] " Vincent Palatin
2012-11-30 20:15           ` [PATCH v3] " Vincent Palatin
2012-11-30 22:10             ` [tip:x86/urgent] x86, fpu: Avoid " tip-bot for Vincent Palatin
2012-11-30 19:26   ` tip-bot for Vincent Palatin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.