linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86: serialize LVTT and TSC_DEADLINE write
@ 2015-07-31 22:11 Shaohua Li
  2015-08-01 10:10 ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2015-07-31 22:11 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Kernel-team, Suresh Siddha, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, stable, v3.7+

We saw a strange issue with local APIC timer. Some random CPU doesn't
receive any local APIC timer interrupt, which causes different issues.
The cpu uses TSC-Deadline mode for local APIC timer and APIC is in xAPIC
mode. When this happens, manually writing TSC_DEADLINE MSR can trigger
interrupt again and the system goes normal.

Currently we only see this issue in E5-2660 v2 and E5-2680 v2 CPU.
Compiler version seems mattering too, it's quite easy to reproduce the
issue with v4.7 gcc.

Since the local APIC timer interrupt number is 0, we either lose the
first interrupt or TSC_DEADLINE MSR isn't set correctly. After some
debugging, we believe it's the serialize issue described in Intel SDM.
In xAPIC mode, write to APIC LVTT and write to TSC_DEADLINE isn't
serialized. Debug shows read TSC_DEADLINE MSR followed the very first
MSR write returns 0 in the buggy cpu.

The patch uses the algorithm Intel SDM described. The issue only happens
in xAPIC mode, but it doesn't bother to check the APIC mode I guess.
Without this patch, we see the issue after ~5 reboots. With it, we
don't see it after 24hr reboot test.

Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org v3.7+
Signed-off-by: Shaohua Li <shli@fb.com>
---
 arch/x86/kernel/apic/apic.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..b7890b3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
 	apic_write(APIC_LVTT, lvtt_value);
 
 	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
+		u64 msr;
+
+		/*
+		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
+		 * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
+		 * This uses the algorithm described in Intel SDM to serialize
+		 * the two writes
+		 * */
+		while (1) {
+			wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
+			rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
+			if (msr)
+				break;
+		}
+		wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
+
 		printk_once(KERN_DEBUG "TSC deadline timer enabled\n");
 		return;
 	}
-- 
1.8.5.6


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-07-31 22:11 [PATCH] x86: serialize LVTT and TSC_DEADLINE write Shaohua Li
@ 2015-08-01 10:10 ` Thomas Gleixner
  2015-08-02 15:49   ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2015-08-01 10:10 UTC (permalink / raw)
  To: Shaohua Li
  Cc: x86, linux-kernel, Kernel-team, Suresh Siddha, H. Peter Anvin,
	Ingo Molnar, stable, v3.7+

On Fri, 31 Jul 2015, Shaohua Li wrote:
> @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
>  	apic_write(APIC_LVTT, lvtt_value);
>  
>  	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> +		u64 msr;
> +
> +		/*
> +		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> +		 * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> +		 * This uses the algorithm described in Intel SDM to serialize
> +		 * the two writes
> +		 * */
> +		while (1) {
> +			wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> +			rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> +			if (msr)
> +				break;
> +		}
> +		wrmsrl(MSR_IA32_TSC_DEADLINE, 0);


I think this is exceptionally silly. A proper fence after the
apic_write() should have the same effect.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-08-01 10:10 ` Thomas Gleixner
@ 2015-08-02 15:49   ` Shaohua Li
  2015-08-02 19:41     ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2015-08-02 15:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: x86, linux-kernel, Kernel-team, Suresh Siddha, H. Peter Anvin,
	Ingo Molnar, stable

On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> On Fri, 31 Jul 2015, Shaohua Li wrote:
> > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> >  	apic_write(APIC_LVTT, lvtt_value);
> >  
> >  	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > +		u64 msr;
> > +
> > +		/*
> > +		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > +		 * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > +		 * This uses the algorithm described in Intel SDM to serialize
> > +		 * the two writes
> > +		 * */
> > +		while (1) {
> > +			wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > +			rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > +			if (msr)
> > +				break;
> > +		}
> > +		wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> 
> 
> I think this is exceptionally silly. A proper fence after the
> apic_write() should have the same effect.

Not sure what happens in the hardware, I could have a try of fence, but
I'd prefer using the algorithm Intel described. This is not a fast path,
the loop will exit immediately regardless the issue occurs anyway.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-08-02 15:49   ` Shaohua Li
@ 2015-08-02 19:41     ` Thomas Gleixner
  2015-08-03 23:58       ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2015-08-02 19:41 UTC (permalink / raw)
  To: Shaohua Li
  Cc: x86, linux-kernel, Kernel-team, Suresh Siddha, H. Peter Anvin,
	Ingo Molnar, stable

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1920 bytes --]

On Sun, 2 Aug 2015, Shaohua Li wrote:

> On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> > On Fri, 31 Jul 2015, Shaohua Li wrote:
> > > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > >  	apic_write(APIC_LVTT, lvtt_value);
> > >  
> > >  	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > > +		u64 msr;
> > > +
> > > +		/*
> > > +		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > > +		 * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > > +		 * This uses the algorithm described in Intel SDM to serialize
> > > +		 * the two writes
> > > +		 * */
> > > +		while (1) {
> > > +			wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > > +			rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > > +			if (msr)
> > > +				break;
> > > +		}
> > > +		wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> > 
> > 
> > I think this is exceptionally silly. A proper fence after the
> > apic_write() should have the same effect.
> 
> Not sure what happens in the hardware, I could have a try of fence, but
> I'd prefer using the algorithm Intel described. This is not a fast path,

s/algorithm/voodoo/

> the loop will exit immediately regardless the issue occurs anyway.

Well, the SDM also says:

 "To allow for efficient access to the APIC registers in x2APIC mode,
  the serializing semantics of WRMSR are relaxed when writing to the
  APIC registers. Thus, system software should not use “WRMSR to APIC
  registers in x2APIC mode” as a serializing instruction. Read and write
  accesses to the APIC registers will occur in program order. A WRMSR to
  an APIC register may complete before all preceding stores are globally
  visible; software can prevent this by inserting a serializing
  instruction, an SFENCE, or an MFENCE before the WRMSR."

And that's what happens here. The write to the LVT has not yet hit the
APIC, so the WRMSR has no effect.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-08-02 19:41     ` Thomas Gleixner
@ 2015-08-03 23:58       ` Shaohua Li
  2015-08-05  8:44         ` Ingo Molnar
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2015-08-03 23:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: x86, linux-kernel, Kernel-team, Suresh Siddha, H. Peter Anvin,
	Ingo Molnar, stable, ak, lenb, fenghua.yu

On Sun, Aug 02, 2015 at 09:41:08PM +0200, Thomas Gleixner wrote:
> On Sun, 2 Aug 2015, Shaohua Li wrote:
> 
> > On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> > > On Fri, 31 Jul 2015, Shaohua Li wrote:
> > > > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > > >  	apic_write(APIC_LVTT, lvtt_value);
> > > >  
> > > >  	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > > > +		u64 msr;
> > > > +
> > > > +		/*
> > > > +		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > > > +		 * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > > > +		 * This uses the algorithm described in Intel SDM to serialize
> > > > +		 * the two writes
> > > > +		 * */
> > > > +		while (1) {
> > > > +			wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > > > +			rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > > > +			if (msr)
> > > > +				break;
> > > > +		}
> > > > +		wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> > > 
> > > 
> > > I think this is exceptionally silly. A proper fence after the
> > > apic_write() should have the same effect.
> > 
> > Not sure what happens in the hardware, I could have a try of fence, but
> > I'd prefer using the algorithm Intel described. This is not a fast path,
> 
> s/algorithm/voodoo/
> 
> > the loop will exit immediately regardless the issue occurs anyway.
> 
> Well, the SDM also says:
> 
>  "To allow for efficient access to the APIC registers in x2APIC mode,
>   the serializing semantics of WRMSR are relaxed when writing to the
>   APIC registers. Thus, system software should not use “WRMSR to APIC
>   registers in x2APIC mode” as a serializing instruction. Read and write
>   accesses to the APIC registers will occur in program order. A WRMSR to
>   an APIC register may complete before all preceding stores are globally
>   visible; software can prevent this by inserting a serializing
>   instruction, an SFENCE, or an MFENCE before the WRMSR."
> 
> And that's what happens here. The write to the LVT has not yet hit the
> APIC, so the WRMSR has no effect.

What you quoted is for x2APIC, I didn't see similar description for
xAPIC.

Tested mfence here, it does work. But I'm not convinced it's the right
thing. the xAPIC access is memory mapped IO, mfence is nothing related
to it. Anyway, cc-ed more intel people, hope they can share some
insights.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-08-03 23:58       ` Shaohua Li
@ 2015-08-05  8:44         ` Ingo Molnar
  2015-08-05 16:25           ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Ingo Molnar @ 2015-08-05  8:44 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Thomas Gleixner, x86, linux-kernel, Kernel-team, Suresh Siddha,
	H. Peter Anvin, stable, ak, lenb, fenghua.yu


* Shaohua Li <shli@fb.com> wrote:

> On Sun, Aug 02, 2015 at 09:41:08PM +0200, Thomas Gleixner wrote:
> > On Sun, 2 Aug 2015, Shaohua Li wrote:
> > 
> > > On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> > > > On Fri, 31 Jul 2015, Shaohua Li wrote:
> > > > > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > > > >  	apic_write(APIC_LVTT, lvtt_value);
> > > > >  
> > > > >  	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > > > > +		u64 msr;
> > > > > +
> > > > > +		/*
> > > > > +		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > > > > +		 * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > > > > +		 * This uses the algorithm described in Intel SDM to serialize
> > > > > +		 * the two writes
> > > > > +		 * */
> > > > > +		while (1) {
> > > > > +			wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > > > > +			rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > > > > +			if (msr)
> > > > > +				break;
> > > > > +		}
> > > > > +		wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> > > > 
> > > > 
> > > > I think this is exceptionally silly. A proper fence after the
> > > > apic_write() should have the same effect.
> > > 
> > > Not sure what happens in the hardware, I could have a try of fence, but
> > > I'd prefer using the algorithm Intel described. This is not a fast path,
> > 
> > s/algorithm/voodoo/
> > 
> > > the loop will exit immediately regardless the issue occurs anyway.
> > 
> > Well, the SDM also says:
> > 
> >  "To allow for efficient access to the APIC registers in x2APIC mode,
> >   the serializing semantics of WRMSR are relaxed when writing to the
> >   APIC registers. Thus, system software should not use “WRMSR to APIC
> >   registers in x2APIC mode” as a serializing instruction. Read and write
> >   accesses to the APIC registers will occur in program order. A WRMSR to
> >   an APIC register may complete before all preceding stores are globally
> >   visible; software can prevent this by inserting a serializing
> >   instruction, an SFENCE, or an MFENCE before the WRMSR."
> > 
> > And that's what happens here. The write to the LVT has not yet hit the
> > APIC, so the WRMSR has no effect.
> 
> What you quoted is for x2APIC, I didn't see similar description for
> xAPIC.
> 
> Tested mfence here, it does work. But I'm not convinced it's the right thing. 
> the xAPIC access is memory mapped IO, mfence is nothing related to it. [...]

Huh?

Can you cite the SDM that says that MFENCE will have no effect on instructions 
that deal with accesses that are not to system RAM (i.e. are memory mapped IO)?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-08-05  8:44         ` Ingo Molnar
@ 2015-08-05 16:25           ` Shaohua Li
  2015-09-09  3:39             ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2015-08-05 16:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, x86, linux-kernel, Kernel-team, Suresh Siddha,
	H. Peter Anvin, stable, ak, lenb, fenghua.yu

On Wed, Aug 05, 2015 at 10:44:24AM +0200, Ingo Molnar wrote:
> 
> * Shaohua Li <shli@fb.com> wrote:
> 
> > On Sun, Aug 02, 2015 at 09:41:08PM +0200, Thomas Gleixner wrote:
> > > On Sun, 2 Aug 2015, Shaohua Li wrote:
> > > 
> > > > On Sat, Aug 01, 2015 at 12:10:41PM +0200, Thomas Gleixner wrote:
> > > > > On Fri, 31 Jul 2015, Shaohua Li wrote:
> > > > > > @@ -336,6 +336,22 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
> > > > > >  	apic_write(APIC_LVTT, lvtt_value);
> > > > > >  
> > > > > >  	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
> > > > > > +		u64 msr;
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
> > > > > > +		 * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
> > > > > > +		 * This uses the algorithm described in Intel SDM to serialize
> > > > > > +		 * the two writes
> > > > > > +		 * */
> > > > > > +		while (1) {
> > > > > > +			wrmsrl(MSR_IA32_TSC_DEADLINE, -1L);
> > > > > > +			rdmsrl(MSR_IA32_TSC_DEADLINE, msr);
> > > > > > +			if (msr)
> > > > > > +				break;
> > > > > > +		}
> > > > > > +		wrmsrl(MSR_IA32_TSC_DEADLINE, 0);
> > > > > 
> > > > > 
> > > > > I think this is exceptionally silly. A proper fence after the
> > > > > apic_write() should have the same effect.
> > > > 
> > > > Not sure what happens in the hardware, I could have a try of fence, but
> > > > I'd prefer using the algorithm Intel described. This is not a fast path,
> > > 
> > > s/algorithm/voodoo/
> > > 
> > > > the loop will exit immediately regardless the issue occurs anyway.
> > > 
> > > Well, the SDM also says:
> > > 
> > >  "To allow for efficient access to the APIC registers in x2APIC mode,
> > >   the serializing semantics of WRMSR are relaxed when writing to the
> > >   APIC registers. Thus, system software should not use “WRMSR to APIC
> > >   registers in x2APIC mode” as a serializing instruction. Read and write
> > >   accesses to the APIC registers will occur in program order. A WRMSR to
> > >   an APIC register may complete before all preceding stores are globally
> > >   visible; software can prevent this by inserting a serializing
> > >   instruction, an SFENCE, or an MFENCE before the WRMSR."
> > > 
> > > And that's what happens here. The write to the LVT has not yet hit the
> > > APIC, so the WRMSR has no effect.
> > 
> > What you quoted is for x2APIC, I didn't see similar description for
> > xAPIC.
> > 
> > Tested mfence here, it does work. But I'm not convinced it's the right thing. 
> > the xAPIC access is memory mapped IO, mfence is nothing related to it. [...]
> 
> Huh?
> 
> Can you cite the SDM that says that MFENCE will have no effect on instructions 
> that deal with accesses that are not to system RAM (i.e. are memory mapped IO)?

Hmm, I didn't mean mfence can't serialize the instructions. For a true
IO, a serialization can't guarantee device finishes the IO, we generally
read some safe IO registers to wait IO finish. I completely don't know
if this case fits here though.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-08-05 16:25           ` Shaohua Li
@ 2015-09-09  3:39             ` Andi Kleen
  2015-09-09  4:13               ` Shaohua Li
  2015-09-09  7:35               ` [PATCH] x86: serialize LVTT and TSC_DEADLINE write Thomas Gleixner
  0 siblings, 2 replies; 11+ messages in thread
From: Andi Kleen @ 2015-09-09  3:39 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Ingo Molnar, Thomas Gleixner, x86, linux-kernel, Kernel-team,
	Suresh Siddha, H. Peter Anvin, stable, lenb, fenghua.yu

> Hmm, I didn't mean mfence can't serialize the instructions. For a true
> IO, a serialization can't guarantee device finishes the IO, we generally
> read some safe IO registers to wait IO finish. I completely don't know
> if this case fits here though.

Sorry for the late answer. We (Intel) analyzed this case in detail and
can confirm that the following sequence

1.   Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
23. MFENCE.
4.   WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline.

has the same effect as the loop algorithm described in the SDM on all Intel
CPUs. So it's fine to use MFENCE here.

-Andi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-09-09  3:39             ` Andi Kleen
@ 2015-09-09  4:13               ` Shaohua Li
  2015-09-14 20:06                 ` [tip:x86/urgent] x86/apic: Serialize LVTT and TSC_DEADLINE writes tip-bot for Shaohua Li
  2015-09-09  7:35               ` [PATCH] x86: serialize LVTT and TSC_DEADLINE write Thomas Gleixner
  1 sibling, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2015-09-09  4:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Thomas Gleixner, x86, linux-kernel, Kernel-team,
	Suresh Siddha, H. Peter Anvin, stable, lenb, fenghua.yu

On Tue, Sep 08, 2015 at 08:39:37PM -0700, Andi Kleen wrote:
> > Hmm, I didn't mean mfence can't serialize the instructions. For a true
> > IO, a serialization can't guarantee device finishes the IO, we generally
> > read some safe IO registers to wait IO finish. I completely don't know
> > if this case fits here though.
> 
> Sorry for the late answer. We (Intel) analyzed this case in detail and
> can confirm that the following sequence
> 
> 1.   Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
> 23. MFENCE.
> 4.   WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline.
> 
> has the same effect as the loop algorithm described in the SDM on all Intel
> CPUs. So it's fine to use MFENCE here.

Thank you very much, Andi!

Here is the updated patch.


>From b5cdd3c14b0589e5d5a72d607814d9e808d05765 Mon Sep 17 00:00:00 2001
Message-Id: <b5cdd3c14b0589e5d5a72d607814d9e808d05765.1441771848.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Thu, 30 Jul 2015 16:24:43 -0700
Subject: [PATCH] x86: serialize LVTT and TSC_DEADLINE write

We saw a strange issue with local APIC timer. Some random CPU doesn't
receive any local APIC timer interrupt, which causes different issues.
The cpu uses TSC-Deadline mode for local APIC timer and APIC is in xAPIC
mode. When this happens, manually writing TSC_DEADLINE MSR can trigger
interrupt again and the system goes normal.

Currently we only see this issue in E5-2660 v2 and E5-2680 v2 CPU.
Compiler version seems mattering too, it's quite easy to reproduce the
issue with v4.7 gcc.

Since the local APIC timer interrupt number is 0, we either lose the
first interrupt or TSC_DEADLINE MSR isn't set correctly. After some
debugging, we believe it's the serialize issue described in Intel SDM.
In xAPIC mode, write to APIC LVTT and write to TSC_DEADLINE isn't
serialized. Debug shows read TSC_DEADLINE MSR followed the very first
MSR write returns 0 in the buggy cpu.

Thanks Andi Kleen confirms a 'mfence' can do the serialization here.
'mfence' instruction should be supported by CPU with TSC_DEADLINE, so we
don't need to check the instruction availability.

Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: stable@vger.kernel.org v3.7+
Signed-off-by: Shaohua Li <shli@fb.com>
---
 arch/x86/kernel/apic/apic.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 3ca3e46..e3fd4ab 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -336,6 +336,13 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
 	apic_write(APIC_LVTT, lvtt_value);
 
 	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
+		/*
+		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
+		 * writing APIC LVTT and TSC_DEADLINE MSR isn't serialized.
+		 * According to Intel, mfence can do the serialization here.
+		 * */
+		asm volatile("mfence" : : : "memory");
+
 		printk_once(KERN_DEBUG "TSC deadline timer enabled\n");
 		return;
 	}
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: serialize LVTT and TSC_DEADLINE write
  2015-09-09  3:39             ` Andi Kleen
  2015-09-09  4:13               ` Shaohua Li
@ 2015-09-09  7:35               ` Thomas Gleixner
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2015-09-09  7:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Shaohua Li, Ingo Molnar, x86, linux-kernel, Kernel-team,
	Suresh Siddha, H. Peter Anvin, stable, lenb, fenghua.yu

Andi,

On Tue, 8 Sep 2015, Andi Kleen wrote:

> > Hmm, I didn't mean mfence can't serialize the instructions. For a true
> > IO, a serialization can't guarantee device finishes the IO, we generally
> > read some safe IO registers to wait IO finish. I completely don't know
> > if this case fits here though.
> 
> Sorry for the late answer. We (Intel) analyzed this case in detail and
> can confirm that the following sequence
> 
> 1.   Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
> 23. MFENCE.
> 4.   WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline.
> 
> has the same effect as the loop algorithm described in the SDM on all Intel
> CPUs. So it's fine to use MFENCE here.

Thanks for the confirmation. It would have been really surprising if
that wouldn't work.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [tip:x86/urgent] x86/apic: Serialize LVTT and TSC_DEADLINE writes
  2015-09-09  4:13               ` Shaohua Li
@ 2015-09-14 20:06                 ` tip-bot for Shaohua Li
  0 siblings, 0 replies; 11+ messages in thread
From: tip-bot for Shaohua Li @ 2015-09-14 20:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, Kernel-team, mingo, fenghua.yu, tglx, ak, lenb, linux-kernel, shli

Commit-ID:  5d7c631d926b59aa16f3c56eaeb83f1036c81dc7
Gitweb:     http://git.kernel.org/tip/5d7c631d926b59aa16f3c56eaeb83f1036c81dc7
Author:     Shaohua Li <shli@fb.com>
AuthorDate: Thu, 30 Jul 2015 16:24:43 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 14 Sep 2015 18:29:59 +0200

x86/apic: Serialize LVTT and TSC_DEADLINE writes

The APIC LVTT register is MMIO mapped but the TSC_DEADLINE register is an
MSR. The write to the TSC_DEADLINE MSR is not serializing, so it's not
guaranteed that the write to LVTT has reached the APIC before the
TSC_DEADLINE MSR is written. In such a case the write to the MSR is
ignored and as a consequence the local timer interrupt never fires.

The SDM decribes this issue for xAPIC and x2APIC modes. The
serialization methods recommended by the SDM differ.

xAPIC:
 "1. Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
  2. WRMSR to the IA32_TSC_DEADLINE MSR a value much larger than current time-stamp counter.
  3. If RDMSR of the IA32_TSC_DEADLINE MSR returns zero, go to step 2.
  4. WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline."

x2APIC:
 "To allow for efficient access to the APIC registers in x2APIC mode,
  the serializing semantics of WRMSR are relaxed when writing to the
  APIC registers. Thus, system software should not use 'WRMSR to APIC
  registers in x2APIC mode' as a serializing instruction. Read and write
  accesses to the APIC registers will occur in program order. A WRMSR to
  an APIC register may complete before all preceding stores are globally
  visible; software can prevent this by inserting a serializing
  instruction, an SFENCE, or an MFENCE before the WRMSR."

The xAPIC method is to just wait for the memory mapped write to hit
the LVTT by checking whether the MSR write has reached the hardware.
There is no reason why a proper MFENCE after the memory mapped write would
not do the same. Andi Kleen confirmed that MFENCE is sufficient for the
xAPIC case as well.

Issue MFENCE before writing to the TSC_DEADLINE MSR. This can be done
unconditionally as all CPUs which have TSC_DEADLINE also have MFENCE
support.

[ tglx: Massaged the changelog ]

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: <Kernel-team@fb.com>
Cc: <lenb@kernel.org>
Cc: <fenghua.yu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@vger.kernel.org #v3.7+
Link: http://lkml.kernel.org/r/20150909041352.GA2059853@devbig257.prn2.facebook.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/apic.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 3ca3e46..24e94ce 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -336,6 +336,13 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
 	apic_write(APIC_LVTT, lvtt_value);
 
 	if (lvtt_value & APIC_LVT_TIMER_TSCDEADLINE) {
+		/*
+		 * See Intel SDM: TSC-Deadline Mode chapter. In xAPIC mode,
+		 * writing to the APIC LVTT and TSC_DEADLINE MSR isn't serialized.
+		 * According to Intel, MFENCE can do the serialization here.
+		 */
+		asm volatile("mfence" : : : "memory");
+
 		printk_once(KERN_DEBUG "TSC deadline timer enabled\n");
 		return;
 	}

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-09-14 20:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-31 22:11 [PATCH] x86: serialize LVTT and TSC_DEADLINE write Shaohua Li
2015-08-01 10:10 ` Thomas Gleixner
2015-08-02 15:49   ` Shaohua Li
2015-08-02 19:41     ` Thomas Gleixner
2015-08-03 23:58       ` Shaohua Li
2015-08-05  8:44         ` Ingo Molnar
2015-08-05 16:25           ` Shaohua Li
2015-09-09  3:39             ` Andi Kleen
2015-09-09  4:13               ` Shaohua Li
2015-09-14 20:06                 ` [tip:x86/urgent] x86/apic: Serialize LVTT and TSC_DEADLINE writes tip-bot for Shaohua Li
2015-09-09  7:35               ` [PATCH] x86: serialize LVTT and TSC_DEADLINE write Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).