Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
diff mbox series

Message ID 1511834933.2498.14.camel@intel.com
State New, archived
Headers show
Series
  • Regression: unable to boot after commit bd9240a18edf ("x86/apic: Add TSC_DEADLINE quirk due to errata") - Surface Pro 4 SKL
Related show

Commit Message

Zhang, Rui Nov. 28, 2017, 2:08 a.m. UTC
Hi, All,

My Surface Pro 4 is unable to boot after 4.12. The symptom is that
kernel freezes during boot, and the last message in the screen is
loading the initrd image. And I have bisected it to this commit

commit bd9240a18edfbfa72e957fc2ba831cf1f13ea073 (refs/bisect/bad)
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed May 31 17:52:03 2017 +0200
Commit:     Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sun Jun 4 21:55:53 2017 +0200

    x86/apic: Add TSC_DEADLINE quirk due to errata
    
    Due to errata it is possible for the TSC_DEADLINE timer to
misbehave
    after using TSC_ADJUST. A microcode update is available to fix this
    situation.
    
    Avoid using the TSC_DEADLINE timer if it is affected by this issue
and
    report the required microcode version.
    
    [ tglx: Renamed function to apic_check_deadline_errata() ]
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: kevin.b.stanton@intel.com
    Link: http://lkml.kernel.org/r/20170531155306.050849877@infradead.o
rg
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Currently, I'm using v4.14 kernel with the following workaround on top.

---
 arch/x86/kernel/apic/apic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.7.4

I'm reading the related code but have not figured out the root cause yet.
Anyone suggestions?

thanks,
rui

Comments

Peter Zijlstra Nov. 28, 2017, 8:14 a.m. UTC | #1
On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote:
> Hi, All,
> 
> My Surface Pro 4 is unable to boot after 4.12. The symptom is that
> kernel freezes during boot, and the last message in the screen is
> loading the initrd image. And I have bisected it to this commit

> +//	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE,	0xb2),

And what microcode version do you run? Have you installed the latest
microcode package and updated your initrd to include it? My skylake is
running 0xba.
Zhang, Rui Nov. 28, 2017, 8:22 a.m. UTC | #2
On Tue, 2017-11-28 at 09:14 +0100, Peter Zijlstra wrote:
> On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote:
> > 
> > Hi, All,
> > 
> > My Surface Pro 4 is unable to boot after 4.12. The symptom is that
> > kernel freezes during boot, and the last message in the screen is
> > loading the initrd image. And I have bisected it to this commit
> > 
> > +//	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE,	
> > 0xb2),
> And what microcode version do you run? Have you installed the latest
> microcode package and updated your initrd to include it? My skylake
> is
> running 0xba.

No, I didn't upgrade my microcode.

$ cat /proc/cpuinfo
...
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 78
model name	: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
stepping	: 3
microcode	: 0x9e
...

I suppose the problem should be gone if I upgrade the microcode.
But the real problem to me is that the system FREEZES after kernel
upgrade.

thanks,
rui
Thomas Gleixner Nov. 28, 2017, 9:34 a.m. UTC | #3
On Tue, 28 Nov 2017, Zhang Rui wrote:

> On Tue, 2017-11-28 at 09:14 +0100, Peter Zijlstra wrote:
> > On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote:
> > > 
> > > Hi, All,
> > > 
> > > My Surface Pro 4 is unable to boot after 4.12. The symptom is that
> > > kernel freezes during boot, and the last message in the screen is
> > > loading the initrd image. And I have bisected it to this commit
> > > 
> > > +//	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE,	
> > > 0xb2),
> > And what microcode version do you run? Have you installed the latest
> > microcode package and updated your initrd to include it? My skylake
> > is
> > running 0xba.
> 
> No, I didn't upgrade my microcode.
> 
> $ cat /proc/cpuinfo
> ...
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 78
> model name	: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
> stepping	: 3
> microcode	: 0x9e
> ...
> 
> I suppose the problem should be gone if I upgrade the microcode.
> But the real problem to me is that the system FREEZES after kernel
> upgrade.

Confused. So the match disables the deadline timer due to 0x9e < 0xb2. And
not having deadline timer makes the boot fail despite the fact that
deadline timer is borked with microcode < 0xb2.

Can you verify by adding 'lapic=notscdeadline' to the command line of a
'working' kernel? That should cause a freeze as well then.

Thanks,

	tglx
Peter Zijlstra Nov. 28, 2017, 9:35 a.m. UTC | #4
On Tue, Nov 28, 2017 at 04:22:15PM +0800, Zhang Rui wrote:
> On Tue, 2017-11-28 at 09:14 +0100, Peter Zijlstra wrote:
> > On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote:
> > > 
> > > Hi, All,
> > > 
> > > My Surface Pro 4 is unable to boot after 4.12. The symptom is that
> > > kernel freezes during boot, and the last message in the screen is
> > > loading the initrd image. And I have bisected it to this commit
> > > 
> > > +//	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE,	
> > > 0xb2),
> > And what microcode version do you run? Have you installed the latest
> > microcode package and updated your initrd to include it? My skylake
> > is
> > running 0xba.
> 
> No, I didn't upgrade my microcode.
> 
> $ cat /proc/cpuinfo
> ...
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 78
> model name	: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
> stepping	: 3
> microcode	: 0x9e
> ...
> 
> I suppose the problem should be gone if I upgrade the microcode.
> But the real problem to me is that the system FREEZES after kernel
> upgrade.

Yes, that is weird and unexpected. The patch basically kills
TSC_DEADLINE usage on older microcode. This makes the kernel then use
the old LAPIC timer mode.

Does your machine boot with "notscdeadline" ? Which _should_ be the
same.
Zhang, Rui Nov. 28, 2017, 10:59 a.m. UTC | #5
On Tue, 2017-11-28 at 10:34 +0100, Thomas Gleixner wrote:
> On Tue, 28 Nov 2017, Zhang Rui wrote:
> 
> > 
> > On Tue, 2017-11-28 at 09:14 +0100, Peter Zijlstra wrote:
> > > 
> > > On Tue, Nov 28, 2017 at 10:08:53AM +0800, Zhang Rui wrote:
> > > > 
> > > > 
> > > > Hi, All,
> > > > 
> > > > My Surface Pro 4 is unable to boot after 4.12. The symptom is
> > > > that
> > > > kernel freezes during boot, and the last message in the screen
> > > > is
> > > > loading the initrd image. And I have bisected it to this commit
> > > > 
> > > > +//	DEADLINE_MODEL_MATCH_REV (
> > > > INTEL_FAM6_SKYLAKE_MOBILE,	
> > > > 0xb2),
> > > And what microcode version do you run? Have you installed the
> > > latest
> > > microcode package and updated your initrd to include it? My
> > > skylake
> > > is
> > > running 0xba.
> > No, I didn't upgrade my microcode.
> > 
> > $ cat /proc/cpuinfo
> > ...
> > processor	: 0
> > vendor_id	: GenuineIntel
> > cpu family	: 6
> > model		: 78
> > model name	: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
> > stepping	: 3
> > microcode	: 0x9e
> > ...
> > 
> > I suppose the problem should be gone if I upgrade the microcode.
> > But the real problem to me is that the system FREEZES after kernel
> > upgrade.
> Confused. So the match disables the deadline timer due to 0x9e <
> 0xb2. And
> not having deadline timer makes the boot fail despite the fact that
> deadline timer is borked with microcode < 0xb2.
> 
> Can you verify by adding 'lapic=notscdeadline' to the command line of
> a
> 'working' kernel? That should cause a freeze as well then.
> 
yes. Tried 4.4 distro and 4.12 vanilla kernel, kernel always freezes
with boot option "notscdeadline"/"lapic=notscdeadline".

thanks,
rui

> Thanks,
> 
> 	tglx
Peter Zijlstra Nov. 28, 2017, 12:36 p.m. UTC | #6
On Tue, Nov 28, 2017 at 06:59:01PM +0800, Zhang Rui wrote:
> > > > > My Surface Pro 4 is unable to boot after 4.12. The symptom is

> yes. Tried 4.4 distro and 4.12 vanilla kernel, kernel always freezes
> with boot option "notscdeadline"/"lapic=notscdeadline".

Then for some mysterious reason, your Surface thing has a borked LAPIC.

No idea how or why..
Zhang, Rui Nov. 29, 2017, 2:44 p.m. UTC | #7
On Tue, 2017-11-28 at 13:36 +0100, Peter Zijlstra wrote:
> On Tue, Nov 28, 2017 at 06:59:01PM +0800, Zhang Rui wrote:
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > My Surface Pro 4 is unable to boot after 4.12. The symptom
> > > > > > is
> > 
> > yes. Tried 4.4 distro and 4.12 vanilla kernel, kernel always
> > freezes
> > with boot option "notscdeadline"/"lapic=notscdeadline".
> Then for some mysterious reason, your Surface thing has a borked
> LAPIC.
> 
I agree. But however, on this platform, forcing a cpu microcode upgrade
after upgrading kernel does not seem like a good solution to users.
Thus I still like to dig into the problem further to see if we can
workaround this "regression" in software.
As the kernel freezes at a very early stage, do you have any advice on
how to narrow down the problem?

thanks,
rui
Peter Zijlstra Dec. 18, 2017, 8:28 p.m. UTC | #8
Hi, can you see if this makes you Surface boot?

I tested it on my IVB by making has_legacy_pic() return unconditional
true.

[    0.024000] tsc: Unable to calibrate against PIT
[    0.025000] tsc: using HPET reference calibration
[    0.026000] tsc: Detected 2792.451 MHz processor

---


diff --git a/arch/x86/include/asm/i8259.h b/arch/x86/include/asm/i8259.h
index c8376b40e882..e2cfc4b52ee4 100644
--- a/arch/x86/include/asm/i8259.h
+++ b/arch/x86/include/asm/i8259.h
@@ -69,6 +69,11 @@ struct legacy_pic {
 extern struct legacy_pic *legacy_pic;
 extern struct legacy_pic null_legacy_pic;
 
+static inline bool has_legacy_pic(void)
+{
+	return legacy_pic == &null_legacy_pic;
+}
+
 static inline int nr_legacy_irqs(void)
 {
 	return legacy_pic->nr_legacy_irqs;
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 8ea117f8142e..2afc623b2280 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -25,6 +25,7 @@
 #include <asm/geode.h>
 #include <asm/apic.h>
 #include <asm/intel-family.h>
+#include <asm/i8259.h>
 
 unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not used here */
 EXPORT_SYMBOL(cpu_khz);
@@ -363,6 +364,15 @@ static unsigned long pit_calibrate_tsc(u32 latch, unsigned long ms, int loopmin)
 	unsigned long tscmin, tscmax;
 	int pitcnt;
 
+	if (!has_legacy_pic()) {
+		udelay(10 * USEC_PER_MSEC);
+		udelay(10 * USEC_PER_MSEC);
+		udelay(10 * USEC_PER_MSEC);
+		udelay(10 * USEC_PER_MSEC);
+		udelay(10 * USEC_PER_MSEC);
+		return ULONG_MAX;
+	}
+
 	/* Set the Gate high, disable speaker */
 	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
 
@@ -487,6 +497,9 @@ static unsigned long quick_pit_calibrate(void)
 	u64 tsc, delta;
 	unsigned long d1, d2;
 
+	if (!has_legacy_pic())
+		return 0;
+
 	/* Set the Gate high, disable speaker */
 	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
Zhang, Rui Dec. 19, 2017, 10:48 a.m. UTC | #9
On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote:
> Hi, can you see if this makes you Surface boot?
> 
No, it does not boot.

> I tested it on my IVB by making has_legacy_pic() return unconditional
> true.
> 
> [    0.024000] tsc: Unable to calibrate against PIT
> [    0.025000] tsc: using HPET reference calibration
> [    0.026000] tsc: Detected 2792.451 MHz processor
> 
> ---
> 
> 
> diff --git a/arch/x86/include/asm/i8259.h
> b/arch/x86/include/asm/i8259.h
> index c8376b40e882..e2cfc4b52ee4 100644
> --- a/arch/x86/include/asm/i8259.h
> +++ b/arch/x86/include/asm/i8259.h
> @@ -69,6 +69,11 @@ struct legacy_pic {
>  extern struct legacy_pic *legacy_pic;
>  extern struct legacy_pic null_legacy_pic;
>  
> +static inline bool has_legacy_pic(void)
> +{
> +	return legacy_pic == &null_legacy_pic;
> +}
> +
shouldn't this be
	return legacy_pic == &default_legacy_pic;
?

thanks,
rui
>  static inline int nr_legacy_irqs(void)
>  {
>  	return legacy_pic->nr_legacy_irqs;
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 8ea117f8142e..2afc623b2280 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -25,6 +25,7 @@
>  #include <asm/geode.h>
>  #include <asm/apic.h>
>  #include <asm/intel-family.h>
> +#include <asm/i8259.h>
>  
>  unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not
> used here */
>  EXPORT_SYMBOL(cpu_khz);
> @@ -363,6 +364,15 @@ static unsigned long pit_calibrate_tsc(u32
> latch, unsigned long ms, int loopmin)
>  	unsigned long tscmin, tscmax;
>  	int pitcnt;
>  
> +	if (!has_legacy_pic()) {
> +		udelay(10 * USEC_PER_MSEC);
> +		udelay(10 * USEC_PER_MSEC);
> +		udelay(10 * USEC_PER_MSEC);
> +		udelay(10 * USEC_PER_MSEC);
> +		udelay(10 * USEC_PER_MSEC);
> +		return ULONG_MAX;
> +	}
> +
>  	/* Set the Gate high, disable speaker */
>  	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
>  
> @@ -487,6 +497,9 @@ static unsigned long quick_pit_calibrate(void)
>  	u64 tsc, delta;
>  	unsigned long d1, d2;
>  
> +	if (!has_legacy_pic())
> +		return 0;
> +
>  	/* Set the Gate high, disable speaker */
>  	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
>
Peter Zijlstra Dec. 19, 2017, 1:15 p.m. UTC | #10
On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote:
> On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote:
> > Hi, can you see if this makes you Surface boot?
> > 
> No, it does not boot.

Bah, staring at the lapic calibrate now, that is a bit of a mess..

> > I tested it on my IVB by making has_legacy_pic() return unconditional
> > true.
> > 
> > [    0.024000] tsc: Unable to calibrate against PIT
> > [    0.025000] tsc: using HPET reference calibration
> > [    0.026000] tsc: Detected 2792.451 MHz processor
> > 
> > ---
> > 
> > 
> > diff --git a/arch/x86/include/asm/i8259.h
> > b/arch/x86/include/asm/i8259.h
> > index c8376b40e882..e2cfc4b52ee4 100644
> > --- a/arch/x86/include/asm/i8259.h
> > +++ b/arch/x86/include/asm/i8259.h
> > @@ -69,6 +69,11 @@ struct legacy_pic {
> >  extern struct legacy_pic *legacy_pic;
> >  extern struct legacy_pic null_legacy_pic;
> >  
> > +static inline bool has_legacy_pic(void)
> > +{
> > +	return legacy_pic == &null_legacy_pic;
> > +}
> > +
> shouldn't this be
> 	return legacy_pic == &default_legacy_pic;
> ?

!= &null, but yes, I mess that up.
Peter Zijlstra Dec. 19, 2017, 3:23 p.m. UTC | #11
On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote:
> On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote:
> > Hi, can you see if this makes you Surface boot?
> > 
> No, it does not boot.

So I'm confused on the lapic calibration.

That stuff uses global_clock_event, which is initially the i8253 (PIT),
but because !PIC this thing won't be there either on your platform.

Then we initialize I/O APIC, and your machine has:

[    0.000000] IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-119
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ9 used by override.

So your ACPI table has an override for IRQ2 and routes it to IRQ0.

Then we initialize HPET, and we _always_ do hpet_enable_legacy_int(),
which sets the LegacyRouting bit. The HPET document says:

  If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the
  interrupts will be routed as follows:

    Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC
    Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O APIC
    Timer 2-n will be routed as per the routing in the timer n config registers.

  If the LegacyReplacement Route bit is set, the individual routing bits
  for timers 0 and 1 (APIC or FSB) will have no impact.

And then we set global_clock_event to &hpet_clockevent.

At this point that _SHOULD_ work afaict, even without actual PIC
present.

Sometime after that we call into calibrate_APIC_clock() -- because
!TSC_DEADLINE -- and this is where you get stuck, because
global_clock_event is not in fact delivering interrupts.

Thomas may have more clue, we'll have to wait for him to have a
time-slot available.
Zhang, Rui Dec. 19, 2017, 3:23 p.m. UTC | #12
On Tue, 2017-12-19 at 14:15 +0100, Peter Zijlstra wrote:
> On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote:
> > 
> > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote:
> > > 
> > > Hi, can you see if this makes you Surface boot?
> > > 
> > No, it does not boot.
> Bah, staring at the lapic calibrate now, that is a bit of a mess..
> 
> > 
> > > 
> > > I tested it on my IVB by making has_legacy_pic() return
> > > unconditional
> > > true.
> > > 
> > > [    0.024000] tsc: Unable to calibrate against PIT
> > > [    0.025000] tsc: using HPET reference calibration
> > > [    0.026000] tsc: Detected 2792.451 MHz processor
> > > 
> > > ---
> > > 
> > > 
> > > diff --git a/arch/x86/include/asm/i8259.h
> > > b/arch/x86/include/asm/i8259.h
> > > index c8376b40e882..e2cfc4b52ee4 100644
> > > --- a/arch/x86/include/asm/i8259.h
> > > +++ b/arch/x86/include/asm/i8259.h
> > > @@ -69,6 +69,11 @@ struct legacy_pic {
> > >  extern struct legacy_pic *legacy_pic;
> > >  extern struct legacy_pic null_legacy_pic;
> > >  
> > > +static inline bool has_legacy_pic(void)
> > > +{
> > > +	return legacy_pic == &null_legacy_pic;
> > > +}
> > > +
> > shouldn't this be
> > 	return legacy_pic == &default_legacy_pic;
> > ?
> != &null, but yes, I mess that up.

I see. I have changed that and the platform can not boot neither.

thanks,
rui
Peter Zijlstra Dec. 19, 2017, 3:31 p.m. UTC | #13
On Tue, Dec 19, 2017 at 04:23:07PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote:
> > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote:
> > > Hi, can you see if this makes you Surface boot?
> > > 
> > No, it does not boot.
> 
> So I'm confused on the lapic calibration.
> 
> That stuff uses global_clock_event, which is initially the i8253 (PIT),
> but because !PIC this thing won't be there either on your platform.
> 
> Then we initialize I/O APIC, and your machine has:
> 
> [    0.000000] IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-119
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> [    0.000000] ACPI: IRQ0 used by override.
> [    0.000000] ACPI: IRQ9 used by override.
> 
> So your ACPI table has an override for IRQ2 and routes it to IRQ0.
> 
> Then we initialize HPET, and we _always_ do hpet_enable_legacy_int(),
> which sets the LegacyRouting bit. The HPET document says:
> 
>   If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the
>   interrupts will be routed as follows:
> 
>     Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC
>     Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O APIC
>     Timer 2-n will be routed as per the routing in the timer n config registers.
> 
>   If the LegacyReplacement Route bit is set, the individual routing bits
>   for timers 0 and 1 (APIC or FSB) will have no impact.
> 
> And then we set global_clock_event to &hpet_clockevent.
> 
> At this point that _SHOULD_ work afaict, even without actual PIC
> present.
> 
> Sometime after that we call into calibrate_APIC_clock() -- because
> !TSC_DEADLINE -- and this is where you get stuck, because
> global_clock_event is not in fact delivering interrupts.
> 
> Thomas may have more clue, we'll have to wait for him to have a
> time-slot available.

What does your /proc/interrupts look like (on a tsc-deadline boot) ?

0: is the HPET, LOC: is the lapic/tsc-deadline

On my SKL desktop I get 21 PIT/HPET ticks on CPU0 before lapic takes over.
Zhang, Rui Dec. 19, 2017, 3:42 p.m. UTC | #14
On Tue, 2017-12-19 at 16:23 +0100, Peter Zijlstra wrote:
> On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote:
> > 
> > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote:
> > > 
> > > Hi, can you see if this makes you Surface boot?
> > > 
> > No, it does not boot.
> So I'm confused on the lapic calibration.
> 
> That stuff uses global_clock_event, which is initially the i8253
> (PIT),
> but because !PIC this thing won't be there either on your platform.
> 
> Then we initialize I/O APIC, and your machine has:
> 
> [    0.000000] IOAPIC[0]: apic_id 2, version 32, address 0xfec00000,
> GSI 0-119
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl
> dfl)
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high
> level)
> [    0.000000] ACPI: IRQ0 used by override.
> [    0.000000] ACPI: IRQ9 used by override.
> 
> So your ACPI table has an override for IRQ2 and routes it to IRQ0.
> 
> Then we initialize HPET, and we _always_ do hpet_enable_legacy_int(),
> which sets the LegacyRouting bit.

Right.

>  The HPET document says:
> 
>   If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the
>   interrupts will be routed as follows:
> 
>     Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O
> APIC

But AFAICS, the HPET emulated timer interrupts goes to IRQ0 on all the
machines I have tested, but on this MS Surface Pro 4, there is no irq 0
row in /proc/interrupts.

$ cat /proc/interrupts 
            CPU0       CPU1       CPU2       CPU3       
   8:          0          1          0          0  IR-IO-APIC    8-
edge      rtc0
   9:        476        144       4573        132  IR-IO-APIC    9-
fasteoi   acpi
  14:          0          0          0          0  IR-IO-APIC   14-
edge      INT344B:00
  16:        646       4971      63574       3973  IR-IO-APIC   16-
fasteoi   idma64.0, MRVL_PCIE, i2c_designware.0
  17:          0          0          0          0  IR-IO-APIC   17-
fasteoi   idma64.1, i2c_designware.1
  18:          0          0          0          0  IR-IO-APIC   18-
fasteoi   idma64.2, i2c_designware.2
  19:          0          0          0          0  IR-IO-APIC   19-
fasteoi   idma64.3, i2c_designware.3

Maybe I need to check if IRQ0 is overridden on other platforms, if no,
registering to IRQ2 for HPET Timer 0 could help in this case?

>     Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O
> APIC
>     Timer 2-n will be routed as per the routing in the timer n config
> registers.
> 
>   If the LegacyReplacement Route bit is set, the individual routing
> bits
>   for timers 0 and 1 (APIC or FSB) will have no impact.
> 
> And then we set global_clock_event to &hpet_clockevent.

Yes, I can confirm this.

> 
> At this point that _SHOULD_ work afaict, even without actual PIC
> present.

so IOAPIC is ready when we calibrating Lapic timer?
> 
> Sometime after that we call into calibrate_APIC_clock() -- because
> !TSC_DEADLINE -- and this is where you get stuck, because
> global_clock_event is not in fact delivering interrupts.

right.
> 
> Thomas may have more clue, we'll have to wait for him to have a
> time-slot available.

okay.

thanks,
rui
Zhang, Rui Dec. 19, 2017, 3:43 p.m. UTC | #15
On Tue, 2017-12-19 at 16:31 +0100, Peter Zijlstra wrote:
> On Tue, Dec 19, 2017 at 04:23:07PM +0100, Peter Zijlstra wrote:
> > 
> > On Tue, Dec 19, 2017 at 06:48:24PM +0800, Zhang Rui wrote:
> > > 
> > > On Mon, 2017-12-18 at 21:28 +0100, Peter Zijlstra wrote:
> > > > 
> > > > Hi, can you see if this makes you Surface boot?
> > > > 
> > > No, it does not boot.
> > So I'm confused on the lapic calibration.
> > 
> > That stuff uses global_clock_event, which is initially the i8253
> > (PIT),
> > but because !PIC this thing won't be there either on your platform.
> > 
> > Then we initialize I/O APIC, and your machine has:
> > 
> > [    0.000000] IOAPIC[0]: apic_id 2, version 32, address
> > 0xfec00000, GSI 0-119
> > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl
> > dfl)
> > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high
> > level)
> > [    0.000000] ACPI: IRQ0 used by override.
> > [    0.000000] ACPI: IRQ9 used by override.
> > 
> > So your ACPI table has an override for IRQ2 and routes it to IRQ0.
> > 
> > Then we initialize HPET, and we _always_ do
> > hpet_enable_legacy_int(),
> > which sets the LegacyRouting bit. The HPET document says:
> > 
> >   If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then
> > the
> >   interrupts will be routed as follows:
> > 
> >     Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O
> > APIC
> >     Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O
> > APIC
> >     Timer 2-n will be routed as per the routing in the timer n
> > config registers.
> > 
> >   If the LegacyReplacement Route bit is set, the individual routing
> > bits
> >   for timers 0 and 1 (APIC or FSB) will have no impact.
> > 
> > And then we set global_clock_event to &hpet_clockevent.
> > 
> > At this point that _SHOULD_ work afaict, even without actual PIC
> > present.
> > 
> > Sometime after that we call into calibrate_APIC_clock() -- because
> > !TSC_DEADLINE -- and this is where you get stuck, because
> > global_clock_event is not in fact delivering interrupts.
> > 
> > Thomas may have more clue, we'll have to wait for him to have a
> > time-slot available.
> What does your /proc/interrupts look like (on a tsc-deadline boot) ?
> 
> 0: is the HPET, LOC: is the lapic/tsc-deadline
> 
> On my SKL desktop I get 21 PIT/HPET ticks on CPU0 before lapic takes
> over.
No irq0.
 LOC:     220071     210079     184892     176494   Local timer
interrupts

thanks,
rui
Peter Zijlstra Dec. 19, 2017, 4:01 p.m. UTC | #16
On Tue, Dec 19, 2017 at 11:42:41PM +0800, Zhang Rui wrote:
> On Tue, 2017-12-19 at 16:23 +0100, Peter Zijlstra wrote:

> > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> > [    0.000000] ACPI: IRQ0 used by override.
> > 
> > So your ACPI table has an override for IRQ2 and routes it to IRQ0.

^ this

> >  The HPET document says:
> > 
> >   If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the
> >   interrupts will be routed as follows:
> > 
> >     Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC
> 
> But AFAICS, the HPET emulated timer interrupts goes to IRQ0

Right, so see that ACPI override, that routes I/O APIC IRQ2 to IRQ0, or
it _should_.

Clearly something is messed up here.. but I've no idea what. That whole
IRQ routing stuff is confusing.
Peter Zijlstra Dec. 19, 2017, 5:23 p.m. UTC | #17
On Tue, Dec 19, 2017 at 05:01:55PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 19, 2017 at 11:42:41PM +0800, Zhang Rui wrote:
> > On Tue, 2017-12-19 at 16:23 +0100, Peter Zijlstra wrote:
> 
> > > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> > > [    0.000000] ACPI: IRQ0 used by override.
> > > 
> > > So your ACPI table has an override for IRQ2 and routes it to IRQ0.
> 
> ^ this
> 
> > >  The HPET document says:
> > > 
> > >   If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the
> > >   interrupts will be routed as follows:
> > > 
> > >     Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC
> > 
> > But AFAICS, the HPET emulated timer interrupts goes to IRQ0
> 
> Right, so see that ACPI override, that routes I/O APIC IRQ2 to IRQ0, or
> it _should_.
> 
> Clearly something is messed up here.. but I've no idea what. That whole
> IRQ routing stuff is confusing.

Does this help?

diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index 749d189f8cd4..45675072771c 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -69,8 +69,6 @@ static struct irqaction irq0  = {
 
 static void __init setup_default_timer_irq(void)
 {
-	if (!nr_legacy_irqs())
-		return;
 	setup_irq(0, &irq0);
 }
Zhang, Rui Dec. 20, 2017, 2:08 p.m. UTC | #18
On Tue, 2017-12-19 at 18:23 +0100, Peter Zijlstra wrote:
> On Tue, Dec 19, 2017 at 05:01:55PM +0100, Peter Zijlstra wrote:
> > 
> > On Tue, Dec 19, 2017 at 11:42:41PM +0800, Zhang Rui wrote:
> > > 
> > > On Tue, 2017-12-19 at 16:23 +0100, Peter Zijlstra wrote:
> > > 
> > > > 
> > > > [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2
> > > > dfl dfl)
> > > > [    0.000000] ACPI: IRQ0 used by override.
> > > > 
> > > > So your ACPI table has an override for IRQ2 and routes it to
> > > > IRQ0.
> > ^ this
> > 
> > > 
> > > > 
> > > >  The HPET document says:
> > > > 
> > > >   If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set,
> > > > then the
> > > >   interrupts will be routed as follows:
> > > > 
> > > >     Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the
> > > > I/O APIC
> > > But AFAICS, the HPET emulated timer interrupts goes to IRQ0
> > Right, so see that ACPI override, that routes I/O APIC IRQ2 to
> > IRQ0, or
> > it _should_.
> > 
> > Clearly something is messed up here.. but I've no idea what. That
> > whole
> > IRQ routing stuff is confusing.
> Does this help?
> 
No.

thanks,
rui
> diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
> index 749d189f8cd4..45675072771c 100644
> --- a/arch/x86/kernel/time.c
> +++ b/arch/x86/kernel/time.c
> @@ -69,8 +69,6 @@ static struct irqaction irq0  = {
>  
>  static void __init setup_default_timer_irq(void)
>  {
> -	if (!nr_legacy_irqs())
> -		return;
>  	setup_irq(0, &irq0);
>  }
>
Peter Zijlstra Dec. 20, 2017, 2:48 p.m. UTC | #19
On Wed, Dec 20, 2017 at 10:08:21PM +0800, Zhang Rui wrote:

> > Does this help?
> > 
> No.

Bah!, does this at least get you a IRQ0 line in /proc/interrupts?

> > diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
> > index 749d189f8cd4..45675072771c 100644
> > --- a/arch/x86/kernel/time.c
> > +++ b/arch/x86/kernel/time.c
> > @@ -69,8 +69,6 @@ static struct irqaction irq0  = {
> >  
> >  static void __init setup_default_timer_irq(void)
> >  {
> > -	if (!nr_legacy_irqs())
> > -		return;
> >  	setup_irq(0, &irq0);
> >  }
> >

Patch
diff mbox series

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ff89177..cd419d4 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -596,7 +596,7 @@  static const struct x86_cpu_id deadline_match[] = {
 	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL_CORE,	0x25),
 	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL_GT3E,	0x17),
 
-	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE,	0xb2),
+//	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_MOBILE,	0xb2),
 	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_DESKTOP,	0xb2),
 
 	DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_KABYLAKE_MOBILE,	0x52),