All of lore.kernel.org
 help / color / mirror / Atom feed
* random insta-reboots on AMD Phenom II
@ 2017-09-30  2:05 Adam Borowski
  2017-09-30 11:11 ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Borowski @ 2017-09-30  2:05 UTC (permalink / raw)
  To: linux-kernel

Hi!
I'm afraid I see random instant reboots on current -rc, approximately
once per day, only under CPU load.  There's nothing on serial/etc -- just
an immediate reboot.  4.13 works perfectly; last kernel I've tried is
v4.14-rc2-165-g770b782f555d.  gcc 7.2.0-7 (Debian).

CPU is AMD Phenom II X6 1055T (family 10h).

Sometimes it dies within a few minutes of load, sometimes all is fine for a
couple of days.  This randomness makes bisecting not really an option.

Any hints how to debug this?


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ We domesticated dogs 36000 years ago; together we chased
⣾⠁⢰⠒⠀⣿⡁ animals, hung out and licked or scratched our private parts.
⢿⡄⠘⠷⠚⠋⠀ Cats domesticated us 9500 years ago, and immediately we got
⠈⠳⣄⠀⠀⠀⠀ agriculture, towns then cities.     -- whitroth on /.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30  2:05 random insta-reboots on AMD Phenom II Adam Borowski
@ 2017-09-30 11:11 ` Borislav Petkov
  2017-09-30 11:29   ` Adam Borowski
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2017-09-30 11:11 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-kernel

On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> Any hints how to debug this?

Do

rdmsr -a 0xc0010015

as root and paste it here.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 11:11 ` Borislav Petkov
@ 2017-09-30 11:29   ` Adam Borowski
  2017-09-30 11:53     ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Borowski @ 2017-09-30 11:29 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel

On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> > Any hints how to debug this?
> 
> Do
> rdmsr -a 0xc0010015
> as root and paste it here.

1000010
1000010
1000010
1000010
1000010
1000010

on both 4.13.4 and 4.14-rc2+.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ We domesticated dogs 36000 years ago; together we chased
⣾⠁⢰⠒⠀⣿⡁ animals, hung out and licked or scratched our private parts.
⢿⡄⠘⠷⠚⠋⠀ Cats domesticated us 9500 years ago, and immediately we got
⠈⠳⣄⠀⠀⠀⠀ agriculture, towns then cities.     -- whitroth on /.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 11:29   ` Adam Borowski
@ 2017-09-30 11:53     ` Borislav Petkov
  2017-09-30 12:47       ` Markus Trippelsdorf
                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Borislav Petkov @ 2017-09-30 11:53 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-kernel, Andy Lutomirski, x86-ml

On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
> On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> > > Any hints how to debug this?
> > 
> > Do
> > rdmsr -a 0xc0010015
> > as root and paste it here.
> 
> 1000010
> 1000010
> 1000010
> 1000010
> 1000010
> 1000010
> 
> on both 4.13.4 and 4.14-rc2+.

Boot into -rc2+ and do as root:

# wrmsr -a 0xc0010015 0x1000018

If the issue gets fixed then Mr. Luto better revert the new lazy TLB
flushing fun'n'games for 4.14 before it is too late and that kernel
releases b0rked.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 11:53     ` Borislav Petkov
@ 2017-09-30 12:47       ` Markus Trippelsdorf
  2017-09-30 14:20         ` Brian Gerst
  2017-09-30 15:50         ` Borislav Petkov
  2017-09-30 15:11       ` Andy Lutomirski
  2017-10-01 13:07       ` Adam Borowski
  2 siblings, 2 replies; 16+ messages in thread
From: Markus Trippelsdorf @ 2017-09-30 12:47 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Adam Borowski, linux-kernel, Andy Lutomirski, x86-ml

On 2017.09.30 at 13:53 +0200, Borislav Petkov wrote:
> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
> > On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> > > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> > > > Any hints how to debug this?
> > > 
> > > Do
> > > rdmsr -a 0xc0010015
> > > as root and paste it here.
> > 
> > 1000010
> > 1000010
> > 1000010
> > 1000010
> > 1000010
> > 1000010
> > 
> > on both 4.13.4 and 4.14-rc2+.
> 
> Boot into -rc2+ and do as root:
> 
> # wrmsr -a 0xc0010015 0x1000018
> 
> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
> flushing fun'n'games for 4.14 before it is too late and that kernel
> releases b0rked.

The issue does get fixed by setting TlbCacheDis to 1. I have been
running it for the last few weeks without any problems. 
Performance is not affected at all. So it might by easier to just set
the bit for older AMD processors as a boot quirk.
Changing the TLB code so late might not be a good idea...

-- 
Markus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 12:47       ` Markus Trippelsdorf
@ 2017-09-30 14:20         ` Brian Gerst
  2017-09-30 15:21           ` Markus Trippelsdorf
  2017-09-30 15:50         ` Borislav Petkov
  1 sibling, 1 reply; 16+ messages in thread
From: Brian Gerst @ 2017-09-30 14:20 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Borislav Petkov, Adam Borowski, Linux Kernel Mailing List,
	Andy Lutomirski, x86-ml

On Sat, Sep 30, 2017 at 8:47 AM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2017.09.30 at 13:53 +0200, Borislav Petkov wrote:
>> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
>> > On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
>> > > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
>> > > > Any hints how to debug this?
>> > >
>> > > Do
>> > > rdmsr -a 0xc0010015
>> > > as root and paste it here.
>> >
>> > 1000010
>> > 1000010
>> > 1000010
>> > 1000010
>> > 1000010
>> > 1000010
>> >
>> > on both 4.13.4 and 4.14-rc2+.
>>
>> Boot into -rc2+ and do as root:
>>
>> # wrmsr -a 0xc0010015 0x1000018
>>
>> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
>> flushing fun'n'games for 4.14 before it is too late and that kernel
>> releases b0rked.
>
> The issue does get fixed by setting TlbCacheDis to 1. I have been
> running it for the last few weeks without any problems.
> Performance is not affected at all. So it might by easier to just set
> the bit for older AMD processors as a boot quirk.
> Changing the TLB code so late might not be a good idea...

Looking at the AMD K10 revision guide
(http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf), errata #298
that this fixes should only apply to revisions DR-BA and DR-B2, which
include the original Phenom, but not Phenom II.  The Phenom II X6 is
revision PH-E0, which does not have this errata.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 11:53     ` Borislav Petkov
  2017-09-30 12:47       ` Markus Trippelsdorf
@ 2017-09-30 15:11       ` Andy Lutomirski
  2017-09-30 15:48         ` Borislav Petkov
  2017-10-01 13:07       ` Adam Borowski
  2 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2017-09-30 15:11 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Adam Borowski, linux-kernel, Andy Lutomirski, x86-ml



> On Sep 30, 2017, at 4:53 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
>> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
>>> On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
>>>> On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
>>>> Any hints how to debug this?
>>> 
>>> Do
>>> rdmsr -a 0xc0010015
>>> as root and paste it here.
>> 
>> 1000010
>> 1000010
>> 1000010
>> 1000010
>> 1000010
>> 1000010
>> 
>> on both 4.13.4 and 4.14-rc2+.
> 
> Boot into -rc2+ and do as root:
> 
> # wrmsr -a 0xc0010015 0x1000018
> 
> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
> flushing fun'n'games for 4.14 before it is too late and that kernel
> releases b0rked.

Yeah, working on it.  It's not a straightforward revert.

> 
> Thx.
> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 14:20         ` Brian Gerst
@ 2017-09-30 15:21           ` Markus Trippelsdorf
  0 siblings, 0 replies; 16+ messages in thread
From: Markus Trippelsdorf @ 2017-09-30 15:21 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Borislav Petkov, Adam Borowski, Linux Kernel Mailing List,
	Andy Lutomirski, x86-ml

On 2017.09.30 at 10:20 -0400, Brian Gerst wrote:
> On Sat, Sep 30, 2017 at 8:47 AM, Markus Trippelsdorf
> <markus@trippelsdorf.de> wrote:
> > On 2017.09.30 at 13:53 +0200, Borislav Petkov wrote:
> >> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
> >> > On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> >> > > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> >> > > > Any hints how to debug this?
> >> > >
> >> > > Do
> >> > > rdmsr -a 0xc0010015
> >> > > as root and paste it here.
> >> >
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> >
> >> > on both 4.13.4 and 4.14-rc2+.
> >>
> >> Boot into -rc2+ and do as root:
> >>
> >> # wrmsr -a 0xc0010015 0x1000018
> >>
> >> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
> >> flushing fun'n'games for 4.14 before it is too late and that kernel
> >> releases b0rked.
> >
> > The issue does get fixed by setting TlbCacheDis to 1. I have been
> > running it for the last few weeks without any problems.
> > Performance is not affected at all. So it might by easier to just set
> > the bit for older AMD processors as a boot quirk.
> > Changing the TLB code so late might not be a good idea...
> 
> Looking at the AMD K10 revision guide
> (http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf), errata #298
> that this fixes should only apply to revisions DR-BA and DR-B2, which
> include the original Phenom, but not Phenom II.  The Phenom II X6 is
> revision PH-E0, which does not have this errata.

It has nothing to do with errata #298. The new lazy TLB code causes
MCEs, because the page tables may now contain garbage.
See the long "Current mainline git (24e700e291d52bd2) hangs when
building e.g. perf" LKML thread.
-- 
Markus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 15:11       ` Andy Lutomirski
@ 2017-09-30 15:48         ` Borislav Petkov
  0 siblings, 0 replies; 16+ messages in thread
From: Borislav Petkov @ 2017-09-30 15:48 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Adam Borowski, linux-kernel, Andy Lutomirski, x86-ml

On Sat, Sep 30, 2017 at 08:11:51AM -0700, Andy Lutomirski wrote:
> Yeah, working on it.  It's not a straightforward revert.

Thanks. At least you have testers :-)

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 12:47       ` Markus Trippelsdorf
  2017-09-30 14:20         ` Brian Gerst
@ 2017-09-30 15:50         ` Borislav Petkov
  2017-09-30 16:04           ` Andy Lutomirski
  2017-10-06 18:49           ` Johannes Hirte
  1 sibling, 2 replies; 16+ messages in thread
From: Borislav Petkov @ 2017-09-30 15:50 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: Adam Borowski, linux-kernel, Andy Lutomirski, x86-ml

On Sat, Sep 30, 2017 at 02:47:11PM +0200, Markus Trippelsdorf wrote:
> Changing the TLB code so late might not be a good idea...

The new lazy code is too risky to keep as we don't know what else will
break. The conservative and thus safe thing to do is to revert to the
old behavior for old machines.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 15:50         ` Borislav Petkov
@ 2017-09-30 16:04           ` Andy Lutomirski
  2017-10-06 18:49           ` Johannes Hirte
  1 sibling, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2017-09-30 16:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Markus Trippelsdorf, Adam Borowski, linux-kernel,
	Andy Lutomirski, x86-ml

On Sat, Sep 30, 2017 at 8:50 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Sep 30, 2017 at 02:47:11PM +0200, Markus Trippelsdorf wrote:
>> Changing the TLB code so late might not be a good idea...
>
> The new lazy code is too risky to keep as we don't know what else will
> break. The conservative and thus safe thing to do is to revert to the
> old behavior for old machines.

Agreed.

The only problem is that the code has changed so much on top of the
problematic commit that just reverting it won't work.

>
> --
> Regards/Gruss,
>     Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 11:53     ` Borislav Petkov
  2017-09-30 12:47       ` Markus Trippelsdorf
  2017-09-30 15:11       ` Andy Lutomirski
@ 2017-10-01 13:07       ` Adam Borowski
  2 siblings, 0 replies; 16+ messages in thread
From: Adam Borowski @ 2017-10-01 13:07 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel, Andy Lutomirski, x86-ml

On Sat, Sep 30, 2017 at 01:53:02PM +0200, Borislav Petkov wrote:
> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
> > On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> > > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:

> Boot into -rc2+ and do as root:
> 
> # wrmsr -a 0xc0010015 0x1000018

Seems to help, thus it's indeed this issue.  I failed to mention that "once
per day" meant a day of regular use, of which heavy loads were only a tiny
fraction.  I've applied this register setting, then kept the machine busy
(mostly with randconfig kernel builds) for a day, no explosions yet -- so
there's a good chance the problem would have triggered.

> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
> flushing fun'n'games for 4.14 before it is too late and that kernel
> releases b0rked.

I have no clue about these matters so I'll leave it to you guys.  But, as
the other report I see gives different effects (frequent segfaults vs rare
insta-reboots), you do want to include my machine in testing.

Thanks for the workaround!

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ We domesticated dogs 36000 years ago; together we chased
⣾⠁⢰⠒⠀⣿⡁ animals, hung out and licked or scratched our private parts.
⢿⡄⠘⠷⠚⠋⠀ Cats domesticated us 9500 years ago, and immediately we got
⠈⠳⣄⠀⠀⠀⠀ agriculture, towns then cities.     -- whitroth on /.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-09-30 15:50         ` Borislav Petkov
  2017-09-30 16:04           ` Andy Lutomirski
@ 2017-10-06 18:49           ` Johannes Hirte
  2017-10-06 18:53             ` Borislav Petkov
  1 sibling, 1 reply; 16+ messages in thread
From: Johannes Hirte @ 2017-10-06 18:49 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Markus Trippelsdorf, Adam Borowski, linux-kernel,
	Andy Lutomirski, x86-ml

On 2017 Sep 30, Borislav Petkov wrote:
> On Sat, Sep 30, 2017 at 02:47:11PM +0200, Markus Trippelsdorf wrote:
> > Changing the TLB code so late might not be a good idea...
> 
> The new lazy code is too risky to keep as we don't know what else will
> break. The conservative and thus safe thing to do is to revert to the
> old behavior for old machines.
>

I see the same behaviour on Carizzo. Is Excavator an old machine too?

--
Regards,
  Johannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-10-06 18:49           ` Johannes Hirte
@ 2017-10-06 18:53             ` Borislav Petkov
  2017-10-06 19:02               ` Johannes Hirte
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2017-10-06 18:53 UTC (permalink / raw)
  To: Johannes Hirte
  Cc: Markus Trippelsdorf, Adam Borowski, linux-kernel,
	Andy Lutomirski, x86-ml

On Fri, Oct 06, 2017 at 08:49:33PM +0200, Johannes Hirte wrote:
> I see the same behaviour on Carizzo. Is Excavator an old machine too?

Do

# rdmsr -a 0xc0010015

as root and paste it here.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-10-06 18:53             ` Borislav Petkov
@ 2017-10-06 19:02               ` Johannes Hirte
  2017-10-06 19:24                 ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Hirte @ 2017-10-06 19:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Markus Trippelsdorf, Adam Borowski, linux-kernel,
	Andy Lutomirski, x86-ml

On 2017 Okt 06, Borislav Petkov wrote:
> On Fri, Oct 06, 2017 at 08:49:33PM +0200, Johannes Hirte wrote:
> > I see the same behaviour on Carizzo. Is Excavator an old machine too?
> 
> Do
> 
> # rdmsr -a 0xc0010015
> 
> as root and paste it here.
> 
> Thx.

19001011
19001011
19001011
19001011

--
Regards,
  Johannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: random insta-reboots on AMD Phenom II
  2017-10-06 19:02               ` Johannes Hirte
@ 2017-10-06 19:24                 ` Borislav Petkov
  0 siblings, 0 replies; 16+ messages in thread
From: Borislav Petkov @ 2017-10-06 19:24 UTC (permalink / raw)
  To: Johannes Hirte
  Cc: Markus Trippelsdorf, Adam Borowski, linux-kernel,
	Andy Lutomirski, x86-ml

On Fri, Oct 06, 2017 at 09:02:09PM +0200, Johannes Hirte wrote:
> 19001011
> 19001011
> 19001011
> 19001011

After you boot, do

wrmsr -a 0xc0010015 0x19001019

as root.

It should fix it temporarily and until the next boot, until we've fixed
it upstream properly.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-10-06 19:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-30  2:05 random insta-reboots on AMD Phenom II Adam Borowski
2017-09-30 11:11 ` Borislav Petkov
2017-09-30 11:29   ` Adam Borowski
2017-09-30 11:53     ` Borislav Petkov
2017-09-30 12:47       ` Markus Trippelsdorf
2017-09-30 14:20         ` Brian Gerst
2017-09-30 15:21           ` Markus Trippelsdorf
2017-09-30 15:50         ` Borislav Petkov
2017-09-30 16:04           ` Andy Lutomirski
2017-10-06 18:49           ` Johannes Hirte
2017-10-06 18:53             ` Borislav Petkov
2017-10-06 19:02               ` Johannes Hirte
2017-10-06 19:24                 ` Borislav Petkov
2017-09-30 15:11       ` Andy Lutomirski
2017-09-30 15:48         ` Borislav Petkov
2017-10-01 13:07       ` Adam Borowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.