amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
@ 2020-03-08 13:06 Clemens Eisserer
  2020-03-08 18:45 ` Bridgman, John
  0 siblings, 1 reply; 10+ messages in thread
From: Clemens Eisserer @ 2020-03-08 13:06 UTC (permalink / raw)
  To: amd-gfx

Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[    0.105003] .... node  #0, CPUs:        #1  #2
[    0.107022] mce: [Hardware Error]: Machine check events logged
[    0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea0000000000108
[    0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://www.reddit.com/r/archlinux/comments/e33nyg/hard_reboots_with_ryzen_3600x/

The users there claim to experience exactly the same problem (even
with the same MCE-Code logged) but where using R600 based graphics
cards - he is even using the same mainboard. When he swapped his
R600-card with a new RX5700 the problems vanished.

I don't have the luxury to simply try another GPU (my RX5700 is the
only one properly driving my 4k@60Hz panel), however the whole
observation makes me wonder. How can a GPU be responsible for
low-level errors such as the machine check exception in the execution
units like the one mentioned above.
Could DMA transfers gone bad be the cluprit?
Are there any "safe mode" options available I could try regarding
amdgpu (I tried disabling low-power states but this didn't help and
only made my GPU fans spin up)?

Any help is highly appreciated.

Thanks, Clemens
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
  2020-03-08 13:06 Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x? Clemens Eisserer
@ 2020-03-08 18:45 ` Bridgman, John
  2020-03-08 19:10   ` Bridgman, John
  0 siblings, 1 reply; 10+ messages in thread
From: Bridgman, John @ 2020-03-08 18:45 UTC (permalink / raw)
  To: Clemens Eisserer, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 3970 bytes --]

[AMD Official Use Only - Internal Distribution Only]

The decoded MCE info doesn't look right... if the last bit is a zero I believe that means the watchdog timer is not enabled.

That said, I'm not sure how the decoder you found works, but it seems like a bit more information would be required than what you passed in. Can you point me to the program you used ?

Thanks,
John

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Clemens Eisserer <linuxhippy@gmail.com>
Sent: March 8, 2020 9:06 AM
To: amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[    0.105003] .... node  #0, CPUs:        #1  #2
[    0.107022] mce: [Hardware Error]: Machine check events logged
[    0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea0000000000108
[    0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F&amp;data=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111254592&amp;sdata=4TuB0a0VHxTqd8R0xLwxg%2BOv1vu8C7L%2FLW4O0EOiq1I%3D&amp;reserved=0
what the decoder logic is
The users there claim to experience exactly the same problem (even
with the same MCE-Code logged) but where using R600 based graphics
cards - he is even using the same mainboard. When he swapped his
R600-card with a new RX5700 the problems vanished.

I don't have the luxury to simply try another GPU (my RX5700 is the
only one properly driving my 4k@60Hz panel), however the whole
observation makes me wonder. How can a GPU be responsible for
low-level errors such as the machine check exception in the execution
units like the one mentioned above.
Could DMA transfers gone bad be the cluprit?
Are there any "safe mode" options available I could try regarding
amdgpu (I tried disabling low-power states but this didn't help and
only made my GPU fans spin up)?

Any help is highly appreciated.

Thanks, Clemens
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111264585&amp;sdata=L52zHeIm8GzEr5eYjUDm5bPK4U1DF0t1GtaxaUy9qHY%3D&amp;reserved=0

[-- Attachment #1.2: Type: text/html, Size: 6669 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
  2020-03-08 18:45 ` Bridgman, John
@ 2020-03-08 19:10   ` Bridgman, John
  2020-03-08 19:14     ` Bridgman, John
  2020-03-09  6:30     ` Clemens Eisserer
  0 siblings, 2 replies; 10+ messages in thread
From: Bridgman, John @ 2020-03-08 19:10 UTC (permalink / raw)
  To: Clemens Eisserer, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 5945 bytes --]

OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options for decoding.

In MCE-Ryzen-Decoder docco the example is exactly the error you are seeing, with the same output, so guessing that is what you are using:

https://github.com/DimitriFourny/MCE-Ryzen-Decoder

On the other hand I found a report on AMD forums where the same error is decoded by mce log as a generic error in a memory transaction, which seems to make more sense.

https://community.amd.com/thread/216084

For something as simple as the GPU bus interface not responding to an access by the CPU I think you would get a different error (bus error) but not 100% sure about that.

My first thought would be to see if your mobo BIOS has an option to force PCIE gen3 instead of 4 and see if that makes a difference. There are some amdgpu module parms related to PCIE as well but I'm not sure which ones to recommend.

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Bridgman, John <John.Bridgman@amd.com>
Sent: March 8, 2020 2:45 PM
To: Clemens Eisserer <linuxhippy@gmail.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?


[AMD Official Use Only - Internal Distribution Only]

The decoded MCE info doesn't look right... if the last bit is a zero I believe that means the watchdog timer is not enabled.

That said, I'm not sure how the decoder you found works, but it seems like a bit more information would be required than what you passed in. Can you point me to the program you used ?

Thanks,
John

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Clemens Eisserer <linuxhippy@gmail.com>
Sent: March 8, 2020 9:06 AM
To: amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[    0.105003] .... node  #0, CPUs:        #1  #2
[    0.107022] mce: [Hardware Error]: Machine check events logged
[    0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea0000000000108
[    0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F&amp;data=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111254592&amp;sdata=4TuB0a0VHxTqd8R0xLwxg%2BOv1vu8C7L%2FLW4O0EOiq1I%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca20457c9361648485aeb08d7c390d88a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192899911911960&sdata=ARxvLcwPrVQkP%2Bil%2FvKz9mKZOBd5Sx%2Bg0MOlQ%2F8UmIs%3D&reserved=0>
what the decoder logic is
The users there claim to experience exactly the same problem (even
with the same MCE-Code logged) but where using R600 based graphics
cards - he is even using the same mainboard. When he swapped his
R600-card with a new RX5700 the problems vanished.

I don't have the luxury to simply try another GPU (my RX5700 is the
only one properly driving my 4k@60Hz panel), however the whole
observation makes me wonder. How can a GPU be responsible for
low-level errors such as the machine check exception in the execution
units like the one mentioned above.
Could DMA transfers gone bad be the cluprit?
Are there any "safe mode" options available I could try regarding
amdgpu (I tried disabling low-power states but this didn't help and
only made my GPU fans spin up)?

Any help is highly appreciated.

Thanks, Clemens
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111264585&amp;sdata=L52zHeIm8GzEr5eYjUDm5bPK4U1DF0t1GtaxaUy9qHY%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca20457c9361648485aeb08d7c390d88a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192899911921953&sdata=zuBIsSmJqDSR5pg4mcjcYlRypl65g4EJoLhgjzD20rk%3D&reserved=0>

[-- Attachment #1.2: Type: text/html, Size: 10439 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
  2020-03-08 19:10   ` Bridgman, John
@ 2020-03-08 19:14     ` Bridgman, John
  2020-03-09  6:30     ` Clemens Eisserer
  1 sibling, 0 replies; 10+ messages in thread
From: Bridgman, John @ 2020-03-08 19:14 UTC (permalink / raw)
  To: Clemens Eisserer, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 6991 bytes --]

[AMD Public Use]

Fixing the security tag...

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Bridgman, John <John.Bridgman@amd.com>
Sent: March 8, 2020 3:10 PM
To: Clemens Eisserer <linuxhippy@gmail.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options for decoding.

In MCE-Ryzen-Decoder docco the example is exactly the error you are seeing, with the same output, so guessing that is what you are using:

https://github.com/DimitriFourny/MCE-Ryzen-Decoder<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDimitriFourny%2FMCE-Ryzen-Decoder&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078581327&sdata=N8FCig9TNL8tppMXnn9RJ2K%2BIsuYFaBJ7cHvsfhgris%3D&reserved=0>

On the other hand I found a report on AMD forums where the same error is decoded by mce log as a generic error in a memory transaction, which seems to make more sense.

https://community.amd.com/thread/216084<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommunity.amd.com%2Fthread%2F216084&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078581327&sdata=G8MPgLKheVdcuA626wFpZwSgqektnTpKkEPnBqlk1QM%3D&reserved=0>

For something as simple as the GPU bus interface not responding to an access by the CPU I think you would get a different error (bus error) but not 100% sure about that.

My first thought would be to see if your mobo BIOS has an option to force PCIE gen3 instead of 4 and see if that makes a difference. There are some amdgpu module parms related to PCIE as well but I'm not sure which ones to recommend.

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Bridgman, John <John.Bridgman@amd.com>
Sent: March 8, 2020 2:45 PM
To: Clemens Eisserer <linuxhippy@gmail.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?


[AMD Official Use Only - Internal Distribution Only]

The decoded MCE info doesn't look right... if the last bit is a zero I believe that means the watchdog timer is not enabled.

That said, I'm not sure how the decoder you found works, but it seems like a bit more information would be required than what you passed in. Can you point me to the program you used ?

Thanks,
John

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Clemens Eisserer <linuxhippy@gmail.com>
Sent: March 8, 2020 9:06 AM
To: amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[    0.105003] .... node  #0, CPUs:        #1  #2
[    0.107022] mce: [Hardware Error]: Machine check events logged
[    0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea0000000000108
[    0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F&amp;data=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111254592&amp;sdata=4TuB0a0VHxTqd8R0xLwxg%2BOv1vu8C7L%2FLW4O0EOiq1I%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078591321&sdata=QAbr3IkabyLUlYrR4K%2B%2BOpVbkf5BPEgNjrnDSltoQNg%3D&reserved=0>
what the decoder logic is
The users there claim to experience exactly the same problem (even
with the same MCE-Code logged) but where using R600 based graphics
cards - he is even using the same mainboard. When he swapped his
R600-card with a new RX5700 the problems vanished.

I don't have the luxury to simply try another GPU (my RX5700 is the
only one properly driving my 4k@60Hz panel), however the whole
observation makes me wonder. How can a GPU be responsible for
low-level errors such as the machine check exception in the execution
units like the one mentioned above.
Could DMA transfers gone bad be the cluprit?
Are there any "safe mode" options available I could try regarding
amdgpu (I tried disabling low-power states but this didn't help and
only made my GPU fans spin up)?

Any help is highly appreciated.

Thanks, Clemens
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111264585&amp;sdata=L52zHeIm8GzEr5eYjUDm5bPK4U1DF0t1GtaxaUy9qHY%3D&amp;reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078601316&sdata=2Gkq6rDmH3ZDMpYEoC27%2FL3FrHbzPWlcZ493oFEpJIk%3D&reserved=0>

[-- Attachment #1.2: Type: text/html, Size: 12562 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
  2020-03-08 19:10   ` Bridgman, John
  2020-03-08 19:14     ` Bridgman, John
@ 2020-03-09  6:30     ` Clemens Eisserer
  2020-03-09  6:58       ` Bridgman, John
  1 sibling, 1 reply; 10+ messages in thread
From: Clemens Eisserer @ 2020-03-09  6:30 UTC (permalink / raw)
  To: Bridgman, John, amd-gfx

Hi John,

Thanks a lot for taking the time to look at this, even if it doesn't
seem to be GPU related at first.

> OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options for decoding.
Sorry for omitting that information - indeed I was using
MCE-Ryzen-Decoder, thanks for pointing to mcelog.
The mce log output definitivly makes more sense, I'll try to
experiment a bit with RAM.

Thanks also for the link to the forum, seems of all the affected users,
no one reported success in that thread.

> For something as simple as the GPU bus interface not responding to an access
> by the CPU I think you would get a different error (bus error) but not 100% sure about that.
>
> My first thought would be to see if your mobo BIOS has an option to force PCIE
> gen3 instead of 4 and see if that makes a difference. There are some amdgpu module parms
> related to PCIE as well but I'm not sure which ones to recommend.

I'll give it a try and have a look at the pcie options - but as far as
I know RX570 (polaris) should stay at PCI3 as far as I know.
Disabling IOMMU didn't help as far as I recall.

Thanks & best regards, Clemens
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
  2020-03-09  6:30     ` Clemens Eisserer
@ 2020-03-09  6:58       ` Bridgman, John
  2020-03-21 10:08         ` Clemens Eisserer
  0 siblings, 1 reply; 10+ messages in thread
From: Bridgman, John @ 2020-03-09  6:58 UTC (permalink / raw)
  To: Clemens Eisserer, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 1709 bytes --]

[AMD Official Use Only - Internal Distribution Only]

>I know RX570 (polaris) should stay at PCI3 as far as I know.

Yep... thought I remembered you mentioning having a 5700XT though... is that in a different system ?

________________________________
From: Clemens Eisserer <linuxhippy@gmail.com>
Sent: March 9, 2020 2:30 AM
To: Bridgman, John <John.Bridgman@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

Hi John,

Thanks a lot for taking the time to look at this, even if it doesn't
seem to be GPU related at first.

> OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options for decoding.
Sorry for omitting that information - indeed I was using
MCE-Ryzen-Decoder, thanks for pointing to mcelog.
The mce log output definitivly makes more sense, I'll try to
experiment a bit with RAM.

Thanks also for the link to the forum, seems of all the affected users,
no one reported success in that thread.

> For something as simple as the GPU bus interface not responding to an access
> by the CPU I think you would get a different error (bus error) but not 100% sure about that.
>
> My first thought would be to see if your mobo BIOS has an option to force PCIE
> gen3 instead of 4 and see if that makes a difference. There are some amdgpu module parms
> related to PCIE as well but I'm not sure which ones to recommend.

I'll give it a try and have a look at the pcie options - but as far as
I know RX570 (polaris) should stay at PCI3 as far as I know.
Disabling IOMMU didn't help as far as I recall.

Thanks & best regards, Clemens

[-- Attachment #1.2: Type: text/html, Size: 3091 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
  2020-03-09  6:58       ` Bridgman, John
@ 2020-03-21 10:08         ` Clemens Eisserer
  0 siblings, 0 replies; 10+ messages in thread
From: Clemens Eisserer @ 2020-03-21 10:08 UTC (permalink / raw)
  To: Bridgman, John, amd-gfx

Hi John,

> >I know RX570 (polaris) should stay at PCI3 as far as I know.
>
> Yep... thought I remembered you mentioning having a 5700XT though... is that in a different system ?

I am using a RX570, the guy from reddit changed from R600 to an 5700XT
and it seems it did solve his reboot problems.

As the system is rock solid with windows-10 and others seem to
experience similar behaviour I've decided to file a bug at the
kernel's bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=206903

Thanks for your suggestions & best regards, Clemens


>
> ________________________________
> From: Clemens Eisserer <linuxhippy@gmail.com>
> Sent: March 9, 2020 2:30 AM
> To: Bridgman, John <John.Bridgman@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
> Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
>
> Hi John,
>
> Thanks a lot for taking the time to look at this, even if it doesn't
> seem to be GPU related at first.
>
> > OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options for decoding.
> Sorry for omitting that information - indeed I was using
> MCE-Ryzen-Decoder, thanks for pointing to mcelog.
> The mce log output definitivly makes more sense, I'll try to
> experiment a bit with RAM.
>
> Thanks also for the link to the forum, seems of all the affected users,
> no one reported success in that thread.
>
> > For something as simple as the GPU bus interface not responding to an access
> > by the CPU I think you would get a different error (bus error) but not 100% sure about that.
> >
> > My first thought would be to see if your mobo BIOS has an option to force PCIE
> > gen3 instead of 4 and see if that makes a difference. There are some amdgpu module parms
> > related to PCIE as well but I'm not sure which ones to recommend.
>
> I'll give it a try and have a look at the pcie options - but as far as
> I know RX570 (polaris) should stay at PCI3 as far as I know.
> Disabling IOMMU didn't help as far as I recall.
>
> Thanks & best regards, Clemens
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
       [not found]       ` <CAE6bd32OwckMGcKLZvX5StOFqOFNx9nUGdNAJvbSzgUHkjma3g@mail.gmail.com>
@ 2020-04-10  6:55         ` Clemens Eisserer
  0 siblings, 0 replies; 10+ messages in thread
From: Clemens Eisserer @ 2020-04-10  6:55 UTC (permalink / raw)
  To: someguy108, amd-gfx; +Cc: Bridgman, John

Hi Someguy,

I've been running with accelmethod=none and llvmpipe for opengl now
for over a week (more or less using only the display engine of my
rx570) and haven't experienced a single MCE during that period.
However, statistically, it will take 1-2 additional weeks to be sure
this is not a coincidence.

This seems to confirm your observations (as well as the reports in the
reddit thread) that the GPU seems to play at least some role when it
comes to the MCE caused reboots.

@John: Could it be the 3rd gen ryzen has some issues which are covered
by windows driver workarounds?
Are CPU errata data sheets publicly available?

Best regards, Clemens

PS: While it does seem to solve the reboot issue, it causes degraded
desktop performance as well as tearing when playing videos.
So I still hope I don't have to live with this forever.

PS2: AMD support respond after some delay stating the MCE actually is
caused by the watchdog timer.

Am So., 5. Apr. 2020 um 04:41 Uhr schrieb someguy108 <someguy108@gmail.com>:
>
> How has it been going since you limited hardware acceleration? Any improvement?
>
> On Thu, Apr 2, 2020, 11:13 PM Clemens Eisserer <linuxhippy@gmail.com> wrote:
>>
>> Hi,
>>
>> I also had the impression this issue triggered by power-state changes.
>> First I suspected CPU power state transitions, but now more and more
>> reports pop up mentinioning exchanging the GPU solved the issue (or in
>> your case it started when switching to an AMD gpu).
>> However I've already tried to limit power-states using the
>> /sys/-Interface as well as custom feature masks provided suggested on
>> the kernel bug report.
>>
>> Your theory regarding firefox and compositing might be right, so this
>> is what I did:
>> - Switch from Glamor to "none" acceleration for Xorg (glamor
>> translates 2d drawing commands to OpenGL)
>> - Switch Firefox from WebRender to Software-Rendering
>> - Disable KDE composition manager
>>
>> Only time will tell...
>>
>> Best regards, Clemens
>>
>>
>> Am Fr., 3. Apr. 2020 um 03:39 Uhr schrieb someguy108 <someguy108@gmail.com>:
>> >
>> > Hi Cemens, I responded to that bugs report about my findings! I do
>> > wonder since yours is happening at desktop, do you have compositing
>> > enabled with your window manager? If my theory, as noted just based
>> > off my limited understanding and observations, is correct about it
>> > revolving around power states, having Firefox playing a video with
>> > hardware acceleration enabled and desktop compositing could be causing
>> > back and forth swings with power states. Which could be the culprit
>> > with the hangs.
>> >
>> > I also formed some of my theory from the long debacle of AMD's Windows
>> > driver quality. As these hangs sound awfully similar to what Windows
>> > users have been enduring for almost all of 2019 and some parts of
>> > 2020. As noted with me, I had TDR's in Windows while alt-tabbing with
>> > games until I disabled hardware acceleration in Google Chrome. And for
>> > good measure I disabled most animations and compositing effects with
>> > Windows UI. As I like to play most of my games in a Window. Like no
>> > more shadows under windows and such. Though I do know some are Navi
>> > specific.
>> >
>> > On Thu, Apr 2, 2020 at 7:12 AM Clemens Eisserer <linuxhippy@gmail.com> wrote:
>> > >
>> > > Hi Someguy,
>> > >
>> > > Your findings sound very familiar, my machine is also rock-solid
>> > > running Windows-10 - most of the MCEs happened for me with low-load
>> > > situations but firefox playing youtube in background.
>> > > First I didn't care that much - but now having experienced corrupted
>> > > firefox profiles and lost spreadsheet work it starts to get annoying.
>> > >
>> > > I've filed a kernel bug regarding this issue:
>> > > https://bugzilla.kernel.org/show_bug.cgi?id=206903
>> > > I would appreciate if you could report your findings there to give the
>> > > issue more data / weight.
>> > >
>> > > Thanks, Clemens
>> > >
>> > > Am Do., 2. Apr. 2020 um 15:11 Uhr schrieb someguy108 <someguy108@gmail.com>:
>> > > >
>> > > > Hello! I saw Clemens Eisserer email regarding MCE errors with his RX 570 and 3700x, and I like to add to that list of MCE spontaneous reboots as well.
>> > > > This is my configuration:
>> > > > -Ryzen 3900x + Noctua D15
>> > > > -MSI X570 Unify (latest agesa as of writing)
>> > > > -DDR4 3200mhz 32GB kit
>> > > > -Sapphire Pulse 5700 XT
>> > > > -Corsair RMX 850 Watt
>> > > > -Arch Linux with kernel 5.5.13
>> > > > -Mesa 20.0.3
>> > > > -Early KMS enabled
>> > > >
>> > > > I've had this system up and running since November 2019 but initially with a Nvidia 1060 and Windows 10. Everything was running smoothly. About a month ago I switched back over to Linux after purchasing my 5700 XT as my initial plan was to go back to Linux. Since returning I've experienced multiple spontaneous MCE reboots. All happened while I was playing one particular game, Warcraft 3 Reforged. The MCE event is the following:
>> > > >
>> > > > kernel: mce: [Hardware Error]: Machine check events logged
>> > > > kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
>> > > > kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffad66d6fe MISC d012000100000000 SYND 4d000000 IPID 500b000000000
>> > > > kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 APIC 2 microcode 8701013
>> > > > kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
>> > > > kernel: mce: [Hardware Error]: Machine check events logged
>> > > > kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
>> > > > kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc1196eb6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
>> > > > kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 APIC 9 microcode 8701013
>> > > > kernel: #16 #17 #18 #19 #20 #21 #22 #23
>> > > >
>> > > > Initially I figured it could be ram so I performed the usual test with no problems. Also tested with standard JEDEC as well and eventually received a MCE during Warcraft 3 reforged. After consulting with a few friends I decided to try a different power supply to no avail. I then bit the bullet and bought a brand new 3900x. I also cleared CMOS before getting my new 3900x and after. All cpu values are on auto with no PBO or manual overclocking. The only fancy is the ram. Yesterday, after owning the new 3900x for three days, I had a MCE while I was playing Warcraft 3 Reforged. I have tested other games but none of them caused a MCE or any crashes / freezes for that matter. World of Warcraft, The Outer Worlds, Stellaris, and Counter-Strike: Global Offensive.
>> > > >
>> > > > As the same with Clemens, using the same decoder he used, MCE-Ryzen-Decoder, from github, it reports the MCE to be the following:
>> > > > Bank: Execution Unit (EX)
>> > > > Error: Watchdog Timeout error (WDT 0x0)
>> > > >
>> > > > One thing to note is I haven't received it during desktop usage. Only in Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and always have. Both of which used, tested, and received the MCE's during those sessions. I have noticed a pattern with the MCE crashes with Warcraft 3. They always happen during a GPU load drop off or increase transition. By that I mean when exiting a match to return to the lobby, or loading a map and when it switches from the loading screen to the match itself is when these MCE's happen. The entire screen quickly turns black, everything is hard locked, and then after about a minute or so the machine reboots on its own. It hasn't happened yet while in a middle of a match session, sitting in the lobby or at the main menu screen. Its consistently been during a transition. My theory is that this could possibly be a GPU hang from switching from one power state to another power state. With the GPU hanging, causes the CPU to stall, and 
 thus a MCE. The GPU hanging could explain the quick solid black screen as well as all output is stopped. But I'm really just assuming here form my own observations from my limited understanding. Possible reason why this triggers in Warcraft is because the other games have few moments of switching power states heavily. The Outer Worlds, World of Warcraft, Stellaris, and Counter-Strike Global Offensive all keep a constant high load on the GPU and the match sessions are long.
>> > > >
>> > > > From what its worth, I've had no major issues in Windows 10. The only quirks where initially a few TDR's that recovered from alt tabing out of most games with Google Chrome running. Disabling hardware acceleration in Chrome fixed those TDR's while alt-tabing out of games.
>> > > >
>> > > > From searching, the way I found this mailing list report, I've found quite a few reports of people talking about receiving MCE's that isn't the typical first generation MCE's reports from 2017 involving Ryzen.Where those where fixed by disabling c-states, ram, and changing power supply current from low to typical. These ones within the past year appear to all have a AMD GPU in common. I did notice a few with Intel CPU's as well paired up with a AMD GPU.
>> > > >
>> > > > Any feedback would be greatly appreciated.
>> > > > _______________________________________________
>> > > > amd-gfx mailing list
>> > > > amd-gfx@lists.freedesktop.org
>> > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
  2020-04-02 11:11 someguy108
@ 2020-04-02 14:12 ` Clemens Eisserer
       [not found]   ` <CAE6bd33O_ivJa9kQHAnp-_GZDagena==FZgfYs=_nYJc6m56CQ@mail.gmail.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Clemens Eisserer @ 2020-04-02 14:12 UTC (permalink / raw)
  To: someguy108, amd-gfx

Hi Someguy,

Your findings sound very familiar, my machine is also rock-solid
running Windows-10 - most of the MCEs happened for me with low-load
situations but firefox playing youtube in background.
First I didn't care that much - but now having experienced corrupted
firefox profiles and lost spreadsheet work it starts to get annoying.

I've filed a kernel bug regarding this issue:
https://bugzilla.kernel.org/show_bug.cgi?id=206903
I would appreciate if you could report your findings there to give the
issue more data / weight.

Thanks, Clemens

Am Do., 2. Apr. 2020 um 15:11 Uhr schrieb someguy108 <someguy108@gmail.com>:
>
> Hello! I saw Clemens Eisserer email regarding MCE errors with his RX 570 and 3700x, and I like to add to that list of MCE spontaneous reboots as well.
> This is my configuration:
> -Ryzen 3900x + Noctua D15
> -MSI X570 Unify (latest agesa as of writing)
> -DDR4 3200mhz 32GB kit
> -Sapphire Pulse 5700 XT
> -Corsair RMX 850 Watt
> -Arch Linux with kernel 5.5.13
> -Mesa 20.0.3
> -Early KMS enabled
>
> I've had this system up and running since November 2019 but initially with a Nvidia 1060 and Windows 10. Everything was running smoothly. About a month ago I switched back over to Linux after purchasing my 5700 XT as my initial plan was to go back to Linux. Since returning I've experienced multiple spontaneous MCE reboots. All happened while I was playing one particular game, Warcraft 3 Reforged. The MCE event is the following:
>
> kernel: mce: [Hardware Error]: Machine check events logged
> kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
> kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffad66d6fe MISC d012000100000000 SYND 4d000000 IPID 500b000000000
> kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 APIC 2 microcode 8701013
> kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
> kernel: mce: [Hardware Error]: Machine check events logged
> kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
> kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc1196eb6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
> kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 APIC 9 microcode 8701013
> kernel: #16 #17 #18 #19 #20 #21 #22 #23
>
> Initially I figured it could be ram so I performed the usual test with no problems. Also tested with standard JEDEC as well and eventually received a MCE during Warcraft 3 reforged. After consulting with a few friends I decided to try a different power supply to no avail. I then bit the bullet and bought a brand new 3900x. I also cleared CMOS before getting my new 3900x and after. All cpu values are on auto with no PBO or manual overclocking. The only fancy is the ram. Yesterday, after owning the new 3900x for three days, I had a MCE while I was playing Warcraft 3 Reforged. I have tested other games but none of them caused a MCE or any crashes / freezes for that matter. World of Warcraft, The Outer Worlds, Stellaris, and Counter-Strike: Global Offensive.
>
> As the same with Clemens, using the same decoder he used, MCE-Ryzen-Decoder, from github, it reports the MCE to be the following:
> Bank: Execution Unit (EX)
> Error: Watchdog Timeout error (WDT 0x0)
>
> One thing to note is I haven't received it during desktop usage. Only in Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and always have. Both of which used, tested, and received the MCE's during those sessions. I have noticed a pattern with the MCE crashes with Warcraft 3. They always happen during a GPU load drop off or increase transition. By that I mean when exiting a match to return to the lobby, or loading a map and when it switches from the loading screen to the match itself is when these MCE's happen. The entire screen quickly turns black, everything is hard locked, and then after about a minute or so the machine reboots on its own. It hasn't happened yet while in a middle of a match session, sitting in the lobby or at the main menu screen. Its consistently been during a transition. My theory is that this could possibly be a GPU hang from switching from one power state to another power state. With the GPU hanging, causes the CPU to stall, and thus a 
 MCE. The GPU hanging could explain the quick solid black screen as well as all output is stopped. But I'm really just assuming here form my own observations from my limited understanding. Possible reason why this triggers in Warcraft is because the other games have few moments of switching power states heavily. The Outer Worlds, World of Warcraft, Stellaris, and Counter-Strike Global Offensive all keep a constant high load on the GPU and the match sessions are long.
>
> From what its worth, I've had no major issues in Windows 10. The only quirks where initially a few TDR's that recovered from alt tabing out of most games with Google Chrome running. Disabling hardware acceleration in Chrome fixed those TDR's while alt-tabing out of games.
>
> From searching, the way I found this mailing list report, I've found quite a few reports of people talking about receiving MCE's that isn't the typical first generation MCE's reports from 2017 involving Ryzen.Where those where fixed by disabling c-states, ram, and changing power supply current from low to typical. These ones within the past year appear to all have a AMD GPU in common. I did notice a few with Intel CPU's as well paired up with a AMD GPU.
>
> Any feedback would be greatly appreciated.
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?
@ 2020-04-02 11:11 someguy108
  2020-04-02 14:12 ` Clemens Eisserer
  0 siblings, 1 reply; 10+ messages in thread
From: someguy108 @ 2020-04-02 11:11 UTC (permalink / raw)
  To: amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 4819 bytes --]

Hello! I saw Clemens Eisserer email regarding MCE errors with his RX 570
and 3700x, and I like to add to that list of MCE spontaneous reboots as
well.
This is my configuration:
-Ryzen 3900x + Noctua D15
-MSI X570 Unify (latest agesa as of writing)
-DDR4 3200mhz 32GB kit
-Sapphire Pulse 5700 XT
-Corsair RMX 850 Watt
-Arch Linux with kernel 5.5.13
-Mesa 20.0.3
-Early KMS enabled

I've had this system up and running since November 2019 but initially with
a Nvidia 1060 and Windows 10. Everything was running smoothly. About a
month ago I switched back over to Linux after purchasing my 5700 XT as my
initial plan was to go back to Linux. Since returning I've experienced
multiple spontaneous MCE reboots. All happened while I was playing one
particular game, Warcraft 3 Reforged. The MCE event is the following:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5:
bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffad66d6fe MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0
APIC 2 microcode 8701013
kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5:
bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc1196eb6 MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0
APIC 9 microcode 8701013
kernel: #16 #17 #18 #19 #20 #21 #22 #23

Initially I figured it could be ram so I performed the usual test with no
problems. Also tested with standard JEDEC as well and eventually received a
MCE during Warcraft 3 reforged. After consulting with a few friends I
decided to try a different power supply to no avail. I then bit the bullet
and bought a brand new 3900x. I also cleared CMOS before getting my new
3900x and after. All cpu values are on auto with no PBO or manual
overclocking. The only fancy is the ram. Yesterday, after owning the new
3900x for three days, I had a MCE while I was playing Warcraft 3 Reforged.
I have tested other games but none of them caused a MCE or any crashes /
freezes for that matter. World of Warcraft, The Outer Worlds, Stellaris,
and Counter-Strike: Global Offensive.

As the same with Clemens, using the same decoder he
used, MCE-Ryzen-Decoder, from github, it reports the MCE to be the
following:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

One thing to note is I haven't received it during desktop usage. Only in
Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and
always have. Both of which used, tested, and received the MCE's during
those sessions. I have noticed a pattern with the MCE crashes with Warcraft
3. They always happen during a GPU load drop off or increase transition. By
that I mean when exiting a match to return to the lobby, or loading a map
and when it switches from the loading screen to the match itself is when
these MCE's happen. The entire screen quickly turns black, everything is
hard locked, and then after about a minute or so the machine reboots on its
own. It hasn't happened yet while in a middle of a match session, sitting
in the lobby or at the main menu screen. Its consistently been during a
transition. My theory is that this could possibly be a GPU hang from
switching from one power state to another power state. With the GPU
hanging, causes the CPU to stall, and thus a MCE. The GPU hanging could
explain the quick solid black screen as well as all output is stopped. But
I'm really just assuming here form my own observations from my limited
understanding. Possible reason why this triggers in Warcraft is because the
other games have few moments of switching power states heavily. The Outer
Worlds, World of Warcraft, Stellaris, and Counter-Strike Global Offensive
all keep a constant high load on the GPU and the match sessions are long.

From what its worth, I've had no major issues in Windows 10. The only
quirks where initially a few TDR's that recovered from alt tabing out of
most games with Google Chrome running. Disabling hardware acceleration in
Chrome fixed those TDR's while alt-tabing out of games.

From searching, the way I found this mailing list report, I've found quite
a few reports of people talking about receiving MCE's that isn't the
typical first generation MCE's reports from 2017 involving Ryzen.Where
those where fixed by disabling c-states, ram, and changing power supply
current from low to typical. These ones within the past year appear to all
have a AMD GPU in common. I did notice a few with Intel CPU's as well
paired up with a AMD GPU.

Any feedback would be greatly appreciated.

[-- Attachment #1.2: Type: text/html, Size: 5511 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-04-10  6:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-08 13:06 Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x? Clemens Eisserer
2020-03-08 18:45 ` Bridgman, John
2020-03-08 19:10   ` Bridgman, John
2020-03-08 19:14     ` Bridgman, John
2020-03-09  6:30     ` Clemens Eisserer
2020-03-09  6:58       ` Bridgman, John
2020-03-21 10:08         ` Clemens Eisserer
2020-04-02 11:11 someguy108
2020-04-02 14:12 ` Clemens Eisserer
     [not found]   ` <CAE6bd33O_ivJa9kQHAnp-_GZDagena==FZgfYs=_nYJc6m56CQ@mail.gmail.com>
     [not found]     ` <CAFvQSYRnfAKCU0MHqwstHqgA4YTWvfCQ1SrsURbee=kEuP7a2g@mail.gmail.com>
     [not found]       ` <CAE6bd32OwckMGcKLZvX5StOFqOFNx9nUGdNAJvbSzgUHkjma3g@mail.gmail.com>
2020-04-10  6:55         ` Clemens Eisserer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).