dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset
@ 2020-02-09 20:36 bugzilla-daemon
  2020-02-10 13:20 ` [Bug 206475] " bugzilla-daemon
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-09 20:36 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

            Bug ID: 206475
           Summary: amdgpu under load drop signal to monitor until hard
                    reset
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.5.2
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@kernel-bugs.osdl.org
          Reporter: rodomar705@protonmail.com
        Regression: No

Created attachment 287265
  --> https://bugzilla.kernel.org/attachment.cgi?id=287265&action=edit
dmesg for the amdgpu hardware freeze

While gaming the monitor goes blank randomly, only with this error in the logs
of the system

kernel: amdgpu: [powerplay] last message was failed ret is 65535
kernel: amdgpu: [powerplay] failed to send message 200 ret is 65535 
kernel: amdgpu: [powerplay] last message was failed ret is 65535
kernel: amdgpu: [powerplay] failed to send message 282 ret is 65535 
kernel: amdgpu: [powerplay] last message was failed ret is 65535
kernel: amdgpu: [powerplay] failed to send message 201 ret is 65535

with the occasional
kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR*
[CRTC:47:crtc-0] flip_done timed out
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled
seq=5275264, emitted seq=5275266
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process
Hand of Fate 2. pid 682062 thread Hand of Fa:cs0 pid 682064
kernel: amdgpu 0000:06:00.0: GPU reset begin!

over and over again. If I reset the system no video output is seen until the
system is fully shut off.

B450 chipset + Ryzen 5 2600 + Radeon RX580 GPU

Full log is attached to this post.

Can anyone at AMD give me some pointers to what the problem is?

Thanks,

Marco.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
@ 2020-02-10 13:20 ` bugzilla-daemon
  2020-02-10 13:21 ` bugzilla-daemon
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-10 13:20 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #1 from Marco (rodomar705@protonmail.com) ---
Just tested under 5.5.2 stock kernel (besides ZFS module) and the same problem
show up. Log attached.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
  2020-02-10 13:20 ` [Bug 206475] " bugzilla-daemon
@ 2020-02-10 13:21 ` bugzilla-daemon
  2020-02-10 16:39 ` bugzilla-daemon
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-10 13:21 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #2 from Marco (rodomar705@protonmail.com) ---
Created attachment 287275
  --> https://bugzilla.kernel.org/attachment.cgi?id=287275&action=edit
amdgpu crash with stock kernel

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
  2020-02-10 13:20 ` [Bug 206475] " bugzilla-daemon
  2020-02-10 13:21 ` bugzilla-daemon
@ 2020-02-10 16:39 ` bugzilla-daemon
  2020-02-10 16:40 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-10 16:39 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #3 from Marco (rodomar705@protonmail.com) ---
Same thing with linux-amd-drm-next, dmesg attached. Any pointers to the cause?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (2 preceding siblings ...)
  2020-02-10 16:39 ` bugzilla-daemon
@ 2020-02-10 16:40 ` bugzilla-daemon
  2020-02-10 19:33 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-10 16:40 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #4 from Marco (rodomar705@protonmail.com) ---
Created attachment 287277
  --> https://bugzilla.kernel.org/attachment.cgi?id=287277&action=edit
dmesg for amd-drm-next

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (3 preceding siblings ...)
  2020-02-10 16:40 ` bugzilla-daemon
@ 2020-02-10 19:33 ` bugzilla-daemon
  2020-02-17 13:23 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-10 19:33 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

Marco (rodomar705@protonmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |OBSOLETE

--- Comment #5 from Marco (rodomar705@protonmail.com) ---
It seems that the problem was insufficient cooling, since the same happened on
a Windows VM.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (4 preceding siblings ...)
  2020-02-10 19:33 ` bugzilla-daemon
@ 2020-02-17 13:23 ` bugzilla-daemon
  2020-02-21 21:13 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-17 13:23 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

Marco (rodomar705@protonmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|OBSOLETE                    |---

--- Comment #6 from Marco (rodomar705@protonmail.com) ---
(In reply to Marco from comment #5)
> It seems that the problem was insufficient cooling, since the same happened
> on a Windows VM.

Instead I was wrong, tested Furmark on two different driver sets on W10 bare
metal, no crashes for an hour (furmark on VM lasted for 30 seconds).

This is a firmware/software problem. Please fix it.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (5 preceding siblings ...)
  2020-02-17 13:23 ` bugzilla-daemon
@ 2020-02-21 21:13 ` bugzilla-daemon
  2020-02-24 13:50 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-21 21:13 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

Marco (rodomar705@protonmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |OBSOLETE

--- Comment #7 from Marco (rodomar705@protonmail.com) ---
Found the root of the issue, in some way ZFS was able to achieve a hard lock
always in the same way in amdgpu. After removal and a switch to xfs, the
problem is gone.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (6 preceding siblings ...)
  2020-02-21 21:13 ` bugzilla-daemon
@ 2020-02-24 13:50 ` bugzilla-daemon
  2020-02-24 13:52 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-24 13:50 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

Marco (rodomar705@protonmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|OBSOLETE                    |---

--- Comment #8 from Marco (rodomar705@protonmail.com) ---
Aaand it's back. Extremely less often, but it still there. However, this time
I've got a warning from the kernel in the backtrace:

feb 24 14:31:13 *** kernel: ------------[ cut here ]------------
feb 24 14:31:13 *** kernel: WARNING: CPU: 3 PID: 24149 at
drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dce_link_encoder.c:1099
dce110_link_encoder_disable_output+0x12a/0x140 [amdgpu]
feb 24 14:31:13 *** kernel: Modules linked in: rfcomm fuse xt_CHECKSUM
xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle
ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter tun bridge stp llc cmac algif_hash algif_skcipher af_alg sr_mod
cdrom bnep hwmon_vid xfs nls_iso8859_1 nls_cp437 vfat fat btrfs edac_mce_amd
kvm_amd kvm blake2b_generic xor btusb btrtl btbcm btintel bluetooth
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel igb joydev ecdh_generic
aesni_intel eeepc_wmi asus_wmi crypto_simd battery cryptd sparse_keymap
mousedev input_leds ecc glue_helper raid6_pq ccp rfkill wmi_bmof pcspkr k10temp
dca libcrc32c i2c_piix4 rng_core evdev pinctrl_amd mac_hid gpio_amdpt
acpi_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) virtio_mmio virtio_input
virtio_pci virtio_balloon usbip_host snd_hda_codec_realtek usbip_core
snd_hda_codec_generic uinput i2c_dev ledtrig_audio sg
feb 24 14:31:13 *** kernel:  snd_hda_codec_hdmi vhba(OE) crypto_user
snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm
snd_timer snd soundcore ip_tables x_tables ext4 crc32c_generic crc16 mbcache
jbd2 sd_mod hid_generic usbhid hid ahci libahci libata crc32c_intel xhci_pci
xhci_hcd scsi_mod nouveau mxm_wmi wmi amdgpu gpu_sched i2c_algo_bit ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart
vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
feb 24 14:31:13 *** kernel: CPU: 3 PID: 24149 Comm: kworker/3:2 Tainted: G     
     OE     5.5.5-arch1-1 #1
feb 24 14:31:13 *** kernel: Hardware name: System manufacturer System Product
Name/ROG STRIX B450-F GAMING, BIOS 3003 12/09/2019
feb 24 14:31:13 *** kernel: Workqueue: events drm_sched_job_timedout
[gpu_sched]
feb 24 14:31:13 *** kernel: RIP:
0010:dce110_link_encoder_disable_output+0x12a/0x140 [amdgpu]
feb 24 14:31:13 *** kernel: Code: 44 24 38 65 48 33 04 25 28 00 00 00 75 20 48
83 c4 40 5b 5d 41 5c c3 48 c7 c6 40 05 76 c0 48 c7 c7 f0 b1 7d c0 e8 76 c3 d1
ff <0f> 0b eb d0 e8 7d 12 e7 df 66 66 2e 0f 1f 84 00 00 00 00 00 66 90
feb 24 14:31:13 *** kernel: RSP: 0018:ffffb06641417630 EFLAGS: 00010246
feb 24 14:31:13 *** kernel: RAX: 0000000000000000 RBX: ffff9790645be420 RCX:
0000000000000000
feb 24 14:31:13 *** kernel: RDX: 0000000000000000 RSI: 0000000000000082 RDI:
00000000ffffffff
feb 24 14:31:13 *** kernel: RBP: 0000000000000002 R08: 00000000000005ba R09:
0000000000000093
feb 24 14:31:13 *** kernel: R10: ffffb06641417480 R11: ffffb06641417485 R12:
ffffb06641417634
feb 24 14:31:13 *** kernel: R13: ffff979064fe6800 R14: ffff978f4f9201b8 R15:
ffff97906ba1ee00
feb 24 14:31:13 *** kernel: FS:  0000000000000000(0000)
GS:ffff97906e8c0000(0000) knlGS:0000000000000000
feb 24 14:31:13 *** kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
feb 24 14:31:13 *** kernel: CR2: 00007effa1509000 CR3: 0000000350d7a000 CR4:
00000000003406e0
feb 24 14:31:13 *** kernel: Call Trace:
feb 24 14:31:13 *** kernel:  core_link_disable_stream+0x10e/0x3d0 [amdgpu]
feb 24 14:31:13 *** kernel:  ? smu7_send_msg_to_smc.cold+0x20/0x25 [amdgpu]
feb 24 14:31:13 *** kernel:  dce110_reset_hw_ctx_wrap+0xc3/0x260 [amdgpu]
feb 24 14:31:13 *** kernel:  dce110_apply_ctx_to_hw+0x51/0x5d0 [amdgpu]
feb 24 14:31:13 *** kernel:  ? pp_dpm_dispatch_tasks+0x45/0x60 [amdgpu]
feb 24 14:31:13 *** kernel:  ? amdgpu_pm_compute_clocks+0xcd/0x600 [amdgpu]
feb 24 14:31:13 *** kernel:  ? dm_pp_apply_display_requirements+0x1a8/0x1c0
[amdgpu]
feb 24 14:31:13 *** kernel:  dc_commit_state+0x2b9/0x5e0 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_dm_atomic_commit_tail+0x398/0x20f0 [amdgpu]
feb 24 14:31:13 *** kernel:  ? number+0x337/0x380
feb 24 14:31:13 *** kernel:  ? vsnprintf+0x3aa/0x4f0
feb 24 14:31:13 *** kernel:  ? sprintf+0x5e/0x80
feb 24 14:31:13 *** kernel:  ? irq_work_queue+0x35/0x50
feb 24 14:31:13 *** kernel:  ? wake_up_klogd+0x4f/0x70
feb 24 14:31:13 *** kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
feb 24 14:31:13 *** kernel:  drm_atomic_helper_commit+0x113/0x140
[drm_kms_helper]
feb 24 14:31:13 *** kernel:  drm_atomic_helper_disable_all+0x175/0x190
[drm_kms_helper]
feb 24 14:31:13 *** kernel:  drm_atomic_helper_suspend+0x73/0x120
[drm_kms_helper]
feb 24 14:31:13 *** kernel:  dm_suspend+0x1c/0x60 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_device_ip_suspend_phase1+0x81/0xe0 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_device_pre_asic_reset+0x191/0x1a4 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_device_gpu_recover+0x2ee/0xa13 [amdgpu]
feb 24 14:31:13 *** kernel:  amdgpu_job_timedout+0x103/0x130 [amdgpu]
feb 24 14:31:13 *** kernel:  drm_sched_job_timedout+0x3e/0x90 [gpu_sched]
feb 24 14:31:13 *** kernel:  process_one_work+0x1e1/0x3d0
feb 24 14:31:13 *** kernel:  worker_thread+0x4a/0x3d0
feb 24 14:31:13 *** kernel:  kthread+0xfb/0x130
feb 24 14:31:13 *** kernel:  ? process_one_work+0x3d0/0x3d0
feb 24 14:31:13 *** kernel:  ? kthread_park+0x90/0x90
feb 24 14:31:13 *** kernel:  ret_from_fork+0x22/0x40
feb 24 14:31:13 *** kernel: ---[ end trace 3e7589981fe74b17 ]---

Complete log attached below.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (7 preceding siblings ...)
  2020-02-24 13:50 ` bugzilla-daemon
@ 2020-02-24 13:52 ` bugzilla-daemon
  2020-05-22 12:55 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-02-24 13:52 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #9 from Marco (rodomar705@protonmail.com) ---
Created attachment 287575
  --> https://bugzilla.kernel.org/attachment.cgi?id=287575&action=edit
Latest log with a warning.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (8 preceding siblings ...)
  2020-02-24 13:52 ` bugzilla-daemon
@ 2020-05-22 12:55 ` bugzilla-daemon
  2020-05-23 14:40 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-05-22 12:55 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

andrewammerlaan@riseup.net (andrewammerlaan@riseup.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrewammerlaan@riseup.net

--- Comment #10 from andrewammerlaan@riseup.net (andrewammerlaan@riseup.net) ---
Created attachment 289235
  --> https://bugzilla.kernel.org/attachment.cgi?id=289235&action=edit
syslog

I think I ran into this issue as well. It has happened twice. Both times it
happened 10 to 20 minutes *after* playing minecraft. Both times I was in a full
screen video meeting. Everything works, except the screen goes black, I could
finish the meeting, but without seeing anything. 

Only the monitors connected to my RX 590 go black, the one connected to the
iGPU just freezes, and after a while the cursor becomes usable again on that
monitor, though all applications remain frozen, and switching to tty does not
work. REISUB'ing the machine makes it boot on the iGPU. It needs to be
completely switched on and off to boot from the amdgpu.

It looks like it does a graphics reset (why though?):
15554.332021] amdgpu 0000:01:00.0: GPU reset begin!

And from that point onwards everyting goes wrong:
[15554.332296] amdgpu: [powerplay] 
[15554.332296]  last message was failed ret is 65535
[15554.332297] amdgpu: [powerplay] 
[15554.332297]  failed to send message 261 ret is 65535 
[15554.332297] amdgpu: [powerplay] 
[15554.332297]  last message was failed ret is 65535

This is kernel 5.6.14
xorg-1.20.8
mesa-20.1.0_rc3
xf86-video-amdgpu-19.1.0

Full log is attached.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (9 preceding siblings ...)
  2020-05-22 12:55 ` bugzilla-daemon
@ 2020-05-23 14:40 ` bugzilla-daemon
  2020-05-23 16:44 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-05-23 14:40 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #11 from Andrew Ammerlaan (andrewammerlaan@riseup.net) ---
Created attachment 289245
  --> https://bugzilla.kernel.org/attachment.cgi?id=289245&action=edit
messages

Happened again today, while playing GTA V. Same problems appear in the log
(attached). 

I think the title of this bug should be changed, there is more going on here
then just dropping the signal to the monitor. Because the monitors connected to
the iGPU freeze as well (no signal drop, just a freeze).

It would be great if someone could give me some pointers as to where I could
find more useful logs. /var/log/messages doesn't seem to be very informative.
It just says a GPU reset began and that it failed to sends some messages after.
Or do I maybe need to set some boot parameters, or kernel configs to get more
verbose logs?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (10 preceding siblings ...)
  2020-05-23 14:40 ` bugzilla-daemon
@ 2020-05-23 16:44 ` bugzilla-daemon
  2020-06-16 15:48 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-05-23 16:44 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #12 from Andrew Ammerlaan (andrewammerlaan@riseup.net) ---
Created attachment 289247
  --> https://bugzilla.kernel.org/attachment.cgi?id=289247&action=edit
messages (reset succesful this time)

And again, twice on the same day :(

But this time:
amdgpu 0000:01:00.0: GPU reset begin!
amdgpu 0000:01:00.0: GPU BACO reset
amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume

This time the reset succeeded, however after restarting X, I got stuck on the
KDE login splash screen. The log (attached) shows some segfaults.

It seems to me that there are two issues here.

1) The GPU is (often) not successfully recovered after a reset, and if it is
recovered successfully segfaults follow in radeonsi_dri.so

2) It goes into a reset in the first place, for no apparent reason

I guess this bug report is mostly about the second issue, why does it go into a
reset? How do I debug this?

It would be great if we could get this fixed, as it is getting kinda annoying.
(This is a brand new GPU, it is not overheating, what is wrong? )

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (11 preceding siblings ...)
  2020-05-23 16:44 ` bugzilla-daemon
@ 2020-06-16 15:48 ` bugzilla-daemon
  2020-06-16 16:39 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-06-16 15:48 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #13 from Marco (rodomar705@protonmail.com) ---
The only way I had it "fixed" (it's more of a workaround, but it is working) is
to slightly drop the clocks (my GPU has by default a max boost clock of 1430
MHz, I have dropped it to 1340 MHz) and voltages (From 1150 mV to 1040 mV on
peak clocks, however both depends on your specific silicon, this is just my
values). 

Now, if the system post after the downclock of it (sometimes lowering
clocks/voltages triggers a black screen bug at boot when the values are applied
with systemd, not sure if the issue is the same), however, if I can reach the
login screen, the system will work perfectly fine under load with no problem.
Never had a crash while using since.

I do not know if it's a specific issue of my silicon, since after this issue
happened to my card the same applies under Windows system. Repasting didn't
help.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (12 preceding siblings ...)
  2020-06-16 15:48 ` bugzilla-daemon
@ 2020-06-16 16:39 ` bugzilla-daemon
  2020-06-24 20:33 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-06-16 16:39 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #14 from Andrew Ammerlaan (andrewammerlaan@riseup.net) ---
I sort of worked around this too.

I changed two things:

1) the iGPU is now the primary GPU, and I use DRI_PRIME=1 to offload to the AMD
gpu. This has reduced the amount of things that are rendered on the AMD card.
This didn't actually fix anything, but it did remove the necessity for a hard
reboot when the AMD GPU does a reset. Now, when the GPU resets only the
applications that are rendered on the AMD card stop working, the desktop and
stuff stay functional. 

2) I added three fans to my PC. Though the card's thermal sensor never reported
that it reached the critical temperature (it went up to 82 Celsius max,
critical is 91 Celsius). There definitely does seem to be a correlation between
high temperatures and the occurrence of the resets. And more fans is always
better anyway.

I still experienced some resets after switching the primary GPU to the iGPU,
but only if I really pushed it to it's limits. I haven't had a single reset
since I added the fans. (Though admittedly I haven't run a decent stress test
yet, so it is still too early to conclude that the problem is completely gone)

Since under-clocking the card worked for you, and adding fans seems to work for
me. I have a hunch that even though the thermal sensor doesn't report
problematic temperatures some parts of the card actually do reach problematic
temperatures nonetheless, which might causes issues leading to a reset.
I'm not sure where the sensor is physically located, but considering that the
card is quite large, it doesn't seem that far fetched to me that there could be
quite a large difference in temperature between two points on the card.

Perhaps this card could benefit from a second thermal sensor or earlier and/or
more aggressive thermal throttling.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (13 preceding siblings ...)
  2020-06-16 16:39 ` bugzilla-daemon
@ 2020-06-24 20:33 ` bugzilla-daemon
  2020-06-24 20:41 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-06-24 20:33 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #15 from Andrew Ammerlaan (andrewammerlaan@riseup.net) ---
So today it was *really* hot, and I had this issue occur a couple of times.
(The solution with the extra fans was nice and all, but not enough to prevent
it entirely)

However, now that the iGPU is default, I can still see the system monitor that
I usually run on the other monitor when this issue occurs. Every single time
the thermal sensor of the GPU would show a ridiculous value (e.g. 511 degrees
Celsius).

Now, this could explain why the GPU does a reset. If the thermal sensor would
all of a sudden return a value of e.g. 511, then of course the GPU will shut
itself down. 

As it is clearly impossible for the temperature of the GPU to jump from being
somewhere between 80 to 90, to over 500 within a couple of milliseconds. I
conclude that there is something wrong, either physically with the thermal
sensor, or with the way the firmware/driver handles the temperature reporting
from the sensor. Also, if the GPU would have actually reached a temperature of
511 it would be broken now, as the melting temperature of tin is about 230
degrees Celsius.

I happen to work with thermometers quite a lot, and I have seen temperature
readings do stuff like this. Usually the cause is either a broken, or shorted
sensor (which is unlikely in this case, cause it works normally most of the
time), or a wrong/incomplete calibration curve. (Usually thermal sensors are
only calibrated within the range they are expected to operate, but the high
limit of this calibration curve might be too low.)

Anyway, either the GPU reset is caused by the incorrect temperature readings,
or the incorrect temperature readings are caused by the GPU reset (which is
also possible I guess). In any case, it would be great if AMD could look into
this soon. Because clearly something is wrong.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (14 preceding siblings ...)
  2020-06-24 20:33 ` bugzilla-daemon
@ 2020-06-24 20:41 ` bugzilla-daemon
  2020-06-25  9:58 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-06-24 20:41 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

Alex Deucher (alexdeucher@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alexdeucher@gmail.com

--- Comment #16 from Alex Deucher (alexdeucher@gmail.com) ---
(In reply to Andrew Ammerlaan from comment #15)
> However, now that the iGPU is default, I can still see the system monitor
> that I usually run on the other monitor when this issue occurs. Every single
> time the thermal sensor of the GPU would show a ridiculous value (e.g. 511
> degrees Celsius).

When the GPU is in reset all reads to the MMIO BAR return 1s so you are just
getting all ones until the reset succeeds.  511 is just all ones.  This patch
will fix that issue:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9271dfd9e0f79e2969dcbe28568bce0fdc4f8f73

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (15 preceding siblings ...)
  2020-06-24 20:41 ` bugzilla-daemon
@ 2020-06-25  9:58 ` bugzilla-daemon
  2020-09-15 18:31 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-06-25  9:58 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #17 from Andrew Ammerlaan (andrewammerlaan@riseup.net) ---
(In reply to Alex Deucher from comment #16)
> When the GPU is in reset all reads to the MMIO BAR return 1s so you are just
> getting all ones until the reset succeeds.  511 is just all ones.  This
> patch will fix that issue:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=9271dfd9e0f79e2969dcbe28568bce0fdc4f8f73

Well there goes my hypotheses of the broken thermal sensor xD.

I did discover yesterday that the fan of my GPU spins relatively slow under
high load. When the GPU reached ~80 degrees Celsius, the fan didn't even spin
at half the maximum RPM! I used the pwmconfig script and the fancontrol service
from lm_sensors to force the fan to go to the maximum RPM just before reaching
80 degrees Celsius. It's very noisy, *but* the GPU stays well below 70 degrees
Celsius now, even under heavy load. As this issue seems to occur only when the
GPU is hotter then ~75 degrees Celsius, I'm hoping that this will help in
preventing the problem.

I'm still confused as to why this is at all necessary, the critical temperature
is 91, so why do I encounter these issues at ~80?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (16 preceding siblings ...)
  2020-06-25  9:58 ` bugzilla-daemon
@ 2020-09-15 18:31 ` bugzilla-daemon
  2020-09-16  7:52 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-09-15 18:31 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #18 from Marco (rodomar705@protonmail.com) ---
As of 5.8.7 I've tried to revert to stock clocks, and I had no black screen
issue under load even after long game sessions.

It does *seems* to be fixed, at least for me.

I don't know how much code is shared between the Linux open source driver and
the Windows closed source driver, but I wonder if it was some bug that jumped
over the Windows driver too (or even if it was firmware related).

I'll keep testing to see if it happens again, but I haven't seen any error logs
mentioning amdgpu in the dmesg kernel.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (17 preceding siblings ...)
  2020-09-15 18:31 ` bugzilla-daemon
@ 2020-09-16  7:52 ` bugzilla-daemon
  2021-03-22  9:36 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2020-09-16  7:52 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #19 from Andrew Ammerlaan (andrewammerlaan@riseup.net) ---
I'm on 5.8.8 at the moment and I haven't had this happen in a long time. I've
had some other freezes but I'm not sure they're GPU/graphics related. So I too
think this is fixed.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (18 preceding siblings ...)
  2020-09-16  7:52 ` bugzilla-daemon
@ 2021-03-22  9:36 ` bugzilla-daemon
  2022-01-06 17:58 ` bugzilla-daemon
  2022-01-06 23:44 ` bugzilla-daemon
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2021-03-22  9:36 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

Marco (rodomar705@protonmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |ANSWERED

--- Comment #20 from Marco (rodomar705@protonmail.com) ---
I finally got where the problem was, and completely fixed it. It was hardware.
The issue was the heatsink was not contacting completely a section on the
mosfets that was feeding power to the core of the card. Under full load they
was thermal tripping for overheating and completely stalling the card to avoid
damages to themselves. The problem was that this card wasn't reporting the
temps of them to software, even if the actual vrm controller was (or if it was
shutting down only when the mosfet trigger purely a signal asserting the
thermal runaway condition). This was hell to debug and fix, as always with
hardware problems, but after a stress test on both Windows and Linux under full
clock, the issue is not present anymore.

I'll keep my optimized clocks for lower temperatures and less fan noise, but
for me the issue wasn't software.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (19 preceding siblings ...)
  2021-03-22  9:36 ` bugzilla-daemon
@ 2022-01-06 17:58 ` bugzilla-daemon
  2022-01-06 23:44 ` bugzilla-daemon
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2022-01-06 17:58 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

rendsvig@gmail.com (rendsvig@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rendsvig@gmail.com

--- Comment #21 from rendsvig@gmail.com (rendsvig@gmail.com) ---
I have recently started facing this issues with two new games I've started
playing, though I've had zero issues with gaming until now.

I'm on kernel 5.15.8, and have tried with the Pop!_OS 21.10 default Mesa
drivers 21.2.2 and more recent 21.3.3, drivers, too.

Logging with lm-sensors every 2 seconds while playing, I see little development
in temperature before the crash, but a growing power consumption, getting close
 to the cap of 183W (five measurements before crash were 183, then 177, 180,
182,   and 181).

Marco, can you explain how to tell if it's the same hardware issue you faced?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Bug 206475] amdgpu under load drop signal to monitor until hard reset
  2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
                   ` (20 preceding siblings ...)
  2022-01-06 17:58 ` bugzilla-daemon
@ 2022-01-06 23:44 ` bugzilla-daemon
  21 siblings, 0 replies; 23+ messages in thread
From: bugzilla-daemon @ 2022-01-06 23:44 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #22 from rendsvig@gmail.com (rendsvig@gmail.com) ---
I resolved my issue by disabling p-state 7 when gaming, with cf. this comment
https://www.reddit.com/r/linux_gaming/comments/gbqe0e/comment/fp8r35a

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-01-06 23:44 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-09 20:36 [Bug 206475] New: amdgpu under load drop signal to monitor until hard reset bugzilla-daemon
2020-02-10 13:20 ` [Bug 206475] " bugzilla-daemon
2020-02-10 13:21 ` bugzilla-daemon
2020-02-10 16:39 ` bugzilla-daemon
2020-02-10 16:40 ` bugzilla-daemon
2020-02-10 19:33 ` bugzilla-daemon
2020-02-17 13:23 ` bugzilla-daemon
2020-02-21 21:13 ` bugzilla-daemon
2020-02-24 13:50 ` bugzilla-daemon
2020-02-24 13:52 ` bugzilla-daemon
2020-05-22 12:55 ` bugzilla-daemon
2020-05-23 14:40 ` bugzilla-daemon
2020-05-23 16:44 ` bugzilla-daemon
2020-06-16 15:48 ` bugzilla-daemon
2020-06-16 16:39 ` bugzilla-daemon
2020-06-24 20:33 ` bugzilla-daemon
2020-06-24 20:41 ` bugzilla-daemon
2020-06-25  9:58 ` bugzilla-daemon
2020-09-15 18:31 ` bugzilla-daemon
2020-09-16  7:52 ` bugzilla-daemon
2021-03-22  9:36 ` bugzilla-daemon
2022-01-06 17:58 ` bugzilla-daemon
2022-01-06 23:44 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).