How to get useful information other than "the whole system locks up"?

All of lore.kernel.org
 help / color / mirror / Atom feed

* How to get useful information other than "the whole system locks up"?
@ 2019-04-03  0:20 Braiam
       [not found] ` <CAG=7Bt8UFiqqfwODc-e+2tN18x3LMYz9AoBmq9ukWsY-b4TaqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Braiam @ 2019-04-03  0:20 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi,

I have a Sapphire Technology Hawaii XT (R9 290X) using amdgpu driver
with kernel 5.1.0-rc3.
The issue happens with current 4.19.0 debian testing, 4.20-trunk,
5.0.0-trunk and rc2 and 3.

It usually happens when I'm reproducing video, but I haven't figured
out a way to reproduce it. It
happened once without reproducing. I'm aware that the support is
experimental, but radeon
driver doesn't seems capable of direct rendering on this card dropping
to llvmepipe.

I had a ssh server installed in case I could log in while it crashes,
and the only relevant
line I found was:

drm:amdgpu job timeout [amdgpu]] **ERROR** ring gfx timeout, signaled
seq=399919, emitted seq=399921

But that turned several bug reports which seems to have been fixed and
the context and symptoms are too different to mine.

I have tried forcing the amdgpu xorg driver with same results (was
using radeon).

-- 
Braiam
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to get useful information other than "the whole system locks up"?
       [not found] ` <CAG=7Bt8UFiqqfwODc-e+2tN18x3LMYz9AoBmq9ukWsY-b4TaqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2019-04-03 18:01   ` Alex Deucher
       [not found]     ` <CADnq5_NsVLJ1tYyQ_e1QPrLg7M6mf1nuR18BOsA+3YJp-LCONQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Deucher @ 2019-04-03 18:01 UTC (permalink / raw)
  To: Braiam; +Cc: amd-gfx list

On Wed, Apr 3, 2019 at 2:58 AM Braiam <braiamp@gmail.com> wrote:
>
> Hi,
>
> I have a Sapphire Technology Hawaii XT (R9 290X) using amdgpu driver
> with kernel 5.1.0-rc3.
> The issue happens with current 4.19.0 debian testing, 4.20-trunk,
> 5.0.0-trunk and rc2 and 3.
>
> It usually happens when I'm reproducing video, but I haven't figured
> out a way to reproduce it. It
> happened once without reproducing. I'm aware that the support is
> experimental, but radeon
> driver doesn't seems capable of direct rendering on this card dropping
> to llvmepipe.

Radeon should work out of the box.  Maybe something is messed up with
your install?

>
> I had a ssh server installed in case I could log in while it crashes,
> and the only relevant
> line I found was:
>
> drm:amdgpu job timeout [amdgpu]] **ERROR** ring gfx timeout, signaled
> seq=399919, emitted seq=399921
>
> But that turned several bug reports which seems to have been fixed and
> the context and symptoms are too different to mine.
>

You appear to be experiencing a GPU lockup.  Unfortunately, there can
be many things that cause it, so it really helps to have a good
reproducer case.  You might try a newer version of mesa or llvm.  What
does your "reproducing video" work flow use?  What apps, APIs are
involved?

Alex

> I have tried forcing the amdgpu xorg driver with same results (was
> using radeon).
>
> --
> Braiam
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to get useful information other than "the whole system locks up"?
       [not found]     ` <CADnq5_NsVLJ1tYyQ_e1QPrLg7M6mf1nuR18BOsA+3YJp-LCONQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2019-04-03 22:38       ` Braiam
       [not found]         ` <CAG=7Bt_ScazVsdOBsszBnHW81Zr_78diU=XvTff9yTeDgGeo8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Braiam @ 2019-04-03 22:38 UTC (permalink / raw)
  To: Alex Deucher; +Cc: amd-gfx list

On Wed, Apr 3, 2019 at 2:02 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Apr 3, 2019 at 2:58 AM Braiam <braiamp@gmail.com> wrote:
> >
> > Hi,
> >
> > I have a Sapphire Technology Hawaii XT (R9 290X) using amdgpu driver
> > with kernel 5.1.0-rc3.
> > The issue happens with current 4.19.0 debian testing, 4.20-trunk,
> > 5.0.0-trunk and rc2 and 3.
> >
> > It usually happens when I'm reproducing video, but I haven't figured
> > out a way to reproduce it. It
> > happened once without reproducing. I'm aware that the support is
> > experimental, but radeon
> > driver doesn't seems capable of direct rendering on this card dropping
> > to llvmepipe.
>
> Radeon should work out of the box.  Maybe something is messed up with
> your install?

Doubtful, since I used 2200G without issues. Only change was using amdgpu
due games being particularly slow.

>
> >
> > I had a ssh server installed in case I could log in while it crashes,
> > and the only relevant
> > line I found was:
> >
> > drm:amdgpu job timeout [amdgpu]] **ERROR** ring gfx timeout, signaled
> > seq=399919, emitted seq=399921
> >
> > But that turned several bug reports which seems to have been fixed and
> > the context and symptoms are too different to mine.
> >
>
> You appear to be experiencing a GPU lockup.  Unfortunately, there can
> be many things that cause it, so it really helps to have a good
> reproducer case.  You might try a newer version of mesa or llvm.  What
> does your "reproducing video" work flow use?  What apps, APIs are
> involved?

By "reproducing video" I mean watching videos from either Youtube/Netflix
with Firefox or mpv for local files. But note that I've experienced at
least one lock up
without any video on screen (I don't remember if there was something paused
on the background).

Using mesa 18.3.4, and according to glxinfo DRM 3.30.0, LLVM 7.0.1. With
vulkan, vdpau and va drivers on the same version. I gave up and removed
any and every instance of the rocm package, since the module couldn't
compile on the kernel.

Another detail about the lock up: the screen doesn't freeze but no
signal is emitted,
so my monitor goes to sleep.

Upgraded to mesa 19.0.1 and llvm 8.0.0, although I didn't see anything
relevant on the
changelogs that could affect my experience.


>
> Alex
>
> > I have tried forcing the amdgpu xorg driver with same results (was
> > using radeon).
> >
> > --
> > Braiam
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx



--
Braiam
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to get useful information other than "the whole system locks up"?
       [not found]         ` <CAG=7Bt_ScazVsdOBsszBnHW81Zr_78diU=XvTff9yTeDgGeo8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2019-04-06 19:25           ` Braiam
       [not found]             ` <CAG=7Bt8VEgsiqaMDYtU31ut1pU+h6_rEaqnfh2tyR7SuEpUQ8g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Braiam @ 2019-04-06 19:25 UTC (permalink / raw)
  To: Alex Deucher; +Cc: amd-gfx list

Managed to extract more information from one of the lock up:

[53798.406792] [drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* ctx
00000000865ddb9b is still alive
[53798.406862] [drm:amdgpu_ctx_mgr_fini [amdgpu]] *ERROR* ctx
00000000865ddb9b is still alive

Apparently the messages above have nothing to do, since there's a jump
on the timestamp.

[135442.607022] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=834761, emitted seq=834763
[135442.607111] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process  pid 0 thread  pid 0
[135442.607117] amdgpu 0000:01:00.0: GPU reset begin!
[135442.617041] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, signaled seq=11762398, emitted seq=11762400
[135442.617133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process Xorg pid 737 thread Xorg:cs0 pid 768
[135442.617139] amdgpu 0000:01:00.0: GPU reset begin!

Why the GPU tries to reset twice?

[135443.319047] amdgpu: [powerplay]
                 failed to send message 5e ret is 0
[135443.806941] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135444.294821] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135444.782666] amdgpu: [powerplay]
                 failed to send message 148 ret is 0
[135445.270683] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135445.759529] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135445.839244] [drm] REG_WAIT timeout 10us * 3500 tries -
dce_mi_free_dmif line:636

Here, it start a trace. Should I report a bug even if I'm unable to
reproduce the conditions that led to it?

[135445.839340] WARNING: CPU: 1 PID: 24740 at
drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:277
generic_reg_wait.cold.3+0x2a/0x31 [amdgpu]
[135445.839341] Modules linked in: uinput nft_chain_route_ipv4
xt_CHECKSUM nft_chain_nat ipt_MASQUERADE nf_nat xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
nft_counter xt_tcpudp nft_compat tun bridge stp llc nf_tables
nfnetlink cpufreq_userspace cpufreq_powersave cpufreq_conservative
edac_mce_amd kvm_amd ccp rng_core kvm irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel binfmt_misc nls_ascii nls_cp437 vfat
fat efi_pstore aesni_intel eeepc_wmi asus_wmi aes_x86_64 sparse_keymap
crypto_simd rfkill cryptd snd_hda_codec_realtek glue_helper
snd_hda_codec_generic pcspkr wmi_bmof joydev efivars k10temp
ledtrig_audio snd_hda_codec_hdmi sp5100_tco amdgpu snd_hda_intel chash
snd_hda_codec gpu_sched snd_hda_core ttm snd_hwdep snd_pcm sg
snd_timer drm_kms_helper snd soundcore drm evdev button pcc_cpufreq
acpi_cpufreq parport_pc ppdev lp parport efivarfs ip_tables x_tables
autofs4 btrfs xor zstd_decompress zstd_compress raid6_pq libcrc32c
crc32c_generic sd_mod
[135445.839368]  hid_generic usbhid hid ahci libahci igb xhci_pci
crc32c_intel i2c_algo_bit dca libata xhci_hcd i2c_piix4 scsi_mod
usbcore wmi video gpio_amdpt gpio_generic
[135445.839377] CPU: 1 PID: 24740 Comm: kworker/1:0 Not tainted 5.1.0-rc3 #1
[135445.839378] Hardware name: System manufacturer System Product
Name/ROG STRIX B350-F GAMING, BIOS 4207 12/07/2018
[135445.839383] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135445.839451] RIP: 0010:generic_reg_wait.cold.3+0x2a/0x31 [amdgpu]
[135445.839453] Code: 44 8b 44 24 68 48 8b 4c 24 60 48 c7 c7 a8 a2 8b
c0 8b 54 24 58 8b 74 24 04 e8 3a 88 8c e4 41 83 7c 24 20 01 0f 84 51
8d fe ff <0f> 0b e9 4a 8d fe ff e8 84 61 ed ff 48 c7 c7 00 80 95 c0 e8
a8 0f
[135445.839454] RSP: 0018:ffffb8d7099777b0 EFLAGS: 00010297
[135445.839455] RAX: 0000000000000044 RBX: 0000000000000dad RCX:
0000000000000000
[135445.839456] RDX: 0000000000000000 RSI: ffff8db08ea56638 RDI:
ffff8db08ea56638
[135445.839457] RBP: 0000000000000328 R08: 00000000000003fd R09:
0000000000000004
[135445.839458] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff8db085256600
[135445.839459] R13: ffff8db086c4f000 R14: 0000000000000002 R15:
0000000000000010
[135445.839460] FS:  0000000000000000(0000) GS:ffff8db08ea40000(0000)
knlGS:0000000000000000
[135445.839461] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[135445.839462] CR2: 00007fe3feab6000 CR3: 00000001c0df4000 CR4:
00000000003406e0
[135445.839463] Call Trace:
[135445.839539]  dce_mi_free_dmif+0xff/0x1e0 [amdgpu]
[135445.839607]  dce110_reset_hw_ctx_wrap+0x149/0x1e0 [amdgpu]
[135445.839674]  dce110_apply_ctx_to_hw+0x4b/0x490 [amdgpu]
[135445.839728]  ? amdgpu_pm_compute_clocks+0xb9/0x5d0 [amdgpu]
[135445.839800]  ? dm_pp_apply_display_requirements+0x196/0x1a0 [amdgpu]
[135445.839864]  dc_commit_state+0x2ae/0x550 [amdgpu]
[135445.839936]  ? dm_determine_update_type_for_commit+0x5a/0x410 [amdgpu]
[135445.840006]  amdgpu_dm_atomic_commit_tail+0x1fb/0x1930 [amdgpu]
[135445.840011]  ? kmalloc_order+0x14/0x30
[135445.840085]  ? dm_determine_update_type_for_commit+0x5a/0x410 [amdgpu]
[135445.840088]  ? _cond_resched+0x15/0x30
[135445.840090]  ? _cond_resched+0x15/0x30
[135445.840092]  ? __ww_mutex_lock.isra.9+0x3a/0x710
[135445.840101]  ? drm_atomic_helper_setup_commit+0x380/0x480 [drm_kms_helper]
[135445.840104]  ? __wake_up_common_lock+0x89/0xc0
[135445.840105]  ? _cond_resched+0x15/0x30
[135445.840107]  ? wait_for_completion_timeout+0x3b/0x1a0
[135445.840181]  ? amdgpu_dm_atomic_commit_tail+0x1930/0x1930 [amdgpu]
[135445.840254]  ? dm_plane_helper_prepare_fb+0x257/0x360 [amdgpu]
[135445.840263]  ? commit_tail+0x3d/0x70 [drm_kms_helper]
[135445.840271]  commit_tail+0x3d/0x70 [drm_kms_helper]
[135445.840279]  drm_atomic_helper_commit+0xb4/0x120 [drm_kms_helper]
[135445.840285]  drm_atomic_helper_disable_all+0x14d/0x160 [drm_kms_helper]
[135445.840292]  drm_atomic_helper_suspend+0xc4/0xf0 [drm_kms_helper]
[135445.840356]  dm_suspend+0x1c/0x60 [amdgpu]
[135445.840402]  amdgpu_device_ip_suspend_phase1+0x9a/0xd0 [amdgpu]
[135445.840448]  amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]
[135445.840508]  amdgpu_device_pre_asic_reset+0x1ef/0x204 [amdgpu]
[135445.840567]  amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135445.840627]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135445.840631]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135445.840634]  process_one_work+0x1a7/0x3b0
[135445.840636]  worker_thread+0x30/0x390
[135445.840638]  ? create_worker+0x1a0/0x1a0
[135445.840639]  kthread+0x112/0x130
[135445.840640]  ? __kthread_parkme+0x70/0x70
[135445.840642]  ret_from_fork+0x22/0x40
[135445.840645] ---[ end trace 660b6e2adb0f7476 ]---


Later, powerplay seems to fail a bunch of time, with messages ranging
from 5e to 186

[135446.364392] amdgpu: [powerplay]
                 failed to send message 148 ret is 0
[135446.887086] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135447.409654] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135447.933340] amdgpu: [powerplay]
                 failed to send message 133 ret is 0
[135447.933342] amdgpu: [powerplay] VI should always have 2 performance levels
[135448.456153] amdgpu: [powerplay]
                 failed to send message 5e ret is 0
[135448.978944] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135449.501800] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135450.024313] amdgpu: [powerplay]
                 failed to send message 148 ret is 0
[135450.546863] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135451.069831] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135451.592819] amdgpu: [powerplay]
                 failed to send message 16a ret is 0
[135452.115338] amdgpu: [powerplay]
                 failed to send message 186 ret is 0
[135452.637920] amdgpu: [powerplay]
                 failed to send message 54 ret is 0
[135453.160464] amdgpu: [powerplay]
                 failed to send message 13d ret is 0
[135453.682876] amdgpu: [powerplay]
                 failed to send message 14f ret is 0
[135454.205768] amdgpu: [powerplay]
                 failed to send message 151 ret is 0
[135454.728890] amdgpu: [powerplay]
                 failed to send message 135 ret is 0
[135455.252057] amdgpu: [powerplay]
                 failed to send message 190 ret is 0
[135455.774989] amdgpu: [powerplay]
                 failed to send message 63 ret is 0
[135456.297489] amdgpu: [powerplay]
                 failed to send message 84 ret is 0


There seems to be a newline somewhere, since powerplay messages aren't
on the same line as the timestamp. After that, the gpu tries to
recover, but fails miserably:

[135456.299369] amdgpu 0000:01:00.0: GPU pci config reset
[135456.306844] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume
[135456.306964] [drm] PCIE gen 3 link speeds already enabled
[135456.310683] [drm] PCIE GART of 1024M enabled (table at 0x000000F400300000).
[135456.310768] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[135456.590097] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring gfx test failed (-110)
[135456.590145] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR*
resume of IP block <gfx_v7_0> failed -110
[135456.590169] amdgpu 0000:01:00.0: GPU reset(2) failed

This line gets repeated about 10 times:

[135456.590183] [drm] Skip scheduling IBs!

And the kernel tries to reset the gpu again:

[135456.590362] amdgpu 0000:01:00.0: GPU reset end with ret = -110
[135466.655925] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=834763, emitted seq=834763
[135466.656025] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process  pid 0 thread  pid 0
[135466.656032] amdgpu 0000:01:00.0: GPU reset begin!
[135578.700184] INFO: task kworker/0:2:24837 blocked for more than 120 seconds.
[135578.700189]       Tainted: G        W         5.1.0-rc3 #1
[135578.700191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135578.700194] kworker/0:2     D    0 24837      2 0x80000000
[135578.700206] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135578.700208] Call Trace:
[135578.700217]  ? __schedule+0x2d4/0x870
[135578.700220]  ? __switch_to_asm+0x34/0x70
[135578.700223]  ? __switch_to_asm+0x40/0x70
[135578.700225]  schedule+0x28/0x70
[135578.700228]  schedule_timeout+0x26d/0x390
[135578.700230]  ? __schedule+0x2dc/0x870
[135578.700234]  dma_fence_default_wait+0x238/0x2a0
[135578.700236]  ? dma_fence_release+0x90/0x90
[135578.700239]  dma_fence_wait_timeout+0xdd/0x100
[135578.700243]  drm_sched_stop+0xf2/0x130 [gpu_sched]
[135578.700340]  amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu]
[135578.700428]  amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135578.700514]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135578.700571]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135578.700576]  process_one_work+0x1a7/0x3b0
[135578.700578]  worker_thread+0x30/0x390
[135578.700581]  ? create_worker+0x1a0/0x1a0
[135578.700583]  kthread+0x112/0x130
[135578.700585]  ? __kthread_parkme+0x70/0x70
[135578.700587]  ret_from_fork+0x22/0x40
[135578.700594] INFO: task kworker/0:1:29946 blocked for more than 120 seconds.
[135578.700596]       Tainted: G        W         5.1.0-rc3 #1
[135578.700598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135578.700599] kworker/0:1     D    0 29946      2 0x80000000
[135578.700605] Workqueue: events drm_sched_job_finish [gpu_sched]
[135578.700606] Call Trace:
[135578.700609]  ? __schedule+0x2d4/0x870
[135578.700611]  schedule+0x28/0x70
[135578.700613]  schedule_timeout+0x26d/0x390
[135578.700616]  wait_for_completion+0x11f/0x190
[135578.700619]  ? wake_up_q+0x80/0x80
[135578.700622]  __flush_work+0x130/0x1f0
[135578.700624]  ? worker_attach_to_pool+0x90/0x90
[135578.700626]  ? get_work_pool+0x40/0x40
[135578.700629]  __cancel_work_timer+0x10a/0x190
[135578.700632]  ? __switch_to+0x15b/0x460
[135578.700634]  ? __switch_to_asm+0x34/0x70
[135578.700636]  ? __switch_to_asm+0x40/0x70
[135578.700639]  drm_sched_job_finish+0x25/0x60 [gpu_sched]
[135578.700642]  process_one_work+0x1a7/0x3b0
[135578.700644]  worker_thread+0x30/0x390
[135578.700647]  ? create_worker+0x1a0/0x1a0
[135578.700648]  kthread+0x112/0x130
[135578.700650]  ? __kthread_parkme+0x70/0x70
[135578.700652]  ret_from_fork+0x22/0x40
[135699.405742] INFO: task kworker/0:2:24837 blocked for more than 241 seconds.
[135699.405748]       Tainted: G        W         5.1.0-rc3 #1
[135699.405750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.405752] kworker/0:2     D    0 24837      2 0x80000000
[135699.405764] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135699.405765] Call Trace:
[135699.405773]  ? __schedule+0x2d4/0x870
[135699.405776]  ? __switch_to_asm+0x34/0x70
[135699.405779]  ? __switch_to_asm+0x40/0x70
[135699.405782]  schedule+0x28/0x70
[135699.405785]  schedule_timeout+0x26d/0x390
[135699.405787]  ? __schedule+0x2dc/0x870
[135699.405791]  dma_fence_default_wait+0x238/0x2a0
[135699.405794]  ? dma_fence_release+0x90/0x90
[135699.405796]  dma_fence_wait_timeout+0xdd/0x100
[135699.405800]  drm_sched_stop+0xf2/0x130 [gpu_sched]
[135699.405900]  amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu]
[135699.405991]  amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135699.406081]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135699.406086]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135699.406089]  process_one_work+0x1a7/0x3b0
[135699.406092]  worker_thread+0x30/0x390
[135699.406095]  ? create_worker+0x1a0/0x1a0
[135699.406096]  kthread+0x112/0x130
[135699.406098]  ? __kthread_parkme+0x70/0x70
[135699.406101]  ret_from_fork+0x22/0x40
[135699.406108] INFO: task kworker/0:1:29946 blocked for more than 241 seconds.
[135699.406110]       Tainted: G        W         5.1.0-rc3 #1
[135699.406111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.406113] kworker/0:1     D    0 29946      2 0x80000000
[135699.406118] Workqueue: events drm_sched_job_finish [gpu_sched]
[135699.406119] Call Trace:
[135699.406122]  ? __schedule+0x2d4/0x870
[135699.406125]  schedule+0x28/0x70
[135699.406127]  schedule_timeout+0x26d/0x390
[135699.406130]  wait_for_completion+0x11f/0x190
[135699.406133]  ? wake_up_q+0x80/0x80
[135699.406136]  __flush_work+0x130/0x1f0
[135699.406138]  ? worker_attach_to_pool+0x90/0x90
[135699.406141]  ? get_work_pool+0x40/0x40
[135699.406143]  __cancel_work_timer+0x10a/0x190
[135699.406145]  ? __switch_to+0x15b/0x460
[135699.406148]  ? __switch_to_asm+0x34/0x70
[135699.406150]  ? __switch_to_asm+0x40/0x70
[135699.406153]  drm_sched_job_finish+0x25/0x60 [gpu_sched]
[135699.406156]  process_one_work+0x1a7/0x3b0
[135699.406158]  worker_thread+0x30/0x390
[135699.406161]  ? create_worker+0x1a0/0x1a0
[135699.406162]  kthread+0x112/0x130
[135699.406164]  ? __kthread_parkme+0x70/0x70
[135699.406167]  ret_from_fork+0x22/0x40
[135699.406170] INFO: task kworker/2:0:29950 blocked for more than 120 seconds.
[135699.406171]       Tainted: G        W         5.1.0-rc3 #1
[135699.406173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.406174] kworker/2:0     D    0 29950      2 0x80000000
[135699.406179] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135699.406180] Call Trace:
[135699.406183]  ? __schedule+0x2d4/0x870
[135699.406185]  schedule+0x28/0x70
[135699.406187]  schedule_preempt_disabled+0xa/0x10
[135699.406190]  __mutex_lock.isra.8+0x2d0/0x4a0
[135699.406277]  amdgpu_device_lock_adev+0x15/0x37 [amdgpu]
[135699.406364]  amdgpu_device_gpu_recover+0x69/0x743 [amdgpu]
[135699.406453]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135699.406457]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135699.406460]  process_one_work+0x1a7/0x3b0
[135699.406463]  worker_thread+0x30/0x390
[135699.406465]  ? create_worker+0x1a0/0x1a0
[135699.406467]  kthread+0x112/0x130
[135699.406468]  ? __kthread_parkme+0x70/0x70
[135699.406471]  ret_from_fork+0x22/0x40

At this point I decided that the system isn't going to recover, so I
tried to soft-reboot it, but it only ended locked and had to hard
reboot.

I've found this bug[1] that seems related, user watching videos. I
will try removing vaapi and check if that improves anything.

[1]: https://bugs.freedesktop.org/show_bug.cgi?id=106547

On Wed, Apr 3, 2019 at 6:38 PM Braiam <braiamp@gmail.com> wrote:
>
> On Wed, Apr 3, 2019 at 2:02 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Wed, Apr 3, 2019 at 2:58 AM Braiam <braiamp@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I have a Sapphire Technology Hawaii XT (R9 290X) using amdgpu driver
> > > with kernel 5.1.0-rc3.
> > > The issue happens with current 4.19.0 debian testing, 4.20-trunk,
> > > 5.0.0-trunk and rc2 and 3.
> > >
> > > It usually happens when I'm reproducing video, but I haven't figured
> > > out a way to reproduce it. It
> > > happened once without reproducing. I'm aware that the support is
> > > experimental, but radeon
> > > driver doesn't seems capable of direct rendering on this card dropping
> > > to llvmepipe.
> >
> > Radeon should work out of the box.  Maybe something is messed up with
> > your install?
>
> Doubtful, since I used 2200G without issues. Only change was using amdgpu
> due games being particularly slow.
>
> >
> > >
> > > I had a ssh server installed in case I could log in while it crashes,
> > > and the only relevant
> > > line I found was:
> > >
> > > drm:amdgpu job timeout [amdgpu]] **ERROR** ring gfx timeout, signaled
> > > seq=399919, emitted seq=399921
> > >
> > > But that turned several bug reports which seems to have been fixed and
> > > the context and symptoms are too different to mine.
> > >
> >
> > You appear to be experiencing a GPU lockup.  Unfortunately, there can
> > be many things that cause it, so it really helps to have a good
> > reproducer case.  You might try a newer version of mesa or llvm.  What
> > does your "reproducing video" work flow use?  What apps, APIs are
> > involved?
>
> By "reproducing video" I mean watching videos from either Youtube/Netflix
> with Firefox or mpv for local files. But note that I've experienced at
> least one lock up
> without any video on screen (I don't remember if there was something paused
> on the background).
>
> Using mesa 18.3.4, and according to glxinfo DRM 3.30.0, LLVM 7.0.1. With
> vulkan, vdpau and va drivers on the same version. I gave up and removed
> any and every instance of the rocm package, since the module couldn't
> compile on the kernel.
>
> Another detail about the lock up: the screen doesn't freeze but no
> signal is emitted,
> so my monitor goes to sleep.
>
> Upgraded to mesa 19.0.1 and llvm 8.0.0, although I didn't see anything
> relevant on the
> changelogs that could affect my experience.
>
>
> >
> > Alex
> >
> > > I have tried forcing the amdgpu xorg driver with same results (was
> > > using radeon).
> > >
> > > --
> > > Braiam
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> --
> Braiam



-- 
Braiam
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to get useful information other than "the whole system locks up"?
       [not found]             ` <CAG=7Bt8VEgsiqaMDYtU31ut1pU+h6_rEaqnfh2tyR7SuEpUQ8g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2019-04-10 13:33               ` Braiam
       [not found]                 ` <CAG=7Bt9NM+9umAZBb74h8dFLevqTk_xesy0TKdLUFMuyDnw_TQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Braiam @ 2019-04-10 13:33 UTC (permalink / raw)
  To: amd-gfx list

On Sat, Apr 6, 2019 at 3:25 PM Braiam <braiamp@gmail.com> wrote:

>
> At this point I decided that the system isn't going to recover, so I
> tried to soft-reboot it, but it only ended locked and had to hard
> reboot.
>
> I've found this bug[1] that seems related, user watching videos. I
> will try removing vaapi and check if that improves anything.
>
> [1]: https://bugs.freedesktop.org/show_bug.cgi?id=106547
>

Removing libva didn't help. Lock ups still occur frequently.

A new symptom is that video doesn't need to be visible on the display
for the lockup to occur. -rc4 didn't improve the situation either.
Lockups are still instantaneous, I haven't had another one where the
gpu doesn't drag all the system.

-- 
Braiam
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to get useful information other than "the whole system locks up"?
       [not found]                 ` <CAG=7Bt9NM+9umAZBb74h8dFLevqTk_xesy0TKdLUFMuyDnw_TQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2019-05-20  0:08                   ` Braiam
  0 siblings, 0 replies; 6+ messages in thread
From: Braiam @ 2019-05-20  0:08 UTC (permalink / raw)
  To: amd-gfx list

Decided to simply contact the manufacturer and ask them for the latest
vBIOS. Though it didn't crash the days before changing the bios, it
hasn't crashed the days later. Plus points since opencl also seems to
be working fine now, rather than making the card crash too (or
refusing to work altogether). Seems that one of the things to take
into account for crashing cards is if the bios is modified.

-- 
Braiam
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-05-20  0:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-03  0:20 How to get useful information other than "the whole system locks up"? Braiam
     [not found] ` <CAG=7Bt8UFiqqfwODc-e+2tN18x3LMYz9AoBmq9ukWsY-b4TaqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2019-04-03 18:01   ` Alex Deucher
     [not found]     ` <CADnq5_NsVLJ1tYyQ_e1QPrLg7M6mf1nuR18BOsA+3YJp-LCONQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2019-04-03 22:38       ` Braiam
     [not found]         ` <CAG=7Bt_ScazVsdOBsszBnHW81Zr_78diU=XvTff9yTeDgGeo8Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2019-04-06 19:25           ` Braiam
     [not found]             ` <CAG=7Bt8VEgsiqaMDYtU31ut1pU+h6_rEaqnfh2tyR7SuEpUQ8g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2019-04-10 13:33               ` Braiam
     [not found]                 ` <CAG=7Bt9NM+9umAZBb74h8dFLevqTk_xesy0TKdLUFMuyDnw_TQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2019-05-20  0:08                   ` Braiam

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.