All of lore.kernel.org
 help / color / mirror / Atom feed
From: Felix Kuehling <felix.kuehling@amd.com>
To: Vasant Hegde <vasant.hegde@amd.com>,
	Matt Fagnani <matt.fagnani@bell.net>,
	"iommu@lists.linux.dev" <iommu@lists.linux.dev>,
	Alex Deucher <alexander.deucher@amd.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Subject: Re: amdgpu failed to resume with AMD IOMMU enabled and 6.2.2-301 and 6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39 and later resulting in a black screen
Date: Mon, 13 Mar 2023 19:02:54 -0400	[thread overview]
Message-ID: <6c16f004-f20a-26dc-0f3e-abe0b683d764@amd.com> (raw)
In-Reply-To: <9b688cbe-ec48-17a7-0e40-5734d58e102d@amd.com>

Am 2023-03-13 um 00:21 schrieb Vasant Hegde:
> Hi Matt,
>
> + Suravee, Felix.
>
> Thanks for reporting this issue.
>
> On 3/12/2023 12:27 AM, Matt Fagnani wrote:
>> I booted a Fedora 38 KDE Plasma installation with the 6.2.2-301 kernel on an hp
>> laptop with an AMD A10-9620P CPU, an integrated Radeon R5 GPU, and an AMD IOMMU
>> enabled. I selected Sleep in either the Application Launcher menu in Plasma
>> 5.27.2 on Wayland or sddm on Wayland. The system went to sleep. I moved the
>> mouse to wake the system. The screen remained black, but the LEDs on the side of
>> the laptop flickered indicating drive activity and the fan resumed making noise.
>> I pressed sysrq+alt+s,u,b to do an emergency sync, remount read-only, and
>> reboot. The system rebooted. The journal indicated the amdgpu failed to resume
>> due to errors including amdgpu: amdgpu_device_ip_resume failed (-6). which
>> started after the kernel failed to resume the AMD IOMMU.
> Looking into the code path, I guess whats happening is :
>    - During system boot `amd_iommu_init_device()` return error to GPU as it
> failed to enable PASID for GPU
>    - With my previous fixes, IOMMU puts device back to default domain properly.
>    - System continued to work with IOMMU default domain (without PASID/PRI
> feature for GPU).
>    - System suspend/resume
>    - Looks like in resume path, amdgpu_device_ip_resume() again calls
> amd_iommu_init_device() and IOMMU returned error for same reason (it couldn't
> enable PASID).
>    - Looks like AMD GPU tried to reset and failed.
>
> IMO this needs to be fixed in GPU driver (either handle error path -OR- fix
> original PASID enable issue using pci quirks or something).

I agree. We're not handling errors returned kgd2kfd_device_init 
correctly, which causes problems later on when we try to resume from 
suspend. I'll prepare a patch.

Regards,
   Felix


>
>
> -Vasant
>
>
>
>> Mar 09 20:27:55 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device
>> 1002:9874
>> Mar 09 20:27:55 kernel: amdgpu 0000:00:01.0: amdgpu: amdgpu_device_ip_resume
>> failed (-6).
>> Mar 09 20:27:55 kernel: amdgpu 0000:00:01.0: PM: dpm_run_callback():
>> pci_pm_resume+0x0/0xe0 returns -6
>> Mar 09 20:27:55 kernel: amdgpu 0000:00:01.0: PM: failed to resume async: error -6
>> Mar 09 20:27:55 kernel: sd 0:0:0:0: [sda] Starting disk
>> Mar 09 20:27:55 kernel: usb 2-1.4: reset full-speed USB device number 4 using
>> ehci-pci
>> Mar 09 20:27:55 kernel: usb 2-1.3: reset full-speed USB device number 3 using
>> ehci-pci
>> Mar 09 20:27:55 kernel: psmouse serio1: synaptics: queried max coordinates: x
>> [..5648], y [..4826]
>> Mar 09 20:27:55 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>> Mar 09 20:27:55 kernel: psmouse serio1: synaptics: queried min coordinates: x
>> [1292..], y [1026..]
>> Mar 09 20:27:55 kernel: ata1.00: configured for UDMA/133
>> Mar 09 20:27:55 kernel: PM: resume devices took 2.703 seconds
>> Mar 09 20:27:55 kernel: OOM killer enabled.
>> Mar 09 20:27:55 kernel: Restarting tasks ... done.
>> Mar 09 20:27:55 kernel: random: crng reseeded on system resumption
>> Mar 09 20:27:55 kernel: thermal thermal_zone2: failed to read out thermal zone
>> (-61)
>> Mar 09 20:27:55 kernel: Bluetooth: hci0: Legacy ROM 2.x revision 5.0 build 25
>> week 20 2015
>> Mar 09 20:27:55 kernel: Bluetooth: hci0: Intel Bluetooth firmware file:
>> intel/ibt-hw-37.8.10-fw-22.50.19.14.f.bseq
>> Mar 09 20:27:55 kernel: PM: suspend exit
>> Mar 09 20:27:55 kernel: Generic FE-GE Realtek PHY r8169-0-100:00: attached PHY
>> driver (mii_bus:phy_addr=r8169-0-100:00, irq=MAC)
>> Mar 09 20:27:55 kernel: r8169 0000:01:00.0 enp1s0: Link is Down
>> Mar 09 20:27:56 kernel: Bluetooth: hci0: Intel BT fw patch 0x43 completed &
>> activated
>> Mar 09 20:28:00 kernel: r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full -
>> flow control off
>> Mar 09 20:28:00 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link becomes ready
>> Mar 09 20:28:01 kernel: r8169 0000:01:00.0 enp1s0: Link is Down
>> Mar 09 20:28:02 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC=
>> SRC=fe80:0000:0000:0000:265c:5b24:c7aa:102b
>> DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=185 TC=0 HOPLIMIT=255
>> FLOWLBL=110208 PROTO=UDP SPT=5353 DPT=5353 LEN=145
>> Mar 09 20:28:04 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC=
>> SRC=fe80:0000:0000:0000:265c:5b24:c7aa:102b
>> DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=185 TC=0 HOPLIMIT=255
>> FLOWLBL=110208 PROTO=UDP SPT=5353 DPT=5353 LEN=145
>> Mar 09 20:28:05 kernel: r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full -
>> flow control off
>> Mar 09 20:28:06 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
>> timeout, signaled seq=49904, emitted seq=49906
>> Mar 09 20:28:06 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
>> information: process  pid 0 thread  pid 0
>> Mar 09 20:28:06 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset begin!
>> Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: [drm:amdgpu_ib_ring_tests [amdgpu]]
>> *ERROR* IB test failed on gfx (-110).
>> Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: amdgpu: ib ring test failed (-110).
>> Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
>> Mar 09 20:28:07 kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
>> Mar 09 20:28:07 kernel: amdgpu: cp is busy, skip halt cp
>> Mar 09 20:28:07 kernel: amdgpu: rlc is busy, skip halt rlc
>> Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset succeeded, trying
>> to resume
>> Mar 09 20:28:07 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device
>> 1002:9874
>> Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset(1) failed
>> Mar 09 20:28:07 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
>> Mar 09 20:28:07 kernel: amdgpu: sdma_bitmap: f
>> Mar 09 20:28:07 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device
>> 1002:9874
>> Mar 09 20:28:07 kernel: kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
>> Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset end with ret = -6
>> Mar 09 20:28:07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery
>> Failed: -6
>> Mar 09 20:28:10 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC=
>> SRC=192.168.2.10 DST=224.0.0.251 LEN=234 TOS=0x00 PREC=0x00 TTL=255 ID=40777 DF
>> PROTO=UDP SPT=5353 DPT=5353 LEN=214
>> Mar 09 20:28:10 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC=
>> SRC=192.168.2.10 DST=224.0.0.251 LEN=234 TOS=0x00 PREC=0x00 TTL=255 ID=40988 DF
>> PROTO=UDP SPT=5353 DPT=5353 LEN=214
>> Mar 09 20:28:10 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC=
>> SRC=192.168.2.10 DST=224.0.0.251 LEN=234 TOS=0x00 PREC=0x00 TTL=255 ID=41207 DF
>> PROTO=UDP SPT=5353 DPT=5353 LEN=214
>> Mar 09 20:28:11 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC=
>> SRC=192.168.2.10 DST=224.0.0.251 LEN=216 TOS=0x00 PREC=0x00 TTL=255 ID=41247 DF
>> PROTO=UDP SPT=5353 DPT=5353 LEN=196
>> Mar 09 20:28:12 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC=
>> SRC=192.168.2.10 DST=224.0.0.251 LEN=216 TOS=0x00 PREC=0x00 TTL=255 ID=41784 DF
>> PROTO=UDP SPT=5353 DPT=5353 LEN=196
>> Mar 09 20:28:14 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC=
>> SRC=192.168.2.10 DST=224.0.0.251 LEN=216 TOS=0x00 PREC=0x00 TTL=255 ID=42530 DF
>> PROTO=UDP SPT=5353 DPT=5353 LEN=196
>> Mar 09 20:28:18 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
>> timeout, signaled seq=49906, emitted seq=49908
>> Mar 09 20:28:18 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
>> information: process  pid 0 thread  pid 0
>> Mar 09 20:28:18 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset begin!
>> Mar 09 20:28:18 kernel: amdgpu 0000:00:01.0: amdgpu: IP block:gfx_v8_0 is hung!
>> Mar 09 20:28:18 kernel: amdgpu 0000:00:01.0: amdgpu: soft reset failed, will
>> fallback to full reset!
>>
>> This problem happened each of a few times with the 6.2.2-301 kernel which
>> contained patches which fixed the black screen problem when amdgpu started
>> during boot with all previous 6.2 branch kernels on this system as reported at
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2319 The problem also happened
>> with 6.2.3. I booted with amd_iommu=off on the kernel command line which was a
>> workaround for that previous problem, and the failure to resume didn't happen
>> when I put the system to sleep 5 times. The AMD IOMMU is likely involved in this
>> problem. I reported this problem at
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2454
>> https://bugzilla.redhat.com/show_bug.cgi?id=2177111 and
>> https://bugzilla.kernel.org/show_bug.cgi?id=217170 Alex Deucher wrote "Might be
>> the same root cause as #2319 (closed).
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2319 The fix for that may not
>> have covered suspend." at
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2454#note_1814352
>>
>> This problem didn't happen with 6.1.15 or earlier. Bisecting this problem might
>> be problematic because previous 6.2 kernels had the black screen problem on boot
>> with the default kernel command line parameters, and the failure to resume
>> didn't happen with amd_iommu=off. I'm attaching the kernel log for a boot when I
>> clicked Sleep in sddm, tried to resume the system, and the problem happened.
>>
>> The Fedora Rawhide build
>> kernel-6.3.0-0.rc1.20230309git6a98c9cae232.18.fc39.x86_64 has this resume
>> problem. kernel-6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39.x86_64 is the first
>> Rawhide kernel without the black screen during boot problem
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2319 and it has this failure to
>> resume problem. The previous build
>> kernel-6.3.0-0.rc0.20230223gita5c95ca18a98.4.fc39.x86_64 had the black screen
>> during boot, so I'm unsure how to test such kernels for this resume problem
>> since it's necessary to use amdgpu and have the IOMMU enabled for it to happen.
>>
>> 6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39 and later had a warning while
>> suspending involving amdgpu which wasn't shown with 6.2.2.
>>
>> Mar 10 02:21:24 kernel: ------------[ cut here ]------------
>> Mar 10 02:21:24 kernel: WARNING: CPU: 2 PID: 1393 at kernel/workqueue.c:3167
>> __flush_work.isra.0+0x270/0x280
>> Mar 10 02:21:24 kernel: Modules linked in: snd_seq_dummy snd_hrtimer
>> nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
>> nf_reject_ipv6 nft_reject nf_log_syslog nft_log nft_ct nft_chain_nat nf_nat
>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc
>> iwlmvm mac80211 uvcvideo edac_mce_amd libarc4 kvm_amd btusb btrtl snd_ctl_led
>> uvc iwlwifi btbcm snd_hda_codec_realtek ccp btintel videobuf2_vmalloc
>> videobuf2_memops snd_hda_codec_generic btmtk videobuf2_v4l2 snd_hda_codec_hdmi
>> ledtrig_audio videobuf2_common hp_wmi snd_hda_intel kvm snd_intel_dspcfg
>> bluetooth sparse_keymap platform_profile snd_intel_sdw_acpi irqbypass cfg80211
>> snd_hda_codec videodev vfat wmi_bmof fat mc pcspkr snd_hda_core snd_hwdep
>> i2c_piix4 rfkill fam15h_power k10temp snd_seq snd_seq_device snd_pcm snd_timer
>> snd soundcore i2c_scmi wireless_hotkey acpi_cpufreq joydev loop zram amdgpu
>> hid_logitech_hidpp crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
>> polyval_generic i2c_algo_bit drm_ttm_helper ttm iommu_v2
>> Mar 10 02:21:24 kernel:  ghash_clmulni_intel drm_buddy r8169 sha512_ssse3
>> wdat_wdt gpu_sched sp5100_tco drm_display_helper cec video wmi hid_multitouch
>> hid_logitech_dj serio_raw scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse dm_multipath
>> Mar 10 02:21:24 kernel: CPU: 2 PID: 1393 Comm: kworker/u8:10 Not tainted
>> 6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39.x86_64 #1
>> Mar 10 02:21:24 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52
>> 12/03/2019
>> Mar 10 02:21:24 kernel: Workqueue: events_unbound async_run_entry_fn
>> Mar 10 02:21:24 kernel: RIP: 0010:__flush_work.isra.0+0x270/0x280
>> Mar 10 02:21:24 kernel: Code: 8b 04 25 80 22 03 00 48 89 44 24 40 48 8b 73 30 8b
>> 4b 28 e9 e3 fe ff ff 40 30 f6 4c 8b 3e e9 21 fe ff ff 0f 0b e9 3a ff ff ff <0f>
>> 0b e9 33 ff ff ff e8 04 d2 e3 00 0f 1f 40 00 90 90 90 90 90 90
>> Mar 10 02:21:24 kernel: RSP: 0018:ffff98a4c3de7ca8 EFLAGS: 00010246
>> Mar 10 02:21:24 kernel: RAX: 0000000000000000 RBX: ffff8d3350680340 RCX:
>> 0000000000000000
>> Mar 10 02:21:24 kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI:
>> ffff98a4c3de7cf0
>> Mar 10 02:21:24 kernel: RBP: ffff8d3350680340 R08: 745e72736d647564 R09:
>> ffff8d3386ae3c74
>> Mar 10 02:21:24 kernel: R10: 000000000000000f R11: fefefefefefefeff R12:
>> 0000000000000001
>> Mar 10 02:21:24 kernel: R13: ffff98a4c3de7ca8 R14: 0000000000000001 R15:
>> ffff8d33789e4f28
>> Mar 10 02:21:24 kernel: FS:  0000000000000000(0000) GS:ffff8d3437500000(0000)
>> knlGS:0000000000000000
>> Mar 10 02:21:24 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Mar 10 02:21:24 kernel: CR2: 0000562f5c082158 CR3: 00000001459ca000 CR4:
>> 00000000001506e0
>> Mar 10 02:21:24 kernel: Call Trace:
>> Mar 10 02:21:24 kernel:  <TASK>
>> Mar 10 02:21:24 kernel:  __cancel_work_timer+0xff/0x190
>> Mar 10 02:21:24 kernel:  ? wait_for_completion+0x37/0x160
>> Mar 10 02:21:24 kernel:  ? preempt_count_add+0x6a/0xa0
>> Mar 10 02:21:24 kernel:  drm_kms_helper_poll_disable+0x1e/0x40
>> Mar 10 02:21:24 kernel:  amdgpu_device_suspend+0x9e/0x180 [amdgpu]
>> Mar 10 02:21:24 kernel:  pci_pm_suspend+0x7b/0x170
>> Mar 10 02:21:24 kernel:  ? __pfx_pci_pm_suspend+0x10/0x10
>> Mar 10 02:21:24 kernel:  dpm_run_callback+0x8c/0x1e0
>> Mar 10 02:21:24 kernel:  __device_suspend+0x10a/0x560
>> Mar 10 02:21:24 kernel:  async_suspend+0x1a/0x70
>> Mar 10 02:21:24 kernel:  async_run_entry_fn+0x30/0x130
>> Mar 10 02:21:24 kernel:  process_one_work+0x1c7/0x3d0
>> Mar 10 02:21:24 kernel:  worker_thread+0x4d/0x380
>> Mar 10 02:21:24 kernel:  ? __pfx_worker_thread+0x10/0x10
>> Mar 10 02:21:24 kernel:  kthread+0xe9/0x110
>> Mar 10 02:21:24 kernel:  ? __pfx_kthread+0x10/0x10
>> Mar 10 02:21:24 kernel:  ret_from_fork+0x2c/0x50
>> Mar 10 02:21:24 kernel:  </TASK>
>> Mar 10 02:21:24 kernel: ---[ end trace 0000000000000000 ]---
>>
>> Bert Karwatzki wrote "The suspend warning is addressed in issue #2411."
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2411 at
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2454#note_1816958 I don't know
>> if this warning is related to the resume problem.
>>
>> Hardware description:
>> CPU: AMD A10-9620P
>> GPU: integrated AMD Radeon R5
>> 00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI]
>> Wani [Radeon R5/R6/R7 Graphics] [1002:9874] (rev ca)
>> System Memory: 8 GB
>> Display(s): internal Elan touchscreen
>> Type of Display Connection: eDP
>>
>> System information:
>> Distro name and Version: Fedora 38
>> Kernel version: 6.2.2-301.fc38 to 6.2.3,
>> 6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39 to
>> 6.3.0-0.rc1.20230309git6a98c9cae232.18.fc39
>> Custom kernel: N/A
>> AMD official driver version: N/A
>>
>> How to reproduce the issue:
>> 1. Boot a Fedora 38 KDE Plasma installation with 6.2.2-301.fc38 or
>> 6.2.3-300.fc38 updated to 2023-3-10 with updates-testing enabled on a laptop
>> with an AMD A10-9620P CPU, an integrated Radeon R5 GPU, and an AMD IOMMU enabled
>> 2. Select Virtual Keyboard at the bottom left of sddm if the Sleep, Restart,
>> Shut down buttons don't appear
>> 3. Select Sleep in sddm
>> 4. Resume the system by moving the mouse or pressing a key

  reply	other threads:[~2023-03-13 23:03 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-11 18:57 amdgpu failed to resume with AMD IOMMU enabled and 6.2.2-301 and 6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39 and later resulting in a black screen Matt Fagnani
2023-03-13  4:21 ` Vasant Hegde
2023-03-13 23:02   ` Felix Kuehling [this message]
2023-03-15  6:40     ` Matt Fagnani
2023-03-15 16:14       ` Felix Kuehling
2023-03-15 17:07         ` Matt Fagnani
2023-03-13 10:30 ` Linux regression tracking #adding (Thorsten Leemhuis)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6c16f004-f20a-26dc-0f3e-abe0b683d764@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=iommu@lists.linux.dev \
    --cc=matt.fagnani@bell.net \
    --cc=regressions@leemhuis.info \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=vasant.hegde@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.