Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: Shuotao Xu <shuotaoxu@microsoft.com>
Cc: "Mukul.Joshi@amd.com" <Mukul.Joshi@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	Peng Cheng <pengc@microsoft.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	Lei Qu <Lei.Qu@microsoft.com>, Ran Shu <Ran.Shu@microsoft.com>,
	Ziyue Yang <Ziyue.Yang@microsoft.com>
Subject: Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
Date: Wed, 27 Apr 2022 12:04:03 -0400	[thread overview]
Message-ID: <66bf32d5-1636-ecdd-8a49-24c6254079bf@amd.com> (raw)
In-Reply-To: <FF40C1DB-326C-45F5-9B59-14C39E17359D@microsoft.com>

[-- Attachment #1: Type: text/plain, Size: 15087 bytes --]

On 2022-04-27 05:20, Shuotao Xu wrote:

> Hi Andrey,
>
> Sorry that I did not have time to work on this for a few days.
>
> I just tried the sysfs crash fix on Radeon VII and it seems that it 
> worked. It did not pass last the hotplug test, but my version has 4 
> tests instead of 3 in your case.

That because the 4th one is only enabled when here are 2 cards in the 
system - to test DRI_PRIME export. I tested this time with only one card.

>
>
> Suite: Hotunplug Tests
>   Test: Unplug card and rescan the bus to plug it back 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> passed
>   Test: Same as first test but with command submission 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> passed
>   Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: 
> No such file or directory
> passed
>   Test: Unplug with exported fence 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)

on the kernel side - the IOCTlL returning this is drm_getclient - maybe 
take a look while it can't find client it ? I didn't have such issue as 
far as I remember when testing.

> FAILED
>     1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
>     2. ../tests/amdgpu/hotunplug_tests.c:411  - 
> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
> &sync_obj_handle2),0)
>     3. ../tests/amdgpu/hotunplug_tests.c:423  - 
> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, 
> 100000000, 0, NULL),0)
>     4. ../tests/amdgpu/hotunplug_tests.c:425  - 
> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)
>
> Run Summary:    Type  Total    Ran Passed Failed Inactive
>               suites     14      1    n/a      0      0
>                tests     71      4      3      1      0
>              asserts     39     39     35      4    n/a
>
> Elapsed time =   17.321 seconds
>
> For kfd compute, there is some problem which I did not see in MI100 
> after I killed the hung application after hot plugout. I was using 
> rocm5.0.2 driver for MI100 card, and not sure if it is a regression 
> from the newer driver.
> After pkill, one of child of user process would be stuck in Zombie 
> mode (Z) understandably because of the bug, and future rocm 
> application after plug-back would in uninterrupted sleep mode (D) 
> because it would not return from syscall to kfd.
>
> Although drm test for amdgpu would run just fine without issues after 
> plug-back with dangling kfd state.

I am not clear when the crash bellow happens ? Is it related to what you 
describe above ?

>
> I don’t know if there is a quick fix to it. I was thinking add 
> drm_enter/drm_exit to amdgpu_device_rreg.

Try adding drm_dev_enter/exit pair at the highest level of attmetong to 
access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We always 
try to avoid accessing any HW functions after backing device is gone.

> Also this has been a long time in my attempt to fix hotplug issue for 
> kfd application.
> I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII 
> would mean something but MI100 is more important for us); 2) what the 
> direct of the patch to this issue will move forward.

I will go to office tomorrow to pick up MI-100, With time and priorities 
permitting I will then then try to test it and fix any bugs such that it 
will be passing all hot plug libdrm tests at the tip of public 
amd-staging-drm-next - https://gitlab.freedesktop.org/agd5f/linux, after 
that you can try to continue working with ROCm enabling on top of that.

For now i suggest you move on with Radeon 7 which as your development 
ASIC and use the fix i mentioned above.

Andrey

>
> Regards,
> Shuotao
>
> [  +0.001645] BUG: unable to handle page fault for address: 
> 0000000000058a68
> [  +0.001298] #PF: supervisor read access in kernel mode
> [  +0.001252] #PF: error_code(0x0000) - not-present page
> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 109b2d067 
> PMD 0
> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G       
>  W   E     5.16.0+ #3
> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 
> 1.5.4 [FPGA Test BIOS] 10/002/2015
> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 
> 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 
> 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
> 00000000ffffffff
> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
> ffff8b0c9c840000
> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
> 0000000000000001
> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
> 0000000000058a68
> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15: 
> 000000000001629a
> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
> knlGS:0000000000000000
> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
> 00000000001706e0
> [  +0.001422] Call Trace:
> [  +0.001407]  <TASK>
> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
> [  +0.001829]  ? kvfree+0x1e/0x30
> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
> [  +0.001868]  ? kvfree+0x1e/0x30
> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
> [  +0.001411]  exit_mmap+0x1b5/0x200
> [  +0.001396]  ? __switch_to+0x12d/0x3e0
> [  +0.001388]  ? __switch_to_asm+0x36/0x70
> [  +0.001372]  ? preempt_count_add+0x74/0xc0
> [  +0.001364]  mmput+0x57/0x110
> [  +0.001349]  do_exit+0x33d/0xc20
> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
> [  +0.001346]  do_group_exit+0x43/0xa0
> [  +0.001341]  get_signal+0x131/0x920
> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
> [  +0.001303]  ? do_futex+0x125/0x190
> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
> [  +0.001264]  do_syscall_64+0x46/0xb0
> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
> [  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX: 
> 00000000000000ca
> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX: 
> 00007f6aff1d2ad3
> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
> 0000000004f542d8
> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09: 
> 0000000000000000
> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000000004f542d8
> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15: 
> 0000000000000000
> [  +0.001152]  </TASK>
> [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink 
> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 
> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 
> xfrm_algo intel_rapl_msr intel_rapl_common sb_edac 
> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_intel 
> ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec kvm_intel 
> snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypass 
> ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support joydev 
> mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf 
> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser rdma_cm 
> iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic 
> zstd_compress raid10 raid456
> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor 
> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 
> gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper 
> syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul 
> hid_generic crc32_pclmul ghash_clmulni_intel usbhid uas aesni_intel 
> crypto_simd igb ahci hid drm usb_storage cryptd libahci dca 
> megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
> [  +0.016626] CR2: 0000000000058a68
> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 
> 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 
> 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
> 00000000ffffffff
> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
> ffff8b0c9c840000
> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
> 0000000000000001
> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
> 0000000000058a68
> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15: 
> 000000000001629a
> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
> knlGS:0000000000000000
> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
> 00000000001706e0
> [  +0.001740] Fixing recursive fault but reboot is needed!
>
>
>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>> I retested hot plug tests at the commit I mentioned bellow - looks 
>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris 
>> ASICs (whatever i had at home at the time). It's possible there are 
>> extra issues in ASICs like ur which I didn't cover during tests.
>>
>> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support UVD, suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support VCE, suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support UVD ENC, suite disabled.
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> Don't support TMZ (trust memory zone), security suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> Peer device is not opened or has ASIC not supported by the suite, 
>> skip all Peer to Peer tests.
>>
>>
>>      CUnit - A unit testing framework for C - Version 2.1-3
>> http://cunit.sourceforge.net/
>>
>>
>> *Suite: Hotunplug Tests**
>> **  Test: Unplug card and rescan the bus to plug it back 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed**
>> **  Test: Same as first test but with command submission 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed**
>> **  Test: Unplug with exported bo 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed*
>>
>> Run Summary:    Type  Total    Ran Passed Failed Inactive
>>               suites     14      1    n/a 0        0
>>                tests     71      3      3 0        1
>>              asserts     21     21     21      0 n/a
>>
>> Elapsed time =    9.195 seconds
>>
>>
>> Andrey
>>
>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>
>>> The only one in Radeon 7 I see is the same sysfs crash we already 
>>> fixed so you can use the same fix. The MI 200 issue i haven't seen 
>>> yet but I also haven't tested MI200 so never saw it before. Need to 
>>> test when i get the time.
>>>
>>> So try that fix with Radeon 7 again to see if you pass the tests 
>>> (the warnings should all be minor issues).
>>>
>>> Andrey
>>>
>>>
>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>
>>>>> That a problem, latest working baseline I tested and confirmed 
>>>>> passing hotplug tests is this branch and commit 
>>>>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 
>>>>> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed 
>>>>> the hotplug code but it had a lot of regressions over time due to 
>>>>> new changes (that why I added the hotplug test to try and catch 
>>>>> them early). It would be best to run this branch on mi-100 so we 
>>>>> have a clean baseline and only after confirming  this particular 
>>>>> branch from this commits passes libdrm tests only then start 
>>>>> adding the KFD specific addons. Another option if you can't work 
>>>>> with MI-100 and this branch is to try a different ASIC that does 
>>>>> work with this branch (if possible).
>>>>>
>>>>> Andrey
>>>>>
>>>> OK I tried both this commit and the HEAD of and-staging-drm-next on 
>>>> two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm 
>>>> test. I might be able to gain access to MI200, but I suspect it 
>>>> would work.
>>>>
>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES 
>>>> for you.
>>>>
>>>> Radeon VII:
>

[-- Attachment #2: Type: text/html, Size: 28514 bytes --]