On 2022-04-27 05:20, Shuotao Xu wrote: > Hi Andrey, > > Sorry that I did not have time to work on this for a few days. > > I just tried the sysfs crash fix on Radeon VII and it seems that it > worked. It did not pass last the hotplug test, but my version has 4 > tests instead of 3 in your case. That because the 4th one is only enabled when here are 2 cards in the system - to test DRI_PRIME export. I tested this time with only one card. > > > Suite: Hotunplug Tests >   Test: Unplug card and rescan the bus to plug it back > .../usr/local/share/libdrm/amdgpu.ids: No such file or directory > passed >   Test: Same as first test but with command submission > .../usr/local/share/libdrm/amdgpu.ids: No such file or directory > passed >   Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: > No such file or directory > passed >   Test: Unplug with exported fence > .../usr/local/share/libdrm/amdgpu.ids: No such file or directory > amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) on the kernel side - the IOCTlL returning this is drm_getclient - maybe take a look while it can't find client it ? I didn't have such issue as far as I remember when testing. > FAILED >     1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0) >     2. ../tests/amdgpu/hotunplug_tests.c:411  - > CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, > &sync_obj_handle2),0) >     3. ../tests/amdgpu/hotunplug_tests.c:423  - > CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, > 100000000, 0, NULL),0) >     4. ../tests/amdgpu/hotunplug_tests.c:425  - > CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0) > > Run Summary:    Type  Total    Ran Passed Failed Inactive >               suites     14      1    n/a      0      0 >                tests     71      4      3      1      0 >              asserts     39     39     35      4    n/a > > Elapsed time =   17.321 seconds > > For kfd compute, there is some problem which I did not see in MI100 > after I killed the hung application after hot plugout. I was using > rocm5.0.2 driver for MI100 card, and not sure if it is a regression > from the newer driver. > After pkill, one of child of user process would be stuck in Zombie > mode (Z) understandably because of the bug, and future rocm > application after plug-back would in uninterrupted sleep mode (D) > because it would not return from syscall to kfd. > > Although drm test for amdgpu would run just fine without issues after > plug-back with dangling kfd state. I am not clear when the crash bellow happens ? Is it related to what you describe above ? > > I don’t know if there is a quick fix to it. I was thinking add > drm_enter/drm_exit to amdgpu_device_rreg. Try adding drm_dev_enter/exit pair at the highest level of attmetong to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We always try to avoid accessing any HW functions after backing device is gone. > Also this has been a long time in my attempt to fix hotplug issue for > kfd application. > I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII > would mean something but MI100 is more important for us); 2) what the > direct of the patch to this issue will move forward. I will go to office tomorrow to pick up MI-100, With time and priorities permitting I will then then try to test it and fix any bugs such that it will be passing all hot plug libdrm tests at the tip of public amd-staging-drm-next - https://gitlab.freedesktop.org/agd5f/linux, after that you can try to continue working with ROCm enabling on top of that. For now i suggest you move on with Radeon 7 which as your development ASIC and use the fix i mentioned above. Andrey > > Regards, > Shuotao > > [  +0.001645] BUG: unable to handle page fault for address: > 0000000000058a68 > [  +0.001298] #PF: supervisor read access in kernel mode > [  +0.001252] #PF: error_code(0x0000) - not-present page > [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 109b2d067 > PMD 0 > [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI > [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G       >  W   E     5.16.0+ #3 > [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS > 1.5.4 [FPGA Test BIOS] 10/002/2015 > [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu] > [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 > 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 > 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85 > [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202 > [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: > 00000000ffffffff > [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI: > ffff8b0c9c840000 > [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09: > 0000000000000001 > [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: > 0000000000058a68 > [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15: > 000000000001629a > [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) > knlGS:0000000000000000 > [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: > 00000000001706e0 > [  +0.001422] Call Trace: > [  +0.001407]   > [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu] > [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu] > [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu] > [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu] > [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu] > [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu] > [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu] > [  +0.001829]  ? kvfree+0x1e/0x30 > [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu] > [  +0.001868]  ? kvfree+0x1e/0x30 > [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm] > [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu] > [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu] > [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu] > [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu] > [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu] > [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu] > [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu] > [  +0.001718]  __mmu_notifier_release+0x77/0x1f0 > [  +0.001411]  exit_mmap+0x1b5/0x200 > [  +0.001396]  ? __switch_to+0x12d/0x3e0 > [  +0.001388]  ? __switch_to_asm+0x36/0x70 > [  +0.001372]  ? preempt_count_add+0x74/0xc0 > [  +0.001364]  mmput+0x57/0x110 > [  +0.001349]  do_exit+0x33d/0xc20 > [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30 > [  +0.001346]  do_group_exit+0x43/0xa0 > [  +0.001341]  get_signal+0x131/0x920 > [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870 > [  +0.001303]  ? do_futex+0x125/0x190 > [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0 > [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40 > [  +0.001264]  do_syscall_64+0x46/0xb0 > [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae > [  +0.001219] RIP: 0033:0x7f6aff1d2ad3 > [  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9. > [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX: > 00000000000000ca > [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX: > 00007f6aff1d2ad3 > [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI: > 0000000004f542d8 > [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09: > 0000000000000000 > [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12: > 0000000004f542d8 > [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15: > 0000000000000000 > [  +0.001152]   > [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink > nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM > iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 > xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter > ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 > xfrm_algo intel_rapl_msr intel_rapl_common sb_edac > x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_intel > ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec kvm_intel > snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypass > ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support joydev > mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf > ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser rdma_cm > iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi > scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic > zstd_compress raid10 raid456 > [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor > async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 > gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper > syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul > hid_generic crc32_pclmul ghash_clmulni_intel usbhid uas aesni_intel > crypto_simd igb ahci hid drm usb_storage cryptd libahci dca > megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu] > [  +0.016626] CR2: 0000000000058a68 > [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]--- > [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu] > [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 > 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 > 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85 > [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202 > [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: > 00000000ffffffff > [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI: > ffff8b0c9c840000 > [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09: > 0000000000000001 > [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: > 0000000000058a68 > [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15: > 000000000001629a > [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) > knlGS:0000000000000000 > [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: > 00000000001706e0 > [  +0.001740] Fixing recursive fault but reboot is needed! > > >> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky >> wrote: >> >> I retested hot plug tests at the commit I mentioned bellow - looks >> ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris >> ASICs (whatever i had at home at the time). It's possible there are >> extra issues in ASICs like ur which I didn't cover during tests. >> >> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13 >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> >> >> The ASIC NOT support UVD, suite disabled >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> >> >> The ASIC NOT support VCE, suite disabled >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> >> >> The ASIC NOT support UVD ENC, suite disabled. >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> >> >> Don't support TMZ (trust memory zone), security suite disabled >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> /usr/local/share/libdrm/amdgpu.ids: No such file or directory >> Peer device is not opened or has ASIC not supported by the suite, >> skip all Peer to Peer tests. >> >> >>      CUnit - A unit testing framework for C - Version 2.1-3 >> http://cunit.sourceforge.net/ >> >> >> *Suite: Hotunplug Tests** >> **  Test: Unplug card and rescan the bus to plug it back >> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory** >> **passed** >> **  Test: Same as first test but with command submission >> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory** >> **passed** >> **  Test: Unplug with exported bo >> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory** >> **passed* >> >> Run Summary:    Type  Total    Ran Passed Failed Inactive >>               suites     14      1    n/a 0        0 >>                tests     71      3      3 0        1 >>              asserts     21     21     21      0 n/a >> >> Elapsed time =    9.195 seconds >> >> >> Andrey >> >> On 2022-04-20 11:44, Andrey Grodzovsky wrote: >>> >>> The only one in Radeon 7 I see is the same sysfs crash we already >>> fixed so you can use the same fix. The MI 200 issue i haven't seen >>> yet but I also haven't tested MI200 so never saw it before. Need to >>> test when i get the time. >>> >>> So try that fix with Radeon 7 again to see if you pass the tests >>> (the warnings should all be minor issues). >>> >>> Andrey >>> >>> >>> On 2022-04-20 05:24, Shuotao Xu wrote: >>>>> >>>>> That a problem, latest working baseline I tested and confirmed >>>>> passing hotplug tests is this branch and commit >>>>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 >>>>> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed >>>>> the hotplug code but it had a lot of regressions over time due to >>>>> new changes (that why I added the hotplug test to try and catch >>>>> them early). It would be best to run this branch on mi-100 so we >>>>> have a clean baseline and only after confirming  this particular >>>>> branch from this commits passes libdrm tests only then start >>>>> adding the KFD specific addons. Another option if you can't work >>>>> with MI-100 and this branch is to try a different ASIC that does >>>>> work with this branch (if possible). >>>>> >>>>> Andrey >>>>> >>>> OK I tried both this commit and the HEAD of and-staging-drm-next on >>>> two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm >>>> test. I might be able to gain access to MI200, but I suspect it >>>> would work. >>>> >>>> I copied the complete dmesgs as follows. I highlighted the OOPSES >>>> for you. >>>> >>>> Radeon VII: >