All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: Shuotao Xu <shuotaoxu@microsoft.com>
Cc: "Mukul.Joshi@amd.com" <Mukul.Joshi@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	Peng Cheng <pengc@microsoft.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	Lei Qu <Lei.Qu@microsoft.com>, Ran Shu <Ran.Shu@microsoft.com>,
	Ziyue Yang <Ziyue.Yang@microsoft.com>
Subject: Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD
Date: Wed, 27 Apr 2022 12:04:03 -0400	[thread overview]
Message-ID: <66bf32d5-1636-ecdd-8a49-24c6254079bf@amd.com> (raw)
In-Reply-To: <FF40C1DB-326C-45F5-9B59-14C39E17359D@microsoft.com>

[-- Attachment #1: Type: text/plain, Size: 15087 bytes --]

On 2022-04-27 05:20, Shuotao Xu wrote:

> Hi Andrey,
>
> Sorry that I did not have time to work on this for a few days.
>
> I just tried the sysfs crash fix on Radeon VII and it seems that it 
> worked. It did not pass last the hotplug test, but my version has 4 
> tests instead of 3 in your case.


That because the 4th one is only enabled when here are 2 cards in the 
system - to test DRI_PRIME export. I tested this time with only one card.

>
>
> Suite: Hotunplug Tests
>   Test: Unplug card and rescan the bus to plug it back 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> passed
>   Test: Same as first test but with command submission 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> passed
>   Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: 
> No such file or directory
> passed
>   Test: Unplug with exported fence 
> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory
> amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)


on the kernel side - the IOCTlL returning this is drm_getclient - maybe 
take a look while it can't find client it ? I didn't have such issue as 
far as I remember when testing.


> FAILED
>     1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
>     2. ../tests/amdgpu/hotunplug_tests.c:411  - 
> CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
> &sync_obj_handle2),0)
>     3. ../tests/amdgpu/hotunplug_tests.c:423  - 
> CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, 
> 100000000, 0, NULL),0)
>     4. ../tests/amdgpu/hotunplug_tests.c:425  - 
> CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)
>
> Run Summary:    Type  Total    Ran Passed Failed Inactive
>               suites     14      1    n/a      0      0
>                tests     71      4      3      1      0
>              asserts     39     39     35      4    n/a
>
> Elapsed time =   17.321 seconds
>
> For kfd compute, there is some problem which I did not see in MI100 
> after I killed the hung application after hot plugout. I was using 
> rocm5.0.2 driver for MI100 card, and not sure if it is a regression 
> from the newer driver.
> After pkill, one of child of user process would be stuck in Zombie 
> mode (Z) understandably because of the bug, and future rocm 
> application after plug-back would in uninterrupted sleep mode (D) 
> because it would not return from syscall to kfd.
>
> Although drm test for amdgpu would run just fine without issues after 
> plug-back with dangling kfd state.


I am not clear when the crash bellow happens ? Is it related to what you 
describe above ?


>
> I don’t know if there is a quick fix to it. I was thinking add 
> drm_enter/drm_exit to amdgpu_device_rreg.


Try adding drm_dev_enter/exit pair at the highest level of attmetong to 
access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We always 
try to avoid accessing any HW functions after backing device is gone.


> Also this has been a long time in my attempt to fix hotplug issue for 
> kfd application.
> I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII 
> would mean something but MI100 is more important for us); 2) what the 
> direct of the patch to this issue will move forward.


I will go to office tomorrow to pick up MI-100, With time and priorities 
permitting I will then then try to test it and fix any bugs such that it 
will be passing all hot plug libdrm tests at the tip of public 
amd-staging-drm-next - https://gitlab.freedesktop.org/agd5f/linux, after 
that you can try to continue working with ROCm enabling on top of that.

For now i suggest you move on with Radeon 7 which as your development 
ASIC and use the fix i mentioned above.

Andrey


>
> Regards,
> Shuotao
>
> [  +0.001645] BUG: unable to handle page fault for address: 
> 0000000000058a68
> [  +0.001298] #PF: supervisor read access in kernel mode
> [  +0.001252] #PF: error_code(0x0000) - not-present page
> [  +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 109b2d067 
> PMD 0
> [  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
> [  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G       
>  W   E     5.16.0+ #3
> [  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 
> 1.5.4 [FPGA Test BIOS] 10/002/2015
> [  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
> [  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 
> 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 
> 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
> [  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
> [  +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
> 00000000ffffffff
> [  +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
> ffff8b0c9c840000
> [  +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
> 0000000000000001
> [  +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
> 0000000000058a68
> [  +0.001400] R13: 000000000001629a R14: 0000000000000000 R15: 
> 000000000001629a
> [  +0.001397] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
> knlGS:0000000000000000
> [  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
> 00000000001706e0
> [  +0.001422] Call Trace:
> [  +0.001407]  <TASK>
> [  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
> [  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
> [  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
> [  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
> [  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
> [  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
> [  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
> [  +0.001829]  ? kvfree+0x1e/0x30
> [  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
> [  +0.001868]  ? kvfree+0x1e/0x30
> [  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
> [  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
> [  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
> [  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
> [  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
> [  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
> [  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
> [  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
> [  +0.001718]  __mmu_notifier_release+0x77/0x1f0
> [  +0.001411]  exit_mmap+0x1b5/0x200
> [  +0.001396]  ? __switch_to+0x12d/0x3e0
> [  +0.001388]  ? __switch_to_asm+0x36/0x70
> [  +0.001372]  ? preempt_count_add+0x74/0xc0
> [  +0.001364]  mmput+0x57/0x110
> [  +0.001349]  do_exit+0x33d/0xc20
> [  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
> [  +0.001346]  do_group_exit+0x43/0xa0
> [  +0.001341]  get_signal+0x131/0x920
> [  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
> [  +0.001303]  ? do_futex+0x125/0x190
> [  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
> [  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
> [  +0.001264]  do_syscall_64+0x46/0xb0
> [  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.001219] RIP: 0033:0x7f6aff1d2ad3
> [  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
> [  +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX: 
> 00000000000000ca
> [  +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX: 
> 00007f6aff1d2ad3
> [  +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 
> 0000000004f542d8
> [  +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09: 
> 0000000000000000
> [  +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000000004f542d8
> [  +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15: 
> 0000000000000000
> [  +0.001152]  </TASK>
> [  +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlink 
> nfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 
> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
> ip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4 
> xfrm_algo intel_rapl_msr intel_rapl_common sb_edac 
> x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_intel 
> ipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec kvm_intel 
> snd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypass 
> ftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support joydev 
> mei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintf 
> ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser rdma_cm 
> iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
> scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic 
> zstd_compress raid10 raid456
> [  +0.000102]  async_raid6_recov async_memcpy async_pq async_xor 
> async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2 
> gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helper 
> syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul 
> hid_generic crc32_pclmul ghash_clmulni_intel usbhid uas aesni_intel 
> crypto_simd igb ahci hid drm usb_storage cryptd libahci dca 
> megaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
> [  +0.016626] CR2: 0000000000058a68
> [  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
> [  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
> [  +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 
> 00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 
> 00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
> [  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
> [  +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX: 
> 00000000ffffffff
> [  +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI: 
> ffff8b0c9c840000
> [  +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09: 
> 0000000000000001
> [  +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12: 
> 0000000000058a68
> [  +0.001650] R13: 000000000001629a R14: 0000000000000000 R15: 
> 000000000001629a
> [  +0.001648] FS:  0000000000000000(0000) GS:ffff8b4b7fa80000(0000) 
> knlGS:0000000000000000
> [  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4: 
> 00000000001706e0
> [  +0.001740] Fixing recursive fault but reboot is needed!
>
>
>> On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>> I retested hot plug tests at the commit I mentioned bellow - looks 
>> ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris 
>> ASICs (whatever i had at home at the time). It's possible there are 
>> extra issues in ASICs like ur which I didn't cover during tests.
>>
>> andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support UVD, suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support VCE, suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> The ASIC NOT support UVD ENC, suite disabled.
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>>
>>
>> Don't support TMZ (trust memory zone), security suite disabled
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> /usr/local/share/libdrm/amdgpu.ids: No such file or directory
>> Peer device is not opened or has ASIC not supported by the suite, 
>> skip all Peer to Peer tests.
>>
>>
>>      CUnit - A unit testing framework for C - Version 2.1-3
>> http://cunit.sourceforge.net/
>>
>>
>> *Suite: Hotunplug Tests**
>> **  Test: Unplug card and rescan the bus to plug it back 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed**
>> **  Test: Same as first test but with command submission 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed**
>> **  Test: Unplug with exported bo 
>> .../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
>> **passed*
>>
>> Run Summary:    Type  Total    Ran Passed Failed Inactive
>>               suites     14      1    n/a 0        0
>>                tests     71      3      3 0        1
>>              asserts     21     21     21      0 n/a
>>
>> Elapsed time =    9.195 seconds
>>
>>
>> Andrey
>>
>> On 2022-04-20 11:44, Andrey Grodzovsky wrote:
>>>
>>> The only one in Radeon 7 I see is the same sysfs crash we already 
>>> fixed so you can use the same fix. The MI 200 issue i haven't seen 
>>> yet but I also haven't tested MI200 so never saw it before. Need to 
>>> test when i get the time.
>>>
>>> So try that fix with Radeon 7 again to see if you pass the tests 
>>> (the warnings should all be minor issues).
>>>
>>> Andrey
>>>
>>>
>>> On 2022-04-20 05:24, Shuotao Xu wrote:
>>>>>
>>>>> That a problem, latest working baseline I tested and confirmed 
>>>>> passing hotplug tests is this branch and commit 
>>>>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 
>>>>> which is amd-staging-drm-next. 5.14 was the branch we ups-reamed 
>>>>> the hotplug code but it had a lot of regressions over time due to 
>>>>> new changes (that why I added the hotplug test to try and catch 
>>>>> them early). It would be best to run this branch on mi-100 so we 
>>>>> have a clean baseline and only after confirming  this particular 
>>>>> branch from this commits passes libdrm tests only then start 
>>>>> adding the KFD specific addons. Another option if you can't work 
>>>>> with MI-100 and this branch is to try a different ASIC that does 
>>>>> work with this branch (if possible).
>>>>>
>>>>> Andrey
>>>>>
>>>> OK I tried both this commit and the HEAD of and-staging-drm-next on 
>>>> two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm 
>>>> test. I might be able to gain access to MI200, but I suspect it 
>>>> would work.
>>>>
>>>> I copied the complete dmesgs as follows. I highlighted the OOPSES 
>>>> for you.
>>>>
>>>> Radeon VII:
>

[-- Attachment #2: Type: text/html, Size: 28514 bytes --]

  reply	other threads:[~2022-04-27 16:04 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-08  8:45 [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Shuotao Xu
2022-04-08  8:45 ` [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD Shuotao Xu
2022-04-08 15:28   ` Andrey Grodzovsky
2022-04-09  1:28     ` [EXTERNAL] " Shuotao Xu
2022-04-11 15:52       ` Andrey Grodzovsky
2022-04-13 16:03         ` Shuotao Xu
2022-04-13 17:31           ` Andrey Grodzovsky
2022-04-14 14:00             ` Shuotao Xu
2022-04-14 14:24               ` Shuotao Xu
2022-04-14 15:13               ` Andrey Grodzovsky
2022-04-15 10:12                 ` Shuotao Xu
2022-04-15 16:43                   ` Andrey Grodzovsky
2022-04-18 13:22                     ` Shuotao Xu
2022-04-18 15:23                       ` Andrey Grodzovsky
2022-04-19  7:41                         ` Shuotao Xu
2022-04-19 16:01                           ` Andrey Grodzovsky
2022-04-19 16:18                             ` Felix Kuehling
2022-04-20  9:24                             ` Shuotao Xu
2022-04-20 15:44                               ` Andrey Grodzovsky
2022-04-20 18:41                                 ` Andrey Grodzovsky
2022-04-27  9:20                                   ` Shuotao Xu
2022-04-27 16:04                                     ` Andrey Grodzovsky [this message]
2022-05-10 11:03                                       ` Shuotao Xu
2022-05-10 16:34                                         ` Andrey Grodzovsky
2022-05-10 20:31                                         ` Felix Kuehling
2022-05-11  3:35                                           ` Shuotao Xu
2022-05-11 13:49                                             ` Andrey Grodzovsky
2022-05-11 16:49                                               ` Felix Kuehling
2022-05-11 17:02                                                 ` Andrey Grodzovsky
2022-04-12  0:07 ` [PATCH 1/2] drm/amdkfd: Cleanup IO links during KFD device removal Felix Kuehling
2022-04-12  1:38   ` [EXTERNAL] " Shuotao Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=66bf32d5-1636-ecdd-8a49-24c6254079bf@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Lei.Qu@microsoft.com \
    --cc=Mukul.Joshi@amd.com \
    --cc=Ran.Shu@microsoft.com \
    --cc=Ziyue.Yang@microsoft.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=pengc@microsoft.com \
    --cc=shuotaoxu@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.