On 2022-04-14 10:00, Shuotao Xu wrote:
>
>
>> On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky 
>> <andrey.grodzovsky@amd.com> wrote:
>>
>>
>> On 2022-04-13 12:03, Shuotao Xu wrote:
>>>
>>>
>>>> On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>
>>>> [Some people who received this message don't often get email 
>>>> fromandrey.grodzovsky@amd.com. Learn why this is important 
>>>> athttp://aka.ms/LearnAboutSenderIdentification.]
>>>>
>>>> On 2022-04-08 21:28, Shuotao Xu wrote:
>>>>>
>>>>>> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
>>>>>> <andrey.grodzovsky@amd.com> wrote:
>>>>>>
>>>>>> [Some people who received this message don't often get email from 
>>>>>> andrey.grodzovsky@amd.com. Learn why this is important at 
>>>>>> http://aka.ms/LearnAboutSenderIdentification.]
>>>>>>
>>>>>> On 2022-04-08 04:45, Shuotao Xu wrote:
>>>>>>> Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug 
>>>>>>> of GPU
>>>>>>> devices can open doors for many advanced applications in data center
>>>>>>> in the next few years, such as for GPU resource
>>>>>>> disaggregation. Current AMDKFD does not support hotplug out b/o the
>>>>>>> following reasons:
>>>>>>>
>>>>>>> 1. During PCIe removal, decrement KFD lock which was incremented at
>>>>>>> the beginning of hw fini; otherwise kfd_open later is going to
>>>>>>> fail.
>>>>>> I assumed you read my comment last time, still you do same approach.
>>>>>> More in details bellow
>>>>> Aha, I like your fix:) I was not familiar with drm APIs so just 
>>>>> only half understood your comment last time.
>>>>>
>>>>> BTW, I tried hot-plugging out a GPU when rocm application is still 
>>>>> running.
>>>>> From dmesg, application is still trying to access the removed kfd 
>>>>> device, and are met with some errors.
>>>>
>>>>
>>>> Application us supposed to keep running, it holds the drm_device
>>>> reference as long as it has an open
>>>> FD to the device and final cleanup will come only after the app 
>>>> will die
>>>> thus releasing the FD and the last
>>>> drm_device reference.
>>>>
>>>>> Application would hang and not exiting in this case.
>>>>
>>>
>>> Actually I tried kill -7 $pid, and the process exists. The dmesg has 
>>> some warning though.
>>>
>>> [  711.769977] WARNING: CPU: 23 PID: 344 at 
>>> .../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
>>> amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>> [  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) 
>>> amd_sched(OE) amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink 
>>> xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM 
>>> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 
>>> xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
>>> ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw 
>>> intel_rapl_common skx_edac nfit x86_pkg_temp_thermal 
>>> intel_powerclamp coretemp kvm_intel rpcrdma kvm sunrpc ipmi_ssif 
>>> ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev acpi_ipmi input_leds 
>>> intel_cstate ipmi_si ipmi_devintf mei_me mei intel_pch_thermal 
>>> ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter 
>>> acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp 
>>> libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub ip_tables 
>>> x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 
>>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor
>>> [  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear 
>>> mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit 
>>> drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul 
>>> crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic 
>>> sysimgblt aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid 
>>> cryptd drm i40e pci_hyperv_intf usb_storage glue_helper mlxfw hid 
>>> ahci libahci wmi
>>> [  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G       
>>>  W  OE     5.11.0+ #1
>>> [  711.779755] Hardware name: Supermicro 
>>> SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018
>>> [  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
>>> [  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
>>> [  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 
>>> 0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd 
>>> f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 
>>> 00 00 55
>>> [  711.780143] RSP: 0018:ffffa8100dd67c30 EFLAGS: 00010282
>>> [  711.780145] RAX: 00000000ffffffea RBX: ffff89980e792058 RCX: 
>>> 0000000000000000
>>> [  711.780147] RDX: 0000000000000000 RSI: ffff89a8f9ad8870 RDI: 
>>> ffff89a8f9ad8870
>>> [  711.780148] RBP: ffffa8100dd67c50 R08: 0000000000000000 R09: 
>>> fffffffffff99b18
>>> [  711.780149] R10: ffffa8100dd67bd0 R11: ffffa8100dd67908 R12: 
>>> ffff89980e792000
>>> [  711.780151] R13: ffff89980e792058 R14: ffff89980e7921bc R15: 
>>> dead000000000100
>>> [  711.780152] FS:  0000000000000000(0000) GS:ffff89a8f9ac0000(0000) 
>>> knlGS:0000000000000000
>>> [  711.780154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  711.780156] CR2: 00007ffddac6f71f CR3: 00000030bb80a003 CR4: 
>>> 00000000007706e0
>>> [  711.780157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [  711.780159] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [  711.780160] PKRU: 55555554
>>> [  711.780161] Call Trace:
>>> [  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
>>> [  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
>>> [  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
>>> [  711.780543]  amdgpu_gem_object_free+0x8b/0x160 [amdgpu]
>>> [  711.781119]  drm_gem_object_free+0x1d/0x30 [drm]
>>> [  711.781489]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34a/0x380 
>>> [amdgpu]
>>> [  711.782044]  kfd_process_device_free_bos+0xe0/0x130 [amdgpu]
>>> [  711.782611]  kfd_process_wq_release+0x286/0x380 [amdgpu]
>>> [  711.783172]  process_one_work+0x236/0x420
>>> [  711.783543]  worker_thread+0x34/0x400
>>> [  711.783911]  ? process_one_work+0x420/0x420
>>> [  711.784279]  kthread+0x126/0x140
>>> [  711.784653]  ? kthread_park+0x90/0x90
>>> [  711.785018]  ret_from_fork+0x22/0x30
>>> [  711.785387] ---[ end trace d8f50f6594817c84 ]---
>>> [  711.798716] [drm] amdgpu: ttm finalized
>>
>>
>> So it means the process was stuck in some wait_event_killable (maybe 
>> here drm_sched_entity_flush) - you can try 
>> 'cat/proc/$process_pid/stack' maybe before
>> you kill it to see where it was stuck so we can go from there.
>>
>>
>>>
>>>>
>>>> For graphic apps what i usually see is a crash because of sigsev when
>>>> the app tries to access
>>>> an unmapped MMIO region on the device. I haven't tested for compute
>>>> stack and so there might
>>>> be something I haven't covered. Hang could mean for example waiting 
>>>> on a
>>>> fence which is not being
>>>> signaled - please provide full dmesg from this case.
>>>>
>>>>>
>>>>> Do you have any good suggestions on how to fix it down the line? 
>>>>> (HIP runtime/libhsakmt or driver)
>>>>>
>>>>> [64036.631333] amdgpu: amdgpu_vm_bo_update failed
>>>>> [64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>> [64036.640754] amdgpu: amdgpu_vm_bo_update failed
>>>>> [64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>> [64036.650394] amdgpu: amdgpu_vm_bo_update failed
>>>>> [64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed
>>>>
>>>
>>> The full dmesg will just the repetition of those two messages,
>>> [186885.764079] amdgpu 0000:43:00.0: amdgpu: amdgpu: finishing device.
>>> [186885.766916] [drm] free PSP TMR buffer
>>> [186893.868173] amdgpu: amdgpu_vm_bo_update failed
>>> [186893.868235] amdgpu: validate_invalid_user_pages: update PTE failed
>>> [186893.876154] amdgpu: amdgpu_vm_bo_update failed
>>> [186893.876190] amdgpu: validate_invalid_user_pages: update PTE failed
>>> [186893.884150] amdgpu: amdgpu_vm_bo_update failed
>>> [186893.884185] amdgpu: validate_invalid_user_pages: update PTE failed
>>>
>>>>
>>>> This just probably means trying to update PTEs after the physical 
>>>> device
>>>> is gone - we usually avoid this by
>>>> first trying to do all HW shutdowns early before PCI remove completion
>>>> but when it's really tricky by
>>>> protecting HW access sections with drm_dev_enter/exit scope.
>>>>
>>>> For this particular error it would be the best to flush
>>>> info->restore_userptr_work before the end of
>>>> amdgpu_pci_remove (rejecting new process creation and calling
>>>> cancel_delayed_work_sync(&process_info->restore_userptr_work) for all
>>>> running processes)
>>>> somewhere in amdgpu_pci_remove.
>>>>
>>> I tried something like *kfd_process_ref_release* which I think did 
>>> what you described, but it did not work.
>>
>>
>> I don't see how kfd_process_ref_release is the same as I mentioned 
>> above, what i meant is calling the code above within kgd2kfd_suspend 
>> (where you tried to call kfd_kill_all_user_processes bellow)
>>
> Yes, you are right. It was not called by it.
>>
>>
>>>
>>> Instead I tried to kill the process from the kernel, but the amdgpu 
>>> could **only** be hot-plugged in back successfully only if there was 
>>> no rocm kernel running when it was plugged out. If not, amdgpu_probe 
>>> will just hang later. (Maybe because amdgpu was plugged out while 
>>> running state, it leaves a bad HW state which causes probe to hang).
>>
>>
>> We usually do asic_reset during probe to reset all HW state (checlk 
>> if amdgpu_device_init->amdgpu_asic_reset is running when you  plug 
>> back).
>>
> OK
>>
>>
>>>
>>> I don’t know if this is a viable solution worth pursuing, but I 
>>> attached the diff anyway.
>>>
>>> Another solution could be let compute stack user mode detect a 
>>> topology change via generation_count change, and abort gracefully there.
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 4e7d9cb09a69..79b4c9b84cd0 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -697,12 +697,15 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool 
>>> run_pm, bool force)
>>>                 return;
>>>
>>>         /* for runtime suspend, skip locking kfd */
>>> -       if (!run_pm) {
>>> +       if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>                 /* For first KFD device suspend all the KFD processes */
>>>                 if (atomic_inc_return(&kfd_locked) == 1)
>>> kfd_suspend_all_processes(force);
>>>         }
>>>
>>> +       if (drm_dev_is_unplugged(kfd->ddev))
>>> + kfd_kill_all_user_processes();
>>> +
>>> kfd->dqm->ops.stop(kfd->dqm);
>>>         kfd_iommu_suspend(kfd);
>>>  }
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> index 55c9e1922714..84cbcd857856 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> @@ -1065,6 +1065,7 @@ void kfd_unref_process(struct kfd_process *p);
>>>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>>>  int kfd_process_restore_queues(struct kfd_process *p);
>>>  void kfd_suspend_all_processes(bool force);
>>> +void kfd_kill_all_user_processes(void);
>>>  /*
>>>   * kfd_resume_all_processes:
>>>   *     bool sync: If kfd_resume_all_processes() should wait for the
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 6cdc855abb6d..fb0c753b682c 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -2206,6 +2206,24 @@ void kfd_suspend_all_processes(bool force)
>>> srcu_read_unlock(&kfd_processes_srcu, idx);
>>>  }
>>>
>>> +
>>> +void kfd_kill_all_user_processes(void)
>>> +{
>>> +       struct kfd_process *p;
>>> +       struct amdkfd_process_info *p_info;
>>> +       unsigned int temp;
>>> +       int idx = srcu_read_lock(&kfd_processes_srcu);
>>> +
>>> +       pr_info("Killing all processes\n");
>>> + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>> +               p_info = p->kgd_process_info;
>>> +               pr_info("Killing  processes, pid = %d", 
>>> pid_nr(p_info->pid));
>>> +               kill_pid(p_info->pid, SIGBUS, 1);
>>
>>
>> From looking into kill_pid I see it only sends a signal but doesn't 
>> wait for completion, it would make sense to wait for completion here. 
>> In any case I would actually try to put here
>>
> I have made a version which does that with some atomic counters. 
> Please read later in the diff.
>>
>>
>> hash_for_each_rcu(p_info)
>>     cancel_delayed_work_sync(&p_info->restore_userptr_work)
>>
>> instead at least that what i meant in the previous mail.
>>
> I actually tried that earlier, and it did not work. Application still 
> keeps running, and you have to send a kill to the user process.
>
> I have made the following version. It waits for processes to terminate 
> synchronously after sending SIGBUS. After that it does the real work 
> of amdgpu_pci_remove.
> However, it hangs at amdgpu_device_ip_fini_early when it is trying to 
> deinit ip_block 6 <sdma_v4_0> 
> (https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L2818 
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fblob%2Famd-staging-drm-next%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Famdgpu_device.c%23L2818&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C37a2503747384d07944608da1e1f37ee%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637855416726313174%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=FBdaleltc3PbJRmaWr8D3gxU7zuZ7n%2Bcu7J2yUrzD1I%3D&reserved=0>). 
> I assume that there are still some inflight dma, therefore fini of 
> this ip block thus hangs?
>
> The following is an excerpt of the dmesg: please excuse for putting my 
> own pr_info, but I hope you get my point of where it hangs.
>
> [  392.344735] amdgpu: all processes has been fully released
> [  392.346557] amdgpu: amdgpu_acpi_fini done
> [  392.346568] amdgpu 0000:b3:00.0: amdgpu: amdgpu: finishing device.
> [  392.349238] amdgpu: amdgpu_device_ip_fini_early enter ip_blocks = 9
> [  392.349248] amdgpu: Free mem_obj = 000000007bf54275, range_start = 
> 14, range_end = 14
> [  392.350299] amdgpu: Free mem_obj = 00000000a85bc878, range_start = 
> 12, range_end = 12
> [  392.350304] amdgpu: Free mem_obj = 00000000b8019e32, range_start = 
> 13, range_end = 13
> [  392.350308] amdgpu: Free mem_obj = 000000002d296168, range_start = 
> 4, range_end = 11
> [  392.350313] amdgpu: Free mem_obj = 000000001fc4f934, range_start = 
> 0, range_end = 3
> [  392.350322] amdgpu: amdgpu_amdkfd_suspend(adev, false) done
> [  392.350672] amdgpu: hw_fini of IP block[8] <jpeg_v2_5> done 0
> [  392.350679] amdgpu: hw_fini of IP block[7] <vcn_v2_5> done 0


I just remembered that the idea to actively kill and wait for running 
user processes during unplug was rejected
as a bad idea in the first iteration of unplug work I did (don't 
remember why now, need to look) so this is a no go.
Our policy is to let zombie processes (zombie in a sense that the 
underlying device is gone) live as long as they want
(as long as you able to terminate them - which you do, so that ok)
and the system should finish PCI remove gracefully and be able to hot 
plug back the device.  Hence, i suggest dropping
this direction of forcing all user processes to be killed, confirm you 
have graceful shutdown and remove of device
from PCI topology and then concentrate on why when you plug back it 
hangs. First confirm if ASIC reset happens on
next init. Second please confirm if the timing you kill manually the 
user process has impact on whether you have a hang
on next plug back (if you kill before or you kill after plug back does 
it makes a difference).

Andrey


>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 8fa9b86ac9d2..c0b27f722281 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -188,6 +188,12 @@ void amdgpu_amdkfd_interrupt(struct amdgpu_device 
> *adev,
> kgd2kfd_interrupt(adev->kfd.dev, ih_ring_entry);
>  }
> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev)
> +{
> +if (adev->kfd.dev)
> +kgd2kfd_kill_all_user_processes(adev->kfd.dev);
> +}
> +
>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm)
>  {
> if (adev->kfd.dev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 27c74fcec455..f4e485d60442 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -141,6 +141,7 @@ struct amdkfd_process_info {
>  int amdgpu_amdkfd_init(void);
>  void amdgpu_amdkfd_fini(void);
> +void amdgpu_amdkfd_kill_all_processes(struct amdgpu_device *adev);
>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>  int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>  int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm, 
> bool sync);
> @@ -405,6 +406,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> const struct kgd2kfd_shared_resources *gpu_resources);
>  void kgd2kfd_device_exit(struct kfd_dev *kfd);
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force);
> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd);
>  int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm, bool sync);
>  int kgd2kfd_pre_reset(struct kfd_dev *kfd);
> @@ -443,6 +445,9 @@ static inline void kgd2kfd_suspend(struct kfd_dev 
> *kfd, bool run_pm, bool force)
>  {
>  }
> +void kgd2kfd_kill_all_user_processes(struct kfd_dev *kfd){
> +}
> +
>  static int __maybe_unused kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>  {
> return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 3d5fc0751829..af6fe5080cfa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2101,6 +2101,9 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>  {
> struct drm_device *dev = pci_get_drvdata(pdev);
> +/* kill all kfd processes before drm_dev_unplug */
> +amdgpu_amdkfd_kill_all_processes(drm_to_adev(dev));
> +
>  #ifdef HAVE_DRM_DEV_UNPLUG
> drm_dev_unplug(dev);
>  #else
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 5504a18b5a45..480c23bef5e2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -691,6 +691,12 @@ bool kfd_is_locked(void)
> return  (atomic_read(&kfd_locked) > 0);
>  }
> +inline void kgd2kfd_kill_all_user_processes(struct kfd_dev* dev)
> +{
> +kfd_kill_all_user_processes();
> +}
> +
> +
>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm, bool force)
>  {
> if (!kfd->init_complete)
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index 55c9e1922714..a35a2cb5bb9f 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1064,6 +1064,7 @@ static inline struct kfd_process_device 
> *kfd_process_device_from_gpuidx(
>  void kfd_unref_process(struct kfd_process *p);
>  int kfd_process_evict_queues(struct kfd_process *p, bool force);
>  int kfd_process_restore_queues(struct kfd_process *p);
> +void kfd_kill_all_user_processes(void);
>  void kfd_suspend_all_processes(bool force);
>  /*
>   * kfd_resume_all_processes:
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 6cdc855abb6d..17e769e6951d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -46,6 +46,9 @@ struct mm_struct;
>  #include "kfd_trace.h"
>  #include "kfd_debug.h"
> +static atomic_t kfd_process_locked = ATOMIC_INIT(0);
> +static atomic_t kfd_inflight_kills = ATOMIC_INIT(0);
> +
>  /*
>   * List of struct kfd_process (field kfd_process).
>   * Unique/indexed by mm_struct*
> @@ -802,6 +805,9 @@ struct kfd_process *kfd_create_process(struct 
> task_struct *thread)
> struct kfd_process *process;
> int ret;
> +if ( atomic_read(&kfd_process_locked) > 0 )
> +return ERR_PTR(-EINVAL);
> +
> if (!(thread->mm && mmget_not_zero(thread->mm)))
> return ERR_PTR(-EINVAL);
> @@ -1126,6 +1132,10 @@ static void kfd_process_wq_release(struct 
> work_struct *work)
> put_task_struct(p->lead_thread);
> kfree(p);
> +
> +if ( atomic_read(&kfd_process_locked) > 0 ){
> +atomic_dec(&kfd_inflight_kills);
> +}
>  }
>  static void kfd_process_ref_release(struct kref *ref)
> @@ -2186,6 +2196,35 @@ static void restore_process_worker(struct 
> work_struct *work)
> pr_err("Failed to restore queues of pasid 0x%x\n", p->pasid);
>  }
> +void kfd_kill_all_user_processes(void)
> +{
> +struct kfd_process *p;
> +/* struct amdkfd_process_info *p_info; */
> +unsigned int temp;
> +int idx;
> +atomic_inc(&kfd_process_locked);
> +
> +idx = srcu_read_lock(&kfd_processes_srcu);
> +pr_info("Killing all processes\n");
> +hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> +dev_warn(kfd_device,
> +"Sending SIGBUS to process %d (pasid 0x%x)",
> +p->lead_thread->pid, p->pasid);
> +send_sig(SIGBUS, p->lead_thread, 0);
> +atomic_inc(&kfd_inflight_kills);
> +}
> +srcu_read_unlock(&kfd_processes_srcu, idx);
> +
> +while ( atomic_read(&kfd_inflight_kills) > 0 ){
> +dev_warn(kfd_device,
> +"kfd_processes_table is not empty, going to sleep for 10ms\n");
> +msleep(10);
> +}
> +
> +atomic_dec(&kfd_process_locked);
> +pr_info("all processes has been fully released");
> +}
> +
>  void kfd_suspend_all_processes(bool force)
>  {
> struct kfd_process *p;
>
>
>
> Regards,
> Shuotao
>
>> Andrey
>>
>>> +       }
>>> + srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +}
>>> +
>>> +
>>>  int kfd_resume_all_processes(bool sync)
>>>  {
>>>         struct kfd_process *p;
>>>
>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Really appreciate your help!
>>>>>
>>>>> Best,
>>>>> Shuotao
>>>>>
>>>>>>> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>>>>>>> out.
>>>>>>>
>>>>>>> 3. New kfd node_id is not properly assigned after a new device is
>>>>>>> added after a gpu is hotplugged out in a system. libhsakmt will
>>>>>>> find this anomaly, (i.e. node_from != <dev node id> in iolinks),
>>>>>>> when taking a topology_snapshot, thus returns fault to the rocm
>>>>>>> stack.
>>>>>>>
>>>>>>> -- This patch fixes issue 1; another patch by Mukul fixes issues 
>>>>>>> 2&3.
>>>>>>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
>>>>>>> 5.16.0-kfd is unstable out of box for MI100.
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 +++++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +++++++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 +++++++++++++
>>>>>>> 4 files changed, 26 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> index c18c4be1e4ac..d50011bdb5c4 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct 
>>>>>>> amdgpu_device *adev, bool run_pm)
>>>>>>> return r;
>>>>>>> }
>>>>>>>
>>>>>>> +int amdgpu_amdkfd_resume_processes(void)
>>>>>>> +{
>>>>>>> + return kgd2kfd_resume_processes();
>>>>>>> +}
>>>>>>> +
>>>>>>> int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>>>>>> {
>>>>>>> int r = 0;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> index f8b9f27adcf5..803306e011c3 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>>>>>>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>>>>>> void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>>>>>> int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>>>>>> int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>>>>>>> +int amdgpu_amdkfd_resume_processes(void);
>>>>>>> void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>>>>>> const void *ih_ring_entry);
>>>>>>> void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>>>>>>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>>>>>> int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>>>>>>> +int kgd2kfd_resume_processes(void);
>>>>>>> int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>>>>>> int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>>>>>> void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
>>>>>>> *ih_ring_entry);
>>>>>>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct 
>>>>>>> kfd_dev *kfd, bool run_pm)
>>>>>>> return 0;
>>>>>>> }
>>>>>>>
>>>>>>> +static inline int kgd2kfd_resume_processes(void)
>>>>>>> +{
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>>>>>> {
>>>>>>> return 0;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index fa4a9f13c922..5827b65b7489 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct 
>>>>>>> amdgpu_device *adev)
>>>>>>> if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>>>>>> amdgpu_device_unmap_mmio(adev);
>>>>>>>
>>>>>>> + amdgpu_amdkfd_resume_processes();
>>>>>>> }
>>>>>>>
>>>>>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> index 62aa6c9d5123..ef05aae9255e 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> @@ -714,6 +714,19 @@ int kgd2kfd_resume(struct kfd_dev *kfd, 
>>>>>>> bool run_pm)
>>>>>>> return ret;
>>>>>>> }
>>>>>>>
>>>>>>> +/* for non-runtime resume only */
>>>>>>> +int kgd2kfd_resume_processes(void)
>>>>>>> +{
>>>>>>> + int count;
>>>>>>> +
>>>>>>> + count = atomic_dec_return(&kfd_locked);
>>>>>>> + WARN_ONCE(count < 0, "KFD suspend / resume ref. error");
>>>>>>> + if (count == 0)
>>>>>>> + return kfd_resume_all_processes();
>>>>>>> +
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>
>>>>>> It doesn't make sense to me to just increment kfd_locked in
>>>>>> kgd2kfd_suspend to only decrement it again a few functions down the
>>>>>> road.
>>>>>>
>>>>>> I suggest this instead - you only incrmemnt if not during PCI remove
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> index 1c2cf3a33c1f..7754f77248a4 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> @@ -952,11 +952,12 @@ bool kfd_is_locked(void)
>>>>>>
>>>>>> void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>>>>>> {
>>>>>> +
>>>>>> if (!kfd->init_complete)
>>>>>> return;
>>>>>>
>>>>>> /* for runtime suspend, skip locking kfd */
>>>>>> - if (!run_pm) {
>>>>>> + if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
>>>>>> /* For first KFD device suspend all the KFD processes */
>>>>>> if (atomic_inc_return(&kfd_locked) == 1)
>>>>>> kfd_suspend_all_processes();
>>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>
>>>>>>> +
>>>>>>> int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
>>>>>>> {
>>>>>>> int err = 0;
>>>
>