All of lore.kernel.org
 help / color / mirror / Atom feed
From: shaoyunl <shaoyun.liu@amd.com>
To: Felix Kuehling <felix.kuehling@amd.com>, amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 4/4] drm/amdkfd: Avoid hanging hardware in stop_cpsch
Date: Fri, 20 Dec 2019 17:09:58 -0500	[thread overview]
Message-ID: <5f7208ac-fcec-8f69-44bd-aa288d79482c@amd.com> (raw)
In-Reply-To: <1d54b45e-8e03-bb04-e49a-2d5385547e2a@amd.com>

I agree this patch is a big improvement , I think we need this patch so 
SRIOV can put the  amdkfd_pre_reset in right place as bare metal mode . 
The further improvement can be in separate change .

This serial is reviewed by shaoyun.liu < shaoyun.liu@amd.com>


Regards

shaoyun.liu


On 2019-12-20 2:46 p.m., Felix Kuehling wrote:
> On 2019-12-20 14:31, shaoyunl wrote:
>> Can we use the  dqm_lock when we try to get the dqm->is_hw_hang and 
>> dqm->is_resetting inside function kq_uninitialize ?
>
> Spreading the DQM lock around is probably not a good idea. Then I'd 
> rather do more refactoring to move hqd_load and hqd_destroy out of the 
> kq_init/kq_uninit functions.
>
>
>>
>> I think more closer we check the status  to hqd_destroy it will be  
>> more accurate . It does look better with this logic if the status are 
>> changed after dqm unmap_queue call and  before we call hqd_destroy .
>>
>> Another comment in line.
>>
>> Regards
>>
>> shaoyun.liu
>>
>>
>>
>>
>> On 2019-12-20 11:33 a.m., Felix Kuehling wrote:
>>> dqm->is_hws_hang is protected by the DQM lock. kq_uninitialize runs 
>>> outside that lock protection. Therefore I opted to pass in the 
>>> hanging flag as a parameter. It also keeps the logic that decides 
>>> all of that inside the device queue manager, which I think is cleaner.
>>>
>>> I was trying to clean this up further by moving the 
>>> pm_init/pm_uninit out of the start_cpsch/stop_cpsch sequence, but 
>>> gave up on that idea when I found out that I can't create the kernel 
>>> queue in the DQM initialize function because dev->dqm isn't 
>>> initialized at that time yet.
>>>
>>> Regards,
>>>   Felix
>>>
>>> On 2019-12-20 10:56, shaoyunl wrote:
>>>> Looks like patch 2 is not related to this serial , but anyway .
>>>>
>>>> Patch 1,2,3 are reviewed by shaoyunl <shaoyun.liu@amd.com>
>>>>
>>>> For patch 4 ,  is it possible we directly check dqm->is_hws_hang || 
>>>> dqm->is_resetting  inside function kq_uninitialize.  so we don't 
>>>> need other interface change .
>>>>
>>>> I think even Inside that kq_uninitialize function , we still can 
>>>> get dqm as  kq->dev->dqm .
>>>>
>>>>
>>>> shaoyun.liu
>>>>
>>>>
>>>> On 2019-12-20 3:30 a.m., Felix Kuehling wrote:
>>>>> Don't use the HWS if it's known to be hanging. In a reset also
>>>>> don't try to destroy the HIQ because that may hang on SRIOV if the
>>>>> KIQ is unresponsive.
>>>>>
>>>>> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>> ---
>>>>>   .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c    | 12 
>>>>> ++++++++----
>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c        | 8 ++++----
>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c      | 4 ++--
>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h                | 4 ++--
>>>>>   .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c   | 2 +-
>>>>>   5 files changed, 17 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> index a7e9ec1b3ce3..d7eb6ac37f62 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>> @@ -946,7 +946,7 @@ static int start_nocpsch(struct 
>>>>> device_queue_manager *dqm)
>>>>>   static int stop_nocpsch(struct device_queue_manager *dqm)
>>>>>   {
>>>>>       if (dqm->dev->device_info->asic_family == CHIP_HAWAII)
>>>>> -        pm_uninit(&dqm->packets);
>>>>> +        pm_uninit(&dqm->packets, false);
>>>>>       dqm->sched_running = false;
>>>>>         return 0;
>>>>> @@ -1114,20 +1114,24 @@ static int start_cpsch(struct 
>>>>> device_queue_manager *dqm)
>>>>>       return 0;
>>>>>   fail_allocate_vidmem:
>>>>>   fail_set_sched_resources:
>>>>> -    pm_uninit(&dqm->packets);
>>>>> +    pm_uninit(&dqm->packets, false);
>>>>>   fail_packet_manager_init:
>>>>>       return retval;
>>>>>   }
>>>>>     static int stop_cpsch(struct device_queue_manager *dqm)
>>>>>   {
>>>>> +    bool hanging;
>>>>> +kq_uninitialize(
>>>>>
>>>>>       dqm_lock(dqm);
>>>>> -    unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>>>>> +    if (!dqm->is_hws_hang)
>> [shaoyunl]  should we check is_resetting here as well . so we ignore 
>> the  unmap call even HWS still not  detect the hang but we know we 
>> currently in resetting  precedure
>
> GPU reset can be done when the HWS is not hanging. In that case 
> unmapping queues is perfectly safe. In the worst case it'll time out 
> and dqm->is_hws_hang will be set as a result. I'm planning to add more 
> checks later so that we can optionally wait in unmap_queues until a 
> reset is done. We'll need that to do preemptions reliably while a GPU 
> reset is in progress. So I need to either unmap the queues or be sure 
> that HWS is hanging.
>
> With yours and Oak's comments I realize, this is far from complete and 
> more work is needed. But I still think this is an improvement.
>
> Regards,
>   Felix
>
>
>>>>> +        unmap_queues_cpsch(dqm, 
>>>>> KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>>>>> +    hanging = dqm->is_hws_hang || dqm->is_resetting;
>>>>>       dqm->sched_running = false;
>>>>>       dqm_unlock(dqm);
>>>>>         kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
>>>>> -    pm_uninit(&dqm->packets);
>>>>> +    pm_uninit(&dqm->packets, hanging);
>>>>>         return 0;
>>>>>   }
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>>>> index 2d56dc534459..bae706462f96 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>>>> @@ -195,9 +195,9 @@ static bool kq_initialize(struct kernel_queue 
>>>>> *kq, struct kfd_dev *dev,
>>>>>   }
>>>>>     /* Uninitialize a kernel queue and free all its memory usages. */
>>>>> -static void kq_uninitialize(struct kernel_queue *kq)
>>>>> +static void kq_uninitialize(struct kernel_queue *kq, bool hanging)
>>>>>   {
>>>>> -    if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ)
>>>>> +    if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && 
>>>>> !hanging)
>>>>>           kq->mqd_mgr->destroy_mqd(kq->mqd_mgr,
>>>>>                       kq->queue->mqd,
>>>>>                       KFD_PREEMPT_TYPE_WAVEFRONT_RESET,
>>>>> @@ -337,9 +337,9 @@ struct kernel_queue *kernel_queue_init(struct 
>>>>> kfd_dev *dev,
>>>>>       return NULL;
>>>>>   }
>>>>>   -void kernel_queue_uninit(struct kernel_queue *kq)
>>>>> +void kernel_queue_uninit(struct kernel_queue *kq, bool hanging)
>>>>>   {
>>>>> -    kq_uninitialize(kq);
>>>>> +    kq_uninitialize(kq, hanging);
>>>>>       kfree(kq);
>>>>>   }
>>>>>   diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
>>>>> index 6cabed06ef5d..dc406e6dee23 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
>>>>> @@ -264,10 +264,10 @@ int pm_init(struct packet_manager *pm, 
>>>>> struct device_queue_manager *dqm)
>>>>>       return 0;
>>>>>   }
>>>>>   -void pm_uninit(struct packet_manager *pm)
>>>>> +void pm_uninit(struct packet_manager *pm, bool hanging)
>>>>>   {
>>>>>       mutex_destroy(&pm->lock);
>>>>> -    kernel_queue_uninit(pm->priv_queue);
>>>>> +    kernel_queue_uninit(pm->priv_queue, hanging);
>>>>>   }
>>>>>     int pm_send_set_resources(struct packet_manager *pm,
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h I
>>>>> index 087e96838997..8ac680dc90f1 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>>> @@ -883,7 +883,7 @@ struct device_queue_manager 
>>>>> *device_queue_manager_init(struct kfd_dev *dev);
>>>>>   void device_queue_manager_uninit(struct device_queue_manager *dqm);
>>>>>   struct kernel_queue *kernel_queue_init(struct kfd_dev *dev,
>>>>>                       enum kfd_queue_type type);
>>>>> -void kernel_queue_uninit(struct kernel_queue *kq);
>>>>> +void kernel_queue_uninit(struct kernel_queue *kq, bool hanging);
>>>>>   int kfd_process_vm_fault(struct device_queue_manager *dqm, 
>>>>> unsigned int pasid);
>>>>>     /* Process Queue Manager */
>>>>> @@ -974,7 +974,7 @@ extern const struct packet_manager_funcs 
>>>>> kfd_vi_pm_funcs;
>>>>>   extern const struct packet_manager_funcs kfd_v9_pm_funcs;
>>>>>     int pm_init(struct packet_manager *pm, struct 
>>>>> device_queue_manager *dqm);
>>>>> -void pm_uninit(struct packet_manager *pm);
>>>>> +void pm_uninit(struct packet_manager *pm, bool hanging);
>>>>>   int pm_send_set_resources(struct packet_manager *pm,
>>>>>                   struct scheduling_resources *res);
>>>>>   int pm_send_runlist(struct packet_manager *pm, struct list_head 
>>>>> *dqm_queues);
>>>>> diff --git 
>>>>> a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c 
>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>>>> index d3eacf72e8db..8fa856e6a03f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>>>> @@ -374,7 +374,7 @@ int pqm_destroy_queue(struct 
>>>>> process_queue_manager *pqm, unsigned int qid)
>>>>>           /* destroy kernel queue (DIQ) */
>>>>>           dqm = pqn->kq->dev->dqm;
>>>>>           dqm->ops.destroy_kernel_queue(dqm, pqn->kq, &pdd->qpd);
>>>>> -        kernel_queue_uninit(pqn->kq);
>>>>> +        kernel_queue_uninit(pqn->kq, false);
>>>>>       }
>>>>>         if (pqn->q) {
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cfelix.kuehling%40amd.com%7C6df6fef2bf6a4208704e08d785652f27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637124542109969638&amp;sdata=V0mHmgjUSP%2BauYL3r6PGU7aqFTQz8NkKMcuA5vXSkUQ%3D&amp;reserved=0 
>>>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2019-12-20 22:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20  8:29 [PATCH 1/4] drm/amdkfd: Fix permissions of hang_hws Felix Kuehling
2019-12-20  8:29 ` [PATCH 2/4] drm/amdkfd: Remove unused variable Felix Kuehling
2020-01-02 13:40   ` Russell, Kent
2019-12-20  8:30 ` [PATCH 3/4] drm/amdkfd: Improve HWS hang detection and handling Felix Kuehling
2019-12-20 17:18   ` Zeng, Oak
2019-12-20 17:49     ` Felix Kuehling
2019-12-20  8:30 ` [PATCH 4/4] drm/amdkfd: Avoid hanging hardware in stop_cpsch Felix Kuehling
2019-12-20 10:03   ` Deng, Emily
2019-12-20 15:56   ` shaoyunl
2019-12-20 16:33     ` Felix Kuehling
2019-12-20 19:31       ` shaoyunl
2019-12-20 19:46         ` Felix Kuehling
2019-12-20 22:09           ` shaoyunl [this message]
2019-12-20 17:22   ` Zeng, Oak
2019-12-20 17:27     ` Felix Kuehling
2019-12-20 17:29       ` Zeng, Oak
2020-01-02 13:39 ` [PATCH 1/4] drm/amdkfd: Fix permissions of hang_hws Russell, Kent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5f7208ac-fcec-8f69-44bd-aa288d79482c@amd.com \
    --to=shaoyun.liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=felix.kuehling@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.