dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: "Lazar, Lijo" <lijo.lazar@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org
Cc: horace.chen@amd.com, Monk.Liu@amd.com
Subject: Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
Date: Wed, 5 Jan 2022 18:56:40 +0530	[thread overview]
Message-ID: <6f64e1d2-eec5-04da-2497-9ee2e7dfcf12@amd.com> (raw)
In-Reply-To: <6be71531-40f1-fcad-f54c-03f6e14064f9@amd.com>



On 1/5/2022 6:45 PM, Christian König wrote:
> Am 05.01.22 um 14:11 schrieb Lazar, Lijo:
>> On 1/5/2022 6:01 PM, Christian König wrote:
>>> Am 05.01.22 um 10:54 schrieb Lazar, Lijo:
>>>> On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
>>>>> Use reset domain wq also for non TDR gpu recovery trigers
>>>>> such as sysfs and RAS. We must serialize all possible
>>>>> GPU recoveries to gurantee no concurrency there.
>>>>> For TDR call the original recovery function directly since
>>>>> it's already executed from within the wq. For others just
>>>>> use a wrapper to qeueue work and wait on it to finish.
>>>>>
>>>>> v2: Rename to amdgpu_recover_work_struct
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 
>>>>> +++++++++++++++++++++-
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>>   3 files changed, 35 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index b5ff76aae7e0..8e96b9a14452 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct 
>>>>> amdgpu_device *adev);
>>>>>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>>>>>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>                     struct amdgpu_job* job);
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>> +                  struct amdgpu_job *job);
>>>>>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>>>>>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>>>>>   bool amdgpu_device_need_post(struct amdgpu_device *adev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 7c063fd37389..258ec3c0b2af 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>    * Returns 0 for success or an error on failure.
>>>>>    */
>>>>>   -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>>>>>                     struct amdgpu_job *job)
>>>>>   {
>>>>>       struct list_head device_list, *device_list_handle = NULL;
>>>>> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct 
>>>>> amdgpu_device *adev,
>>>>>       return r;
>>>>>   }
>>>>>   +struct amdgpu_recover_work_struct {
>>>>> +    struct work_struct base;
>>>>> +    struct amdgpu_device *adev;
>>>>> +    struct amdgpu_job *job;
>>>>> +    int ret;
>>>>> +};
>>>>> +
>>>>> +static void amdgpu_device_queue_gpu_recover_work(struct 
>>>>> work_struct *work)
>>>>> +{
>>>>> +    struct amdgpu_recover_work_struct *recover_work = 
>>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>>> +
>>>>> +    recover_work->ret = 
>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>>>>> +}
>>>>> +/*
>>>>> + * Serialize gpu recover into reset domain single threaded wq
>>>>> + */
>>>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> +                    struct amdgpu_job *job)
>>>>> +{
>>>>> +    struct amdgpu_recover_work_struct work = {.adev = adev, .job = 
>>>>> job};
>>>>> +
>>>>> +    INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>>>>> +
>>>>> +    if (!queue_work(adev->reset_domain.wq, &work.base))
>>>>> +        return -EAGAIN;
>>>>> +
>>>>
>>>> The decision to schedule a reset is made at this point. Subsequent 
>>>> accesses to hardware may not be reliable. So should the flag 
>>>> in_reset be set here itself rather than waiting for the work to 
>>>> start execution?
>>>
>>> No, when we race and lose the VM is completely lost and probably 
>>> restarted by the hypervisor.
>>>
>>> And when we race and win we properly set the flag before signaling 
>>> the hypervisor that it can continue with the reset.
>>>
>>
>> I was talking about baremetal case. When this was synchronous, 
>> in_reset flag is set as one of the first things and amdgpu_in_reset is 
>> checked to prevent further hardware accesses. This design only changes 
>> the recover part and doesn't change the hardware perspective. 
> 
>> Potential accesses from other processes need to be blocked as soon as 
>> we determine a reset is required.
> 
> That's an incorrect assumption.
> 
> Accessing the hardware is perfectly ok as long as the reset hasn't 
> started yet. In other words even when the hardware is locked up you can 
> still happily read/write registers or access the VRAM BAR.
> 

Not sure if that is 100% correct like a recovery triggered by RAS error 
(depends on the access done).

Thanks,
Lijo

> Only when the hardware is currently performing a reset, then we can't 
> touch it or there might be unfortunate consequences (usually complete 
> system lockup).
> 
> Regards,
> Christian.
> 
>> Are we expecting the work to be immediately executed and set the flags?
>>
>> Thanks,
>> Lijo
>>
>>>> Also, what about having the reset_active or in_reset flag in the 
>>>> reset_domain itself?
>>>
>>> Of hand that sounds like a good idea.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> +    flush_work(&work.base);
>>>>> +
>>>>> +    return work.ret;
>>>>> +}
>>>>> +
>>>>>   /**
>>>>>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>>>>>    *
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index bfc47bea23db..38c9fd7b7ad4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat 
>>>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>             ti.process_name, ti.tgid, ti.task_name, ti.pid);
>>>>>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>>>> -        amdgpu_device_gpu_recover(ring->adev, job);
>>>>> +        amdgpu_device_gpu_recover_imp(ring->adev, job);
>>>>>       } else {
>>>>>           drm_sched_suspend_timeout(&ring->sched);
>>>>>           if (amdgpu_sriov_vf(adev))
>>>>>
>>>
> 

  reply	other threads:[~2022-01-05 13:26 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-22 22:04 [RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu Andrey Grodzovsky
2021-12-22 22:04 ` [RFC v2 1/8] drm/amdgpu: Introduce reset domain Andrey Grodzovsky
2021-12-22 22:05 ` [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready Andrey Grodzovsky
2021-12-23  8:39   ` Christian König
2021-12-22 22:05 ` [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe Andrey Grodzovsky
2021-12-23  8:40   ` Christian König
2021-12-22 22:05 ` [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Andrey Grodzovsky
2021-12-23  8:41   ` Christian König
2022-01-05  9:54   ` Lazar, Lijo
2022-01-05 12:31     ` Christian König
2022-01-05 13:11       ` Lazar, Lijo
2022-01-05 13:15         ` Christian König
2022-01-05 13:26           ` Lazar, Lijo [this message]
2022-01-05 13:41             ` Christian König
2022-01-05 18:11       ` Andrey Grodzovsky
2022-01-17 19:14         ` Andrey Grodzovsky
2022-01-17 19:17           ` Christian König
2022-01-17 19:21             ` Andrey Grodzovsky
2022-01-26 15:52               ` Andrey Grodzovsky
2022-01-28 16:57                 ` Grodzovsky, Andrey
2022-02-07  2:41                   ` JingWen Chen
2022-02-07  3:08                     ` Grodzovsky, Andrey
2021-12-22 22:13 ` [RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 6/8] drm/amdgpu: Drop hive->in_reset Andrey Grodzovsky
2021-12-22 22:13   ` [RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device Andrey Grodzovsky
2021-12-22 22:14   ` [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV Andrey Grodzovsky
2021-12-23  8:42     ` Christian König
2021-12-23 10:14       ` Liu, Monk
2021-12-24  8:58         ` Deng, Emily
2021-12-24  9:57           ` JingWen Chen
2021-12-30 18:45             ` Andrey Grodzovsky
2022-01-03 10:17               ` Christian König
2022-01-04  9:07                 ` JingWen Chen
2022-01-04 10:18                   ` Christian König
2022-01-04 10:49                     ` Liu, Monk
2022-01-04 11:36                       ` Christian König
2022-01-04 16:56                         ` Andrey Grodzovsky
2022-01-05  7:34                           ` JingWen Chen
2022-01-05  7:59                             ` Christian König
2022-01-05 18:24                               ` Andrey Grodzovsky
2022-01-06  4:59                                 ` JingWen Chen
2022-01-06  5:18                                   ` JingWen Chen
2022-01-06  9:13                                     ` Christian König
2022-01-06 19:13                                     ` Andrey Grodzovsky
2022-01-07  3:57                                       ` JingWen Chen
2022-01-07  5:46                                         ` JingWen Chen
2022-01-07 16:02                                           ` Andrey Grodzovsky
2022-01-12  6:28                                             ` JingWen Chen
2022-01-04 17:13                         ` Liu, Shaoyun
2022-01-04 20:54                           ` Andrey Grodzovsky
2022-01-05  0:01                             ` Liu, Shaoyun
2022-01-05  7:25                         ` JingWen Chen
2021-12-30 18:39           ` Andrey Grodzovsky
2021-12-23 18:07     ` Liu, Shaoyun
2021-12-23 18:29   ` [RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6f64e1d2-eec5-04da-2497-9ee2e7dfcf12@amd.com \
    --to=lijo.lazar@amd.com \
    --cc=Monk.Liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=horace.chen@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).