From: "Lazar, Lijo" <lijo.lazar@amd.com>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org
Cc: horace.chen@amd.com, christian.koenig@amd.com, Monk.Liu@amd.com
Subject: Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs
Date: Wed, 5 Jan 2022 15:24:40 +0530 [thread overview]
Message-ID: <639bd7c3-e946-65eb-afae-dd619f6429d6@amd.com> (raw)
In-Reply-To: <20211222220506.789133-5-andrey.grodzovsky@amd.com>
On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote:
> Use reset domain wq also for non TDR gpu recovery trigers
> such as sysfs and RAS. We must serialize all possible
> GPU recoveries to gurantee no concurrency there.
> For TDR call the original recovery function directly since
> it's already executed from within the wq. For others just
> use a wrapper to qeueue work and wait on it to finish.
>
> v2: Rename to amdgpu_recover_work_struct
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +++++++++++++++++++++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> 3 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index b5ff76aae7e0..8e96b9a14452 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
> bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
> int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> struct amdgpu_job* job);
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> + struct amdgpu_job *job);
> void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
> int amdgpu_device_pci_reset(struct amdgpu_device *adev);
> bool amdgpu_device_need_post(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7c063fd37389..258ec3c0b2af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
> * Returns 0 for success or an error on failure.
> */
>
> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> +int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> struct amdgpu_job *job)
> {
> struct list_head device_list, *device_list_handle = NULL;
> @@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> return r;
> }
>
> +struct amdgpu_recover_work_struct {
> + struct work_struct base;
> + struct amdgpu_device *adev;
> + struct amdgpu_job *job;
> + int ret;
> +};
> +
> +static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> +{
> + struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> +
> + recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +}
> +/*
> + * Serialize gpu recover into reset domain single threaded wq
> + */
> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> + struct amdgpu_job *job)
> +{
> + struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
> +
> + INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> +
> + if (!queue_work(adev->reset_domain.wq, &work.base))
> + return -EAGAIN;
> +
The decision to schedule a reset is made at this point. Subsequent
accesses to hardware may not be reliable. So should the flag in_reset be
set here itself rather than waiting for the work to start execution?
Also, what about having the reset_active or in_reset flag in the
reset_domain itself?
Thanks,
Lijo
> + flush_work(&work.base);
> +
> + return work.ret;
> +}
> +
> /**
> * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
> *
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index bfc47bea23db..38c9fd7b7ad4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> ti.process_name, ti.tgid, ti.task_name, ti.pid);
>
> if (amdgpu_device_should_recover_gpu(ring->adev)) {
> - amdgpu_device_gpu_recover(ring->adev, job);
> + amdgpu_device_gpu_recover_imp(ring->adev, job);
> } else {
> drm_sched_suspend_timeout(&ring->sched);
> if (amdgpu_sriov_vf(adev))
>
next prev parent reply other threads:[~2022-01-05 9:55 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-22 22:04 [RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu Andrey Grodzovsky
2021-12-22 22:04 ` [RFC v2 1/8] drm/amdgpu: Introduce reset domain Andrey Grodzovsky
2021-12-22 22:05 ` [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready Andrey Grodzovsky
2021-12-23 8:39 ` Christian König
2021-12-22 22:05 ` [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe Andrey Grodzovsky
2021-12-23 8:40 ` Christian König
2021-12-22 22:05 ` [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs Andrey Grodzovsky
2021-12-23 8:41 ` Christian König
2022-01-05 9:54 ` Lazar, Lijo [this message]
2022-01-05 12:31 ` Christian König
2022-01-05 13:11 ` Lazar, Lijo
2022-01-05 13:15 ` Christian König
2022-01-05 13:26 ` Lazar, Lijo
2022-01-05 13:41 ` Christian König
2022-01-05 18:11 ` Andrey Grodzovsky
2022-01-17 19:14 ` Andrey Grodzovsky
2022-01-17 19:17 ` Christian König
2022-01-17 19:21 ` Andrey Grodzovsky
2022-01-26 15:52 ` Andrey Grodzovsky
2022-01-28 16:57 ` Grodzovsky, Andrey
2022-02-07 2:41 ` JingWen Chen
2022-02-07 3:08 ` Grodzovsky, Andrey
2021-12-22 22:13 ` [RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
2021-12-22 22:13 ` [RFC v2 6/8] drm/amdgpu: Drop hive->in_reset Andrey Grodzovsky
2021-12-22 22:13 ` [RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device Andrey Grodzovsky
2021-12-22 22:14 ` [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV Andrey Grodzovsky
2021-12-23 8:42 ` Christian König
2021-12-23 10:14 ` Liu, Monk
2021-12-24 8:58 ` Deng, Emily
2021-12-24 9:57 ` JingWen Chen
2021-12-30 18:45 ` Andrey Grodzovsky
2022-01-03 10:17 ` Christian König
2022-01-04 9:07 ` JingWen Chen
2022-01-04 10:18 ` Christian König
2022-01-04 10:49 ` Liu, Monk
2022-01-04 11:36 ` Christian König
2022-01-04 16:56 ` Andrey Grodzovsky
2022-01-05 7:34 ` JingWen Chen
2022-01-05 7:59 ` Christian König
2022-01-05 18:24 ` Andrey Grodzovsky
2022-01-06 4:59 ` JingWen Chen
2022-01-06 5:18 ` JingWen Chen
2022-01-06 9:13 ` Christian König
2022-01-06 19:13 ` Andrey Grodzovsky
2022-01-07 3:57 ` JingWen Chen
2022-01-07 5:46 ` JingWen Chen
2022-01-07 16:02 ` Andrey Grodzovsky
2022-01-12 6:28 ` JingWen Chen
2022-01-04 17:13 ` Liu, Shaoyun
2022-01-04 20:54 ` Andrey Grodzovsky
2022-01-05 0:01 ` Liu, Shaoyun
2022-01-05 7:25 ` JingWen Chen
2021-12-30 18:39 ` Andrey Grodzovsky
2021-12-23 18:07 ` Liu, Shaoyun
2021-12-23 18:29 ` [RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue Andrey Grodzovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=639bd7c3-e946-65eb-afae-dd619f6429d6@amd.com \
--to=lijo.lazar@amd.com \
--cc=Monk.Liu@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrey.grodzovsky@amd.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=horace.chen@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).