All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: Jack Zhang <Jack.Zhang1@amd.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	Christian.Koenig@amd.com, Monk.Liu@amd.com, Emily.Deng@amd.com
Subject: Re: [PATCH v2] drm/scheduler re-insert Bailing job to avoid memleak
Date: Fri, 12 Mar 2021 10:22:01 -0500	[thread overview]
Message-ID: <01a690d2-6b09-8324-863f-babf2782c550@amd.com> (raw)
In-Reply-To: <20210312065025.1094934-1-Jack.Zhang1@amd.com>



On 2021-03-12 1:50 a.m., Jack Zhang wrote:
> re-insert Bailing jobs to avoid memory leak.

Usually we put a v2:"Blha blha blha" here to explain
what was modified in v2

Also - since you make changes to another driver you should
add their maintainer and mailing list probably
(use ./scripts/get_maintainer.pl) for this

> 
> Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 ++++++--
>   drivers/gpu/drm/panfrost/panfrost_job.c    | 2 +-
>   drivers/gpu/drm/scheduler/sched_main.c     | 8 +++++++-
>   include/drm/gpu_scheduler.h                | 1 +
>   5 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 79b9cc73763f..86463b0f936e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   					job ? job->base.id : -1);
>   
>   		/* even we skipped this reset, still need to set the job to guilty */
> -		if (job)
> +		if (job) {
>   			drm_sched_increase_karma(&job->base);
> +			r = DRM_GPU_SCHED_STAT_BAILING;
> +		}
>   		goto skip_recovery;
>   	}
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 759b34799221..41390bdacd9e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	struct amdgpu_job *job = to_amdgpu_job(s_job);
>   	struct amdgpu_task_info ti;
>   	struct amdgpu_device *adev = ring->adev;
> +	int ret;
>   
>   	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>   
> @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
> -		amdgpu_device_gpu_recover(ring->adev, job);
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		ret = amdgpu_device_gpu_recover(ring->adev, job);
> +		if (ret == DRM_GPU_SCHED_STAT_BAILING)
> +			return DRM_GPU_SCHED_STAT_BAILING;
> +		else
> +			return DRM_GPU_SCHED_STAT_NOMINAL;
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 6003cfeb1322..c372f4a38736 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>   
>   	/* Scheduler is already stopped, nothing to do. */
>   	if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		return DRM_GPU_SCHED_STAT_BAILING;

Note that there is another early termination in panfrost
at 
https://elixir.bootlin.com/linux/v5.11.1/source/drivers/gpu/drm/panfrost/panfrost_job.c#L445
So probably should also add there.

>   
>   	/* Schedule a reset if there's no reset in progress. */
>   	if (!atomic_xchg(&pfdev->reset.pending, 1))
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 92d8de24d0a1..a44f621fb5c4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   {
>   	struct drm_gpu_scheduler *sched;
>   	struct drm_sched_job *job;
> +	int ret;
>   
>   	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>   
> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   		list_del_init(&job->list);
>   		spin_unlock(&sched->job_list_lock);
>   
> -		job->sched->ops->timedout_job(job);
> +		ret = job->sched->ops->timedout_job(job);
>   
> +		if (ret == DRM_GPU_SCHED_STAT_BAILING) {
> +			spin_lock(&sched->job_list_lock);
> +			list_add(&job->node, &sched->ring_mirror_list);
> +			spin_unlock(&sched->job_list_lock);
> +		}

Just reiterating my comment from v1 here since u missed it -
Problem here that since you already dropped the reset locks you are
racing here now against other recovery threads as they process the same
mirror list, and yet,I think this solution makes things better then
they are now with the leak but still, it's only temporary band-aid until
the full solution to be implemented as described earlier by Christian.
Probably then worth mentioning here with a comment this it's a temporary
fix and that races are possible.

Andrey

>   		/*
>   		 * Guilty job did complete and hence needs to be manually removed
>   		 * See drm_sched_stop doc.
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 4ea8606d91fe..8093ac2427ef 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -210,6 +210,7 @@ enum drm_gpu_sched_stat {
>   	DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */
>   	DRM_GPU_SCHED_STAT_NOMINAL,
>   	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_BAILING,
>   };
>   
>   /**
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: Jack Zhang <Jack.Zhang1@amd.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	Christian.Koenig@amd.com, Monk.Liu@amd.com, Emily.Deng@amd.com
Subject: Re: [PATCH v2] drm/scheduler re-insert Bailing job to avoid memleak
Date: Fri, 12 Mar 2021 10:22:01 -0500	[thread overview]
Message-ID: <01a690d2-6b09-8324-863f-babf2782c550@amd.com> (raw)
In-Reply-To: <20210312065025.1094934-1-Jack.Zhang1@amd.com>



On 2021-03-12 1:50 a.m., Jack Zhang wrote:
> re-insert Bailing jobs to avoid memory leak.

Usually we put a v2:"Blha blha blha" here to explain
what was modified in v2

Also - since you make changes to another driver you should
add their maintainer and mailing list probably
(use ./scripts/get_maintainer.pl) for this

> 
> Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 ++++++--
>   drivers/gpu/drm/panfrost/panfrost_job.c    | 2 +-
>   drivers/gpu/drm/scheduler/sched_main.c     | 8 +++++++-
>   include/drm/gpu_scheduler.h                | 1 +
>   5 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 79b9cc73763f..86463b0f936e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   					job ? job->base.id : -1);
>   
>   		/* even we skipped this reset, still need to set the job to guilty */
> -		if (job)
> +		if (job) {
>   			drm_sched_increase_karma(&job->base);
> +			r = DRM_GPU_SCHED_STAT_BAILING;
> +		}
>   		goto skip_recovery;
>   	}
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 759b34799221..41390bdacd9e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	struct amdgpu_job *job = to_amdgpu_job(s_job);
>   	struct amdgpu_task_info ti;
>   	struct amdgpu_device *adev = ring->adev;
> +	int ret;
>   
>   	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>   
> @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
> -		amdgpu_device_gpu_recover(ring->adev, job);
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		ret = amdgpu_device_gpu_recover(ring->adev, job);
> +		if (ret == DRM_GPU_SCHED_STAT_BAILING)
> +			return DRM_GPU_SCHED_STAT_BAILING;
> +		else
> +			return DRM_GPU_SCHED_STAT_NOMINAL;
>   	} else {
>   		drm_sched_suspend_timeout(&ring->sched);
>   		if (amdgpu_sriov_vf(adev))
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 6003cfeb1322..c372f4a38736 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>   
>   	/* Scheduler is already stopped, nothing to do. */
>   	if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		return DRM_GPU_SCHED_STAT_BAILING;

Note that there is another early termination in panfrost
at 
https://elixir.bootlin.com/linux/v5.11.1/source/drivers/gpu/drm/panfrost/panfrost_job.c#L445
So probably should also add there.

>   
>   	/* Schedule a reset if there's no reset in progress. */
>   	if (!atomic_xchg(&pfdev->reset.pending, 1))
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 92d8de24d0a1..a44f621fb5c4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   {
>   	struct drm_gpu_scheduler *sched;
>   	struct drm_sched_job *job;
> +	int ret;
>   
>   	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>   
> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   		list_del_init(&job->list);
>   		spin_unlock(&sched->job_list_lock);
>   
> -		job->sched->ops->timedout_job(job);
> +		ret = job->sched->ops->timedout_job(job);
>   
> +		if (ret == DRM_GPU_SCHED_STAT_BAILING) {
> +			spin_lock(&sched->job_list_lock);
> +			list_add(&job->node, &sched->ring_mirror_list);
> +			spin_unlock(&sched->job_list_lock);
> +		}

Just reiterating my comment from v1 here since u missed it -
Problem here that since you already dropped the reset locks you are
racing here now against other recovery threads as they process the same
mirror list, and yet,I think this solution makes things better then
they are now with the leak but still, it's only temporary band-aid until
the full solution to be implemented as described earlier by Christian.
Probably then worth mentioning here with a comment this it's a temporary
fix and that races are possible.

Andrey

>   		/*
>   		 * Guilty job did complete and hence needs to be manually removed
>   		 * See drm_sched_stop doc.
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 4ea8606d91fe..8093ac2427ef 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -210,6 +210,7 @@ enum drm_gpu_sched_stat {
>   	DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */
>   	DRM_GPU_SCHED_STAT_NOMINAL,
>   	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_BAILING,
>   };
>   
>   /**
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-03-12 15:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-12  6:50 [PATCH v2] drm/scheduler re-insert Bailing job to avoid memleak Jack Zhang
2021-03-12  6:50 ` Jack Zhang
2021-03-12 15:22 ` Andrey Grodzovsky [this message]
2021-03-12 15:22   ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01a690d2-6b09-8324-863f-babf2782c550@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Emily.Deng@amd.com \
    --cc=Jack.Zhang1@amd.com \
    --cc=Monk.Liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.