All of lore.kernel.org
 help / color / mirror / Atom feed
From: Steven Price <steven.price@arm.com>
To: "Zhang, Jack (Jian)" <Jack.Zhang1@amd.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"Liu, Monk" <Monk.Liu@amd.com>,
	"Deng, Emily" <Emily.Deng@amd.com>, Rob Herring <robh@kernel.org>,
	Tomeu Vizoso <tomeu.vizoso@collabora.com>
Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
Date: Mon, 22 Mar 2021 15:29:00 +0000	[thread overview]
Message-ID: <bd11b7f4-41a8-fd29-bc94-656c7c83c552@arm.com> (raw)
In-Reply-To: <DM5PR1201MB020453AA9A2A5C5173AF4D84BB6C9@DM5PR1201MB0204.namprd12.prod.outlook.com>

On 15/03/2021 05:23, Zhang, Jack (Jian) wrote:
> [AMD Public Use]
> 
> Hi, Rob/Tomeu/Steven,
> 
> Would you please help to review this patch for panfrost driver?
> 
> Thanks,
> Jack Zhang
> 
> -----Original Message-----
> From: Jack Zhang <Jack.Zhang1@amd.com>
> Sent: Monday, March 15, 2021 1:21 PM
> To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>
> Cc: Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
> Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
> 
> re-insert Bailing jobs to avoid memory leak.
> 
> V2: move re-insert step to drm/scheduler logic
> V3: add panfrost's return value for bailing jobs
> in case it hits the memleak issue.

This commit message could do with some work - it's really hard to 
decipher what the actual problem you're solving is.

> 
> Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 ++++++--
>   drivers/gpu/drm/panfrost/panfrost_job.c    | 4 ++--
>   drivers/gpu/drm/scheduler/sched_main.c     | 8 +++++++-
>   include/drm/gpu_scheduler.h                | 1 +
>   5 files changed, 19 insertions(+), 6 deletions(-)
> 
[...]
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 6003cfeb1322..e2cb4f32dae1 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>   	 * spurious. Bail out.
>   	 */
>   	if (dma_fence_is_signaled(job->done_fence))
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		return DRM_GPU_SCHED_STAT_BAILING;
>   
>   	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>   		js,
> @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>   
>   	/* Scheduler is already stopped, nothing to do. */
>   	if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		return DRM_GPU_SCHED_STAT_BAILING;
>   
>   	/* Schedule a reset if there's no reset in progress. */
>   	if (!atomic_xchg(&pfdev->reset.pending, 1))

This looks correct to me - in these two cases drm_sched_stop() is not 
called on the sched_job, so it looks like currently the job will be leaked.

> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 92d8de24d0a1..a44f621fb5c4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   {
>   	struct drm_gpu_scheduler *sched;
>   	struct drm_sched_job *job;
> +	int ret;
>   
>   	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>   
> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   		list_del_init(&job->list);
>   		spin_unlock(&sched->job_list_lock);
>   
> -		job->sched->ops->timedout_job(job);
> +		ret = job->sched->ops->timedout_job(job);
>   
> +		if (ret == DRM_GPU_SCHED_STAT_BAILING) {
> +			spin_lock(&sched->job_list_lock);
> +			list_add(&job->node, &sched->ring_mirror_list);
> +			spin_unlock(&sched->job_list_lock);
> +		}

I think we could really do with a comment somewhere explaining what 
"bailing" means in this context. For the Panfrost case we have two cases:

  * The GPU job actually finished while the timeout code was running 
(done_fence is signalled).

  * The GPU is already in the process of being reset (Panfrost has 
multiple queues, so mostly like a bad job in another queue).

I'm also not convinced that (for Panfrost) it makes sense to be adding 
the jobs back to the list. For the first case above clearly the job 
could just be freed (it's complete). The second case is more interesting 
and Panfrost currently doesn't handle this well. In theory the driver 
could try to rescue the job ('soft stop' in Mali language) so that it 
could be resubmitted. Panfrost doesn't currently support that, so 
attempting to resubmit the job is almost certainly going to fail.

It's on my TODO list to look at improving Panfrost in this regard, but 
sadly still quite far down.

Steve

>   		/*
>   		 * Guilty job did complete and hence needs to be manually removed
>   		 * See drm_sched_stop doc.
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 4ea8606d91fe..8093ac2427ef 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -210,6 +210,7 @@ enum drm_gpu_sched_stat {
>   	DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */
>   	DRM_GPU_SCHED_STAT_NOMINAL,
>   	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_BAILING,
>   };
>   
>   /**
> 

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Steven Price <steven.price@arm.com>
To: "Zhang, Jack (Jian)" <Jack.Zhang1@amd.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"Liu, Monk" <Monk.Liu@amd.com>,
	"Deng, Emily" <Emily.Deng@amd.com>, Rob Herring <robh@kernel.org>,
	Tomeu Vizoso <tomeu.vizoso@collabora.com>
Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
Date: Mon, 22 Mar 2021 15:29:00 +0000	[thread overview]
Message-ID: <bd11b7f4-41a8-fd29-bc94-656c7c83c552@arm.com> (raw)
In-Reply-To: <DM5PR1201MB020453AA9A2A5C5173AF4D84BB6C9@DM5PR1201MB0204.namprd12.prod.outlook.com>

On 15/03/2021 05:23, Zhang, Jack (Jian) wrote:
> [AMD Public Use]
> 
> Hi, Rob/Tomeu/Steven,
> 
> Would you please help to review this patch for panfrost driver?
> 
> Thanks,
> Jack Zhang
> 
> -----Original Message-----
> From: Jack Zhang <Jack.Zhang1@amd.com>
> Sent: Monday, March 15, 2021 1:21 PM
> To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Deng, Emily <Emily.Deng@amd.com>
> Cc: Zhang, Jack (Jian) <Jack.Zhang1@amd.com>
> Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak
> 
> re-insert Bailing jobs to avoid memory leak.
> 
> V2: move re-insert step to drm/scheduler logic
> V3: add panfrost's return value for bailing jobs
> in case it hits the memleak issue.

This commit message could do with some work - it's really hard to 
decipher what the actual problem you're solving is.

> 
> Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 ++++++--
>   drivers/gpu/drm/panfrost/panfrost_job.c    | 4 ++--
>   drivers/gpu/drm/scheduler/sched_main.c     | 8 +++++++-
>   include/drm/gpu_scheduler.h                | 1 +
>   5 files changed, 19 insertions(+), 6 deletions(-)
> 
[...]
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 6003cfeb1322..e2cb4f32dae1 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>   	 * spurious. Bail out.
>   	 */
>   	if (dma_fence_is_signaled(job->done_fence))
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		return DRM_GPU_SCHED_STAT_BAILING;
>   
>   	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>   		js,
> @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>   
>   	/* Scheduler is already stopped, nothing to do. */
>   	if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> +		return DRM_GPU_SCHED_STAT_BAILING;
>   
>   	/* Schedule a reset if there's no reset in progress. */
>   	if (!atomic_xchg(&pfdev->reset.pending, 1))

This looks correct to me - in these two cases drm_sched_stop() is not 
called on the sched_job, so it looks like currently the job will be leaked.

> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 92d8de24d0a1..a44f621fb5c4 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   {
>   	struct drm_gpu_scheduler *sched;
>   	struct drm_sched_job *job;
> +	int ret;
>   
>   	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>   
> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct *work)
>   		list_del_init(&job->list);
>   		spin_unlock(&sched->job_list_lock);
>   
> -		job->sched->ops->timedout_job(job);
> +		ret = job->sched->ops->timedout_job(job);
>   
> +		if (ret == DRM_GPU_SCHED_STAT_BAILING) {
> +			spin_lock(&sched->job_list_lock);
> +			list_add(&job->node, &sched->ring_mirror_list);
> +			spin_unlock(&sched->job_list_lock);
> +		}

I think we could really do with a comment somewhere explaining what 
"bailing" means in this context. For the Panfrost case we have two cases:

  * The GPU job actually finished while the timeout code was running 
(done_fence is signalled).

  * The GPU is already in the process of being reset (Panfrost has 
multiple queues, so mostly like a bad job in another queue).

I'm also not convinced that (for Panfrost) it makes sense to be adding 
the jobs back to the list. For the first case above clearly the job 
could just be freed (it's complete). The second case is more interesting 
and Panfrost currently doesn't handle this well. In theory the driver 
could try to rescue the job ('soft stop' in Mali language) so that it 
could be resubmitted. Panfrost doesn't currently support that, so 
attempting to resubmit the job is almost certainly going to fail.

It's on my TODO list to look at improving Panfrost in this regard, but 
sadly still quite far down.

Steve

>   		/*
>   		 * Guilty job did complete and hence needs to be manually removed
>   		 * See drm_sched_stop doc.
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 4ea8606d91fe..8093ac2427ef 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -210,6 +210,7 @@ enum drm_gpu_sched_stat {
>   	DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */
>   	DRM_GPU_SCHED_STAT_NOMINAL,
>   	DRM_GPU_SCHED_STAT_ENODEV,
> +	DRM_GPU_SCHED_STAT_BAILING,
>   };
>   
>   /**
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2021-03-22 15:28 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-15  5:20 [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak Jack Zhang
2021-03-15  5:20 ` Jack Zhang
2021-03-15  5:23 ` Zhang, Jack (Jian)
2021-03-15  5:23   ` Zhang, Jack (Jian)
2021-03-16  7:19   ` Zhang, Jack (Jian)
2021-03-16  7:19     ` Zhang, Jack (Jian)
2021-03-17  6:46     ` Zhang, Jack (Jian)
2021-03-17  6:46       ` Zhang, Jack (Jian)
2021-03-17  7:43       ` Christian König
2021-03-17  7:43         ` Christian König
2021-03-17 14:50         ` Andrey Grodzovsky
2021-03-17 14:50           ` Andrey Grodzovsky
2021-03-17 15:11           ` Zhang, Jack (Jian)
2021-03-17 15:11             ` Zhang, Jack (Jian)
2021-03-18 10:41             ` Zhang, Jack (Jian)
2021-03-18 10:41               ` Zhang, Jack (Jian)
2021-03-18 16:16               ` Andrey Grodzovsky
2021-03-18 16:16                 ` Andrey Grodzovsky
2021-03-25  9:51                 ` Zhang, Jack (Jian)
2021-03-25  9:51                   ` Zhang, Jack (Jian)
2021-03-25 16:32                   ` Andrey Grodzovsky
2021-03-25 16:32                     ` Andrey Grodzovsky
2021-03-26  2:23                     ` Zhang, Jack (Jian)
2021-03-26  2:23                       ` Zhang, Jack (Jian)
2021-03-26  9:05                       ` Christian König
2021-03-26  9:05                         ` Christian König
2021-03-26 11:21                         ` 回复: " Liu, Monk
2021-03-26 11:21                           ` Liu, Monk
2021-03-26 14:51                           ` Christian König
2021-03-26 14:51                             ` Christian König
2021-03-30  3:10                             ` Liu, Monk
2021-03-30  3:10                               ` Liu, Monk
2021-03-30  6:59                               ` Christian König
2021-03-30  6:59                                 ` Christian König
2021-03-22 15:29   ` Steven Price [this message]
2021-03-22 15:29     ` Steven Price
2021-03-26  2:04     ` Zhang, Jack (Jian)
2021-03-26  2:04       ` Zhang, Jack (Jian)
2021-03-26  9:07       ` Steven Price
2021-03-26  9:07         ` Steven Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bd11b7f4-41a8-fd29-bc94-656c7c83c552@arm.com \
    --to=steven.price@arm.com \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Emily.Deng@amd.com \
    --cc=Jack.Zhang1@amd.com \
    --cc=Monk.Liu@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=robh@kernel.org \
    --cc=tomeu.vizoso@collabora.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.