All of lore.kernel.org
 help / color / mirror / Atom feed
From: Neil Armstrong <narmstrong@baylibre.com>
To: "Steven Price" <steven.price@arm.com>,
	daniel@ffwll.ch, airlied@linux.ie,
	"Christian König" <christian.koenig@amd.com>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>,
	linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
	"open list:ARM/Amlogic Meson..."
	<linux-amlogic@lists.infradead.org>,
	Erico Nunes <nunes.erico@gmail.com>,
	Rob Herring <robh@kernel.org>
Subject: Re: drm_sched with panfrost crash on T820
Date: Fri, 27 Sep 2019 13:27:23 +0200	[thread overview]
Message-ID: <1eec2f1b-0467-cd4d-aa22-23c70388ac0f@baylibre.com> (raw)
In-Reply-To: <26ae2a4d-8df1-e8db-3060-41638ed63e2a@arm.com>

Hi Steven,

Thanks for your prompt reaction !

On 27/09/2019 12:48, Steven Price wrote:
> On 27/09/2019 10:55, Steven Price wrote:
> [...]
>> One obvious issue with the DRM scheduler is that there is a call to
>> cancel_delayed_work() in drm_sched_stop() which to me looks like it
>> should be cancel_delayed_work_sync() to ensure that the timeout handling
>> has completed.
>>
>> However in the above scenario a _sync() variety would then cause a
>> deadlock (one thread has pfdev->reset_lock and is waiting for the other
>> thread which is trying to take the lock).
>>
>> So we need to update Panfrost so that it can coordinate the reset
>> between schedulers. Can you try something like the following (untested):
> 
> And actually testing it I of course discover it doesn't quite work. We
> do need the cancel_delayed_work_sync() in the DRM scheduler (when
> stopping a different scheduler) and we need to avoid holding the
> reset_lock during the drm_sched_stop() call to prevent deadlocking with
> another thread handling a timeout.

Yep the first patch wasn't fixing the issue all the time.

> 
> Can you give the following patch a spin? I don't have a great
> reproduction case, so it would be good to get some confidence it fixes
> the problem.

Running it right now.

Thanks,
Neil

> 
> ----8<----
> From 521a286789260197ae94f698932ebf369efc45ad Mon Sep 17 00:00:00 2001
> From: Steven Price <steven.price@arm.com>
> Date: Fri, 27 Sep 2019 11:42:40 +0100
> Subject: [PATCH] drm/panfrost: Handle resetting on timeout better
> 
> Panfrost uses multiple schedulers (one for each slot, so 2 in reality),
> and on a timeout has to stop all the schedulers to safely perform a
> reset. However more than one scheduler can trigger a timeout at the same
> time. This race condition results in jobs being freed while they are
> still in use.
> 
> Modify drm_sched_stop() to call cancel_delayed_work_sync() when stopping
> a different scheduler to the one belonging to the passed in job.
> panfrost_job_timedout() is also modified to only allow one thread at a
> time to handle the reset. Any subsequent threads simply return assuming
> that the first thread will handle the situation.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  drivers/gpu/drm/panfrost/panfrost_device.h |  2 ++
>  drivers/gpu/drm/panfrost/panfrost_job.c    | 11 ++++++++++-
>  drivers/gpu/drm/scheduler/sched_main.c     |  5 ++++-
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
> index f503c566e99f..6441c7fba6c4 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
> @@ -99,6 +99,8 @@ struct panfrost_device {
>  		unsigned long cur_volt;
>  		struct panfrost_devfreq_slot slot[NUM_JOB_SLOTS];
>  	} devfreq;
> +
> +	bool is_resetting;
>  };
>  
>  struct panfrost_mmu {
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 05c85f45a0de..1b2019e08b43 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -388,13 +388,21 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job)
>  
>  	mutex_lock(&pfdev->reset_lock);
>  
> +	if (pfdev->is_resetting) {
> +		mutex_unlock(&pfdev->reset_lock);
> +		return;
> +	}
> +	pfdev->is_resetting = true;
> +
> +	mutex_unlock(&pfdev->reset_lock);
> +
>  	for (i = 0; i < NUM_JOB_SLOTS; i++)
>  		drm_sched_stop(&pfdev->js->queue[i].sched, sched_job);
>  
>  	if (sched_job)
>  		drm_sched_increase_karma(sched_job);
>  
> -	/* panfrost_core_dump(pfdev); */
> +	mutex_lock(&pfdev->reset_lock);
>  
>  	panfrost_devfreq_record_transition(pfdev, js);
>  	panfrost_device_reset(pfdev);
> @@ -406,6 +414,7 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job)
>  	for (i = 0; i < NUM_JOB_SLOTS; i++)
>  		drm_sched_start(&pfdev->js->queue[i].sched, true);
>  
> +	pfdev->is_resetting = false;
>  	mutex_unlock(&pfdev->reset_lock);
>  }
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 148468447ba9..bc6d1862ec8a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -415,7 +415,10 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  	 * this TDR finished and before the newly restarted jobs had a
>  	 * chance to complete.
>  	 */
> -	cancel_delayed_work(&sched->work_tdr);
> +	if (bad->sched != sched)
> +		cancel_delayed_work_sync(&sched->work_tdr);
> +	else
> +		cancel_delayed_work(&sched->work_tdr);
>  }
>  
>  EXPORT_SYMBOL(drm_sched_stop);
> 


WARNING: multiple messages have this Message-ID (diff)
From: Neil Armstrong <narmstrong@baylibre.com>
To: "Steven Price" <steven.price@arm.com>,
	daniel@ffwll.ch, airlied@linux.ie,
	"Christian König" <christian.koenig@amd.com>
Cc: Rob Herring <robh@kernel.org>,
	Tomeu Vizoso <tomeu.vizoso@collabora.com>,
	linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
	"open list:ARM/Amlogic Meson..."
	<linux-amlogic@lists.infradead.org>,
	Erico Nunes <nunes.erico@gmail.com>
Subject: Re: drm_sched with panfrost crash on T820
Date: Fri, 27 Sep 2019 13:27:23 +0200	[thread overview]
Message-ID: <1eec2f1b-0467-cd4d-aa22-23c70388ac0f@baylibre.com> (raw)
In-Reply-To: <26ae2a4d-8df1-e8db-3060-41638ed63e2a@arm.com>

Hi Steven,

Thanks for your prompt reaction !

On 27/09/2019 12:48, Steven Price wrote:
> On 27/09/2019 10:55, Steven Price wrote:
> [...]
>> One obvious issue with the DRM scheduler is that there is a call to
>> cancel_delayed_work() in drm_sched_stop() which to me looks like it
>> should be cancel_delayed_work_sync() to ensure that the timeout handling
>> has completed.
>>
>> However in the above scenario a _sync() variety would then cause a
>> deadlock (one thread has pfdev->reset_lock and is waiting for the other
>> thread which is trying to take the lock).
>>
>> So we need to update Panfrost so that it can coordinate the reset
>> between schedulers. Can you try something like the following (untested):
> 
> And actually testing it I of course discover it doesn't quite work. We
> do need the cancel_delayed_work_sync() in the DRM scheduler (when
> stopping a different scheduler) and we need to avoid holding the
> reset_lock during the drm_sched_stop() call to prevent deadlocking with
> another thread handling a timeout.

Yep the first patch wasn't fixing the issue all the time.

> 
> Can you give the following patch a spin? I don't have a great
> reproduction case, so it would be good to get some confidence it fixes
> the problem.

Running it right now.

Thanks,
Neil

> 
> ----8<----
> From 521a286789260197ae94f698932ebf369efc45ad Mon Sep 17 00:00:00 2001
> From: Steven Price <steven.price@arm.com>
> Date: Fri, 27 Sep 2019 11:42:40 +0100
> Subject: [PATCH] drm/panfrost: Handle resetting on timeout better
> 
> Panfrost uses multiple schedulers (one for each slot, so 2 in reality),
> and on a timeout has to stop all the schedulers to safely perform a
> reset. However more than one scheduler can trigger a timeout at the same
> time. This race condition results in jobs being freed while they are
> still in use.
> 
> Modify drm_sched_stop() to call cancel_delayed_work_sync() when stopping
> a different scheduler to the one belonging to the passed in job.
> panfrost_job_timedout() is also modified to only allow one thread at a
> time to handle the reset. Any subsequent threads simply return assuming
> that the first thread will handle the situation.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  drivers/gpu/drm/panfrost/panfrost_device.h |  2 ++
>  drivers/gpu/drm/panfrost/panfrost_job.c    | 11 ++++++++++-
>  drivers/gpu/drm/scheduler/sched_main.c     |  5 ++++-
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
> index f503c566e99f..6441c7fba6c4 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
> @@ -99,6 +99,8 @@ struct panfrost_device {
>  		unsigned long cur_volt;
>  		struct panfrost_devfreq_slot slot[NUM_JOB_SLOTS];
>  	} devfreq;
> +
> +	bool is_resetting;
>  };
>  
>  struct panfrost_mmu {
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 05c85f45a0de..1b2019e08b43 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -388,13 +388,21 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job)
>  
>  	mutex_lock(&pfdev->reset_lock);
>  
> +	if (pfdev->is_resetting) {
> +		mutex_unlock(&pfdev->reset_lock);
> +		return;
> +	}
> +	pfdev->is_resetting = true;
> +
> +	mutex_unlock(&pfdev->reset_lock);
> +
>  	for (i = 0; i < NUM_JOB_SLOTS; i++)
>  		drm_sched_stop(&pfdev->js->queue[i].sched, sched_job);
>  
>  	if (sched_job)
>  		drm_sched_increase_karma(sched_job);
>  
> -	/* panfrost_core_dump(pfdev); */
> +	mutex_lock(&pfdev->reset_lock);
>  
>  	panfrost_devfreq_record_transition(pfdev, js);
>  	panfrost_device_reset(pfdev);
> @@ -406,6 +414,7 @@ static void panfrost_job_timedout(struct drm_sched_job *sched_job)
>  	for (i = 0; i < NUM_JOB_SLOTS; i++)
>  		drm_sched_start(&pfdev->js->queue[i].sched, true);
>  
> +	pfdev->is_resetting = false;
>  	mutex_unlock(&pfdev->reset_lock);
>  }
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 148468447ba9..bc6d1862ec8a 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -415,7 +415,10 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  	 * this TDR finished and before the newly restarted jobs had a
>  	 * chance to complete.
>  	 */
> -	cancel_delayed_work(&sched->work_tdr);
> +	if (bad->sched != sched)
> +		cancel_delayed_work_sync(&sched->work_tdr);
> +	else
> +		cancel_delayed_work(&sched->work_tdr);
>  }
>  
>  EXPORT_SYMBOL(drm_sched_stop);
> 


_______________________________________________
linux-amlogic mailing list
linux-amlogic@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-amlogic

  reply	other threads:[~2019-09-27 11:27 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-27  8:12 drm_sched with panfrost crash on T820 Neil Armstrong
2019-09-27  8:12 ` Neil Armstrong
2019-09-27  9:55 ` Steven Price
2019-09-27  9:55   ` Steven Price
2019-09-27 10:48   ` Steven Price
2019-09-27 10:48     ` Steven Price
2019-09-27 10:48     ` Steven Price
2019-09-27 11:27     ` Neil Armstrong [this message]
2019-09-27 11:27       ` Neil Armstrong
2019-09-27 11:48       ` Neil Armstrong
2019-09-27 11:48         ` Neil Armstrong
2019-09-27 15:00         ` Steven Price
2019-09-27 15:00           ` Steven Price
2019-09-27 15:00           ` Steven Price
2019-09-27 15:20           ` Neil Armstrong
2019-09-27 15:20             ` Neil Armstrong
2019-09-30 13:18           ` Neil Armstrong
2019-09-30 13:18             ` Neil Armstrong
2019-09-30 13:18             ` Neil Armstrong
2019-09-27 20:55 ` Grodzovsky, Andrey
2019-09-27 20:55   ` Grodzovsky, Andrey
2019-09-27 20:55   ` Grodzovsky, Andrey
2019-09-30  9:17   ` Neil Armstrong
2019-09-30  9:17     ` Neil Armstrong
2019-09-30  9:17     ` Neil Armstrong
2019-10-02 16:53     ` Grodzovsky, Andrey
2019-10-02 16:53       ` Grodzovsky, Andrey
2019-10-03  8:36       ` Neil Armstrong
2019-10-03  8:36         ` Neil Armstrong
2019-09-30 14:52   ` Hillf Danton
2019-09-30 14:52     ` Hillf Danton
2019-10-02 14:40     ` Grodzovsky, Andrey
2019-10-02 14:40       ` Grodzovsky, Andrey
2019-10-02 14:40       ` Grodzovsky, Andrey
2019-10-02 14:44       ` Neil Armstrong
2019-10-02 14:44         ` Neil Armstrong
2019-10-02 14:44         ` Neil Armstrong
2019-10-03  8:34       ` Neil Armstrong
2019-10-03  8:34         ` Neil Armstrong
2019-10-03  8:34         ` Neil Armstrong
2019-10-04 14:53         ` Grodzovsky, Andrey
2019-10-04 14:53           ` Grodzovsky, Andrey
2019-10-04 15:03           ` Neil Armstrong
2019-10-04 15:03             ` Neil Armstrong
2019-10-04 15:03             ` Neil Armstrong
2019-10-04 15:27             ` Steven Price
2019-10-04 15:27               ` Steven Price
2019-10-04 15:34               ` Koenig, Christian
2019-10-04 15:34                 ` Koenig, Christian
2019-10-04 15:34                 ` Koenig, Christian
2019-10-04 16:02                 ` Steven Price
2019-10-04 16:02                   ` Steven Price
2019-10-04 16:33 Koenig, Christian
2019-10-07 12:47 ` Steven Price
2019-10-07 12:47   ` Steven Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1eec2f1b-0467-cd4d-aa22-23c70388ac0f@baylibre.com \
    --to=narmstrong@baylibre.com \
    --cc=airlied@linux.ie \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-amlogic@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nunes.erico@gmail.com \
    --cc=robh@kernel.org \
    --cc=steven.price@arm.com \
    --cc=tomeu.vizoso@collabora.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.