All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Emily.Deng@amd.com, amd-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, Christian.Koenig@amd.com
Subject: [PATCH v2] drm/scheduler: Avoid accessing freed bad job.
Date: Mon, 18 Nov 2019 12:52:25 -0500	[thread overview]
Message-ID: <1574099545-20430-1-git-send-email-andrey.grodzovsky@amd.com> (raw)

Problem:
Due to a race between drm_sched_cleanup_jobs in sched thread and
drm_sched_job_timedout in timeout work there is a possiblity that
bad job was already freed while still being accessed from the
timeout thread.

Fix:
Instead of just peeking at the bad job in the mirror list
remove it from the list under lock and then put it back later when
we are garanteed no race with main sched thread is possible which
is after the thread is parked.

v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 44 +++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 80ddbdf..b05b210 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -287,10 +287,24 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	unsigned long flags;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
+
+	/*
+	 * Protects against concurrent deletion in drm_sched_cleanup_jobs that
+	 * is already in progress.
+	 */
+	spin_lock_irqsave(&sched->job_list_lock, flags);
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
 				       struct drm_sched_job, node);
 
 	if (job) {
+		/*
+		 * Remove the bad job so it cannot be freed by already in progress
+		 * drm_sched_cleanup_jobs. It will be reinsrted back after sched->thread
+		 * is parked at which point it's safe.
+		 */
+		list_del_init(&job->node);
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
+
 		job->sched->ops->timedout_job(job);
 
 		/*
@@ -302,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 			sched->free_guilty = false;
 		}
 	}
+	else
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
 
 	spin_lock_irqsave(&sched->job_list_lock, flags);
 	drm_sched_start_timeout(sched);
@@ -373,6 +389,19 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 	kthread_park(sched->thread);
 
 	/*
+	 * Reinsert back the bad job here - now it's safe as drm_sched_cleanup_jobs
+	 * cannot race against us and release the bad job at this point - we parked
+	 * (waited for) any in progress (earlier) cleanups and any later ones will
+	 * bail out due to sched->thread being parked.
+	 */
+	if (bad && bad->sched == sched)
+		/*
+		 * Add at the head of the queue to reflect it was the earliest
+		 * job extracted.
+		 */
+		list_add(&bad->node, &sched->ring_mirror_list);
+
+	/*
 	 * Iterate the job list from later to  earlier one and either deactive
 	 * their HW callbacks or remove them from mirror list if they already
 	 * signaled.
@@ -656,16 +685,19 @@ static void drm_sched_cleanup_jobs(struct drm_gpu_scheduler *sched)
 	    __kthread_should_park(sched->thread))
 		return;
 
-
-	while (!list_empty(&sched->ring_mirror_list)) {
+	/* See drm_sched_job_timedout for why the locking is here */
+	while (true) {
 		struct drm_sched_job *job;
 
-		job = list_first_entry(&sched->ring_mirror_list,
-				       struct drm_sched_job, node);
-		if (!dma_fence_is_signaled(&job->s_fence->finished))
+		spin_lock_irqsave(&sched->job_list_lock, flags);
+		job = list_first_entry_or_null(&sched->ring_mirror_list,
+					       struct drm_sched_job, node);
+
+		if (!job || !dma_fence_is_signaled(&job->s_fence->finished)) {
+			spin_unlock_irqrestore(&sched->job_list_lock, flags);
 			break;
+		}
 
-		spin_lock_irqsave(&sched->job_list_lock, flags);
 		/* remove job from ring_mirror_list */
 		list_del_init(&job->node);
 		spin_unlock_irqrestore(&sched->job_list_lock, flags);
-- 
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Emily.Deng@amd.com, amd-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, Christian.Koenig@amd.com
Subject: [PATCH v2] drm/scheduler: Avoid accessing freed bad job.
Date: Mon, 18 Nov 2019 12:52:25 -0500	[thread overview]
Message-ID: <1574099545-20430-1-git-send-email-andrey.grodzovsky@amd.com> (raw)
Message-ID: <20191118175225.CRmO8YzGcZ_GtC0CUHLsaCaFNzjbrpWciAzzG5jXxU4@z> (raw)

Problem:
Due to a race between drm_sched_cleanup_jobs in sched thread and
drm_sched_job_timedout in timeout work there is a possiblity that
bad job was already freed while still being accessed from the
timeout thread.

Fix:
Instead of just peeking at the bad job in the mirror list
remove it from the list under lock and then put it back later when
we are garanteed no race with main sched thread is possible which
is after the thread is parked.

v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 44 +++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 80ddbdf..b05b210 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -287,10 +287,24 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	unsigned long flags;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
+
+	/*
+	 * Protects against concurrent deletion in drm_sched_cleanup_jobs that
+	 * is already in progress.
+	 */
+	spin_lock_irqsave(&sched->job_list_lock, flags);
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
 				       struct drm_sched_job, node);
 
 	if (job) {
+		/*
+		 * Remove the bad job so it cannot be freed by already in progress
+		 * drm_sched_cleanup_jobs. It will be reinsrted back after sched->thread
+		 * is parked at which point it's safe.
+		 */
+		list_del_init(&job->node);
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
+
 		job->sched->ops->timedout_job(job);
 
 		/*
@@ -302,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 			sched->free_guilty = false;
 		}
 	}
+	else
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
 
 	spin_lock_irqsave(&sched->job_list_lock, flags);
 	drm_sched_start_timeout(sched);
@@ -373,6 +389,19 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 	kthread_park(sched->thread);
 
 	/*
+	 * Reinsert back the bad job here - now it's safe as drm_sched_cleanup_jobs
+	 * cannot race against us and release the bad job at this point - we parked
+	 * (waited for) any in progress (earlier) cleanups and any later ones will
+	 * bail out due to sched->thread being parked.
+	 */
+	if (bad && bad->sched == sched)
+		/*
+		 * Add at the head of the queue to reflect it was the earliest
+		 * job extracted.
+		 */
+		list_add(&bad->node, &sched->ring_mirror_list);
+
+	/*
 	 * Iterate the job list from later to  earlier one and either deactive
 	 * their HW callbacks or remove them from mirror list if they already
 	 * signaled.
@@ -656,16 +685,19 @@ static void drm_sched_cleanup_jobs(struct drm_gpu_scheduler *sched)
 	    __kthread_should_park(sched->thread))
 		return;
 
-
-	while (!list_empty(&sched->ring_mirror_list)) {
+	/* See drm_sched_job_timedout for why the locking is here */
+	while (true) {
 		struct drm_sched_job *job;
 
-		job = list_first_entry(&sched->ring_mirror_list,
-				       struct drm_sched_job, node);
-		if (!dma_fence_is_signaled(&job->s_fence->finished))
+		spin_lock_irqsave(&sched->job_list_lock, flags);
+		job = list_first_entry_or_null(&sched->ring_mirror_list,
+					       struct drm_sched_job, node);
+
+		if (!job || !dma_fence_is_signaled(&job->s_fence->finished)) {
+			spin_unlock_irqrestore(&sched->job_list_lock, flags);
 			break;
+		}
 
-		spin_lock_irqsave(&sched->job_list_lock, flags);
 		/* remove job from ring_mirror_list */
 		list_del_init(&job->node);
 		spin_unlock_irqrestore(&sched->job_list_lock, flags);
-- 
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Emily.Deng@amd.com, Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	Christian.Koenig@amd.com
Subject: [PATCH v2] drm/scheduler: Avoid accessing freed bad job.
Date: Mon, 18 Nov 2019 12:52:25 -0500	[thread overview]
Message-ID: <1574099545-20430-1-git-send-email-andrey.grodzovsky@amd.com> (raw)
Message-ID: <20191118175225.q7mazzLeygmvf2DHSr43D6IJhT1iqNBp0YxThLloIRo@z> (raw)

Problem:
Due to a race between drm_sched_cleanup_jobs in sched thread and
drm_sched_job_timedout in timeout work there is a possiblity that
bad job was already freed while still being accessed from the
timeout thread.

Fix:
Instead of just peeking at the bad job in the mirror list
remove it from the list under lock and then put it back later when
we are garanteed no race with main sched thread is possible which
is after the thread is parked.

v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 44 +++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 80ddbdf..b05b210 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -287,10 +287,24 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	unsigned long flags;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
+
+	/*
+	 * Protects against concurrent deletion in drm_sched_cleanup_jobs that
+	 * is already in progress.
+	 */
+	spin_lock_irqsave(&sched->job_list_lock, flags);
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
 				       struct drm_sched_job, node);
 
 	if (job) {
+		/*
+		 * Remove the bad job so it cannot be freed by already in progress
+		 * drm_sched_cleanup_jobs. It will be reinsrted back after sched->thread
+		 * is parked at which point it's safe.
+		 */
+		list_del_init(&job->node);
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
+
 		job->sched->ops->timedout_job(job);
 
 		/*
@@ -302,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 			sched->free_guilty = false;
 		}
 	}
+	else
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
 
 	spin_lock_irqsave(&sched->job_list_lock, flags);
 	drm_sched_start_timeout(sched);
@@ -373,6 +389,19 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 	kthread_park(sched->thread);
 
 	/*
+	 * Reinsert back the bad job here - now it's safe as drm_sched_cleanup_jobs
+	 * cannot race against us and release the bad job at this point - we parked
+	 * (waited for) any in progress (earlier) cleanups and any later ones will
+	 * bail out due to sched->thread being parked.
+	 */
+	if (bad && bad->sched == sched)
+		/*
+		 * Add at the head of the queue to reflect it was the earliest
+		 * job extracted.
+		 */
+		list_add(&bad->node, &sched->ring_mirror_list);
+
+	/*
 	 * Iterate the job list from later to  earlier one and either deactive
 	 * their HW callbacks or remove them from mirror list if they already
 	 * signaled.
@@ -656,16 +685,19 @@ static void drm_sched_cleanup_jobs(struct drm_gpu_scheduler *sched)
 	    __kthread_should_park(sched->thread))
 		return;
 
-
-	while (!list_empty(&sched->ring_mirror_list)) {
+	/* See drm_sched_job_timedout for why the locking is here */
+	while (true) {
 		struct drm_sched_job *job;
 
-		job = list_first_entry(&sched->ring_mirror_list,
-				       struct drm_sched_job, node);
-		if (!dma_fence_is_signaled(&job->s_fence->finished))
+		spin_lock_irqsave(&sched->job_list_lock, flags);
+		job = list_first_entry_or_null(&sched->ring_mirror_list,
+					       struct drm_sched_job, node);
+
+		if (!job || !dma_fence_is_signaled(&job->s_fence->finished)) {
+			spin_unlock_irqrestore(&sched->job_list_lock, flags);
 			break;
+		}
 
-		spin_lock_irqsave(&sched->job_list_lock, flags);
 		/* remove job from ring_mirror_list */
 		list_del_init(&job->node);
 		spin_unlock_irqrestore(&sched->job_list_lock, flags);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

             reply	other threads:[~2019-11-18 17:52 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-18 17:52 Andrey Grodzovsky [this message]
2019-11-18 17:52 ` [PATCH v2] drm/scheduler: Avoid accessing freed bad job Andrey Grodzovsky
2019-11-18 17:52 ` Andrey Grodzovsky
     [not found] ` <1574099545-20430-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2019-11-18 20:04   ` Christian König
2019-11-18 20:04     ` Christian König
2019-11-18 20:04     ` Christian König
2019-11-19  8:48 ` Deng, Emily
2019-11-19  8:48   ` Deng, Emily
2019-11-19  8:48   ` Deng, Emily

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1574099545-20430-1-git-send-email-andrey.grodzovsky@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Emily.Deng@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.