* [PATCH 1/5] drm/amd/sched: rename amd_sched_entity_pop_job @ 2017-09-28 14:55 Nicolai Hähnle [not found] ` <20170928145530.12844-1-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Nicolai Hähnle @ 2017-09-28 14:55 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Nicolai Hähnle From: Nicolai Hähnle <nicolai.haehnle@amd.com> The function does not actually remove the job from the FIFO, so "peek" describes it better. Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> --- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c index 97c94f9683fa..742d724cd720 100644 --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c @@ -301,21 +301,21 @@ static bool amd_sched_entity_add_dependency_cb(struct amd_sched_entity *entity) if (!dma_fence_add_callback(entity->dependency, &entity->cb, amd_sched_entity_wakeup)) return true; dma_fence_put(entity->dependency); return false; } static struct amd_sched_job * -amd_sched_entity_pop_job(struct amd_sched_entity *entity) +amd_sched_entity_peek_job(struct amd_sched_entity *entity) { struct amd_gpu_scheduler *sched = entity->sched; struct amd_sched_job *sched_job; if (!kfifo_out_peek(&entity->job_queue, &sched_job, sizeof(sched_job))) return NULL; while ((entity->dependency = sched->ops->dependency(sched_job))) if (amd_sched_entity_add_dependency_cb(entity)) return NULL; @@ -593,21 +593,21 @@ static int amd_sched_main(void *param) struct dma_fence *fence; wait_event_interruptible(sched->wake_up_worker, (!amd_sched_blocked(sched) && (entity = amd_sched_select_entity(sched))) || kthread_should_stop()); if (!entity) continue; - sched_job = amd_sched_entity_pop_job(entity); + sched_job = amd_sched_entity_peek_job(entity); if (!sched_job) continue; s_fence = sched_job->s_fence; atomic_inc(&sched->hw_rq_count); amd_sched_job_begin(sched_job); fence = sched->ops->run_job(sched_job); amd_sched_fence_scheduled(s_fence); -- 2.11.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 36+ messages in thread
[parent not found: <20170928145530.12844-1-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* [PATCH 2/5] drm/amd/sched: fix an outdated comment [not found] ` <20170928145530.12844-1-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2017-09-28 14:55 ` Nicolai Hähnle 2017-09-28 14:55 ` [PATCH 3/5] drm/amd/sched: move adding finish callback to amd_sched_job_begin Nicolai Hähnle ` (2 subsequent siblings) 3 siblings, 0 replies; 36+ messages in thread From: Nicolai Hähnle @ 2017-09-28 14:55 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Nicolai Hähnle From: Nicolai Hähnle <nicolai.haehnle@amd.com> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> --- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c index 742d724cd720..6e899c593b7e 100644 --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c @@ -347,22 +347,21 @@ static bool amd_sched_entity_in(struct amd_sched_job *sched_job) /* first job wakes up scheduler */ if (first) { /* Add the entity to the run queue */ amd_sched_rq_add_entity(entity->rq, entity); amd_sched_wakeup(sched); } return added; } -/* job_finish is called after hw fence signaled, and - * the job had already been deleted from ring_mirror_list +/* job_finish is called after hw fence signaled */ static void amd_sched_job_finish(struct work_struct *work) { struct amd_sched_job *s_job = container_of(work, struct amd_sched_job, finish_work); struct amd_gpu_scheduler *sched = s_job->sched; /* remove job from ring_mirror_list */ spin_lock(&sched->job_list_lock); list_del_init(&s_job->node); -- 2.11.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 3/5] drm/amd/sched: move adding finish callback to amd_sched_job_begin [not found] ` <20170928145530.12844-1-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 14:55 ` [PATCH 2/5] drm/amd/sched: fix an outdated comment Nicolai Hähnle @ 2017-09-28 14:55 ` Nicolai Hähnle 2017-09-28 14:55 ` [PATCH 4/5] drm/amd/sched: NULL out the s_fence field after run_job Nicolai Hähnle 2017-09-28 14:55 ` [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini Nicolai Hähnle 3 siblings, 0 replies; 36+ messages in thread From: Nicolai Hähnle @ 2017-09-28 14:55 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Nicolai Hähnle From: Nicolai Hähnle <nicolai.haehnle@amd.com> The finish callback is responsible for removing the job from the ring mirror list, among other things. It makes sense to add it as callback in the place where the job is added to the ring mirror list. Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> Acked-by: Christian König <christian.koenig@amd.com> --- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c index 6e899c593b7e..e793312e351c 100644 --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c @@ -388,20 +388,23 @@ static void amd_sched_job_finish_cb(struct dma_fence *f, { struct amd_sched_job *job = container_of(cb, struct amd_sched_job, finish_cb); schedule_work(&job->finish_work); } static void amd_sched_job_begin(struct amd_sched_job *s_job) { struct amd_gpu_scheduler *sched = s_job->sched; + dma_fence_add_callback(&s_job->s_fence->finished, &s_job->finish_cb, + amd_sched_job_finish_cb); + spin_lock(&sched->job_list_lock); list_add_tail(&s_job->node, &sched->ring_mirror_list); if (sched->timeout != MAX_SCHEDULE_TIMEOUT && list_first_entry_or_null(&sched->ring_mirror_list, struct amd_sched_job, node) == s_job) schedule_delayed_work(&s_job->work_tdr, sched->timeout); spin_unlock(&sched->job_list_lock); } static void amd_sched_job_timedout(struct work_struct *work) @@ -480,22 +483,20 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched) * * @sched_job The pointer to job required to submit * * Returns 0 for success, negative error code otherwise. */ void amd_sched_entity_push_job(struct amd_sched_job *sched_job) { struct amd_sched_entity *entity = sched_job->s_entity; trace_amd_sched_job(sched_job); - dma_fence_add_callback(&sched_job->s_fence->finished, &sched_job->finish_cb, - amd_sched_job_finish_cb); wait_event(entity->sched->job_scheduled, amd_sched_entity_in(sched_job)); } /* init a sched_job with basic field */ int amd_sched_job_init(struct amd_sched_job *job, struct amd_gpu_scheduler *sched, struct amd_sched_entity *entity, void *owner) { -- 2.11.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 4/5] drm/amd/sched: NULL out the s_fence field after run_job [not found] ` <20170928145530.12844-1-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 14:55 ` [PATCH 2/5] drm/amd/sched: fix an outdated comment Nicolai Hähnle 2017-09-28 14:55 ` [PATCH 3/5] drm/amd/sched: move adding finish callback to amd_sched_job_begin Nicolai Hähnle @ 2017-09-28 14:55 ` Nicolai Hähnle [not found] ` <20170928145530.12844-4-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 14:55 ` [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini Nicolai Hähnle 3 siblings, 1 reply; 36+ messages in thread From: Nicolai Hähnle @ 2017-09-28 14:55 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Nicolai Hähnle From: Nicolai Hähnle <nicolai.haehnle@amd.com> amd_sched_process_job drops the fence reference, so NULL out the s_fence field before adding it as a callback to guard against accidentally using s_fence after it may have be freed. Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> Acked-by: Christian König <christian.koenig@amd.com> --- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c index e793312e351c..54eb77cffd9b 100644 --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c @@ -604,20 +604,23 @@ static int amd_sched_main(void *param) if (!sched_job) continue; s_fence = sched_job->s_fence; atomic_inc(&sched->hw_rq_count); amd_sched_job_begin(sched_job); fence = sched->ops->run_job(sched_job); amd_sched_fence_scheduled(s_fence); + + sched_job->s_fence = NULL; + if (fence) { s_fence->parent = dma_fence_get(fence); r = dma_fence_add_callback(fence, &s_fence->cb, amd_sched_process_job); if (r == -ENOENT) amd_sched_process_job(fence, &s_fence->cb); else if (r) DRM_ERROR("fence add callback failed (%d)\n", r); dma_fence_put(fence); -- 2.11.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 36+ messages in thread
[parent not found: <20170928145530.12844-4-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 4/5] drm/amd/sched: NULL out the s_fence field after run_job [not found] ` <20170928145530.12844-4-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2017-09-28 18:39 ` Andres Rodriguez [not found] ` <7064b408-60db-2817-0ae7-af6b2c56580b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Andres Rodriguez @ 2017-09-28 18:39 UTC (permalink / raw) To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Cc: Andres Rodriguez, Nicolai Hähnle On 2017-09-28 10:55 AM, Nicolai Hähnle wrote: > From: Nicolai Hähnle <nicolai.haehnle@amd.com> > > amd_sched_process_job drops the fence reference, so NULL out the s_fence > field before adding it as a callback to guard against accidentally using > s_fence after it may have be freed. > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> > Acked-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > index e793312e351c..54eb77cffd9b 100644 > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > @@ -604,20 +604,23 @@ static int amd_sched_main(void *param) > if (!sched_job) > continue; > > s_fence = sched_job->s_fence; > > atomic_inc(&sched->hw_rq_count); > amd_sched_job_begin(sched_job); > > fence = sched->ops->run_job(sched_job); > amd_sched_fence_scheduled(s_fence); > + > + sched_job->s_fence = NULL; Minor optional nitpick here. Could this be moved somewhere closer to where the fence reference is actually dropped? Alternatively, could a comment be added to specify which function call results in the reference ownership transfer? Whether a change is made or not, this series is Reviewed-by: Andres Rodriguez <andresx7@gmail.com> Currently running piglit to check if this fixes the occasional soft hangs I was getting where all tests complete except one. > + > if (fence) { > s_fence->parent = dma_fence_get(fence); > r = dma_fence_add_callback(fence, &s_fence->cb, > amd_sched_process_job); > if (r == -ENOENT) > amd_sched_process_job(fence, &s_fence->cb); > else if (r) > DRM_ERROR("fence add callback failed (%d)\n", > r); > dma_fence_put(fence); > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <7064b408-60db-2817-0ae7-af6b2c56580b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 4/5] drm/amd/sched: NULL out the s_fence field after run_job [not found] ` <7064b408-60db-2817-0ae7-af6b2c56580b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2017-09-28 19:04 ` Nicolai Hähnle 0 siblings, 0 replies; 36+ messages in thread From: Nicolai Hähnle @ 2017-09-28 19:04 UTC (permalink / raw) To: Andres Rodriguez, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW On 28.09.2017 20:39, Andres Rodriguez wrote: > > > On 2017-09-28 10:55 AM, Nicolai Hähnle wrote: >> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >> >> amd_sched_process_job drops the fence reference, so NULL out the s_fence >> field before adding it as a callback to guard against accidentally using >> s_fence after it may have be freed. >> >> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >> Acked-by: Christian König <christian.koenig@amd.com> >> --- >> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> index e793312e351c..54eb77cffd9b 100644 >> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> @@ -604,20 +604,23 @@ static int amd_sched_main(void *param) >> if (!sched_job) >> continue; >> s_fence = sched_job->s_fence; >> atomic_inc(&sched->hw_rq_count); >> amd_sched_job_begin(sched_job); >> fence = sched->ops->run_job(sched_job); >> amd_sched_fence_scheduled(s_fence); >> + >> + sched_job->s_fence = NULL; > > Minor optional nitpick here. Could this be moved somewhere closer to > where the fence reference is actually dropped? Alternatively, could a > comment be added to specify which function call results in the reference > ownership transfer? Sure, I can add a comment. (It's amd_sched_process_job, which is called directly or indirectly in all the branches of the following if-statement.) > Whether a change is made or not, this series is > Reviewed-by: Andres Rodriguez <andresx7@gmail.com> Thanks. > Currently running piglit to check if this fixes the occasional soft > hangs I was getting where all tests complete except one. You may be running into this Mesa issue: https://patchwork.freedesktop.org/patch/179535/ Cheers, Nicolai > >> + >> if (fence) { >> s_fence->parent = dma_fence_get(fence); >> r = dma_fence_add_callback(fence, &s_fence->cb, >> amd_sched_process_job); >> if (r == -ENOENT) >> amd_sched_process_job(fence, &s_fence->cb); >> else if (r) >> DRM_ERROR("fence add callback failed (%d)\n", >> r); >> dma_fence_put(fence); >> _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <20170928145530.12844-1-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> ` (2 preceding siblings ...) 2017-09-28 14:55 ` [PATCH 4/5] drm/amd/sched: NULL out the s_fence field after run_job Nicolai Hähnle @ 2017-09-28 14:55 ` Nicolai Hähnle [not found] ` <20170928145530.12844-5-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 3 siblings, 1 reply; 36+ messages in thread From: Nicolai Hähnle @ 2017-09-28 14:55 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Nicolai Hähnle From: Nicolai Hähnle <nicolai.haehnle@amd.com> Highly concurrent Piglit runs can trigger a race condition where a pending SDMA job on a buffer object is never executed because the corresponding process is killed (perhaps due to a crash). Since the job's fences were never signaled, the buffer object was effectively leaked. Worse, the buffer was stuck wherever it happened to be at the time, possibly in VRAM. The symptom was user space processes stuck in interruptible waits with kernel stacks like: [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> Acked-by: Christian König <christian.koenig@amd.com> --- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c index 54eb77cffd9b..32a99e980d78 100644 --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched, amd_sched_entity_is_idle(entity)); amd_sched_rq_remove_entity(rq, entity); if (r) { struct amd_sched_job *job; /* Park the kernel for a moment to make sure it isn't processing * our enity. */ kthread_park(sched->thread); kthread_unpark(sched->thread); - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { + struct amd_sched_fence *s_fence = job->s_fence; + amd_sched_fence_scheduled(s_fence); + amd_sched_fence_finished(s_fence); + dma_fence_put(&s_fence->finished); sched->ops->free_job(job); + } } kfifo_free(&entity->job_queue); } static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb) { struct amd_sched_entity *entity = container_of(cb, struct amd_sched_entity, cb); entity->dependency = NULL; -- 2.11.0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 36+ messages in thread
[parent not found: <20170928145530.12844-5-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <20170928145530.12844-5-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2017-09-28 15:01 ` Christian König [not found] ` <3032bef3-4829-8cae-199a-11353b38c49a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 18:30 ` Marek Olšák ` (2 subsequent siblings) 3 siblings, 1 reply; 36+ messages in thread From: Christian König @ 2017-09-28 15:01 UTC (permalink / raw) To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Cc: Nicolai Hähnle Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: > From: Nicolai Hähnle <nicolai.haehnle@amd.com> > > Highly concurrent Piglit runs can trigger a race condition where a pending > SDMA job on a buffer object is never executed because the corresponding > process is killed (perhaps due to a crash). Since the job's fences were > never signaled, the buffer object was effectively leaked. Worse, the > buffer was stuck wherever it happened to be at the time, possibly in VRAM. > > The symptom was user space processes stuck in interruptible waits with > kernel stacks like: > > [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 > [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 > [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 > [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] > [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] > [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] > [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] > [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] > [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] > [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] > [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] > [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] > [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] > [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] > [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 > [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 > [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad > [<ffffffffffffffff>] 0xffffffffffffffff > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> > Acked-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > index 54eb77cffd9b..32a99e980d78 100644 > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched, > amd_sched_entity_is_idle(entity)); > amd_sched_rq_remove_entity(rq, entity); > if (r) { > struct amd_sched_job *job; > > /* Park the kernel for a moment to make sure it isn't processing > * our enity. > */ > kthread_park(sched->thread); > kthread_unpark(sched->thread); > - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) > + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { > + struct amd_sched_fence *s_fence = job->s_fence; > + amd_sched_fence_scheduled(s_fence); It would be really nice to have an error code set on s_fence->finished before it is signaled, use dma_fence_set_error() for this. Additional to that it would be nice to note in the subject line that this is a rather important bug fix. With that fixed the whole series is Reviewed-by: Christian König <christian.koenig@amd.com>. Regards, Christian. > + amd_sched_fence_finished(s_fence); > + dma_fence_put(&s_fence->finished); > sched->ops->free_job(job); > + } > > } > kfifo_free(&entity->job_queue); > } > > static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb) > { > struct amd_sched_entity *entity = > container_of(cb, struct amd_sched_entity, cb); > entity->dependency = NULL; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <3032bef3-4829-8cae-199a-11353b38c49a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <3032bef3-4829-8cae-199a-11353b38c49a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2017-10-02 16:00 ` Tom St Denis 2017-10-09 6:42 ` Liu, Monk 1 sibling, 0 replies; 36+ messages in thread From: Tom St Denis @ 2017-10-02 16:00 UTC (permalink / raw) To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Hi Nicolai, Ping? :-) Tom On 28/09/17 11:01 AM, Christian König wrote: > Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >> >> Highly concurrent Piglit runs can trigger a race condition where a >> pending >> SDMA job on a buffer object is never executed because the corresponding >> process is killed (perhaps due to a crash). Since the job's fences were >> never signaled, the buffer object was effectively leaked. Worse, the >> buffer was stuck wherever it happened to be at the time, possibly in >> VRAM. >> >> The symptom was user space processes stuck in interruptible waits with >> kernel stacks like: >> >> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >> [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 >> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] >> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >> [amdgpu] >> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] >> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] >> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >> Acked-by: Christian König <christian.koenig@amd.com> >> --- >> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> index 54eb77cffd9b..32a99e980d78 100644 >> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >> amd_gpu_scheduler *sched, >> amd_sched_entity_is_idle(entity)); >> amd_sched_rq_remove_entity(rq, entity); >> if (r) { >> struct amd_sched_job *job; >> /* Park the kernel for a moment to make sure it isn't >> processing >> * our enity. >> */ >> kthread_park(sched->thread); >> kthread_unpark(sched->thread); >> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >> + struct amd_sched_fence *s_fence = job->s_fence; >> + amd_sched_fence_scheduled(s_fence); > > It would be really nice to have an error code set on s_fence->finished > before it is signaled, use dma_fence_set_error() for this. > > Additional to that it would be nice to note in the subject line that > this is a rather important bug fix. > > With that fixed the whole series is Reviewed-by: Christian König > <christian.koenig@amd.com>. > > Regards, > Christian. > >> + amd_sched_fence_finished(s_fence); >> + dma_fence_put(&s_fence->finished); >> sched->ops->free_job(job); >> + } >> } >> kfifo_free(&entity->job_queue); >> } >> static void amd_sched_entity_wakeup(struct dma_fence *f, struct >> dma_fence_cb *cb) >> { >> struct amd_sched_entity *entity = >> container_of(cb, struct amd_sched_entity, cb); >> entity->dependency = NULL; > > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <3032bef3-4829-8cae-199a-11353b38c49a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-10-02 16:00 ` Tom St Denis @ 2017-10-09 6:42 ` Liu, Monk [not found] ` <BLUPR12MB044904A26E01C265C49042E484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 1 sibling, 1 reply; 36+ messages in thread From: Liu, Monk @ 2017-10-09 6:42 UTC (permalink / raw) To: Koenig, Christian, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Cc: Haehnle, Nicolai Christian > It would be really nice to have an error code set on s_fence->finished before it is signaled, use dma_fence_set_error() for this. For gpu reset patches (already submitted to pub) I would make kernel return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) founded as error, that way UMD would run into robust extension path and considering the GPU hang occurred, Don't know if this is expected for the case of normal process being killed or crashed like Nicolai hit ... since there is no gpu hang hit BR Monk -----Original Message----- From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Christian K?nig Sent: 2017年9月28日 23:01 To: Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: > From: Nicolai Hähnle <nicolai.haehnle@amd.com> > > Highly concurrent Piglit runs can trigger a race condition where a > pending SDMA job on a buffer object is never executed because the > corresponding process is killed (perhaps due to a crash). Since the > job's fences were never signaled, the buffer object was effectively > leaked. Worse, the buffer was stuck wherever it happened to be at the time, possibly in VRAM. > > The symptom was user space processes stuck in interruptible waits with > kernel stacks like: > > [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 > [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 > [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 > [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] > [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] > [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] > [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] > [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] > [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] > [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] > [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] > [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] > [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] > [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] > [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 > [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 > [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad > [<ffffffffffffffff>] 0xffffffffffffffff > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> > Acked-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > index 54eb77cffd9b..32a99e980d78 100644 > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched, > amd_sched_entity_is_idle(entity)); > amd_sched_rq_remove_entity(rq, entity); > if (r) { > struct amd_sched_job *job; > > /* Park the kernel for a moment to make sure it isn't processing > * our enity. > */ > kthread_park(sched->thread); > kthread_unpark(sched->thread); > - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) > + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { > + struct amd_sched_fence *s_fence = job->s_fence; > + amd_sched_fence_scheduled(s_fence); It would be really nice to have an error code set on s_fence->finished before it is signaled, use dma_fence_set_error() for this. Additional to that it would be nice to note in the subject line that this is a rather important bug fix. With that fixed the whole series is Reviewed-by: Christian König <christian.koenig@amd.com>. Regards, Christian. > + amd_sched_fence_finished(s_fence); > + dma_fence_put(&s_fence->finished); > sched->ops->free_job(job); > + } > > } > kfifo_free(&entity->job_queue); > } > > static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb) > { > struct amd_sched_entity *entity = > container_of(cb, struct amd_sched_entity, cb); > entity->dependency = NULL; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <BLUPR12MB044904A26E01C265C49042E484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <BLUPR12MB044904A26E01C265C49042E484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-09 8:02 ` Christian König [not found] ` <11f21e54-16b8-68e4-c63e-d791ef8bbffa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Christian König @ 2017-10-09 8:02 UTC (permalink / raw) To: Liu, Monk, Koenig, Christian, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Cc: Haehnle, Nicolai > For gpu reset patches (already submitted to pub) I would make kernel return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) founded as error, that way UMD would run into robust extension path and considering the GPU hang occurred, Well that is only closed source behavior which is completely irrelevant for upstream development. As far as I know we haven't pushed the change to return -ENODEV upstream. Regards, Christian. Am 09.10.2017 um 08:42 schrieb Liu, Monk: > Christian > >> It would be really nice to have an error code set on s_fence->finished before it is signaled, use dma_fence_set_error() for this. > For gpu reset patches (already submitted to pub) I would make kernel return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) founded as error, that way UMD would run into robust extension path and considering the GPU hang occurred, > > Don't know if this is expected for the case of normal process being killed or crashed like Nicolai hit ... since there is no gpu hang hit > > > BR Monk > > > > > -----Original Message----- > From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Christian K?nig > Sent: 2017年9月28日 23:01 > To: Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org > Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> > Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini > > Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >> >> Highly concurrent Piglit runs can trigger a race condition where a >> pending SDMA job on a buffer object is never executed because the >> corresponding process is killed (perhaps due to a crash). Since the >> job's fences were never signaled, the buffer object was effectively >> leaked. Worse, the buffer was stuck wherever it happened to be at the time, possibly in VRAM. >> >> The symptom was user space processes stuck in interruptible waits with >> kernel stacks like: >> >> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >> [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 >> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] >> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] >> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] >> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] >> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >> Acked-by: Christian König <christian.koenig@amd.com> >> --- >> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> index 54eb77cffd9b..32a99e980d78 100644 >> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched, >> amd_sched_entity_is_idle(entity)); >> amd_sched_rq_remove_entity(rq, entity); >> if (r) { >> struct amd_sched_job *job; >> >> /* Park the kernel for a moment to make sure it isn't processing >> * our enity. >> */ >> kthread_park(sched->thread); >> kthread_unpark(sched->thread); >> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >> + struct amd_sched_fence *s_fence = job->s_fence; >> + amd_sched_fence_scheduled(s_fence); > It would be really nice to have an error code set on s_fence->finished before it is signaled, use dma_fence_set_error() for this. > > Additional to that it would be nice to note in the subject line that this is a rather important bug fix. > > With that fixed the whole series is Reviewed-by: Christian König <christian.koenig@amd.com>. > > Regards, > Christian. > >> + amd_sched_fence_finished(s_fence); >> + dma_fence_put(&s_fence->finished); >> sched->ops->free_job(job); >> + } >> >> } >> kfifo_free(&entity->job_queue); >> } >> >> static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb) >> { >> struct amd_sched_entity *entity = >> container_of(cb, struct amd_sched_entity, cb); >> entity->dependency = NULL; > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <11f21e54-16b8-68e4-c63e-d791ef8bbffa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <11f21e54-16b8-68e4-c63e-d791ef8bbffa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2017-10-09 10:14 ` Nicolai Hähnle [not found] ` <d0f66c04-fbcd-09a2-6e4c-9de9ca7a93ff-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Nicolai Hähnle @ 2017-10-09 10:14 UTC (permalink / raw) To: christian.koenig-5C7GfCeVMHo, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW On 09.10.2017 10:02, Christian König wrote: >> For gpu reset patches (already submitted to pub) I would make kernel >> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >> founded as error, that way UMD would run into robust extension path >> and considering the GPU hang occurred, > Well that is only closed source behavior which is completely irrelevant > for upstream development. > > As far as I know we haven't pushed the change to return -ENODEV upstream. FWIW, radeonsi currently expects -ECANCELED on CS submissions and treats those as context lost. Perhaps we could use the same error on fences? That makes more sense to me than -ENODEV. Cheers, Nicolai > > Regards, > Christian. > > Am 09.10.2017 um 08:42 schrieb Liu, Monk: >> Christian >> >>> It would be really nice to have an error code set on >>> s_fence->finished before it is signaled, use dma_fence_set_error() >>> for this. >> For gpu reset patches (already submitted to pub) I would make kernel >> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >> founded as error, that way UMD would run into robust extension path >> and considering the GPU hang occurred, >> >> Don't know if this is expected for the case of normal process being >> killed or crashed like Nicolai hit ... since there is no gpu hang hit >> >> >> BR Monk >> >> >> >> >> -----Original Message----- >> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf >> Of Christian K?nig >> Sent: 2017年9月28日 23:01 >> To: Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org >> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >> fences in amd_sched_entity_fini >> >> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>> >>> Highly concurrent Piglit runs can trigger a race condition where a >>> pending SDMA job on a buffer object is never executed because the >>> corresponding process is killed (perhaps due to a crash). Since the >>> job's fences were never signaled, the buffer object was effectively >>> leaked. Worse, the buffer was stuck wherever it happened to be at the >>> time, possibly in VRAM. >>> >>> The symptom was user space processes stuck in interruptible waits with >>> kernel stacks like: >>> >>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>> [<ffffffffbc5e82d2>] >>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>> [ttm] >>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>> [amdgpu] >>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] >>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] >>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>> Acked-by: Christian König <christian.koenig@amd.com> >>> --- >>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>> 1 file changed, 6 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> index 54eb77cffd9b..32a99e980d78 100644 >>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>> amd_gpu_scheduler *sched, >>> amd_sched_entity_is_idle(entity)); >>> amd_sched_rq_remove_entity(rq, entity); >>> if (r) { >>> struct amd_sched_job *job; >>> /* Park the kernel for a moment to make sure it isn't >>> processing >>> * our enity. >>> */ >>> kthread_park(sched->thread); >>> kthread_unpark(sched->thread); >>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>> + struct amd_sched_fence *s_fence = job->s_fence; >>> + amd_sched_fence_scheduled(s_fence); >> It would be really nice to have an error code set on s_fence->finished >> before it is signaled, use dma_fence_set_error() for this. >> >> Additional to that it would be nice to note in the subject line that >> this is a rather important bug fix. >> >> With that fixed the whole series is Reviewed-by: Christian König >> <christian.koenig@amd.com>. >> >> Regards, >> Christian. >> >>> + amd_sched_fence_finished(s_fence); >>> + dma_fence_put(&s_fence->finished); >>> sched->ops->free_job(job); >>> + } >>> } >>> kfifo_free(&entity->job_queue); >>> } >>> static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>> dma_fence_cb *cb) >>> { >>> struct amd_sched_entity *entity = >>> container_of(cb, struct amd_sched_entity, cb); >>> entity->dependency = NULL; >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <d0f66c04-fbcd-09a2-6e4c-9de9ca7a93ff-5C7GfCeVMHo@public.gmane.org>]
* RE: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <d0f66c04-fbcd-09a2-6e4c-9de9ca7a93ff-5C7GfCeVMHo@public.gmane.org> @ 2017-10-09 10:35 ` Liu, Monk [not found] ` <BLUPR12MB044925932C8D956F93CAF93E84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Liu, Monk @ 2017-10-09 10:35 UTC (permalink / raw) To: Haehnle, Nicolai, Koenig, Christian, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley Hi Christian You reject some of my patches that returns -ENODEV, with the cause that MESA doesn't do the handling on -ENODEV But if Nicolai can confirm that MESA do have a handling on -ECANCELED, then we need to overall align our error code, on detail below IOCTL can return error code: Amdgpu_cs_ioctl Amdgpu_cs_wait_ioctl Amdgpu_cs_wait_fences_ioctl Amdgpu_info_ioctl My patches is: return -ENODEV on cs_ioctl if the context is detected guilty, also return -ENODEV on cs_wait|cs_wait_fences if the fence is signaled but with error -ETIME, also return -ENODEV on info_ioctl so UMD can query if gpu reset happened after the process created (because for strict mode we block process instead of context) according to Nicolai: amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly speaking, kernel part doesn't have any place with "-ECANCELED" so this solution on MESA side doesn't align with *current* amdgpu driver, which only return 0 on success or -EINVALID on other error but definitely no "-ECANCELED" error code, so if we talking about community rules we shouldn't let MESA handle -ECANCELED , we should have a unified error code + Marek BR Monk -----Original Message----- From: Haehnle, Nicolai Sent: 2017年10月9日 18:14 To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini On 09.10.2017 10:02, Christian König wrote: >> For gpu reset patches (already submitted to pub) I would make kernel >> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >> founded as error, that way UMD would run into robust extension path >> and considering the GPU hang occurred, > Well that is only closed source behavior which is completely > irrelevant for upstream development. > > As far as I know we haven't pushed the change to return -ENODEV upstream. FWIW, radeonsi currently expects -ECANCELED on CS submissions and treats those as context lost. Perhaps we could use the same error on fences? That makes more sense to me than -ENODEV. Cheers, Nicolai > > Regards, > Christian. > > Am 09.10.2017 um 08:42 schrieb Liu, Monk: >> Christian >> >>> It would be really nice to have an error code set on >>> s_fence->finished before it is signaled, use dma_fence_set_error() >>> for this. >> For gpu reset patches (already submitted to pub) I would make kernel >> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >> founded as error, that way UMD would run into robust extension path >> and considering the GPU hang occurred, >> >> Don't know if this is expected for the case of normal process being >> killed or crashed like Nicolai hit ... since there is no gpu hang hit >> >> >> BR Monk >> >> >> >> >> -----Original Message----- >> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >> Behalf Of Christian K?nig >> Sent: 2017年9月28日 23:01 >> To: Nicolai Hähnle <nhaehnle@gmail.com>; >> amd-gfx@lists.freedesktop.org >> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >> fences in amd_sched_entity_fini >> >> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>> >>> Highly concurrent Piglit runs can trigger a race condition where a >>> pending SDMA job on a buffer object is never executed because the >>> corresponding process is killed (perhaps due to a crash). Since the >>> job's fences were never signaled, the buffer object was effectively >>> leaked. Worse, the buffer was stuck wherever it happened to be at >>> the time, possibly in VRAM. >>> >>> The symptom was user space processes stuck in interruptible waits >>> with kernel stacks like: >>> >>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>> [<ffffffffbc5e82d2>] >>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>> [ttm] >>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>> [amdgpu] >>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>> [amdgpu] >>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>> [amdgpu] >>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>> Acked-by: Christian König <christian.koenig@amd.com> >>> --- >>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>> 1 file changed, 6 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> index 54eb77cffd9b..32a99e980d78 100644 >>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>> amd_gpu_scheduler *sched, >>> amd_sched_entity_is_idle(entity)); >>> amd_sched_rq_remove_entity(rq, entity); >>> if (r) { >>> struct amd_sched_job *job; >>> /* Park the kernel for a moment to make sure it isn't >>> processing >>> * our enity. >>> */ >>> kthread_park(sched->thread); >>> kthread_unpark(sched->thread); >>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>> + struct amd_sched_fence *s_fence = job->s_fence; >>> + amd_sched_fence_scheduled(s_fence); >> It would be really nice to have an error code set on >> s_fence->finished before it is signaled, use dma_fence_set_error() for this. >> >> Additional to that it would be nice to note in the subject line that >> this is a rather important bug fix. >> >> With that fixed the whole series is Reviewed-by: Christian König >> <christian.koenig@amd.com>. >> >> Regards, >> Christian. >> >>> + amd_sched_fence_finished(s_fence); >>> + dma_fence_put(&s_fence->finished); >>> sched->ops->free_job(job); >>> + } >>> } >>> kfifo_free(&entity->job_queue); >>> } >>> static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>> dma_fence_cb *cb) >>> { >>> struct amd_sched_entity *entity = >>> container_of(cb, struct amd_sched_entity, cb); >>> entity->dependency = NULL; >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <BLUPR12MB044925932C8D956F93CAF93E84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <BLUPR12MB044925932C8D956F93CAF93E84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-09 10:49 ` Nicolai Hähnle [not found] ` <7e338e23-540c-4e2e-982f-f0eb623c75b1-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Nicolai Hähnle @ 2017-10-09 10:49 UTC (permalink / raw) To: Liu, Monk, Koenig, Christian, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley Hi Monk, Yes, you're right, we're only using ECANCELED internally. But as a consequence, Mesa would already handle a kernel error of ECANCELED on context loss correctly :) Cheers, Nicolai On 09.10.2017 12:35, Liu, Monk wrote: > Hi Christian > > You reject some of my patches that returns -ENODEV, with the cause that MESA doesn't do the handling on -ENODEV > > But if Nicolai can confirm that MESA do have a handling on -ECANCELED, then we need to overall align our error code, on detail below IOCTL can return error code: > > Amdgpu_cs_ioctl > Amdgpu_cs_wait_ioctl > Amdgpu_cs_wait_fences_ioctl > Amdgpu_info_ioctl > > > My patches is: > return -ENODEV on cs_ioctl if the context is detected guilty, > also return -ENODEV on cs_wait|cs_wait_fences if the fence is signaled but with error -ETIME, > also return -ENODEV on info_ioctl so UMD can query if gpu reset happened after the process created (because for strict mode we block process instead of context) > > > according to Nicolai: > > amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly speaking, kernel part doesn't have any place with "-ECANCELED" so this solution on MESA side doesn't align with *current* amdgpu driver, > which only return 0 on success or -EINVALID on other error but definitely no "-ECANCELED" error code, > > so if we talking about community rules we shouldn't let MESA handle -ECANCELED , we should have a unified error code > > + Marek > > BR Monk > > > > > > -----Original Message----- > From: Haehnle, Nicolai > Sent: 2017年10月9日 18:14 > To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini > > On 09.10.2017 10:02, Christian König wrote: >>> For gpu reset patches (already submitted to pub) I would make kernel >>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >>> founded as error, that way UMD would run into robust extension path >>> and considering the GPU hang occurred, >> Well that is only closed source behavior which is completely >> irrelevant for upstream development. >> >> As far as I know we haven't pushed the change to return -ENODEV upstream. > > FWIW, radeonsi currently expects -ECANCELED on CS submissions and treats those as context lost. Perhaps we could use the same error on fences? > That makes more sense to me than -ENODEV. > > Cheers, > Nicolai > >> >> Regards, >> Christian. >> >> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>> Christian >>> >>>> It would be really nice to have an error code set on >>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>> for this. >>> For gpu reset patches (already submitted to pub) I would make kernel >>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >>> founded as error, that way UMD would run into robust extension path >>> and considering the GPU hang occurred, >>> >>> Don't know if this is expected for the case of normal process being >>> killed or crashed like Nicolai hit ... since there is no gpu hang hit >>> >>> >>> BR Monk >>> >>> >>> >>> >>> -----Original Message----- >>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>> Behalf Of Christian K?nig >>> Sent: 2017年9月28日 23:01 >>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>> amd-gfx@lists.freedesktop.org >>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>> fences in amd_sched_entity_fini >>> >>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>> >>>> Highly concurrent Piglit runs can trigger a race condition where a >>>> pending SDMA job on a buffer object is never executed because the >>>> corresponding process is killed (perhaps due to a crash). Since the >>>> job's fences were never signaled, the buffer object was effectively >>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>> the time, possibly in VRAM. >>>> >>>> The symptom was user space processes stuck in interruptible waits >>>> with kernel stacks like: >>>> >>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>> [<ffffffffbc5e82d2>] >>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>> [ttm] >>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>>> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>>> [amdgpu] >>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>> [amdgpu] >>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>> [amdgpu] >>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>> >>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>> Acked-by: Christian König <christian.koenig@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> index 54eb77cffd9b..32a99e980d78 100644 >>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>> amd_gpu_scheduler *sched, >>>> amd_sched_entity_is_idle(entity)); >>>> amd_sched_rq_remove_entity(rq, entity); >>>> if (r) { >>>> struct amd_sched_job *job; >>>> /* Park the kernel for a moment to make sure it isn't >>>> processing >>>> * our enity. >>>> */ >>>> kthread_park(sched->thread); >>>> kthread_unpark(sched->thread); >>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>> + amd_sched_fence_scheduled(s_fence); >>> It would be really nice to have an error code set on >>> s_fence->finished before it is signaled, use dma_fence_set_error() for this. >>> >>> Additional to that it would be nice to note in the subject line that >>> this is a rather important bug fix. >>> >>> With that fixed the whole series is Reviewed-by: Christian König >>> <christian.koenig@amd.com>. >>> >>> Regards, >>> Christian. >>> >>>> + amd_sched_fence_finished(s_fence); >>>> + dma_fence_put(&s_fence->finished); >>>> sched->ops->free_job(job); >>>> + } >>>> } >>>> kfifo_free(&entity->job_queue); >>>> } >>>> static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>>> dma_fence_cb *cb) >>>> { >>>> struct amd_sched_entity *entity = >>>> container_of(cb, struct amd_sched_entity, cb); >>>> entity->dependency = NULL; >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <7e338e23-540c-4e2e-982f-f0eb623c75b1-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <7e338e23-540c-4e2e-982f-f0eb623c75b1-5C7GfCeVMHo@public.gmane.org> @ 2017-10-09 10:59 ` Christian König [not found] ` <760c1434-0739-81ff-82c3-a5210c5575d3-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Christian König @ 2017-10-09 10:59 UTC (permalink / raw) To: Nicolai Hähnle, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley Hi Nicolai & Monk, well that sounds like exactly what I wanted to hear. When Mesa already handles ECANCELED gracefully as return code for a lost context then I think we can move forward with that. Monk, would that be ok with you? Nicolai, how hard would it be to handle ENODEV as failure for all currently existing contexts? Monk, would it be ok with you when we return ENODEV only for existing context when VRAM is lost and/or we have a strict mode GPU reset? E.g. newly created contexts would continue work as they should. Regards, Christian. Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: > Hi Monk, > > Yes, you're right, we're only using ECANCELED internally. But as a > consequence, Mesa would already handle a kernel error of ECANCELED on > context loss correctly :) > > Cheers, > Nicolai > > On 09.10.2017 12:35, Liu, Monk wrote: >> Hi Christian >> >> You reject some of my patches that returns -ENODEV, with the cause >> that MESA doesn't do the handling on -ENODEV >> >> But if Nicolai can confirm that MESA do have a handling on >> -ECANCELED, then we need to overall align our error code, on detail >> below IOCTL can return error code: >> >> Amdgpu_cs_ioctl >> Amdgpu_cs_wait_ioctl >> Amdgpu_cs_wait_fences_ioctl >> Amdgpu_info_ioctl >> >> >> My patches is: >> return -ENODEV on cs_ioctl if the context is detected guilty, >> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >> signaled but with error -ETIME, >> also return -ENODEV on info_ioctl so UMD can query if gpu reset >> happened after the process created (because for strict mode we block >> process instead of context) >> >> >> according to Nicolai: >> >> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly speaking, >> kernel part doesn't have any place with "-ECANCELED" so this solution >> on MESA side doesn't align with *current* amdgpu driver, >> which only return 0 on success or -EINVALID on other error but >> definitely no "-ECANCELED" error code, >> >> so if we talking about community rules we shouldn't let MESA handle >> -ECANCELED , we should have a unified error code >> >> + Marek >> >> BR Monk >> >> >> >> >> -----Original Message----- >> From: Haehnle, Nicolai >> Sent: 2017年10月9日 18:14 >> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >> amd-gfx@lists.freedesktop.org >> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >> fences in amd_sched_entity_fini >> >> On 09.10.2017 10:02, Christian König wrote: >>>> For gpu reset patches (already submitted to pub) I would make kernel >>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >>>> founded as error, that way UMD would run into robust extension path >>>> and considering the GPU hang occurred, >>> Well that is only closed source behavior which is completely >>> irrelevant for upstream development. >>> >>> As far as I know we haven't pushed the change to return -ENODEV >>> upstream. >> >> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >> treats those as context lost. Perhaps we could use the same error on >> fences? >> That makes more sense to me than -ENODEV. >> >> Cheers, >> Nicolai >> >>> >>> Regards, >>> Christian. >>> >>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>> Christian >>>> >>>>> It would be really nice to have an error code set on >>>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>>> for this. >>>> For gpu reset patches (already submitted to pub) I would make kernel >>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >>>> founded as error, that way UMD would run into robust extension path >>>> and considering the GPU hang occurred, >>>> >>>> Don't know if this is expected for the case of normal process being >>>> killed or crashed like Nicolai hit ... since there is no gpu hang hit >>>> >>>> >>>> BR Monk >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>> Behalf Of Christian K?nig >>>> Sent: 2017年9月28日 23:01 >>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>> amd-gfx@lists.freedesktop.org >>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>> fences in amd_sched_entity_fini >>>> >>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>> >>>>> Highly concurrent Piglit runs can trigger a race condition where a >>>>> pending SDMA job on a buffer object is never executed because the >>>>> corresponding process is killed (perhaps due to a crash). Since the >>>>> job's fences were never signaled, the buffer object was effectively >>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>> the time, possibly in VRAM. >>>>> >>>>> The symptom was user space processes stuck in interruptible waits >>>>> with kernel stacks like: >>>>> >>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>> [<ffffffffbc5e82d2>] >>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>> [ttm] >>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>>>> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>>>> [amdgpu] >>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>> [amdgpu] >>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>> [amdgpu] >>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>> >>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>> --- >>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>> amd_gpu_scheduler *sched, >>>>> amd_sched_entity_is_idle(entity)); >>>>> amd_sched_rq_remove_entity(rq, entity); >>>>> if (r) { >>>>> struct amd_sched_job *job; >>>>> /* Park the kernel for a moment to make sure it isn't >>>>> processing >>>>> * our enity. >>>>> */ >>>>> kthread_park(sched->thread); >>>>> kthread_unpark(sched->thread); >>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>> + amd_sched_fence_scheduled(s_fence); >>>> It would be really nice to have an error code set on >>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>> for this. >>>> >>>> Additional to that it would be nice to note in the subject line that >>>> this is a rather important bug fix. >>>> >>>> With that fixed the whole series is Reviewed-by: Christian König >>>> <christian.koenig@amd.com>. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> + amd_sched_fence_finished(s_fence); >>>>> + dma_fence_put(&s_fence->finished); >>>>> sched->ops->free_job(job); >>>>> + } >>>>> } >>>>> kfifo_free(&entity->job_queue); >>>>> } >>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>>>> dma_fence_cb *cb) >>>>> { >>>>> struct amd_sched_entity *entity = >>>>> container_of(cb, struct amd_sched_entity, cb); >>>>> entity->dependency = NULL; >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> >>> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <760c1434-0739-81ff-82c3-a5210c5575d3-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <760c1434-0739-81ff-82c3-a5210c5575d3-5C7GfCeVMHo@public.gmane.org> @ 2017-10-09 11:04 ` Nicolai Hähnle [not found] ` <de5e2c7c-b6cd-1c24-4d8e-7ae3cdfad0bd-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Nicolai Hähnle @ 2017-10-09 11:04 UTC (permalink / raw) To: Christian König, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley On 09.10.2017 12:59, Christian König wrote: > Nicolai, how hard would it be to handle ENODEV as failure for all > currently existing contexts? Impossible? "All currently existing contexts" is not a well-defined concept when multiple drivers co-exist in the same process. And what would be the purpose of this? If it's to support VRAM loss, having a per-context VRAM loss counter would enable each context to signal ECANCELED separately. Cheers, Nicolai > > Monk, would it be ok with you when we return ENODEV only for existing > context when VRAM is lost and/or we have a strict mode GPU reset? E.g. > newly created contexts would continue work as they should. > > Regards, > Christian. > > Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >> Hi Monk, >> >> Yes, you're right, we're only using ECANCELED internally. But as a >> consequence, Mesa would already handle a kernel error of ECANCELED on >> context loss correctly :) >> >> Cheers, >> Nicolai >> >> On 09.10.2017 12:35, Liu, Monk wrote: >>> Hi Christian >>> >>> You reject some of my patches that returns -ENODEV, with the cause >>> that MESA doesn't do the handling on -ENODEV >>> >>> But if Nicolai can confirm that MESA do have a handling on >>> -ECANCELED, then we need to overall align our error code, on detail >>> below IOCTL can return error code: >>> >>> Amdgpu_cs_ioctl >>> Amdgpu_cs_wait_ioctl >>> Amdgpu_cs_wait_fences_ioctl >>> Amdgpu_info_ioctl >>> >>> >>> My patches is: >>> return -ENODEV on cs_ioctl if the context is detected guilty, >>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>> signaled but with error -ETIME, >>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>> happened after the process created (because for strict mode we block >>> process instead of context) >>> >>> >>> according to Nicolai: >>> >>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly speaking, >>> kernel part doesn't have any place with "-ECANCELED" so this solution >>> on MESA side doesn't align with *current* amdgpu driver, >>> which only return 0 on success or -EINVALID on other error but >>> definitely no "-ECANCELED" error code, >>> >>> so if we talking about community rules we shouldn't let MESA handle >>> -ECANCELED , we should have a unified error code >>> >>> + Marek >>> >>> BR Monk >>> >>> >>> >>> >>> -----Original Message----- >>> From: Haehnle, Nicolai >>> Sent: 2017年10月9日 18:14 >>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >>> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >>> amd-gfx@lists.freedesktop.org >>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>> fences in amd_sched_entity_fini >>> >>> On 09.10.2017 10:02, Christian König wrote: >>>>> For gpu reset patches (already submitted to pub) I would make kernel >>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >>>>> founded as error, that way UMD would run into robust extension path >>>>> and considering the GPU hang occurred, >>>> Well that is only closed source behavior which is completely >>>> irrelevant for upstream development. >>>> >>>> As far as I know we haven't pushed the change to return -ENODEV >>>> upstream. >>> >>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>> treats those as context lost. Perhaps we could use the same error on >>> fences? >>> That makes more sense to me than -ENODEV. >>> >>> Cheers, >>> Nicolai >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>> Christian >>>>> >>>>>> It would be really nice to have an error code set on >>>>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>>>> for this. >>>>> For gpu reset patches (already submitted to pub) I would make kernel >>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >>>>> founded as error, that way UMD would run into robust extension path >>>>> and considering the GPU hang occurred, >>>>> >>>>> Don't know if this is expected for the case of normal process being >>>>> killed or crashed like Nicolai hit ... since there is no gpu hang hit >>>>> >>>>> >>>>> BR Monk >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>>> Behalf Of Christian K?nig >>>>> Sent: 2017年9月28日 23:01 >>>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>>> amd-gfx@lists.freedesktop.org >>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>> fences in amd_sched_entity_fini >>>>> >>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>> >>>>>> Highly concurrent Piglit runs can trigger a race condition where a >>>>>> pending SDMA job on a buffer object is never executed because the >>>>>> corresponding process is killed (perhaps due to a crash). Since the >>>>>> job's fences were never signaled, the buffer object was effectively >>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>> the time, possibly in VRAM. >>>>>> >>>>>> The symptom was user space processes stuck in interruptible waits >>>>>> with kernel stacks like: >>>>>> >>>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>> [<ffffffffbc5e82d2>] >>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>> [ttm] >>>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>>>>> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>> [amdgpu] >>>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>> [amdgpu] >>>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>> [amdgpu] >>>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>> >>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>> --- >>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>> amd_gpu_scheduler *sched, >>>>>> amd_sched_entity_is_idle(entity)); >>>>>> amd_sched_rq_remove_entity(rq, entity); >>>>>> if (r) { >>>>>> struct amd_sched_job *job; >>>>>> /* Park the kernel for a moment to make sure it isn't >>>>>> processing >>>>>> * our enity. >>>>>> */ >>>>>> kthread_park(sched->thread); >>>>>> kthread_unpark(sched->thread); >>>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>>> + amd_sched_fence_scheduled(s_fence); >>>>> It would be really nice to have an error code set on >>>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>>> for this. >>>>> >>>>> Additional to that it would be nice to note in the subject line that >>>>> this is a rather important bug fix. >>>>> >>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>> <christian.koenig@amd.com>. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> + amd_sched_fence_finished(s_fence); >>>>>> + dma_fence_put(&s_fence->finished); >>>>>> sched->ops->free_job(job); >>>>>> + } >>>>>> } >>>>>> kfifo_free(&entity->job_queue); >>>>>> } >>>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>>>>> dma_fence_cb *cb) >>>>>> { >>>>>> struct amd_sched_entity *entity = >>>>>> container_of(cb, struct amd_sched_entity, cb); >>>>>> entity->dependency = NULL; >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx@lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx@lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>>> >>> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <de5e2c7c-b6cd-1c24-4d8e-7ae3cdfad0bd-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <de5e2c7c-b6cd-1c24-4d8e-7ae3cdfad0bd-5C7GfCeVMHo@public.gmane.org> @ 2017-10-09 11:12 ` Christian König [not found] ` <9619ebd2-f218-7568-3b24-0a9d2b008a6a-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Christian König @ 2017-10-09 11:12 UTC (permalink / raw) To: Nicolai Hähnle, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley > >> Nicolai, how hard would it be to handle ENODEV as failure for all >> currently existing contexts? > > Impossible? "All currently existing contexts" is not a well-defined > concept when multiple drivers co-exist in the same process. Ok, let me refine the question: I assume there are resources "shared" between contexts like binary shader code for example which needs to be reuploaded when VRAM is lost. How hard would it be to handle that correctly? > And what would be the purpose of this? If it's to support VRAM loss, > having a per-context VRAM loss counter would enable each context to > signal ECANCELED separately. I thought of that on top of the -ENODEV handling. In other words when we see -ENODEV we call an IOCTL to let the kernel know we noticed that something is wrong and then reinit all shared resources in userspace. All existing context will still see -ECANCELED when we drop their command submission, but new contexts would at least not cause a new lockup immediately because their shader binaries are corrupted. Regards, Christian. Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle: > On 09.10.2017 12:59, Christian König wrote: >> Nicolai, how hard would it be to handle ENODEV as failure for all >> currently existing contexts? > > Impossible? "All currently existing contexts" is not a well-defined > concept when multiple drivers co-exist in the same process. > > And what would be the purpose of this? If it's to support VRAM loss, > having a per-context VRAM loss counter would enable each context to > signal ECANCELED separately. > > Cheers, > Nicolai > > >> >> Monk, would it be ok with you when we return ENODEV only for existing >> context when VRAM is lost and/or we have a strict mode GPU reset? >> E.g. newly created contexts would continue work as they should. >> >> Regards, >> Christian. >> >> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >>> Hi Monk, >>> >>> Yes, you're right, we're only using ECANCELED internally. But as a >>> consequence, Mesa would already handle a kernel error of ECANCELED >>> on context loss correctly :) >>> >>> Cheers, >>> Nicolai >>> >>> On 09.10.2017 12:35, Liu, Monk wrote: >>>> Hi Christian >>>> >>>> You reject some of my patches that returns -ENODEV, with the cause >>>> that MESA doesn't do the handling on -ENODEV >>>> >>>> But if Nicolai can confirm that MESA do have a handling on >>>> -ECANCELED, then we need to overall align our error code, on detail >>>> below IOCTL can return error code: >>>> >>>> Amdgpu_cs_ioctl >>>> Amdgpu_cs_wait_ioctl >>>> Amdgpu_cs_wait_fences_ioctl >>>> Amdgpu_info_ioctl >>>> >>>> >>>> My patches is: >>>> return -ENODEV on cs_ioctl if the context is detected guilty, >>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>>> signaled but with error -ETIME, >>>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>>> happened after the process created (because for strict mode we >>>> block process instead of context) >>>> >>>> >>>> according to Nicolai: >>>> >>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly >>>> speaking, kernel part doesn't have any place with "-ECANCELED" so >>>> this solution on MESA side doesn't align with *current* amdgpu driver, >>>> which only return 0 on success or -EINVALID on other error but >>>> definitely no "-ECANCELED" error code, >>>> >>>> so if we talking about community rules we shouldn't let MESA handle >>>> -ECANCELED , we should have a unified error code >>>> >>>> + Marek >>>> >>>> BR Monk >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: 2017年10月9日 18:14 >>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >>>> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >>>> amd-gfx@lists.freedesktop.org >>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>> fences in amd_sched_entity_fini >>>> >>>> On 09.10.2017 10:02, Christian König wrote: >>>>>> For gpu reset patches (already submitted to pub) I would make kernel >>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>> IOCTL) >>>>>> founded as error, that way UMD would run into robust extension path >>>>>> and considering the GPU hang occurred, >>>>> Well that is only closed source behavior which is completely >>>>> irrelevant for upstream development. >>>>> >>>>> As far as I know we haven't pushed the change to return -ENODEV >>>>> upstream. >>>> >>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>>> treats those as context lost. Perhaps we could use the same error >>>> on fences? >>>> That makes more sense to me than -ENODEV. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>>> Christian >>>>>> >>>>>>> It would be really nice to have an error code set on >>>>>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>>>>> for this. >>>>>> For gpu reset patches (already submitted to pub) I would make kernel >>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>> IOCTL) >>>>>> founded as error, that way UMD would run into robust extension path >>>>>> and considering the GPU hang occurred, >>>>>> >>>>>> Don't know if this is expected for the case of normal process being >>>>>> killed or crashed like Nicolai hit ... since there is no gpu hang >>>>>> hit >>>>>> >>>>>> >>>>>> BR Monk >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>>>> Behalf Of Christian K?nig >>>>>> Sent: 2017年9月28日 23:01 >>>>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>> amd-gfx@lists.freedesktop.org >>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>> fences in amd_sched_entity_fini >>>>>> >>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>> >>>>>>> Highly concurrent Piglit runs can trigger a race condition where a >>>>>>> pending SDMA job on a buffer object is never executed because the >>>>>>> corresponding process is killed (perhaps due to a crash). Since the >>>>>>> job's fences were never signaled, the buffer object was effectively >>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>>> the time, possibly in VRAM. >>>>>>> >>>>>>> The symptom was user space processes stuck in interruptible waits >>>>>>> with kernel stacks like: >>>>>>> >>>>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>> [<ffffffffbc5e82d2>] >>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>> [<ffffffffc03ce56f>] >>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>>> [ttm] >>>>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>>>>>> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>>> [amdgpu] >>>>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>>> [amdgpu] >>>>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>>> [amdgpu] >>>>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>> >>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>>> amd_gpu_scheduler *sched, >>>>>>> amd_sched_entity_is_idle(entity)); >>>>>>> amd_sched_rq_remove_entity(rq, entity); >>>>>>> if (r) { >>>>>>> struct amd_sched_job *job; >>>>>>> /* Park the kernel for a moment to make sure it isn't >>>>>>> processing >>>>>>> * our enity. >>>>>>> */ >>>>>>> kthread_park(sched->thread); >>>>>>> kthread_unpark(sched->thread); >>>>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>>> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>>>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>>>> + amd_sched_fence_scheduled(s_fence); >>>>>> It would be really nice to have an error code set on >>>>>> s_fence->finished before it is signaled, use >>>>>> dma_fence_set_error() for this. >>>>>> >>>>>> Additional to that it would be nice to note in the subject line that >>>>>> this is a rather important bug fix. >>>>>> >>>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>>> <christian.koenig@amd.com>. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> + amd_sched_fence_finished(s_fence); >>>>>>> + dma_fence_put(&s_fence->finished); >>>>>>> sched->ops->free_job(job); >>>>>>> + } >>>>>>> } >>>>>>> kfifo_free(&entity->job_queue); >>>>>>> } >>>>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>>>>>> dma_fence_cb *cb) >>>>>>> { >>>>>>> struct amd_sched_entity *entity = >>>>>>> container_of(cb, struct amd_sched_entity, cb); >>>>>>> entity->dependency = NULL; >>>>>> >>>>>> _______________________________________________ >>>>>> amd-gfx mailing list >>>>>> amd-gfx@lists.freedesktop.org >>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>> _______________________________________________ >>>>>> amd-gfx mailing list >>>>>> amd-gfx@lists.freedesktop.org >>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>> >>>>> >>>> >>> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <9619ebd2-f218-7568-3b24-0a9d2b008a6a-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <9619ebd2-f218-7568-3b24-0a9d2b008a6a-5C7GfCeVMHo@public.gmane.org> @ 2017-10-09 11:27 ` Nicolai Hähnle [not found] ` <de68c0ca-f36e-3adb-2c42-83a5176f07d8-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Nicolai Hähnle @ 2017-10-09 11:27 UTC (permalink / raw) To: Christian König, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley On 09.10.2017 13:12, Christian König wrote: >> >>> Nicolai, how hard would it be to handle ENODEV as failure for all >>> currently existing contexts? >> >> Impossible? "All currently existing contexts" is not a well-defined >> concept when multiple drivers co-exist in the same process. > > Ok, let me refine the question: I assume there are resources "shared" > between contexts like binary shader code for example which needs to be > reuploaded when VRAM is lost. > > How hard would it be to handle that correctly? Okay, that makes more sense :) With the current interface it's still pretty difficult, but if we could get a new per-device query ioctl which returns a "VRAM loss counter", it would be reasonably straight-forward. >> And what would be the purpose of this? If it's to support VRAM loss, >> having a per-context VRAM loss counter would enable each context to >> signal ECANCELED separately. > > I thought of that on top of the -ENODEV handling. > > In other words when we see -ENODEV we call an IOCTL to let the kernel > know we noticed that something is wrong and then reinit all shared > resources in userspace. > > All existing context will still see -ECANCELED when we drop their > command submission, but new contexts would at least not cause a new > lockup immediately because their shader binaries are corrupted. I don't think we need -ENODEV for this. We just need -ECANCELED to be returned when a submission is rejected due to reset (hang or VRAM loss). Mesa would keep a shadow of the VRAM loss counter in pipe_screen and pipe_context, and query the kernel's counter when it encounters -ECANCELED. Each context would then know to drop the CS it's built up so far and restart based on comparing the VRAM loss counter of pipe_screen to that of pipe_context, and similarly we could keep a copy of the VRAM loss counter for important buffer objects like shader binaries, descriptors, etc. This seems more robust to me than relying only on an ENODEV. We'd most likely keep some kind of VRAM loss counter in Mesa *anyway* (we don't maintain a list of all shaders, for example, and we can't nuke important per-context across threads), and synthesizing such a counter from ENODEVs is not particularly robust (what if multiple ENODEVs occur for the same loss event?). BTW, I still don't like ENODEV. It seems more like the kind of error code you'd return with hot-pluggable GPUs where the device can physically disappear... Cheers, Nicolai > > Regards, > Christian. > > Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle: >> On 09.10.2017 12:59, Christian König wrote: >>> Nicolai, how hard would it be to handle ENODEV as failure for all >>> currently existing contexts? >> >> Impossible? "All currently existing contexts" is not a well-defined >> concept when multiple drivers co-exist in the same process. >> >> And what would be the purpose of this? If it's to support VRAM loss, >> having a per-context VRAM loss counter would enable each context to >> signal ECANCELED separately. >> >> Cheers, >> Nicolai >> >> >>> >>> Monk, would it be ok with you when we return ENODEV only for existing >>> context when VRAM is lost and/or we have a strict mode GPU reset? >>> E.g. newly created contexts would continue work as they should. >>> >>> Regards, >>> Christian. >>> >>> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >>>> Hi Monk, >>>> >>>> Yes, you're right, we're only using ECANCELED internally. But as a >>>> consequence, Mesa would already handle a kernel error of ECANCELED >>>> on context loss correctly :) >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> On 09.10.2017 12:35, Liu, Monk wrote: >>>>> Hi Christian >>>>> >>>>> You reject some of my patches that returns -ENODEV, with the cause >>>>> that MESA doesn't do the handling on -ENODEV >>>>> >>>>> But if Nicolai can confirm that MESA do have a handling on >>>>> -ECANCELED, then we need to overall align our error code, on detail >>>>> below IOCTL can return error code: >>>>> >>>>> Amdgpu_cs_ioctl >>>>> Amdgpu_cs_wait_ioctl >>>>> Amdgpu_cs_wait_fences_ioctl >>>>> Amdgpu_info_ioctl >>>>> >>>>> >>>>> My patches is: >>>>> return -ENODEV on cs_ioctl if the context is detected guilty, >>>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>>>> signaled but with error -ETIME, >>>>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>>>> happened after the process created (because for strict mode we >>>>> block process instead of context) >>>>> >>>>> >>>>> according to Nicolai: >>>>> >>>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly >>>>> speaking, kernel part doesn't have any place with "-ECANCELED" so >>>>> this solution on MESA side doesn't align with *current* amdgpu driver, >>>>> which only return 0 on success or -EINVALID on other error but >>>>> definitely no "-ECANCELED" error code, >>>>> >>>>> so if we talking about community rules we shouldn't let MESA handle >>>>> -ECANCELED , we should have a unified error code >>>>> >>>>> + Marek >>>>> >>>>> BR Monk >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Haehnle, Nicolai >>>>> Sent: 2017年10月9日 18:14 >>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >>>>> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >>>>> amd-gfx@lists.freedesktop.org >>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>> fences in amd_sched_entity_fini >>>>> >>>>> On 09.10.2017 10:02, Christian König wrote: >>>>>>> For gpu reset patches (already submitted to pub) I would make kernel >>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>> IOCTL) >>>>>>> founded as error, that way UMD would run into robust extension path >>>>>>> and considering the GPU hang occurred, >>>>>> Well that is only closed source behavior which is completely >>>>>> irrelevant for upstream development. >>>>>> >>>>>> As far as I know we haven't pushed the change to return -ENODEV >>>>>> upstream. >>>>> >>>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>>>> treats those as context lost. Perhaps we could use the same error >>>>> on fences? >>>>> That makes more sense to me than -ENODEV. >>>>> >>>>> Cheers, >>>>> Nicolai >>>>> >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>>>> Christian >>>>>>> >>>>>>>> It would be really nice to have an error code set on >>>>>>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>>>>>> for this. >>>>>>> For gpu reset patches (already submitted to pub) I would make kernel >>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>> IOCTL) >>>>>>> founded as error, that way UMD would run into robust extension path >>>>>>> and considering the GPU hang occurred, >>>>>>> >>>>>>> Don't know if this is expected for the case of normal process being >>>>>>> killed or crashed like Nicolai hit ... since there is no gpu hang >>>>>>> hit >>>>>>> >>>>>>> >>>>>>> BR Monk >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>>>>> Behalf Of Christian K?nig >>>>>>> Sent: 2017年9月28日 23:01 >>>>>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>> amd-gfx@lists.freedesktop.org >>>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>> fences in amd_sched_entity_fini >>>>>>> >>>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>> >>>>>>>> Highly concurrent Piglit runs can trigger a race condition where a >>>>>>>> pending SDMA job on a buffer object is never executed because the >>>>>>>> corresponding process is killed (perhaps due to a crash). Since the >>>>>>>> job's fences were never signaled, the buffer object was effectively >>>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>>>> the time, possibly in VRAM. >>>>>>>> >>>>>>>> The symptom was user space processes stuck in interruptible waits >>>>>>>> with kernel stacks like: >>>>>>>> >>>>>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>>> [<ffffffffbc5e82d2>] >>>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>>> [<ffffffffc03ce56f>] >>>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>>>> [ttm] >>>>>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>>>>>>> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>>>> [amdgpu] >>>>>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>>>> [amdgpu] >>>>>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>>>> [amdgpu] >>>>>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>> >>>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>>>> amd_gpu_scheduler *sched, >>>>>>>> amd_sched_entity_is_idle(entity)); >>>>>>>> amd_sched_rq_remove_entity(rq, entity); >>>>>>>> if (r) { >>>>>>>> struct amd_sched_job *job; >>>>>>>> /* Park the kernel for a moment to make sure it isn't >>>>>>>> processing >>>>>>>> * our enity. >>>>>>>> */ >>>>>>>> kthread_park(sched->thread); >>>>>>>> kthread_unpark(sched->thread); >>>>>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>>>> + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>>>>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>>>>> + amd_sched_fence_scheduled(s_fence); >>>>>>> It would be really nice to have an error code set on >>>>>>> s_fence->finished before it is signaled, use >>>>>>> dma_fence_set_error() for this. >>>>>>> >>>>>>> Additional to that it would be nice to note in the subject line that >>>>>>> this is a rather important bug fix. >>>>>>> >>>>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>>>> <christian.koenig@amd.com>. >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>>> + amd_sched_fence_finished(s_fence); >>>>>>>> + dma_fence_put(&s_fence->finished); >>>>>>>> sched->ops->free_job(job); >>>>>>>> + } >>>>>>>> } >>>>>>>> kfifo_free(&entity->job_queue); >>>>>>>> } >>>>>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>>>>>>> dma_fence_cb *cb) >>>>>>>> { >>>>>>>> struct amd_sched_entity *entity = >>>>>>>> container_of(cb, struct amd_sched_entity, cb); >>>>>>>> entity->dependency = NULL; >>>>>>> >>>>>>> _______________________________________________ >>>>>>> amd-gfx mailing list >>>>>>> amd-gfx@lists.freedesktop.org >>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>> _______________________________________________ >>>>>>> amd-gfx mailing list >>>>>>> amd-gfx@lists.freedesktop.org >>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <de68c0ca-f36e-3adb-2c42-83a5176f07d8-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <de68c0ca-f36e-3adb-2c42-83a5176f07d8-5C7GfCeVMHo@public.gmane.org> @ 2017-10-09 12:33 ` Christian König [not found] ` <2f113fd3-ab4a-58b8-31d8-dc0a23751513-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Christian König @ 2017-10-09 12:33 UTC (permalink / raw) To: Nicolai Hähnle, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley Am 09.10.2017 um 13:27 schrieb Nicolai Hähnle: > On 09.10.2017 13:12, Christian König wrote: >>> >>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>> currently existing contexts? >>> >>> Impossible? "All currently existing contexts" is not a well-defined >>> concept when multiple drivers co-exist in the same process. >> >> Ok, let me refine the question: I assume there are resources "shared" >> between contexts like binary shader code for example which needs to >> be reuploaded when VRAM is lost. >> >> How hard would it be to handle that correctly? > > Okay, that makes more sense :) > > With the current interface it's still pretty difficult, but if we > could get a new per-device query ioctl which returns a "VRAM loss > counter", it would be reasonably straight-forward. The problem with the VRAM lost counter is that this isn't save either. E.g. you could have an application which uploads shaders, a GPU reset happens and VRAM is lost and then the application creates a new context and makes submission with broken shader binaries. So I would still vote for a separate IOCTL to reset the VRAM lost state which is called *before" user space starts to reupload shader/descriptors etc... This way you also catch the case when another reset happens while you reupload things. > >>> And what would be the purpose of this? If it's to support VRAM loss, >>> having a per-context VRAM loss counter would enable each context to >>> signal ECANCELED separately. >> >> I thought of that on top of the -ENODEV handling. >> >> In other words when we see -ENODEV we call an IOCTL to let the kernel >> know we noticed that something is wrong and then reinit all shared >> resources in userspace. >> >> All existing context will still see -ECANCELED when we drop their >> command submission, but new contexts would at least not cause a new >> lockup immediately because their shader binaries are corrupted. > > I don't think we need -ENODEV for this. We just need -ECANCELED to be > returned when a submission is rejected due to reset (hang or VRAM loss). > > Mesa would keep a shadow of the VRAM loss counter in pipe_screen and > pipe_context, and query the kernel's counter when it encounters > -ECANCELED. Each context would then know to drop the CS it's built up > so far and restart based on comparing the VRAM loss counter of > pipe_screen to that of pipe_context, and similarly we could keep a > copy of the VRAM loss counter for important buffer objects like shader > binaries, descriptors, etc. > > This seems more robust to me than relying only on an ENODEV. We'd most > likely keep some kind of VRAM loss counter in Mesa *anyway* (we don't > maintain a list of all shaders, for example, and we can't nuke > important per-context across threads), and synthesizing such a counter > from ENODEVs is not particularly robust (what if multiple ENODEVs > occur for the same loss event?). > > BTW, I still don't like ENODEV. It seems more like the kind of error > code you'd return with hot-pluggable GPUs where the device can > physically disappear... Yeah, ECANCELED sounds like a better alternative. But I think we should still somehow note the fatality of loosing VRAM to userspace. How about ENODATA or EBADFD? Regards, Christian. > > Cheers, > Nicolai > > >> >> Regards, >> Christian. >> >> Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle: >>> On 09.10.2017 12:59, Christian König wrote: >>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>> currently existing contexts? >>> >>> Impossible? "All currently existing contexts" is not a well-defined >>> concept when multiple drivers co-exist in the same process. >>> >>> And what would be the purpose of this? If it's to support VRAM loss, >>> having a per-context VRAM loss counter would enable each context to >>> signal ECANCELED separately. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> >>>> Monk, would it be ok with you when we return ENODEV only for >>>> existing context when VRAM is lost and/or we have a strict mode GPU >>>> reset? E.g. newly created contexts would continue work as they should. >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >>>>> Hi Monk, >>>>> >>>>> Yes, you're right, we're only using ECANCELED internally. But as a >>>>> consequence, Mesa would already handle a kernel error of ECANCELED >>>>> on context loss correctly :) >>>>> >>>>> Cheers, >>>>> Nicolai >>>>> >>>>> On 09.10.2017 12:35, Liu, Monk wrote: >>>>>> Hi Christian >>>>>> >>>>>> You reject some of my patches that returns -ENODEV, with the >>>>>> cause that MESA doesn't do the handling on -ENODEV >>>>>> >>>>>> But if Nicolai can confirm that MESA do have a handling on >>>>>> -ECANCELED, then we need to overall align our error code, on >>>>>> detail below IOCTL can return error code: >>>>>> >>>>>> Amdgpu_cs_ioctl >>>>>> Amdgpu_cs_wait_ioctl >>>>>> Amdgpu_cs_wait_fences_ioctl >>>>>> Amdgpu_info_ioctl >>>>>> >>>>>> >>>>>> My patches is: >>>>>> return -ENODEV on cs_ioctl if the context is detected guilty, >>>>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>>>>> signaled but with error -ETIME, >>>>>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>>>>> happened after the process created (because for strict mode we >>>>>> block process instead of context) >>>>>> >>>>>> >>>>>> according to Nicolai: >>>>>> >>>>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly >>>>>> speaking, kernel part doesn't have any place with "-ECANCELED" so >>>>>> this solution on MESA side doesn't align with *current* amdgpu >>>>>> driver, >>>>>> which only return 0 on success or -EINVALID on other error but >>>>>> definitely no "-ECANCELED" error code, >>>>>> >>>>>> so if we talking about community rules we shouldn't let MESA >>>>>> handle -ECANCELED , we should have a unified error code >>>>>> >>>>>> + Marek >>>>>> >>>>>> BR Monk >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Haehnle, Nicolai >>>>>> Sent: 2017年10月9日 18:14 >>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >>>>>> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>> amd-gfx@lists.freedesktop.org >>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>> fences in amd_sched_entity_fini >>>>>> >>>>>> On 09.10.2017 10:02, Christian König wrote: >>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>> kernel >>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>> IOCTL) >>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>> path >>>>>>>> and considering the GPU hang occurred, >>>>>>> Well that is only closed source behavior which is completely >>>>>>> irrelevant for upstream development. >>>>>>> >>>>>>> As far as I know we haven't pushed the change to return -ENODEV >>>>>>> upstream. >>>>>> >>>>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>>>>> treats those as context lost. Perhaps we could use the same error >>>>>> on fences? >>>>>> That makes more sense to me than -ENODEV. >>>>>> >>>>>> Cheers, >>>>>> Nicolai >>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>>>>> Christian >>>>>>>> >>>>>>>>> It would be really nice to have an error code set on >>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>> dma_fence_set_error() >>>>>>>>> for this. >>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>> kernel >>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>> IOCTL) >>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>> path >>>>>>>> and considering the GPU hang occurred, >>>>>>>> >>>>>>>> Don't know if this is expected for the case of normal process >>>>>>>> being >>>>>>>> killed or crashed like Nicolai hit ... since there is no gpu >>>>>>>> hang hit >>>>>>>> >>>>>>>> >>>>>>>> BR Monk >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>>>>>> Behalf Of Christian K?nig >>>>>>>> Sent: 2017年9月28日 23:01 >>>>>>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>>> fences in amd_sched_entity_fini >>>>>>>> >>>>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>> >>>>>>>>> Highly concurrent Piglit runs can trigger a race condition >>>>>>>>> where a >>>>>>>>> pending SDMA job on a buffer object is never executed because the >>>>>>>>> corresponding process is killed (perhaps due to a crash). >>>>>>>>> Since the >>>>>>>>> job's fences were never signaled, the buffer object was >>>>>>>>> effectively >>>>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>>>>> the time, possibly in VRAM. >>>>>>>>> >>>>>>>>> The symptom was user space processes stuck in interruptible waits >>>>>>>>> with kernel stacks like: >>>>>>>>> >>>>>>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>>>> [<ffffffffbc5e82d2>] >>>>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>>>> [<ffffffffc03ce56f>] >>>>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>>>>> [ttm] >>>>>>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 >>>>>>>>> [ttm] >>>>>>>>> [<ffffffffc042f523>] >>>>>>>>> amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>>>>> [amdgpu] >>>>>>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>>>>> [amdgpu] >>>>>>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>>>>> [amdgpu] >>>>>>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>>> >>>>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>>>>> amd_gpu_scheduler *sched, >>>>>>>>> amd_sched_entity_is_idle(entity)); >>>>>>>>> amd_sched_rq_remove_entity(rq, entity); >>>>>>>>> if (r) { >>>>>>>>> struct amd_sched_job *job; >>>>>>>>> /* Park the kernel for a moment to make sure it isn't >>>>>>>>> processing >>>>>>>>> * our enity. >>>>>>>>> */ >>>>>>>>> kthread_park(sched->thread); >>>>>>>>> kthread_unpark(sched->thread); >>>>>>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>>>>> + while (kfifo_out(&entity->job_queue, &job, >>>>>>>>> sizeof(job))) { >>>>>>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>>>>>> + amd_sched_fence_scheduled(s_fence); >>>>>>>> It would be really nice to have an error code set on >>>>>>>> s_fence->finished before it is signaled, use >>>>>>>> dma_fence_set_error() for this. >>>>>>>> >>>>>>>> Additional to that it would be nice to note in the subject line >>>>>>>> that >>>>>>>> this is a rather important bug fix. >>>>>>>> >>>>>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>>>>> <christian.koenig@amd.com>. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>>> + amd_sched_fence_finished(s_fence); >>>>>>>>> + dma_fence_put(&s_fence->finished); >>>>>>>>> sched->ops->free_job(job); >>>>>>>>> + } >>>>>>>>> } >>>>>>>>> kfifo_free(&entity->job_queue); >>>>>>>>> } >>>>>>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, >>>>>>>>> struct >>>>>>>>> dma_fence_cb *cb) >>>>>>>>> { >>>>>>>>> struct amd_sched_entity *entity = >>>>>>>>> container_of(cb, struct amd_sched_entity, cb); >>>>>>>>> entity->dependency = NULL; >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> amd-gfx mailing list >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>> _______________________________________________ >>>>>>>> amd-gfx mailing list >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <2f113fd3-ab4a-58b8-31d8-dc0a23751513-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <2f113fd3-ab4a-58b8-31d8-dc0a23751513-5C7GfCeVMHo@public.gmane.org> @ 2017-10-09 12:58 ` Nicolai Hähnle [not found] ` <1a79e19c-a654-f5c7-84d9-ce4cce76243f-5C7GfCeVMHo@public.gmane.org> 2017-10-10 4:00 ` Liu, Monk 1 sibling, 1 reply; 36+ messages in thread From: Nicolai Hähnle @ 2017-10-09 12:58 UTC (permalink / raw) To: Christian König, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley On 09.10.2017 14:33, Christian König wrote: > Am 09.10.2017 um 13:27 schrieb Nicolai Hähnle: >> On 09.10.2017 13:12, Christian König wrote: >>>> >>>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>>> currently existing contexts? >>>> >>>> Impossible? "All currently existing contexts" is not a well-defined >>>> concept when multiple drivers co-exist in the same process. >>> >>> Ok, let me refine the question: I assume there are resources "shared" >>> between contexts like binary shader code for example which needs to >>> be reuploaded when VRAM is lost. >>> >>> How hard would it be to handle that correctly? >> >> Okay, that makes more sense :) >> >> With the current interface it's still pretty difficult, but if we >> could get a new per-device query ioctl which returns a "VRAM loss >> counter", it would be reasonably straight-forward. > > The problem with the VRAM lost counter is that this isn't save either. > E.g. you could have an application which uploads shaders, a GPU reset > happens and VRAM is lost and then the application creates a new context > and makes submission with broken shader binaries. Hmm. Here's how I imagined we'd be using a VRAM lost counter: int si_shader_binary_upload(...) { ... shader->bo_vram_lost_counter = sscreen->vram_lost_counter; shader->bo = pipe_buffer_create(...); ptr = sscreen->b.ws->buffer_map(shader->bo->buf, ...); ... copies ... sscreen->b.ws->buffer_unmap(shader->bo->buf); } int si_shader_select(...) { ... r = si_shader_select_with_key(ctx->sscreen, state, ...); if (r) return r; if (state->current->bo_vram_lost_counter != ctx->sscreen->vram_lost_counter) { ... re-upload sequence ... } } (Not shown: logic that compares ctx->vram_lost_counter with sscreen->vram_lost_counter and forces a re-validation of all state including shaders.) That should cover this scenario, shouldn't it? Oh... I see one problem. But it should be easy to fix: when creating a new amdgpu context, Mesa needs to query the vram lost counter. So then the sequence of events would be either: - VRAM lost counter starts at 0 - Mesa uploads a shader binary - Unrelated GPU reset happens, kernel increments VRAM lost counter to 1 - Mesa creates a new amdgpu context, queries the VRAM lost counter --> 1 - si_screen::vram_lost_counter is updated to 1 - Draw happens on the new context --> si_shader_select will catch the VRAM loss Or: - VRAM lost counter starts at 0 - Mesa uploads a shader binary - Mesa creates a new amdgpu context, VRAM lost counter still 0 - Unrelated GPU reset happens, kernel increments VRAM lost counter to 1 - Draw happens on the new context and proceeds normally ... - Mesa flushes the CS, and the kernel will return an error code because the device VRAM lost counter is different from the amdgpu context VRAM lost counter > So I would still vote for a separate IOCTL to reset the VRAM lost state > which is called *before" user space starts to reupload > shader/descriptors etc... The question is: is that separate IOCTL per-context or per-fd? If it's per-fd, then it's not compatible with multiple drivers. If it's per-context, then I don't see how it helps. Perhaps you could explain? > This way you also catch the case when another reset happens while you > re-upload things. My assumption would be that the re-upload happens *after* the new amdgpu context is created. Then the repeat reset should be caught by the kernel when we try to submit a CS on the new context (this is assuming that the kernel context's vram lost counter is initialized properly when the context is created): - Mesa prepares upload, sets shader->bo_vram_lost_counter to 0 - Mesa uploads a shader binary - While doing this, a GPU reset happens[0], kernel increments device VRAM lost counter to 1 - Draw happens with the new shader, Mesa proceeds normally ... - On flush / CS submit, the kernel detects the VRAM lost state and returns an error to Mesa [0] Out of curiosity: What happens on the CPU side if the PCI / full ASIC reset method is used? Is there a time window where we could get a SEGV? [snip] >> BTW, I still don't like ENODEV. It seems more like the kind of error >> code you'd return with hot-pluggable GPUs where the device can >> physically disappear... > > Yeah, ECANCELED sounds like a better alternative. But I think we should > still somehow note the fatality of loosing VRAM to userspace. > > How about ENODATA or EBADFD? According to the manpage, EBADFD is "File descriptor in bad state.". Sounds fitting :) Cheers, Nicolai > > Regards, > Christian. > >> >> Cheers, >> Nicolai >> >> >>> >>> Regards, >>> Christian. >>> >>> Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle: >>>> On 09.10.2017 12:59, Christian König wrote: >>>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>>> currently existing contexts? >>>> >>>> Impossible? "All currently existing contexts" is not a well-defined >>>> concept when multiple drivers co-exist in the same process. >>>> >>>> And what would be the purpose of this? If it's to support VRAM loss, >>>> having a per-context VRAM loss counter would enable each context to >>>> signal ECANCELED separately. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> >>>>> Monk, would it be ok with you when we return ENODEV only for >>>>> existing context when VRAM is lost and/or we have a strict mode GPU >>>>> reset? E.g. newly created contexts would continue work as they should. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >>>>>> Hi Monk, >>>>>> >>>>>> Yes, you're right, we're only using ECANCELED internally. But as a >>>>>> consequence, Mesa would already handle a kernel error of ECANCELED >>>>>> on context loss correctly :) >>>>>> >>>>>> Cheers, >>>>>> Nicolai >>>>>> >>>>>> On 09.10.2017 12:35, Liu, Monk wrote: >>>>>>> Hi Christian >>>>>>> >>>>>>> You reject some of my patches that returns -ENODEV, with the >>>>>>> cause that MESA doesn't do the handling on -ENODEV >>>>>>> >>>>>>> But if Nicolai can confirm that MESA do have a handling on >>>>>>> -ECANCELED, then we need to overall align our error code, on >>>>>>> detail below IOCTL can return error code: >>>>>>> >>>>>>> Amdgpu_cs_ioctl >>>>>>> Amdgpu_cs_wait_ioctl >>>>>>> Amdgpu_cs_wait_fences_ioctl >>>>>>> Amdgpu_info_ioctl >>>>>>> >>>>>>> >>>>>>> My patches is: >>>>>>> return -ENODEV on cs_ioctl if the context is detected guilty, >>>>>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>>>>>> signaled but with error -ETIME, >>>>>>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>>>>>> happened after the process created (because for strict mode we >>>>>>> block process instead of context) >>>>>>> >>>>>>> >>>>>>> according to Nicolai: >>>>>>> >>>>>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly >>>>>>> speaking, kernel part doesn't have any place with "-ECANCELED" so >>>>>>> this solution on MESA side doesn't align with *current* amdgpu >>>>>>> driver, >>>>>>> which only return 0 on success or -EINVALID on other error but >>>>>>> definitely no "-ECANCELED" error code, >>>>>>> >>>>>>> so if we talking about community rules we shouldn't let MESA >>>>>>> handle -ECANCELED , we should have a unified error code >>>>>>> >>>>>>> + Marek >>>>>>> >>>>>>> BR Monk >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Haehnle, Nicolai >>>>>>> Sent: 2017年10月9日 18:14 >>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >>>>>>> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>> amd-gfx@lists.freedesktop.org >>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>> fences in amd_sched_entity_fini >>>>>>> >>>>>>> On 09.10.2017 10:02, Christian König wrote: >>>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>>> kernel >>>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>>> IOCTL) >>>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>>> path >>>>>>>>> and considering the GPU hang occurred, >>>>>>>> Well that is only closed source behavior which is completely >>>>>>>> irrelevant for upstream development. >>>>>>>> >>>>>>>> As far as I know we haven't pushed the change to return -ENODEV >>>>>>>> upstream. >>>>>>> >>>>>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>>>>>> treats those as context lost. Perhaps we could use the same error >>>>>>> on fences? >>>>>>> That makes more sense to me than -ENODEV. >>>>>>> >>>>>>> Cheers, >>>>>>> Nicolai >>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>>>>>> Christian >>>>>>>>> >>>>>>>>>> It would be really nice to have an error code set on >>>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>>> dma_fence_set_error() >>>>>>>>>> for this. >>>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>>> kernel >>>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>>> IOCTL) >>>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>>> path >>>>>>>>> and considering the GPU hang occurred, >>>>>>>>> >>>>>>>>> Don't know if this is expected for the case of normal process >>>>>>>>> being >>>>>>>>> killed or crashed like Nicolai hit ... since there is no gpu >>>>>>>>> hang hit >>>>>>>>> >>>>>>>>> >>>>>>>>> BR Monk >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>>>>>>> Behalf Of Christian K?nig >>>>>>>>> Sent: 2017年9月28日 23:01 >>>>>>>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>>>> fences in amd_sched_entity_fini >>>>>>>>> >>>>>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>>>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>>> >>>>>>>>>> Highly concurrent Piglit runs can trigger a race condition >>>>>>>>>> where a >>>>>>>>>> pending SDMA job on a buffer object is never executed because the >>>>>>>>>> corresponding process is killed (perhaps due to a crash). >>>>>>>>>> Since the >>>>>>>>>> job's fences were never signaled, the buffer object was >>>>>>>>>> effectively >>>>>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>>>>>> the time, possibly in VRAM. >>>>>>>>>> >>>>>>>>>> The symptom was user space processes stuck in interruptible waits >>>>>>>>>> with kernel stacks like: >>>>>>>>>> >>>>>>>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>>>>> [<ffffffffbc5e82d2>] >>>>>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>>>>> [<ffffffffc03ce56f>] >>>>>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>>>>>> [ttm] >>>>>>>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 >>>>>>>>>> [ttm] >>>>>>>>>> [<ffffffffc042f523>] >>>>>>>>>> amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>>>>>> [amdgpu] >>>>>>>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>>>>>> [amdgpu] >>>>>>>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>>>>>> [amdgpu] >>>>>>>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>>>> >>>>>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>>>>>> amd_gpu_scheduler *sched, >>>>>>>>>> amd_sched_entity_is_idle(entity)); >>>>>>>>>> amd_sched_rq_remove_entity(rq, entity); >>>>>>>>>> if (r) { >>>>>>>>>> struct amd_sched_job *job; >>>>>>>>>> /* Park the kernel for a moment to make sure it isn't >>>>>>>>>> processing >>>>>>>>>> * our enity. >>>>>>>>>> */ >>>>>>>>>> kthread_park(sched->thread); >>>>>>>>>> kthread_unpark(sched->thread); >>>>>>>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>>>>>> + while (kfifo_out(&entity->job_queue, &job, >>>>>>>>>> sizeof(job))) { >>>>>>>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>>>>>>> + amd_sched_fence_scheduled(s_fence); >>>>>>>>> It would be really nice to have an error code set on >>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>> dma_fence_set_error() for this. >>>>>>>>> >>>>>>>>> Additional to that it would be nice to note in the subject line >>>>>>>>> that >>>>>>>>> this is a rather important bug fix. >>>>>>>>> >>>>>>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>>>>>> <christian.koenig@amd.com>. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> + amd_sched_fence_finished(s_fence); >>>>>>>>>> + dma_fence_put(&s_fence->finished); >>>>>>>>>> sched->ops->free_job(job); >>>>>>>>>> + } >>>>>>>>>> } >>>>>>>>>> kfifo_free(&entity->job_queue); >>>>>>>>>> } >>>>>>>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, >>>>>>>>>> struct >>>>>>>>>> dma_fence_cb *cb) >>>>>>>>>> { >>>>>>>>>> struct amd_sched_entity *entity = >>>>>>>>>> container_of(cb, struct amd_sched_entity, cb); >>>>>>>>>> entity->dependency = NULL; >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <1a79e19c-a654-f5c7-84d9-ce4cce76243f-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <1a79e19c-a654-f5c7-84d9-ce4cce76243f-5C7GfCeVMHo@public.gmane.org> @ 2017-10-09 13:57 ` Olsak, Marek [not found] ` <CY1PR12MB0885AF7148CD8ECE929E96D2F9740-1s8aH8ViOEfCYw/MNJAFQgdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Olsak, Marek @ 2017-10-09 13:57 UTC (permalink / raw) To: Haehnle, Nicolai, Koenig, Christian, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Cc: Li, Bingley [-- Attachment #1.1: Type: text/plain, Size: 16286 bytes --] Mesa does not handle -ECANCELED. It only returns -ECANCELED from the Mesa winsys layer if the CS ioctl wasn't called (because the context is already lost and so the winsys doesn't submit further CS ioctls). When the CS ioctl fails for the first time, the kernel error is returned and the context is marked as "lost". The next command submission is automatically dropped by the winsys and it returns -ECANCELED. Marek ________________________________ From: Haehnle, Nicolai Sent: Monday, October 9, 2017 2:58:02 PM To: Koenig, Christian; Liu, Monk; Nicolai Hähnle; amd-gfx@lists.freedesktop.org; Olsak, Marek Cc: Li, Bingley Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini On 09.10.2017 14:33, Christian König wrote: > Am 09.10.2017 um 13:27 schrieb Nicolai Hähnle: >> On 09.10.2017 13:12, Christian König wrote: >>>> >>>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>>> currently existing contexts? >>>> >>>> Impossible? "All currently existing contexts" is not a well-defined >>>> concept when multiple drivers co-exist in the same process. >>> >>> Ok, let me refine the question: I assume there are resources "shared" >>> between contexts like binary shader code for example which needs to >>> be reuploaded when VRAM is lost. >>> >>> How hard would it be to handle that correctly? >> >> Okay, that makes more sense :) >> >> With the current interface it's still pretty difficult, but if we >> could get a new per-device query ioctl which returns a "VRAM loss >> counter", it would be reasonably straight-forward. > > The problem with the VRAM lost counter is that this isn't save either. > E.g. you could have an application which uploads shaders, a GPU reset > happens and VRAM is lost and then the application creates a new context > and makes submission with broken shader binaries. Hmm. Here's how I imagined we'd be using a VRAM lost counter: int si_shader_binary_upload(...) { ... shader->bo_vram_lost_counter = sscreen->vram_lost_counter; shader->bo = pipe_buffer_create(...); ptr = sscreen->b.ws->buffer_map(shader->bo->buf, ...); ... copies ... sscreen->b.ws->buffer_unmap(shader->bo->buf); } int si_shader_select(...) { ... r = si_shader_select_with_key(ctx->sscreen, state, ...); if (r) return r; if (state->current->bo_vram_lost_counter != ctx->sscreen->vram_lost_counter) { ... re-upload sequence ... } } (Not shown: logic that compares ctx->vram_lost_counter with sscreen->vram_lost_counter and forces a re-validation of all state including shaders.) That should cover this scenario, shouldn't it? Oh... I see one problem. But it should be easy to fix: when creating a new amdgpu context, Mesa needs to query the vram lost counter. So then the sequence of events would be either: - VRAM lost counter starts at 0 - Mesa uploads a shader binary - Unrelated GPU reset happens, kernel increments VRAM lost counter to 1 - Mesa creates a new amdgpu context, queries the VRAM lost counter --> 1 - si_screen::vram_lost_counter is updated to 1 - Draw happens on the new context --> si_shader_select will catch the VRAM loss Or: - VRAM lost counter starts at 0 - Mesa uploads a shader binary - Mesa creates a new amdgpu context, VRAM lost counter still 0 - Unrelated GPU reset happens, kernel increments VRAM lost counter to 1 - Draw happens on the new context and proceeds normally ... - Mesa flushes the CS, and the kernel will return an error code because the device VRAM lost counter is different from the amdgpu context VRAM lost counter > So I would still vote for a separate IOCTL to reset the VRAM lost state > which is called *before" user space starts to reupload > shader/descriptors etc... The question is: is that separate IOCTL per-context or per-fd? If it's per-fd, then it's not compatible with multiple drivers. If it's per-context, then I don't see how it helps. Perhaps you could explain? > This way you also catch the case when another reset happens while you > re-upload things. My assumption would be that the re-upload happens *after* the new amdgpu context is created. Then the repeat reset should be caught by the kernel when we try to submit a CS on the new context (this is assuming that the kernel context's vram lost counter is initialized properly when the context is created): - Mesa prepares upload, sets shader->bo_vram_lost_counter to 0 - Mesa uploads a shader binary - While doing this, a GPU reset happens[0], kernel increments device VRAM lost counter to 1 - Draw happens with the new shader, Mesa proceeds normally ... - On flush / CS submit, the kernel detects the VRAM lost state and returns an error to Mesa [0] Out of curiosity: What happens on the CPU side if the PCI / full ASIC reset method is used? Is there a time window where we could get a SEGV? [snip] >> BTW, I still don't like ENODEV. It seems more like the kind of error >> code you'd return with hot-pluggable GPUs where the device can >> physically disappear... > > Yeah, ECANCELED sounds like a better alternative. But I think we should > still somehow note the fatality of loosing VRAM to userspace. > > How about ENODATA or EBADFD? According to the manpage, EBADFD is "File descriptor in bad state.". Sounds fitting :) Cheers, Nicolai > > Regards, > Christian. > >> >> Cheers, >> Nicolai >> >> >>> >>> Regards, >>> Christian. >>> >>> Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle: >>>> On 09.10.2017 12:59, Christian König wrote: >>>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>>> currently existing contexts? >>>> >>>> Impossible? "All currently existing contexts" is not a well-defined >>>> concept when multiple drivers co-exist in the same process. >>>> >>>> And what would be the purpose of this? If it's to support VRAM loss, >>>> having a per-context VRAM loss counter would enable each context to >>>> signal ECANCELED separately. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> >>>>> Monk, would it be ok with you when we return ENODEV only for >>>>> existing context when VRAM is lost and/or we have a strict mode GPU >>>>> reset? E.g. newly created contexts would continue work as they should. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >>>>>> Hi Monk, >>>>>> >>>>>> Yes, you're right, we're only using ECANCELED internally. But as a >>>>>> consequence, Mesa would already handle a kernel error of ECANCELED >>>>>> on context loss correctly :) >>>>>> >>>>>> Cheers, >>>>>> Nicolai >>>>>> >>>>>> On 09.10.2017 12:35, Liu, Monk wrote: >>>>>>> Hi Christian >>>>>>> >>>>>>> You reject some of my patches that returns -ENODEV, with the >>>>>>> cause that MESA doesn't do the handling on -ENODEV >>>>>>> >>>>>>> But if Nicolai can confirm that MESA do have a handling on >>>>>>> -ECANCELED, then we need to overall align our error code, on >>>>>>> detail below IOCTL can return error code: >>>>>>> >>>>>>> Amdgpu_cs_ioctl >>>>>>> Amdgpu_cs_wait_ioctl >>>>>>> Amdgpu_cs_wait_fences_ioctl >>>>>>> Amdgpu_info_ioctl >>>>>>> >>>>>>> >>>>>>> My patches is: >>>>>>> return -ENODEV on cs_ioctl if the context is detected guilty, >>>>>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>>>>>> signaled but with error -ETIME, >>>>>>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>>>>>> happened after the process created (because for strict mode we >>>>>>> block process instead of context) >>>>>>> >>>>>>> >>>>>>> according to Nicolai: >>>>>>> >>>>>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly >>>>>>> speaking, kernel part doesn't have any place with "-ECANCELED" so >>>>>>> this solution on MESA side doesn't align with *current* amdgpu >>>>>>> driver, >>>>>>> which only return 0 on success or -EINVALID on other error but >>>>>>> definitely no "-ECANCELED" error code, >>>>>>> >>>>>>> so if we talking about community rules we shouldn't let MESA >>>>>>> handle -ECANCELED , we should have a unified error code >>>>>>> >>>>>>> + Marek >>>>>>> >>>>>>> BR Monk >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Haehnle, Nicolai >>>>>>> Sent: 2017年10月9日 18:14 >>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >>>>>>> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>> amd-gfx@lists.freedesktop.org >>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>> fences in amd_sched_entity_fini >>>>>>> >>>>>>> On 09.10.2017 10:02, Christian König wrote: >>>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>>> kernel >>>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>>> IOCTL) >>>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>>> path >>>>>>>>> and considering the GPU hang occurred, >>>>>>>> Well that is only closed source behavior which is completely >>>>>>>> irrelevant for upstream development. >>>>>>>> >>>>>>>> As far as I know we haven't pushed the change to return -ENODEV >>>>>>>> upstream. >>>>>>> >>>>>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>>>>>> treats those as context lost. Perhaps we could use the same error >>>>>>> on fences? >>>>>>> That makes more sense to me than -ENODEV. >>>>>>> >>>>>>> Cheers, >>>>>>> Nicolai >>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>>>>>> Christian >>>>>>>>> >>>>>>>>>> It would be really nice to have an error code set on >>>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>>> dma_fence_set_error() >>>>>>>>>> for this. >>>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>>> kernel >>>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>>> IOCTL) >>>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>>> path >>>>>>>>> and considering the GPU hang occurred, >>>>>>>>> >>>>>>>>> Don't know if this is expected for the case of normal process >>>>>>>>> being >>>>>>>>> killed or crashed like Nicolai hit ... since there is no gpu >>>>>>>>> hang hit >>>>>>>>> >>>>>>>>> >>>>>>>>> BR Monk >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>>>>>>> Behalf Of Christian K?nig >>>>>>>>> Sent: 2017年9月28日 23:01 >>>>>>>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>>>> fences in amd_sched_entity_fini >>>>>>>>> >>>>>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>>>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>>> >>>>>>>>>> Highly concurrent Piglit runs can trigger a race condition >>>>>>>>>> where a >>>>>>>>>> pending SDMA job on a buffer object is never executed because the >>>>>>>>>> corresponding process is killed (perhaps due to a crash). >>>>>>>>>> Since the >>>>>>>>>> job's fences were never signaled, the buffer object was >>>>>>>>>> effectively >>>>>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>>>>>> the time, possibly in VRAM. >>>>>>>>>> >>>>>>>>>> The symptom was user space processes stuck in interruptible waits >>>>>>>>>> with kernel stacks like: >>>>>>>>>> >>>>>>>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>>>>> [<ffffffffbc5e82d2>] >>>>>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>>>>> [<ffffffffc03ce56f>] >>>>>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>>>>>> [ttm] >>>>>>>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 >>>>>>>>>> [ttm] >>>>>>>>>> [<ffffffffc042f523>] >>>>>>>>>> amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>>>>>> [amdgpu] >>>>>>>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>>>>>> [amdgpu] >>>>>>>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>>>>>> [amdgpu] >>>>>>>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>>>> >>>>>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>>>>>> amd_gpu_scheduler *sched, >>>>>>>>>> amd_sched_entity_is_idle(entity)); >>>>>>>>>> amd_sched_rq_remove_entity(rq, entity); >>>>>>>>>> if (r) { >>>>>>>>>> struct amd_sched_job *job; >>>>>>>>>> /* Park the kernel for a moment to make sure it isn't >>>>>>>>>> processing >>>>>>>>>> * our enity. >>>>>>>>>> */ >>>>>>>>>> kthread_park(sched->thread); >>>>>>>>>> kthread_unpark(sched->thread); >>>>>>>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>>>>>> + while (kfifo_out(&entity->job_queue, &job, >>>>>>>>>> sizeof(job))) { >>>>>>>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>>>>>>> + amd_sched_fence_scheduled(s_fence); >>>>>>>>> It would be really nice to have an error code set on >>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>> dma_fence_set_error() for this. >>>>>>>>> >>>>>>>>> Additional to that it would be nice to note in the subject line >>>>>>>>> that >>>>>>>>> this is a rather important bug fix. >>>>>>>>> >>>>>>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>>>>>> <christian.koenig@amd.com>. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> + amd_sched_fence_finished(s_fence); >>>>>>>>>> + dma_fence_put(&s_fence->finished); >>>>>>>>>> sched->ops->free_job(job); >>>>>>>>>> + } >>>>>>>>>> } >>>>>>>>>> kfifo_free(&entity->job_queue); >>>>>>>>>> } >>>>>>>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, >>>>>>>>>> struct >>>>>>>>>> dma_fence_cb *cb) >>>>>>>>>> { >>>>>>>>>> struct amd_sched_entity *entity = >>>>>>>>>> container_of(cb, struct amd_sched_entity, cb); >>>>>>>>>> entity->dependency = NULL; >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > [-- Attachment #1.2: Type: text/html, Size: 28268 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <CY1PR12MB0885AF7148CD8ECE929E96D2F9740-1s8aH8ViOEfCYw/MNJAFQgdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <CY1PR12MB0885AF7148CD8ECE929E96D2F9740-1s8aH8ViOEfCYw/MNJAFQgdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> @ 2017-10-09 14:01 ` Nicolai Hähnle 0 siblings, 0 replies; 36+ messages in thread From: Nicolai Hähnle @ 2017-10-09 14:01 UTC (permalink / raw) To: Olsak, Marek, Koenig, Christian, Liu, Monk, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Cc: Li, Bingley It depends on what you mean by "handle". If amdgpu_cs_submit_raw were to return ECANCELED, the correct error message would be printed. We don't do any of the "trying to continue" business because back when we last discussed that we said that it wasn't such a great idea, and to be honest, it really isn't a great idea for normal applications. For the X server / compositor it could be valuable though. Cheers, Nicolai On 09.10.2017 15:57, Olsak, Marek wrote: > Mesa does not handle -ECANCELED. It only returns -ECANCELED from the > Mesa winsys layer if the CS ioctl wasn't called (because the context is > already lost and so the winsys doesn't submit further CS ioctls). > > > When the CS ioctl fails for the first time, the kernel error is returned > and the context is marked as "lost". > > The next command submission is automatically dropped by the winsys and > it returns -ECANCELED. > > > Marek > > ------------------------------------------------------------------------ > *From:* Haehnle, Nicolai > *Sent:* Monday, October 9, 2017 2:58:02 PM > *To:* Koenig, Christian; Liu, Monk; Nicolai Hähnle; > amd-gfx@lists.freedesktop.org; Olsak, Marek > *Cc:* Li, Bingley > *Subject:* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining > fences in amd_sched_entity_fini > On 09.10.2017 14:33, Christian König wrote: >> Am 09.10.2017 um 13:27 schrieb Nicolai Hähnle: >>> On 09.10.2017 13:12, Christian König wrote: >>>>> >>>>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>>>> currently existing contexts? >>>>> >>>>> Impossible? "All currently existing contexts" is not a well-defined >>>>> concept when multiple drivers co-exist in the same process. >>>> >>>> Ok, let me refine the question: I assume there are resources "shared" >>>> between contexts like binary shader code for example which needs to >>>> be reuploaded when VRAM is lost. >>>> >>>> How hard would it be to handle that correctly? >>> >>> Okay, that makes more sense :) >>> >>> With the current interface it's still pretty difficult, but if we >>> could get a new per-device query ioctl which returns a "VRAM loss >>> counter", it would be reasonably straight-forward. >> >> The problem with the VRAM lost counter is that this isn't save either. >> E.g. you could have an application which uploads shaders, a GPU reset >> happens and VRAM is lost and then the application creates a new context >> and makes submission with broken shader binaries. > > Hmm. Here's how I imagined we'd be using a VRAM lost counter: > > int si_shader_binary_upload(...) > { > ... > shader->bo_vram_lost_counter = sscreen->vram_lost_counter; > shader->bo = pipe_buffer_create(...); > ptr = sscreen->b.ws->buffer_map(shader->bo->buf, ...); > ... copies ... > sscreen->b.ws->buffer_unmap(shader->bo->buf); > } > > int si_shader_select(...) > { > ... > r = si_shader_select_with_key(ctx->sscreen, state, ...); > if (r) return r; > > if (state->current->bo_vram_lost_counter != > ctx->sscreen->vram_lost_counter) { > ... re-upload sequence ... > } > } > > (Not shown: logic that compares ctx->vram_lost_counter with > sscreen->vram_lost_counter and forces a re-validation of all state > including shaders.) > > That should cover this scenario, shouldn't it? > > Oh... I see one problem. But it should be easy to fix: when creating a > new amdgpu context, Mesa needs to query the vram lost counter. So then > the sequence of events would be either: > > - VRAM lost counter starts at 0 > - Mesa uploads a shader binary > - Unrelated GPU reset happens, kernel increments VRAM lost counter to 1 > - Mesa creates a new amdgpu context, queries the VRAM lost counter --> 1 > - si_screen::vram_lost_counter is updated to 1 > - Draw happens on the new context --> si_shader_select will catch the > VRAM loss > > Or: > > - VRAM lost counter starts at 0 > - Mesa uploads a shader binary > - Mesa creates a new amdgpu context, VRAM lost counter still 0 > - Unrelated GPU reset happens, kernel increments VRAM lost counter to 1 > - Draw happens on the new context and proceeds normally > ... > - Mesa flushes the CS, and the kernel will return an error code because > the device VRAM lost counter is different from the amdgpu context VRAM > lost counter > > >> So I would still vote for a separate IOCTL to reset the VRAM lost state >> which is called *before" user space starts to reupload >> shader/descriptors etc... > > The question is: is that separate IOCTL per-context or per-fd? If it's > per-fd, then it's not compatible with multiple drivers. If it's > per-context, then I don't see how it helps. Perhaps you could explain? > > > > This way you also catch the case when another reset happens while you > > re-upload things. > > My assumption would be that the re-upload happens *after* the new amdgpu > context is created. Then the repeat reset should be caught by the kernel > when we try to submit a CS on the new context (this is assuming that the > kernel context's vram lost counter is initialized properly when the > context is created): > > - Mesa prepares upload, sets shader->bo_vram_lost_counter to 0 > - Mesa uploads a shader binary > - While doing this, a GPU reset happens[0], kernel increments device > VRAM lost counter to 1 > - Draw happens with the new shader, Mesa proceeds normally > ... > - On flush / CS submit, the kernel detects the VRAM lost state and > returns an error to Mesa > > [0] Out of curiosity: What happens on the CPU side if the PCI / full > ASIC reset method is used? Is there a time window where we could get a SEGV? > > > [snip] >>> BTW, I still don't like ENODEV. It seems more like the kind of error >>> code you'd return with hot-pluggable GPUs where the device can >>> physically disappear... >> >> Yeah, ECANCELED sounds like a better alternative. But I think we should >> still somehow note the fatality of loosing VRAM to userspace. >> >> How about ENODATA or EBADFD? > > According to the manpage, EBADFD is "File descriptor in bad state.". > Sounds fitting :) > > Cheers, > Nicolai > > >> >> Regards, >> Christian. >> >>> >>> Cheers, >>> Nicolai >>> >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle: >>>>> On 09.10.2017 12:59, Christian König wrote: >>>>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>>>> currently existing contexts? >>>>> >>>>> Impossible? "All currently existing contexts" is not a well-defined >>>>> concept when multiple drivers co-exist in the same process. >>>>> >>>>> And what would be the purpose of this? If it's to support VRAM loss, >>>>> having a per-context VRAM loss counter would enable each context to >>>>> signal ECANCELED separately. >>>>> >>>>> Cheers, >>>>> Nicolai >>>>> >>>>> >>>>>> >>>>>> Monk, would it be ok with you when we return ENODEV only for >>>>>> existing context when VRAM is lost and/or we have a strict mode GPU >>>>>> reset? E.g. newly created contexts would continue work as they should. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >>>>>>> Hi Monk, >>>>>>> >>>>>>> Yes, you're right, we're only using ECANCELED internally. But as a >>>>>>> consequence, Mesa would already handle a kernel error of ECANCELED >>>>>>> on context loss correctly :) >>>>>>> >>>>>>> Cheers, >>>>>>> Nicolai >>>>>>> >>>>>>> On 09.10.2017 12:35, Liu, Monk wrote: >>>>>>>> Hi Christian >>>>>>>> >>>>>>>> You reject some of my patches that returns -ENODEV, with the >>>>>>>> cause that MESA doesn't do the handling on -ENODEV >>>>>>>> >>>>>>>> But if Nicolai can confirm that MESA do have a handling on >>>>>>>> -ECANCELED, then we need to overall align our error code, on >>>>>>>> detail below IOCTL can return error code: >>>>>>>> >>>>>>>> Amdgpu_cs_ioctl >>>>>>>> Amdgpu_cs_wait_ioctl >>>>>>>> Amdgpu_cs_wait_fences_ioctl >>>>>>>> Amdgpu_info_ioctl >>>>>>>> >>>>>>>> >>>>>>>> My patches is: >>>>>>>> return -ENODEV on cs_ioctl if the context is detected guilty, >>>>>>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>>>>>>> signaled but with error -ETIME, >>>>>>>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>>>>>>> happened after the process created (because for strict mode we >>>>>>>> block process instead of context) >>>>>>>> >>>>>>>> >>>>>>>> according to Nicolai: >>>>>>>> >>>>>>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly >>>>>>>> speaking, kernel part doesn't have any place with "-ECANCELED" so >>>>>>>> this solution on MESA side doesn't align with *current* amdgpu >>>>>>>> driver, >>>>>>>> which only return 0 on success or -EINVALID on other error but >>>>>>>> definitely no "-ECANCELED" error code, >>>>>>>> >>>>>>>> so if we talking about community rules we shouldn't let MESA >>>>>>>> handle -ECANCELED , we should have a unified error code >>>>>>>> >>>>>>>> + Marek >>>>>>>> >>>>>>>> BR Monk >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Haehnle, Nicolai >>>>>>>> Sent: 2017年10月9日 18:14 >>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >>>>>>>> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>>> fences in amd_sched_entity_fini >>>>>>>> >>>>>>>> On 09.10.2017 10:02, Christian König wrote: >>>>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>>>> kernel >>>>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>>>> IOCTL) >>>>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>>>> path >>>>>>>>>> and considering the GPU hang occurred, >>>>>>>>> Well that is only closed source behavior which is completely >>>>>>>>> irrelevant for upstream development. >>>>>>>>> >>>>>>>>> As far as I know we haven't pushed the change to return -ENODEV >>>>>>>>> upstream. >>>>>>>> >>>>>>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>>>>>>> treats those as context lost. Perhaps we could use the same error >>>>>>>> on fences? >>>>>>>> That makes more sense to me than -ENODEV. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Nicolai >>>>>>>> >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>>>>>>> Christian >>>>>>>>>> >>>>>>>>>>> It would be really nice to have an error code set on >>>>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>>>> dma_fence_set_error() >>>>>>>>>>> for this. >>>>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>>>> kernel >>>>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>>>> IOCTL) >>>>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>>>> path >>>>>>>>>> and considering the GPU hang occurred, >>>>>>>>>> >>>>>>>>>> Don't know if this is expected for the case of normal process >>>>>>>>>> being >>>>>>>>>> killed or crashed like Nicolai hit ... since there is no gpu >>>>>>>>>> hang hit >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> BR Monk >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>>>>>>>> Behalf Of Christian K?nig >>>>>>>>>> Sent: 2017年9月28日 23:01 >>>>>>>>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>>>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>>>>> fences in amd_sched_entity_fini >>>>>>>>>> >>>>>>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>>>>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>>>> >>>>>>>>>>> Highly concurrent Piglit runs can trigger a race condition >>>>>>>>>>> where a >>>>>>>>>>> pending SDMA job on a buffer object is never executed because the >>>>>>>>>>> corresponding process is killed (perhaps due to a crash). >>>>>>>>>>> Since the >>>>>>>>>>> job's fences were never signaled, the buffer object was >>>>>>>>>>> effectively >>>>>>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>>>>>>> the time, possibly in VRAM. >>>>>>>>>>> >>>>>>>>>>> The symptom was user space processes stuck in interruptible waits >>>>>>>>>>> with kernel stacks like: >>>>>>>>>>> >>>>>>>>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>>>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>>>>>> [<ffffffffbc5e82d2>] >>>>>>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>>>>>> [<ffffffffc03ce56f>] >>>>>>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>>>>>>> [ttm] >>>>>>>>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>>>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>>>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>>>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 >>>>>>>>>>> [ttm] >>>>>>>>>>> [<ffffffffc042f523>] >>>>>>>>>>> amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>>>>>>> [amdgpu] >>>>>>>>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>>>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>>>>>>> [amdgpu] >>>>>>>>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>>>>>>> [amdgpu] >>>>>>>>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>>>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>>>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>>>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>>>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>>>>>> --- >>>>>>>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>>>>>>> amd_gpu_scheduler *sched, >>>>>>>>>>> amd_sched_entity_is_idle(entity)); >>>>>>>>>>> amd_sched_rq_remove_entity(rq, entity); >>>>>>>>>>> if (r) { >>>>>>>>>>> struct amd_sched_job *job; >>>>>>>>>>> /* Park the kernel for a moment to make sure it isn't >>>>>>>>>>> processing >>>>>>>>>>> * our enity. >>>>>>>>>>> */ >>>>>>>>>>> kthread_park(sched->thread); >>>>>>>>>>> kthread_unpark(sched->thread); >>>>>>>>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>>>>>>> + while (kfifo_out(&entity->job_queue, &job, >>>>>>>>>>> sizeof(job))) { >>>>>>>>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>>>>>>>> + amd_sched_fence_scheduled(s_fence); >>>>>>>>>> It would be really nice to have an error code set on >>>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>>> dma_fence_set_error() for this. >>>>>>>>>> >>>>>>>>>> Additional to that it would be nice to note in the subject line >>>>>>>>>> that >>>>>>>>>> this is a rather important bug fix. >>>>>>>>>> >>>>>>>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>>>>>>> <christian.koenig@amd.com>. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Christian. >>>>>>>>>> >>>>>>>>>>> + amd_sched_fence_finished(s_fence); >>>>>>>>>>> + dma_fence_put(&s_fence->finished); >>>>>>>>>>> sched->ops->free_job(job); >>>>>>>>>>> + } >>>>>>>>>>> } >>>>>>>>>>> kfifo_free(&entity->job_queue); >>>>>>>>>>> } >>>>>>>>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, >>>>>>>>>>> struct >>>>>>>>>>> dma_fence_cb *cb) >>>>>>>>>>> { >>>>>>>>>>> struct amd_sched_entity *entity = >>>>>>>>>>> container_of(cb, struct amd_sched_entity, cb); >>>>>>>>>>> entity->dependency = NULL; >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> amd-gfx mailing list >>>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>>> _______________________________________________ >>>>>>>>>> amd-gfx mailing list >>>>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <2f113fd3-ab4a-58b8-31d8-dc0a23751513-5C7GfCeVMHo@public.gmane.org> 2017-10-09 12:58 ` Nicolai Hähnle @ 2017-10-10 4:00 ` Liu, Monk 1 sibling, 0 replies; 36+ messages in thread From: Liu, Monk @ 2017-10-10 4:00 UTC (permalink / raw) To: Koenig, Christian, Haehnle, Nicolai, Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Olsak, Marek Cc: Li, Bingley What about below plan? 1, ECANCELED is used for kernel to return an error upon guilty context submitting jobs, In order to prevent guilty context from submitting anymore ... or prevent all contexts from submitting if VRAM lost happens 2, ENODEV is used for kernel to return error in cs_wait/fences_wait IOCTL, so UMD can be notified that the fence they are waiting on will never accomplished but just fake signaled due to GPU hang If you guys are picky on ENODEV, maybe ETIME/EBADFD ?? 3, ENODEV is used for kernel to return error in amdgpu_info_ioctl so UMD can call info_ioctl to acknowledge that the GPU is not work anymore, and after GPU reset done, info_ioctl can work as expected Use info_ioctl there is because it doesn't need context as input parameter, because for VK UMD sometimes it want to query GPU healthy but no context on hand, only with FD/DEVICE BR Monk -----Original Message----- From: Koenig, Christian Sent: 2017年10月9日 20:34 To: Haehnle, Nicolai <Nicolai.Haehnle@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org; Olsak, Marek <Marek.Olsak@amd.com> Cc: Li, Bingley <Bingley.Li@amd.com> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini Am 09.10.2017 um 13:27 schrieb Nicolai Hähnle: > On 09.10.2017 13:12, Christian König wrote: >>> >>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>> currently existing contexts? >>> >>> Impossible? "All currently existing contexts" is not a well-defined >>> concept when multiple drivers co-exist in the same process. >> >> Ok, let me refine the question: I assume there are resources "shared" >> between contexts like binary shader code for example which needs to >> be reuploaded when VRAM is lost. >> >> How hard would it be to handle that correctly? > > Okay, that makes more sense :) > > With the current interface it's still pretty difficult, but if we > could get a new per-device query ioctl which returns a "VRAM loss > counter", it would be reasonably straight-forward. The problem with the VRAM lost counter is that this isn't save either. E.g. you could have an application which uploads shaders, a GPU reset happens and VRAM is lost and then the application creates a new context and makes submission with broken shader binaries. So I would still vote for a separate IOCTL to reset the VRAM lost state which is called *before" user space starts to reupload shader/descriptors etc... This way you also catch the case when another reset happens while you reupload things. > >>> And what would be the purpose of this? If it's to support VRAM loss, >>> having a per-context VRAM loss counter would enable each context to >>> signal ECANCELED separately. >> >> I thought of that on top of the -ENODEV handling. >> >> In other words when we see -ENODEV we call an IOCTL to let the kernel >> know we noticed that something is wrong and then reinit all shared >> resources in userspace. >> >> All existing context will still see -ECANCELED when we drop their >> command submission, but new contexts would at least not cause a new >> lockup immediately because their shader binaries are corrupted. > > I don't think we need -ENODEV for this. We just need -ECANCELED to be > returned when a submission is rejected due to reset (hang or VRAM loss). > > Mesa would keep a shadow of the VRAM loss counter in pipe_screen and > pipe_context, and query the kernel's counter when it encounters > -ECANCELED. Each context would then know to drop the CS it's built up > so far and restart based on comparing the VRAM loss counter of > pipe_screen to that of pipe_context, and similarly we could keep a > copy of the VRAM loss counter for important buffer objects like shader > binaries, descriptors, etc. > > This seems more robust to me than relying only on an ENODEV. We'd most > likely keep some kind of VRAM loss counter in Mesa *anyway* (we don't > maintain a list of all shaders, for example, and we can't nuke > important per-context across threads), and synthesizing such a counter > from ENODEVs is not particularly robust (what if multiple ENODEVs > occur for the same loss event?). > > BTW, I still don't like ENODEV. It seems more like the kind of error > code you'd return with hot-pluggable GPUs where the device can > physically disappear... Yeah, ECANCELED sounds like a better alternative. But I think we should still somehow note the fatality of loosing VRAM to userspace. How about ENODATA or EBADFD? Regards, Christian. > > Cheers, > Nicolai > > >> >> Regards, >> Christian. >> >> Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle: >>> On 09.10.2017 12:59, Christian König wrote: >>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>> currently existing contexts? >>> >>> Impossible? "All currently existing contexts" is not a well-defined >>> concept when multiple drivers co-exist in the same process. >>> >>> And what would be the purpose of this? If it's to support VRAM loss, >>> having a per-context VRAM loss counter would enable each context to >>> signal ECANCELED separately. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> >>>> Monk, would it be ok with you when we return ENODEV only for >>>> existing context when VRAM is lost and/or we have a strict mode GPU >>>> reset? E.g. newly created contexts would continue work as they should. >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >>>>> Hi Monk, >>>>> >>>>> Yes, you're right, we're only using ECANCELED internally. But as a >>>>> consequence, Mesa would already handle a kernel error of ECANCELED >>>>> on context loss correctly :) >>>>> >>>>> Cheers, >>>>> Nicolai >>>>> >>>>> On 09.10.2017 12:35, Liu, Monk wrote: >>>>>> Hi Christian >>>>>> >>>>>> You reject some of my patches that returns -ENODEV, with the >>>>>> cause that MESA doesn't do the handling on -ENODEV >>>>>> >>>>>> But if Nicolai can confirm that MESA do have a handling on >>>>>> -ECANCELED, then we need to overall align our error code, on >>>>>> detail below IOCTL can return error code: >>>>>> >>>>>> Amdgpu_cs_ioctl >>>>>> Amdgpu_cs_wait_ioctl >>>>>> Amdgpu_cs_wait_fences_ioctl >>>>>> Amdgpu_info_ioctl >>>>>> >>>>>> >>>>>> My patches is: >>>>>> return -ENODEV on cs_ioctl if the context is detected guilty, >>>>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>>>>> signaled but with error -ETIME, also return -ENODEV on info_ioctl >>>>>> so UMD can query if gpu reset happened after the process created >>>>>> (because for strict mode we block process instead of context) >>>>>> >>>>>> >>>>>> according to Nicolai: >>>>>> >>>>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly >>>>>> speaking, kernel part doesn't have any place with "-ECANCELED" so >>>>>> this solution on MESA side doesn't align with *current* amdgpu >>>>>> driver, which only return 0 on success or -EINVALID on other >>>>>> error but definitely no "-ECANCELED" error code, >>>>>> >>>>>> so if we talking about community rules we shouldn't let MESA >>>>>> handle -ECANCELED , we should have a unified error code >>>>>> >>>>>> + Marek >>>>>> >>>>>> BR Monk >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Haehnle, Nicolai >>>>>> Sent: 2017年10月9日 18:14 >>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk >>>>>> <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>> amd-gfx@lists.freedesktop.org >>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>> fences in amd_sched_entity_fini >>>>>> >>>>>> On 09.10.2017 10:02, Christian König wrote: >>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>> kernel return -ENODEV if the waiting fence (in cs_wait or >>>>>>>> wait_fences >>>>>>>> IOCTL) >>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>> path and considering the GPU hang occurred, >>>>>>> Well that is only closed source behavior which is completely >>>>>>> irrelevant for upstream development. >>>>>>> >>>>>>> As far as I know we haven't pushed the change to return -ENODEV >>>>>>> upstream. >>>>>> >>>>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>>>>> treats those as context lost. Perhaps we could use the same error >>>>>> on fences? >>>>>> That makes more sense to me than -ENODEV. >>>>>> >>>>>> Cheers, >>>>>> Nicolai >>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>>>>> Christian >>>>>>>> >>>>>>>>> It would be really nice to have an error code set on >>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>> dma_fence_set_error() >>>>>>>>> for this. >>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>> kernel return -ENODEV if the waiting fence (in cs_wait or >>>>>>>> wait_fences >>>>>>>> IOCTL) >>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>> path and considering the GPU hang occurred, >>>>>>>> >>>>>>>> Don't know if this is expected for the case of normal process >>>>>>>> being killed or crashed like Nicolai hit ... since there is no >>>>>>>> gpu hang hit >>>>>>>> >>>>>>>> >>>>>>>> BR Monk >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On >>>>>>>> Behalf Of Christian K?nig >>>>>>>> Sent: 2017年9月28日 23:01 >>>>>>>> To: Nicolai Hähnle <nhaehnle@gmail.com>; >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com> >>>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>>> fences in amd_sched_entity_fini >>>>>>>> >>>>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>>>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>> >>>>>>>>> Highly concurrent Piglit runs can trigger a race condition >>>>>>>>> where a >>>>>>>>> pending SDMA job on a buffer object is never executed because the >>>>>>>>> corresponding process is killed (perhaps due to a crash). >>>>>>>>> Since the >>>>>>>>> job's fences were never signaled, the buffer object was >>>>>>>>> effectively >>>>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>>>>> the time, possibly in VRAM. >>>>>>>>> >>>>>>>>> The symptom was user space processes stuck in interruptible waits >>>>>>>>> with kernel stacks like: >>>>>>>>> >>>>>>>>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>>>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>>>> [<ffffffffbc5e82d2>] >>>>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>>>> [<ffffffffc03ce56f>] >>>>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>>>>> [ttm] >>>>>>>>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>>>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>>>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>>>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 >>>>>>>>> [ttm] >>>>>>>>> [<ffffffffc042f523>] >>>>>>>>> amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>>>>> [amdgpu] >>>>>>>>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>>>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>>>>> [amdgpu] >>>>>>>>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>>>>> [amdgpu] >>>>>>>>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>>>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>>>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>>>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>>>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>>> >>>>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>>>>> amd_gpu_scheduler *sched, >>>>>>>>> amd_sched_entity_is_idle(entity)); >>>>>>>>> amd_sched_rq_remove_entity(rq, entity); >>>>>>>>> if (r) { >>>>>>>>> struct amd_sched_job *job; >>>>>>>>> /* Park the kernel for a moment to make sure it isn't >>>>>>>>> processing >>>>>>>>> * our enity. >>>>>>>>> */ >>>>>>>>> kthread_park(sched->thread); >>>>>>>>> kthread_unpark(sched->thread); >>>>>>>>> - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>>>>> + while (kfifo_out(&entity->job_queue, &job, >>>>>>>>> sizeof(job))) { >>>>>>>>> + struct amd_sched_fence *s_fence = job->s_fence; >>>>>>>>> + amd_sched_fence_scheduled(s_fence); >>>>>>>> It would be really nice to have an error code set on >>>>>>>> s_fence->finished before it is signaled, use >>>>>>>> dma_fence_set_error() for this. >>>>>>>> >>>>>>>> Additional to that it would be nice to note in the subject line >>>>>>>> that >>>>>>>> this is a rather important bug fix. >>>>>>>> >>>>>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>>>>> <christian.koenig@amd.com>. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>>> + amd_sched_fence_finished(s_fence); >>>>>>>>> + dma_fence_put(&s_fence->finished); >>>>>>>>> sched->ops->free_job(job); >>>>>>>>> + } >>>>>>>>> } >>>>>>>>> kfifo_free(&entity->job_queue); >>>>>>>>> } >>>>>>>>> static void amd_sched_entity_wakeup(struct dma_fence *f, >>>>>>>>> struct >>>>>>>>> dma_fence_cb *cb) >>>>>>>>> { >>>>>>>>> struct amd_sched_entity *entity = >>>>>>>>> container_of(cb, struct amd_sched_entity, cb); >>>>>>>>> entity->dependency = NULL; >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> amd-gfx mailing list >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>> _______________________________________________ >>>>>>>> amd-gfx mailing list >>>>>>>> amd-gfx@lists.freedesktop.org >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <20170928145530.12844-5-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 15:01 ` Christian König @ 2017-09-28 18:30 ` Marek Olšák 2017-09-29 2:17 ` Chunming Zhou 2017-10-11 16:30 ` Michel Dänzer 3 siblings, 0 replies; 36+ messages in thread From: Marek Olšák @ 2017-09-28 18:30 UTC (permalink / raw) To: Nicolai Hähnle; +Cc: Nicolai Hähnle, amd-gfx mailing list Thanks for this series. I can finally finish piglit on VI. Marek On Thu, Sep 28, 2017 at 4:55 PM, Nicolai Hähnle <nhaehnle@gmail.com> wrote: > From: Nicolai Hähnle <nicolai.haehnle@amd.com> > > Highly concurrent Piglit runs can trigger a race condition where a pending > SDMA job on a buffer object is never executed because the corresponding > process is killed (perhaps due to a crash). Since the job's fences were > never signaled, the buffer object was effectively leaked. Worse, the > buffer was stuck wherever it happened to be at the time, possibly in VRAM. > > The symptom was user space processes stuck in interruptible waits with > kernel stacks like: > > [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 > [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 > [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 > [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] > [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] > [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] > [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] > [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] > [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] > [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] > [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] > [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] > [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] > [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] > [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 > [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 > [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad > [<ffffffffffffffff>] 0xffffffffffffffff > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> > Acked-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > index 54eb77cffd9b..32a99e980d78 100644 > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched, > amd_sched_entity_is_idle(entity)); > amd_sched_rq_remove_entity(rq, entity); > if (r) { > struct amd_sched_job *job; > > /* Park the kernel for a moment to make sure it isn't processing > * our enity. > */ > kthread_park(sched->thread); > kthread_unpark(sched->thread); > - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) > + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { > + struct amd_sched_fence *s_fence = job->s_fence; > + amd_sched_fence_scheduled(s_fence); > + amd_sched_fence_finished(s_fence); > + dma_fence_put(&s_fence->finished); > sched->ops->free_job(job); > + } > > } > kfifo_free(&entity->job_queue); > } > > static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb) > { > struct amd_sched_entity *entity = > container_of(cb, struct amd_sched_entity, cb); > entity->dependency = NULL; > -- > 2.11.0 > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <20170928145530.12844-5-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 15:01 ` Christian König 2017-09-28 18:30 ` Marek Olšák @ 2017-09-29 2:17 ` Chunming Zhou 2017-10-11 16:30 ` Michel Dänzer 3 siblings, 0 replies; 36+ messages in thread From: Chunming Zhou @ 2017-09-29 2:17 UTC (permalink / raw) To: Nicolai Hähnle, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Cc: Nicolai Hähnle On 2017年09月28日 22:55, Nicolai Hähnle wrote: > From: Nicolai Hähnle <nicolai.haehnle@amd.com> > > Highly concurrent Piglit runs can trigger a race condition where a pending > SDMA job on a buffer object is never executed because the corresponding > process is killed (perhaps due to a crash). Since the job's fences were > never signaled, the buffer object was effectively leaked. Worse, the > buffer was stuck wherever it happened to be at the time, possibly in VRAM. Indeed good catch. Cheers, David Zhou > > The symptom was user space processes stuck in interruptible waits with > kernel stacks like: > > [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 > [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 > [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 > [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] > [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] > [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] > [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] > [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] > [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] > [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] > [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] > [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] > [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] > [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] > [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 > [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 > [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad > [<ffffffffffffffff>] 0xffffffffffffffff > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> > Acked-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > index 54eb77cffd9b..32a99e980d78 100644 > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct amd_gpu_scheduler *sched, > amd_sched_entity_is_idle(entity)); > amd_sched_rq_remove_entity(rq, entity); > if (r) { > struct amd_sched_job *job; > > /* Park the kernel for a moment to make sure it isn't processing > * our enity. > */ > kthread_park(sched->thread); > kthread_unpark(sched->thread); > - while (kfifo_out(&entity->job_queue, &job, sizeof(job))) > + while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { > + struct amd_sched_fence *s_fence = job->s_fence; > + amd_sched_fence_scheduled(s_fence); > + amd_sched_fence_finished(s_fence); > + dma_fence_put(&s_fence->finished); > sched->ops->free_job(job); > + } > > } > kfifo_free(&entity->job_queue); > } > > static void amd_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb) > { > struct amd_sched_entity *entity = > container_of(cb, struct amd_sched_entity, cb); > entity->dependency = NULL; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <20170928145530.12844-5-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> ` (2 preceding siblings ...) 2017-09-29 2:17 ` Chunming Zhou @ 2017-10-11 16:30 ` Michel Dänzer [not found] ` <7cb63e4c-9b65-b9b9-14dc-26368ca7126a-otUistvHUpPR7s880joybQ@public.gmane.org> 3 siblings, 1 reply; 36+ messages in thread From: Michel Dänzer @ 2017-10-11 16:30 UTC (permalink / raw) To: Nicolai Hähnle Cc: Alex Deucher, Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW On 28/09/17 04:55 PM, Nicolai Hähnle wrote: > From: Nicolai Hähnle <nicolai.haehnle@amd.com> > > Highly concurrent Piglit runs can trigger a race condition where a pending > SDMA job on a buffer object is never executed because the corresponding > process is killed (perhaps due to a crash). Since the job's fences were > never signaled, the buffer object was effectively leaked. Worse, the > buffer was stuck wherever it happened to be at the time, possibly in VRAM. > > The symptom was user space processes stuck in interruptible waits with > kernel stacks like: > > [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 > [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 > [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 > [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] > [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] > [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] > [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] > [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] > [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] > [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] > [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] > [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] > [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] > [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] > [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 > [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 > [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad > [<ffffffffffffffff>] 0xffffffffffffffff > > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> > Acked-by: Christian König <christian.koenig@amd.com> Since Christian's commit which introduced the problem (6af0883ed977 "drm/amdgpu: discard commands of killed processes") is in 4.14, we need a solution for that. Should we backport Nicolai's five commits fixing the problem, or revert 6af0883ed977? While looking into this, I noticed that the following commits by Christian in 4.14 each also cause hangs for me when running the piglit gpu profile on Tonga: 457e0fee04b0 "drm/amdgpu: remove the GART copy hack" 1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind" Are there fixes for these that can be backported to 4.14, or do they need to be reverted there? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <7cb63e4c-9b65-b9b9-14dc-26368ca7126a-otUistvHUpPR7s880joybQ@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <7cb63e4c-9b65-b9b9-14dc-26368ca7126a-otUistvHUpPR7s880joybQ@public.gmane.org> @ 2017-10-12 8:05 ` Christian König [not found] ` <c67d1bd8-81a0-4133-c3df-dd2a1b1a8c11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Christian König @ 2017-10-12 8:05 UTC (permalink / raw) To: Michel Dänzer, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Am 11.10.2017 um 18:30 schrieb Michel Dänzer: > On 28/09/17 04:55 PM, Nicolai Hähnle wrote: >> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >> >> Highly concurrent Piglit runs can trigger a race condition where a pending >> SDMA job on a buffer object is never executed because the corresponding >> process is killed (perhaps due to a crash). Since the job's fences were >> never signaled, the buffer object was effectively leaked. Worse, the >> buffer was stuck wherever it happened to be at the time, possibly in VRAM. >> >> The symptom was user space processes stuck in interruptible waits with >> kernel stacks like: >> >> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >> [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300 >> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm] >> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu] >> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] >> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] >> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >> Acked-by: Christian König <christian.koenig@amd.com> > Since Christian's commit which introduced the problem (6af0883ed977 > "drm/amdgpu: discard commands of killed processes") is in 4.14, we need > a solution for that. Should we backport Nicolai's five commits fixing > the problem, or revert 6af0883ed977? > > > While looking into this, I noticed that the following commits by > Christian in 4.14 each also cause hangs for me when running the piglit > gpu profile on Tonga: > > 457e0fee04b0 "drm/amdgpu: remove the GART copy hack" > 1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind" > > Are there fixes for these that can be backported to 4.14, or do they > need to be reverted there? Well I'm not aware that any of those two can cause problems. For "drm/amdgpu: remove the GART copy hack" I also don't have the slightest idea how that could be an issue. It just removes an unused code path. Is amd-staging-drm-next stable for you? Thanks, Christian. _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <c67d1bd8-81a0-4133-c3df-dd2a1b1a8c11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <c67d1bd8-81a0-4133-c3df-dd2a1b1a8c11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2017-10-12 11:00 ` Michel Dänzer [not found] ` <51ec8d88-32eb-ef4a-b34b-d2fd8e23281e-otUistvHUpPR7s880joybQ@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Michel Dänzer @ 2017-10-12 11:00 UTC (permalink / raw) To: christian.koenig-5C7GfCeVMHo, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW On 12/10/17 10:05 AM, Christian König wrote: > Am 11.10.2017 um 18:30 schrieb Michel Dänzer: >> On 28/09/17 04:55 PM, Nicolai Hähnle wrote: >>> From: Nicolai Hähnle <nicolai.haehnle@amd.com> >>> >>> Highly concurrent Piglit runs can trigger a race condition where a >>> pending >>> SDMA job on a buffer object is never executed because the corresponding >>> process is killed (perhaps due to a crash). Since the job's fences were >>> never signaled, the buffer object was effectively leaked. Worse, the >>> buffer was stuck wherever it happened to be at the time, possibly in >>> VRAM. >>> >>> The symptom was user space processes stuck in interruptible waits with >>> kernel stacks like: >>> >>> [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>> [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>> [<ffffffffbc5e82d2>] >>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>> [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>> [ttm] >>> [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>> [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>> [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>> [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>> [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>> [amdgpu] >>> [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>> [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu] >>> [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu] >>> [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>> [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>> [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>> [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>> [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com> >>> Acked-by: Christian König <christian.koenig@amd.com> >> Since Christian's commit which introduced the problem (6af0883ed977 >> "drm/amdgpu: discard commands of killed processes") is in 4.14, we need >> a solution for that. Should we backport Nicolai's five commits fixing >> the problem, or revert 6af0883ed977? BTW, any preference for this Christian or Nicolai? >> While looking into this, I noticed that the following commits by >> Christian in 4.14 each also cause hangs for me when running the piglit >> gpu profile on Tonga: >> >> 457e0fee04b0 "drm/amdgpu: remove the GART copy hack" >> 1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind" >> >> Are there fixes for these that can be backported to 4.14, or do they >> need to be reverted there? > Well I'm not aware that any of those two can cause problems. > > For "drm/amdgpu: remove the GART copy hack" I also don't have the > slightest idea how that could be an issue. It just removes an unused > code path. I also thought it's weird, and indeed I can no longer reproduce a hang with only 457e0fee04b0; but I still can with only 1d00402b4da2. I guess one of my bisections went wrong and incorrectly identified 457e0fee04b0 instead of 1d00402b4da2. > Is amd-staging-drm-next stable for you? It seemed stable before the changes you pushed this morning. :) As of cfb6dee86711 "drm/ttm: add transparent huge page support for cached allocations v2", I get a flood of [TTM] Erroneous page count. Leaking pages. in dmesg while running piglit, and it eventually hangs[0]. Anyway, unless anyone knows which commits from amd-staging-drm-next are needed to make 1d00402b4da2 stable in 4.14, the safe course of action seems to be reverting it (and ac7afe6b3cf3, which depends on it)? [0] I also got this, but I don't know yet if it's related: BUG: unable to handle kernel NULL pointer dereference at 0000000000000220 IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] PGD 0 P4D 0 Oops: 0000 [#1] SMP Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd soundcore rng_core sg wmi parport_pc parport i2c_designware_platform i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 hwmon_vid efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata xhci_hcd scsi_mod usbcore shpchp gpio_amdpt gpio_generic CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: G W O 4.13.0-rc5+ #28 Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017 task: ffff9d2982c75a00 task.stack: ffffb2744e9bc000 RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: 0018:ffffb2744e9bf6e8 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff9d2848642820 RCX: ffff9d28c77fdae0 RDX: 0000000000000001 RSI: ffff9d28c77fd800 RDI: ffff9d288f286008 RBP: ffffb2744e9bf728 R08: 000000ffffffffff R09: 0000000000000000 R10: 0000000000000078 R11: ffff9d298ba170a0 R12: ffff9d28c77fd800 R13: 0000000000000001 R14: ffff9d288f286000 R15: ffff9d2848642800 FS: 00007f809fc5c300(0000) GS:ffff9d298e940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000220 CR3: 000000030e05a000 CR4: 00000000003406e0 Call Trace: amdgpu_bo_move_notify+0x42/0xd0 [amdgpu] ttm_bo_unmap_virtual_locked+0x298/0xac0 [ttm] ? ttm_bo_mem_space+0x391/0x580 [ttm] ttm_bo_unmap_virtual_locked+0x737/0xac0 [ttm] ttm_bo_unmap_virtual_locked+0xa6f/0xac0 [ttm] ttm_bo_mem_space+0x306/0x580 [ttm] ttm_bo_validate+0xd4/0x150 [ttm] ttm_bo_init_reserved+0x22e/0x440 [ttm] amdgpu_ttm_placement_from_domain+0x33c/0x580 [amdgpu] ? amdgpu_fill_buffer+0x300/0x420 [amdgpu] amdgpu_bo_create+0x50/0x2b0 [amdgpu] amdgpu_gem_object_create+0x9f/0x110 [amdgpu] amdgpu_gem_create_ioctl+0x12f/0x270 [amdgpu] ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] drm_ioctl_kernel+0x5d/0xf0 [drm] drm_ioctl+0x32a/0x630 [drm] ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] ? lru_cache_add_active_or_unevictable+0x36/0xb0 ? __handle_mm_fault+0x90d/0xff0 amdgpu_drm_ioctl+0x4f/0x1c20 [amdgpu] do_vfs_ioctl+0xa5/0x600 ? handle_mm_fault+0xd8/0x230 ? __do_page_fault+0x267/0x4c0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x1e/0xa9 RIP: 0033:0x7f809c8f3dc7 RSP: 002b:00007ffcc8c485f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f809cbaab00 RCX: 00007f809c8f3dc7 RDX: 00007ffcc8c48640 RSI: 00000000c0206440 RDI: 0000000000000006 RBP: 0000000040000010 R08: 00007f809cbaabe8 R09: 0000000000000060 R10: 0000000000000004 R11: 0000000000000246 R12: 0000000040001000 R13: 00007f809cbaab58 R14: 0000000000001000 R15: 00007f809cbaab00 Code: 49 8b 47 10 48 39 45 d0 4c 8d 78 f0 0f 84 87 00 00 00 4d 8b 37 45 84 ed 41 c6 47 30 01 49 8d 5f 20 49 8d 7e 08 74 19 49 8b 46 58 <48> 8b 80 20 02 00 00 49 39 84 24 20 02 00 00 0f 84 ab 00 00 00 RIP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: ffffb2744e9bf6e8 CR2: 0000000000000220 -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <51ec8d88-32eb-ef4a-b34b-d2fd8e23281e-otUistvHUpPR7s880joybQ@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <51ec8d88-32eb-ef4a-b34b-d2fd8e23281e-otUistvHUpPR7s880joybQ@public.gmane.org> @ 2017-10-12 11:44 ` Christian König [not found] ` <4c750ed5-98be-eafa-e684-940ecb2787f0-5C7GfCeVMHo@public.gmane.org> 2017-10-12 16:49 ` Michel Dänzer 1 sibling, 1 reply; 36+ messages in thread From: Christian König @ 2017-10-12 11:44 UTC (permalink / raw) To: Michel Dänzer, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Am 12.10.2017 um 13:00 schrieb Michel Dänzer: > On 12/10/17 10:05 AM, Christian König wrote: > [SNIP] >> Is amd-staging-drm-next stable for you? > It seemed stable before the changes you pushed this morning. :) As of > cfb6dee86711 "drm/ttm: add transparent huge page support for cached > allocations v2", I get a flood of > > [TTM] Erroneous page count. Leaking pages. > > in dmesg while running piglit, and it eventually hangs[0]. Great, going to take a look. > Anyway, unless anyone knows which commits from amd-staging-drm-next are > needed to make 1d00402b4da2 stable in 4.14, the safe course of action > seems to be reverting it (and ac7afe6b3cf3, which depends on it)? The amdgpu_ttm_bind change should be fixed by "70a9c6b drm/amdgpu: fix placement flags in amdgpu_ttm_bind". But I've assumed they went both into 4.14. Thanks for the bugreport, Christian. > > > [0] I also got this, but I don't know yet if it's related: > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000220 > IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] > PGD 0 > P4D 0 > > Oops: 0000 [#1] SMP > Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd soundcore rng_core sg wmi parport_pc parport i2c_designware_platform i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 hwmon_vid efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache > jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata xhci_hcd scsi_mod usbcore shpchp gpio_amdpt gpio_generic > CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: G W O 4.13.0-rc5+ #28 > Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017 > task: ffff9d2982c75a00 task.stack: ffffb2744e9bc000 > RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] > RSP: 0018:ffffb2744e9bf6e8 EFLAGS: 00010202 > RAX: 0000000000000000 RBX: ffff9d2848642820 RCX: ffff9d28c77fdae0 > RDX: 0000000000000001 RSI: ffff9d28c77fd800 RDI: ffff9d288f286008 > RBP: ffffb2744e9bf728 R08: 000000ffffffffff R09: 0000000000000000 > R10: 0000000000000078 R11: ffff9d298ba170a0 R12: ffff9d28c77fd800 > R13: 0000000000000001 R14: ffff9d288f286000 R15: ffff9d2848642800 > FS: 00007f809fc5c300(0000) GS:ffff9d298e940000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000220 CR3: 000000030e05a000 CR4: 00000000003406e0 > Call Trace: > amdgpu_bo_move_notify+0x42/0xd0 [amdgpu] > ttm_bo_unmap_virtual_locked+0x298/0xac0 [ttm] > ? ttm_bo_mem_space+0x391/0x580 [ttm] > ttm_bo_unmap_virtual_locked+0x737/0xac0 [ttm] > ttm_bo_unmap_virtual_locked+0xa6f/0xac0 [ttm] > ttm_bo_mem_space+0x306/0x580 [ttm] > ttm_bo_validate+0xd4/0x150 [ttm] > ttm_bo_init_reserved+0x22e/0x440 [ttm] > amdgpu_ttm_placement_from_domain+0x33c/0x580 [amdgpu] > ? amdgpu_fill_buffer+0x300/0x420 [amdgpu] > amdgpu_bo_create+0x50/0x2b0 [amdgpu] > amdgpu_gem_object_create+0x9f/0x110 [amdgpu] > amdgpu_gem_create_ioctl+0x12f/0x270 [amdgpu] > ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] > drm_ioctl_kernel+0x5d/0xf0 [drm] > drm_ioctl+0x32a/0x630 [drm] > ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] > ? lru_cache_add_active_or_unevictable+0x36/0xb0 > ? __handle_mm_fault+0x90d/0xff0 > amdgpu_drm_ioctl+0x4f/0x1c20 [amdgpu] > do_vfs_ioctl+0xa5/0x600 > ? handle_mm_fault+0xd8/0x230 > ? __do_page_fault+0x267/0x4c0 > SyS_ioctl+0x79/0x90 > entry_SYSCALL_64_fastpath+0x1e/0xa9 > RIP: 0033:0x7f809c8f3dc7 > RSP: 002b:00007ffcc8c485f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > RAX: ffffffffffffffda RBX: 00007f809cbaab00 RCX: 00007f809c8f3dc7 > RDX: 00007ffcc8c48640 RSI: 00000000c0206440 RDI: 0000000000000006 > RBP: 0000000040000010 R08: 00007f809cbaabe8 R09: 0000000000000060 > R10: 0000000000000004 R11: 0000000000000246 R12: 0000000040001000 > R13: 00007f809cbaab58 R14: 0000000000001000 R15: 00007f809cbaab00 > Code: 49 8b 47 10 48 39 45 d0 4c 8d 78 f0 0f 84 87 00 00 00 4d 8b 37 45 84 ed 41 c6 47 30 01 49 8d 5f 20 49 8d 7e 08 74 19 49 8b 46 58 <48> 8b 80 20 02 00 00 49 39 84 24 20 02 00 00 0f 84 ab 00 00 00 > RIP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: ffffb2744e9bf6e8 > CR2: 0000000000000220 > > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <4c750ed5-98be-eafa-e684-940ecb2787f0-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <4c750ed5-98be-eafa-e684-940ecb2787f0-5C7GfCeVMHo@public.gmane.org> @ 2017-10-12 13:42 ` Michel Dänzer [not found] ` <bc0e87da-a632-07ce-6934-86aee099b916-otUistvHUpPR7s880joybQ@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Michel Dänzer @ 2017-10-12 13:42 UTC (permalink / raw) To: Christian König, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW On 12/10/17 01:44 PM, Christian König wrote: > Am 12.10.2017 um 13:00 schrieb Michel Dänzer: > >> Anyway, unless anyone knows which commits from amd-staging-drm-next are >> needed to make 1d00402b4da2 stable in 4.14, the safe course of action >> seems to be reverting it (and ac7afe6b3cf3, which depends on it)? > > The amdgpu_ttm_bind change should be fixed by "70a9c6b drm/amdgpu: fix > placement flags in amdgpu_ttm_bind". Indeed, that fixes it for me. > But I've assumed they went both into 4.14. Unfortunately, it looks like only 1d00402b4da2 made it into 4.14. Alex, please send a fixes pull for 4.14 with a backport of 70a9c6b. For the other issue, do we want to backport Nicolai's commits 6b37d03280a4..318d85de9c20 or revert 6af0883ed977? Christian, can you check that there are no other fixes missing from 4.14? BTW, this raises an issue: Since we push both fixes and new development work to the same internal branch, sometimes it isn't clear which changes should go upstream via -fixes or -next. Any ideas for mitigating the risk of missing an important fix? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <bc0e87da-a632-07ce-6934-86aee099b916-otUistvHUpPR7s880joybQ@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <bc0e87da-a632-07ce-6934-86aee099b916-otUistvHUpPR7s880joybQ@public.gmane.org> @ 2017-10-12 13:50 ` Christian König [not found] ` <609e2516-d783-597c-d771-21dc89091043-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Christian König @ 2017-10-12 13:50 UTC (permalink / raw) To: Michel Dänzer, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Am 12.10.2017 um 15:42 schrieb Michel Dänzer: > On 12/10/17 01:44 PM, Christian König wrote: >> Am 12.10.2017 um 13:00 schrieb Michel Dänzer: >> >>> Anyway, unless anyone knows which commits from amd-staging-drm-next are >>> needed to make 1d00402b4da2 stable in 4.14, the safe course of action >>> seems to be reverting it (and ac7afe6b3cf3, which depends on it)? >> The amdgpu_ttm_bind change should be fixed by "70a9c6b drm/amdgpu: fix >> placement flags in amdgpu_ttm_bind". > Indeed, that fixes it for me. > >> But I've assumed they went both into 4.14. > Unfortunately, it looks like only 1d00402b4da2 made it into 4.14. Alex, > please send a fixes pull for 4.14 with a backport of 70a9c6b. > > For the other issue, do we want to backport Nicolai's commits > 6b37d03280a4..318d85de9c20 or revert 6af0883ed977? > > Christian, can you check that there are no other fixes missing from 4.14? Not that I know of, and manually checking is nearly impossible since I don't know what Alex pushed to 4.14 and what not. Manually checking what went in and what not would take quite some time. > BTW, this raises an issue: Since we push both fixes and new development > work to the same internal branch, sometimes it isn't clear which changes > should go upstream via -fixes or -next. Any ideas for mitigating the > risk of missing an important fix? Well we would need to drop that amd-staging-* model and revert back to something amd-drm-next-*/amd-drm-fixes-* based. But that is not something we were able to sell to our internal teams so far. Christian. _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <609e2516-d783-597c-d771-21dc89091043-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <609e2516-d783-597c-d771-21dc89091043-5C7GfCeVMHo@public.gmane.org> @ 2017-10-12 14:04 ` Michel Dänzer 0 siblings, 0 replies; 36+ messages in thread From: Michel Dänzer @ 2017-10-12 14:04 UTC (permalink / raw) To: Christian König, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW On 12/10/17 03:50 PM, Christian König wrote: > Am 12.10.2017 um 15:42 schrieb Michel Dänzer: >> On 12/10/17 01:44 PM, Christian König wrote: >>> Am 12.10.2017 um 13:00 schrieb Michel Dänzer: >>> >>>> Anyway, unless anyone knows which commits from amd-staging-drm-next are >>>> needed to make 1d00402b4da2 stable in 4.14, the safe course of action >>>> seems to be reverting it (and ac7afe6b3cf3, which depends on it)? >>> The amdgpu_ttm_bind change should be fixed by "70a9c6b drm/amdgpu: fix >>> placement flags in amdgpu_ttm_bind". >> Indeed, that fixes it for me. >> >>> But I've assumed they went both into 4.14. >> Unfortunately, it looks like only 1d00402b4da2 made it into 4.14. Alex, >> please send a fixes pull for 4.14 with a backport of 70a9c6b. >> >> For the other issue, do we want to backport Nicolai's commits >> 6b37d03280a4..318d85de9c20 or revert 6af0883ed977? >> >> Christian, can you check that there are no other fixes missing from 4.14? > > Not that I know of, and manually checking is nearly impossible since I > don't know what Alex pushed to 4.14 and what not. > > Manually checking what went in and what not would take quite some time. Looking at commits which are in drm-next-4.15 / amd-staging-drm-next but not in Linus' tree should be a good start. Something like: gitk v4.14-rc4..origin/amd-staging-drm-next drivers/gpu/drm/{ttm,amd} (in an internal tree) gitk v4.14-rc4..<remote for Alex's tree>/drm-next-4.15 drivers/gpu/drm/{ttm,amd} -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <51ec8d88-32eb-ef4a-b34b-d2fd8e23281e-otUistvHUpPR7s880joybQ@public.gmane.org> 2017-10-12 11:44 ` Christian König @ 2017-10-12 16:49 ` Michel Dänzer [not found] ` <6b509b43-a6e9-175b-7d64-87e38c5ea4e2-otUistvHUpPR7s880joybQ@public.gmane.org> 1 sibling, 1 reply; 36+ messages in thread From: Michel Dänzer @ 2017-10-12 16:49 UTC (permalink / raw) To: christian.koenig-5C7GfCeVMHo, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW On 12/10/17 01:00 PM, Michel Dänzer wrote: > > [0] I also got this, but I don't know yet if it's related: No, that seems to be a separate issue; I can still reproduce it with the huge page related changes reverted. Unfortunately, it doesn't seem to happen reliably on every piglit run. Even before your changes this morning, there's another hang which doesn't happen every time, without any corresponding dmesg output. Lots of "fun" in amd-staging-drm-next... > BUG: unable to handle kernel NULL pointer dereference at 0000000000000220 > IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] > PGD 0 > P4D 0 > > Oops: 0000 [#1] SMP > Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd soundcore rng_core sg wmi parport_pc parport i2c_designware_platform i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 hwmon_vid efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache > jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata xhci_hcd scsi_mod usbcore shpchp gpio_amdpt gpio_generic > CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: G W O 4.13.0-rc5+ #28 > Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017 > task: ffff9d2982c75a00 task.stack: ffffb2744e9bc000 > RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] > RSP: 0018:ffffb2744e9bf6e8 EFLAGS: 00010202 > RAX: 0000000000000000 RBX: ffff9d2848642820 RCX: ffff9d28c77fdae0 > RDX: 0000000000000001 RSI: ffff9d28c77fd800 RDI: ffff9d288f286008 > RBP: ffffb2744e9bf728 R08: 000000ffffffffff R09: 0000000000000000 > R10: 0000000000000078 R11: ffff9d298ba170a0 R12: ffff9d28c77fd800 > R13: 0000000000000001 R14: ffff9d288f286000 R15: ffff9d2848642800 > FS: 00007f809fc5c300(0000) GS:ffff9d298e940000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000220 CR3: 000000030e05a000 CR4: 00000000003406e0 > Call Trace: > amdgpu_bo_move_notify+0x42/0xd0 [amdgpu] > ttm_bo_unmap_virtual_locked+0x298/0xac0 [ttm] > ? ttm_bo_mem_space+0x391/0x580 [ttm] > ttm_bo_unmap_virtual_locked+0x737/0xac0 [ttm] > ttm_bo_unmap_virtual_locked+0xa6f/0xac0 [ttm] > ttm_bo_mem_space+0x306/0x580 [ttm] > ttm_bo_validate+0xd4/0x150 [ttm] > ttm_bo_init_reserved+0x22e/0x440 [ttm] > amdgpu_ttm_placement_from_domain+0x33c/0x580 [amdgpu] > ? amdgpu_fill_buffer+0x300/0x420 [amdgpu] > amdgpu_bo_create+0x50/0x2b0 [amdgpu] > amdgpu_gem_object_create+0x9f/0x110 [amdgpu] > amdgpu_gem_create_ioctl+0x12f/0x270 [amdgpu] > ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] > drm_ioctl_kernel+0x5d/0xf0 [drm] > drm_ioctl+0x32a/0x630 [drm] > ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] > ? lru_cache_add_active_or_unevictable+0x36/0xb0 > ? __handle_mm_fault+0x90d/0xff0 > amdgpu_drm_ioctl+0x4f/0x1c20 [amdgpu] > do_vfs_ioctl+0xa5/0x600 > ? handle_mm_fault+0xd8/0x230 > ? __do_page_fault+0x267/0x4c0 > SyS_ioctl+0x79/0x90 > entry_SYSCALL_64_fastpath+0x1e/0xa9 > RIP: 0033:0x7f809c8f3dc7 > RSP: 002b:00007ffcc8c485f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > RAX: ffffffffffffffda RBX: 00007f809cbaab00 RCX: 00007f809c8f3dc7 > RDX: 00007ffcc8c48640 RSI: 00000000c0206440 RDI: 0000000000000006 > RBP: 0000000040000010 R08: 00007f809cbaabe8 R09: 0000000000000060 > R10: 0000000000000004 R11: 0000000000000246 R12: 0000000040001000 > R13: 00007f809cbaab58 R14: 0000000000001000 R15: 00007f809cbaab00 > Code: 49 8b 47 10 48 39 45 d0 4c 8d 78 f0 0f 84 87 00 00 00 4d 8b 37 45 84 ed 41 c6 47 30 01 49 8d 5f 20 49 8d 7e 08 74 19 49 8b 46 58 <48> 8b 80 20 02 00 00 49 39 84 24 20 02 00 00 0f 84 ab 00 00 00 > RIP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: ffffb2744e9bf6e8 > CR2: 0000000000000220 > > -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <6b509b43-a6e9-175b-7d64-87e38c5ea4e2-otUistvHUpPR7s880joybQ@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <6b509b43-a6e9-175b-7d64-87e38c5ea4e2-otUistvHUpPR7s880joybQ@public.gmane.org> @ 2017-10-12 17:11 ` Christian König [not found] ` <fcb5f430-5912-0feb-a586-eaf710433d8d-5C7GfCeVMHo@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Christian König @ 2017-10-12 17:11 UTC (permalink / raw) To: Michel Dänzer, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Am 12.10.2017 um 18:49 schrieb Michel Dänzer: > On 12/10/17 01:00 PM, Michel Dänzer wrote: >> [0] I also got this, but I don't know yet if it's related: > No, that seems to be a separate issue; I can still reproduce it with the > huge page related changes reverted. Unfortunately, it doesn't seem to > happen reliably on every piglit run. Can you enable KASAN in your kernel, and please look up at which line number amdgpu_vm_bo_invalidate+0x88 is. > Even before your changes this morning, there's another hang which > doesn't happen every time, without any corresponding dmesg output. > > Lots of "fun" in amd-staging-drm-next... Yeah, way to much stuff on my TODO list and not enough time/resources for extensive testing :( Thanks for the reports, Christian. > > >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000220 >> IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] >> PGD 0 >> P4D 0 >> >> Oops: 0000 [#1] SMP >> Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd soundcore rng_core sg wmi parport_pc parport i2c_designware_platform i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 hwmon_vid efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache >> jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata xhci_hcd scsi_mod usbcore shpchp gpio_amdpt gpio_generic >> CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: G W O 4.13.0-rc5+ #28 >> Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017 >> task: ffff9d2982c75a00 task.stack: ffffb2744e9bc000 >> RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] >> RSP: 0018:ffffb2744e9bf6e8 EFLAGS: 00010202 >> RAX: 0000000000000000 RBX: ffff9d2848642820 RCX: ffff9d28c77fdae0 >> RDX: 0000000000000001 RSI: ffff9d28c77fd800 RDI: ffff9d288f286008 >> RBP: ffffb2744e9bf728 R08: 000000ffffffffff R09: 0000000000000000 >> R10: 0000000000000078 R11: ffff9d298ba170a0 R12: ffff9d28c77fd800 >> R13: 0000000000000001 R14: ffff9d288f286000 R15: ffff9d2848642800 >> FS: 00007f809fc5c300(0000) GS:ffff9d298e940000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000000000220 CR3: 000000030e05a000 CR4: 00000000003406e0 >> Call Trace: >> amdgpu_bo_move_notify+0x42/0xd0 [amdgpu] >> ttm_bo_unmap_virtual_locked+0x298/0xac0 [ttm] >> ? ttm_bo_mem_space+0x391/0x580 [ttm] >> ttm_bo_unmap_virtual_locked+0x737/0xac0 [ttm] >> ttm_bo_unmap_virtual_locked+0xa6f/0xac0 [ttm] >> ttm_bo_mem_space+0x306/0x580 [ttm] >> ttm_bo_validate+0xd4/0x150 [ttm] >> ttm_bo_init_reserved+0x22e/0x440 [ttm] >> amdgpu_ttm_placement_from_domain+0x33c/0x580 [amdgpu] >> ? amdgpu_fill_buffer+0x300/0x420 [amdgpu] >> amdgpu_bo_create+0x50/0x2b0 [amdgpu] >> amdgpu_gem_object_create+0x9f/0x110 [amdgpu] >> amdgpu_gem_create_ioctl+0x12f/0x270 [amdgpu] >> ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] >> drm_ioctl_kernel+0x5d/0xf0 [drm] >> drm_ioctl+0x32a/0x630 [drm] >> ? amdgpu_gem_object_close+0x210/0x210 [amdgpu] >> ? lru_cache_add_active_or_unevictable+0x36/0xb0 >> ? __handle_mm_fault+0x90d/0xff0 >> amdgpu_drm_ioctl+0x4f/0x1c20 [amdgpu] >> do_vfs_ioctl+0xa5/0x600 >> ? handle_mm_fault+0xd8/0x230 >> ? __do_page_fault+0x267/0x4c0 >> SyS_ioctl+0x79/0x90 >> entry_SYSCALL_64_fastpath+0x1e/0xa9 >> RIP: 0033:0x7f809c8f3dc7 >> RSP: 002b:00007ffcc8c485f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 >> RAX: ffffffffffffffda RBX: 00007f809cbaab00 RCX: 00007f809c8f3dc7 >> RDX: 00007ffcc8c48640 RSI: 00000000c0206440 RDI: 0000000000000006 >> RBP: 0000000040000010 R08: 00007f809cbaabe8 R09: 0000000000000060 >> R10: 0000000000000004 R11: 0000000000000246 R12: 0000000040001000 >> R13: 00007f809cbaab58 R14: 0000000000001000 R15: 00007f809cbaab00 >> Code: 49 8b 47 10 48 39 45 d0 4c 8d 78 f0 0f 84 87 00 00 00 4d 8b 37 45 84 ed 41 c6 47 30 01 49 8d 5f 20 49 8d 7e 08 74 19 49 8b 46 58 <48> 8b 80 20 02 00 00 49 39 84 24 20 02 00 00 0f 84 ab 00 00 00 >> RIP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: ffffb2744e9bf6e8 >> CR2: 0000000000000220 >> >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <fcb5f430-5912-0feb-a586-eaf710433d8d-5C7GfCeVMHo@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <fcb5f430-5912-0feb-a586-eaf710433d8d-5C7GfCeVMHo@public.gmane.org> @ 2017-10-13 14:34 ` Michel Dänzer [not found] ` <8ab106b9-363b-4fb2-6f1a-727a5e0e7bc5-otUistvHUpPR7s880joybQ@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Michel Dänzer @ 2017-10-13 14:34 UTC (permalink / raw) To: Christian König, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW [-- Attachment #1: Type: text/plain, Size: 1370 bytes --] On 12/10/17 07:11 PM, Christian König wrote: > Am 12.10.2017 um 18:49 schrieb Michel Dänzer: >> On 12/10/17 01:00 PM, Michel Dänzer wrote: >>> [0] I also got this, but I don't know yet if it's related: >> No, that seems to be a separate issue; I can still reproduce it with the >> huge page related changes reverted. Unfortunately, it doesn't seem to >> happen reliably on every piglit run. > > Can you enable KASAN in your kernel, KASAN caught something else at the beginning of piglit, see the attached dmesg excerpt. Not sure it's related though. amdgpu_job_free_cb+0x13d/0x160 decodes to: amd_sched_get_job_priority at .../drivers/gpu/drm/amd/amdgpu/../scheduler/gpu_scheduler.h:182 static inline enum amd_sched_priority amd_sched_get_job_priority(struct amd_sched_job *job) { return (job->s_entity->rq - job->sched->sched_rq); <=== } (inlined by) amdgpu_job_free_cb at .../drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:107 amdgpu_ring_priority_put(job->ring, amd_sched_get_job_priority(s_job)); > and please look up at which line number amdgpu_vm_bo_invalidate+0x88 > is. Looks like it's this line: if (evicted && bo->tbo.resv == vm->root.base.bo->tbo.resv) { Maybe vm or vm->root.base.bo is NULL? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer [-- Attachment #2: kasan-piglit.txt --] [-- Type: text/plain, Size: 14880 bytes --] [ 89.594368] ================================================================== [ 89.594440] BUG: KASAN: use-after-free in amdgpu_job_free_cb+0x13d/0x160 [amdgpu] [ 89.594444] Read of size 8 at addr ffff880367cc22c0 by task kworker/8:1/142 [ 89.594449] CPU: 8 PID: 142 Comm: kworker/8:1 Tainted: G W 4.13.0-rc5+ #29 [ 89.594451] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017 [ 89.594516] Workqueue: events amd_sched_job_finish [amdgpu] [ 89.594517] Call Trace: [ 89.594522] dump_stack+0xb8/0x152 [ 89.594524] ? dma_virt_map_sg+0x1fe/0x1fe [ 89.594527] ? show_regs_print_info+0x62/0x62 [ 89.594531] print_address_description+0x6f/0x280 [ 89.594533] kasan_report+0x27a/0x370 [ 89.594596] ? amdgpu_job_free_cb+0x13d/0x160 [amdgpu] [ 89.594599] __asan_report_load8_noabort+0x19/0x20 [ 89.594662] amdgpu_job_free_cb+0x13d/0x160 [amdgpu] [ 89.594726] amd_sched_job_finish+0x36e/0x630 [amdgpu] [ 89.594790] ? trace_event_raw_event_amd_sched_process_job+0x180/0x180 [amdgpu] [ 89.594792] ? pick_next_task_fair+0x435/0x15c0 [ 89.594795] ? pwq_dec_nr_in_flight+0x1c2/0x4d0 [ 89.594797] ? cpu_load_update_active+0x330/0x330 [ 89.594800] ? __switch_to+0x685/0xda0 [ 89.594801] ? load_balance+0x3490/0x3490 [ 89.594803] process_one_work+0x8a5/0x1a30 [ 89.594805] ? wq_worker_sleeping+0x86/0x310 [ 89.594808] ? create_worker+0x590/0x590 [ 89.594810] ? __schedule+0x83b/0x1c80 [ 89.594813] ? schedule+0x10e/0x450 [ 89.594815] ? __schedule+0x1c80/0x1c80 [ 89.594817] ? alloc_worker+0x360/0x360 [ 89.594819] ? update_stack_state+0x402/0x780 [ 89.594820] ? update_stack_state+0x402/0x780 [ 89.594822] ? tsc_resume+0x10/0x10 [ 89.594824] worker_thread+0x21f/0x1920 [ 89.594825] ? sched_clock+0x9/0x10 [ 89.594826] ? sched_clock+0x9/0x10 [ 89.594828] ? sched_clock_local+0x43/0x130 [ 89.594831] ? process_one_work+0x1a30/0x1a30 [ 89.594832] ? pick_next_task_fair+0xcd3/0x15c0 [ 89.594833] ? cpu_load_update_active+0x330/0x330 [ 89.594835] ? __switch_to+0x685/0xda0 [ 89.594836] ? load_balance+0x3490/0x3490 [ 89.594838] ? compat_start_thread+0x80/0x80 [ 89.594839] ? sched_clock+0x9/0x10 [ 89.594840] ? sched_clock_local+0x43/0x130 [ 89.594843] ? set_rq_online.part.79+0x130/0x130 [ 89.594844] ? put_prev_entity+0x4e/0x370 [ 89.594846] ? __schedule+0x83b/0x1c80 [ 89.594847] ? kasan_kmalloc+0xad/0xe0 [ 89.594849] ? kmem_cache_alloc_trace+0xe9/0x1f0 [ 89.594851] ? firmware_map_remove+0x80/0x80 [ 89.594852] ? migrate_swap_stop+0x660/0x660 [ 89.594855] ? __schedule+0x1c80/0x1c80 [ 89.594856] ? default_wake_function+0x35/0x50 [ 89.594858] ? __wake_up_common+0xb9/0x150 [ 89.594859] ? print_dl_stats+0x80/0x80 [ 89.594861] kthread+0x310/0x3d0 [ 89.594863] ? process_one_work+0x1a30/0x1a30 [ 89.594864] ? kthread_create_on_node+0xc0/0xc0 [ 89.594866] ret_from_fork+0x25/0x30 [ 89.594869] Allocated by task 1701: [ 89.594872] save_stack_trace+0x1b/0x20 [ 89.594873] save_stack+0x43/0xd0 [ 89.594874] kasan_kmalloc+0xad/0xe0 [ 89.594875] kmem_cache_alloc_trace+0xe9/0x1f0 [ 89.594913] amdgpu_driver_open_kms+0xec/0x3f0 [amdgpu] [ 89.594925] drm_open+0x7ea/0x13a0 [drm] [ 89.594936] drm_stub_open+0x2a7/0x420 [drm] [ 89.594939] chrdev_open+0x24d/0x6f0 [ 89.594940] do_dentry_open+0x5b1/0xd30 [ 89.594941] vfs_open+0xf1/0x260 [ 89.594942] path_openat+0x130a/0x5240 [ 89.594944] do_filp_open+0x23e/0x3c0 [ 89.594945] do_sys_open+0x47a/0x800 [ 89.594946] SyS_open+0x1e/0x20 [ 89.594947] entry_SYSCALL_64_fastpath+0x1e/0xa9 [ 89.594949] Freed by task 1737: [ 89.594951] save_stack_trace+0x1b/0x20 [ 89.594952] save_stack+0x43/0xd0 [ 89.594953] kasan_slab_free+0x72/0xc0 [ 89.594954] kfree+0x94/0x1a0 [ 89.594992] amdgpu_driver_postclose_kms+0x495/0x830 [amdgpu] [ 89.595002] drm_release+0x9bf/0x1350 [drm] [ 89.595004] __fput+0x306/0x900 [ 89.595005] ____fput+0xe/0x10 [ 89.595006] task_work_run+0x14d/0x230 [ 89.595008] exit_to_usermode_loop+0x1f5/0x230 [ 89.595009] syscall_return_slowpath+0x1d8/0x240 [ 89.595011] entry_SYSCALL_64_fastpath+0xa7/0xa9 [ 89.595013] The buggy address belongs to the object at ffff880367cc2200 which belongs to the cache kmalloc-2048 of size 2048 [ 89.595015] The buggy address is located 192 bytes inside of 2048-byte region [ffff880367cc2200, ffff880367cc2a00) [ 89.595017] The buggy address belongs to the page: [ 89.595019] page:ffffea000d9f3000 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 [ 89.595022] flags: 0x17fffc000008100(slab|head) [ 89.595026] raw: 017fffc000008100 0000000000000000 0000000000000000 00000001800f000f [ 89.595028] raw: dead000000000100 dead000000000200 ffff88038e00ea00 0000000000000000 [ 89.595029] page dumped because: kasan: bad access detected [ 89.595031] Memory state around the buggy address: [ 89.595032] ffff880367cc2180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 89.595034] ffff880367cc2200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 89.595036] >ffff880367cc2280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 89.595037] ^ [ 89.595038] ffff880367cc2300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 89.595040] ffff880367cc2380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 89.595041] ================================================================== [ 89.595042] Disabling lock debugging due to kernel taint [ 239.309507] [drm:amdgpu_gem_object_create [amdgpu]] *ERROR* Failed to allocate GEM object (4294967296, 2, 4096, -12) [ 239.340689] [drm:amdgpu_gem_object_create [amdgpu]] *ERROR* Failed to allocate GEM object (4294967296, 2, 4096, -12) [ 323.869197] [drm:amdgpu_gem_object_create [amdgpu]] *ERROR* Failed to allocate GEM object (2147483648, 2, 4096, -12) [ 323.869377] [drm:amdgpu_gem_object_create [amdgpu]] *ERROR* Failed to allocate GEM object (2147483648, 2, 4096, -12) [ 349.580299] kasan: CONFIG_KASAN_INLINE enabled [ 349.580305] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 349.580311] general protection fault: 0000 [#1] SMP KASAN [ 349.580313] Modules linked in: snd_hda_codec_realtek snd_hda_codec_generic cpufreq_powersave cpufreq_userspace cpufreq_conservative binfmt_misc nls_ascii nls_cp437 vfat fat edac_mce_amd kvm amdkfd irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel amdgpu pcbc snd_hda_codec_hdmi efi_pstore chash ttm snd_hda_intel snd_hda_codec drm_kms_helper snd_hda_core snd_hwdep wmi_bmof drm snd_pcm aesni_intel i2c_algo_bit snd_timer aes_x86_64 crypto_simd glue_helper sp5100_tco snd fb_sys_fops ccp syscopyarea r8169 ppdev sysfillrect cryptd pcspkr efivars sysimgblt i2c_piix4 mii sg rng_core mfd_core soundcore wmi parport_pc parport i2c_designware_platform i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq nct6775 hwmon_vid sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto raid10 [ 349.580366] raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev hid_generic usbhid hid ahci crc32c_intel libahci xhci_pci libata xhci_hcd usbcore scsi_mod shpchp gpio_amdpt gpio_generic [ 349.580385] CPU: 5 PID: 529 Comm: max-texture-siz Tainted: G B W 4.13.0-rc5+ #29 [ 349.580388] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017 [ 349.580392] task: ffff88036d258000 task.stack: ffff880274260000 [ 349.580466] RIP: 0010:amdgpu_vm_bo_invalidate+0x277/0xd60 [amdgpu] [ 349.580469] RSP: 0018:ffff880274266810 EFLAGS: 00010216 [ 349.580472] RAX: 0000000000000000 RBX: ffff8801d9ac2d00 RCX: 0000000000000000 [ 349.580474] RDX: ffff880389c74408 RSI: ffff880389c73da8 RDI: 0000000000000220 [ 349.580476] RBP: ffff880274266aa8 R08: 0000000000000010 R09: 0000000000000044 [ 349.580479] R10: ffff880274266d88 R11: 0000000000000010 R12: dffffc0000000000 [ 349.580481] R13: ffff8801d9ac2d20 R14: ffff880259586600 R15: ffff880389c74400 [ 349.580484] FS: 00007ff180510300(0000) GS:ffff88038e540000(0000) knlGS:0000000000000000 [ 349.580487] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 349.580489] CR2: 00007fe4e75c1000 CR3: 000000037f1bc000 CR4: 00000000003406e0 [ 349.580491] Call Trace: [ 349.580502] ? ttm_mem_global_alloc_zone.constprop.4+0x1c6/0x290 [ttm] [ 349.580509] ? ttm_bo_init_reserved+0x9b6/0x1300 [ttm] [ 349.580570] ? amdgpu_bo_do_create+0x594/0x1420 [amdgpu] [ 349.580634] ? amdgpu_vm_bo_rmv+0xd60/0xd60 [amdgpu] [ 349.580641] ? ttm_mem_global_alloc_page+0x9e/0xf0 [ttm] [ 349.580649] ? ttm_pool_populate+0x46e/0xd10 [ttm] [ 349.580655] ? kasan_kmalloc+0xad/0xe0 [ 349.580664] ? ttm_pool_unpopulate+0x290/0x290 [ttm] [ 349.580669] ? kvmalloc_node+0x75/0x80 [ 349.580676] ? ttm_dma_tt_init+0x285/0x620 [ttm] [ 349.580679] ? kasan_unpoison_shadow+0x35/0x50 [ 349.580686] ? ttm_tt_init+0x300/0x300 [ttm] [ 349.580691] ? nommu_map_page+0x5c/0xa0 [ 349.580754] ? amdgpu_ttm_backend_bind+0x10f/0xc70 [amdgpu] [ 349.580817] ? amdgpu_ttm_tt_create+0x132/0x2c0 [amdgpu] [ 349.580879] amdgpu_bo_move_notify+0x10f/0x350 [amdgpu] [ 349.580942] ? amdgpu_bo_get_metadata+0x200/0x200 [amdgpu] [ 349.580950] ttm_bo_handle_move_mem+0x782/0x22f0 [ttm] [ 349.580955] ? reservation_object_reserve_shared+0x167/0x200 [ 349.580962] ? ttm_bo_add_move_fence.isra.17+0x29/0x160 [ttm] [ 349.580969] ? ttm_bo_mem_space+0x518/0xdf0 [ttm] [ 349.580974] ? free_one_page+0x1560/0x1560 [ 349.580983] ttm_bo_evict+0x3ff/0xf70 [ttm] [ 349.580990] ? ttm_bo_release_list+0x910/0x910 [ttm] [ 349.580998] ? ttm_bo_handle_move_mem+0x22f0/0x22f0 [ttm] [ 349.581005] ? ttm_bo_add_to_lru+0x46c/0x870 [ttm] [ 349.581009] ? kasan_kmalloc_large+0x9c/0xd0 [ 349.581017] ? ttm_mem_global_alloc_zone.constprop.4+0x1c6/0x290 [ttm] [ 349.581020] ? kmem_cache_alloc+0xb7/0x1c0 [ 349.581081] ? amdgpu_ttm_bo_eviction_valuable+0x199/0x2b0 [amdgpu] [ 349.581143] ? amdgpu_vram_mgr_new+0x4e4/0x780 [amdgpu] [ 349.581152] ttm_mem_evict_first+0x312/0x4a0 [ttm] [ 349.581215] ? amdgpu_vram_mgr_new+0x4e4/0x780 [amdgpu] [ 349.581222] ? ttm_bo_evict+0xf70/0xf70 [ttm] [ 349.581230] ttm_bo_mem_space+0x84e/0xdf0 [ttm] [ 349.581233] ? __shmem_file_setup+0x250/0x520 [ 349.581241] ttm_bo_validate+0x322/0x580 [ttm] [ 349.581244] ? unwind_dump+0x4e0/0x4e0 [ 349.581251] ? ttm_bo_evict_mm+0xa0/0xa0 [ttm] [ 349.581255] ? cpufreq_default_governor+0x20/0x20 [ 349.581277] ? drm_add_edid_modes+0x44a0/0x67e0 [drm] [ 349.581286] ttm_bo_init_reserved+0x9b6/0x1300 [ttm] [ 349.581294] ? ttm_bo_validate+0x580/0x580 [ttm] [ 349.581298] ? dentry_path_raw+0x10/0x10 [ 349.581302] ? proc_nr_files+0x30/0x30 [ 349.581305] ? shmem_get_inode+0x668/0x8f0 [ 349.581308] ? shmem_fh_to_dentry+0x160/0x160 [ 349.581313] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 [ 349.581318] ? _copy_to_user+0x90/0x90 [ 349.581321] ? alloc_file+0x16d/0x440 [ 349.581324] ? __shmem_file_setup+0x2e0/0x520 [ 349.581327] ? shmem_fill_super+0xa10/0xa10 [ 349.581344] ? drm_gem_private_object_init+0x189/0x300 [drm] [ 349.581347] ? kasan_kmalloc+0xad/0xe0 [ 349.581412] amdgpu_bo_do_create+0x594/0x1420 [amdgpu] [ 349.581486] ? amdgpu_fill_buffer+0xb80/0xb80 [amdgpu] [ 349.581489] ? update_stack_state+0x402/0x780 [ 349.581559] ? amdgpu_ttm_placement_from_domain+0x8d0/0x8d0 [amdgpu] [ 349.581565] ? show_initstate+0xb0/0xb0 [ 349.581570] ? bpf_prog_alloc+0x320/0x320 [ 349.581574] ? unwind_next_frame.part.5+0x1bb/0xc90 [ 349.581582] ? __free_insn_slot+0x6a0/0x6a0 [ 349.581585] ? unwind_dump+0x4e0/0x4e0 [ 349.581590] ? rb_erase+0x3540/0x3540 [ 349.581595] ? __mem_cgroup_threshold+0x7b0/0x7b0 [ 349.581599] ? memory_max_write+0x420/0x420 [ 349.581605] ? __kernel_text_address+0xbf/0xf0 [ 349.581608] ? unwind_get_return_address+0x66/0xb0 [ 349.581612] ? __save_stack_trace+0x92/0x100 [ 349.581685] amdgpu_bo_create+0xba/0xa00 [amdgpu] [ 349.581759] ? amdgpu_bo_do_create+0x1420/0x1420 [amdgpu] [ 349.581763] ? mem_cgroup_uncharge_swap+0xc0/0xc0 [ 349.581767] ? kmem_cache_alloc+0xb7/0x1c0 [ 349.581771] ? __anon_vma_prepare+0x37a/0x650 [ 349.581775] ? __handle_mm_fault+0x31ac/0x5070 [ 349.581778] ? handle_mm_fault+0x292/0x800 [ 349.581782] ? __do_page_fault+0x412/0xa00 [ 349.581785] ? do_page_fault+0x22/0x30 [ 349.581788] ? page_fault+0x28/0x30 [ 349.581792] ? memcg_oom_wake_function+0x6a0/0x6a0 [ 349.581866] amdgpu_gem_object_create+0x11f/0x240 [amdgpu] [ 349.581942] ? amdgpu_gem_object_free+0x1d0/0x1d0 [amdgpu] [ 349.581945] ? __alloc_pages_nodemask+0x3d8/0xe50 [ 349.582021] amdgpu_gem_create_ioctl+0x3bb/0xc10 [amdgpu] [ 349.582097] ? amdgpu_gem_object_close+0x790/0x790 [amdgpu] [ 349.582101] ? page_add_new_anon_rmap+0x203/0x3d0 [ 349.582105] ? __check_object_size+0x22e/0x560 [ 349.582180] ? amdgpu_gem_object_close+0x790/0x790 [amdgpu] [ 349.582201] drm_ioctl_kernel+0x1ce/0x330 [drm] [ 349.582221] ? drm_ioctl_permit+0x2c0/0x2c0 [drm] [ 349.582225] ? kasan_check_write+0x14/0x20 [ 349.582246] drm_ioctl+0x79a/0xc30 [drm] [ 349.582321] ? amdgpu_gem_object_close+0x790/0x790 [amdgpu] [ 349.582342] ? drm_getstats+0x20/0x20 [drm] [ 349.582347] ? do_mmap+0x641/0x10f0 [ 349.582419] amdgpu_drm_ioctl+0xd8/0x1b0 [amdgpu] [ 349.582424] do_vfs_ioctl+0x197/0x1490 [ 349.582428] ? vm_mmap_pgoff+0x1fe/0x2c0 [ 349.582431] ? ioctl_preallocate+0x2c0/0x2c0 [ 349.582435] ? __fget_light+0x2be/0x410 [ 349.582438] ? iterate_fd+0x2e0/0x2e0 [ 349.582441] ? handle_mm_fault+0x292/0x800 [ 349.582445] ? __handle_mm_fault+0x5070/0x5070 [ 349.582449] ? __do_page_fault+0x43a/0xa00 [ 349.582453] SyS_ioctl+0x79/0x90 [ 349.582457] entry_SYSCALL_64_fastpath+0x1e/0xa9 [ 349.582461] RIP: 0033:0x7ff17d1a7dc7 [ 349.582463] RSP: 002b:00007ffe7a4edbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 349.582467] RAX: ffffffffffffffda RBX: 00007ff17d45eb00 RCX: 00007ff17d1a7dc7 [ 349.582470] RDX: 00007ffe7a4edc40 RSI: 00000000c0206440 RDI: 0000000000000006 [ 349.582472] RBP: 0000000040000010 R08: 00007ff17d45ebc8 R09: 0000000000000060 [ 349.582475] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000040001000 [ 349.582477] R13: 00007ff17d45eb58 R14: 0000000000001000 R15: 00007ff17d45eb00 [ 349.582480] Code: 49 8b b6 20 02 00 00 48 89 f8 48 c1 e8 03 42 80 3c 20 00 0f 85 f7 05 00 00 49 8b 47 58 48 8d b8 20 02 00 00 49 89 f9 49 c1 e9 03 <43> 80 3c 21 00 0f 85 e8 08 00 00 48 3b b0 20 02 00 00 0f 84 d9 [ 349.582596] RIP: amdgpu_vm_bo_invalidate+0x277/0xd60 [amdgpu] RSP: ffff880274266810 [ 349.582600] ---[ end trace cc9c171d2cdc0539 ]--- [ 360.319061] amdgpu 0000:23:00.0: Disabling VM faults because of PRT request! [-- Attachment #3: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <8ab106b9-363b-4fb2-6f1a-727a5e0e7bc5-otUistvHUpPR7s880joybQ@public.gmane.org>]
* Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini [not found] ` <8ab106b9-363b-4fb2-6f1a-727a5e0e7bc5-otUistvHUpPR7s880joybQ@public.gmane.org> @ 2017-10-13 15:20 ` Christian König 0 siblings, 0 replies; 36+ messages in thread From: Christian König @ 2017-10-13 15:20 UTC (permalink / raw) To: Michel Dänzer, Nicolai Hähnle Cc: Alex Deucher, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW Am 13.10.2017 um 16:34 schrieb Michel Dänzer: > On 12/10/17 07:11 PM, Christian König wrote: >> Am 12.10.2017 um 18:49 schrieb Michel Dänzer: >>> On 12/10/17 01:00 PM, Michel Dänzer wrote: >>>> [0] I also got this, but I don't know yet if it's related: >>> No, that seems to be a separate issue; I can still reproduce it with the >>> huge page related changes reverted. Unfortunately, it doesn't seem to >>> happen reliably on every piglit run. >> Can you enable KASAN in your kernel, > KASAN caught something else at the beginning of piglit, see the attached > dmesg excerpt. Not sure it's related though. > > amdgpu_job_free_cb+0x13d/0x160 decodes to: > > amd_sched_get_job_priority at .../drivers/gpu/drm/amd/amdgpu/../scheduler/gpu_scheduler.h:182 > > static inline enum amd_sched_priority > amd_sched_get_job_priority(struct amd_sched_job *job) > { > return (job->s_entity->rq - job->sched->sched_rq); <=== > } > > (inlined by) amdgpu_job_free_cb at .../drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:107 > > amdgpu_ring_priority_put(job->ring, amd_sched_get_job_priority(s_job)); Sounds a lot like the code Andres added is buggy somehow. Going to take a look as well. >> and please look up at which line number amdgpu_vm_bo_invalidate+0x88 >> is. > Looks like it's this line: > > if (evicted && bo->tbo.resv == vm->root.base.bo->tbo.resv) { > > Maybe vm or vm->root.base.bo is NULL? Ah, of course! We need to reserve the page directory root when we release it or otherwise we can run into a race with somebody else trying to evict it. Going to send a patch in a minute, Christian. _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2017-10-13 15:20 UTC | newest] Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-09-28 14:55 [PATCH 1/5] drm/amd/sched: rename amd_sched_entity_pop_job Nicolai Hähnle [not found] ` <20170928145530.12844-1-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 14:55 ` [PATCH 2/5] drm/amd/sched: fix an outdated comment Nicolai Hähnle 2017-09-28 14:55 ` [PATCH 3/5] drm/amd/sched: move adding finish callback to amd_sched_job_begin Nicolai Hähnle 2017-09-28 14:55 ` [PATCH 4/5] drm/amd/sched: NULL out the s_fence field after run_job Nicolai Hähnle [not found] ` <20170928145530.12844-4-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 18:39 ` Andres Rodriguez [not found] ` <7064b408-60db-2817-0ae7-af6b2c56580b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 19:04 ` Nicolai Hähnle 2017-09-28 14:55 ` [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini Nicolai Hähnle [not found] ` <20170928145530.12844-5-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-09-28 15:01 ` Christian König [not found] ` <3032bef3-4829-8cae-199a-11353b38c49a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-10-02 16:00 ` Tom St Denis 2017-10-09 6:42 ` Liu, Monk [not found] ` <BLUPR12MB044904A26E01C265C49042E484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-09 8:02 ` Christian König [not found] ` <11f21e54-16b8-68e4-c63e-d791ef8bbffa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-10-09 10:14 ` Nicolai Hähnle [not found] ` <d0f66c04-fbcd-09a2-6e4c-9de9ca7a93ff-5C7GfCeVMHo@public.gmane.org> 2017-10-09 10:35 ` Liu, Monk [not found] ` <BLUPR12MB044925932C8D956F93CAF93E84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-09 10:49 ` Nicolai Hähnle [not found] ` <7e338e23-540c-4e2e-982f-f0eb623c75b1-5C7GfCeVMHo@public.gmane.org> 2017-10-09 10:59 ` Christian König [not found] ` <760c1434-0739-81ff-82c3-a5210c5575d3-5C7GfCeVMHo@public.gmane.org> 2017-10-09 11:04 ` Nicolai Hähnle [not found] ` <de5e2c7c-b6cd-1c24-4d8e-7ae3cdfad0bd-5C7GfCeVMHo@public.gmane.org> 2017-10-09 11:12 ` Christian König [not found] ` <9619ebd2-f218-7568-3b24-0a9d2b008a6a-5C7GfCeVMHo@public.gmane.org> 2017-10-09 11:27 ` Nicolai Hähnle [not found] ` <de68c0ca-f36e-3adb-2c42-83a5176f07d8-5C7GfCeVMHo@public.gmane.org> 2017-10-09 12:33 ` Christian König [not found] ` <2f113fd3-ab4a-58b8-31d8-dc0a23751513-5C7GfCeVMHo@public.gmane.org> 2017-10-09 12:58 ` Nicolai Hähnle [not found] ` <1a79e19c-a654-f5c7-84d9-ce4cce76243f-5C7GfCeVMHo@public.gmane.org> 2017-10-09 13:57 ` Olsak, Marek [not found] ` <CY1PR12MB0885AF7148CD8ECE929E96D2F9740-1s8aH8ViOEfCYw/MNJAFQgdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org> 2017-10-09 14:01 ` Nicolai Hähnle 2017-10-10 4:00 ` Liu, Monk 2017-09-28 18:30 ` Marek Olšák 2017-09-29 2:17 ` Chunming Zhou 2017-10-11 16:30 ` Michel Dänzer [not found] ` <7cb63e4c-9b65-b9b9-14dc-26368ca7126a-otUistvHUpPR7s880joybQ@public.gmane.org> 2017-10-12 8:05 ` Christian König [not found] ` <c67d1bd8-81a0-4133-c3df-dd2a1b1a8c11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2017-10-12 11:00 ` Michel Dänzer [not found] ` <51ec8d88-32eb-ef4a-b34b-d2fd8e23281e-otUistvHUpPR7s880joybQ@public.gmane.org> 2017-10-12 11:44 ` Christian König [not found] ` <4c750ed5-98be-eafa-e684-940ecb2787f0-5C7GfCeVMHo@public.gmane.org> 2017-10-12 13:42 ` Michel Dänzer [not found] ` <bc0e87da-a632-07ce-6934-86aee099b916-otUistvHUpPR7s880joybQ@public.gmane.org> 2017-10-12 13:50 ` Christian König [not found] ` <609e2516-d783-597c-d771-21dc89091043-5C7GfCeVMHo@public.gmane.org> 2017-10-12 14:04 ` Michel Dänzer 2017-10-12 16:49 ` Michel Dänzer [not found] ` <6b509b43-a6e9-175b-7d64-87e38c5ea4e2-otUistvHUpPR7s880joybQ@public.gmane.org> 2017-10-12 17:11 ` Christian König [not found] ` <fcb5f430-5912-0feb-a586-eaf710433d8d-5C7GfCeVMHo@public.gmane.org> 2017-10-13 14:34 ` Michel Dänzer [not found] ` <8ab106b9-363b-4fb2-6f1a-727a5e0e7bc5-otUistvHUpPR7s880joybQ@public.gmane.org> 2017-10-13 15:20 ` Christian König
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.