All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Nicolai Hähnle" <nicolai.haehnle-5C7GfCeVMHo@public.gmane.org>
To: christian.koenig-5C7GfCeVMHo@public.gmane.org, "Liu,
	Monk" <Monk.Liu-5C7GfCeVMHo@public.gmane.org>,
	"Nicolai Hähnle"
	<nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini
Date: Mon, 9 Oct 2017 12:14:29 +0200	[thread overview]
Message-ID: <d0f66c04-fbcd-09a2-6e4c-9de9ca7a93ff@amd.com> (raw)
In-Reply-To: <11f21e54-16b8-68e4-c63e-d791ef8bbffa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On 09.10.2017 10:02, Christian König wrote:
>> For gpu reset patches (already submitted to pub) I would make kernel 
>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) 
>> founded as error, that way UMD would run into robust extension path 
>> and considering the GPU hang occurred,
> Well that is only closed source behavior which is completely irrelevant 
> for upstream development.
> 
> As far as I know we haven't pushed the change to return -ENODEV upstream.

FWIW, radeonsi currently expects -ECANCELED on CS submissions and treats 
those as context lost. Perhaps we could use the same error on fences? 
That makes more sense to me than -ENODEV.

Cheers,
Nicolai

> 
> Regards,
> Christian.
> 
> Am 09.10.2017 um 08:42 schrieb Liu, Monk:
>> Christian
>>
>>> It would be really nice to have an error code set on 
>>> s_fence->finished before it is signaled, use dma_fence_set_error() 
>>> for this.
>> For gpu reset patches (already submitted to pub) I would make kernel 
>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) 
>> founded as error, that way UMD would run into robust extension path 
>> and considering the GPU hang occurred,
>>
>> Don't know if this is expected for the case of normal process being 
>> killed or crashed like Nicolai hit ... since there is no gpu hang hit
>>
>>
>> BR Monk
>>
>>
>>
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf 
>> Of Christian K?nig
>> Sent: 2017年9月28日 23:01
>> To: Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org
>> Cc: Haehnle, Nicolai <Nicolai.Haehnle@amd.com>
>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining 
>> fences in amd_sched_entity_fini
>>
>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle:
>>> From: Nicolai Hähnle <nicolai.haehnle@amd.com>
>>>
>>> Highly concurrent Piglit runs can trigger a race condition where a
>>> pending SDMA job on a buffer object is never executed because the
>>> corresponding process is killed (perhaps due to a crash). Since the
>>> job's fences were never signaled, the buffer object was effectively
>>> leaked. Worse, the buffer was stuck wherever it happened to be at the 
>>> time, possibly in VRAM.
>>>
>>> The symptom was user space processes stuck in interruptible waits with
>>> kernel stacks like:
>>>
>>>       [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>>>       [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>>>       [<ffffffffbc5e82d2>] 
>>> reservation_object_wait_timeout_rcu+0x1c2/0x300
>>>       [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 
>>> [ttm]
>>>       [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>>       [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>>       [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>>>       [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>>>       [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 
>>> [amdgpu]
>>>       [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>>       [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
>>>       [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
>>>       [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>>>       [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>>       [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>>>       [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>>>       [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>>       [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
>>> Acked-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++-
>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> index 54eb77cffd9b..32a99e980d78 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct 
>>> amd_gpu_scheduler *sched,
>>>                        amd_sched_entity_is_idle(entity));
>>>        amd_sched_rq_remove_entity(rq, entity);
>>>        if (r) {
>>>            struct amd_sched_job *job;
>>>            /* Park the kernel for a moment to make sure it isn't 
>>> processing
>>>             * our enity.
>>>             */
>>>            kthread_park(sched->thread);
>>>            kthread_unpark(sched->thread);
>>> -        while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
>>> +        while (kfifo_out(&entity->job_queue, &job, sizeof(job))) {
>>> +            struct amd_sched_fence *s_fence = job->s_fence;
>>> +            amd_sched_fence_scheduled(s_fence);
>> It would be really nice to have an error code set on s_fence->finished 
>> before it is signaled, use dma_fence_set_error() for this.
>>
>> Additional to that it would be nice to note in the subject line that 
>> this is a rather important bug fix.
>>
>> With that fixed the whole series is Reviewed-by: Christian König 
>> <christian.koenig@amd.com>.
>>
>> Regards,
>> Christian.
>>
>>> +            amd_sched_fence_finished(s_fence);
>>> +            dma_fence_put(&s_fence->finished);
>>>                sched->ops->free_job(job);
>>> +        }
>>>        }
>>>        kfifo_free(&entity->job_queue);
>>>    }
>>>    static void amd_sched_entity_wakeup(struct dma_fence *f, struct 
>>> dma_fence_cb *cb)
>>>    {
>>>        struct amd_sched_entity *entity =
>>>            container_of(cb, struct amd_sched_entity, cb);
>>>        entity->dependency = NULL;
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2017-10-09 10:14 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-28 14:55 [PATCH 1/5] drm/amd/sched: rename amd_sched_entity_pop_job Nicolai Hähnle
     [not found] ` <20170928145530.12844-1-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-09-28 14:55   ` [PATCH 2/5] drm/amd/sched: fix an outdated comment Nicolai Hähnle
2017-09-28 14:55   ` [PATCH 3/5] drm/amd/sched: move adding finish callback to amd_sched_job_begin Nicolai Hähnle
2017-09-28 14:55   ` [PATCH 4/5] drm/amd/sched: NULL out the s_fence field after run_job Nicolai Hähnle
     [not found]     ` <20170928145530.12844-4-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-09-28 18:39       ` Andres Rodriguez
     [not found]         ` <7064b408-60db-2817-0ae7-af6b2c56580b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-09-28 19:04           ` Nicolai Hähnle
2017-09-28 14:55   ` [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini Nicolai Hähnle
     [not found]     ` <20170928145530.12844-5-nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-09-28 15:01       ` Christian König
     [not found]         ` <3032bef3-4829-8cae-199a-11353b38c49a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-02 16:00           ` Tom St Denis
2017-10-09  6:42           ` Liu, Monk
     [not found]             ` <BLUPR12MB044904A26E01C265C49042E484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:02               ` Christian König
     [not found]                 ` <11f21e54-16b8-68e4-c63e-d791ef8bbffa-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09 10:14                   ` Nicolai Hähnle [this message]
     [not found]                     ` <d0f66c04-fbcd-09a2-6e4c-9de9ca7a93ff-5C7GfCeVMHo@public.gmane.org>
2017-10-09 10:35                       ` Liu, Monk
     [not found]                         ` <BLUPR12MB044925932C8D956F93CAF93E84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09 10:49                           ` Nicolai Hähnle
     [not found]                             ` <7e338e23-540c-4e2e-982f-f0eb623c75b1-5C7GfCeVMHo@public.gmane.org>
2017-10-09 10:59                               ` Christian König
     [not found]                                 ` <760c1434-0739-81ff-82c3-a5210c5575d3-5C7GfCeVMHo@public.gmane.org>
2017-10-09 11:04                                   ` Nicolai Hähnle
     [not found]                                     ` <de5e2c7c-b6cd-1c24-4d8e-7ae3cdfad0bd-5C7GfCeVMHo@public.gmane.org>
2017-10-09 11:12                                       ` Christian König
     [not found]                                         ` <9619ebd2-f218-7568-3b24-0a9d2b008a6a-5C7GfCeVMHo@public.gmane.org>
2017-10-09 11:27                                           ` Nicolai Hähnle
     [not found]                                             ` <de68c0ca-f36e-3adb-2c42-83a5176f07d8-5C7GfCeVMHo@public.gmane.org>
2017-10-09 12:33                                               ` Christian König
     [not found]                                                 ` <2f113fd3-ab4a-58b8-31d8-dc0a23751513-5C7GfCeVMHo@public.gmane.org>
2017-10-09 12:58                                                   ` Nicolai Hähnle
     [not found]                                                     ` <1a79e19c-a654-f5c7-84d9-ce4cce76243f-5C7GfCeVMHo@public.gmane.org>
2017-10-09 13:57                                                       ` Olsak, Marek
     [not found]                                                         ` <CY1PR12MB0885AF7148CD8ECE929E96D2F9740-1s8aH8ViOEfCYw/MNJAFQgdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09 14:01                                                           ` Nicolai Hähnle
2017-10-10  4:00                                                   ` Liu, Monk
2017-09-28 18:30       ` Marek Olšák
2017-09-29  2:17       ` Chunming Zhou
2017-10-11 16:30       ` Michel Dänzer
     [not found]         ` <7cb63e4c-9b65-b9b9-14dc-26368ca7126a-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-10-12  8:05           ` Christian König
     [not found]             ` <c67d1bd8-81a0-4133-c3df-dd2a1b1a8c11-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-12 11:00               ` Michel Dänzer
     [not found]                 ` <51ec8d88-32eb-ef4a-b34b-d2fd8e23281e-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-10-12 11:44                   ` Christian König
     [not found]                     ` <4c750ed5-98be-eafa-e684-940ecb2787f0-5C7GfCeVMHo@public.gmane.org>
2017-10-12 13:42                       ` Michel Dänzer
     [not found]                         ` <bc0e87da-a632-07ce-6934-86aee099b916-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-10-12 13:50                           ` Christian König
     [not found]                             ` <609e2516-d783-597c-d771-21dc89091043-5C7GfCeVMHo@public.gmane.org>
2017-10-12 14:04                               ` Michel Dänzer
2017-10-12 16:49                   ` Michel Dänzer
     [not found]                     ` <6b509b43-a6e9-175b-7d64-87e38c5ea4e2-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-10-12 17:11                       ` Christian König
     [not found]                         ` <fcb5f430-5912-0feb-a586-eaf710433d8d-5C7GfCeVMHo@public.gmane.org>
2017-10-13 14:34                           ` Michel Dänzer
     [not found]                             ` <8ab106b9-363b-4fb2-6f1a-727a5e0e7bc5-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-10-13 15:20                               ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0f66c04-fbcd-09a2-6e4c-9de9ca7a93ff@amd.com \
    --to=nicolai.haehnle-5c7gfcevmho@public.gmane.org \
    --cc=Monk.Liu-5C7GfCeVMHo@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    --cc=christian.koenig-5C7GfCeVMHo@public.gmane.org \
    --cc=nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.