amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Luben Tuikov <luben.tuikov@amd.com>
To: "Andrey Grodzovsky" <Andrey.Grodzovsky@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Alex Deucher" <alexdeucher@gmail.com>,
	"Lucas Stach" <l.stach@pengutronix.de>
Cc: Emily Deng <Emily.Deng@amd.com>,
	amd-gfx list <amd-gfx@lists.freedesktop.org>,
	Maling list - DRI developers <dri-devel@lists.freedesktop.org>,
	steven.price@arm.com
Subject: Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.
Date: Tue, 11 Feb 2020 19:53:50 -0500	[thread overview]
Message-ID: <dbb2e7f0-85b7-f9e0-7875-144a8cca4993@amd.com> (raw)
In-Reply-To: <d4ba6ed9-604e-5e88-f427-679639dcf8e9@amd.com>

On 2020-02-11 4:27 p.m., Andrey Grodzovsky wrote:
> 
> On 2/11/20 10:55 AM, Andrey Grodzovsky wrote:
>> On 2/10/20 4:50 PM, Luben Tuikov wrote:
>>> Hi Lucas,
>>>
>>> Thank you for bringing awareness of this issue, publicly.
>>>
>>> As soon as this patch showed up back in November of 2019,
>>> I objected to it, privately.
>>
>>
>> I didn't find this objection in my mail actually

Yes, I didn't send it to you.

>>> I suggested to instead use a _list_ to store the "state" of
>>> all jobs of the same state. Then, at any time, timeout interrupt
>>> or whatever, we can atomically (irq spinlock) move the timeout/bad
>>> job to the timedout/cleanup/bad job list, and wake someone up
>>> to deal with that list asynchronously, and return from the 
>>> interrupt/etc.
>>> immediately.
>>
>>
>> Sounds a good idea to me, i think enough for us to have 2 lists, 
>> timeout list for jobs scheduled to HW and not yet completed 
>> (completion fence signaled) and cleanup list for those that did 
>> complete. This should give alternative solution to the race condition 
>> this patch was addressing without causing the break the Lucas 
>> reported. If no one objects I think i can try implement it.
>>
>> Andrey
> 
> 
> Thinking more i realize Luben is right about having also bad job list as 
> this is needed for normal job competition (by fence callback from 
> amdgpu_fence_process)  and you need to decide if you move it to cleanup 
> list from timeout list or not. If it's already in bad job list - meaning 
> that it's being processed by GPU recovery code you don't touch it, 
> otherwise you move it to cleanup list where it will be freed eventually 
> by invocation of drm_sched_get_cleanup_job.

Yep...

Perhaps fewer lists, than "timeout", "bad" and "cleanup" could be had.
I'd also name the "bad" list as "recovery" list, as that is what would
be done to commands on that list.

"Timeout" is a status "timed-out", so perhaps just set the timeout
flag and move it to a "done" list. (Note that the command can still
complete asynchronously while on that list and while it has status
"timed-out'.)

The idea is that,
1) it avoid contention and races when more than one context
   can update the job at the same time, and
2) easy to process all jobs of a certain state and/or
   move them around, etc.

Let's discuss it and come up with a plan. :-)

Regards,
Luben




> 
> Andrey
> 
> 
>>
>>
>>>
>>> Then in due time, if any more interrupts or whatnot take place,
>>> the job will either be in the timeout list or not. If it it,
>>> then the instigator backs off as someone else (the list handler) will/is
>>> awake and handling it (obviously a state variable may be kept as well).
>>>
>>> This draws somewhat from my days with iSCSI, SCSI and SAS, 15 years ago,
>>> where a device can complete a job (task) at anytime regardless
>>> of what the SCSI layer "thinks" the task's state is: timed-out, aborted,
>>> whatever. It is a very simple and elegant solution which generalizes
>>> well.
>>>
>>> Regards,
>>> Luben
>>>
>>> On 2020-02-10 11:55 a.m., Andrey Grodzovsky wrote:
>>>> Lucas - Ping on my question and also I attached this temporary 
>>>> solution for etnaviv to clarify my point. If that something 
>>>> acceptable for now at least i can do the same for v3d where it 
>>>> requires a bit more code changes.
>>>>
>>>> Andrey
>>>>
>>>> On 2/6/20 10:49 AM, Andrey Grodzovsky wrote:
>>>>>> Well a revert would break our driver.
>>>>>>
>>>>>> The real solution is that somebody needs to sit down, gather ALL 
>>>>>> the requirements and then come up with a solution which is clean 
>>>>>> and works for everyone.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>> I can to take on this as indeed our general design on this becomes 
>>>>> more and more entangled as GPU reset scenarios grow in complexity 
>>>>> (at least in AMD driver). Currently I am on a high priority 
>>>>> internal task which should take me around a week or 2 to finish and 
>>>>> after that I can get to it.
>>>>>
>>>>> Regarding temporary solution  - I looked into v3d and etnaviv use 
>>>>> cases and we in AMD actually face the same scenario where we decide 
>>>>> to skip HW reset if the guilty job did finish by the time we are 
>>>>> processing the timeout  (see amdgpu_device_gpu_recover and 
>>>>> skip_hw_reset goto) - the difference is we always call 
>>>>> drm_sched_stop/start irrespectively of whether we are going to 
>>>>> actually HW reset or not (same as extend timeout). I wonder if 
>>>>> something like this can be done also for ve3 and etnaviv ?
>>>>>
>>>>> Andrey
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cef96617d23a54fe9b6ef08d7af0ac9db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637170333200621550&amp;sdata=wa7Eh3bdi%2BthYjjZF2yeTvTjNRipGPqVA%2FGQt0QL7R8%3D&amp;reserved=0 
>>>>
>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Cef96617d23a54fe9b6ef08d7af0ac9db%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637170333200621550&amp;sdata=wa7Eh3bdi%2BthYjjZF2yeTvTjNRipGPqVA%2FGQt0QL7R8%3D&amp;reserved=0 
>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2020-02-12  0:54 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-25 20:51 [PATCH v4] drm/scheduler: Avoid accessing freed bad job Andrey Grodzovsky
2019-11-25 20:51 ` Andrey Grodzovsky
2019-11-25 21:44 ` Deng, Emily
2019-11-25 21:44   ` Deng, Emily
2019-11-26  0:09   ` Grodzovsky, Andrey
2019-11-26  0:09     ` Grodzovsky, Andrey
     [not found]     ` <MWHPR12MB1453C6FC45A83482232CA3EDEA450-Gy0DoCVfaSWZBIDmKHdw+wdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-26 15:36       ` Deucher, Alexander
2019-11-26 15:36         ` Deucher, Alexander
2019-11-26 15:37 ` Andrey Grodzovsky
2019-11-26 15:37   ` Andrey Grodzovsky
     [not found]   ` <b8b716a7-e235-38b2-ea6d-0a21881fa64e-5C7GfCeVMHo@public.gmane.org>
2019-11-27  0:41     ` Deng, Emily
2019-11-27  0:41       ` Deng, Emily
2019-12-02 19:24       ` Deng, Emily
2019-12-03 19:10         ` Andrey Grodzovsky
2019-12-03 19:44           ` Deucher, Alexander
2019-12-03 19:57             ` Andrey Grodzovsky
2019-12-03 19:59               ` Deucher, Alexander
2019-12-03 20:32                 ` Andrey Grodzovsky
2019-12-03 20:58                   ` Deng, Emily
2019-12-03 19:53           ` Deng, Emily
2020-02-05 18:24 ` Lucas Stach
2020-02-06 11:10   ` Lucas Stach
2020-02-06 11:49     ` Christian König
2020-02-06 14:49       ` Alex Deucher
2020-02-06 14:51         ` Christian König
2020-02-06 15:49           ` Andrey Grodzovsky
2020-02-10 16:55             ` Andrey Grodzovsky
2020-02-10 21:50               ` Luben Tuikov
2020-02-11 15:55                 ` Andrey Grodzovsky
2020-02-11 21:27                   ` Andrey Grodzovsky
2020-02-12  0:53                     ` Luben Tuikov [this message]
2020-02-12 16:33                       ` Andrey Grodzovsky
2020-07-21 11:03                         ` Lucas Stach
2020-07-21 13:36                           ` Andrey Grodzovsky
2020-07-21 13:39                             ` Christian König
2020-07-21 13:42                               ` Andrey Grodzovsky
2020-07-21 18:29                                 ` Luben Tuikov
2020-11-25  3:17                                   ` [PATCH 0/6] Allow to extend the timeout without jobs disappearing Luben Tuikov
2020-11-25  3:17                                     ` [PATCH 1/6] drm/scheduler: "node" --> "list" Luben Tuikov
2020-11-25  9:44                                       ` Christian König
2020-11-25  3:17                                     ` [PATCH 2/6] gpu/drm: ring_mirror_list --> pending_list Luben Tuikov
2020-11-25  9:47                                       ` Christian König
2020-11-25 16:42                                         ` Luben Tuikov
2020-11-25  3:17                                     ` [PATCH 3/6] drm/scheduler: Job timeout handler returns status Luben Tuikov
2020-11-25  4:41                                       ` kernel test robot
2020-11-25  9:50                                       ` Christian König
2020-11-25 16:48                                         ` Luben Tuikov
2020-11-25 11:04                                       ` Steven Price
2020-11-25 11:15                                         ` Lucas Stach
2020-11-25 11:22                                           ` Steven Price
2020-11-25 11:47                                             ` Lucas Stach
2020-11-25 12:41                                         ` Christian König
2020-11-26 15:06                                       ` Andrey Grodzovsky
2020-11-25  3:17                                     ` [PATCH 4/6] drm/scheduler: Essentialize the job done callback Luben Tuikov
2020-11-25  9:51                                       ` Christian König
2020-11-25  3:17                                     ` [PATCH 5/6] drm/amdgpu: Don't hardcode thread name length Luben Tuikov
2020-11-25  9:55                                       ` Christian König
2020-11-25 17:01                                         ` Luben Tuikov
2020-11-26  8:11                                           ` Christian König
2020-11-25  3:17                                     ` [PATCH 6/6] drm/sched: Make use of a "done" thread Luben Tuikov
2020-11-25 10:10                                       ` Christian König
2020-11-26  0:24                                         ` Luben Tuikov
2020-11-25 11:09                                       ` Steven Price
2020-11-26  0:30                                         ` Luben Tuikov
2020-02-07 15:26           ` [PATCH v4] drm/scheduler: Avoid accessing freed bad job Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dbb2e7f0-85b7-f9e0-7875-144a8cca4993@amd.com \
    --to=luben.tuikov@amd.com \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=Emily.Deng@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=l.stach@pengutronix.de \
    --cc=steven.price@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).