All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jack Zhang <Jack.Zhang1@amd.com>,
	Intel Graphics Development <intel-gfx@lists.freedesktop.org>,
	DRI Development <dri-devel@lists.freedesktop.org>,
	Steven Price <steven.price@arm.com>,
	Luben Tuikov <luben.tuikov@amd.com>,
	Boris Brezillon <boris.brezillon@collabora.com>,
	Daniel Vetter <daniel.vetter@intel.com>
Subject: Re: [PATCH v3 01/20] drm/sched: entity->rq selection cannot fail
Date: Fri, 9 Jul 2021 09:23:09 +0200	[thread overview]
Message-ID: <871a4619-8a17-134f-9d9c-40a522473946@amd.com> (raw)
In-Reply-To: <CAKMK7uFuqXdbvqDCerXHW5kiT=LUZEoyrjFMgHjkUQdS1eidDw@mail.gmail.com>

Am 09.07.21 um 09:14 schrieb Daniel Vetter:
> On Fri, Jul 9, 2021 at 8:53 AM Christian König <christian.koenig@amd.com> wrote:
>> Am 08.07.21 um 19:37 schrieb Daniel Vetter:
>>> If it does, someone managed to set up a sched_entity without
>>> schedulers, which is just a driver bug.
>> NAK, it is perfectly valid for rq selection to fail.
> There isn't a better way to explain stuff to someone who's new to the
> code and tries to improve it with docs than to NAK stuff with
> incomplete explanations?

Well as far as I understand it a NAK means that the author has missed 
something important and needs to re-iterate.

It's just to say that we absolutely can't merge a patch or something 
will break.

>> See drm_sched_pick_best():
>>
>>                   if (!sched->ready) {
>>                           DRM_WARN("scheduler %s is not ready, skipping",
>>                                    sched->name);
>>                           continue;
>>                   }
>>
>> This can happen when a device reset fails for some engine.
> Well yeah I didn't expect amdgpu to just change this directly, so I
> didn't find it. Getting an ENOENT on a hw failure instead of an EIO is
> a bit interesting semantics I guess, also what happens with the jobs
> which raced against the scheduler not being ready? I'm not seeing any
> checks for ready in the main scheduler logic so this at least looks
> somewhat accidental as a side effect, also no other driver than amdgpu
> communitcates that reset failed back to drm/sched like this. They seem
> to just not, and I guess timeout on the next request will get us into
> an endless reset loop?

Correct. Key point is that there aren't any jobs which are currently 
scheduled.

When the ready flag is changed the scheduler is paused, e.g. the main 
thread is not running any more.

I'm pretty sure that all of this is horrible racy, but nobody really 
looked into the design from a higher level as far as I know.

Christian.



> -Daniel
>
>
>> Regards,
>> Christian.
>>
>>> We BUG_ON() here because in the next patch drm_sched_job_init() will
>>> be split up, with drm_sched_job_arm() never failing. And that's the
>>> part where the rq selection will end up in.
>>>
>>> Note that if having an empty sched_list set on an entity is indeed a
>>> valid use-case, we can keep that check in job_init even after the split
>>> into job_init/arm.
>>>
>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>> Cc: "Christian König" <christian.koenig@amd.com>
>>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>> Cc: Steven Price <steven.price@arm.com>
>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Cc: Boris Brezillon <boris.brezillon@collabora.com>
>>> Cc: Jack Zhang <Jack.Zhang1@amd.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 2 +-
>>>    drivers/gpu/drm/scheduler/sched_main.c   | 3 +--
>>>    2 files changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 79554aa4dbb1..6fc116ee7302 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -45,7 +45,7 @@
>>>     * @guilty: atomic_t set to 1 when a job on this queue
>>>     *          is found to be guilty causing a timeout
>>>     *
>>> - * Note: the sched_list should have at least one element to schedule
>>> + * Note: the sched_list must have at least one element to schedule
>>>     *       the entity
>>>     *
>>>     * Returns 0 on success or a negative error code on failure.
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 33c414d55fab..01dd47154181 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -586,8 +586,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>>        struct drm_gpu_scheduler *sched;
>>>
>>>        drm_sched_entity_select_rq(entity);
>>> -     if (!entity->rq)
>>> -             return -ENOENT;
>>> +     BUG_ON(!entity->rq);
>>>
>>>        sched = entity->rq->sched;
>>>
>


WARNING: multiple messages have this Message-ID (diff)
From: "Christian König" <christian.koenig@amd.com>
To: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	Jack Zhang <Jack.Zhang1@amd.com>,
	Intel Graphics Development <intel-gfx@lists.freedesktop.org>,
	DRI Development <dri-devel@lists.freedesktop.org>,
	Steven Price <steven.price@arm.com>,
	Luben Tuikov <luben.tuikov@amd.com>,
	Daniel Vetter <daniel.vetter@intel.com>
Subject: Re: [Intel-gfx] [PATCH v3 01/20] drm/sched: entity->rq selection cannot fail
Date: Fri, 9 Jul 2021 09:23:09 +0200	[thread overview]
Message-ID: <871a4619-8a17-134f-9d9c-40a522473946@amd.com> (raw)
In-Reply-To: <CAKMK7uFuqXdbvqDCerXHW5kiT=LUZEoyrjFMgHjkUQdS1eidDw@mail.gmail.com>

Am 09.07.21 um 09:14 schrieb Daniel Vetter:
> On Fri, Jul 9, 2021 at 8:53 AM Christian König <christian.koenig@amd.com> wrote:
>> Am 08.07.21 um 19:37 schrieb Daniel Vetter:
>>> If it does, someone managed to set up a sched_entity without
>>> schedulers, which is just a driver bug.
>> NAK, it is perfectly valid for rq selection to fail.
> There isn't a better way to explain stuff to someone who's new to the
> code and tries to improve it with docs than to NAK stuff with
> incomplete explanations?

Well as far as I understand it a NAK means that the author has missed 
something important and needs to re-iterate.

It's just to say that we absolutely can't merge a patch or something 
will break.

>> See drm_sched_pick_best():
>>
>>                   if (!sched->ready) {
>>                           DRM_WARN("scheduler %s is not ready, skipping",
>>                                    sched->name);
>>                           continue;
>>                   }
>>
>> This can happen when a device reset fails for some engine.
> Well yeah I didn't expect amdgpu to just change this directly, so I
> didn't find it. Getting an ENOENT on a hw failure instead of an EIO is
> a bit interesting semantics I guess, also what happens with the jobs
> which raced against the scheduler not being ready? I'm not seeing any
> checks for ready in the main scheduler logic so this at least looks
> somewhat accidental as a side effect, also no other driver than amdgpu
> communitcates that reset failed back to drm/sched like this. They seem
> to just not, and I guess timeout on the next request will get us into
> an endless reset loop?

Correct. Key point is that there aren't any jobs which are currently 
scheduled.

When the ready flag is changed the scheduler is paused, e.g. the main 
thread is not running any more.

I'm pretty sure that all of this is horrible racy, but nobody really 
looked into the design from a higher level as far as I know.

Christian.



> -Daniel
>
>
>> Regards,
>> Christian.
>>
>>> We BUG_ON() here because in the next patch drm_sched_job_init() will
>>> be split up, with drm_sched_job_arm() never failing. And that's the
>>> part where the rq selection will end up in.
>>>
>>> Note that if having an empty sched_list set on an entity is indeed a
>>> valid use-case, we can keep that check in job_init even after the split
>>> into job_init/arm.
>>>
>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>> Cc: "Christian König" <christian.koenig@amd.com>
>>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>> Cc: Steven Price <steven.price@arm.com>
>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Cc: Boris Brezillon <boris.brezillon@collabora.com>
>>> Cc: Jack Zhang <Jack.Zhang1@amd.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 2 +-
>>>    drivers/gpu/drm/scheduler/sched_main.c   | 3 +--
>>>    2 files changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 79554aa4dbb1..6fc116ee7302 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -45,7 +45,7 @@
>>>     * @guilty: atomic_t set to 1 when a job on this queue
>>>     *          is found to be guilty causing a timeout
>>>     *
>>> - * Note: the sched_list should have at least one element to schedule
>>> + * Note: the sched_list must have at least one element to schedule
>>>     *       the entity
>>>     *
>>>     * Returns 0 on success or a negative error code on failure.
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 33c414d55fab..01dd47154181 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -586,8 +586,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>>        struct drm_gpu_scheduler *sched;
>>>
>>>        drm_sched_entity_select_rq(entity);
>>> -     if (!entity->rq)
>>> -             return -ENOENT;
>>> +     BUG_ON(!entity->rq);
>>>
>>>        sched = entity->rq->sched;
>>>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2021-07-09  7:23 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-08 17:37 [PATCH v3 00/20] drm/sched dependency tracking and dma-resv fixes Daniel Vetter
2021-07-08 17:37 ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 01/20] drm/sched: entity->rq selection cannot fail Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-09  6:53   ` Christian König
2021-07-09  6:53     ` [Intel-gfx] " Christian König
2021-07-09  7:14     ` Daniel Vetter
2021-07-09  7:14       ` [Intel-gfx] " Daniel Vetter
2021-07-09  7:23       ` Christian König [this message]
2021-07-09  7:23         ` Christian König
2021-07-09  8:00         ` Daniel Vetter
2021-07-09  8:00           ` [Intel-gfx] " Daniel Vetter
2021-07-09  8:11           ` Christian König
2021-07-09  8:11             ` [Intel-gfx] " Christian König
2021-07-08 17:37 ` [PATCH v3 02/20] drm/sched: Split drm_sched_job_init Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 03/20] drm/sched: Barriers are needed for entity->last_scheduled Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 18:56   ` Andrey Grodzovsky
2021-07-08 18:56     ` [Intel-gfx] " Andrey Grodzovsky
2021-07-08 19:53     ` Daniel Vetter
2021-07-08 19:53       ` [Intel-gfx] " Daniel Vetter
2021-07-08 21:54   ` [PATCH] " Daniel Vetter
2021-07-08 21:54     ` [Intel-gfx] " Daniel Vetter
2021-07-09  6:57     ` Christian König
2021-07-09  6:57       ` [Intel-gfx] " Christian König
2021-07-09  7:40       ` Daniel Vetter
2021-07-09  7:40         ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 04/20] drm/sched: Add dependency tracking Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 05/20] drm/sched: drop entity parameter from drm_sched_push_job Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 06/20] drm/sched: improve docs around drm_sched_entity Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 07/20] drm/panfrost: use scheduler dependency tracking Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-12  9:19   ` Steven Price
2021-07-12  9:19     ` [Intel-gfx] " Steven Price
2021-07-12  9:19     ` Steven Price
2021-07-08 17:37 ` [PATCH v3 08/20] drm/lima: " Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 09/20] drm/v3d: Move drm_sched_job_init to v3d_job_init Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 10/20] drm/v3d: Use scheduler dependency handling Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 11/20] drm/etnaviv: " Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 12/20] drm/gem: Delete gem array fencing helpers Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 13/20] drm/sched: Don't store self-dependencies Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 14/20] drm/sched: Check locking in drm_sched_job_await_implicit Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 15/20] drm/msm: Don't break exclusive fence ordering Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 16/20] drm/msm: always wait for the exclusive fence Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-09  8:48   ` Christian König
2021-07-09  8:48     ` [Intel-gfx] " Christian König
2021-07-09  8:48     ` Christian König
2021-07-09  9:15     ` Daniel Vetter
2021-07-09  9:15       ` [Intel-gfx] " Daniel Vetter
2021-07-09  9:15       ` Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 17/20] drm/etnaviv: Don't break exclusive fence ordering Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 18/20] drm/i915: delete exclude argument from i915_sw_fence_await_reservation Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 19/20] drm/i915: Don't break exclusive fence ordering Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37 ` [PATCH v3 20/20] dma-resv: Give the docs a do-over Daniel Vetter
2021-07-08 17:37   ` [Intel-gfx] " Daniel Vetter
2021-07-08 17:37   ` Daniel Vetter
2021-07-09  0:03 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/sched dependency tracking and dma-resv fixes (rev2) Patchwork
2021-07-09  0:29 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-07-09 15:27 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871a4619-8a17-134f-9d9c-40a522473946@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Jack.Zhang1@amd.com \
    --cc=boris.brezillon@collabora.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=daniel.vetter@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=luben.tuikov@amd.com \
    --cc=steven.price@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.