Re: [PATCH 10/11] drm/scheduler: Don't store self-dependencies

From: Daniel Vetter <daniel@ffwll.ch>
To: "Christian König" <christian.koenig@amd.com>
Cc: Jack Zhang <Jack.Zhang1@amd.com>,
	DRI Development <dri-devel@lists.freedesktop.org>,
	Luben Tuikov <luben.tuikov@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Daniel Vetter <daniel.vetter@intel.com>
Subject: Re: [PATCH 10/11] drm/scheduler: Don't store self-dependencies
Date: Thu, 24 Jun 2021 20:21:15 +0200	[thread overview]
Message-ID: <CAKMK7uE21AVpBQywKhQOVZUhG=D=CDaGwWAEDTpjhUoSzTdbrw@mail.gmail.com> (raw)
In-Reply-To: <fe537d78-0737-f1b1-a740-66efa93e94a9@amd.com>

On Thu, Jun 24, 2021 at 7:56 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 24.06.21 um 19:43 schrieb Daniel Vetter:
> > On Thu, Jun 24, 2021 at 7:38 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 24.06.21 um 19:29 schrieb Daniel Vetter:
> >>> On Thu, Jun 24, 2021 at 07:03:10PM +0200, Christian König wrote:
> >>>> Am 24.06.21 um 16:00 schrieb Daniel Vetter:
> >>>>> This is essentially part of drm_sched_dependency_optimized(), which
> >>>>> only amdgpu seems to make use of. Use it a bit more.
> >>>>>
> >>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >>>>> Cc: "Christian König" <christian.koenig@amd.com>
> >>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >>>>> Cc: Luben Tuikov <luben.tuikov@amd.com>
> >>>>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
> >>>>> Cc: Jack Zhang <Jack.Zhang1@amd.com>
> >>>>> ---
> >>>>>     drivers/gpu/drm/scheduler/sched_main.c | 7 +++++++
> >>>>>     1 file changed, 7 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> index 370c336d383f..c31d7cf7df74 100644
> >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> @@ -649,6 +649,13 @@ int drm_sched_job_await_fence(struct drm_sched_job *job,
> >>>>>      if (!fence)
> >>>>>              return 0;
> >>>>> +   /* if it's a fence from us it's guaranteed to be earlier */
> >>>>> +   if (fence->context == job->entity->fence_context ||
> >>>>> +       fence->context == job->entity->fence_context + 1) {
> >>>>> +           dma_fence_put(fence);
> >>>>> +           return 0;
> >>>>> +   }
> >>>>> +
> >>>> Well NAK. That would break Vulkan.
> > I'm assuming your reply means the NAK is retracted and was just the
> > usual "this doesn't perfectly fit for amdgpu" reflex?
>
> Well rather "NAK, you haven't considered that special handling in amdgpu
> and if you really want to unify this you need that as well."
>
> >
> >>>> The problem is that Vulkan can insert dependencies between jobs which run on
> >>>> the same queue.
> >>>>
> >>>> So we need to track those as well and if the previous job for the same
> >>>> queue/scheduler is not yet finished a pipeline synchronization needs to be
> >>>> inserted.
> >>>>
> >>>> That's one of the reasons we wasn't able to unify the dependency handling
> >>>> yet.
> >>> That sounds like an extremely amdgpu specific constraint?
> >> Yeah, that's totally hardware specific.
> >>
> >> It's just that I don't know how else we could track that without having
> >> the same separation as in amdgpu between implicit and explicit fences.
> >> And as far as I understand it that's exactly what you want to avoid.
> >>
> >> As I said this turned out to be really awkward.
> >>
> >>> You're also the
> >>> only one who keeps track of whether the previous job we've scheduled has
> >>> finished already (I guess they can get pipelined and you don't flush by
> >>> default), so you insert fences.
> >> Yes, exactly that.
> >>
> >>> I guess we can add a await_fence_no_dedup or so for amdgpu, but I'm not
> >>> sure why we have to inflict this design constraint on all other drivers?
> >>> At least I'm not seeing anything in lima, panfrost, v3d or entaviv that
> >>> would break with this, and i915 will also be perfectly fine.
> >>>
> >>> Also note: I'm not using this for amdgpu, exactly because there's a few
> >>> funny things going on.
> >> Yeah, exactly the reason why we never unified this.
> > Yeah there's clear limits to this, because you also can't use the
> > await_implicit helper, because you have to keep filtering for owner or
> > the current amdgpu uapi goes horribly slow. I think the benefit would
> > be just that we could share the datastructure and the book-keeping,
> > but aside from that you'd need your own integration in amdgpu.
>
> Yeah, but that is trivial. The _add_dependency() function (or however we
> want to call it) needs to be exported anyway for adding fences from
> syncfile and syncobj.
>
> Or do you also want to unify the handling for those?

I guess we could add some convenience wrapper that pulls in a
sync_file or sync_objc automatically. But there's not that much code
involved there, and it's also not tricky. Also drivers might need to
add dependencies for whatever anyway. The await_implicit is a bit
different, because that defines how implicit sync is supposed to work.

I guess the bikeshed then boils down to which one is the simple
await_fence() function. The one that filters for same timeline, or the
one that doesnt. I'd make the non-filtering one the special case so
that amdgpu sticks out a bit more - out of 6 drivers with schedulers
(i915 included) it seems to be the special one.

> > One idea I just had was whether we could use the tag bits xarray has
> > for the amdgpu purposed. Like we could do a
> > drm_sched_job_await_fence_tagged, where you supply additional
> > information (like the "this might be relevant for the vm_flush" and
> > things like that). Afaiui xarray tags are very fast to enumerate on if
> > you're looking for specific tags, but I might be wrong. Ideally this
> > would avoid the need for the duplicated amdgpu_job->sched.
>
> That could work.
>
> Essentially we just need the information from the scheduler which is the
> last fence which was dependency optimized.
>
> In other words when you push jobs like those to the same scheduler
>
> J1
> J2 -> depends on J1.
> J3 -> depends on whatever, but not j2
>
> The hardware needs to insert a flush between J2 and J1, but not between
> j3 and j2.
>
> This makes roughly 19% performance difference for some OpenGL games and
> incorrect rendering for Vulkan if you mess it up either way or the other.

Yeah that's massive. On i915 "too many pipeline stalls" even within
batches is a lot less, so we never bothered with this at all.
-Daniel

>
> Regards,
> Christian.
>
>
> >
> > Cheers, Daniel
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>> Finally: You _really_ need explicit dependency handling for vulkan in your
> >>> uapi, instead of the kernel second-guessing what userspace might be doing.
> >>> That's really not how vulkan is designed to work :-)
> >>> Cheers, Daniel
> >>>
> >>>
> >>>> Christian.
> >>>>
> >>>>>      /* Deduplicate if we already depend on a fence from the same context.
> >>>>>       * This lets the size of the array of deps scale with the number of
> >>>>>       * engines involved, rather than the number of BOs.
> >
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch