Re: [RFC 3/4] drm/i915: Interrupt driven fences

From: Daniel Vetter <daniel@ffwll.ch>
To: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [RFC 3/4] drm/i915: Interrupt driven fences
Date: Fri, 27 Mar 2015 09:24:18 +0100	[thread overview]
Message-ID: <20150327082418.GD23521@phenom.ffwll.local> (raw)
In-Reply-To: <5514417D.7020206@virtuousgeek.org>

On Thu, Mar 26, 2015 at 10:27:25AM -0700, Jesse Barnes wrote:
> On 03/26/2015 06:22 AM, Daniel Vetter wrote:
> > On Mon, Mar 23, 2015 at 12:13:56PM +0000, John Harrison wrote:
> >> On 23/03/2015 09:22, Daniel Vetter wrote:
> >>> On Fri, Mar 20, 2015 at 09:11:35PM +0000, Chris Wilson wrote:
> >>>> On Fri, Mar 20, 2015 at 05:48:36PM +0000, John.C.Harrison@Intel.com wrote:
> >>>>> From: John Harrison <John.C.Harrison@Intel.com>
> >>>>>
> >>>>> The intended usage model for struct fence is that the signalled status should be
> >>>>> set on demand rather than polled. That is, there should not be a need for a
> >>>>> 'signaled' function to be called everytime the status is queried. Instead,
> >>>>> 'something' should be done to enable a signal callback from the hardware which
> >>>>> will update the state directly. In the case of requests, this is the seqno
> >>>>> update interrupt. The idea is that this callback will only be enabled on demand
> >>>>> when something actually tries to wait on the fence.
> >>>>>
> >>>>> This change removes the polling test and replaces it with the callback scheme.
> >>>>> To avoid race conditions where signals can be sent before anyone is waiting for
> >>>>> them, it does not implement the callback on demand feature. When the GPU
> >>>>> scheduler arrives, it will need to know about the completion of every single
> >>>>> request anyway. So it is far simpler to not put in complex and messy anti-race
> >>>>> code in the first place given that it will not be needed in the future.
> >>>>>
> >>>>> Instead, each fence is added to a 'please poke me' list at the start of
> >>>>> i915_add_request(). This happens before the commands to generate the seqno
> >>>>> interrupt are added to the ring thus is guaranteed to be race free. The
> >>>>> interrupt handler then scans through the 'poke me' list when a new seqno pops
> >>>>> out and signals any matching fence/request. The fence is then removed from the
> >>>>> list so the entire request stack does not need to be scanned every time.
> >>>> No. Please let's not go back to the bad old days of generating an interrupt
> >>>> per batch, and doing a lot more work inside the interrupt handler.
> >>> Yeah, enable_signalling should be the place where we grab the interrupt
> >>> reference. Also that we shouldn't call this unconditionally, that pretty
> >>> much defeats the point of that fastpath optimization.
> >>>
> >>> Another complication is missed interrupts. If we detect those and someone
> >>> calls enable_signalling then we need to fire up a timer to wake up once
> >>> per jiffy and save stuck fences. To avoid duplication with the threaded
> >>> wait code we could remove the fallback wakeups from there and just rely on
> >>> that timer everywhere.
> >>> -Daniel
> >>
> >> As has been discussed many times in many forums, the scheduler requires
> >> notification of each batch buffer's completion. It needs to know so that it
> >> can submit new work, keep dependencies of outstanding work up to date, etc.
> >>
> >> Android is similar. With the native sync API, Android wants to be signaled
> >> about the completion of everything. Every single batch buffer submission
> >> comes with a request for a sync point that will be poked when that buffer
> >> completes. The kernel has no way of knowing which buffers are actually going
> >> to be waited on. There is no driver call anymore. User land simply waits on
> >> a file descriptor.
> >>
> >> I don't see how we can get away without generating an interrupt per batch.
> > 
> > I've explained this a bit offline in a meeting, but here's finally the
> > mail version for the record. The reason we want to enable interrupts only
> > when needed is that interrupts don't scale. Looking around high throughput
> > pheriferals all try to avoid interrupts like the plague: netdev has
> > netpoll, block devices just gained the same because of ridiculously fast
> > ssds connected to pcie. And there's lots of people talking about insanely
> > tightly coupled gpu compute workloads (maybe not yet on intel gpus, but
> > it'll come).
> > 
> > Now I fully agree that unfortunately the execlist hw design isn't awesome
> > and there's no way around receiving and processing an interrupt per batch.
> > But the hw folks are working on fixing these overheads again (or at least
> > attempting using the guc, I haven't seen the new numbers yet) and old hw
> > without the scheduler works perfectly fine with interrupts mostly
> > disabled. So just because we currently have a suboptimal hw design is imo
> > not a good reason to throw all the on-demand interrupt enabling and
> > handling overboard. I fully expect that we'll need it again. And I think
> > it's easier to keep it working than to first kick it out and then rebuild
> > it again.
> > 
> > That's in a nutshell why I think we should keep all that machinery, even
> > though it won't be terribly useful for execlist (with or without the
> > scheduler).
> 
> What is our interrupt frequency these days anyway, for an interrupt per
> batch completion, for a somewhat real set of workloads?  There's
> probably more to shave off of our interrupt handling overhead, which
> ought to help universally, but especially with execlists and sync point
> usages.  I think Chris was looking at that awhile back and removed some
> MMIO and such and got the overhead down, but I don't know where we stand
> today...

I guess you're referring to the pile of patches to reorder the
reads/writes for subordinate irq sources to only happen when they need to?
I.e. read only when we have a bit indicating so (unfortunately not
available for all of them) and write only if there's something to clear.

On a quick scan those patches all landed.

The other bit is making the mmio debug stuff faster. That one hasn't
converged yet to a version which both reduces the overhead without
destroying the usefulness of the debug functionality itself - unclaimed
mmio has helped a lot in chasing down runtime pm and power domain bugs in
our driver. So I really want to keep it around in some form by default, if
at all possible.

Maybe check out Chris latest patch and see whether you have a good idea?
I've run out on them a bit.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx