Re: [RFC] drm/i915/bdw+: Do not emit user interrupts when not needed

From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Intel-gfx@lists.freedesktop.org
Subject: Re: [RFC] drm/i915/bdw+: Do not emit user interrupts when not needed
Date: Fri, 18 Dec 2015 14:29:22 +0000	[thread overview]
Message-ID: <20151218142922.GA3302@nuc-i3427.alporthouse.com> (raw)
In-Reply-To: <56740F6A.2080907@linux.intel.com>

On Fri, Dec 18, 2015 at 01:51:38PM +0000, Tvrtko Ursulin wrote:
> 
> On 18/12/15 12:28, Chris Wilson wrote:
> >On Fri, Dec 18, 2015 at 11:59:41AM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>We can rely on context complete interrupt to wake up the waiters
> >>apart in the case where requests are merged into a single ELSP
> >>submission. In this case we inject MI_USER_INTERRUPTS in the
> >>ring buffer to ensure prompt wake-ups.
> >>
> >>This optimization has the effect on for example GLBenchmark
> >>Egypt off-screen test of decreasing the number of generated
> >>interrupts per second by a factor of two, and context switched
> >>by factor of five to six.
> >
> >I half like it. Are the interupts a limiting factor in this case though?
> >This should be ~100 waits/second with ~1000 batches/second, right? What
> >is the delay between request completion and client wakeup - difficult to
> >measure after you remove the user interrupt though! But I estimate it
> >should be on the order of just a few GPU cycles.
> 
> Neither of the two benchmarks I ran (trex onscreen and egypt
> offscreen) show any framerate improvements.

Expected, but nice to be confirmed. I expect you would have to get into
the synchronous workloads before the latency had any impact (1 us on a
1 ms wait will be lost in the noise, but  a1 us on 2 us wait would
quickly build up)

> The only thing I did manage to measure is that CPU energy usage goes
> down with the optimisation. Roughly 8-10%, courtesy of RAPL script
> someone posted here.

That's actually v.impressive - are you confident in the result? :)

> Benchmarking is generally very hard so it is a pity we don't have a
> farm similar to CI which does it all in a repeatable and solid
> manner.

Indeed. And to have stable, reliable benchmarking. Martin Peres has been
working on getting confidence in the benchmark farm. The caveats tend to
be that we have to run at low frequencies to avoid throttling (CPU/GPU),
run on fixed CPUs etc etc. To get reliable metrics we have to throw out
some interesting and complex variabilty of the real world. :|

> >>@@ -433,6 +440,12 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
> >>  			cursor->elsp_submitted = req0->elsp_submitted;
> >>  			list_move_tail(&req0->execlist_link,
> >>  				       &ring->execlist_retired_req_list);
> >>+			/*
> >>+			 * When merging requests make sure there is still
> >>+			 * something after each batch buffer to wake up waiters.
> >>+			 */
> >>+			if (cursor != req0)
> >>+				execlists_emit_user_interrupt(req0);
> >
> >You may have already missed this instruction as you patch it, and keep
> >doing so as long as the context is resubmitted. I think to be safe, you
> >need to patch cursor as well. You could then MI_NOOP out the MI_INTERUPT
> >on the terminal request.
> 
> I don't at the moment see it could miss it? We don't do preemption,
> but granted I don't understand this code fully.

The GPU is currently executing the context that was on port[1]. It may
already have executed all the instructions upto and including the
instruction being patched (from the last ELSP write). Ok, that's a
window of just one missed interrupt on the resubmitted request - not as
big a deal as I first thought.

> But patching it out definitely looks safer. And I even don't have to
> unbreak GuC in that case. So I'll try that approach.

It still has the same hole on the resubmitted request though, but it may
be simpler.

> >An interesting igt experiement I think would be:
> >
> >thread A, keep queuing batches with just a single MI_STORE_DWORD_IMM *addr
> >thread B, waits on batch from A, reads *addr (asynchronously), measures
> >latency (actual value - expected(batch))
> >
> >Run for 10s, report min/max/median latency.
> >
> >Repeat for more threads/contexts and more waiters. Ah, that may be the
> >demonstration for the thundering herd I've been looking for!
> 
> Hm I'll think about it.

I'm working on this as I have been trying to get something to measure
the thundering herd issue (I have one benchmark that measures how much
CPU time we steal to handle interrupts, but it is horrible).

> Wrt your second reply, that is an interesting question.
> 
> All I can tell that empirically it looks interrupts do arrive split,
> otherwise there would be no reduction in interrupt numbers. But why
> are they split I don't know.
> 
> I'll try adding some counters to get a feel how often does that
> happen in various scenarios.

It may just be that on the CPU timescale a few GPU instructions is
enough for us to process the first interrupt and go back to sleep. But
it has to be shorter than the context switch to the bottom-half, I would
guess.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx