Re: [PATCH 17/21] drm/i915: Convert trace-irq to the breadcrumb waiter

From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 17/21] drm/i915: Convert trace-irq to the breadcrumb waiter
Date: Wed, 8 Jun 2016 14:47:49 +0100	[thread overview]
Message-ID: <20160608134749.GF32344@nuc-i3427.alporthouse.com> (raw)
In-Reply-To: <5758132B.4050409@linux.intel.com>

On Wed, Jun 08, 2016 at 01:44:27PM +0100, Tvrtko Ursulin wrote:
> 
> On 08/06/16 13:34, Chris Wilson wrote:
> >On Wed, Jun 08, 2016 at 12:47:28PM +0100, Tvrtko Ursulin wrote:
> >>
> >>On 08/06/16 12:24, Chris Wilson wrote:
> >>>On Wed, Jun 08, 2016 at 11:16:13AM +0100, Tvrtko Ursulin wrote:
> >>>>
> >>>>And why so long delays? It looks pretty lightweight to me.
> >>>>
> >>>>>>One alternative could perhaps be to add a waiter->wake_up vfunc and
> >>>>>>signalers could then potentially use a tasklet?
> >>>>>
> >>>>>Hmm, I did find that in order to reduce execlists latency, I had to
> >>>>>drive the tasklet processing from the signaler.
> >>>>
> >>>>What do you mean? The existing execlists tasklet? Now would that work?
> >>>
> >>>Due to how dma-fence signals, the softirq is never kicked
> >>>(spin_lock_irq doesn't handle local_bh_enable()) and so we would only
> >>>submit a new task via execlists on a reschedule. That latency added
> >>>about 30% (30s on bsw) to gem_exec_parallel.
> >>
> >>I don't follow. User interrupts are separate from context complete
> >>which drives the submission. How do fences interfere with the
> >>latter?
> >
> >The biggest user benchmark (ala sysmark) regression we have for
> >execlists is the latency in submitting the first request to hardware via
> >elsp (or at least the hw responding to and executing that batch,
> >the per-bb and per-ctx w/a are not free either). If we incur extra
> >latency in the driver in even adding the request to the queue for an
> >idle GPU, that is easily felt by userspace.
> 
> I still don't see how fences tie into that. But it is not so
> important since it was all along the lines of "do we really need a
> thread".

I was just mentioning in passing an issue I noticed when mixing fences
and tasklets! Which boils down to spin_unlock_irq() doesn't do
local_bh_enable() and so trying to schedule a tasklet from inside a
fence callback incurs more latency than you would expect. Entirely
unrelated expect for the signaling, fencing and their uses ;)

> >>>>>>>+int intel_engine_enable_signaling(struct drm_i915_gem_request *request)
> >>>>>>>+{
> >>>>>>>+	struct intel_engine_cs *engine = request->engine;
> >>>>>>>+	struct intel_breadcrumbs *b = &engine->breadcrumbs;
> >>>>>>>+	struct rb_node *parent, **p;
> >>>>>>>+	struct signal *signal;
> >>>>>>>+	bool first, wakeup;
> >>>>>>>+
> >>>>>>>+	if (unlikely(IS_ERR(b->signaler)))
> >>>>>>>+		return PTR_ERR(b->signaler);
> >>>>>>
> >>>>>>I don't see that there is a fallback is kthread creation failed. It
> >>>>>>should just fail in intel_engine_init_breadcrumbs if that happens.
> >>>>>
> >>>>>Because it is not fatal to using the GPU, just one optional function.
> >>>>
> >>>>But we never expect it to fail and it is not even dependent on
> >>>>anything user controllable. Just a random error which would cause
> >>>>user experience to degrade. If thread creation failed it means
> >>>>system is in such a poor shape I would just fail the driver init.
> >>>
> >>>A minimally functional system is better than nothing at all.
> >>>GEM is not required for driver loading, interrupt driven dma-fences less
> >>>so.
> >>
> >>If you are so hot for that, how about vfuncing enable signaling in
> >>that case? Because I find the "have we created our kthread at driver
> >>init time successfuly" question for every fence a bit too much.
> >
> >read + conditional that pulls in the cacheline we want? You can place
> >the test after the spinlock if you want to avoid the cost I supose.
> >Or we just mark the GPU as wedged.
> 
> What I meant was to pass in different fence_ops at fence_init time
> depending on whether or not signaler thread was created or not. If
> driver is wanted to be functional in that case, and
> fence->enable_signaling needs to keep returning errors, it sound
> like a much more elegant solution than to repeating the check at
> every fence->enable_signaling call.

Actually, looking at it, the code was broken for !thread as there was
not an automatic fallback to polling by dma-fence. Choice between doing
that ourselves for an impossible failure case or just marking the GPU as
wedged on init.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx