Re: [PATCH 08/15] drm/i915: Slaughter the thundering i915_wait_request herd

From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 08/15] drm/i915: Slaughter the thundering i915_wait_request herd
Date: Mon, 30 Nov 2015 14:18:29 +0000	[thread overview]
Message-ID: <20151130141829.GC14582@nuc-i3427.alporthouse.com> (raw)
In-Reply-To: <565C4FE0.8000403@linux.intel.com>

On Mon, Nov 30, 2015 at 01:32:16PM +0000, Tvrtko Ursulin wrote:
> 
> On 30/11/15 12:30, Chris Wilson wrote:
> >On Mon, Nov 30, 2015 at 12:05:50PM +0000, Tvrtko Ursulin wrote:
> >>>+	struct intel_breadcrumbs {
> >>>+		spinlock_t lock; /* protects the per-engine request list */
> >>
> >>s/list/tree/
> >
> >list. We use the tree as a list!
> 
> Maybe but it confuses the reader a bit who goes and looks for some
> list then.
> 
> Related, when I was playing with this I was piggy backing on the
> existing per engine request list.

A separate list with separate locking only used when waiting, that's the
appeal for me.

> If they are seqno ordered there, and I think they are, then the
> waker thread just traverses it in order until hitting the not
> completed one, waking up every req->wait_queue as it goes.
>
> In this case it doesn't matter that the waiters come in in
> indeterministic order so we don't need a (new) tree.
> 
> But the downside is the thread needs struct mutex. And that the
> retirement from that list is so lazy..

And we can't even take struct_mutex!

> 
> >>>+		struct task_struct *task; /* bh for all user interrupts */
> >>>+		struct intel_breadcrumbs_engine {
> >>>+			struct rb_root requests; /* sorted by retirement */
> >>>+			struct drm_i915_gem_request *first;
> >>
> >>/* First request in the tree to be completed next by the hw. */
> >>
> >>?
> >
> >/* first */ ?
> 
> Funny? :) First what? Why make the reader reverse engineer it?

struct drm_i915_gem_request *first_waiter; ?

Please don't make me add /* ^^^ */

> >>For a later extension:
> >>
> >>How would you view, some time in the future, adding a bool return to
> >>ring->put_irq() which would say to the thread that it failed to
> >>disable interrupts?
> >>
> >>In this case thread would exit and would need to be restarted when
> >>the next waiter queues up. Possibly extending the
> >>idle->put_irq->exit durations as well then.
> >
> >Honestly, not keen on the idea that the lowlevel put_irq can fail (makes
> >cleanup much more fraught). I don't see what improvement you intend
> >here, maybe clarifying that would help explain what you mean.
> 
> Don't know, maybe reflecting the fact it wasn't the last put_irq
> call so let the caller handle it if they want. We can leave this
> discussion for the future.

Hmm. The only other use is the trace_irq. It would be nice to eliminate
that and the ring->irq_count.
> 
> >>>@@ -2986,16 +2981,16 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> >>>  			if (ring_idle(ring, seqno)) {
> >>>  				ring->hangcheck.action = HANGCHECK_IDLE;
> >>>
> >>>-				if (waitqueue_active(&ring->irq_queue)) {
> >>>+				if (READ_ONCE(dev_priv->breadcrumbs.engine[ring->id].first)) {
> >>
> >>READ_ONCE is again strange, but other than that I don't know the
> >>hangcheck code to really evaulate it.
> >>
> >>Perhaps this also means this line needs a comment describing what
> >>condition/state is actually approximating with the check.
> >
> >All we are doing is asking if anyone is waiting on the GPU, because the
> >GPU has finished all of its work. If so, the IRQ handling is suspect
> >(modulo the pre-existing race condition clearly earmarked by READ_ONCE).
> 
> Cool, /* Check if someone is waiting on the GPU */ then would make me happy.

+static bool any_waiters_on_engine(struct intel_engine_cs *engine)
+{
+       return READ_ONCE(engine->i915->breadcrumbs.engine[engine->id].first_waiter);
+}
+
 static bool any_waiters(struct drm_i915_private *dev_priv)
 {
        struct intel_engine_cs *ring;
        int i;

        for_each_ring(ring, dev_priv, i)
-               if (ring->irq_refcount)
+               if (any_waiters_on_engine(engine))
                        return true;

        return false;

> >>>+	while (!kthread_should_stop()) {
> >>>+		unsigned reset_counter = i915_reset_counter(&i915->gpu_error);
> >>>+		unsigned long timeout_jiffies;
> >>>+		bool idling = false;
> >>>+
> >>>+		/* On every tick, walk the seqno-ordered list of requests
> >>>+		 * and for each retired request wakeup its clients. If we
> >>>+		 * find an unfinished request, go back to sleep.
> >>>+		 */
> >>
> >>s/tick/loop/ ?
> >s/tick/iteration/ I'm sticking with tick
> 
> Tick makes me tick of a scheduler tick so to me it is the worst of
> the three. Iteration sounds really good. Whether that will be a
> tick/jiffie/orsomething is definite a bit lower in the code.

Hmm, that was the connection I liked with tick.

> >>And if we don't find and unfinished request still go back to sleep. :)
> >
> >Ok.
> >
> >>>+		set_current_state(TASK_INTERRUPTIBLE);
> >>>+
> >>>+		/* Note carefully that we do not hold a reference for the
> >>>+		 * requests on this list. Therefore we only inspect them
> >>>+		 * whilst holding the spinlock to ensure that they are not
> >>>+		 * freed in the meantime, and the client must remove the
> >>>+		 * request from the list if it is interrupted (before it
> >>>+		 * itself releases its reference).
> >>>+		 */
> >>>+		spin_lock(&b->lock);
> >>
> >>This lock is more per engine that global in its nature unless you
> >>think it was more efficient to do fewer lock atomics vs potential
> >>overlap of waiter activity per engines?
> >
> >Indeed. I decided not to care about making it per-engine.
> >
> >>Would need a new lock for req->irq_count, per request. So
> >>req->lock/req->irq_count and be->lock for the tree operations.
> >
> >Why? A request is tied to an individual engine, therefore we can use
> >that breadcrumb_engine.lock for the request.
> 
> Just so 2nd+ waiters would not touch the per engine lock.

Ah. I am not expecting that much contention and even fewer multiple
waiters for a request.

> >>>+		for (i = 0; i < I915_NUM_RINGS; i++) {
> >>>+			struct intel_engine_cs *engine = &i915->ring[i];
> >>>+			struct intel_breadcrumbs_engine *be = &b->engine[i];
> >>>+			struct drm_i915_gem_request *request = be->first;
> >>>+
> >>>+			if (request == NULL) {
> >>>+				if ((irq_get & (1 << i))) {
> >>>+					if (irq_enabled & (1 << i)) {
> >>>+						engine->irq_put(engine);
> >>>+						irq_enabled &= ~ (1 << i);
> >>>+					}
> >>>+					intel_runtime_pm_put(i915);
> >>>+					irq_get &= ~(1 << i);
> >>>+				}
> >>>+				continue;
> >>>+			}
> >>>+
> >>>+			if ((irq_get & (1 << i)) == 0) {
> >>>+				intel_runtime_pm_get(i915);
> >>>+				irq_enabled |= __irq_enable(engine) << i;
> >>>+				irq_get |= 1 << i;
> >>>+			}
> >>
> >>irq_get and irq_enabled are creating a little bit of a headache for
> >>me. And that would probably mean a comment is be in order to explain
> >>what they are for and how they work.
> >>
> >>Like this I don't see the need for irq_get to persist across loops?
> >
> >Because getting the irq is quite slow, we add a jiffie shadow.
> 
> But it is not a shadow in the sense of the same bitmask from the
> previous iteration. The bits are different depending on special
> cases in __irq_enable.
> 
> So it needs some comments I think.

Didn't you get that from idling and its comments?

First, how about engines_active and irqs_enabled?

...
                        /* When enabling waiting upon an engine, we need
                         * to unmask the user interrupt, and before we
                         * do so we need to hold a wakelock for our
                         * hardware access (and the interrupts thereafter).
                         */
                        if ((engines_active & (1 << i)) == 0) {
...
                        if (request == NULL) {
                                /* No more waiters, mask the user interrupt
                                 * (if we were able to enable it!) and
                                 * release the wakelock.
                                 */
                                if (engines_active & (1 << i)) {

with a bit of work, I can then get the irq and rpm references out of the
spinlock.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx