All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH v7] drm/i915: Slaughter the thundering i915_wait_request herd
Date: Tue, 8 Dec 2015 10:44:15 +0000	[thread overview]
Message-ID: <20151208104415.GC26418@nuc-i3427.alporthouse.com> (raw)
In-Reply-To: <5665A0EC.7040909@linux.intel.com>

On Mon, Dec 07, 2015 at 03:08:28PM +0000, Tvrtko Ursulin wrote:
> 
> Hi,
> 
> On 03/12/15 16:22, Chris Wilson wrote:
> >One particularly stressful scenario consists of many independent tasks
> >all competing for GPU time and waiting upon the results (e.g. realtime
> >transcoding of many, many streams). One bottleneck in particular is that
> >each client waits on its own results, but every client is woken up after
> >every batchbuffer - hence the thunder of hooves as then every client must
> >do its heavyweight dance to read a coherent seqno to see if it is the
> >lucky one.
> >
> >Ideally, we only want one client to wake up after the interrupt and
> >check its request for completion. Since the requests must retire in
> >order, we can select the first client on the oldest request to be woken.
> >Once that client has completed his wait, we can then wake up the
> >next client and so on.
> >
> >v2: Convert from a kworker per engine into a dedicated kthread for the
> >bottom-half.
> >v3: Rename request members and tweak comments.
> >v4: Use a per-engine spinlock in the breadcrumbs bottom-half.
> >v5: Fix race in locklessly checking waiter status and kicking the task on
> >adding a new waiter.
> >v6: Fix deciding when to force the timer to hide missing interrupts.
> >v7: Move the bottom-half from the kthread to the first client process.
> 
> I like the idea a lot so I'll make some comments even before you
> post v9 or later. :)
> 
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
> >Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >Cc: Dave Gordon <david.s.gordon@intel.com>
> >---
> >  drivers/gpu/drm/i915/Makefile            |   1 +
> >  drivers/gpu/drm/i915/i915_debugfs.c      |   2 +-
> >  drivers/gpu/drm/i915/i915_drv.h          |   3 +-
> >  drivers/gpu/drm/i915/i915_gem.c          | 108 ++++++-----------
> >  drivers/gpu/drm/i915/i915_gpu_error.c    |   2 +-
> >  drivers/gpu/drm/i915/i915_irq.c          |  17 +--
> >  drivers/gpu/drm/i915/intel_breadcrumbs.c | 197 +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/i915/intel_lrc.c         |   5 +-
> >  drivers/gpu/drm/i915/intel_ringbuffer.c  |   5 +-
> >  drivers/gpu/drm/i915/intel_ringbuffer.h  |  47 +++++++-
> >  10 files changed, 297 insertions(+), 90 deletions(-)
> >  create mode 100644 drivers/gpu/drm/i915/intel_breadcrumbs.c
> >
> >diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> >index 0851de07bd13..d3b9d3618719 100644
> >--- a/drivers/gpu/drm/i915/Makefile
> >+++ b/drivers/gpu/drm/i915/Makefile
> >@@ -35,6 +35,7 @@ i915-y += i915_cmd_parser.o \
> >  	  i915_gem_userptr.o \
> >  	  i915_gpu_error.o \
> >  	  i915_trace_points.o \
> >+	  intel_breadcrumbs.o \
> >  	  intel_lrc.o \
> >  	  intel_mocs.o \
> >  	  intel_ringbuffer.o \
> >diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> >index 0583eda642d7..68fbe4d1f91d 100644
> >--- a/drivers/gpu/drm/i915/i915_debugfs.c
> >+++ b/drivers/gpu/drm/i915/i915_debugfs.c
> >@@ -2323,7 +2323,7 @@ static int count_irq_waiters(struct drm_i915_private *i915)
> >  	int i;
> >
> >  	for_each_ring(ring, i915, i)
> >-		count += ring->irq_refcount;
> >+		count += intel_engine_has_waiter(ring);
> 
> Don't care how many any longer? Okay in debugfs, but also in error capture?

It was nice that we were able to mark the requests with waiters for
free. We could add that, but it is no longer a side-effect of this
patch. Here I was tempted to do an intel_engine_count_breadcrumbs(), but
that doesn't match how the information is used by RPS so adding it to
debugfs for RPS is misleading.

> >-	/* Optimistic spin for the next jiffie before touching IRQs */
> >-	ret = __i915_spin_request(req, state);
> >-	if (ret == 0)
> >-		goto out;
> >+	if (INTEL_INFO(req->i915)->gen >= 6)
> >+		gen6_rps_boost(req->i915, rps, req->emitted_jiffies);
> >
> >-	if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
> >-		ret = -ENODEV;
> >-		goto out;
> >+	/* Optimistic spin for the next jiffie before touching IRQs */
> 
> Exactly the opposite! :)

Bah. It's close, it certainly succinctly captures the intent if not the
implementation! s/jiffie/~jiffie/!

> >@@ -996,12 +996,11 @@ static void ironlake_rps_change_irq_handler(struct drm_device *dev)
> >
> >  static void notify_ring(struct intel_engine_cs *ring)
> >  {
> >-	if (!intel_ring_initialized(ring))
> >+	if (ring->i915 == NULL)
> 
> Glanced something about this in some other thread, maybe best to
> keep it consolidated in one patch?

This can die now. It was justifiable when I was using ring->i915
directly here, but now it is just (mostly) pointless churn.

> >+static bool irq_get(struct intel_engine_cs *engine)
> >+{
> >+	if (test_bit(engine->id, &engine->i915->gpu_error.test_irq_rings))
> >+		return false;
> 
> I don't understand why knowledge of this debugfs interface needs to
> be sprinkled around?

It's not though. This is the only use - the interface is to inject
faults into the waiter that ensure we cannot wake up after an
interrupt, and so force the hangcheck to intervene. As you thinking of
its counterpart missed_irq, which we use to track when we need to force
the fake interrupt?

> I mean, I am not even 100% sure what the
> debugfs interface is for. But it looks to be about software masking
> user interrupts on a ring? Why not just mask the interrupt delivery
> based on this mask at the source?

It is to simulate a broken wait, as opposed to masking interrupts.

> >+bool intel_breadcrumbs_add_waiter(struct drm_i915_gem_request *request,
> >+				  struct intel_breadcrumb *wait)
> >+{
> >+	struct intel_breadcrumbs *b = &request->ring->breadcrumbs;
> >+	struct rb_node **p, *parent;
> >+	bool first = false;
> >+
> >+	wait->task = current;
> >+	wait->seqno = request->seqno;
> >+
> >+	spin_lock(&b->lock);
> >+
> >+	/* First? Unmask the user interrupt */
> >+	if (b->first_waiter == NULL)
> >+		queue_delayed_work(system_highpri_wq, &b->irq, 0);
> 
> Who enables interrupts if the worker gets scheduled and completes
> before settig b->first_waiter below?

In this version, it is serialised by the spinlock. In the next version,
the ordering is explicit (no more delayed worker).
 
> >+	/* Insert the request into the retirment ordered list
> >+	 * of waiters by walking the rbtree. If we are the oldest
> >+	 * seqno in the tree (the first to be retired), then
> >+	 * set ourselves as the bottom-half.
> >+	 */
> >+	first = true;
> >+	parent = NULL;
> >+	p = &b->requests.rb_node;
> >+	while (*p) {
> >+		parent = *p;
> >+		if (i915_seqno_passed(wait->seqno, to_wait(parent)->seqno)) {
> >+			p = &parent->rb_right;
> >+			first = false;
> >+		} else
> >+			p = &parent->rb_left;
> >+	}
> >+	rb_link_node(&wait->node, parent, p);
> >+	rb_insert_color(&wait->node, &b->requests);
> 
> WARN_ON(!first && !b->first_waiter) maybe?

We will get a GPF on b->first_waiter soon enough :)

> 
> >+	if (first)
> >+		b->first_waiter = wait->task;

BUG_ON(b->first_waiter == NULL);

would be clearer I guess.

> >+
> >+	spin_unlock(&b->lock);
> >+
> >+	return first;
> >+}
> >+
> >+void intel_breadcrumbs_remove_waiter(struct drm_i915_gem_request *request,
> >+				     struct intel_breadcrumb *wait)
> >+{
> >+	struct intel_breadcrumbs *b = &request->ring->breadcrumbs;
> >+
> >+	spin_lock(&b->lock);
> >+
> >+	if (b->first_waiter == wait->task) {
> >+		struct intel_breadcrumb *next;
> >+		struct task_struct *task;
> >+
> >+		/* We are the current bottom-half. Find the next candidate,
> >+		 * the first waiter in the queue on the remaining oldest
> >+		 * request. As multiple seqnos may complete in the time it
> >+		 * takes us to wake up and find the next waiter, we have to
> >+		 * wake up that waiter for it to perform its own coherent
> >+		 * completion check.
> >+		 */
> 
> Equally, why wouldn't we wake up all waiters for which the requests
> have been completed?

Because we no longer track the requests to be completed, having moved to
a chain of waiting processes instead of a chain of requests. I could
insert a waitqueue into the intel_breadcrumb and that would indeed
necessitate locking in the irq handler (and irq locks everywhere :(.
 
> Would be a cheap check here and it would save a cascading growing
> latency as one task wakes up the next one.

Well, it can't be here since we may remove_waiter after a signal
(incomplete wait). So this part has to walk the chain of processes. Ugh,
and have to move the waitqueue from one waiter to the next...

> Even more benefit for multiple waiters on the same request.

Yes. That's what I liked about the previous version with the independent
waitqueue. However, latency of a single waiter is a compelling argument.

My head is still full of cold, maybe embedding a waitqueue into the
breadcrumb is an improvement - I am not sure yet.

> >+	/* Rather than have every client wait upon all user interrupts,
> >+	 * with the herd waking after every interrupt and each doing the
> >+	 * heavyweight seqno dance, we delegate the task to the first client.
> >+	 * After the interrupt, we wake up one client, who does the heavyweight
> >+	 * coherent seqno read and either goes back to sleep (if incomplete),
> >+	 * or wakes up the next client in the queue and so forth.
> >+	 *
> >+	 * With respect to walking the entire list of waiters in a single
> 
> s/With respect to/In contrast to/ or "Compared to"?

"Compared to", thanks.
 
> >+	 * dedicated bottom-half, we reduce the latency of the first waiter
> >+	 * by avoiding a context switch, but incur additional coherent seqno
> >+	 * reads when following the chain of request breadcrumbs.
> >+	 */
> >+	struct intel_breadcrumbs {
> >+		spinlock_t lock; /* protects the lists of requests */
> >+		struct rb_root requests; /* sorted by retirement */
> >+		struct task_struct *first_waiter; /* bh for user interrupts */
> >+		struct delayed_work irq; /* used to enable or fake inerrupts */
> >+		bool irq_enabled : 1;
> >+		bool rpm_wakelock : 1;
> 
> Are bitfields worth it? There aren't that many engine so maybe it
> loses more in terms of instructions generated.

I'm always optimistic that gcc gets it's act together. Not that I don't
have patches converting from bitfields to flags when gcc hurts. Here,
the potential extra instruction is going to dwarfed by the irq spinlocks
and worse along these paths.

> >+static inline void intel_engine_wakeup(struct intel_engine_cs *engine)
> >+{
> >+	struct task_struct *task = READ_ONCE(engine->breadcrumbs.first_waiter);
> >+	if (task)
> >+		wake_up_process(task);
> 
> What guarantees task is a valid task at this point?

Not an awful lot! Indirectly, I can point to the that the task struct
cannot be freed whilst we are in an irq (thanks to rcu), but other than
that it is simply that there is a much longer path between clearing the
first_waiter and freeing the task_struct than doing a wake_up_process on
the running task.
 
> I know it is so extremely unlikely, but it can theoretically sample
> the first_waiter which then exists and disappears bafore the
> wake_up_process call.

I can leave a comment: "I told you so -- Tvrtko" :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2015-12-08 10:44 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-29  8:47 i915_wait_request scaling Chris Wilson
2015-11-29  8:47 ` [PATCH 01/15] drm/i915: Break busywaiting for requests on pending signals Chris Wilson
2015-11-30 10:01   ` Tvrtko Ursulin
2015-11-29  8:48 ` [PATCH 02/15] drm/i915: Limit the busy wait on requests to 10us not 10ms! Chris Wilson
2015-11-30 10:02   ` Tvrtko Ursulin
2015-11-30 10:08     ` Chris Wilson
2015-11-29  8:48 ` [PATCH 03/15] drm/i915: Only spin whilst waiting on the current request Chris Wilson
2015-11-30 10:06   ` Tvrtko Ursulin
2015-12-01 15:47     ` Dave Gordon
2015-12-01 15:58       ` Chris Wilson
2015-12-01 16:44         ` Dave Gordon
2015-12-03  8:52           ` Daniel Vetter
2015-11-29  8:48 ` [PATCH 04/15] drm/i915: Cache the reset_counter for the request Chris Wilson
2015-12-01  8:31   ` Daniel Vetter
2015-12-01  8:47     ` Chris Wilson
2015-12-01  9:15       ` Chris Wilson
2015-12-01 11:05         ` [PATCH 1/3] drm/i915: Hide the atomic_read(reset_counter) behind a helper Chris Wilson
2015-12-01 11:05           ` [PATCH 2/3] drm/i915: Store the reset counter when constructing a request Chris Wilson
2015-12-03  8:59             ` Daniel Vetter
2015-12-01 11:05           ` [PATCH 3/3] drm/i915: Prevent leaking of -EIO from i915_wait_request() Chris Wilson
2015-12-03  9:14             ` Daniel Vetter
2015-12-03  9:41               ` Chris Wilson
2015-12-11  9:02               ` Chris Wilson
2015-12-11 16:46                 ` Daniel Vetter
2015-12-03  8:57           ` [PATCH 1/3] drm/i915: Hide the atomic_read(reset_counter) behind a helper Daniel Vetter
2015-12-03  9:02             ` Chris Wilson
2015-12-03  9:20               ` Daniel Vetter
2015-11-29  8:48 ` [PATCH 05/15] drm/i915: Suppress error message when GPU resets are disabled Chris Wilson
2015-12-01  8:30   ` Daniel Vetter
2015-11-29  8:48 ` [PATCH 06/15] drm/i915: Delay queuing hangcheck to wait-request Chris Wilson
2015-11-29  8:48 ` [PATCH 07/15] drm/i915: Check the timeout passed to i915_wait_request Chris Wilson
2015-11-30 10:14   ` Tvrtko Ursulin
2015-11-30 10:19     ` Chris Wilson
2015-11-30 10:27       ` Tvrtko Ursulin
2015-11-30 10:22   ` Chris Wilson
2015-11-30 10:28     ` Tvrtko Ursulin
2015-11-29  8:48 ` [PATCH 08/15] drm/i915: Slaughter the thundering i915_wait_request herd Chris Wilson
2015-11-30 10:53   ` Chris Wilson
2015-11-30 12:09     ` Tvrtko Ursulin
2015-11-30 12:38       ` Chris Wilson
2015-11-30 13:33         ` Tvrtko Ursulin
2015-11-30 14:30           ` Chris Wilson
2015-11-30 12:05   ` Tvrtko Ursulin
2015-11-30 12:30     ` Chris Wilson
2015-11-30 13:32       ` Tvrtko Ursulin
2015-11-30 14:18         ` Chris Wilson
2015-12-01 17:06           ` Dave Gordon
2015-11-30 14:26         ` Chris Wilson
2015-11-30 14:34   ` [PATCH v4] " Chris Wilson
2015-11-30 16:30     ` Chris Wilson
2015-11-30 16:40       ` Chris Wilson
2015-12-01 18:34     ` Dave Gordon
2015-12-03 16:22       ` [PATCH v7] " Chris Wilson
2015-12-07 15:08         ` Tvrtko Ursulin
2015-12-08 10:44           ` Chris Wilson [this message]
2015-12-08 14:03             ` Tvrtko Ursulin
2015-12-08 14:33               ` Chris Wilson
2015-11-23 11:34                 ` [RFC 00/12] Convert requests to use struct fence John.C.Harrison
2015-11-23 11:34                   ` [RFC 01/12] staging/android/sync: Support sync points created from dma-fences John.C.Harrison
2015-11-23 13:29                     ` Maarten Lankhorst
2015-11-23 13:31                     ` [Intel-gfx] " Tvrtko Ursulin
2015-11-23 11:34                   ` [RFC 02/12] staging/android/sync: add sync_fence_create_dma John.C.Harrison
2015-11-23 13:27                     ` Maarten Lankhorst
2015-11-23 13:38                       ` John Harrison
2015-11-23 13:44                         ` Tvrtko Ursulin
2015-11-23 13:48                           ` Maarten Lankhorst
2015-11-23 11:34                   ` [RFC 03/12] staging/android/sync: Move sync framework out of staging John.C.Harrison
2015-11-23 11:34                   ` [RFC 04/12] drm/i915: Convert requests to use struct fence John.C.Harrison
2015-11-23 11:34                   ` [RFC 05/12] drm/i915: Removed now redudant parameter to i915_gem_request_completed() John.C.Harrison
2015-11-23 11:34                   ` [RFC 06/12] drm/i915: Add per context timelines to fence object John.C.Harrison
2015-11-23 11:34                   ` [RFC 07/12] drm/i915: Delay the freeing of requests until retire time John.C.Harrison
2015-11-23 11:34                   ` [RFC 08/12] drm/i915: Interrupt driven fences John.C.Harrison
2015-12-11 12:17                     ` Tvrtko Ursulin
2015-11-23 11:34                   ` [RFC 09/12] drm/i915: Updated request structure tracing John.C.Harrison
2015-11-23 11:34                   ` [RFC 10/12] android/sync: Fix reversed sense of signaled fence John.C.Harrison
2015-11-23 11:34                   ` [RFC 11/12] drm/i915: Add sync framework support to execbuff IOCTL John.C.Harrison
2015-11-23 11:34                   ` [RFC 12/12] drm/i915: Cache last IRQ seqno to reduce IRQ overhead John.C.Harrison
2015-11-23 11:38                   ` [RFC 00/12] Convert requests to use struct fence John Harrison
2015-12-08 14:53                   ` [PATCH v7] drm/i915: Slaughter the thundering i915_wait_request herd Dave Gordon
2015-11-30 15:45   ` [PATCH] drm/i915: Convert trace-irq to the breadcrumb waiter Chris Wilson
2015-11-29  8:48 ` [PATCH 09/15] drm/i915: Separate out the seqno-barrier from engine->get_seqno Chris Wilson
2015-11-29  8:48 ` [PATCH 10/15] drm/i915: Remove the lazy_coherency parameter from request-completed? Chris Wilson
2015-11-29  8:48 ` [PATCH 11/15] drm/i915: Use HWS for seqno tracking everywhere Chris Wilson
2015-11-29  8:48 ` [PATCH 12/15] drm/i915: Reduce seqno/irq barrier to a clflush on legacy gen6+ Chris Wilson
2015-11-29  8:48 ` [PATCH 13/15] drm/i915: Stop setting wraparound seqno on initialisation Chris Wilson
2015-12-01 16:57   ` Dave Gordon
2015-12-04  9:36     ` Daniel Vetter
2015-12-04  9:51       ` Chris Wilson
2015-11-29  8:48 ` [PATCH 14/15] drm/i915: Only query timestamp when measuring elapsed time Chris Wilson
2015-11-30 10:19   ` Tvrtko Ursulin
2015-11-30 14:31     ` Chris Wilson
2015-11-29  8:48 ` [PATCH 15/15] drm/i915: On GPU reset, set the HWS breadcrumb to the last seqno Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151208104415.GC26418@nuc-i3427.alporthouse.com \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.