Re: [RFC 00/39] GPU scheduler for i915 driver

From: Daniel Vetter <daniel@ffwll.ch>
To: John.C.Harrison@Intel.com
Cc: Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [RFC 00/39] GPU scheduler for i915 driver
Date: Tue, 21 Jul 2015 15:33:42 +0200	[thread overview]
Message-ID: <20150721133342.GK16722@phenom.ffwll.local> (raw)
In-Reply-To: <1437143628-6329-1-git-send-email-John.C.Harrison@Intel.com>

On Fri, Jul 17, 2015 at 03:33:09PM +0100, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Implemented a batch buffer submission scheduler for the i915 DRM driver.
> 
> The general theory of operation is that when batch buffers are submitted to the
> driver, the execbuffer() code assigns a unique seqno value and then packages up
> all the information required to execute the batch buffer at a later time. This
> package is given over to the scheduler which adds it to an internal node list.
> The scheduler also scans the list of objects associated with the batch buffer
> and compares them against the objects already in use by other buffers in the
> node list. If matches are found then the new batch buffer node is marked as
> being dependent upon the matching node. The same is done for the context object.
> The scheduler also bumps up the priority of such matching nodes on the grounds
> that the more dependencies a given batch buffer has the more important it is
> likely to be.
> 
> The scheduler aims to have a given (tuneable) number of batch buffers in flight
> on the hardware at any given time. If fewer than this are currently executing
> when a new node is queued, then the node is passed straight through to the
> submit function. Otherwise it is simply added to the queue and the driver
> returns back to user land.
> 
> As each batch buffer completes, it raises an interrupt which wakes up the
> scheduler. Note that it is possible for multiple buffers to complete before the
> IRQ handler gets to run. Further, the seqno values of the individual buffers are
> not necessary incrementing as the scheduler may have re-ordered their
> submission. However, the scheduler keeps the list of executing buffers in order
> of hardware submission. Thus it can scan through the list until a matching seqno
> is found and then mark all in flight nodes from that point on as completed.
> 
> A deferred work queue is also poked by the interrupt handler. When this wakes up
> it can do more involved processing such as actually removing completed nodes
> from the queue and freeing up the resources associated with them (internal
> memory allocations, DRM object references, context reference, etc.). The work
> handler also checks the in flight count and calls the submission code if a new
> slot has appeared.
> 
> When the scheduler's submit code is called, it scans the queued node list for
> the highest priority node that has no unmet dependencies. Note that the
> dependency calculation is complex as it must take inter-ring dependencies and
> potential preemptions into account. Note also that in the future this will be
> extended to include external dependencies such as the Android Native Sync file
> descriptors and/or the linux dma-buff synchronisation scheme.
> 
> If a suitable node is found then it is sent to execbuff_final() for submission
> to the hardware. The in flight count is then re-checked and a new node popped
> from the list if appropriate.
> 
> The scheduler also allows high priority batch buffers (e.g. from a desktop
> compositor) to jump ahead of whatever is already running if the underlying
> hardware supports pre-emption. In this situation, any work that was pre-empted
> is returned to the queued list ready to be resubmitted when no more high
> priority work is outstanding.
> 
> [Patches against drm-intel-nightly tree fetched 15/07/2015 with struct fence
> conversion patches applied]

A few high-level comments:

seqno handling
--------------

So you do rework seqno handling to materialize them fairly late, but
there's still (at least seemingly, not sure whether that all falls out
again) a lot of code to handle seqno values jumping around.

My goal with all the request/fence abstraction was that core code can stop
caring about where/how the seqno is stored, which would allow us to switch
to a per-timeline seqno schem, i.e. per ctx/engine, store to a private
per-ctx location.

Originally it made sense to keep seqno values as-is since they where used
all over the driver, but we fixed that with the seqno2request conversion
and with the request preallocation patches. What's still missing that
prevents us from fully abstracting away seqno, how they're allocated and
where they're stored from a fence-internal implementation detail?

parallel data structures
------------------------

Currently your scheduler builds up a parallel set of datastrctures
starting from dev_priv->scheduler. That probably makes sense for an
external patch series, but for usptreaming I don't like parallel data
structures for a few reasons:

- It makes it harder to refactor code since it's more hoops to jump
  through when going from one hirarchy to the other one. E.g. when you
  walk a list on one side there might not be a perfectly corresponding
  list order on the other side.

- It's harder to spot duplicated logic (like the depency calculation
  code).

- It's harder on reviewers since behavior for the new stuff needs to be
  inferred from the code, while extending existing stuff can build on
  working assumptions. Yeah I'm just lazy ;-)

- Especially in GEM code it's not great having to kmalloc shadowing
  structures deep down in the call chain.

Anyway I think the global stuff should just be an embedded struct in
dev_priv (we don't bother with kmalloc in general). The per-engine arrays
should be split and moved into intel_engine_cs. Anything ctx specific
(haven't spotted much) belongs into ctx and the queue_entry should be
moved into requests. Yes there's not a 1:1 request:QE mapping, but we have
that already with execlists where we can coalesce requests together if
they're just tail updates to the same ctx on the same engine.

legacy ringbuffer scheduler
---------------------------

The really nice thing with execlist is that each ctx has its own ring and
we don't need to delay emitting ringbuffer commands at all for the
scheduler, just the tail update and execlist submission. That means
scheduler with execlist is really just an internal detail like the guc
submission and won't need any big changes to the core gem code.

Legacy scheduler otoh means we need to make sure no ringbuffer commands
get emitted, which means some kind of transactional ring access thing or
something similar like Chris suggested. That's a lot more than just the
guarantees in the add_request rework, and I don't really think that effort
is worth it really.

I've made some more comments about a bunch of the details, this is just
the overall design comments.

Cheers, Daniel
> 
> Dave Gordon (1):
>   drm/i915: Updating assorted register and status page definitions
> 
> John Harrison (38):
>   drm/i915: Add total count to context status debugfs output
>   drm/i915: Explicit power enable during deferred context initialisation
>   drm/i915: Prelude to splitting i915_gem_do_execbuffer in two
>   drm/i915: Split i915_dem_do_execbuffer() in half
>   drm/i915: Re-instate request->uniq because it is extremely useful
>   drm/i915: Start of GPU scheduler
>   drm/i915: Prepare retire_requests to handle out-of-order seqnos
>   drm/i915: Added scheduler hook into i915_gem_complete_requests_ring()
>   drm/i915: Disable hardware semaphores when GPU scheduler is enabled
>   drm/i915: Force MMIO flips when scheduler enabled
>   drm/i915: Added scheduler hook when closing DRM file handles
>   drm/i915: Added deferred work handler for scheduler
>   drm/i915: Redirect execbuffer_final() via scheduler
>   drm/i915: Keep the reserved space mechanism happy
>   drm/i915: Added tracking/locking of batch buffer objects
>   drm/i915: Hook scheduler node clean up into retire requests
>   drm/i915: Added scheduler interrupt handler hook
>   drm/i915: Added scheduler support to __wait_request() calls
>   drm/i915: Added scheduler support to page fault handler
>   drm/i915: Added scheduler flush calls to ring throttle and idle functions
>   drm/i915: Add scheduler hook to GPU reset
>   drm/i915: Added a module parameter for allowing scheduler overrides
>   drm/i915: Support for 'unflushed' ring idle
>   drm/i915: Defer seqno allocation until actual hardware submission time
>   drm/i915: Added immediate submission override to scheduler
>   drm/i915: Add sync wait support to scheduler
>   drm/i915: Connecting execbuff fences to scheduler
>   drm/i915: Added trace points to scheduler
>   drm/i915: Added scheduler queue throttling by DRM file handle
>   drm/i915: Added debugfs interface to scheduler tuning parameters
>   drm/i915: Added debug state dump facilities to scheduler
>   drm/i915: Add early exit to execbuff_final() if insufficient ring space
>   drm/i915: Added scheduler statistic reporting to debugfs
>   drm/i915: Added seqno values to scheduler status dump
>   drm/i915: Add scheduler support functions for TDR
>   drm/i915: GPU priority bumping to prevent starvation
>   drm/i915: Enable GPU scheduler by default
>   drm/i915: Allow scheduler to manage inter-ring object synchronisation
> 
>  drivers/gpu/drm/i915/Makefile              |    1 +
>  drivers/gpu/drm/i915/i915_debugfs.c        |  221 ++++
>  drivers/gpu/drm/i915/i915_dma.c            |    6 +
>  drivers/gpu/drm/i915/i915_drv.c            |    9 +
>  drivers/gpu/drm/i915/i915_drv.h            |   42 +-
>  drivers/gpu/drm/i915/i915_gem.c            |  191 ++-
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  338 ++++--
>  drivers/gpu/drm/i915/i915_irq.c            |    3 +
>  drivers/gpu/drm/i915/i915_params.c         |    4 +
>  drivers/gpu/drm/i915/i915_reg.h            |   30 +-
>  drivers/gpu/drm/i915/i915_scheduler.c      | 1762 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_scheduler.h      |  178 +++
>  drivers/gpu/drm/i915/i915_trace.h          |  233 +++-
>  drivers/gpu/drm/i915/intel_display.c       |    6 +-
>  drivers/gpu/drm/i915/intel_lrc.c           |  163 ++-
>  drivers/gpu/drm/i915/intel_lrc.h           |    1 +
>  drivers/gpu/drm/i915/intel_ringbuffer.c    |   47 +-
>  drivers/gpu/drm/i915/intel_ringbuffer.h    |   43 +-
>  18 files changed, 3110 insertions(+), 168 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/i915_scheduler.c
>  create mode 100644 drivers/gpu/drm/i915/i915_scheduler.h
> 
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx