All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 08/10] drm/i915: Cancel non-persistent contexts on close
Date: Fri, 11 Oct 2019 15:22:17 +0100	[thread overview]
Message-ID: <157080373793.31572.12385908510774881252@skylake-alporthouse-com> (raw)
In-Reply-To: <7bfc079e-e76b-9e43-de61-a00ab6b97b72@linux.intel.com>

Quoting Tvrtko Ursulin (2019-10-11 14:55:00)
> 
> On 10/10/2019 08:14, Chris Wilson wrote:
> > Normally, we rely on our hangcheck to prevent persistent batches from
> > hogging the GPU. However, if the user disables hangcheck, this mechanism
> > breaks down. Despite our insistence that this is unsafe, the users are
> > equally insistent that they want to use endless batches and will disable
> > the hangcheck mechanism. We are looking at perhaps replacing hangcheck
> > with a softer mechanism, that sends a pulse down the engine to check if
> > it is well. We can use the same preemptive pulse to flush an active
> > persistent context off the GPU upon context close, preventing resources
> > being lost and unkillable requests remaining on the GPU after process
> > termination. To avoid changing the ABI and accidentally breaking
> > existing userspace, we make the persistence of a context explicit and
> > enable it by default (matching current ABI). Userspace can opt out of
> > persistent mode (forcing requests to be cancelled when the context is
> > closed by process termination or explicitly) by a context parameter. To
> > facilitate existing use-cases of disabling hangcheck, if the modparam is
> > disabled (i915.enable_hangcheck=0), we disable persistence mode by
> > default.  (Note, one of the outcomes for supporting endless mode will be
> > the removal of hangchecking, at which point opting into persistent mode
> > will be mandatory, or maybe the default perhaps controlled by cgroups.)
> > 
> > v2: Check for hangchecking at context termination, so that we are not
> > left with undying contexts from a crafty user.
> > 
> > Testcase: igt/gem_ctx_persistence
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> > Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 132 ++++++++++++++++++
> >   drivers/gpu/drm/i915/gem/i915_gem_context.h   |  15 ++
> >   .../gpu/drm/i915/gem/i915_gem_context_types.h |   1 +
> >   .../gpu/drm/i915/gem/selftests/mock_context.c |   2 +
> >   include/uapi/drm/i915_drm.h                   |  15 ++
> >   5 files changed, 165 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index 5d8221c7ba83..46e5b3b53288 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -70,6 +70,7 @@
> >   #include <drm/i915_drm.h>
> >   
> >   #include "gt/intel_lrc_reg.h"
> > +#include "gt/intel_engine_heartbeat.h"
> >   #include "gt/intel_engine_user.h"
> >   
> >   #include "i915_gem_context.h"
> > @@ -269,6 +270,78 @@ void i915_gem_context_release(struct kref *ref)
> >               schedule_work(&gc->free_work);
> >   }
> >   
> > +static inline struct i915_gem_engines *
> > +__context_engines_static(struct i915_gem_context *ctx)
> > +{
> > +     return rcu_dereference_protected(ctx->engines, true);
> > +}
> > +
> > +static void kill_context(struct i915_gem_context *ctx)
> > +{
> > +     intel_engine_mask_t tmp, active, reset;
> > +     struct intel_gt *gt = &ctx->i915->gt;
> > +     struct i915_gem_engines_iter it;
> > +     struct intel_engine_cs *engine;
> > +     struct intel_context *ce;
> > +
> > +     /*
> > +      * If we are already banned, it was due to a guilty request causing
> > +      * a reset and the entire context being evicted from the GPU.
> > +      */
> > +     if (i915_gem_context_is_banned(ctx))
> > +             return;
> > +
> > +     i915_gem_context_set_banned(ctx);
> > +
> > +     /*
> > +      * Map the user's engine back to the actual engines; one virtual
> > +      * engine will be mapped to multiple engines, and using ctx->engine[]
> > +      * the same engine may be have multiple instances in the user's map.
> > +      * However, we only care about pending requests, so only include
> > +      * engines on which there are incomplete requests.
> > +      */
> > +     active = 0;
> > +     for_each_gem_engine(ce, __context_engines_static(ctx), it) {
> > +             struct dma_fence *fence;
> > +
> > +             if (!ce->timeline)
> > +                     continue;
> > +
> > +             fence = i915_active_fence_get(&ce->timeline->last_request);
> > +             if (!fence)
> > +                     continue;
> > +
> > +             engine = to_request(fence)->engine;
> > +             if (HAS_EXECLISTS(gt->i915))
> > +                     engine = intel_context_inflight(ce);
> 
> Okay preemption implies execlists, was confused for a moment.
> 
> When can engine be NULL here?

The engine is not paused, and an interrupt can cause a schedule-out as
we gather up the state.

> 
> > +             if (engine)
> > +                     active |= engine->mask;
> > +
> > +             dma_fence_put(fence);
> > +     }
> > +
> > +     /*
> > +      * Send a "high priority pulse" down the engine to cause the
> > +      * current request to be momentarily preempted. (If it fails to
> > +      * be preempted, it will be reset). As we have marked our context
> > +      * as banned, any incomplete request, including any running, will
> > +      * be skipped following the preemption.
> > +      */
> > +     reset = 0;
> > +     for_each_engine_masked(engine, gt->i915, active, tmp)
> > +             if (intel_engine_pulse(engine))
> > +                     reset |= engine->mask;
> 
> What if we were able to send a pulse, but the hog cannot be preempted 
> and hangcheck is obviously disabled - who will do the reset?

Hmm, the idea is that forced-preemption causes the reset.
(See igt/gem_ctx_persistence/hostile)

However, if we give the sysadmin the means to disable force-preemption,
we just gave them another shovel to dig a hole with.

A last resort would be another timer here to ensure the context was
terminated.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2019-10-11 14:22 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-10  7:14 [PATCH 01/10] drm/i915: Note the addition of timeslicing to the pretend scheduler Chris Wilson
2019-10-10  7:14 ` [PATCH 02/10] drm/i915/execlists: Leave tell-tales as to why pending[] is bad Chris Wilson
2019-10-11  8:39   ` Tvrtko Ursulin
2019-10-10  7:14 ` [PATCH 03/10] drm/i915: Expose engine properties via sysfs Chris Wilson
2019-10-11  8:44   ` Tvrtko Ursulin
2019-10-11  8:49     ` Chris Wilson
2019-10-11  9:04       ` Tvrtko Ursulin
2019-10-11  9:40   ` [PATCH v2] " Chris Wilson
2019-10-10  7:14 ` [PATCH 04/10] drm/i915/execlists: Force preemption Chris Wilson
2019-10-10  7:14 ` [PATCH 05/10] drm/i915: Mark up "sentinel" requests Chris Wilson
2019-10-11  8:45   ` Tvrtko Ursulin
2019-10-10  7:14 ` [PATCH 06/10] drm/i915/gt: Introduce barrier pulses along engines Chris Wilson
2019-10-11  9:11   ` Tvrtko Ursulin
2019-10-11  9:52     ` Chris Wilson
2019-10-10  7:14 ` [PATCH 07/10] drm/i915/execlists: Cancel banned contexts on schedule-out Chris Wilson
2019-10-11  9:47   ` Tvrtko Ursulin
2019-10-11 10:03     ` Chris Wilson
2019-10-11 10:15     ` Chris Wilson
2019-10-11 10:40       ` Chris Wilson
2019-10-11 11:16   ` [PATCH v2] " Chris Wilson
2019-10-11 13:10     ` Tvrtko Ursulin
2019-10-11 14:10       ` Chris Wilson
2019-10-10  7:14 ` [PATCH 08/10] drm/i915: Cancel non-persistent contexts on close Chris Wilson
2019-10-11 13:55   ` Tvrtko Ursulin
2019-10-11 14:22     ` Chris Wilson [this message]
2019-10-11 15:41       ` Chris Wilson
2019-10-10  7:14 ` [PATCH 09/10] drm/i915: Replace hangcheck by heartbeats Chris Wilson
2019-10-11 14:24   ` Tvrtko Ursulin
2019-10-11 15:06     ` Chris Wilson
2019-10-10  7:14 ` [PATCH 10/10] drm/i915: Flush idle barriers when waiting Chris Wilson
2019-10-11 14:56   ` Tvrtko Ursulin
2019-10-11 15:11     ` Chris Wilson
2019-10-14 13:08       ` Tvrtko Ursulin
2019-10-14 13:38         ` Chris Wilson
2019-10-23 15:33         ` Chris Wilson
2019-10-23 15:33           ` [Intel-gfx] " Chris Wilson
2019-10-10  8:18 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/10] drm/i915: Note the addition of timeslicing to the pretend scheduler Patchwork
2019-10-10  8:42 ` ✓ Fi.CI.BAT: success " Patchwork
2019-10-10 16:19 ` ✗ Fi.CI.IGT: failure " Patchwork
2019-10-11  8:16 ` [PATCH 01/10] " Tvrtko Ursulin
2019-10-11  9:49 ` ✗ Fi.CI.BUILD: failure for series starting with [01/10] drm/i915: Note the addition of timeslicing to the pretend scheduler (rev2) Patchwork
2019-10-11 11:39 ` ✗ Fi.CI.BUILD: failure for series starting with [01/10] drm/i915: Note the addition of timeslicing to the pretend scheduler (rev3) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=157080373793.31572.12385908510774881252@skylake-alporthouse-com \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.