All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 12/15] drm/i915/gem: Cancel non-persistent contexts on close
Date: Mon, 14 Oct 2019 17:06:42 +0100	[thread overview]
Message-ID: <f2d91987-29ef-1fee-2a87-a80c26158013@linux.intel.com> (raw)
In-Reply-To: <157106007436.18859.12181352843885393767@skylake-alporthouse-com>


On 14/10/2019 14:34, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-10-14 14:10:30)
>>
>> On 14/10/2019 13:21, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2019-10-14 13:11:46)
>>>>
>>>> On 14/10/2019 10:07, Chris Wilson wrote:
>>>>> Normally, we rely on our hangcheck to prevent persistent batches from
>>>>> hogging the GPU. However, if the user disables hangcheck, this mechanism
>>>>> breaks down. Despite our insistence that this is unsafe, the users are
>>>>> equally insistent that they want to use endless batches and will disable
>>>>> the hangcheck mechanism. We are looking at perhaps replacing hangcheck
>>>>> with a softer mechanism, that sends a pulse down the engine to check if
>>>>> it is well. We can use the same preemptive pulse to flush an active
>>>>> persistent context off the GPU upon context close, preventing resources
>>>>> being lost and unkillable requests remaining on the GPU after process
>>>>> termination. To avoid changing the ABI and accidentally breaking
>>>>> existing userspace, we make the persistence of a context explicit and
>>>>> enable it by default (matching current ABI). Userspace can opt out of
>>>>> persistent mode (forcing requests to be cancelled when the context is
>>>>> closed by process termination or explicitly) by a context parameter. To
>>>>> facilitate existing use-cases of disabling hangcheck, if the modparam is
>>>>> disabled (i915.enable_hangcheck=0), we disable persistence mode by
>>>>> default.  (Note, one of the outcomes for supporting endless mode will be
>>>>> the removal of hangchecking, at which point opting into persistent mode
>>>>> will be mandatory, or maybe the default perhaps controlled by cgroups.)
>>>>>
>>>>> v2: Check for hangchecking at context termination, so that we are not
>>>>> left with undying contexts from a crafty user.
>>>>> v3: Force context termination even if forced-preemption is disabled.
>>>>>
>>>>> Testcase: igt/gem_ctx_persistence
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>>>> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
>>>>> Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gem/i915_gem_context.c   | 182 ++++++++++++++++++
>>>>>     drivers/gpu/drm/i915/gem/i915_gem_context.h   |  15 ++
>>>>>     .../gpu/drm/i915/gem/i915_gem_context_types.h |   1 +
>>>>>     .../gpu/drm/i915/gem/selftests/mock_context.c |   2 +
>>>>>     include/uapi/drm/i915_drm.h                   |  15 ++
>>>>>     5 files changed, 215 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>>>> index 5d8221c7ba83..70b72456e2c4 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>>>> @@ -70,6 +70,7 @@
>>>>>     #include <drm/i915_drm.h>
>>>>>     
>>>>>     #include "gt/intel_lrc_reg.h"
>>>>> +#include "gt/intel_engine_heartbeat.h"
>>>>>     #include "gt/intel_engine_user.h"
>>>>>     
>>>>>     #include "i915_gem_context.h"
>>>>> @@ -269,6 +270,128 @@ void i915_gem_context_release(struct kref *ref)
>>>>>                 schedule_work(&gc->free_work);
>>>>>     }
>>>>>     
>>>>> +static inline struct i915_gem_engines *
>>>>> +__context_engines_static(const struct i915_gem_context *ctx)
>>>>> +{
>>>>> +     return rcu_dereference_protected(ctx->engines, true);
>>>>> +}
>>>>> +
>>>>> +static bool __reset_engine(struct intel_engine_cs *engine)
>>>>> +{
>>>>> +     struct intel_gt *gt = engine->gt;
>>>>> +     bool success = false;
>>>>> +
>>>>> +     if (!intel_has_reset_engine(gt))
>>>>> +             return false;
>>>>> +
>>>>> +     if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
>>>>> +                           &gt->reset.flags)) {
>>>>> +             success = intel_engine_reset(engine, NULL) == 0;
>>>>> +             clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
>>>>> +                                   &gt->reset.flags);
>>>>> +     }
>>>>> +
>>>>> +     return success;
>>>>> +}
>>>>> +
>>>>> +static void __reset_context(struct i915_gem_context *ctx,
>>>>> +                         struct intel_engine_cs *engine)
>>>>> +{
>>>>> +     intel_gt_handle_error(engine->gt, engine->mask, 0,
>>>>> +                           "context closure in %s", ctx->name);
>>>>> +}
>>>>> +
>>>>> +static bool __cancel_engine(struct intel_engine_cs *engine)
>>>>> +{
>>>>> +     /*
>>>>> +      * Send a "high priority pulse" down the engine to cause the
>>>>> +      * current request to be momentarily preempted. (If it fails to
>>>>> +      * be preempted, it will be reset). As we have marked our context
>>>>> +      * as banned, any incomplete request, including any running, will
>>>>> +      * be skipped following the preemption.
>>>>> +      */
>>>>> +     if (CONFIG_DRM_I915_PREEMPT_TIMEOUT && !intel_engine_pulse(engine))
>>>>> +             return true;
>>>>
>>>> Maybe I lost the train of thought here.. But why not even try with the
>>>> pulse even if forced preemption is not compiled in? There is a chance it
>>>> may preempt normally, no?
>>>
>>> If there is no reset-on-preemption failure and no hangchecking, there is
>>> no reset and we are left with the denial-of-service that we are seeking
>>> to close.
>>
>> Because there is no mechanism to send a pulse, see if it managed to
>> preempt, but if it did not come come back later and reset?
> 
> What are you going to preempt with? The mechanism you describe is what
> the pulse + forced-preempt is meant to be handling. (I was going to use
> a 2 birds with one stone allegory for the various features all pulling
> together, but it's more like a flock with a grenade.)

I meant try to preempt with idle pulse first. If it doesn't let go then 
go for a reset. It has as chance of working even without forced 
preemption, no?

>>>> Hm, or from the other angle, why bother with preemption and not just
>>>> reset? What is the value in letting the closed context complete if at
>>>> the same time, if it is preemptable, we will cancel all outstanding work
>>>> anyway?
>>>
>>> The reset is the elephant gun; it is likely to cause collateral damage.
>>> So we try with a bit of finesse first.
>>
>> How so? Isn't our per-engine reset supposed to be fast and reliable? But
>> yes, I have no complaints of trying preemption first, just trying to
>> connect all the dots.
> 
> Fast and reliable! Even if were, we still the challenge of ensuring we
> reset the right context. But in terms of being fast and reliable, we
> have actually talked about using this type of preempt mechanism to avoid
> taking a risk with the reset! :)

:) Ok, makes sense. I did not look into how reset victims are picked in 
detail. So just a question if we want to try to send an idle pulse first 
in any case.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2019-10-14 16:06 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-14  9:07 [PATCH 01/15] drm/i915/display: Squelch kerneldoc warnings Chris Wilson
2019-10-14  9:07 ` [PATCH 02/15] drm/i915/gem: Distinguish each object type Chris Wilson
2019-10-14  9:07 ` [PATCH 03/15] drm/i915/execlists: Assert tasklet is locked for process_csb() Chris Wilson
2019-10-14  9:07 ` [PATCH 04/15] drm/i915/execlists: Clear semaphore immediately upon ELSP promotion Chris Wilson
2019-10-14  9:07 ` [PATCH 05/15] drm/i915/execlists: Tweak virtual unsubmission Chris Wilson
2019-10-14  9:07 ` [PATCH 06/15] drm/i915/selftests: Check known register values within the context Chris Wilson
2019-10-14  9:59   ` Tvrtko Ursulin
2019-10-14 10:06     ` Chris Wilson
2019-10-14  9:07 ` [PATCH 07/15] drm/i915/selftests: Check that GPR are cleared for new contexts Chris Wilson
2019-10-14 10:08   ` Tvrtko Ursulin
2019-10-14  9:07 ` [PATCH 08/15] drm/i915: Expose engine properties via sysfs Chris Wilson
2019-10-14 10:17   ` Tvrtko Ursulin
2019-10-14 10:27     ` Chris Wilson
2019-10-14  9:07 ` [PATCH 09/15] drm/i915/execlists: Force preemption Chris Wilson
2019-10-14  9:07 ` [PATCH 10/15] drm/i915/gt: Introduce barrier pulses along engines Chris Wilson
2019-10-14 11:03   ` Tvrtko Ursulin
2019-10-14  9:07 ` [PATCH 11/15] drm/i915/execlists: Cancel banned contexts on schedule-out Chris Wilson
2019-10-14 12:00   ` Tvrtko Ursulin
2019-10-14 12:06     ` Chris Wilson
2019-10-14 12:25       ` Tvrtko Ursulin
2019-10-14 12:34         ` Chris Wilson
2019-10-14 13:13           ` Chris Wilson
2019-10-14 13:19             ` Tvrtko Ursulin
2019-10-14 13:23               ` Chris Wilson
2019-10-14 13:38                 ` Tvrtko Ursulin
2019-10-14  9:07 ` [PATCH 12/15] drm/i915/gem: Cancel non-persistent contexts on close Chris Wilson
2019-10-14 12:11   ` Tvrtko Ursulin
2019-10-14 12:21     ` Chris Wilson
2019-10-14 13:10       ` Tvrtko Ursulin
2019-10-14 13:34         ` Chris Wilson
2019-10-14 16:06           ` Tvrtko Ursulin [this message]
2019-10-14  9:07 ` [PATCH 13/15] drm/i915: Replace hangcheck by heartbeats Chris Wilson
2019-10-14 12:13   ` Tvrtko Ursulin
2019-10-14  9:07 ` [PATCH 14/15] drm/i915: Flush idle barriers when waiting Chris Wilson
2019-10-14  9:07 ` [PATCH 15/15] drm/i915/execlist: Trim immediate timeslice expiry Chris Wilson
2019-10-14 16:15 ` ✗ Fi.CI.BUILD: failure for series starting with [01/15] drm/i915/display: Squelch kerneldoc warnings Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f2d91987-29ef-1fee-2a87-a80c26158013@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.