All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michel Thierry <michel.thierry@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org,
	Mika Kuoppala <mika.kuoppala@linux.intel.com>
Subject: Re: [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
Date: Thu, 20 Apr 2017 17:17:05 -0700	[thread overview]
Message-ID: <58373b7c-0f4c-8bf9-116b-776998bc5e2b@intel.com> (raw)
In-Reply-To: <20170419104926.GJ9029@nuc-i3427.alporthouse.com>

On 19/04/17 03:49, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>  - identifies the request that caused the hang and it is dropped
>>  - force engine to idle: this is done by issuing a reset request
>>  - reset and re-init engine
>>  - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> v2: Rebase.
>> v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
>> calling i915_gem_reset_engine (Chris).
>> v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
>> reuse the function for reset_engine.
>> v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
>> v6: Clean up reset_engine function to not require mutex, i.e. no need to call
>> revoke/restore_fences and _retire_requests (Chris).
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c         | 76 ++++++++++++++++++++++++--
>>  drivers/gpu/drm/i915/i915_drv.h         | 12 +++-
>>  drivers/gpu/drm/i915/i915_gem.c         | 97 +++++++++++++++++++--------------
>>  drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
>>  drivers/gpu/drm/i915/intel_uncore.c     | 20 +++++++
>>  5 files changed, 158 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index e03d0643dbd6..634893cd93b3 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>
>>  	pr_notice("drm/i915: Resetting chip after gpu hang\n");
>>  	disable_irq(dev_priv->drm.irq);
>> -	ret = i915_gem_reset_prepare(dev_priv);
>> +	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
>>  	if (ret) {
>>  		DRM_ERROR("GPU recovery failed\n");
>>  		intel_gpu_reset(dev_priv, ALL_ENGINES);
>> @@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	i915_queue_hangcheck(dev_priv);
>>
>>  finish:
>> -	i915_gem_reset_finish(dev_priv);
>> +	i915_gem_reset_finish(dev_priv, ALL_ENGINES);
>>  	enable_irq(dev_priv->drm.irq);
>>
>>  wakeup:
>> @@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>   *
>>   * Reset a specific GPU engine. Useful if a hang is detected.
>>   * Returns zero on successful reset or otherwise an error code.
>> + *
>> + * Procedure is:
>> + *  - identifies the request that caused the hang and it is dropped
>> + *  - force engine to idle: this is done by issuing a reset request
>> + *  - reset engine
>> + *  - restart submissions to the engine
>>   */
>>  int i915_reset_engine(struct intel_engine_cs *engine)
>>  {
>> -	/* FIXME: replace me with engine reset sequence */
>> -	return -ENODEV;
>> +	int ret;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +	struct i915_gpu_error *error = &dev_priv->gpu_error;
>> +
>> +	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
>> +
>> +	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
>> +
>> +	/*
>> +	 * We need to first idle the engine by issuing a reset request,
>> +	 * then perform soft reset and re-initialize hw state, for all of
>> +	 * this GT power need to be awake so ensure it does throughout the
>> +	 * process
>> +	 */
>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>
> Hmm, what path did we take to get here without taking rpm? I thought I
> had fixed the last offender.
>

Too many rebases... As you say, this is no longer needed after 
1604a86d08053 "drm/i915: Extend rpm wakelock during i915_handle_error()"

>> +	disable_irq(dev_priv->drm.irq);
>
> I am 99% certain that we don't need to disable_irq() now for per-engine
> reset... I'd keep it in the global reset as simple paranoia.
>

100% correct.

>> +	ret = i915_gem_reset_prepare_engine(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Previous reset failed - promote to full reset\n");
>> +		goto error;
>> +	}
>> +
>> +	/*
>> +	 * the request that caused the hang is stuck on elsp, identify the
>> +	 * active request and drop it, adjust head to skip the offending
>> +	 * request to resume executing remaining requests in the queue.
>> +	 */
>
> Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
> struct_mutex) to skip replaying innocent requests, but here we should be
> asserting that we do have the hung request.
>
> i.e.
> request = i915_gem_find_active_request(engine);
> if (!request)
> 	goto skip.
>
> Bonus points for tying that into i915_gem_reset_prepare_engine() so that
> we only seach for the active_request once.
>

What about something like this?

int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
...
         if (engine_stalled(engine)) {
                 request = i915_gem_find_active_request(engine);
+               if (!request)
+                       err = -ECANCELED;
+
                 if (request && request->fence.error == -EIO)
                         err = -EIO; /* Previous reset failed! */
         }

and

int i915_reset_engine(struct intel_engine_cs *engine)
...
         DRM_DEBUG_DRIVER("resetting %s\n", engine->name);

         ret = i915_gem_reset_prepare_engine(engine);
-       if (ret) {
+       if (ret == -ECANCELED) {
+               DRM_INFO("no active request found, skip reset\n");
+               goto skip;
+       } else  if (ret) {
                 DRM_ERROR("Previous reset failed - promote to full 
reset\n");
                 goto out;
...

+skip:
+       i915_gem_reset_finish_engine(engine);
+       goto out;
...

I'll still have to keep the request (if found) and pass it to 
i915_gem_reset_engine, to avoid the extra find_active_request.

>> +	i915_gem_reset_engine(engine);
>
> Where does skip_request() get called?
>

Sorry, I didn't get this, skip_request happens under this path:
i915_gem_reset_engine
--> i915_gem_reset_request
-----> if (guilty)

>> +
>> +	/* forcing engine to idle */
>> +	ret = intel_reset_engine_start(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto error;
>> +	}
>> +
>> +	/* finally, reset engine */
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		intel_reset_engine_cancel(engine);
>> +		goto error;
>> +	}
>> +
>> +	/* be sure the request reset bit gets cleared */
>> +	intel_reset_engine_cancel(engine);
>> +
>> +	i915_gem_reset_finish_engine(engine);
>> +
>> +	/* replay remaining requests in the queue */
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto error;
>> +
>> +wakeup:
>> +	enable_irq(dev_priv->drm.irq);
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>> +	wake_up_bit(&error->flags, I915_RESET_HANDOFF);
>
> No handoff here anymore.
>

All these are leftovers from the previous version.

>> +	return ret;
>> +
>> +error:
>> +	/* use full gpu reset to recover on error */
>> +	goto wakeup;
>>  }
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2017-04-21  0:17 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
2017-04-18 20:23 ` [PATCH v6 01/20] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
2017-04-18 20:23 ` [PATCH v6 02/20] drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel Michel Thierry
2017-04-27  8:13   ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 03/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
2017-04-18 20:23 ` [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
2017-04-18 21:40   ` Chris Wilson
2017-04-18 22:01     ` Michel Thierry
2017-04-18 23:13       ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
2017-04-19 10:49   ` Chris Wilson
2017-04-19 13:49     ` Chris Wilson
2017-04-21  0:17     ` Michel Thierry [this message]
2017-04-24 21:22       ` Michel Thierry
2017-04-25  9:42         ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 06/20] drm/i915: Skip reset request if there is one already Michel Thierry
2017-04-18 20:23 ` [PATCH v6 07/20] drm/i915/tdr: Add engine reset count to error state Michel Thierry
2017-04-18 20:23 ` [PATCH v6 08/20] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
2017-04-18 20:23 ` [PATCH v6 09/20] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
2017-04-18 20:23 ` [PATCH v6 10/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
2017-04-18 20:23 ` [PATCH v6 11/20] drm/i915/selftests: reset engine self tests Michel Thierry
2017-04-18 20:23 ` [PATCH v6 12/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
2017-04-18 20:23 ` [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
2017-04-19  0:26   ` Daniele Ceraolo Spurio
2017-04-19  0:44     ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
2017-04-19 10:27   ` Chris Wilson
2017-04-19 23:22     ` Michel Thierry
2017-04-20  9:05       ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
2017-04-18 21:18   ` Daniele Ceraolo Spurio
2017-04-18 20:23 ` [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
2017-04-19 10:20   ` Chris Wilson
2017-04-19 17:11     ` Michel Thierry
2017-04-19 17:51       ` Chris Wilson
2017-04-19 18:13         ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
2017-04-18 21:20   ` Chris Wilson
2017-04-18 21:36     ` Michel Thierry
2017-04-18 23:06       ` Chris Wilson
2017-04-18 23:11         ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
2017-04-19 16:56   ` Jeff McGee
2017-04-19 17:07   ` Daniele Ceraolo Spurio
2017-04-20  1:09   ` Michel Thierry
2017-04-20  8:52     ` Chris Wilson
2017-04-20 17:19       ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
2017-04-18 20:23 ` [PATCH v6 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
2017-04-18 20:44 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev2) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=58373b7c-0f4c-8bf9-116b-776998bc5e2b@intel.com \
    --to=michel.thierry@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=mika.kuoppala@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.