All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: "Sundaresan, Sujaritha" <sujaritha.sundaresan@intel.com>
Subject: Re: [PATCH 24/46] drm/i915: Do a synchronous switch-to-kernel-context on idling
Date: Thu, 21 Feb 2019 16:29:07 -0800	[thread overview]
Message-ID: <c9f2250e-ee5a-2cab-8826-62f35e33da42@intel.com> (raw)
In-Reply-To: <155079154116.4937.17569276439061642701@skylake-alporthouse-com>



On 2/21/19 3:25 PM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-02-21 22:53:41)
>>
>>
>> On 2/21/19 1:42 PM, Chris Wilson wrote:
>>> Quoting Daniele Ceraolo Spurio (2019-02-21 21:31:45)
>>>>
>>>>
>>>> On 2/21/19 1:17 PM, Chris Wilson wrote:
>>>>> Quoting Daniele Ceraolo Spurio (2019-02-21 19:48:01)
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>> @@ -4481,19 +4471,7 @@ int i915_gem_suspend(struct drm_i915_private *i915)
>>>>>>>           * state. Fortunately, the kernel_context is disposable and we do
>>>>>>>           * not rely on its state.
>>>>>>>           */
>>>>>>> -     if (!i915_terminally_wedged(&i915->gpu_error)) {
>>>>>>> -             ret = i915_gem_switch_to_kernel_context(i915);
>>>>>>> -             if (ret)
>>>>>>> -                     goto err_unlock;
>>>>>>> -
>>>>>>> -             ret = i915_gem_wait_for_idle(i915,
>>>>>>> -                                          I915_WAIT_INTERRUPTIBLE |
>>>>>>> -                                          I915_WAIT_LOCKED |
>>>>>>> -                                          I915_WAIT_FOR_IDLE_BOOST,
>>>>>>> -                                          HZ / 5);
>>>>>>> -             if (ret == -EINTR)
>>>>>>> -                     goto err_unlock;
>>>>>>> -
>>>>>>> +     if (!switch_to_kernel_context_sync(i915)) { >                   /* Forcibly cancel outstanding work and leave the gpu quiet. */
>>>>>>>                  i915_gem_set_wedged(i915);
>>>>>>>          }
>>>>>>
>>>>>> GuC-related question: what's your expectation here in regards to GuC
>>>>>> status? The current i915 flow expect either uc_reset_prepare() or
>>>>>> uc_suspend() to be called to clean up the guc status, but we're calling
>>>>>> neither of them here if the switch is successful. Do you expect the
>>>>>> resume code to always blank out the GuC status before a reload?
>>>>>
>>>>> (A few patches later on I propose that we always just do a reset+wedge
>>>>> on suspend in lieu of hangcheck.)
>>>>>
>>>>> On resume, we have to bring the HW up from scratch and do another reset
>>>>> in the process. Some platforms have been known to survive the trips to
>>>>> PCI_D3 (someone is lying!) and so we _have_ to do a reset to be sure we
>>>>> clear the HW state. I expect we would need to force a reset on resume
>>>>> even for the guc, to be sure we cover all cases such as kexec.
>>>>> -Chris
>>>>>
>>>> More than about the HW state, my question here was about the SW
>>>> tracking. At which point do we go and stop guc communication and mark
>>>> guc as not loaded/accessible? e.g. we need to disable and re-enable CT
>>>> buffers before GuC is reset/suspended to make sure the shared memory
>>>> area is cleaned correctly (we currently avoid memsetting all of it on
>>>> reload since it is quite big). Also, communication with GuC is going to
>>>> increase going forward, so we'll need to make sure we accurately track
>>>> its state and do all the relevant cleanups.
>>>
>>> Across suspend/resume, we issue a couple of resets and scrub/sanitize our
>>> state tracking. By the time we load the fw again, both the fw and our
>>> state should be starting from scratch.
>>>
>>> That all seems unavoidable, so I am not understanding the essence of
>>> your question.
>>> -Chris
>>>
>>
>> We're not doing the state scrubbing for guc in all paths at the moment.
>> There is logic in gem_suspend_late(), but that doesn't seem to be called
>> on all paths; e.g. it isn't when we run
>> igt@gem_exec_suspend@basic-s4-devices
> 
> Yup, the dummy hibernate code throws a few surprises, and why
> i915_gem_sanitize is so fiddly to get right between that and
> gem_eio/suspend.
> 
>> and that's why Suja's patch moved
>> the disabling of communication from uc_sanitize to uc_suspend.
> 
> That should also help as previously it tried to talk to the guc after we
> reset it.

But only helps if we do call uc_suspend ;). I'm wondering if it ends up 
being better to call it from both places.

> 
>> The guc
>> resume code also doesn't currently clean everything as some of the
>> structures (including stuff we allocate for guc usage) are carried over.
>> We can either add something more in the cleanup path or go and rework
>> the resume to blank everything (which would be time consuming since
>> there is tens of MBs involved), but before putting down any code one way
>> or another I wanted to understand what the expectation is.
> 
> I may be naive, but my expectations is that we just have to reset the
> comm ringbuffer pointers. We shouldn't need to hand the guc pristine
> pages, it will zero on allocate when its needs to, surely? We do have to
> rebuild the set of clients everytime we load the guc, so that can't be
> the issue (as that has to be done on resume, device reset etc today),
> although that should only have to be the pinned clients?

GuC doesn't clean up some of the state stored in the memory we allocate 
for its use. In the specific example of the CT buffers, the registration 
is not automatically cleaned by GuC, it is only cleaned when the 
disable_communication H2G is issued or if we just memset the guc memory. 
This is to allow re-use of the same buffers across resets without having 
to issue an H2G to re-enable them. Similar approach is taken for other 
info (e.g. lrc info required gen11+), again to allow the host to 
seamlessly restart after a reset or suspend/resume. We always need to 
recreate the clients because the doorbells are a HW state and thus they 
can get reset with the guc; the firmware also saves db status in the 
WOPCM rather then in the shared memory for speed, so that does get 
cleaned on reload.

In the current gen11 guc code (which hopefully will hit the ML soon) we 
assumed that uc_suspend would be called on all suspend paths to make 
sure the state in the shared structures was clean, but if it doesn't 
then we'll have to do some tweaks to cope. BTW, we need to add 
uc_reset_prepare() to __i915_gem_set_wedged as well.

Daniele

> 
> We have to restart the comm channels on loading the guc, so what's
> changing?
> 
> On suspend, hit the device reset & kill guc. On resume, load the guc fw,
> restart comm. After fiddling about making sure we are in the right
> callpaths, the intent is that resume just looks like a fresh module
> load (so we only have to reason about init sequence [nearly] once).
> -Chris
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2019-02-22  0:29 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-06 13:03 The road to load balancing Chris Wilson
2019-02-06 13:03 ` [PATCH 01/46] drm/i915: Hack and slash, throttle execbuffer hogs Chris Wilson
2019-02-06 13:03 ` [PATCH 02/46] drm/i915: Revoke mmaps and prevent access to fence registers across reset Chris Wilson
2019-02-06 15:56   ` Mika Kuoppala
2019-02-06 16:08     ` Chris Wilson
2019-02-06 16:18       ` Chris Wilson
2019-02-26 19:53   ` Rodrigo Vivi
2019-02-26 20:27     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 03/46] drm/i915: Force the GPU reset upon wedging Chris Wilson
2019-02-06 13:03 ` [PATCH 04/46] drm/i915: Uninterruptibly drain the timelines on unwedging Chris Wilson
2019-02-06 13:03 ` [PATCH 05/46] drm/i915: Wait for old resets before applying debugfs/i915_wedged Chris Wilson
2019-02-06 13:03 ` [PATCH 06/46] drm/i915: Serialise resets with wedging Chris Wilson
2019-02-06 13:03 ` [PATCH 07/46] drm/i915: Don't claim an unstarted request was guilty Chris Wilson
2019-02-06 13:03 ` [PATCH 08/46] drm/i915/execlists: Suppress mere WAIT preemption Chris Wilson
2019-02-11 11:19   ` Tvrtko Ursulin
2019-02-19 10:22   ` Matthew Auld
2019-02-19 10:34     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 09/46] drm/i915/execlists: Suppress redundant preemption Chris Wilson
2019-02-06 13:03 ` [PATCH 10/46] drm/i915: Make request allocation caches global Chris Wilson
2019-02-11 11:43   ` Tvrtko Ursulin
2019-02-11 12:40     ` Chris Wilson
2019-02-11 17:02       ` Tvrtko Ursulin
2019-02-12 11:51         ` Chris Wilson
2019-02-06 13:03 ` [PATCH 11/46] drm/i915: Keep timeline HWSP allocated until idle across the system Chris Wilson
2019-02-06 13:03 ` [PATCH 12/46] drm/i915/execlists: Refactor out can_merge_rq() Chris Wilson
2019-02-06 13:03 ` [PATCH 13/46] drm/i915: Compute the global scheduler caps Chris Wilson
2019-02-11 12:24   ` Tvrtko Ursulin
2019-02-11 12:33     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 14/46] drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+ Chris Wilson
2019-02-06 13:03 ` [PATCH 15/46] drm/i915: Prioritise non-busywait semaphore workloads Chris Wilson
2019-02-06 13:03 ` [PATCH 16/46] drm/i915: Show support for accurate sw PMU busyness tracking Chris Wilson
2019-02-06 13:03 ` [PATCH 17/46] drm/i915: Apply rps waitboosting for dma_fence_wait_timeout() Chris Wilson
2019-02-11 18:06   ` Tvrtko Ursulin
2019-02-06 13:03 ` [PATCH 18/46] drm/i915: Replace global_seqno with a hangcheck heartbeat seqno Chris Wilson
2019-02-11 12:40   ` Tvrtko Ursulin
2019-02-11 12:44     ` Chris Wilson
2019-02-11 16:56       ` Tvrtko Ursulin
2019-02-12 13:36         ` Chris Wilson
2019-02-06 13:03 ` [PATCH 19/46] drm/i915/pmu: Always sample an active ringbuffer Chris Wilson
2019-02-11 18:18   ` Tvrtko Ursulin
2019-02-12 13:40     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 20/46] drm/i915: Remove access to global seqno in the HWSP Chris Wilson
2019-02-11 18:22   ` Tvrtko Ursulin
2019-02-06 13:03 ` [PATCH 21/46] drm/i915: Remove i915_request.global_seqno Chris Wilson
2019-02-11 18:44   ` Tvrtko Ursulin
2019-02-12 13:45     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 22/46] drm/i915: Force GPU idle on suspend Chris Wilson
2019-02-06 13:03 ` [PATCH 23/46] drm/i915/selftests: Improve switch-to-kernel-context checking Chris Wilson
2019-02-06 13:03 ` [PATCH 24/46] drm/i915: Do a synchronous switch-to-kernel-context on idling Chris Wilson
2019-02-21 19:48   ` Daniele Ceraolo Spurio
2019-02-21 21:17     ` Chris Wilson
2019-02-21 21:31       ` Daniele Ceraolo Spurio
2019-02-21 21:42         ` Chris Wilson
2019-02-21 22:53           ` Daniele Ceraolo Spurio
2019-02-21 23:25             ` Chris Wilson
2019-02-22  0:29               ` Daniele Ceraolo Spurio [this message]
2019-02-06 13:03 ` [PATCH 25/46] drm/i915: Store the BIT(engine->id) as the engine's mask Chris Wilson
2019-02-11 18:51   ` Tvrtko Ursulin
2019-02-12 13:51     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 26/46] drm/i915: Refactor common code to load initial power context Chris Wilson
2019-02-06 13:03 ` [PATCH 27/46] drm/i915: Reduce presumption of request ordering for barriers Chris Wilson
2019-02-06 13:03 ` [PATCH 28/46] drm/i915: Remove has-kernel-context Chris Wilson
2019-02-06 13:03 ` [PATCH 29/46] drm/i915: Introduce the i915_user_extension_method Chris Wilson
2019-02-11 19:00   ` Tvrtko Ursulin
2019-02-12 13:56     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 30/46] drm/i915: Track active engines within a context Chris Wilson
2019-02-11 19:11   ` Tvrtko Ursulin
2019-02-12 13:59     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 31/46] drm/i915: Introduce a context barrier callback Chris Wilson
2019-02-06 13:03 ` [PATCH 32/46] drm/i915: Create/destroy VM (ppGTT) for use with contexts Chris Wilson
2019-02-12 11:18   ` Tvrtko Ursulin
2019-02-12 14:11     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 33/46] drm/i915: Extend CONTEXT_CREATE to set parameters upon construction Chris Wilson
2019-02-12 13:43   ` Tvrtko Ursulin
2019-02-06 13:03 ` [PATCH 34/46] drm/i915: Allow contexts to share a single timeline across all engines Chris Wilson
2019-02-06 13:03 ` [PATCH 35/46] drm/i915: Fix I915_EXEC_RING_MASK Chris Wilson
2019-02-06 13:03 ` [PATCH 36/46] drm/i915: Remove last traces of exec-id (GEM_BUSY) Chris Wilson
2019-02-06 13:03 ` [PATCH 37/46] drm/i915: Re-arrange execbuf so context is known before engine Chris Wilson
2019-02-06 13:03 ` [PATCH 38/46] drm/i915: Allow a context to define its set of engines Chris Wilson
2019-02-25 10:41   ` Tvrtko Ursulin
2019-02-25 10:47     ` Chris Wilson
2019-02-06 13:03 ` [PATCH 39/46] drm/i915: Extend I915_CONTEXT_PARAM_SSEU to support local ctx->engine[] Chris Wilson
2019-02-06 13:03 ` [PATCH 40/46] drm/i915: Pass around the intel_context Chris Wilson
2019-02-06 13:03 ` [PATCH 41/46] drm/i915: Split struct intel_context definition to its own header Chris Wilson
2019-02-06 13:03 ` [PATCH 42/46] drm/i915: Move over to intel_context_lookup() Chris Wilson
2019-02-06 14:27   ` [PATCH] " Chris Wilson
2019-02-06 13:03 ` [PATCH 43/46] drm/i915: Load balancing across a virtual engine Chris Wilson
2019-02-06 13:03 ` [PATCH 44/46] drm/i915: Extend execution fence to support a callback Chris Wilson
2019-02-06 13:03 ` [PATCH 45/46] drm/i915/execlists: Virtual engine bonding Chris Wilson
2019-02-06 13:03 ` [PATCH 46/46] drm/i915: Allow specification of parallel execbuf Chris Wilson
2019-02-06 13:52 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/46] drm/i915: Hack and slash, throttle execbuffer hogs Patchwork
2019-02-06 14:09 ` ✗ Fi.CI.BAT: failure " Patchwork
2019-02-06 14:11 ` ✗ Fi.CI.SPARSE: warning " Patchwork
2019-02-06 14:37 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/46] drm/i915: Hack and slash, throttle execbuffer hogs (rev2) Patchwork
2019-02-06 14:55 ` ✗ Fi.CI.SPARSE: " Patchwork
2019-02-06 14:56 ` ✓ Fi.CI.BAT: success " Patchwork
2019-02-06 16:18 ` ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c9f2250e-ee5a-2cab-8826-62f35e33da42@intel.com \
    --to=daniele.ceraolospurio@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=sujaritha.sundaresan@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.