dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: maarten.lankhorst@linux.intel.com, matthew.auld@intel.com,
	Matthew Brost <matthew.brost@intel.com>,
	John Harrison <John.C.Harrison@Intel.com>
Subject: Re: [Intel-gfx] [PATCH v6 3/9] drm/i915/gt: Increase suspend timeout
Date: Thu, 23 Sep 2021 15:33:39 +0100	[thread overview]
Message-ID: <a3b8aa87-1276-7dd7-611b-b2aaf758860a@linux.intel.com> (raw)
In-Reply-To: <199e2c25-8133-360e-4b85-18485522c2be@linux.intel.com>


On 23/09/2021 14:19, Thomas Hellström wrote:
> 
> On 9/23/21 2:59 PM, Tvrtko Ursulin wrote:
>>
>> On 23/09/2021 12:47, Thomas Hellström wrote:
>>> Hi, Tvrtko,
>>>
>>> On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>>>>
>>>> On 22/09/2021 07:25, Thomas Hellström wrote:
>>>>> With GuC submission on DG1, the execution of the requests times out
>>>>> for the gem_exec_suspend igt test case after executing around 800-900
>>>>> of 1000 submitted requests.
>>>>>
>>>>> Given the time we allow elsewhere for fences to signal (in the 
>>>>> order of
>>>>> seconds), increase the timeout before we mark the gt wedged and 
>>>>> proceed.
>>>>
>>>> I suspect it is not about requests not retiring in time but about 
>>>> the intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although 
>>>> I don't know which G2H message is the code waiting for at suspend 
>>>> time so perhaps something to run past the GuC experts.
>>>
>>> So what's happening here is that the tests submits 1000 requests, 
>>> each writing a value to an object, and then that object content is 
>>> checked after resume. With GuC it turns out that only 800-900 or so 
>>> values are actually written before we time out, and the test 
>>> (basic-S3) fails, but not on every run.
>>
>> Yes and that did not make sense to me. It is a single context even so 
>> I did not come up with an explanation why would GuC be slower.
>>
>> Unless it somehow manages to not even update the ring tail in time and 
>> requests are still only stuck in the software queue? Perhaps you can 
>> see that from context tail and head when it happens.
>>
>>> This is a bit interesting in itself, because I never saw the hang-S3 
>>> test fail, which from what I can tell basically is an identical test 
>>> but with a spinner submitted after the 1000th request. Could be that 
>>> the suspend backup code ends up waiting for something before we end 
>>> up in intel_gt_wait_for_idle, giving more requests time to execute.
>>
>> No idea, I don't know the suspend paths that well. For instance before 
>> looking at the code I thought we would preempt what's executing and 
>> not wait for everything that has been submitted to finish. :)
>>
>>>> Anyway, if that turns out to be correct then perhaps it would be 
>>>> better to split the two timeouts (like if required GuC timeout is 
>>>> perhaps fundamentally independent) so it's clear who needs how much 
>>>> time. Adding Matt and John to comment.
>>>
>>> You mean we have separate timeouts depending on whether we're using 
>>> GuC or execlists submission?
>>
>> No, I don't know yet. First I think we need to figure out what exactly 
>> is happening.
> 
> Well then TBH I will need to file a separate Jira about that. There 
> might be various things going on here like swiching between the migrate 
> context for eviction of unrelated LMEM buffers and the context used by 
> gem_exec_suspend. The gem_exec_suspend failures are blocking DG1 BAT so 
> it's pretty urgent to get this series merged. If you insist I can leave 
> this patch out for now, but rather I'd commit it as is and File a Jira 
> instead.

I see now how you have i915_gem_suspend() in between two lmem_suspend() 
calls in this series. So first call has the potential of creating a lot 
of requests and that you think interferes? Sounds plausible but implies 
GuC timeslicing is less efficient if I follow?

IMO it is okay to leave for follow up work but strictly speaking, unless 
I am missing something, the approach of bumping the timeout does not 
sound valid if the copying is done async.

Because the timeout is then mandated not only as function of GPU 
activity (lets say user controlled), but also the amount of 
unpinned/idle buffers which happen to be laying around (which is more 
i915 controlled, or mixed at least).

So question is, with enough data to copy, any timeout could be too low 
and then how long do we want to wait before failing suspend? Is this an 
argument to have a separate timeout specifically addressing the suspend 
path or not I am not sure. Perhaps there is no choice and simply wait 
until buffers are swapped out otherwise nothing will work.

Regards,

Tvrtko

  reply	other threads:[~2021-09-23 14:34 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-22  6:25 [PATCH v6 0/9] drm/i915: Suspend / resume backup- and restore of LMEM Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 1/9] drm/i915/ttm: Implement a function to copy the contents of two TTM-based objects Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 2/9] drm/i915/gem: Implement a function to process all gem objects of a region Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 3/9] drm/i915/gt: Increase suspend timeout Thomas Hellström
2021-09-23  9:18   ` Matthew Auld
2021-09-23 10:13   ` [Intel-gfx] " Tvrtko Ursulin
2021-09-23 11:47     ` Thomas Hellström
2021-09-23 12:59       ` Tvrtko Ursulin
2021-09-23 13:19         ` Thomas Hellström
2021-09-23 14:33           ` Tvrtko Ursulin [this message]
2021-09-23 15:43             ` Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 4/9] drm/i915 Implement LMEM backup and restore for suspend / resume Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 5/9] drm/i915/gt: Register the migrate contexts with their engines Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 6/9] drm/i915: Don't back up pinned LMEM context images and rings during suspend Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 7/9] drm/i915: Reduce the number of objects subject to memcpy recover Thomas Hellström
2021-09-23  9:44   ` Matthew Auld
2021-09-23  9:58     ` Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 8/9] HAX: component: do not leave master devres group open after bind Thomas Hellström
2021-09-22  6:25 ` [PATCH v6 9/9] HAX: drm/i915/gem: Fix the __i915_gem_is_lmem() function Thomas Hellström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a3b8aa87-1276-7dd7-611b-b2aaf758860a@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=John.C.Harrison@Intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).