dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: John Harrison <john.c.harrison@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	Jani Nikula <jani.nikula@linux.intel.com>,
	<Intel-GFX@Lists.FreeDesktop.Org>
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Don't wait forever in drop_caches
Date: Wed, 2 Nov 2022 18:33:05 -0700	[thread overview]
Message-ID: <5e22de43-d75c-fc21-9ae7-f27d116c5688@intel.com> (raw)
In-Reply-To: <c710a428-50f6-6181-3f93-4d7667a9ac3f@linux.intel.com>

On 11/2/2022 07:20, Tvrtko Ursulin wrote:
> On 02/11/2022 12:12, Jani Nikula wrote:
>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> At the end of each test, IGT does a drop caches call via sysfs with
>>
>> sysfs?
Sorry, that was meant to say debugfs. I've also been working on some 
sysfs IGT issues and evidently got my wires crossed!

>>
>>> special flags set. One of the possible paths waits for idle with an
>>> infinite timeout. That causes problems for debugging issues when CI
>>> catches a "can't go idle" test failure. Best case, the CI system times
>>> out (after 90s), attempts a bunch of state dump actions and then
>>> reboots the system to recover it. Worst case, the CI system can't do
>>> anything at all and then times out (after 1000s) and simply reboots.
>>> Sometimes a serial port log of dmesg might be available, sometimes not.
>>>
>>> So rather than making life hard for ourselves, change the timeout to
>>> be 10s rather than infinite. Also, trigger the standard
>>> wedge/reset/recover sequence so that testing can continue with a
>>> working system (if possible).
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>> index ae987e92251dd..9d916fbbfc27c 100644
>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>             DROP_RESET_ACTIVE | \
>>>             DROP_RESET_SEQNO | \
>>>             DROP_RCU)
>>> +
>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>
>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
>> here.
>
> So move here, dropping i915 prefix, next to the newly proposed one?
Sure, can do that.

>
>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>> gt/intel_gt.c.
>
> Move there and rename to GT_IDLE_TIMEOUT?
>
>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.
>
> No action needed, maybe drop i915 prefix if wanted.
>
These two are totally unrelated and in code not being touched by this 
change. I would rather not conflate changing random other things with 
fixing this specific issue.

>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>
> Add _MS suffix if wanted.
>
>> My head spins.
>
> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies 
> to DROP_ACTIVE and not only DROP_IDLE.
My original intention for the name was that is the 'drop caches timeout 
for intel_gt_wait_for_idle'. Which is quite the mouthful and hence 
abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later that name 
can be conflated with the DROP_IDLE flag. Will rename.


>
> Things get refactored, code moves around, bits get left behind, who 
> knows. No reason to get too worked up. :) As long as people are taking 
> a wider view when touching the code base, and are not afraid to send 
> cleanups, things should be good.
On the other hand, if every patch gets blocked in code review because 
someone points out some completely unrelated piece of code could be a 
bit better then nothing ever gets fixed. If you spot something that you 
think should be improved, isn't the general idea that you should post a 
patch yourself to improve it?


>
> For the actual functional change at hand - it would be nice if code 
> paths in question could handle SIGINT and then we could punt the 
> decision on how long someone wants to wait purely to userspace. But 
> it's probably hard and it's only debugfs so whatever.
>
The code paths in question will already abort on a signal won't they? 
Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(), 
which is where the uc_wait_for_idle eventually ends up, have an 
'if(signal_pending) return -EINTR;' check. Beyond that, it sounds like 
what you are asking for is a change in the IGT libraries and/or CI 
framework to start sending signals after some specific timeout. That 
seems like a significantly more complex change (in terms of the number 
of entities affected and number of groups involved) and unnecessary.

> Whether or not 10s is enough CI will hopefully tell us. I'd probably 
> err on the side of safety and make it longer, but at most half from 
> the test runner timeout.
This is supposed to be test clean up. This is not about how long a 
particular test takes to complete but about how long it takes to declare 
the system broken after the test has already finished. I would argue 
that even 10s is massively longer than required.

>
> I am not convinced that wedging is correct though. Conceptually could 
> be just that the timeout is too short. What does wedging really give 
> us, on top of limiting the wait, when latter AFAIU is the key factor 
> which would prevent the need to reboot the machine?
>
It gives us a system that knows what state it is in. If we can't idle 
the GT then something is very badly wrong. Wedging indicates that. It 
also ensure that a full GT reset will be attempted before the next test 
is run. Helping to prevent a failure on test X from propagating into 
failures of unrelated tests X+1, X+2, ... And if the GT reset does not 
work because the system is really that badly broken then future tests 
will not run rather than report erroneous failures.

This is not about getting a more stable system for end users by sweeping 
issues under the carpet and pretending all is well. End users don't run 
IGTs or explicitly call dodgy debugfs entry points. The sole motivation 
here is to get more accurate results from CI. That is, correctly 
identifying which test has hit a problem, getting valid debug analysis 
for that test (logs and such) and allowing further testing to complete 
correctly in the case where the system can be recovered.

John.

> Regards,
>
> Tvrtko


  reply	other threads:[~2022-11-03  1:33 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-01 23:50 [PATCH] drm/i915: Don't wait forever in drop_caches John.C.Harrison
2022-11-02 12:12 ` Jani Nikula
2022-11-02 14:20   ` [Intel-gfx] " Tvrtko Ursulin
2022-11-03  1:33     ` John Harrison [this message]
2022-11-03  9:18       ` Tvrtko Ursulin
2022-11-03  9:38         ` Tvrtko Ursulin
2022-11-03 19:16           ` John Harrison
2022-11-04 10:01             ` Tvrtko Ursulin
2022-11-04 17:45               ` John Harrison
2022-11-07 14:09                 ` Tvrtko Ursulin
2022-11-07 19:45                   ` John Harrison
2022-11-08  9:08                     ` Tvrtko Ursulin
2022-11-08 19:37                       ` John Harrison
2022-11-09 11:35                         ` Tvrtko Ursulin
2022-11-10  6:20                           ` John Harrison
2022-11-03 19:37         ` John Harrison
2022-11-03 10:45       ` Jani Nikula
2022-11-03 19:39         ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5e22de43-d75c-fc21-9ae7-f27d116c5688@intel.com \
    --to=john.c.harrison@intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=jani.nikula@linux.intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).