Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: "Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>,
	"Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-gfx@lists.freedesktop.org" <intel-gfx@lists.freedesktop.org>
Subject: Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.
Date: Tue, 4 Jan 2022 13:56:39 +0000	[thread overview]
Message-ID: <8257f42f-7bbd-c033-28f1-f43f21cc81af@linux.intel.com> (raw)
In-Reply-To: <baaf6bcf51c487817392142913f31655138b6596.camel@intel.com>

On 24/12/2021 13:34, Teres Alexis, Alan Previn wrote:
> 
> On Fri, 2021-12-24 at 12:09 +0000, Tvrtko Ursulin wrote:
>> Hi,
>>
>> Somehow I stumbled on this while browsing through the mailing list.
>>
>> On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:
>>> Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
>>> there is little benefit to even making that guc-id lookup because:
>>>
>>> 1. the delay between the context reset notification (when the vmas are copied
>>>      and when we verify we had received a guc err capture dump) may be subjectively
>>>      large enough and not tethered that the guc-id may have already been re-assigned.
>>>
>>> 2. I was really looking for some kind of unique context handle to print out that could
>>>      be correlated (by user inspecting the dump) back to a unique app or process or
>>>      context-id but cant find such a param in struct intel_context.
>>>
>>> As part of further reviewing the end to end flows and possible error scenarios, there
>>> also may potentially be a mismatch between "which context was reset by guc at time-n"
>>> vs "which context's vma buffers is being printed out at time-n+x" if
>>> we are experiencing back-to-back resets and the user dumped the debugfs x-time later.
>>
>> What does this all actually mean, because it sounds rather alarming,
>> that it just won't be possible to know which context, belonging to which
>> process, was reset? And because of guc_id potentially re-assigned even
>> the captured VMAs may not be the correct ones?
>>
>>
> 
> The flow of events are as below:
> 
> 1. guc sends notification that an error capture was done and ready to take.
> 	- at this point we copy the guc error captured dump into an interim store
> 	  (larger buffer that can hold multiple captures).
> 2. guc sends notification that a context was reset (after the prior)
> 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
>            from the context that was reset
> 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
>            global gpu state, engine state, context vmas and also engine registers. For the
>            engine registers now call into the guc_capture code which merely needs to verify
> 	  that GuC had already done a step 1 and we have data ready to be parsed.

What about the time between the actual reset and receiving the context 
reset notification? Latter will contain intel_context->guc_id - can that 
be re-assigned or "retired" in between the two and so cause problems for 
matching the correct (or any) vmas?

Regards,

Tvrtko

> 
> (time elapses)
> 
> 3. end user triggers the sysfs to dump the error state and all prior information is
>     printed out in proper format.
> 
> 
> Between 2 and 3:
>     - Looking at existing framework (established by execlist-capture codes),
>          I believe we only hold on to the first error state capture and drop any
>          subsequent context reset captures occurring before #3 (i.e. before the end user
>          triggers the debugfs)
>     - However, in that same space, guc can send us more and more error-capture logs
>           long as we have space for it in the buffer.
> 
> So the issue was that in my original patch, for every next capture-snaphot we find in
> guc-error-capture output buffer, i would find the matching engine and print out all
> the VMA data (that was successfully captured in #2). However, i should only do that
> for the first dump only since that would correlate exactly with the existing execlist
> code behavior. So this fix is actually pretty straight forward to get the right matching
> VMA.
> 
> WRT to my statement about "getting the context-to->process" lookup, i was initially hoping that
> I could "on my own" (within the guc-err-capture module) get that information, but it would be
> a stretch (in terms of inter-component information access). More importantly, its totally
> unnecessary since existing execlist code already did that in Step 2. That code remains intact
> with guc-error-capture.
> 
> One open i plan to test before final rev is with shared engines like CCS and RCS where i want to
> trigger cascading hangs + resets in quick succession just to see how the overall flow behavior works.
> 
> I will attach an output guc error capture based gpu error dump as per the review comment from Matthew
> on last rev.
> 
> ..alan
>>
> 
> 
>> Regards,
>>
>> Tvrtko
>>
>>> (Recap: First, guc notifies capture event, second, guc notifies context reset during
>>> which we trigger i915_gpu_coredump. In this second step, the vma's are dumped and we
>>> verify that the guc capture happened but don't parse the guc-err-capture-logs yet.
>>> Third step is when user triggers the debugfs to dump which is when we parse the error
>>> capture logs.)
>>>
>>> As a fix, what we can do in the guc_error_capture report out is to ensure that
>>> we dont re-print the previously dumped vmas if we end up finding multiple
>>> guc-error-capture dumps since the i915_gpu_coredump would have only captured the vma's
>>> for the very first context that was reset. And with guc-submission, that would always
>>> correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the guc-error-capture
>>> logs are large enough to hold data for multiple dumps).
>>>
>>> The changes (removal of below-hunk and adding of only-print-the-first-vma") is trivial
>>> but i felt it warranted a good explanation. Apologies for the inbox noise.
>>>
>>> ...alan
>>>
>>> On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
>>>> Thanks again for the detailed review here.
>>>> Will fix all the rest on next rev.
>>>> One special response for this one:
>>>>
>>>>
>>>> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
>>>>> On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
>>>>>> +			if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
>>>>>> +				GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
>>>>>> +				eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
>>>>>> +				eng = guc_lookup_engine(guc, engineclass, eng_inst);
>>>>>> +				if (eng) {
>>>>>> +					GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng);
>>>>>> +				} else {
>>>>>> +					PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup Fail!\n");
>>>>>> +				}
>>>>>> +				ce = guc_context_lookup(guc, data.guc_ctx_id);
>>>>>
>>>>> You are going to need to reference count the 'ce' here. See
>>>>> intel_guc_context_reset_process_msg for an example.
>>>>>
>>>>
>>>> Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the
>>>> development. Sorry about that i just totally missed it from my todo-notes.
>>>>
>>>> ...alan
>