From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DA4CBC433EF for ; Tue, 4 Jan 2022 13:56:44 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 58B8E896EB; Tue, 4 Jan 2022 13:56:44 +0000 (UTC) Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by gabe.freedesktop.org (Postfix) with ESMTPS id C1420893EC for ; Tue, 4 Jan 2022 13:56:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1641304602; x=1672840602; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=vuwXWfBsgdGq5+yXqBFidq01ms8HWHP/EvtsGrz0plY=; b=ZdxTU4LJfd1vCGiHCDNL3Jw4JoOl2XsA+uLoa46ZedAc4mGVglGuAPR1 I7ztrqxjdsgcMEld1LAeiO/Xf7yt3z52HTC3SphmFAfQBGfD80ce8EY5V 9beRkix5J9sbHVb5WzmQkbUuQ6wRlaA2u69tKoi8vZXTNPP13fOlrEk3F OMW4XRWbuN0siGt2iZedC7O3Gl5dPvsDcNYukLwh38Q9My66NCtiuegAi /SKyPmHShFSnGuMWrcJFtrfRsvic22VtHMbruk++6LRmeEGOjDjDUasAA 85vw/Se71BqEz+lyfPfJnMZF0zWbYgqwRfgnzjTd5jHxuCFXKoMSG/8CI A==; X-IronPort-AV: E=McAfee;i="6200,9189,10216"; a="222218277" X-IronPort-AV: E=Sophos;i="5.88,261,1635231600"; d="scan'208";a="222218277" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jan 2022 05:56:42 -0800 X-IronPort-AV: E=Sophos;i="5.88,261,1635231600"; d="scan'208";a="620655296" Received: from mcox1-mobl1.ger.corp.intel.com (HELO [10.213.215.140]) ([10.213.215.140]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jan 2022 05:56:41 -0800 Message-ID: <8257f42f-7bbd-c033-28f1-f43f21cc81af@linux.intel.com> Date: Tue, 4 Jan 2022 13:56:39 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.1 Content-Language: en-US To: "Teres Alexis, Alan Previn" , "Brost, Matthew" References: <20211122230402.2023576-1-alan.previn.teres.alexis@intel.com> <20211122230402.2023576-8-alan.previn.teres.alexis@intel.com> <20211208002215.GA17960@jons-linux-dev-box> From: Tvrtko Ursulin Organization: Intel Corporation UK Plc In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list. X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "intel-gfx@lists.freedesktop.org" Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On 24/12/2021 13:34, Teres Alexis, Alan Previn wrote: > > On Fri, 2021-12-24 at 12:09 +0000, Tvrtko Ursulin wrote: >> Hi, >> >> Somehow I stumbled on this while browsing through the mailing list. >> >> On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote: >>> Revisiting below hunk of patch-7 comment, as per offline discussion with Matt, >>> there is little benefit to even making that guc-id lookup because: >>> >>> 1. the delay between the context reset notification (when the vmas are copied >>> and when we verify we had received a guc err capture dump) may be subjectively >>> large enough and not tethered that the guc-id may have already been re-assigned. >>> >>> 2. I was really looking for some kind of unique context handle to print out that could >>> be correlated (by user inspecting the dump) back to a unique app or process or >>> context-id but cant find such a param in struct intel_context. >>> >>> As part of further reviewing the end to end flows and possible error scenarios, there >>> also may potentially be a mismatch between "which context was reset by guc at time-n" >>> vs "which context's vma buffers is being printed out at time-n+x" if >>> we are experiencing back-to-back resets and the user dumped the debugfs x-time later. >> >> What does this all actually mean, because it sounds rather alarming, >> that it just won't be possible to know which context, belonging to which >> process, was reset? And because of guc_id potentially re-assigned even >> the captured VMAs may not be the correct ones? >> >> > > The flow of events are as below: > > 1. guc sends notification that an error capture was done and ready to take. > - at this point we copy the guc error captured dump into an interim store > (larger buffer that can hold multiple captures). > 2. guc sends notification that a context was reset (after the prior) > - this triggers a call to i915_gpu_coredump with the corresponding engine-mask > from the context that was reset > - i915_gpu_coredump proceeds to gather entire gpu state including driver state, > global gpu state, engine state, context vmas and also engine registers. For the > engine registers now call into the guc_capture code which merely needs to verify > that GuC had already done a step 1 and we have data ready to be parsed. What about the time between the actual reset and receiving the context reset notification? Latter will contain intel_context->guc_id - can that be re-assigned or "retired" in between the two and so cause problems for matching the correct (or any) vmas? Regards, Tvrtko > > (time elapses) > > 3. end user triggers the sysfs to dump the error state and all prior information is > printed out in proper format. > > > Between 2 and 3: > - Looking at existing framework (established by execlist-capture codes), > I believe we only hold on to the first error state capture and drop any > subsequent context reset captures occurring before #3 (i.e. before the end user > triggers the debugfs) > - However, in that same space, guc can send us more and more error-capture logs > long as we have space for it in the buffer. > > So the issue was that in my original patch, for every next capture-snaphot we find in > guc-error-capture output buffer, i would find the matching engine and print out all > the VMA data (that was successfully captured in #2). However, i should only do that > for the first dump only since that would correlate exactly with the existing execlist > code behavior. So this fix is actually pretty straight forward to get the right matching > VMA. > > WRT to my statement about "getting the context-to->process" lookup, i was initially hoping that > I could "on my own" (within the guc-err-capture module) get that information, but it would be > a stretch (in terms of inter-component information access). More importantly, its totally > unnecessary since existing execlist code already did that in Step 2. That code remains intact > with guc-error-capture. > > One open i plan to test before final rev is with shared engines like CCS and RCS where i want to > trigger cascading hangs + resets in quick succession just to see how the overall flow behavior works. > > I will attach an output guc error capture based gpu error dump as per the review comment from Matthew > on last rev. > > ..alan >> > > >> Regards, >> >> Tvrtko >> >>> (Recap: First, guc notifies capture event, second, guc notifies context reset during >>> which we trigger i915_gpu_coredump. In this second step, the vma's are dumped and we >>> verify that the guc capture happened but don't parse the guc-err-capture-logs yet. >>> Third step is when user triggers the debugfs to dump which is when we parse the error >>> capture logs.) >>> >>> As a fix, what we can do in the guc_error_capture report out is to ensure that >>> we dont re-print the previously dumped vmas if we end up finding multiple >>> guc-error-capture dumps since the i915_gpu_coredump would have only captured the vma's >>> for the very first context that was reset. And with guc-submission, that would always >>> correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the guc-error-capture >>> logs are large enough to hold data for multiple dumps). >>> >>> The changes (removal of below-hunk and adding of only-print-the-first-vma") is trivial >>> but i felt it warranted a good explanation. Apologies for the inbox noise. >>> >>> ...alan >>> >>> On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote: >>>> Thanks again for the detailed review here. >>>> Will fix all the rest on next rev. >>>> One special response for this one: >>>> >>>> >>>> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote: >>>>> On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote: >>>>>> + if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) { >>>>>> + GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data); >>>>>> + eng_inst = FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info); >>>>>> + eng = guc_lookup_engine(guc, engineclass, eng_inst); >>>>>> + if (eng) { >>>>>> + GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, eng); >>>>>> + } else { >>>>>> + PRINT(&i915->drm, ebuf, " i915-Eng-Lookup Fail!\n"); >>>>>> + } >>>>>> + ce = guc_context_lookup(guc, data.guc_ctx_id); >>>>> >>>>> You are going to need to reference count the 'ce' here. See >>>>> intel_guc_context_reset_process_msg for an example. >>>>> >>>> >>>> Oh crap - i missed this one - which you had explicitly mentioned offline when i was doing the >>>> development. Sorry about that i just totally missed it from my todo-notes. >>>> >>>> ...alan >