From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 26707C433EF for ; Thu, 23 Dec 2021 10:23:08 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5FDC410E29D; Thu, 23 Dec 2021 10:23:07 +0000 (UTC) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3C81810E29D; Thu, 23 Dec 2021 10:23:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1640254986; x=1671790986; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=vxZV9khprg1WURCLeMHZ4JgAs8wPQh/vYyzCApeft7Y=; b=G/FyZDwKjIM64rXjg8+qytxZjCofeng00PtFEexz01j9FAvJvrn7rPmF FkuNfFtc711yHuEbHHHRP54k48LxwMZcYpKYB7EozmI4tTTPvVSUQUJDe WJWgalMkA7AHDItSEuFXJSpvyUsH4dnOhXUyQtpQ3i4ukQp9iM6VX3NQT NimF9BVEA/G6Ll3zXlxe4CGPu/fPnQ5JR8N8We8QIhl3dKktz54865pel DQyBRj2rgZH7rMnFWRGtt4tp2mPJD65iHXS6hueK6oD6vgJnFn0a45ddp WEgDXS+9LPehsN4GPhNTTVT50hIpCJ9SM7r3ZwVVs9DvYuvo84E2eGiZL w==; X-IronPort-AV: E=McAfee;i="6200,9189,10206"; a="265010742" X-IronPort-AV: E=Sophos;i="5.88,229,1635231600"; d="scan'208";a="265010742" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Dec 2021 02:23:05 -0800 X-IronPort-AV: E=Sophos;i="5.88,229,1635231600"; d="scan'208";a="570875375" Received: from bylee-mobl1.amr.corp.intel.com (HELO [10.213.175.220]) ([10.213.175.220]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Dec 2021 02:23:04 -0800 Message-ID: Date: Thu, 23 Dec 2021 10:23:01 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.1 Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: Log engine resets Content-Language: en-US To: John Harrison , Matthew Brost References: <20211214150704.984034-1-tvrtko.ursulin@linux.intel.com> <597d4ad0-fdae-49a6-b471-3a83d4c25b98@linux.intel.com> <20211217162254.GA22048@jons-linux-dev-box> <3d32df02-c02e-9c35-5165-79af1cb10100@linux.intel.com> <1c3b1b53-0e9e-a89c-1a77-90cbc229c637@intel.com> <01320c8c-a9c9-8c21-7601-75860bf11838@linux.intel.com> <1d78552e-8a7c-84c5-a9e6-7c7476a5e2e5@intel.com> <5921ac11-1286-1062-f702-f33bc55e4a0e@linux.intel.com> From: Tvrtko Ursulin Organization: Intel Corporation UK Plc In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On 22/12/2021 21:58, John Harrison wrote: > On 12/22/2021 08:21, Tvrtko Ursulin wrote: >> On 21/12/2021 22:14, John Harrison wrote: >>> On 12/21/2021 05:37, Tvrtko Ursulin wrote: >>>> On 20/12/2021 18:34, John Harrison wrote: >>>>> On 12/20/2021 07:00, Tvrtko Ursulin wrote: >>>>>> On 17/12/2021 16:22, Matthew Brost wrote: >>>>>>> On Fri, Dec 17, 2021 at 12:15:53PM +0000, Tvrtko Ursulin wrote: >>>>>>>> >>>>>>>> On 14/12/2021 15:07, Tvrtko Ursulin wrote: >>>>>>>>> From: Tvrtko Ursulin >>>>>>>>> >>>>>>>>> Log engine resets done by the GuC firmware in the similar way >>>>>>>>> it is done >>>>>>>>> by the execlists backend. >>>>>>>>> >>>>>>>>> This way we have notion of where the hangs are before the GuC >>>>>>>>> gains >>>>>>>>> support for proper error capture. >>>>>>>> >>>>>>>> Ping - any interest to log this info? >>>>>>>> >>>>>>>> All there currently is a non-descriptive "[drm] GPU HANG: ecode >>>>>>>> 12:0:00000000". >>>>>>>> >>>>>>> >>>>>>> Yea, this could be helpful. One suggestion below. >>>>>>> >>>>>>>> Also, will GuC be reporting the reason for the engine reset at >>>>>>>> any point? >>>>>>>> >>>>>>> >>>>>>> We are working on the error state capture, presumably the >>>>>>> registers will >>>>>>> give a clue what caused the hang. >>>>>>> >>>>>>> As for the GuC providing a reason, that isn't defined in the >>>>>>> interface >>>>>>> but that is decent idea to provide a hint in G2H what the issue >>>>>>> was. Let >>>>>>> me run that by the i915 GuC developers / GuC firmware team and >>>>>>> see what >>>>>>> they think. >>>>>>> >>>>> The GuC does not do any hang analysis. So as far as GuC is >>>>> concerned, the reason is pretty much always going to be pre-emption >>>>> timeout. There are a few ways the pre-emption itself could be >>>>> triggered but basically, if GuC resets an active context then it is >>>>> because it did not pre-empt quickly enough when requested. >>>>> >>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Tvrtko >>>>>>>> >>>>>>>>> Signed-off-by: Tvrtko Ursulin >>>>>>>>> Cc: Matthew Brost >>>>>>>>> Cc: John Harrison >>>>>>>>> --- >>>>>>>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 >>>>>>>>> +++++++++++- >>>>>>>>>    1 file changed, 11 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c >>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c >>>>>>>>> index 97311119da6f..51512123dc1a 100644 >>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c >>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c >>>>>>>>> @@ -11,6 +11,7 @@ >>>>>>>>>    #include "gt/intel_context.h" >>>>>>>>>    #include "gt/intel_engine_pm.h" >>>>>>>>>    #include "gt/intel_engine_heartbeat.h" >>>>>>>>> +#include "gt/intel_engine_user.h" >>>>>>>>>    #include "gt/intel_gpu_commands.h" >>>>>>>>>    #include "gt/intel_gt.h" >>>>>>>>>    #include "gt/intel_gt_clock_utils.h" >>>>>>>>> @@ -3934,9 +3935,18 @@ static void capture_error_state(struct >>>>>>>>> intel_guc *guc, >>>>>>>>>    { >>>>>>>>>        struct intel_gt *gt = guc_to_gt(guc); >>>>>>>>>        struct drm_i915_private *i915 = gt->i915; >>>>>>>>> -    struct intel_engine_cs *engine = >>>>>>>>> __context_to_physical_engine(ce); >>>>>>>>> +    struct intel_engine_cs *engine = ce->engine; >>>>>>>>>        intel_wakeref_t wakeref; >>>>>>>>> +    if (intel_engine_is_virtual(engine)) { >>>>>>>>> +        drm_notice(&i915->drm, "%s class, engines 0x%x; GuC >>>>>>>>> engine reset\n", >>>>>>>>> + intel_engine_class_repr(engine->class), >>>>>>>>> +               engine->mask); >>>>>>>>> +        engine = guc_virtual_get_sibling(engine, 0); >>>>>>>>> +    } else { >>>>>>>>> +        drm_notice(&i915->drm, "%s GuC engine reset\n", >>>>>>>>> engine->name); >>>>>>> >>>>>>> Probably include the guc_id of the context too then? >>>>>> >>>>>> Is the guc id stable and useful on its own - who would be the user? >>>>> The GuC id is the only thing that matters when trying to correlate >>>>> KMD activity with a GuC log. So while it might not be of any use or >>>>> interest to an end user, it is extremely important and useful to a >>>>> kernel developer attempting to debug an issue. And that includes >>>>> bug reports from end users that are hard to repro given that the >>>>> standard error capture will include the GuC log. >>>> >>>> On the topic of GuC log - is there a tool in IGT (or will be) which >>>> will parse the bit saved in the error capture or how is that >>>> supposed to be used? >>> Nope. >>> >>> However, Alan is currently working on supporting the GuC error >>> capture mechanism. Prior to sending the reset notification to the >>> KMD, the GuC will save a whole bunch of register state to a memory >>> buffer and send a notification to the KMD that this is available. >>> When we then get the actual reset notification, we need to match the >>> two together and include a parsed, human readable version of the >>> GuC's capture state buffer in the sysfs error log output. >>> >>> The GuC log should not be involved in this process. And note that any >>> register dumps in the GuC log are limited in scope and only enabled >>> at higher verbosity levels. Whereas, the official state capture is >>> based on a register list provided by the KMD and is available >>> irrespective of debug CONFIG settings, verbosity levels, etc. >> >> Hm why should GuC log not be involved now? I thought earlier you said: >> >> """ >> And that includes bug reports from end users that are hard to repro >> given that the standard error capture will include the GuC log. >> """ >> >> Hence I thought there would be a tool in IGT which would parse the >> part saved inside the error capture. > Different things. > > The GuC log is not involved in capturing hardware register state and > reporting that as part of the sysfs error capture that user's can read > out. The GuC needs to do the state capture for us if it is doing the > reset, but it is provided via a dedicated state capture API. There > should be no requirement to set GuC log sizes/verbosity levels or to > decode and parse the GuC log just to get the engine register state at > the time of the hang. > > On the other hand, the GuC log is useful for debugging certain issues > and it is included automatically in the sysfs error capture along with > all the other hardware and software state that we save. > > There is intended to be a publicly released tool to decode the GuC log > into a human readable format. So end users will be able to see what > engine scheduling decisions were taken prior to the hang, for example. > Unfortunately, that is not yet ready for release for a number of > reasons. I don't know whether that would be released as part of the IGT > suite or in some other manner. Okay, it would be handy if it was part of IGT, maybe even intel_error_decode being able to use it to show everything interesting to kernel developers in one go. Or at least the log parsing tool being able to consume raw error capture. >>>>> Also, note that GuC really resets contexts rather than engines. >>>>> What it reports back to i915 on a reset is simply the GuC id of the >>>>> context. It is up to i915 to work back from that to determine >>>>> engine instances/classes if required. And in the case of a virtual >>>>> context, it is impossible to extract the actual instance number. So >>>>> your above print about resetting all instances within the virtual >>>>> engine mask is incorrect/misleading. The reset would have been >>>>> applied to one and only one of those engines. If you really need to >>>>> know exactly which engine was poked, you need to look inside the >>>>> GuC log. >>>> >>>> I think I understood that part. :) It wasn't my intent to imply in >>>> the message multiple engines have been reset, but in the case of >>>> veng, log the class and mask and the fact there was an engine reset >>>> (singular). Clearer message can probably be written. >>>> >>>>> However, the follow up point is to ask why you need to report the >>>>> exact class/instance? The end user doesn't care about which >>>>> specific engine got reset. They only care that their context was >>>>> reset. Even a KMD developer doesn't really care unless the concern >>>>> is about a hardware bug rather than a software bug. >>>> >>>> I was simply aligning both backends to log as similar information as >>>> possible. Information is there, just not logged. >>>> >>>> Concerning the wider topic, my thinking is end user is mainly >>>> interested to know there are any engine resets happening (to tie >>>> with the experience of UI/video glitching or whatever). Going for >>>> deeper analysis than that is probably beyond the scope of the kernel >>>> log and indeed error capture territory. >>> I would still say that the important information is which context was >>> killed not which engine. Sure, knowing the engine is better than >>> nothing but if we can report something more useful then why not? >> >> Make it so. :) >> >>>>> My view is that the current message is indeed woefully >>>>> uninformative. However, it is more important to be reporting >>>>> context identification than engine instances. So sure, add the >>>>> engine instance description but also add something specific to the >>>>> ce as well. Ideally (for me) the GuC id and maybe something else >>>>> that uniquely identifies the context in KMD land for when not using >>>>> GuC? >>>> >>>> Not sure we need to go that far at this level, but even if we do it >>>> could be a follow up to add new data to both backends. Not sure yet >>>> I care enough to drive this. My patch was simply a reaction to >>>> noticing there is zero information currently logged while debugging >>>> some DG2 hangs. >>> In terms of just reporting that a reset occurred, we already have the >>> 'GPU HANG: ecode 12:1:fbffffff, in testfw_app [8177]' message. The >>> ecode is a somewhat bizarre value but it does act as a 'something >>> went wrong, your system is not happy' type message. Going beyond >>> that, I think context identification is the next most useful thing to >>> add. >>> >>> But if you aren't even getting the 'GPU HANG' message then it sounds >>> like something is broken with what we already have. So we should fix >>> that as a first priority. If that message isn't appearing then it >>> means there was no error capture so adding extra info to the capture >>> won't help! >> >> The issue I have is that "GPU HANG ecode" messages are always "all >> zeros". It thought that was because GuC error capture was not there, >> but maybe its something else. > Hmm. All zeros including the platform and engine class part or just the > instdone part? Class and instdone - all we were seeing was "[drm] GPU HANG: ecode 12:0:00000000". Even on the CI run for this patch I see in the logs: <5>[ 157.243472] i915 0000:00:02.0: [drm] rcs0 GuC engine reset <6>[ 157.254568] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:0:00000000 So there seem circumstances when the GPU HANG line somehow misses to figure out the engine class. > The instdone value is engine register state and will have been cleared > before i915 can read it if the reset was handled by GuC. That comes > under the heading of state we need the new error capture API for. > However, the context should be correctly identified as should the > platform/engine class. > > Currently, the recommended w/a is to set i915.reset=1 when debugging a > hang issue. That will disable GuC based resets and fall back to old > school i915 based (but full GT not engine) resets. And that means that > i915 will be able to read the engine state prior to the reset happening > and thus produce a valid error capture / GPU HANG message. Ah.. that's the piece of the puzzle I was missing. Perhaps it should even be the default until error capture works. Regards, Tvrtko