From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1C9CC433F5 for ; Thu, 23 Sep 2021 13:19:45 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7E7D161050 for ; Thu, 23 Sep 2021 13:19:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 7E7D161050 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7BC606E0E9; Thu, 23 Sep 2021 13:19:44 +0000 (UTC) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by gabe.freedesktop.org (Postfix) with ESMTPS id BAC916E0E9; Thu, 23 Sep 2021 13:19:42 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10115"; a="223879773" X-IronPort-AV: E=Sophos;i="5.85,316,1624345200"; d="scan'208";a="223879773" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Sep 2021 06:19:42 -0700 X-IronPort-AV: E=Sophos;i="5.85,316,1624345200"; d="scan'208";a="475510934" Received: from gboschi-mobl.ger.corp.intel.com (HELO [10.249.254.197]) ([10.249.254.197]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Sep 2021 06:19:40 -0700 Subject: Re: [Intel-gfx] [PATCH v6 3/9] drm/i915/gt: Increase suspend timeout To: Tvrtko Ursulin , intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: maarten.lankhorst@linux.intel.com, matthew.auld@intel.com, Matthew Brost , John Harrison References: <20210922062527.865433-1-thomas.hellstrom@linux.intel.com> <20210922062527.865433-4-thomas.hellstrom@linux.intel.com> <0f1050c9-b9fe-b587-2aac-cceae4032638@linux.intel.com> <061617be-9bf4-7853-a34d-7501f6b3179f@linux.intel.com> From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m?= Message-ID: <199e2c25-8133-360e-4b85-18485522c2be@linux.intel.com> Date: Thu, 23 Sep 2021 15:19:37 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <061617be-9bf4-7853-a34d-7501f6b3179f@linux.intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On 9/23/21 2:59 PM, Tvrtko Ursulin wrote: > > On 23/09/2021 12:47, Thomas Hellström wrote: >> Hi, Tvrtko, >> >> On 9/23/21 12:13 PM, Tvrtko Ursulin wrote: >>> >>> On 22/09/2021 07:25, Thomas Hellström wrote: >>>> With GuC submission on DG1, the execution of the requests times out >>>> for the gem_exec_suspend igt test case after executing around 800-900 >>>> of 1000 submitted requests. >>>> >>>> Given the time we allow elsewhere for fences to signal (in the >>>> order of >>>> seconds), increase the timeout before we mark the gt wedged and >>>> proceed. >>> >>> I suspect it is not about requests not retiring in time but about >>> the intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although >>> I don't know which G2H message is the code waiting for at suspend >>> time so perhaps something to run past the GuC experts. >> >> So what's happening here is that the tests submits 1000 requests, >> each writing a value to an object, and then that object content is >> checked after resume. With GuC it turns out that only 800-900 or so >> values are actually written before we time out, and the test >> (basic-S3) fails, but not on every run. > > Yes and that did not make sense to me. It is a single context even so > I did not come up with an explanation why would GuC be slower. > > Unless it somehow manages to not even update the ring tail in time and > requests are still only stuck in the software queue? Perhaps you can > see that from context tail and head when it happens. > >> This is a bit interesting in itself, because I never saw the hang-S3 >> test fail, which from what I can tell basically is an identical test >> but with a spinner submitted after the 1000th request. Could be that >> the suspend backup code ends up waiting for something before we end >> up in intel_gt_wait_for_idle, giving more requests time to execute. > > No idea, I don't know the suspend paths that well. For instance before > looking at the code I thought we would preempt what's executing and > not wait for everything that has been submitted to finish. :) > >>> Anyway, if that turns out to be correct then perhaps it would be >>> better to split the two timeouts (like if required GuC timeout is >>> perhaps fundamentally independent) so it's clear who needs how much >>> time. Adding Matt and John to comment. >> >> You mean we have separate timeouts depending on whether we're using >> GuC or execlists submission? > > No, I don't know yet. First I think we need to figure out what exactly > is happening. Well then TBH I will need to file a separate Jira about that. There might be various things going on here like swiching between the migrate context for eviction of unrelated LMEM buffers and the context used by gem_exec_suspend. The gem_exec_suspend failures are blocking DG1 BAT so it's pretty urgent to get this series merged. If you insist I can leave this patch out for now, but rather I'd commit it as is and File a Jira instead. /Thomas