From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0FACAC4332F for ; Thu, 3 Nov 2022 09:38:33 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 9D7ED10E5C7; Thu, 3 Nov 2022 09:38:29 +0000 (UTC) Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by gabe.freedesktop.org (Postfix) with ESMTPS id 761EF10E5C7; Thu, 3 Nov 2022 09:38:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1667468303; x=1699004303; h=message-id:date:mime-version:subject:from:to:cc: references:in-reply-to:content-transfer-encoding; bh=7VGgWd6x608+WUHZSAoKi7Zyp85tQeycWq+CFUuL7nM=; b=NDdWQNEvIIucbp128Yf31+x10MA0r2zh5gkVusj1GxAy5af0ack7GD8D 4OVRlprYrrrH2atx6c6EyDxq2xGHsEgQHL+3SlD3WQKIpcYX/cIz2ruzn xnj0yzV4hCtb4qGOm3A4tt59b3BU+MiIvvKTUJ1fsPAGVwpwhrUPLu545 lCpw0ttlBl2t0acrkHXHe3uME37O24SzuiAKZNwUiSWotPy8kq3PEE/cA I34SbGFOdfpDWsNQozpk/l6HkCJPvVgBzqdUtFLCJ8VxSbJRgrjm54/LT sYANJDMXsv0yJ+YabsfyGcjnn5tpKql2obn13ny4fOo9b1LBnqDpgLnZ7 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10519"; a="290024684" X-IronPort-AV: E=Sophos;i="5.95,235,1661842800"; d="scan'208";a="290024684" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2022 02:38:23 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10519"; a="703606605" X-IronPort-AV: E=Sophos;i="5.95,235,1661842800"; d="scan'208";a="703606605" Received: from khogan-mobl1.ger.corp.intel.com (HELO [10.213.226.17]) ([10.213.226.17]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2022 02:38:16 -0700 Message-ID: <2b2eb780-08f7-c7df-0397-a7f732da272d@linux.intel.com> Date: Thu, 3 Nov 2022 09:38:14 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.3.3 Subject: Re: [Intel-gfx] [PATCH] drm/i915: Don't wait forever in drop_caches Content-Language: en-US From: Tvrtko Ursulin To: John Harrison , Jani Nikula , Intel-GFX@Lists.FreeDesktop.Org References: <20221101235053.1650364-1-John.C.Harrison@Intel.com> <87k04d7dyn.fsf@intel.com> <5e22de43-d75c-fc21-9ae7-f27d116c5688@intel.com> <1855f0f2-8a1c-7bf4-76c0-76a4354ea8e8@linux.intel.com> Organization: Intel Corporation UK Plc In-Reply-To: <1855f0f2-8a1c-7bf4-76c0-76a4354ea8e8@linux.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On 03/11/2022 09:18, Tvrtko Ursulin wrote: > > On 03/11/2022 01:33, John Harrison wrote: >> On 11/2/2022 07:20, Tvrtko Ursulin wrote: >>> On 02/11/2022 12:12, Jani Nikula wrote: >>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote: >>>>> From: John Harrison >>>>> >>>>> At the end of each test, IGT does a drop caches call via sysfs with >>>> >>>> sysfs? >> Sorry, that was meant to say debugfs. I've also been working on some >> sysfs IGT issues and evidently got my wires crossed! >> >>>> >>>>> special flags set. One of the possible paths waits for idle with an >>>>> infinite timeout. That causes problems for debugging issues when CI >>>>> catches a "can't go idle" test failure. Best case, the CI system times >>>>> out (after 90s), attempts a bunch of state dump actions and then >>>>> reboots the system to recover it. Worst case, the CI system can't do >>>>> anything at all and then times out (after 1000s) and simply reboots. >>>>> Sometimes a serial port log of dmesg might be available, sometimes >>>>> not. >>>>> >>>>> So rather than making life hard for ourselves, change the timeout to >>>>> be 10s rather than infinite. Also, trigger the standard >>>>> wedge/reset/recover sequence so that testing can continue with a >>>>> working system (if possible). >>>>> >>>>> Signed-off-by: John Harrison >>>>> --- >>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++- >>>>>   1 file changed, 6 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c >>>>> b/drivers/gpu/drm/i915/i915_debugfs.c >>>>> index ae987e92251dd..9d916fbbfc27c 100644 >>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c >>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c >>>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops, >>>>>             DROP_RESET_ACTIVE | \ >>>>>             DROP_RESET_SEQNO | \ >>>>>             DROP_RCU) >>>>> + >>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10) >>>> >>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used >>>> here. >>> >>> So move here, dropping i915 prefix, next to the newly proposed one? >> Sure, can do that. >> >>> >>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in >>>> gt/intel_gt.c. >>> >>> Move there and rename to GT_IDLE_TIMEOUT? >>> >>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c. >>> >>> No action needed, maybe drop i915 prefix if wanted. >>> >> These two are totally unrelated and in code not being touched by this >> change. I would rather not conflate changing random other things with >> fixing this specific issue. >> >>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies. >>> >>> Add _MS suffix if wanted. >>> >>>> My head spins. >>> >>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies >>> to DROP_ACTIVE and not only DROP_IDLE. >> My original intention for the name was that is the 'drop caches >> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful and >> hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later that >> name can be conflated with the DROP_IDLE flag. Will rename. >> >> >>> >>> Things get refactored, code moves around, bits get left behind, who >>> knows. No reason to get too worked up. :) As long as people are >>> taking a wider view when touching the code base, and are not afraid >>> to send cleanups, things should be good. >> On the other hand, if every patch gets blocked in code review because >> someone points out some completely unrelated piece of code could be a >> bit better then nothing ever gets fixed. If you spot something that >> you think should be improved, isn't the general idea that you should >> post a patch yourself to improve it? > > There's two maintainers per branch and an order of magnitude or two more > developers so it'd be nice if cleanups would just be incoming on > self-initiative basis. ;) > >>> For the actual functional change at hand - it would be nice if code >>> paths in question could handle SIGINT and then we could punt the >>> decision on how long someone wants to wait purely to userspace. But >>> it's probably hard and it's only debugfs so whatever. >>> >> The code paths in question will already abort on a signal won't they? >> Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(), >> which is where the uc_wait_for_idle eventually ends up, have an >> 'if(signal_pending) return -EINTR;' check. Beyond that, it sounds like >> what you are asking for is a change in the IGT libraries and/or CI >> framework to start sending signals after some specific timeout. That >> seems like a significantly more complex change (in terms of the number >> of entities affected and number of groups involved) and unnecessary. > > If you say so, I haven't looked at them all. But if the code path in > question already aborts on signals then I am not sure what is the patch > fixing? I assumed you are trying to avoid the write stuck in D forever, > which then prevents driver unload and everything, requiring the test > runner to eventually reboot. If you say SIGINT works then you can > already recover from userspace, no? > >>> Whether or not 10s is enough CI will hopefully tell us. I'd probably >>> err on the side of safety and make it longer, but at most half from >>> the test runner timeout. >> This is supposed to be test clean up. This is not about how long a >> particular test takes to complete but about how long it takes to >> declare the system broken after the test has already finished. I would >> argue that even 10s is massively longer than required. >> >>> >>> I am not convinced that wedging is correct though. Conceptually could >>> be just that the timeout is too short. What does wedging really give >>> us, on top of limiting the wait, when latter AFAIU is the key factor >>> which would prevent the need to reboot the machine? >>> >> It gives us a system that knows what state it is in. If we can't idle >> the GT then something is very badly wrong. Wedging indicates that. It >> also ensure that a full GT reset will be attempted before the next >> test is run. Helping to prevent a failure on test X from propagating >> into failures of unrelated tests X+1, X+2, ... And if the GT reset >> does not work because the system is really that badly broken then >> future tests will not run rather than report erroneous failures. >> >> This is not about getting a more stable system for end users by >> sweeping issues under the carpet and pretending all is well. End users >> don't run IGTs or explicitly call dodgy debugfs entry points. The sole >> motivation here is to get more accurate results from CI. That is, >> correctly identifying which test has hit a problem, getting valid >> debug analysis for that test (logs and such) and allowing further >> testing to complete correctly in the case where the system can be >> recovered. > > I don't really oppose shortening of the timeout in principle, just want > a clear statement if this is something IGT / test runner could already > do or not. It can apply a timeout, it can also send SIGINT, and it could > even trigger a reset from outside. Sure it is debugfs hacks so general > "kernel should not implement policy" need not be strictly followed, but > lets have it clear what are the options. One conceptual problem with applying this policy is that the code is: if (val & (DROP_IDLE | DROP_ACTIVE)) { ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT); if (ret) return ret; } if (val & DROP_IDLE) { ret = intel_gt_pm_wait_for_idle(gt); if (ret) return ret; } So if someone passes in DROP_IDLE and then why would only the first branch have a short timeout and wedge. Yeah some bug happens to be there at the moment, but put a bug in a different place and you hang on the second branch and then need another patch. Versus perhaps making it all respect SIGINT and handle from outside. Regards, Tvrtko