From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 26C86C433EF for ; Tue, 7 Jun 2022 22:29:36 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C989910E0A6; Tue, 7 Jun 2022 22:29:34 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8422E10E0A6; Tue, 7 Jun 2022 22:29:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654640973; x=1686176973; h=date:message-id:from:to:cc:subject:in-reply-to: references:mime-version; bh=Wjr0Wz6iwc88HCtMZg+CGs0mPmkrXQ3vQ1GaSI4Hhr0=; b=D226VTi7Mr5242geykNwL5Z7t1bn4BFKBiCPTMuZ57GPr4LebHxYRzhw 1qlt9Hr2IhNXrNq4U2Ov+tS90GJdQTkl03VaWfFVBBBsmLgvOtNqBBRW3 lYSvoNet2VMRIaFVPKWIFfZETBqr1PUE8jA2vgShyIcBP2g+0hqPh4F+g KU/4JDx3v8+3MXtRTQrn3CHSj3uIpohEV6ypIBgaOASYoPzHsyREvNhka njKzd3MbeJhFYBG1qiRn7g21oscmMTD2Fkgx1rFZ256ykxWQhEo4vv0O4 q3WkKQ/2xMPgRPWDBWkWAEg+mHZ1K/zx8IdP56vBRQOp94xfSjtCFp2I3 Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="274291280" X-IronPort-AV: E=Sophos;i="5.91,284,1647327600"; d="scan'208";a="274291280" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jun 2022 15:29:32 -0700 X-IronPort-AV: E=Sophos;i="5.91,284,1647327600"; d="scan'208";a="648274960" Received: from adixit-mobl1.amr.corp.intel.com (HELO adixit-arch.intel.com) ([10.212.186.67]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jun 2022 15:29:32 -0700 Date: Tue, 07 Jun 2022 15:29:32 -0700 Message-ID: <87y1y8jeer.wl-ashutosh.dixit@intel.com> From: "Dixit, Ashutosh" To: Vinay Belgaumkar Subject: Re: [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost In-Reply-To: <20220515060506.22084-1-vinay.belgaumkar@intel.com> References: <20220515060506.22084-1-vinay.belgaumkar@intel.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: intel-gfx@lists.freedesktop.org, Daniele Ceraolo Spurio , John Harrison , dri-devel@lists.freedesktop.org, Michal Wajdeczko Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Sat, 14 May 2022 23:05:06 -0700, Vinay Belgaumkar wrote: > > SLPC min/max frequency updates require H2G calls. We are seeing > timeouts when GuC channel is backed up and it is unable to respond > in a timely fashion causing warnings and affecting CI. > > This is seen when waitboosting happens during a stress test. > this patch updates the waitboost path to use a non-blocking > H2G call instead, which returns as soon as the message is > successfully transmitted. Overall I think this patch is trying to paper over problems in the blocking H2G CT interface (specifically the 1 second timeout in wait_for_ct_request_update()). So I think we should address that problem in the interface directly rather than having each client (SLPC and any future client) work around the problem. Following points: 1. This patch seems to assume that it is 'ok' to ignore the return code from FW for a waitboost request (arguing waitboost is best effort so it's ok to 'fire and forget'). But the return code is still useful e.g. in cases where we see performance issues and want to go back and investigate if FW rejected any waitboost requests. 2. We are already seeing that a 1 second timeout is not sufficient. So why not simply increase that timeout? 3. In fact if we are saying that the CT interface is a "reliable" interface (implying no message loss), to ensure reliability that timeout should not simply be increased, it should be made "infinite" (in quotes). 4. Maybe it would have been best to not have a "blocking" H2G interface at all (with the wait in wait_for_ct_request_update()). Just have an asynchronous interface (which mirrors the actual interface between FW and i915) in which clients register callbacks which are invoked when FW responds. If this is too big a change we can probably continue with the current blocking interface after increasing the timeout as mentioned above. 5. Finally, the waitboost request is just the most likely to get stuck at the back of a full CT queue since it happens during normal operation. Actually any request, say one initiated from sysfs, can also get similarly stuck at the back of a full queue. So any solution should also address that situation (where the return code is needed and similarly for a future client of the "blocking" (REQUEST/RESPONSE) interface). Thanks. -- Ashutosh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 89952C43334 for ; Tue, 7 Jun 2022 22:29:35 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0FE8310E132; Tue, 7 Jun 2022 22:29:35 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8422E10E0A6; Tue, 7 Jun 2022 22:29:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654640973; x=1686176973; h=date:message-id:from:to:cc:subject:in-reply-to: references:mime-version; bh=Wjr0Wz6iwc88HCtMZg+CGs0mPmkrXQ3vQ1GaSI4Hhr0=; b=D226VTi7Mr5242geykNwL5Z7t1bn4BFKBiCPTMuZ57GPr4LebHxYRzhw 1qlt9Hr2IhNXrNq4U2Ov+tS90GJdQTkl03VaWfFVBBBsmLgvOtNqBBRW3 lYSvoNet2VMRIaFVPKWIFfZETBqr1PUE8jA2vgShyIcBP2g+0hqPh4F+g KU/4JDx3v8+3MXtRTQrn3CHSj3uIpohEV6ypIBgaOASYoPzHsyREvNhka njKzd3MbeJhFYBG1qiRn7g21oscmMTD2Fkgx1rFZ256ykxWQhEo4vv0O4 q3WkKQ/2xMPgRPWDBWkWAEg+mHZ1K/zx8IdP56vBRQOp94xfSjtCFp2I3 Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10371"; a="274291280" X-IronPort-AV: E=Sophos;i="5.91,284,1647327600"; d="scan'208";a="274291280" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jun 2022 15:29:32 -0700 X-IronPort-AV: E=Sophos;i="5.91,284,1647327600"; d="scan'208";a="648274960" Received: from adixit-mobl1.amr.corp.intel.com (HELO adixit-arch.intel.com) ([10.212.186.67]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jun 2022 15:29:32 -0700 Date: Tue, 07 Jun 2022 15:29:32 -0700 Message-ID: <87y1y8jeer.wl-ashutosh.dixit@intel.com> From: "Dixit, Ashutosh" To: Vinay Belgaumkar In-Reply-To: <20220515060506.22084-1-vinay.belgaumkar@intel.com> References: <20220515060506.22084-1-vinay.belgaumkar@intel.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On Sat, 14 May 2022 23:05:06 -0700, Vinay Belgaumkar wrote: > > SLPC min/max frequency updates require H2G calls. We are seeing > timeouts when GuC channel is backed up and it is unable to respond > in a timely fashion causing warnings and affecting CI. > > This is seen when waitboosting happens during a stress test. > this patch updates the waitboost path to use a non-blocking > H2G call instead, which returns as soon as the message is > successfully transmitted. Overall I think this patch is trying to paper over problems in the blocking H2G CT interface (specifically the 1 second timeout in wait_for_ct_request_update()). So I think we should address that problem in the interface directly rather than having each client (SLPC and any future client) work around the problem. Following points: 1. This patch seems to assume that it is 'ok' to ignore the return code from FW for a waitboost request (arguing waitboost is best effort so it's ok to 'fire and forget'). But the return code is still useful e.g. in cases where we see performance issues and want to go back and investigate if FW rejected any waitboost requests. 2. We are already seeing that a 1 second timeout is not sufficient. So why not simply increase that timeout? 3. In fact if we are saying that the CT interface is a "reliable" interface (implying no message loss), to ensure reliability that timeout should not simply be increased, it should be made "infinite" (in quotes). 4. Maybe it would have been best to not have a "blocking" H2G interface at all (with the wait in wait_for_ct_request_update()). Just have an asynchronous interface (which mirrors the actual interface between FW and i915) in which clients register callbacks which are invoked when FW responds. If this is too big a change we can probably continue with the current blocking interface after increasing the timeout as mentioned above. 5. Finally, the waitboost request is just the most likely to get stuck at the back of a full CT queue since it happens during normal operation. Actually any request, say one initiated from sysfs, can also get similarly stuck at the back of a full queue. So any solution should also address that situation (where the return code is needed and similarly for a future client of the "blocking" (REQUEST/RESPONSE) interface). Thanks. -- Ashutosh