From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0969AC2B9F8 for ; Tue, 25 May 2021 14:15:27 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 9D743613F1 for ; Tue, 25 May 2021 14:15:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9D743613F1 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2B6496E125; Tue, 25 May 2021 14:15:26 +0000 (UTC) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by gabe.freedesktop.org (Postfix) with ESMTPS id EACEB6E125; Tue, 25 May 2021 14:15:24 +0000 (UTC) IronPort-SDR: rXXVzFAjErKM60FjXXOzZsU+p6OO/TVPMio3VSM/BJyIwnbFTYfNTFlsLVWMVM53ux0O/2mox2 icOFSYmyRaFQ== X-IronPort-AV: E=McAfee;i="6200,9189,9995"; a="202239478" X-IronPort-AV: E=Sophos;i="5.82,328,1613462400"; d="scan'208";a="202239478" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 May 2021 07:15:21 -0700 IronPort-SDR: EQtN2udL+0JprowicJKcJGL7eQi1DN7PcbV2R95UHgLiTWa912jRsWNqaP1zzLp0lg4oi4H3gO LQGDJnwB3PNg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.82,328,1613462400"; d="scan'208";a="443412867" Received: from irvmail001.ir.intel.com ([10.43.11.63]) by fmsmga008.fm.intel.com with ESMTP; 25 May 2021 07:15:17 -0700 Received: from [10.249.148.145] (mwajdecz-MOBL.ger.corp.intel.com [10.249.148.145]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id 14PEFF6Z013901; Tue, 25 May 2021 15:15:16 +0100 Subject: Re: [RFC PATCH 37/97] drm/i915/guc: Add stall timer to non blocking CTB send function To: Matthew Brost References: <20210506191451.77768-1-matthew.brost@intel.com> <20210506191451.77768-38-matthew.brost@intel.com> <7e7ca64d-e5d7-7fd8-b446-7167341c0c92@intel.com> <20210524183518.GA12293@sdutt-i7> From: Michal Wajdeczko Message-ID: <2b0430a4-d43e-ad52-890c-340631febd6c@intel.com> Date: Tue, 25 May 2021 16:15:15 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2 MIME-Version: 1.0 In-Reply-To: <20210524183518.GA12293@sdutt-i7> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: tvrtko.ursulin@intel.com, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, jason.ekstrand@intel.com, daniele.ceraolospurio@intel.com, jon.bloomfield@intel.com, daniel.vetter@intel.com, john.c.harrison@intel.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On 24.05.2021 20:35, Matthew Brost wrote: > On Mon, May 24, 2021 at 02:58:12PM +0200, Michal Wajdeczko wrote: >> >> >> On 06.05.2021 21:13, Matthew Brost wrote: >>> Implement a stall timer which fails H2G CTBs once a period of time >>> with no forward progress is reached to prevent deadlock. >>> >>> Also update to ct_write to return -EDEADLK rather than -EPIPE on a >>> corrupted descriptor. >> >> broken descriptor is really separate issue compared to no progress from >> GuC side, I would really like to keep old error code >> > > I know you do as you have brought it up several times. Again to the rest > of the stack these two things mean the exact same thing. but I guess 'the rest of the stack' is only interested if returned error is EBUSY or not, as all other errors are treated in the same way, thus no need change existing error codes > >> note that broken CTB descriptor is unrecoverable error, while on other >> hand, in theory, we could recover from temporary non-moving CTB >> > > Yea but we don't, in both cases we disable submission which in turn > triggers a full GPU reset. is this current limitation or long term design? I would rather expect that decision to trigger full GPU is done on solid foundations, like definite lost communication with the GuC or missed heartbeats, not that we temporarily push CTB to the limit or if do we want to treat CTB processing as kind of hw health checkout too, what if heartbeat timeout and CTB stall time do not match ? > >>> >>> Signed-off-by: John Harrison >>> Signed-off-by: Daniele Ceraolo Spurio >>> Signed-off-by: Matthew Brost >>> --- >>> drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 48 +++++++++++++++++++++-- >>> 1 file changed, 45 insertions(+), 3 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c >>> index af7314d45a78..4eab319d61be 100644 >>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c >>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c >>> @@ -69,6 +69,8 @@ static inline struct drm_device *ct_to_drm(struct intel_guc_ct *ct) >>> #define CTB_H2G_BUFFER_SIZE (SZ_4K) >>> #define CTB_G2H_BUFFER_SIZE (SZ_4K) >>> >>> +#define MAX_US_STALL_CTB 1000000 >> >> nit: maybe we should make it a CONFIG value ? >> > > Sure. > >>> + >>> struct ct_request { >>> struct list_head link; >>> u32 fence; >>> @@ -315,6 +317,7 @@ int intel_guc_ct_enable(struct intel_guc_ct *ct) >>> >>> ct->requests.last_fence = 1; >>> ct->enabled = true; >>> + ct->stall_time = KTIME_MAX; >>> >>> return 0; >>> >>> @@ -378,7 +381,7 @@ static int ct_write(struct intel_guc_ct *ct, >>> unsigned int i; >>> >>> if (unlikely(ctb->broken)) >>> - return -EPIPE; >>> + return -EDEADLK; >>> >>> if (unlikely(desc->status)) >>> goto corrupted; >>> @@ -449,7 +452,7 @@ static int ct_write(struct intel_guc_ct *ct, >>> CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n", >>> desc->head, desc->tail, desc->status); >>> ctb->broken = true; >>> - return -EPIPE; >>> + return -EDEADLK; >>> } >>> >>> /** >>> @@ -494,6 +497,17 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status) >>> return err; >>> } >>> >>> +static inline bool ct_deadlocked(struct intel_guc_ct *ct) >>> +{ >>> + bool ret = ktime_us_delta(ktime_get(), ct->stall_time) > >>> + MAX_US_STALL_CTB; >>> + >>> + if (unlikely(ret)) >>> + CT_ERROR(ct, "CT deadlocked\n"); >>> + >>> + return ret; >>> +} >>> + >>> static inline bool ctb_has_room(struct intel_guc_ct_buffer *ctb, u32 len_dw) >>> { >>> struct guc_ct_buffer_desc *desc = ctb->desc; >>> @@ -505,6 +519,26 @@ static inline bool ctb_has_room(struct intel_guc_ct_buffer *ctb, u32 len_dw) >>> return space >= len_dw; >>> } >>> >>> +static int has_room_nb(struct intel_guc_ct *ct, u32 len_dw) >>> +{ >>> + struct intel_guc_ct_buffer *ctb = &ct->ctbs.send; >>> + >>> + lockdep_assert_held(&ct->ctbs.send.lock); >>> + >>> + if (unlikely(!ctb_has_room(ctb, len_dw))) { >>> + if (ct->stall_time == KTIME_MAX) >>> + ct->stall_time = ktime_get(); >>> + >>> + if (unlikely(ct_deadlocked(ct))) >>> + return -EDEADLK; >>> + else >>> + return -EBUSY; >>> + } >>> + >>> + ct->stall_time = KTIME_MAX; >>> + return 0; >>> +} >>> + >>> static int ct_send_nb(struct intel_guc_ct *ct, >>> const u32 *action, >>> u32 len, >>> @@ -517,7 +551,7 @@ static int ct_send_nb(struct intel_guc_ct *ct, >>> >>> spin_lock_irqsave(&ctb->lock, spin_flags); >>> >>> - ret = ctb_has_room(ctb, len + 1); >>> + ret = has_room_nb(ct, len + 1); >>> if (unlikely(ret)) >>> goto out; >>> >>> @@ -561,11 +595,19 @@ static int ct_send(struct intel_guc_ct *ct, >>> retry: >>> spin_lock_irqsave(&ct->ctbs.send.lock, flags); >>> if (unlikely(!ctb_has_room(ctb, len + 1))) { >>> + if (ct->stall_time == KTIME_MAX) >>> + ct->stall_time = ktime_get(); >>> spin_unlock_irqrestore(&ct->ctbs.send.lock, flags); >>> + >>> + if (unlikely(ct_deadlocked(ct))) >>> + return -EDEADLK; >>> + >> >> likely, instead of duplicating code, you can reuse has_room_nb here >> > > In this patch yes, in the following patch no as this check changes > between non-blockig and blocking once we introduce G2H credits. I'd > rather just leave it as is than churning on the patches. > > Matt > >>> cond_resched(); >>> goto retry; >>> } >>> >>> + ct->stall_time = KTIME_MAX; >>> + >>> fence = ct_get_next_fence(ct); >>> request.fence = fence; >>> request.status = 0; >>>