From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.1 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9766BC433ED for ; Thu, 13 May 2021 08:00:37 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 3914A61439 for ; Thu, 13 May 2021 08:00:37 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3914A61439 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C45AB6E856; Thu, 13 May 2021 08:00:36 +0000 (UTC) Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by gabe.freedesktop.org (Postfix) with ESMTPS id C0B106E856 for ; Thu, 13 May 2021 08:00:35 +0000 (UTC) Received: by mail-ej1-x62c.google.com with SMTP id n2so38657526ejy.7 for ; Thu, 13 May 2021 01:00:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding:content-language; bh=hdsKgEnKATes6zLBVznKXPaRNmnucaEto+lM/WzMoDA=; b=A//eZWkjPCnwQoUYtpT5y7i6WB4Cgko1zpWP0w5Mt9Mmu+bOK+04jGQPkSI2IRgQtr 6YYs9UPVQyNoolJ9XO0M8ABIu9eqZSJ+si3ZMGCvvtFA8uVi1e0D+bohUp5XZLd9iY2N 3nmAX+36NsSyQntPc4/xcGacVTmrYFOYaQhYT6iA9wL0+QgM0Q3zrTRhv74Sk2kVwgl5 0s+FJKQQOeTmM0zMAlrmEZL8tCryjnD5TfQkoOeB1ars7jb8pDxh5dDPSNLF+w8HjpVZ DGsbXNEbykADrKqFkGyboFOENcRjkYVodTrGHX50vUqSdj6EytDN/eDx2LSSnuffJI8W NJAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=hdsKgEnKATes6zLBVznKXPaRNmnucaEto+lM/WzMoDA=; b=qki914RQ2CQZ8+eLnDj6YHfRRAa7DVr7oxpgjeW14luJV6/jG31kBU2DQdd2z7xSA6 8+HyArH1O9tgCEvxhC6Cxz0WzrC2fJ0YpbFd8mg0hIfzKAS7vtY1ZkK9q6mXJQ16jtHR jaqstP6Spi1rPbfAQG6qbu1CCpmzXIy+2D+qfjAa9AvvhIdC+YDEkDfxI3kwj02RNuT3 GPgE7hgLwiEBp3aJoi9WWTipGBSsi+ZoiabbxZGtrT1AS7eH1f1/3bexBsWOEWgoOebS 2f4S8fQNaVOqiO9rChuuDncy0qcq3gL1oPZ5R+pQ1Skf3PZtUxw6ioe4uJeQRpDEAi3I MjFQ== X-Gm-Message-State: AOAM533mZ/Bo+yDT4Y51o+nUBjerSHhQIux0L2k7z2UXRMBMIb81bK5K pcLzoJBmRp77QUW0UhWNuuA80hgVjKs= X-Google-Smtp-Source: ABdhPJw0S6Kk6saayA1pQXlZ/uh0G3aOl+xtSG+rKb4+3isbgzmX8qauPfpRJI2XFltgIHeqdZ5sUQ== X-Received: by 2002:a17:906:b2c1:: with SMTP id cf1mr32637191ejb.544.1620892834407; Thu, 13 May 2021 01:00:34 -0700 (PDT) Received: from ?IPv6:2a02:908:1252:fb60:1e3f:7b0a:2eb9:c6dc? ([2a02:908:1252:fb60:1e3f:7b0a:2eb9:c6dc]) by smtp.gmail.com with ESMTPSA id f30sm1431631ejl.62.2021.05.13.01.00.33 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 13 May 2021 01:00:34 -0700 (PDT) Subject: Re: [PATCH 2/2] drm/amdgpu: Poll of RAS errors asynchronously To: Luben Tuikov , amd-gfx@lists.freedesktop.org References: <20210512170302.64951-1-luben.tuikov@amd.com> <20210512170302.64951-2-luben.tuikov@amd.com> From: =?UTF-8?Q?Christian_K=c3=b6nig?= Message-ID: <83e2151b-02aa-a055-2d37-4f74ea025f3d@gmail.com> Date: Thu, 13 May 2021 10:00:33 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210512170302.64951-2-luben.tuikov@amd.com> Content-Language: en-US X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Alexander Deucher , John Clements , Hawking Zhang Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" Am 12.05.21 um 19:03 schrieb Luben Tuikov: > When using Vega 20 with RAS support and RAS is > enabled, the system interactivity is extremely > slow, to the point of being unusable. After > debugging, it was determined that this is due to > the polling loop performed for > AMDGPU_CTX_OP_QUERY_STATE2 under > amdgpu_ctx_ioctl(), which seems to be executed on > every ioctl from X/Mesa. > > The latter seems to call amdgpu_ctx_query2() which > calls amdgpu_ras_query_error_count() twice, once > for each of ue_count and ce_count. This is > unnecessarily wasteful since > amdgpu_ras_query_error_count() calculates both, > but with the current interface it returns one or > the other, depending on its Boolean input, when it > can in fact return both, in a single invocation, > if it had a better interface. > > Further down, the query_ras_error_count() callback > is called, which could be quite a large polling > loop, and very time consuming. For instance, > gfx_v9_0_query_ras_error_count() is at least > O(n^3). A similar situation is seen with > umc_v6_1_query_ras_error_count(). > > This commit implements asynchronous RAS error > count polling to that of the ioctl. A kernel > thread polls the RAS error counters once in a > while. The ioctl reads the values > asynchronously. The poll frequency is a module > parameter, with range [500, 10000] milliseconds, > default 3000. > > v2: Enable setting the polling interval to 0, > which disables the thread. Please drop the module parameter, we already have way to many module parameters and that doesn't add much value. Then I would really prefer to implement this as a delayed work item instead. If you call schedule_delayed_work() from the amdgpu_ctx_query2() the work item will only be started every X jiffies when the function is actually used. Regards, Christian. > > Cc: Alexander Deucher > Cc: John Clements > Cc: Hawking Zhang > Signed-off-by: Luben Tuikov > Reviewed-by: John Clements > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 36 +++++++------- > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 10 ++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 66 ++++++++++++++++++++++--- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 8 +-- > 5 files changed, 93 insertions(+), 28 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 3147c1c935c8..a269f778194f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -197,6 +197,7 @@ extern struct amdgpu_mgpu_info mgpu_info; > extern int amdgpu_ras_enable; > extern uint amdgpu_ras_mask; > extern int amdgpu_bad_page_threshold; > +extern uint amdgpu_ras_thread_poll_ms; > extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer; > extern int amdgpu_async_gfx_ring; > extern int amdgpu_mcbp; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index d481a33f4eaf..558e887e99b6 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -332,12 +332,12 @@ static int amdgpu_ctx_query(struct amdgpu_device *adev, > } > > static int amdgpu_ctx_query2(struct amdgpu_device *adev, > - struct amdgpu_fpriv *fpriv, uint32_t id, > - union drm_amdgpu_ctx_out *out) > + struct amdgpu_fpriv *fpriv, uint32_t id, > + union drm_amdgpu_ctx_out *out) > { > + struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > struct amdgpu_ctx *ctx; > struct amdgpu_ctx_mgr *mgr; > - unsigned long ras_counter; > > if (!fpriv) > return -EINVAL; > @@ -362,20 +362,22 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev, > if (atomic_read(&ctx->guilty)) > out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_GUILTY; > > - /*query ue count*/ > - /* ras_counter = amdgpu_ras_query_error_count(adev, false); */ > - /* /\*ras counter is monotonic increasing*\/ */ > - /* if (ras_counter != ctx->ras_counter_ue) { */ > - /* out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; */ > - /* ctx->ras_counter_ue = ras_counter; */ > - /* } */ > - > - /* /\*query ce count*\/ */ > - /* ras_counter = amdgpu_ras_query_error_count(adev, true); */ > - /* if (ras_counter != ctx->ras_counter_ce) { */ > - /* out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; */ > - /* ctx->ras_counter_ce = ras_counter; */ > - /* } */ > + if (con) { > + int ce_count, ue_count; > + > + ce_count = atomic_read(&con->ras_ce_count); > + ue_count = atomic_read(&con->ras_ue_count); > + > + if (ce_count != ctx->ras_counter_ce) { > + out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_CE; > + ctx->ras_counter_ce = ce_count; > + } > + > + if (ue_count != ctx->ras_counter_ue) { > + out->state.flags |= AMDGPU_CTX_QUERY2_FLAGS_RAS_UE; > + ctx->ras_counter_ue = ue_count; > + } > + } > > mutex_unlock(&mgr->lock); > return 0; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 877469d288f8..641c374b8525 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -183,6 +183,7 @@ struct amdgpu_mgpu_info mgpu_info = { > int amdgpu_ras_enable = -1; > uint amdgpu_ras_mask = 0xffffffff; > int amdgpu_bad_page_threshold = -1; > +uint amdgpu_ras_thread_poll_ms = 3000; > struct amdgpu_watchdog_timer amdgpu_watchdog_timer = { > .timeout_fatal_disable = false, > .period = 0x0, /* default to 0x0 (timeout disable) */ > @@ -534,6 +535,15 @@ module_param_named(emu_mode, amdgpu_emu_mode, int, 0444); > MODULE_PARM_DESC(ras_enable, "Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))"); > module_param_named(ras_enable, amdgpu_ras_enable, int, 0444); > > +/** > + * DOC: ras_thread_poll (uint) > + * Number of milliseconds between RAS poll for errors. > + * Valid range 0 and [500,10000]. Set to 0 to disable the thread. > + * Default: 3000. > + */ > +MODULE_PARM_DESC(ras_thread_poll, "Number of milliseconds between RAS poll for errors. Valid range 0 and [500,10000]. Set to 0 to disable the thread. Default: 3000."); > +module_param_named(ras_thread_poll, amdgpu_ras_thread_poll_ms, uint, 0444); > + > /** > * DOC: ras_mask (uint) > * Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1 > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index b1c57a5b6e89..30bec289e9ad 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -1043,15 +1043,17 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev, > } > > /* get the total error counts on all IPs */ > -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev, > - bool is_ce) > +static void amdgpu_ras_query_error_count(struct amdgpu_device *adev) > { > struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > struct ras_manager *obj; > - struct ras_err_data data = {0, 0}; > + int ce_count, ue_count; > > if (!adev->ras_enabled || !con) > - return 0; > + return; > + > + ce_count = 0; > + ue_count = 0; > > list_for_each_entry(obj, &con->head, node) { > struct ras_query_if info = { > @@ -1059,13 +1061,14 @@ unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev, > }; > > if (amdgpu_ras_query_error_status(adev, &info)) > - return 0; > + return; > > - data.ce_count += info.ce_count; > - data.ue_count += info.ue_count; > + ce_count += info.ce_count; > + ue_count += info.ue_count; > } > > - return is_ce ? data.ce_count : data.ue_count; > + atomic_set(&con->ras_ce_count, ce_count); > + atomic_set(&con->ras_ue_count, ue_count); > } > /* query/inject/cure end */ > > @@ -2109,6 +2112,49 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev) > adev->ras_hw_enabled & amdgpu_ras_mask; > } > > +static int amdgpu_ras_thread(void *data) > +{ > + struct amdgpu_device *adev = data; > + struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > + > + con->ras_thread_poll_ms = amdgpu_ras_thread_poll_ms; > + if (con->ras_thread_poll_ms == 0) { > + atomic_set(&con->ras_ce_count, 0); > + atomic_set(&con->ras_ue_count, 0); > + return 0; > + } else if (con->ras_thread_poll_ms < 500 || > + con->ras_thread_poll_ms > 10000) { > + con->ras_thread_poll_ms = 3000; > + } > + > + while (1) { > + if (kthread_should_stop()) > + break; > + if (kthread_should_park()) > + kthread_parkme(); > + > + amdgpu_ras_query_error_count(adev); > + msleep_interruptible(con->ras_thread_poll_ms); > + } > + > + return 0; > +} > + > +static int amdgpu_ras_thread_start(struct amdgpu_device *adev) > +{ > + struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > + struct task_struct *kt; > + > + kt = kthread_run(amdgpu_ras_thread, adev, > + "amdras:%s", pci_name(adev->pdev)); > + if (IS_ERR(kt)) > + return PTR_ERR(kt); > + > + con->ras_thread = kt; > + > + return 0; > +} > + > int amdgpu_ras_init(struct amdgpu_device *adev) > { > struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > @@ -2182,6 +2228,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev) > goto release_con; > } > > + r = amdgpu_ras_thread_start(adev); > + if (r) > + goto release_con; > + > dev_info(adev->dev, "RAS INFO: ras initialized successfully, " > "hardware ability[%x] ras_mask[%x]\n", > adev->ras_hw_enabled, adev->ras_enabled); > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > index 201fbdee1d09..fb9e4c7ab054 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > @@ -340,6 +340,11 @@ struct amdgpu_ras { > > /* disable ras error count harvest in recovery */ > bool disable_ras_err_cnt_harvest; > + > + struct task_struct *ras_thread; > + unsigned int ras_thread_poll_ms; > + atomic_t ras_ue_count; > + atomic_t ras_ce_count; > }; > > struct ras_fs_data { > @@ -485,9 +490,6 @@ int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev, > void amdgpu_ras_resume(struct amdgpu_device *adev); > void amdgpu_ras_suspend(struct amdgpu_device *adev); > > -unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev, > - bool is_ce); > - > /* error handling functions */ > int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev, > struct eeprom_table_record *bps, int pages); _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx