From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.6 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 088C2C43466 for ; Fri, 18 Sep 2020 13:57:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B304020878 for ; Fri, 18 Sep 2020 13:57:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HaN8UMxQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726904AbgIRN5v (ORCPT ); Fri, 18 Sep 2020 09:57:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51056 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726406AbgIRN5t (ORCPT ); Fri, 18 Sep 2020 09:57:49 -0400 Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com [IPv6:2a00:1450:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 72DFAC0613CE; Fri, 18 Sep 2020 06:57:49 -0700 (PDT) Received: by mail-wr1-x443.google.com with SMTP id s12so5718808wrw.11; Fri, 18 Sep 2020 06:57:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JyHX/3xaI4MUjJWsWpVy5PNH+Y48SlI7AjmXE8ItCY4=; b=HaN8UMxQ6Z/S3OwGAIcjPPmvBARR7wJZunOxgWrFCryJXlG256dWNW6qFFUqZIMwGo tqDc88KQ0bjbPoz+R+7iEYdJBqBqDUQvp/K1p3BTfadlR6AnjAtB6o1MWtzAnUy0HnKp xKbuDnLR7Pssn06GQSk5+FnxBOoYJysnDThKKIP8zm7xyM3MnITaZgmqIM0F2deB/IHF VeCX2GhtmcZqANVzingHDFA2s4pZo2/y1AW6ovuceejdYTwl3UMIUPsMkit+V/p17noz rCYL+6n/G39xQQvfcZptkXOtiSMyGM1k2MSAnR9yHFP3xxj48NmI6l3QgIbrp2WVpVkL MnJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JyHX/3xaI4MUjJWsWpVy5PNH+Y48SlI7AjmXE8ItCY4=; b=UA+h5uQwK1hXOlp9/1/oaHmU/3+9ObrxTA+pAUlUAVC3JaePP7crn98MbmkZIM8QQj /UaaRXF5aUb1MvGYbfzgYbOrBBpEvdDydKRrS3BjLSHd01p5o0kS3YAXWVHHGklqXVFW QfpxUUhBmiF6VaH87wJWle3TugNqbbjTUhLvQDt/62C4HrpvP16ksf1ZbJ4NQRbpu/b5 PsG8GA+LSSZjCCHp94VKOoiKoLpxk4XU7mFDfoXeBao9hYDMnKlQ7sAhFICzXUH5Atml jMt+dMVg2qdx1JUzsYo8Plj9qPVht9tQXvXeHTRLkpSIAR7Empwt++OZ2ZWydaUXxWSJ Sebg== X-Gm-Message-State: AOAM5338rPNn9GLAKIqajzN3/BaBnxWKIapqKv7Zlh7WzLsyU74QNo3k n7FnsZ6eRtEXZSwRxwDSCbuqXDtDMWVCi2DAO9Q= X-Google-Smtp-Source: ABdhPJwWCJ1rgPLDm+BkbE+i0k2c1DMueTOhyzwGz7zvvTROjRAdso6zbAbPEmlKLUv1f9V7qeky8Te3EfTjy2cLYZw= X-Received: by 2002:adf:fc0a:: with SMTP id i10mr37464146wrr.111.1600437468135; Fri, 18 Sep 2020 06:57:48 -0700 (PDT) MIME-Version: 1.0 References: <20200918020110.2063155-1-sashal@kernel.org> <20200918020110.2063155-265-sashal@kernel.org> In-Reply-To: From: Alex Deucher Date: Fri, 18 Sep 2020 09:57:37 -0400 Message-ID: Subject: Re: [PATCH AUTOSEL 5.4 265/330] drm/amd/powerplay: try to do a graceful shutdown on SW CTF To: "Quan, Evan" Cc: Sasha Levin , "linux-kernel@vger.kernel.org" , "stable@vger.kernel.org" , "Deucher, Alexander" , "dri-devel@lists.freedesktop.org" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 18, 2020 at 3:17 AM Quan, Evan wrote: > > [AMD Official Use Only - Internal Distribution Only] > > Hi @Sasha Levin @Deucher, Alexander, > > The following changes need to be applied also. > Otherwise, you may see unexpected shutdown on stress gpu loading on Vega10. > > drm/amd/pm: avoid false alarm due to confusing softwareshutdowntemp setting > drm/amd/pm: correct the thermal alert temperature limit settings > drm/amd/pm: correct Vega20 swctf limit setting > drm/amd/pm: correct Vega12 swctf limit setting > drm/amd/pm: correct Vega10 swctf limit setting I would suggest we just drop this patch for kernels prior to 5.8 (where it was introduced). Alex > > BR > Evan > -----Original Message----- > From: Sasha Levin > Sent: Friday, September 18, 2020 10:00 AM > To: linux-kernel@vger.kernel.org; stable@vger.kernel.org > Cc: Quan, Evan ; Deucher, Alexander ; Sasha Levin ; dri-devel@lists.freedesktop.org > Subject: [PATCH AUTOSEL 5.4 265/330] drm/amd/powerplay: try to do a graceful shutdown on SW CTF > > From: Evan Quan > > [ Upstream commit 9495220577416632675959caf122e968469ffd16 ] > > Normally this(SW CTF) should not happen. And by doing graceful shutdown we can prevent further damage. > > Signed-off-by: Evan Quan > Reviewed-by: Alex Deucher > Signed-off-by: Alex Deucher > Signed-off-by: Sasha Levin > --- > .../gpu/drm/amd/powerplay/hwmgr/smu_helper.c | 21 +++++++++++++++---- > drivers/gpu/drm/amd/powerplay/smu_v11_0.c | 7 +++++++ > 2 files changed, 24 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/smu_helper.c b/drivers/gpu/drm/amd/powerplay/hwmgr/smu_helper.c > index d09690fca4520..414added3d02c 100644 > --- a/drivers/gpu/drm/amd/powerplay/hwmgr/smu_helper.c > +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/smu_helper.c > @@ -22,6 +22,7 @@ > */ > > #include > +#include > > #include "hwmgr.h" > #include "pp_debug.h" > @@ -593,12 +594,18 @@ int phm_irq_process(struct amdgpu_device *adev, > uint32_t src_id = entry->src_id; > > if (client_id == AMDGPU_IRQ_CLIENTID_LEGACY) { > -if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_LOW_TO_HIGH) > +if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_LOW_TO_HIGH) { > pr_warn("GPU over temperature range detected on PCIe %d:%d.%d!\n", > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > PCI_FUNC(adev->pdev->devfn)); > -else if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_HIGH_TO_LOW) > +/* > + * SW CTF just occurred. > + * Try to do a graceful shutdown to prevent further damage. > + */ > +dev_emerg(adev->dev, "System is going to shutdown due to SW CTF!\n"); > +orderly_poweroff(true); > +} else if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_HIGH_TO_LOW) > pr_warn("GPU under temperature range detected on PCIe %d:%d.%d!\n", > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > @@ -609,12 +616,18 @@ int phm_irq_process(struct amdgpu_device *adev, > PCI_SLOT(adev->pdev->devfn), > PCI_FUNC(adev->pdev->devfn)); > } else if (client_id == SOC15_IH_CLIENTID_THM) { > -if (src_id == 0) > +if (src_id == 0) { > pr_warn("GPU over temperature range detected on PCIe %d:%d.%d!\n", > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > PCI_FUNC(adev->pdev->devfn)); > -else > +/* > + * SW CTF just occurred. > + * Try to do a graceful shutdown to prevent further damage. > + */ > +dev_emerg(adev->dev, "System is going to shutdown due to SW CTF!\n"); > +orderly_poweroff(true); > +} else > pr_warn("GPU under temperature range detected on PCIe %d:%d.%d!\n", > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c > index c4d8c52c6b9ca..6c4405622c9bb 100644 > --- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c > +++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c > @@ -23,6 +23,7 @@ > #include > #include > #include > +#include > > #include "pp_debug.h" > #include "amdgpu.h" > @@ -1538,6 +1539,12 @@ static int smu_v11_0_irq_process(struct amdgpu_device *adev, > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > PCI_FUNC(adev->pdev->devfn)); > +/* > + * SW CTF just occurred. > + * Try to do a graceful shutdown to prevent further damage. > + */ > +dev_emerg(adev->dev, "System is going to shutdown due to SW CTF!\n"); > +orderly_poweroff(true); > break; > case THM_11_0__SRCID__THM_DIG_THERM_H2L: > pr_warn("GPU under temperature range detected on PCIe %d:%d.%d!\n", > -- > 2.25.1 > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.3 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9E1EC43463 for ; Fri, 18 Sep 2020 13:57:51 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8E7D220878 for ; Fri, 18 Sep 2020 13:57:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HaN8UMxQ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8E7D220878 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C0B026E162; Fri, 18 Sep 2020 13:57:50 +0000 (UTC) Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com [IPv6:2a00:1450:4864:20::444]) by gabe.freedesktop.org (Postfix) with ESMTPS id 733EB6E162 for ; Fri, 18 Sep 2020 13:57:49 +0000 (UTC) Received: by mail-wr1-x444.google.com with SMTP id o5so5717503wrn.13 for ; Fri, 18 Sep 2020 06:57:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JyHX/3xaI4MUjJWsWpVy5PNH+Y48SlI7AjmXE8ItCY4=; b=HaN8UMxQ6Z/S3OwGAIcjPPmvBARR7wJZunOxgWrFCryJXlG256dWNW6qFFUqZIMwGo tqDc88KQ0bjbPoz+R+7iEYdJBqBqDUQvp/K1p3BTfadlR6AnjAtB6o1MWtzAnUy0HnKp xKbuDnLR7Pssn06GQSk5+FnxBOoYJysnDThKKIP8zm7xyM3MnITaZgmqIM0F2deB/IHF VeCX2GhtmcZqANVzingHDFA2s4pZo2/y1AW6ovuceejdYTwl3UMIUPsMkit+V/p17noz rCYL+6n/G39xQQvfcZptkXOtiSMyGM1k2MSAnR9yHFP3xxj48NmI6l3QgIbrp2WVpVkL MnJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JyHX/3xaI4MUjJWsWpVy5PNH+Y48SlI7AjmXE8ItCY4=; b=YqobAP+5munn+nrzORTM9qejhsxsHx+YWhohIS0dYGBXilH30mQcUUS5qRFW/etSZv IJ0bnxUPou9D8Ym+b0I3dM3UXS/rmVT/i9TPQXsl38Ai6PSczEJpCMMX+TBlOXxU6OFg 76bbyP2gDibxQ0OBj/FJ1pq3eoHv0w6rNl402CYbaKqqHNBOcZ3d/b/17VjofR6D9pkC XDoSVIulsap3DEZJ8KV/ierj4J7maOUw372FjCRLj9lXgNV5/gy+8fMSDWkCPqwPex6o Iv25XJEqLfsBir1KgyilasrQYgcxw2zA5nC77cLL5kOH5XWjWarskLI3FYJKdhGkdLjw P+lw== X-Gm-Message-State: AOAM533P82t2dQ5MPSOX7KsKyJqtNidplM5SNs5zxbv4EUevA1aeHMkL kV8j7px8f40SbAK/rFPNqxeJcYmUM3iHOMlzvME= X-Google-Smtp-Source: ABdhPJwWCJ1rgPLDm+BkbE+i0k2c1DMueTOhyzwGz7zvvTROjRAdso6zbAbPEmlKLUv1f9V7qeky8Te3EfTjy2cLYZw= X-Received: by 2002:adf:fc0a:: with SMTP id i10mr37464146wrr.111.1600437468135; Fri, 18 Sep 2020 06:57:48 -0700 (PDT) MIME-Version: 1.0 References: <20200918020110.2063155-1-sashal@kernel.org> <20200918020110.2063155-265-sashal@kernel.org> In-Reply-To: From: Alex Deucher Date: Fri, 18 Sep 2020 09:57:37 -0400 Message-ID: Subject: Re: [PATCH AUTOSEL 5.4 265/330] drm/amd/powerplay: try to do a graceful shutdown on SW CTF To: "Quan, Evan" X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Sasha Levin , "Deucher, Alexander" , "dri-devel@lists.freedesktop.org" , "linux-kernel@vger.kernel.org" , "stable@vger.kernel.org" Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Fri, Sep 18, 2020 at 3:17 AM Quan, Evan wrote: > > [AMD Official Use Only - Internal Distribution Only] > > Hi @Sasha Levin @Deucher, Alexander, > > The following changes need to be applied also. > Otherwise, you may see unexpected shutdown on stress gpu loading on Vega10. > > drm/amd/pm: avoid false alarm due to confusing softwareshutdowntemp setting > drm/amd/pm: correct the thermal alert temperature limit settings > drm/amd/pm: correct Vega20 swctf limit setting > drm/amd/pm: correct Vega12 swctf limit setting > drm/amd/pm: correct Vega10 swctf limit setting I would suggest we just drop this patch for kernels prior to 5.8 (where it was introduced). Alex > > BR > Evan > -----Original Message----- > From: Sasha Levin > Sent: Friday, September 18, 2020 10:00 AM > To: linux-kernel@vger.kernel.org; stable@vger.kernel.org > Cc: Quan, Evan ; Deucher, Alexander ; Sasha Levin ; dri-devel@lists.freedesktop.org > Subject: [PATCH AUTOSEL 5.4 265/330] drm/amd/powerplay: try to do a graceful shutdown on SW CTF > > From: Evan Quan > > [ Upstream commit 9495220577416632675959caf122e968469ffd16 ] > > Normally this(SW CTF) should not happen. And by doing graceful shutdown we can prevent further damage. > > Signed-off-by: Evan Quan > Reviewed-by: Alex Deucher > Signed-off-by: Alex Deucher > Signed-off-by: Sasha Levin > --- > .../gpu/drm/amd/powerplay/hwmgr/smu_helper.c | 21 +++++++++++++++---- > drivers/gpu/drm/amd/powerplay/smu_v11_0.c | 7 +++++++ > 2 files changed, 24 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/powerplay/hwmgr/smu_helper.c b/drivers/gpu/drm/amd/powerplay/hwmgr/smu_helper.c > index d09690fca4520..414added3d02c 100644 > --- a/drivers/gpu/drm/amd/powerplay/hwmgr/smu_helper.c > +++ b/drivers/gpu/drm/amd/powerplay/hwmgr/smu_helper.c > @@ -22,6 +22,7 @@ > */ > > #include > +#include > > #include "hwmgr.h" > #include "pp_debug.h" > @@ -593,12 +594,18 @@ int phm_irq_process(struct amdgpu_device *adev, > uint32_t src_id = entry->src_id; > > if (client_id == AMDGPU_IRQ_CLIENTID_LEGACY) { > -if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_LOW_TO_HIGH) > +if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_LOW_TO_HIGH) { > pr_warn("GPU over temperature range detected on PCIe %d:%d.%d!\n", > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > PCI_FUNC(adev->pdev->devfn)); > -else if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_HIGH_TO_LOW) > +/* > + * SW CTF just occurred. > + * Try to do a graceful shutdown to prevent further damage. > + */ > +dev_emerg(adev->dev, "System is going to shutdown due to SW CTF!\n"); > +orderly_poweroff(true); > +} else if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_HIGH_TO_LOW) > pr_warn("GPU under temperature range detected on PCIe %d:%d.%d!\n", > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > @@ -609,12 +616,18 @@ int phm_irq_process(struct amdgpu_device *adev, > PCI_SLOT(adev->pdev->devfn), > PCI_FUNC(adev->pdev->devfn)); > } else if (client_id == SOC15_IH_CLIENTID_THM) { > -if (src_id == 0) > +if (src_id == 0) { > pr_warn("GPU over temperature range detected on PCIe %d:%d.%d!\n", > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > PCI_FUNC(adev->pdev->devfn)); > -else > +/* > + * SW CTF just occurred. > + * Try to do a graceful shutdown to prevent further damage. > + */ > +dev_emerg(adev->dev, "System is going to shutdown due to SW CTF!\n"); > +orderly_poweroff(true); > +} else > pr_warn("GPU under temperature range detected on PCIe %d:%d.%d!\n", > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > diff --git a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c > index c4d8c52c6b9ca..6c4405622c9bb 100644 > --- a/drivers/gpu/drm/amd/powerplay/smu_v11_0.c > +++ b/drivers/gpu/drm/amd/powerplay/smu_v11_0.c > @@ -23,6 +23,7 @@ > #include > #include > #include > +#include > > #include "pp_debug.h" > #include "amdgpu.h" > @@ -1538,6 +1539,12 @@ static int smu_v11_0_irq_process(struct amdgpu_device *adev, > PCI_BUS_NUM(adev->pdev->devfn), > PCI_SLOT(adev->pdev->devfn), > PCI_FUNC(adev->pdev->devfn)); > +/* > + * SW CTF just occurred. > + * Try to do a graceful shutdown to prevent further damage. > + */ > +dev_emerg(adev->dev, "System is going to shutdown due to SW CTF!\n"); > +orderly_poweroff(true); > break; > case THM_11_0__SRCID__THM_DIG_THERM_H2L: > pr_warn("GPU under temperature range detected on PCIe %d:%d.%d!\n", > -- > 2.25.1 > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel