All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ma, Le" <Le.Ma-5C7GfCeVMHo@public.gmane.org>
To: "Zhang, Hawking" <Hawking.Zhang-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Cc: "Deucher,
	Alexander" <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>,
	"Zhou1, Tao" <Tao.Zhou1-5C7GfCeVMHo@public.gmane.org>,
	"Li, Dennis" <Dennis.Li-5C7GfCeVMHo@public.gmane.org>,
	"Chen, Guchun" <Guchun.Chen-5C7GfCeVMHo@public.gmane.org>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case
Date: Wed, 27 Nov 2019 12:35:57 +0000	[thread overview]
Message-ID: <MN2PR12MB4285E37DF7D44270D5CAD0E5F6440@MN2PR12MB4285.namprd12.prod.outlook.com> (raw)
In-Reply-To: <DM5PR12MB141825CB772FEEF1FD013EDBFC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 4511 bytes --]

Agree with your thoughts that we drop amdgpu_ras_enable=2 condition. The only concern in my side is that besides fatal_error, another result may happen that atombios_init timeout on xgmi by baco (not sure psp mode1 reset causes this as well).



Assuming no amdgpu_ras_enable=2 check, if PMFW > 40.52,  the use cases as my understanding includes:

  1.  sGPU without RAS:
     *   new: baco
     *   old: baco
  2.  sGPU with RAS:

  *   new: baco
  *   old: psp mode1 chain reset and legacy fatal_error handling

  1.  XGMI with RAS: baco
     *   new: baco
     *   old: psp mode1 chain reset and legacy fatal_error handling
  2.  XGMI without RAS: baco
     *   new: baco
     *   old: psp mode1 chain reset



That is to say, all uses cases go on baco road when PMFW > 40.52.



Regards,

Ma Le



-----Original Message-----
From: Zhang, Hawking <Hawking.Zhang-5C7GfCeVMHo@public.gmane.org>
Sent: Wednesday, November 27, 2019 7:28 PM
To: Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Cc: Chen, Guchun <Guchun.Chen-5C7GfCeVMHo@public.gmane.org>; Zhou1, Tao <Tao.Zhou1-5C7GfCeVMHo@public.gmane.org>; Li, Dennis <Dennis.Li-5C7GfCeVMHo@public.gmane.org>; Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>; Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case



[AMD Public Use]



After thinking it a bit, I think we can just rely on PMFW version to decide to go RAS recovery or legacy fatal_error handling for the platforms that support RAS. Leveraging amdgpu_ras_enable as a temporary solution seems not necessary? Even baco ras recovery not stable, it is the same result as legacy fatal_error handling that user has to reboot the node manually.



So the new soc reset use cases are:

XGMI (without RAS): use PSP mode1 based chain reset, RAS enabled (with PMFW 40.52 and onwards): use BACO based RAS recovery, RAS enabled (with PMFW prior to 40.52): use legacy fatal_error handling.

Anything else?



Regards,

Hawking

-----Original Message-----

From: Le Ma <le.ma-5C7GfCeVMHo@public.gmane.org<mailto:le.ma-5C7GfCeVMHo@public.gmane.org>>

Sent: 2019年11月27日 17:15

To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org<mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>

Cc: Zhang, Hawking <Hawking.Zhang-5C7GfCeVMHo@public.gmane.org<mailto:Hawking.Zhang-5C7GfCeVMHo@public.gmane.org>>; Chen, Guchun <Guchun.Chen-5C7GfCeVMHo@public.gmane.org<mailto:Guchun.Chen-5C7GfCeVMHo@public.gmane.org>>; Zhou1, Tao <Tao.Zhou1-5C7GfCeVMHo@public.gmane.org<mailto:Tao.Zhou1-5C7GfCeVMHo@public.gmane.org>>; Li, Dennis <Dennis.Li-5C7GfCeVMHo@public.gmane.org<mailto:Dennis.Li-5C7GfCeVMHo@public.gmane.org>>; Deucher, Alexander <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org<mailto:Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>>; Ma, Le <Le.Ma-5C7GfCeVMHo@public.gmane.org<mailto:Le.Ma@amd.com>>

Subject: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case



Avoid to change default reset behavior for production card by checking amdgpu_ras_enable equal to 2. And only new enough smu ucode can support baco for xgmi/ras case.



Change-Id: I07c3e6862be03e068745c73db8ea71f428ecba6b

Signed-off-by: Le Ma <le.ma-5C7GfCeVMHo@public.gmane.org<mailto:le.ma-5C7GfCeVMHo@public.gmane.org>>

---

drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +++-

1 file changed, 3 insertions(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c

index 951327f..6202333 100644

--- a/drivers/gpu/drm/amd/amdgpu/soc15.c

+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c

@@ -577,7 +577,9 @@ soc15_asic_reset_method(struct amdgpu_device *adev)

                                   struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, 0);

                                   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);

-                                   if (hive || (ras && ras->supported))

+                                  if ((hive || (ras && ras->supported)) &&

+                                      (amdgpu_ras_enable != 2 ||

+                                      adev->pm.fw_version <= 0x283400))

                                               baco_reset = false;

                       }

                       break;

--

2.7.4

[-- Attachment #1.2: Type: text/html, Size: 16311 bytes --]

[-- Attachment #2: Type: text/plain, Size: 153 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

WARNING: multiple messages have this Message-ID (diff)
From: "Ma, Le" <Le.Ma@amd.com>
To: "Zhang, Hawking" <Hawking.Zhang@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Zhou1, Tao" <Tao.Zhou1@amd.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"Chen, Guchun" <Guchun.Chen@amd.com>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case
Date: Wed, 27 Nov 2019 12:35:57 +0000	[thread overview]
Message-ID: <MN2PR12MB4285E37DF7D44270D5CAD0E5F6440@MN2PR12MB4285.namprd12.prod.outlook.com> (raw)
Message-ID: <20191127123557.t6hdpWeERzLydbQCd-cmZ8Ezdf-AUrbq-85n6WHKSbQ@z> (raw)
In-Reply-To: <DM5PR12MB141825CB772FEEF1FD013EDBFC440@DM5PR12MB1418.namprd12.prod.outlook.com>


[-- Attachment #1.1: Type: text/plain, Size: 3965 bytes --]

Agree with your thoughts that we drop amdgpu_ras_enable=2 condition. The only concern in my side is that besides fatal_error, another result may happen that atombios_init timeout on xgmi by baco (not sure psp mode1 reset causes this as well).



Assuming no amdgpu_ras_enable=2 check, if PMFW > 40.52,  the use cases as my understanding includes:

  1.  sGPU without RAS:
     *   new: baco
     *   old: baco
  2.  sGPU with RAS:

  *   new: baco
  *   old: psp mode1 chain reset and legacy fatal_error handling

  1.  XGMI with RAS: baco
     *   new: baco
     *   old: psp mode1 chain reset and legacy fatal_error handling
  2.  XGMI without RAS: baco
     *   new: baco
     *   old: psp mode1 chain reset



That is to say, all uses cases go on baco road when PMFW > 40.52.



Regards,

Ma Le



-----Original Message-----
From: Zhang, Hawking <Hawking.Zhang@amd.com>
Sent: Wednesday, November 27, 2019 7:28 PM
To: Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun <Guchun.Chen@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Ma, Le <Le.Ma@amd.com>
Subject: RE: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case



[AMD Public Use]



After thinking it a bit, I think we can just rely on PMFW version to decide to go RAS recovery or legacy fatal_error handling for the platforms that support RAS. Leveraging amdgpu_ras_enable as a temporary solution seems not necessary? Even baco ras recovery not stable, it is the same result as legacy fatal_error handling that user has to reboot the node manually.



So the new soc reset use cases are:

XGMI (without RAS): use PSP mode1 based chain reset, RAS enabled (with PMFW 40.52 and onwards): use BACO based RAS recovery, RAS enabled (with PMFW prior to 40.52): use legacy fatal_error handling.

Anything else?



Regards,

Hawking

-----Original Message-----

From: Le Ma <le.ma@amd.com<mailto:le.ma@amd.com>>

Sent: 2019年11月27日 17:15

To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

Cc: Zhang, Hawking <Hawking.Zhang@amd.com<mailto:Hawking.Zhang@amd.com>>; Chen, Guchun <Guchun.Chen@amd.com<mailto:Guchun.Chen@amd.com>>; Zhou1, Tao <Tao.Zhou1@amd.com<mailto:Tao.Zhou1@amd.com>>; Li, Dennis <Dennis.Li@amd.com<mailto:Dennis.Li@amd.com>>; Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>; Ma, Le <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>

Subject: [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case



Avoid to change default reset behavior for production card by checking amdgpu_ras_enable equal to 2. And only new enough smu ucode can support baco for xgmi/ras case.



Change-Id: I07c3e6862be03e068745c73db8ea71f428ecba6b

Signed-off-by: Le Ma <le.ma@amd.com<mailto:le.ma@amd.com>>

---

drivers/gpu/drm/amd/amdgpu/soc15.c | 4 +++-

1 file changed, 3 insertions(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c

index 951327f..6202333 100644

--- a/drivers/gpu/drm/amd/amdgpu/soc15.c

+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c

@@ -577,7 +577,9 @@ soc15_asic_reset_method(struct amdgpu_device *adev)

                                   struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, 0);

                                   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);

-                                   if (hive || (ras && ras->supported))

+                                  if ((hive || (ras && ras->supported)) &&

+                                      (amdgpu_ras_enable != 2 ||

+                                      adev->pm.fw_version <= 0x283400))

                                               baco_reset = false;

                       }

                       break;

--

2.7.4

[-- Attachment #1.2: Type: text/html, Size: 15781 bytes --]

[-- Attachment #2: Type: text/plain, Size: 153 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2019-11-27 12:35 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  9:15 [PATCH 01/10] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler Le Ma
2019-11-27  9:15 ` Le Ma
     [not found] ` <1574846129-4826-1-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27  9:15   ` [PATCH 02/10] drm/amdgpu: export amdgpu_ras_find_obj to use externally Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 03/10] drm/amdgpu: clear ras controller status registers when interrupt occurs Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 05/10] drm/amdgpu: enable/disable doorbell interrupt in baco entry/exit helper Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-4-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 12:04       ` Zhang, Hawking
2019-11-27 12:04         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB14184CF08E965BAF369F4249FC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 12:14           ` Ma, Le
2019-11-27 12:14             ` Ma, Le
2019-11-28  6:50       ` Zhou1, Tao
2019-11-28  6:50         ` Zhou1, Tao
2019-11-27  9:15   ` [PATCH 06/10] drm/amdgpu: add condition to enable baco for xgmi/ras case Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-5-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 11:28       ` Zhang, Hawking
2019-11-27 11:28         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB141825CB772FEEF1FD013EDBFC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 12:35           ` Ma, Le [this message]
2019-11-27 12:35             ` Ma, Le
2019-11-27 11:38       ` Zhang, Hawking
2019-11-27 11:38         ` Zhang, Hawking
     [not found]         ` <DM5PR12MB1418D76FD9E6E7748C2F9997FC440-2J9CzHegvk81aAVlcVN8UQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27 14:00           ` Ma, Le
2019-11-27 14:00             ` Ma, Le
2019-11-27  9:15   ` [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-6-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27 15:46       ` Andrey Grodzovsky
2019-11-27 15:46         ` Andrey Grodzovsky
     [not found]         ` <c09d7928-f864-3a80-40e2-b6116abe044c-5C7GfCeVMHo@public.gmane.org>
2019-11-28  9:00           ` Ma, Le
2019-11-28  9:00             ` Ma, Le
2019-11-29 16:21             ` Andrey Grodzovsky
2019-12-02 11:42               ` Ma, Le
2019-12-02 22:05                 ` Andrey Grodzovsky
     [not found]                   ` <MN2PR12MB42855B198BB4064A0D311845F6420@MN2PR12MB4285.namprd12.prod.outlook.com>
     [not found]                     ` <2c4dd3f3-e2ce-9843-312b-1e5c05a51521@amd.com>
2019-12-04  7:09                       ` Ma, Le
2019-12-04 16:05                         ` Andrey Grodzovsky
2019-12-05  3:14                           ` Ma, Le
2019-12-06 21:50                             ` Andrey Grodzovsky
2019-12-09 11:34                               ` Ma, Le
2019-12-09 15:52                                 ` Andrey Grodzovsky
2019-12-10  2:45                                   ` Ma, Le
2019-12-10 19:55                                     ` Andrey Grodzovsky
2019-12-11 12:18                                       ` Ma, Le
2019-12-11 14:04                                         ` Andrey Grodzovsky
2019-12-09 22:00                                 ` Andrey Grodzovsky
2019-12-10  3:27                                   ` Ma, Le
2019-11-27  9:15   ` [PATCH 08/10] drm/amdgpu: support full gpu reset workflow when ras err_event_athub occurs Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 09/10] drm/amdgpu: clear err_event_athub flag after reset exit Le Ma
2019-11-27  9:15     ` Le Ma
2019-11-27  9:15   ` [PATCH 10/10] drm/amdgpu: reduce redundant uvd context lost warning message Le Ma
2019-11-27  9:15     ` Le Ma
     [not found]     ` <1574846129-4826-9-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-11-27  9:49       ` Chen, Guchun
2019-11-27  9:49         ` Chen, Guchun
     [not found]         ` <BYAPR12MB280648A1C59519AA77B3FCA9F1440-ZGDeBxoHBPk0CuAkIMgl3QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-27  9:54           ` Ma, Le
2019-11-27  9:54             ` Ma, Le
2019-11-28  5:27   ` [PATCH 01/10] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler Zhang, Hawking
2019-11-28  5:27     ` Zhang, Hawking

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MN2PR12MB4285E37DF7D44270D5CAD0E5F6440@MN2PR12MB4285.namprd12.prod.outlook.com \
    --to=le.ma-5c7gfcevmho@public.gmane.org \
    --cc=Alexander.Deucher-5C7GfCeVMHo@public.gmane.org \
    --cc=Dennis.Li-5C7GfCeVMHo@public.gmane.org \
    --cc=Guchun.Chen-5C7GfCeVMHo@public.gmane.org \
    --cc=Hawking.Zhang-5C7GfCeVMHo@public.gmane.org \
    --cc=Tao.Zhou1-5C7GfCeVMHo@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.