All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
@ 2023-03-02  6:58 Evan Quan
  2023-03-02  8:27 ` Lazar, Lijo
  0 siblings, 1 reply; 9+ messages in thread
From: Evan Quan @ 2023-03-02  6:58 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Evan Quan

Gpu reset might be needed during driver reloading. To guard that(gpu
reset) work, df cstate needs to be disabled properly.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 51bbeaa1f311..3c854461ef32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2816,6 +2816,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
 	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
 	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
 
+	/*
+	 * Get df cstate disabled properly on driver unloading.
+	 * Since on the succeeding driver reloading, gpu reset might
+	 * be required. And cstate disabled is a prerequisite for
+	 * that(gpu reset).
+	 */
+	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
+		dev_warn(adev->dev, "Failed to disallow df cstate");
+
 	amdgpu_amdkfd_suspend(adev, false);
 
 	/* Workaroud for ASICs need to disable SMC first */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
  2023-03-02  6:58 [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario Evan Quan
@ 2023-03-02  8:27 ` Lazar, Lijo
  2023-03-02  9:13   ` Quan, Evan
  0 siblings, 1 reply; 9+ messages in thread
From: Lazar, Lijo @ 2023-03-02  8:27 UTC (permalink / raw)
  To: Evan Quan, amd-gfx; +Cc: alexander.deucher



On 3/2/2023 12:28 PM, Evan Quan wrote:
> Gpu reset might be needed during driver reloading. To guard that(gpu
> reset) work, df cstate needs to be disabled properly.
> 
> Signed-off-by: Evan Quan <evan.quan@amd.com>
> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 51bbeaa1f311..3c854461ef32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2816,6 +2816,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
>   	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>   	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>   
> +	/*
> +	 * Get df cstate disabled properly on driver unloading.
> +	 * Since on the succeeding driver reloading, gpu reset might
> +	 * be required. And cstate disabled is a prerequisite for
> +	 * that(gpu reset).
> +	 */
> +	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
> +		dev_warn(adev->dev, "Failed to disallow df cstate");
> +

This looks more like a firmware bug. Driver sends the Unload message to 
FW. In that case FW should disable all features including C-state.

Thanks,
Lijo

>   	amdgpu_amdkfd_suspend(adev, false);
>   
>   	/* Workaroud for ASICs need to disable SMC first */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
  2023-03-02  8:27 ` Lazar, Lijo
@ 2023-03-02  9:13   ` Quan, Evan
  2023-03-02  9:20     ` Lazar, Lijo
  0 siblings, 1 reply; 9+ messages in thread
From: Quan, Evan @ 2023-03-02  9:13 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx; +Cc: Deucher, Alexander

[AMD Official Use Only - General]



> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Thursday, March 2, 2023 4:28 PM
> To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> reloading scenario
> 
> 
> 
> On 3/2/2023 12:28 PM, Evan Quan wrote:
> > Gpu reset might be needed during driver reloading. To guard that(gpu
> > reset) work, df cstate needs to be disabled properly.
> >
> > Signed-off-by: Evan Quan <evan.quan@amd.com>
> > Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
> >   1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 51bbeaa1f311..3c854461ef32 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2816,6 +2816,15 @@ static int amdgpu_device_ip_fini_early(struct
> amdgpu_device *adev)
> >   	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> >   	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> >
> > +	/*
> > +	 * Get df cstate disabled properly on driver unloading.
> > +	 * Since on the succeeding driver reloading, gpu reset might
> > +	 * be required. And cstate disabled is a prerequisite for
> > +	 * that(gpu reset).
> > +	 */
> > +	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
> > +		dev_warn(adev->dev, "Failed to disallow df cstate");
> > +
> 
> This looks more like a firmware bug. Driver sends the Unload message to FW.
> In that case FW should disable all features including C-state.
Driver does not send the Unload message. We want PMFM alive and ready for handling possible gpu reset on reloading.

BR
Evan
> 
> Thanks,
> Lijo
> 
> >   	amdgpu_amdkfd_suspend(adev, false);
> >
> >   	/* Workaroud for ASICs need to disable SMC first */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
  2023-03-02  9:13   ` Quan, Evan
@ 2023-03-02  9:20     ` Lazar, Lijo
  2023-03-02  9:30       ` Quan, Evan
  0 siblings, 1 reply; 9+ messages in thread
From: Lazar, Lijo @ 2023-03-02  9:20 UTC (permalink / raw)
  To: Quan, Evan, amd-gfx; +Cc: Deucher, Alexander



On 3/2/2023 2:43 PM, Quan, Evan wrote:
> [AMD Official Use Only - General]
> 
> 
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>> Sent: Thursday, March 2, 2023 4:28 PM
>> To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>> reloading scenario
>>
>>
>>
>> On 3/2/2023 12:28 PM, Evan Quan wrote:
>>> Gpu reset might be needed during driver reloading. To guard that(gpu
>>> reset) work, df cstate needs to be disabled properly.
>>>
>>> Signed-off-by: Evan Quan <evan.quan@amd.com>
>>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
>>>    1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 51bbeaa1f311..3c854461ef32 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2816,6 +2816,15 @@ static int amdgpu_device_ip_fini_early(struct
>> amdgpu_device *adev)
>>>    	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>>    	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>>>
>>> +	/*
>>> +	 * Get df cstate disabled properly on driver unloading.
>>> +	 * Since on the succeeding driver reloading, gpu reset might
>>> +	 * be required. And cstate disabled is a prerequisite for
>>> +	 * that(gpu reset).
>>> +	 */
>>> +	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
>>> +		dev_warn(adev->dev, "Failed to disallow df cstate");
>>> +
>>
>> This looks more like a firmware bug. Driver sends the Unload message to FW.
>> In that case FW should disable all features including C-state.
> Driver does not send the Unload message. We want PMFM alive and ready for handling possible gpu reset on reloading.
> 

Actually, soc21_need_reset_on_init code itself has a bug. PSP won't get 
unloaded by default on ring destruction. Even if PSP stops, it could 
just keep the heartbeat value as non-zero (just that it won't increment).

Probably, that needs to be fixed first rather than keeping PMFW alive 
for a reset.

Thanks,
Lijo

> BR
> Evan
>>
>> Thanks,
>> Lijo
>>
>>>    	amdgpu_amdkfd_suspend(adev, false);
>>>
>>>    	/* Workaroud for ASICs need to disable SMC first */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
  2023-03-02  9:20     ` Lazar, Lijo
@ 2023-03-02  9:30       ` Quan, Evan
  2023-03-02 15:26         ` Deucher, Alexander
  0 siblings, 1 reply; 9+ messages in thread
From: Quan, Evan @ 2023-03-02  9:30 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx, Deucher, Alexander

[AMD Official Use Only - General]



> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Thursday, March 2, 2023 5:21 PM
> To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> reloading scenario
> 
> 
> 
> On 3/2/2023 2:43 PM, Quan, Evan wrote:
> > [AMD Official Use Only - General]
> >
> >
> >
> >> -----Original Message-----
> >> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> >> Sent: Thursday, March 2, 2023 4:28 PM
> >> To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
> >> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> >> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> >> reloading scenario
> >>
> >>
> >>
> >> On 3/2/2023 12:28 PM, Evan Quan wrote:
> >>> Gpu reset might be needed during driver reloading. To guard that(gpu
> >>> reset) work, df cstate needs to be disabled properly.
> >>>
> >>> Signed-off-by: Evan Quan <evan.quan@amd.com>
> >>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
> >>>    1 file changed, 9 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 51bbeaa1f311..3c854461ef32 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -2816,6 +2816,15 @@ static int amdgpu_device_ip_fini_early(struct
> >> amdgpu_device *adev)
> >>>    	amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> >>>    	amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> >>>
> >>> +	/*
> >>> +	 * Get df cstate disabled properly on driver unloading.
> >>> +	 * Since on the succeeding driver reloading, gpu reset might
> >>> +	 * be required. And cstate disabled is a prerequisite for
> >>> +	 * that(gpu reset).
> >>> +	 */
> >>> +	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
> >>> +		dev_warn(adev->dev, "Failed to disallow df cstate");
> >>> +
> >>
> >> This looks more like a firmware bug. Driver sends the Unload message to
> FW.
> >> In that case FW should disable all features including C-state.
> > Driver does not send the Unload message. We want PMFM alive and ready
> for handling possible gpu reset on reloading.
> >
> 
> Actually, soc21_need_reset_on_init code itself has a bug. PSP won't get
> unloaded by default on ring destruction. Even if PSP stops, it could just keep
> the heartbeat value as non-zero (just that it won't increment).
> 
> Probably, that needs to be fixed first rather than keeping PMFW alive for a
> reset.
As I remembered, the change(asic reset during reloading) seemed introduced to address some sriov issues.
@Deucher, Alexander might share more backgrounds about this.
To be honest, I'm not a fan of this(perform asic reset during reloading).

Evan
> 
> Thanks,
> Lijo
> 
> > BR
> > Evan
> >>
> >> Thanks,
> >> Lijo
> >>
> >>>    	amdgpu_amdkfd_suspend(adev, false);
> >>>
> >>>    	/* Workaroud for ASICs need to disable SMC first */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
  2023-03-02  9:30       ` Quan, Evan
@ 2023-03-02 15:26         ` Deucher, Alexander
  2023-03-02 15:35           ` Lazar, Lijo
  0 siblings, 1 reply; 9+ messages in thread
From: Deucher, Alexander @ 2023-03-02 15:26 UTC (permalink / raw)
  To: Quan, Evan, Lazar, Lijo, amd-gfx

[AMD Official Use Only - General]

> -----Original Message-----
> From: Quan, Evan <Evan.Quan@amd.com>
> Sent: Thursday, March 2, 2023 4:31 AM
> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org;
> Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
> reloading scenario
> 
> [AMD Official Use Only - General]
> 
> 
> 
> > -----Original Message-----
> > From: Lazar, Lijo <Lijo.Lazar@amd.com>
> > Sent: Thursday, March 2, 2023 5:21 PM
> > To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> > Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> > reloading scenario
> >
> >
> >
> > On 3/2/2023 2:43 PM, Quan, Evan wrote:
> > > [AMD Official Use Only - General]
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> > >> Sent: Thursday, March 2, 2023 4:28 PM
> > >> To: Quan, Evan <Evan.Quan@amd.com>; amd-
> gfx@lists.freedesktop.org
> > >> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> > >> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> > >> reloading scenario
> > >>
> > >>
> > >>
> > >> On 3/2/2023 12:28 PM, Evan Quan wrote:
> > >>> Gpu reset might be needed during driver reloading. To guard
> > >>> that(gpu
> > >>> reset) work, df cstate needs to be disabled properly.
> > >>>
> > >>> Signed-off-by: Evan Quan <evan.quan@amd.com>
> > >>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
> > >>> ---
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
> > >>>    1 file changed, 9 insertions(+)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> index 51bbeaa1f311..3c854461ef32 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> @@ -2816,6 +2816,15 @@ static int
> > >>> amdgpu_device_ip_fini_early(struct
> > >> amdgpu_device *adev)
> > >>>    	amdgpu_device_set_pg_state(adev,
> AMD_PG_STATE_UNGATE);
> > >>>    	amdgpu_device_set_cg_state(adev,
> AMD_CG_STATE_UNGATE);
> > >>>
> > >>> +	/*
> > >>> +	 * Get df cstate disabled properly on driver unloading.
> > >>> +	 * Since on the succeeding driver reloading, gpu reset might
> > >>> +	 * be required. And cstate disabled is a prerequisite for
> > >>> +	 * that(gpu reset).
> > >>> +	 */
> > >>> +	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
> > >>> +		dev_warn(adev->dev, "Failed to disallow df cstate");
> > >>> +
> > >>
> > >> This looks more like a firmware bug. Driver sends the Unload
> > >> message to
> > FW.
> > >> In that case FW should disable all features including C-state.
> > > Driver does not send the Unload message. We want PMFM alive and
> > > ready
> > for handling possible gpu reset on reloading.
> > >
> >
> > Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
> > get unloaded by default on ring destruction. Even if PSP stops, it
> > could just keep the heartbeat value as non-zero (just that it won't
> increment).
> >
> > Probably, that needs to be fixed first rather than keeping PMFW alive
> > for a reset.
> As I remembered, the change(asic reset during reloading) seemed
> introduced to address some sriov issues.
> @Deucher, Alexander might share more backgrounds about this.
> To be honest, I'm not a fan of this(perform asic reset during reloading).

I'm open to doing it a better way.  We did it for two reasons:
1. often times the device was left in a weird state after the driver unload/VM killed. Etc.  We needed a way to put the device into a known good state so the driver could re-initialize it.  Plus, IIRC, on some of the older ASICS, once the SMU or PSP firmware was loaded, there was no way to reload it without a reset so you needed one anyway.  This is largely why we have to reset for S4 as well.
2. Some large servers didn't power off PCI devices on reboots to save time.  This left the devices with whatever state they had before the system was rebooted which led to driver initialization problems on subsequent boots because the device was in an unknown state.

If there is a better way to handle these situations, I'm all for it.

Alex


> 
> Evan
> >
> > Thanks,
> > Lijo
> >
> > > BR
> > > Evan
> > >>
> > >> Thanks,
> > >> Lijo
> > >>
> > >>>    	amdgpu_amdkfd_suspend(adev, false);
> > >>>
> > >>>    	/* Workaroud for ASICs need to disable SMC first */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
  2023-03-02 15:26         ` Deucher, Alexander
@ 2023-03-02 15:35           ` Lazar, Lijo
  2023-03-06  5:42             ` Lazar, Lijo
  0 siblings, 1 reply; 9+ messages in thread
From: Lazar, Lijo @ 2023-03-02 15:35 UTC (permalink / raw)
  To: Deucher, Alexander, Quan, Evan, amd-gfx



On 3/2/2023 8:56 PM, Deucher, Alexander wrote:
> [AMD Official Use Only - General]
> 
>> -----Original Message-----
>> From: Quan, Evan <Evan.Quan@amd.com>
>> Sent: Thursday, March 2, 2023 4:31 AM
>> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org;
>> Deucher, Alexander <Alexander.Deucher@amd.com>
>> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
>> reloading scenario
>>
>> [AMD Official Use Only - General]
>>
>>
>>
>>> -----Original Message-----
>>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>>> Sent: Thursday, March 2, 2023 5:21 PM
>>> To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>>> reloading scenario
>>>
>>>
>>>
>>> On 3/2/2023 2:43 PM, Quan, Evan wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>>> Sent: Thursday, March 2, 2023 4:28 PM
>>>>> To: Quan, Evan <Evan.Quan@amd.com>; amd-
>> gfx@lists.freedesktop.org
>>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>>>>> reloading scenario
>>>>>
>>>>>
>>>>>
>>>>> On 3/2/2023 12:28 PM, Evan Quan wrote:
>>>>>> Gpu reset might be needed during driver reloading. To guard
>>>>>> that(gpu
>>>>>> reset) work, df cstate needs to be disabled properly.
>>>>>>
>>>>>> Signed-off-by: Evan Quan <evan.quan@amd.com>
>>>>>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
>>>>>>     1 file changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 51bbeaa1f311..3c854461ef32 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -2816,6 +2816,15 @@ static int
>>>>>> amdgpu_device_ip_fini_early(struct
>>>>> amdgpu_device *adev)
>>>>>>     	amdgpu_device_set_pg_state(adev,
>> AMD_PG_STATE_UNGATE);
>>>>>>     	amdgpu_device_set_cg_state(adev,
>> AMD_CG_STATE_UNGATE);
>>>>>>
>>>>>> +	/*
>>>>>> +	 * Get df cstate disabled properly on driver unloading.
>>>>>> +	 * Since on the succeeding driver reloading, gpu reset might
>>>>>> +	 * be required. And cstate disabled is a prerequisite for
>>>>>> +	 * that(gpu reset).
>>>>>> +	 */
>>>>>> +	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
>>>>>> +		dev_warn(adev->dev, "Failed to disallow df cstate");
>>>>>> +
>>>>>
>>>>> This looks more like a firmware bug. Driver sends the Unload
>>>>> message to
>>> FW.
>>>>> In that case FW should disable all features including C-state.
>>>> Driver does not send the Unload message. We want PMFM alive and
>>>> ready
>>> for handling possible gpu reset on reloading.
>>>>
>>>
>>> Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
>>> get unloaded by default on ring destruction. Even if PSP stops, it
>>> could just keep the heartbeat value as non-zero (just that it won't
>> increment).
>>>
>>> Probably, that needs to be fixed first rather than keeping PMFW alive
>>> for a reset.
>> As I remembered, the change(asic reset during reloading) seemed
>> introduced to address some sriov issues.
>> @Deucher, Alexander might share more backgrounds about this.
>> To be honest, I'm not a fan of this(perform asic reset during reloading).
> 
> I'm open to doing it a better way.  We did it for two reasons:
> 1. often times the device was left in a weird state after the driver unload/VM killed. Etc.  We needed a way to put the device into a known good state so the driver could re-initialize it.  Plus, IIRC, on some of the older ASICS, once the SMU or PSP firmware was loaded, there was no way to reload it without a reset so you needed one anyway.  This is largely why we have to reset for S4 as well.
> 2. Some large servers didn't power off PCI devices on reboots to save time.  This left the devices with whatever state they had before the system was rebooted which led to driver initialization problems on subsequent boots because the device was in an unknown state.
> 
> If there is a better way to handle these situations, I'm all for it.
> 

Are those cases valid still? We have this for GFX9 - reset done only for 
pass through. And some of the GFX9 ASICs are used in large servers.

         /* Just return false for soc15 GPUs.  Reset does not seem to
          * be necessary.
          */
         if (!amdgpu_passthrough(adev))
                 return false;

Thanks,
Lijo

> Alex
> 
> 
>>
>> Evan
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> BR
>>>> Evan
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>>     	amdgpu_amdkfd_suspend(adev, false);
>>>>>>
>>>>>>     	/* Workaroud for ASICs need to disable SMC first */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
  2023-03-02 15:35           ` Lazar, Lijo
@ 2023-03-06  5:42             ` Lazar, Lijo
  2023-03-06 17:57               ` Deucher, Alexander
  0 siblings, 1 reply; 9+ messages in thread
From: Lazar, Lijo @ 2023-03-06  5:42 UTC (permalink / raw)
  To: Deucher, Alexander, Quan, Evan, amd-gfx



On 3/2/2023 9:05 PM, Lazar, Lijo wrote:
> 
> 
> On 3/2/2023 8:56 PM, Deucher, Alexander wrote:
>> [AMD Official Use Only - General]
>>
>>> -----Original Message-----
>>> From: Quan, Evan <Evan.Quan@amd.com>
>>> Sent: Thursday, March 2, 2023 4:31 AM
>>> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org;
>>> Deucher, Alexander <Alexander.Deucher@amd.com>
>>> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
>>> reloading scenario
>>>
>>> [AMD Official Use Only - General]
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>> Sent: Thursday, March 2, 2023 5:21 PM
>>>> To: Quan, Evan <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>>>> reloading scenario
>>>>
>>>>
>>>>
>>>> On 3/2/2023 2:43 PM, Quan, Evan wrote:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>>>> Sent: Thursday, March 2, 2023 4:28 PM
>>>>>> To: Quan, Evan <Evan.Quan@amd.com>; amd-
>>> gfx@lists.freedesktop.org
>>>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
>>>>>> reloading scenario
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/2/2023 12:28 PM, Evan Quan wrote:
>>>>>>> Gpu reset might be needed during driver reloading. To guard
>>>>>>> that(gpu
>>>>>>> reset) work, df cstate needs to be disabled properly.
>>>>>>>
>>>>>>> Signed-off-by: Evan Quan <evan.quan@amd.com>
>>>>>>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
>>>>>>>     1 file changed, 9 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 51bbeaa1f311..3c854461ef32 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -2816,6 +2816,15 @@ static int
>>>>>>> amdgpu_device_ip_fini_early(struct
>>>>>> amdgpu_device *adev)
>>>>>>>         amdgpu_device_set_pg_state(adev,
>>> AMD_PG_STATE_UNGATE);
>>>>>>>         amdgpu_device_set_cg_state(adev,
>>> AMD_CG_STATE_UNGATE);
>>>>>>>
>>>>>>> +    /*
>>>>>>> +     * Get df cstate disabled properly on driver unloading.
>>>>>>> +     * Since on the succeeding driver reloading, gpu reset might
>>>>>>> +     * be required. And cstate disabled is a prerequisite for
>>>>>>> +     * that(gpu reset).
>>>>>>> +     */
>>>>>>> +    if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
>>>>>>> +        dev_warn(adev->dev, "Failed to disallow df cstate");
>>>>>>> +
>>>>>>
>>>>>> This looks more like a firmware bug. Driver sends the Unload
>>>>>> message to
>>>> FW.
>>>>>> In that case FW should disable all features including C-state.
>>>>> Driver does not send the Unload message. We want PMFM alive and
>>>>> ready
>>>> for handling possible gpu reset on reloading.
>>>>>
>>>>
>>>> Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
>>>> get unloaded by default on ring destruction. Even if PSP stops, it
>>>> could just keep the heartbeat value as non-zero (just that it won't
>>> increment).
>>>>
>>>> Probably, that needs to be fixed first rather than keeping PMFW alive
>>>> for a reset.
>>> As I remembered, the change(asic reset during reloading) seemed
>>> introduced to address some sriov issues.
>>> @Deucher, Alexander might share more backgrounds about this.
>>> To be honest, I'm not a fan of this(perform asic reset during 
>>> reloading).
>>
>> I'm open to doing it a better way.  We did it for two reasons:
>> 1. often times the device was left in a weird state after the driver 
>> unload/VM killed. Etc.  We needed a way to put the device into a known 
>> good state so the driver could re-initialize it.  Plus, IIRC, on some 
>> of the older ASICS, once the SMU or PSP firmware was loaded, there was 
>> no way to reload it without a reset so you needed one anyway.  This is 
>> largely why we have to reset for S4 as well.
>> 2. Some large servers didn't power off PCI devices on reboots to save 
>> time.  This left the devices with whatever state they had before the 
>> system was rebooted which led to driver initialization problems on 
>> subsequent boots because the device was in an unknown state.
>>
>> If there is a better way to handle these situations, I'm all for it.

Hi Alex,

There is a part of FW running to handle generic reset requests on SOC21 
SOCs. Can we remove the reset-on-reload for soc21 SOCs, unload PMFW as 
usual and see how it goes?

Thanks,
Lijo

>>
> 
> Are those cases valid still? We have this for GFX9 - reset done only for 
> pass through. And some of the GFX9 ASICs are used in large servers.
> 
>          /* Just return false for soc15 GPUs.  Reset does not seem to
>           * be necessary.
>           */
>          if (!amdgpu_passthrough(adev))
>                  return false;
> 
> Thanks,
> Lijo
> 
>> Alex
>>
>>
>>>
>>> Evan
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> BR
>>>>> Evan
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>>         amdgpu_amdkfd_suspend(adev, false);
>>>>>>>
>>>>>>>         /* Workaroud for ASICs need to disable SMC first */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario
  2023-03-06  5:42             ` Lazar, Lijo
@ 2023-03-06 17:57               ` Deucher, Alexander
  0 siblings, 0 replies; 9+ messages in thread
From: Deucher, Alexander @ 2023-03-06 17:57 UTC (permalink / raw)
  To: Lazar, Lijo, Quan, Evan, amd-gfx

[AMD Official Use Only - General]

> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Monday, March 6, 2023 12:42 AM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>; Quan, Evan
> <Evan.Quan@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> reloading scenario
> 
> 
> 
> On 3/2/2023 9:05 PM, Lazar, Lijo wrote:
> >
> >
> > On 3/2/2023 8:56 PM, Deucher, Alexander wrote:
> >> [AMD Official Use Only - General]
> >>
> >>> -----Original Message-----
> >>> From: Quan, Evan <Evan.Quan@amd.com>
> >>> Sent: Thursday, March 2, 2023 4:31 AM
> >>> To: Lazar, Lijo <Lijo.Lazar@amd.com>; amd-gfx@lists.freedesktop.org;
> >>> Deucher, Alexander <Alexander.Deucher@amd.com>
> >>> Subject: RE: [PATCH] drm/amdgpu: disable cstate properly for driver
> >>> reloading scenario
> >>>
> >>> [AMD Official Use Only - General]
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> >>>> Sent: Thursday, March 2, 2023 5:21 PM
> >>>> To: Quan, Evan <Evan.Quan@amd.com>; amd-
> gfx@lists.freedesktop.org
> >>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> >>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for driver
> >>>> reloading scenario
> >>>>
> >>>>
> >>>>
> >>>> On 3/2/2023 2:43 PM, Quan, Evan wrote:
> >>>>> [AMD Official Use Only - General]
> >>>>>
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> >>>>>> Sent: Thursday, March 2, 2023 4:28 PM
> >>>>>> To: Quan, Evan <Evan.Quan@amd.com>; amd-
> >>> gfx@lists.freedesktop.org
> >>>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
> >>>>>> Subject: Re: [PATCH] drm/amdgpu: disable cstate properly for
> >>>>>> driver reloading scenario
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 3/2/2023 12:28 PM, Evan Quan wrote:
> >>>>>>> Gpu reset might be needed during driver reloading. To guard
> >>>>>>> that(gpu
> >>>>>>> reset) work, df cstate needs to be disabled properly.
> >>>>>>>
> >>>>>>> Signed-off-by: Evan Quan <evan.quan@amd.com>
> >>>>>>> Change-Id: I5c074c265c0b08a67b6934ae1ad9aa3fed245461
> >>>>>>> ---
> >>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9
> +++++++++
> >>>>>>>     1 file changed, 9 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>> index 51bbeaa1f311..3c854461ef32 100644
> >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>> @@ -2816,6 +2816,15 @@ static int
> >>>>>>> amdgpu_device_ip_fini_early(struct
> >>>>>> amdgpu_device *adev)
> >>>>>>>         amdgpu_device_set_pg_state(adev,
> >>> AMD_PG_STATE_UNGATE);
> >>>>>>>         amdgpu_device_set_cg_state(adev,
> >>> AMD_CG_STATE_UNGATE);
> >>>>>>>
> >>>>>>> +    /*
> >>>>>>> +     * Get df cstate disabled properly on driver unloading.
> >>>>>>> +     * Since on the succeeding driver reloading, gpu reset
> >>>>>>> +might
> >>>>>>> +     * be required. And cstate disabled is a prerequisite for
> >>>>>>> +     * that(gpu reset).
> >>>>>>> +     */
> >>>>>>> +    if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
> >>>>>>> +        dev_warn(adev->dev, "Failed to disallow df cstate");
> >>>>>>> +
> >>>>>>
> >>>>>> This looks more like a firmware bug. Driver sends the Unload
> >>>>>> message to
> >>>> FW.
> >>>>>> In that case FW should disable all features including C-state.
> >>>>> Driver does not send the Unload message. We want PMFM alive and
> >>>>> ready
> >>>> for handling possible gpu reset on reloading.
> >>>>>
> >>>>
> >>>> Actually, soc21_need_reset_on_init code itself has a bug. PSP won't
> >>>> get unloaded by default on ring destruction. Even if PSP stops, it
> >>>> could just keep the heartbeat value as non-zero (just that it won't
> >>> increment).
> >>>>
> >>>> Probably, that needs to be fixed first rather than keeping PMFW
> >>>> alive for a reset.
> >>> As I remembered, the change(asic reset during reloading) seemed
> >>> introduced to address some sriov issues.
> >>> @Deucher, Alexander might share more backgrounds about this.
> >>> To be honest, I'm not a fan of this(perform asic reset during
> >>> reloading).
> >>
> >> I'm open to doing it a better way.  We did it for two reasons:
> >> 1. often times the device was left in a weird state after the driver
> >> unload/VM killed. Etc.  We needed a way to put the device into a
> >> known good state so the driver could re-initialize it.  Plus, IIRC,
> >> on some of the older ASICS, once the SMU or PSP firmware was loaded,
> >> there was no way to reload it without a reset so you needed one
> >> anyway.  This is largely why we have to reset for S4 as well.
> >> 2. Some large servers didn't power off PCI devices on reboots to save
> >> time.  This left the devices with whatever state they had before the
> >> system was rebooted which led to driver initialization problems on
> >> subsequent boots because the device was in an unknown state.
> >>
> >> If there is a better way to handle these situations, I'm all for it.
> 
> Hi Alex,
> 
> There is a part of FW running to handle generic reset requests on SOC21
> SOCs. Can we remove the reset-on-reload for soc21 SOCs, unload PMFW as
> usual and see how it goes?

Sure.

Alex

> 
> Thanks,
> Lijo
> 
> >>
> >
> > Are those cases valid still? We have this for GFX9 - reset done only
> > for pass through. And some of the GFX9 ASICs are used in large servers.
> >
> >          /* Just return false for soc15 GPUs.  Reset does not seem to
> >           * be necessary.
> >           */
> >          if (!amdgpu_passthrough(adev))
> >                  return false;
> >
> > Thanks,
> > Lijo
> >
> >> Alex
> >>
> >>
> >>>
> >>> Evan
> >>>>
> >>>> Thanks,
> >>>> Lijo
> >>>>
> >>>>> BR
> >>>>> Evan
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Lijo
> >>>>>>
> >>>>>>>         amdgpu_amdkfd_suspend(adev, false);
> >>>>>>>
> >>>>>>>         /* Workaroud for ASICs need to disable SMC first */

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-03-06 17:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-02  6:58 [PATCH] drm/amdgpu: disable cstate properly for driver reloading scenario Evan Quan
2023-03-02  8:27 ` Lazar, Lijo
2023-03-02  9:13   ` Quan, Evan
2023-03-02  9:20     ` Lazar, Lijo
2023-03-02  9:30       ` Quan, Evan
2023-03-02 15:26         ` Deucher, Alexander
2023-03-02 15:35           ` Lazar, Lijo
2023-03-06  5:42             ` Lazar, Lijo
2023-03-06 17:57               ` Deucher, Alexander

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.