linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux 6.1-rc1 drm/amdgpu regression
@ 2022-10-19 20:00 Shuah Khan
  2022-10-19 20:27 ` Deucher, Alexander
  0 siblings, 1 reply; 5+ messages in thread
From: Shuah Khan @ 2022-10-19 20:00 UTC (permalink / raw)
  To: Alexander Deucher; +Cc: Linus Torvalds, Shuah Khan, linux-kernel

Hi Alex,

I am seeing the same problem I sent reverts for on 5.10.147 on
Linux 6.1-rc1 on my laptop with AMD Ryzen 7 PRO 5850U with Radeon
Graphics.

commit e3163bc8ffdfdb405e10530b140135b2ee487f89
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Fri Sep 9 11:53:27 2022 -0400

     drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega

I see that the following has been reverted in Linux 6.1-rc1

commit 66f99628eb24409cb8feb5061f78283c8b65f820
Author: Hamza Mahfooz <hamza.mahfooz@amd.com>
Date:   Tue Sep 6 15:01:49 2022 -0400

     drm/amdgpu: use dirty framebuffer helper

However I still see the following filling dmesg and system is unusable.
For now I switched back to Linux 6.0 as this is my primary system.

[drm] Fence fallback timer expired on ring sdma0
[drm] Fence fallback timer expired on ring gfx
[drm] Fence fallback timer expired on ring sdma0
[drm] Fence fallback timer expired on ring gfx
[drm] Fence fallback timer expired on ring sdma0
[drm] Fence fallback timer expired on ring sdma0
[drm] Fence fallback timer expired on ring sdma0
[drm] Fence fallback timer expired on ring gfx

Please let me know if I should send revert for this for the mainline
as well.

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Linux 6.1-rc1 drm/amdgpu regression
  2022-10-19 20:00 Linux 6.1-rc1 drm/amdgpu regression Shuah Khan
@ 2022-10-19 20:27 ` Deucher, Alexander
  2022-10-19 20:59   ` Shuah Khan
  0 siblings, 1 reply; 5+ messages in thread
From: Deucher, Alexander @ 2022-10-19 20:27 UTC (permalink / raw)
  To: Shuah Khan; +Cc: Linus Torvalds, linux-kernel

[AMD Official Use Only - General]

> -----Original Message-----
> From: Shuah Khan <skhan@linuxfoundation.org>
> Sent: Wednesday, October 19, 2022 4:00 PM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Shuah Khan
> <skhan@linuxfoundation.org>; linux-kernel@vger.kernel.org
> Subject: Linux 6.1-rc1 drm/amdgpu regression
> 
> Hi Alex,
> 
> I am seeing the same problem I sent reverts for on 5.10.147 on Linux 6.1-rc1
> on my laptop with AMD Ryzen 7 PRO 5850U with Radeon Graphics.
> 
> commit e3163bc8ffdfdb405e10530b140135b2ee487f89
> Author: Alex Deucher <alexander.deucher@amd.com>
> Date:   Fri Sep 9 11:53:27 2022 -0400
> 
>      drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega
> 
> I see that the following has been reverted in Linux 6.1-rc1
> 
> commit 66f99628eb24409cb8feb5061f78283c8b65f820
> Author: Hamza Mahfooz <hamza.mahfooz@amd.com>
> Date:   Tue Sep 6 15:01:49 2022 -0400
> 
>      drm/amdgpu: use dirty framebuffer helper
> 
> However I still see the following filling dmesg and system is unusable.
> For now I switched back to Linux 6.0 as this is my primary system.
> 
> [drm] Fence fallback timer expired on ring sdma0 [drm] Fence fallback timer
> expired on ring gfx [drm] Fence fallback timer expired on ring sdma0 [drm]
> Fence fallback timer expired on ring gfx [drm] Fence fallback timer expired
> on ring sdma0 [drm] Fence fallback timer expired on ring sdma0 [drm] Fence
> fallback timer expired on ring sdma0 [drm] Fence fallback timer expired on
> ring gfx
> 
> Please let me know if I should send revert for this for the mainline as well.
> 

Can you file a bug report (https://gitlab.freedesktop.org/drm/amd/-/issues) and attach your dmesg output?  I'd like to try and repro the issue if I can and provide some patches to test.  I'd like to avoid reverting the patch as that will break the driver for users using vega dGPUs.  If we revert this patch we'll need to revert the following patches as well to avoid a broken driver for a bunch of AMD GPUs:
dc1d85cb790f2091eea074cee24a704b2d6c4a06
e3163bc8ffdfdb405e10530b140135b2ee487f89
a8671493d2074950553da3cf07d1be43185ef6c6
8795e182b02dc87e343c79e73af6b8b7f9c5e635

Thanks,

Alex

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Linux 6.1-rc1 drm/amdgpu regression
  2022-10-19 20:27 ` Deucher, Alexander
@ 2022-10-19 20:59   ` Shuah Khan
  2022-10-19 21:24     ` Deucher, Alexander
  0 siblings, 1 reply; 5+ messages in thread
From: Shuah Khan @ 2022-10-19 20:59 UTC (permalink / raw)
  To: Deucher, Alexander; +Cc: Linus Torvalds, linux-kernel, Shuah Khan

On 10/19/22 14:27, Deucher, Alexander wrote:
> [AMD Official Use Only - General]
> 
>> -----Original Message-----
>> From: Shuah Khan <skhan@linuxfoundation.org>
>> Sent: Wednesday, October 19, 2022 4:00 PM
>> To: Deucher, Alexander <Alexander.Deucher@amd.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Shuah Khan
>> <skhan@linuxfoundation.org>; linux-kernel@vger.kernel.org
>> Subject: Linux 6.1-rc1 drm/amdgpu regression
>>
>> Hi Alex,
>>
>> I am seeing the same problem I sent reverts for on 5.10.147 on Linux 6.1-rc1
>> on my laptop with AMD Ryzen 7 PRO 5850U with Radeon Graphics.
>>
>> commit e3163bc8ffdfdb405e10530b140135b2ee487f89
>> Author: Alex Deucher <alexander.deucher@amd.com>
>> Date:   Fri Sep 9 11:53:27 2022 -0400
>>
>>       drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega
>>
>> I see that the following has been reverted in Linux 6.1-rc1
>>
>> commit 66f99628eb24409cb8feb5061f78283c8b65f820
>> Author: Hamza Mahfooz <hamza.mahfooz@amd.com>
>> Date:   Tue Sep 6 15:01:49 2022 -0400
>>
>>       drm/amdgpu: use dirty framebuffer helper
>>
>> However I still see the following filling dmesg and system is unusable.
>> For now I switched back to Linux 6.0 as this is my primary system.
>>
>> [drm] Fence fallback timer expired on ring sdma0 [drm] Fence fallback timer
>> expired on ring gfx [drm] Fence fallback timer expired on ring sdma0 [drm]
>> Fence fallback timer expired on ring gfx [drm] Fence fallback timer expired
>> on ring sdma0 [drm] Fence fallback timer expired on ring sdma0 [drm] Fence
>> fallback timer expired on ring sdma0 [drm] Fence fallback timer expired on
>> ring gfx
>>
>> Please let me know if I should send revert for this for the mainline as well.
>>
> 
> Can you file a bug report (https://gitlab.freedesktop.org/drm/amd/-/issues) and attach your dmesg output?  I'd like to try and repro the issue if I can and provide some patches to test.  I'd like to avoid reverting the patch as that will break the driver for users using vega dGPUs.

Makes sense. I will file the bug and aattach dmesg. Since this is my
primary system, there will be some delay in getting this info. to you
and testing any patches you provide for testing.

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Linux 6.1-rc1 drm/amdgpu regression
  2022-10-19 20:59   ` Shuah Khan
@ 2022-10-19 21:24     ` Deucher, Alexander
  2022-10-20  1:16       ` Shuah Khan
  0 siblings, 1 reply; 5+ messages in thread
From: Deucher, Alexander @ 2022-10-19 21:24 UTC (permalink / raw)
  To: Shuah Khan; +Cc: Linus Torvalds, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3247 bytes --]

[Public]

> -----Original Message-----
> From: Shuah Khan <skhan@linuxfoundation.org>
> Sent: Wednesday, October 19, 2022 5:00 PM
> To: Deucher, Alexander <Alexander.Deucher@amd.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>; linux-
> kernel@vger.kernel.org; Shuah Khan <skhan@linuxfoundation.org>
> Subject: Re: Linux 6.1-rc1 drm/amdgpu regression
> 
> On 10/19/22 14:27, Deucher, Alexander wrote:
> > [AMD Official Use Only - General]
> >
> >> -----Original Message-----
> >> From: Shuah Khan <skhan@linuxfoundation.org>
> >> Sent: Wednesday, October 19, 2022 4:00 PM
> >> To: Deucher, Alexander <Alexander.Deucher@amd.com>
> >> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Shuah Khan
> >> <skhan@linuxfoundation.org>; linux-kernel@vger.kernel.org
> >> Subject: Linux 6.1-rc1 drm/amdgpu regression
> >>
> >> Hi Alex,
> >>
> >> I am seeing the same problem I sent reverts for on 5.10.147 on Linux
> >> 6.1-rc1 on my laptop with AMD Ryzen 7 PRO 5850U with Radeon Graphics.
> >>
> >> commit e3163bc8ffdfdb405e10530b140135b2ee487f89
> >> Author: Alex Deucher <alexander.deucher@amd.com>
> >> Date:   Fri Sep 9 11:53:27 2022 -0400
> >>
> >>       drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for
> >> vega
> >>
> >> I see that the following has been reverted in Linux 6.1-rc1
> >>
> >> commit 66f99628eb24409cb8feb5061f78283c8b65f820
> >> Author: Hamza Mahfooz <hamza.mahfooz@amd.com>
> >> Date:   Tue Sep 6 15:01:49 2022 -0400
> >>
> >>       drm/amdgpu: use dirty framebuffer helper
> >>
> >> However I still see the following filling dmesg and system is unusable.
> >> For now I switched back to Linux 6.0 as this is my primary system.
> >>
> >> [drm] Fence fallback timer expired on ring sdma0 [drm] Fence fallback
> >> timer expired on ring gfx [drm] Fence fallback timer expired on ring
> >> sdma0 [drm] Fence fallback timer expired on ring gfx [drm] Fence
> >> fallback timer expired on ring sdma0 [drm] Fence fallback timer
> >> expired on ring sdma0 [drm] Fence fallback timer expired on ring
> >> sdma0 [drm] Fence fallback timer expired on ring gfx
> >>
> >> Please let me know if I should send revert for this for the mainline as well.
> >>
> >
> > Can you file a bug report
> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitl
> ab.freedesktop.org%2Fdrm%2Famd%2F-
> %2Fissues&amp;data=05%7C01%7CAlexander.Deucher%40amd.com%7C61b
> 64b1be7294b27eb2308dab214dbe2%7C3dd8961fe4884e608e11a82d994e183d
> %7C0%7C0%7C638018099904584274%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
> 000%7C%7C%7C&amp;sdata=ZYA0bWZAGsxB91Bqcg1YAI704LhpISQX63bE67
> UVO%2Bs%3D&amp;reserved=0) and attach your dmesg output?  I'd like to
> try and repro the issue if I can and provide some patches to test.  I'd like to
> avoid reverting the patch as that will break the driver for users using vega
> dGPUs.
> 
> Makes sense. I will file the bug and aattach dmesg. Since this is my primary
> system, there will be some delay in getting this info. to you and testing any
> patches you provide for testing.
> 

Actually I think I see what's wrong.  Can you try the attached patch?

Alex

[-- Attachment #2: 0001-drm-amdgpu-fix-sdma-doorbell-init-ordering-on-APUs.patch --]
[-- Type: application/octet-stream, Size: 3328 bytes --]

From 62fda3a8cbc93d50974bb320c0e95e2b6308f4b9 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@amd.com>
Date: Wed, 19 Oct 2022 16:57:42 -0400
Subject: [PATCH] drm/amdgpu: fix sdma doorbell init ordering on APUs

Commit 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")
uncovered a bug in amdgpu that required a reordering of the driver
init sequence to avoid accessing a special register on the GPU
before it was properly set up leading to an PCI AER error.  This
reordering uncovered a different hw programming ordering dependency
in some APUs where the SDMA doorbells need to be programmed before
the GFX doorbells. To fix this, move the SDMA doorbell programming
back into the soc15 common code, but use the actual doorbell range
values directly rather than the values stored in the ring structure
since those will not be initialized at this point.

This is a partial revert, but with the doorbell assignment
fixed so the proper doorbell index is set before it's used.

Fixes: e3163bc8ffdfdb ("drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: skhan@linuxfoundation.org
---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c |  5 -----
 drivers/gpu/drm/amd/amdgpu/soc15.c     | 21 +++++++++++++++++++++
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 298fa11702e7..1122bd4eae98 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -1417,11 +1417,6 @@ static int sdma_v4_0_start(struct amdgpu_device *adev)
 		WREG32_SDMA(i, mmSDMA0_CNTL, temp);
 
 		if (!amdgpu_sriov_vf(adev)) {
-			ring = &adev->sdma.instance[i].ring;
-			adev->nbio.funcs->sdma_doorbell_range(adev, i,
-				ring->use_doorbell, ring->doorbell_index,
-				adev->doorbell_index.sdma_doorbell_range);
-
 			/* unhalt engine */
 			temp = RREG32_SDMA(i, mmSDMA0_F32_CNTL);
 			temp = REG_SET_FIELD(temp, SDMA0_F32_CNTL, HALT, 0);
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 183024d7c184..e3b2b6b4f1a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -1211,6 +1211,20 @@ static int soc15_common_sw_fini(void *handle)
 	return 0;
 }
 
+static void soc15_sdma_doorbell_range_init(struct amdgpu_device *adev)
+{
+	int i;
+
+	/* sdma doorbell range is programed by hypervisor */
+	if (!amdgpu_sriov_vf(adev)) {
+		for (i = 0; i < adev->sdma.num_instances; i++) {
+			adev->nbio.funcs->sdma_doorbell_range(adev, i,
+				true, adev->doorbell_index.sdma_engine[i] << 1,
+				adev->doorbell_index.sdma_doorbell_range);
+		}
+	}
+}
+
 static int soc15_common_hw_init(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
@@ -1230,6 +1244,13 @@ static int soc15_common_hw_init(void *handle)
 
 	/* enable the doorbell aperture */
 	soc15_enable_doorbell_aperture(adev, true);
+	/* HW doorbell routing policy: doorbell writing not
+	 * in SDMA/IH/MM/ACV range will be routed to CP. So
+	 * we need to init SDMA doorbell range prior
+	 * to CP ip block init and ring test.  IH already
+	 * happens before CP.
+	 */
+	soc15_sdma_doorbell_range_init(adev);
 
 	return 0;
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Linux 6.1-rc1 drm/amdgpu regression
  2022-10-19 21:24     ` Deucher, Alexander
@ 2022-10-20  1:16       ` Shuah Khan
  0 siblings, 0 replies; 5+ messages in thread
From: Shuah Khan @ 2022-10-20  1:16 UTC (permalink / raw)
  To: Deucher, Alexander; +Cc: Linus Torvalds, linux-kernel, Shuah Khan

On 10/19/22 15:24, Deucher, Alexander wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Shuah Khan <skhan@linuxfoundation.org>
>> Sent: Wednesday, October 19, 2022 5:00 PM
>> To: Deucher, Alexander <Alexander.Deucher@amd.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>; linux-
>> kernel@vger.kernel.org; Shuah Khan <skhan@linuxfoundation.org>
>> Subject: Re: Linux 6.1-rc1 drm/amdgpu regression
>>
>> On 10/19/22 14:27, Deucher, Alexander wrote:
>>> [AMD Official Use Only - General]
>>>
>>>> -----Original Message-----
>>>> From: Shuah Khan <skhan@linuxfoundation.org>
>>>> Sent: Wednesday, October 19, 2022 4:00 PM
>>>> To: Deucher, Alexander <Alexander.Deucher@amd.com>
>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Shuah Khan
>>>> <skhan@linuxfoundation.org>; linux-kernel@vger.kernel.org
>>>> Subject: Linux 6.1-rc1 drm/amdgpu regression
>>>>
>>>> Hi Alex,
>>>>
>>>> I am seeing the same problem I sent reverts for on 5.10.147 on Linux
>>>> 6.1-rc1 on my laptop with AMD Ryzen 7 PRO 5850U with Radeon Graphics.
>>>>
>>>> commit e3163bc8ffdfdb405e10530b140135b2ee487f89
>>>> Author: Alex Deucher <alexander.deucher@amd.com>
>>>> Date:   Fri Sep 9 11:53:27 2022 -0400
>>>>
>>>>        drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for
>>>> vega
>>>>
>>>> I see that the following has been reverted in Linux 6.1-rc1
>>>>
>>>> commit 66f99628eb24409cb8feb5061f78283c8b65f820
>>>> Author: Hamza Mahfooz <hamza.mahfooz@amd.com>
>>>> Date:   Tue Sep 6 15:01:49 2022 -0400
>>>>
>>>>        drm/amdgpu: use dirty framebuffer helper
>>>>
>>>> However I still see the following filling dmesg and system is unusable.
>>>> For now I switched back to Linux 6.0 as this is my primary system.
>>>>
>>>> [drm] Fence fallback timer expired on ring sdma0 [drm] Fence fallback
>>>> timer expired on ring gfx [drm] Fence fallback timer expired on ring
>>>> sdma0 [drm] Fence fallback timer expired on ring gfx [drm] Fence
>>>> fallback timer expired on ring sdma0 [drm] Fence fallback timer
>>>> expired on ring sdma0 [drm] Fence fallback timer expired on ring
>>>> sdma0 [drm] Fence fallback timer expired on ring gfx
>>>>
>>>> Please let me know if I should send revert for this for the mainline as well.
>>>>
>>>
>>> Can you file a bug report
>> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitl
>> ab.freedesktop.org%2Fdrm%2Famd%2F-
>> %2Fissues&amp;data=05%7C01%7CAlexander.Deucher%40amd.com%7C61b
>> 64b1be7294b27eb2308dab214dbe2%7C3dd8961fe4884e608e11a82d994e183d
>> %7C0%7C0%7C638018099904584274%7CUnknown%7CTWFpbGZsb3d8eyJWIj
>> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
>> 000%7C%7C%7C&amp;sdata=ZYA0bWZAGsxB91Bqcg1YAI704LhpISQX63bE67
>> UVO%2Bs%3D&amp;reserved=0) and attach your dmesg output?  I'd like to
>> try and repro the issue if I can and provide some patches to test.  I'd like to
>> avoid reverting the patch as that will break the driver for users using vega
>> dGPUs.
>>
>> Makes sense. I will file the bug and aattach dmesg. Since this is my primary
>> system, there will be some delay in getting this info. to you and testing any
>> patches you provide for testing.
>>
> 
> Actually I think I see what's wrong.  Can you try the attached patch?
> 

This patch worked. Clean boot without any warns and timer expiry messages
from drm/amdgpu.

thanks,
-- Shuah


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-20  1:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-19 20:00 Linux 6.1-rc1 drm/amdgpu regression Shuah Khan
2022-10-19 20:27 ` Deucher, Alexander
2022-10-19 20:59   ` Shuah Khan
2022-10-19 21:24     ` Deucher, Alexander
2022-10-20  1:16       ` Shuah Khan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).