All of lore.kernel.org
 help / color / mirror / Atom feed
* amdgpu: [powerplay] failed to send message 148 ret is 0
@ 2018-10-24 21:49 Mikulas Patocka
       [not found] ` <alpine.LRH.2.02.1810241603110.4080-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Mikulas Patocka @ 2018-10-24 21:49 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher

Hi

I have a Sapphire Pulse RX 570 ITX graphics card.

On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret 
is 0" and the system is stuck for several seconds when they happen. The 
card works, except for these errors and occasional delays.

Do you have an idea what could cause these errors or how to debug them?

There's nothing to bisect because all the kernels that I tried (back to 
4.9) show these errors. I've also tried a kernel from branch 
"origin/amd-staging-drm-next" from amdgpu git, but it has even more of 
these errors than 4.18.16.

I tried newer firmware from 
git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git, 
but it didn't help.

Some users suggest that BIOS upgrade may help with this, but there's no 
BIOS for this card on the Sapphire website.

Mikulas


[    9.371716] [drm] amdgpu kernel modesetting enabled.
[    9.372068] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1DA2:0xE343 0xEF).
[    9.372126] [drm] register mmio base: 0xFF5C0000
[    9.372158] [drm] register mmio size: 262144
[    9.372194] [drm] probing mlw for device 1166:132 = 3026c81
[    9.372228] [drm] add ip block number 0 <vi_common>
[    9.372260] [drm] add ip block number 1 <gmc_v8_0>
[    9.372292] [drm] add ip block number 2 <tonga_ih>
[    9.372324] [drm] add ip block number 3 <powerplay>
[    9.372356] [drm] add ip block number 4 <dm>
[    9.372387] [drm] add ip block number 5 <gfx_v8_0>
[    9.372419] [drm] add ip block number 6 <sdma_v3_0>
[    9.372452] [drm] add ip block number 7 <uvd_v6_0>
[    9.372483] [drm] add ip block number 8 <vce_v3_0>
[    9.372530] [drm] UVD is enabled in VM mode
[    9.372561] [drm] UVD ENC is enabled in VM mode
[    9.372594] [drm] VCE enabled in VM mode
[    9.372807] amdgpu 0000:07:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[    9.373681] ATOM BIOS: 113-D00034-L01
[    9.373751] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    9.373848] amdgpu 0000:07:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[    9.373894] amdgpu 0000:07:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
[    9.373941] [drm] Detected VRAM RAM=4096M, BAR=256M
[    9.373974] [drm] RAM width 256bits GDDR5
[    9.374090] [TTM] Zone  kernel: Available graphics memory: 66051588 kiB
[    9.374124] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[    9.374158] [TTM] Initializing pool allocator
[    9.374193] [TTM] Initializing DMA pool allocator
[    9.374258] [drm] amdgpu: 4096M of VRAM memory ready
[    9.374291] [drm] amdgpu: 4096M of GTT memory ready.
[    9.374331] [drm] GART: num cpu pages 65536, num gpu pages 65536
[    9.374419] [drm] PCIE GART of 256M enabled (table at 0x000000F400900000).
[    9.374616] [drm] Chained IB support enabled!
[    9.376667] [drm] Found UVD firmware Version: 1.130 Family ID: 16
[    9.379218] [drm] Found VCE firmware Version: 53.26 Binary ID: 3
[    9.433581] [drm] DM_PPLIB: values for Engine clock
[    9.433618] [drm] DM_PPLIB:   30000
[    9.433649] [drm] DM_PPLIB:   58800
[    9.433679] [drm] DM_PPLIB:   95200
[    9.433710] [drm] DM_PPLIB:   104100
[    9.433740] [drm] DM_PPLIB:   110600
[    9.433771] [drm] DM_PPLIB:   116800
[    9.433801] [drm] DM_PPLIB:   120900
[    9.433831] [drm] DM_PPLIB:   124400
[    9.433862] [drm] DM_PPLIB: Validation clocks:
[    9.433894] [drm] DM_PPLIB:    engine_max_clock: 124400
[    9.433926] [drm] DM_PPLIB:    memory_max_clock: 150000
[    9.433958] [drm] DM_PPLIB:    level           : 8
[    9.433990] [drm] DM_PPLIB: values for Memory clock
[    9.434026] [drm] DM_PPLIB:   30000
[    9.434056] [drm] DM_PPLIB:   100000
[    9.434087] [drm] DM_PPLIB:   150000
[    9.434117] [drm] DM_PPLIB: Validation clocks:
[    9.434148] [drm] DM_PPLIB:    engine_max_clock: 124400
[    9.434180] [drm] DM_PPLIB:    memory_max_clock: 150000
[    9.434212] [drm] DM_PPLIB:    level           : 8
[    9.434662] [drm] Display Core initialized with v3.1.44!
[    9.447631] [drm] SADs count is: -2, don't need to read it
[    9.447676] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    9.447710] [drm] Driver supports precise vblank timestamp query.
[    9.471140] random: crng init done
[    9.471202] random: 7 urandom warning(s) missed due to ratelimiting
[    9.496908] [drm] UVD and UVD ENC initialized successfully.
[    9.607867] [drm] VCE initialized successfully.
[    9.609791] [drm] fb mappable at 0xC0E28000
[    9.609825] [drm] vram apper at 0xC0000000
[    9.609856] [drm] size 8294400
[    9.609887] [drm] fb depth is 24
[    9.609917] [drm]    pitch is 7680
[    9.610027] fbcon: amdgpudrmfb (fb0) is primary device
[    9.650493] Console: switching to colour frame buffer device 240x67
[    9.667224] amdgpu 0000:07:00.0: fb0: amdgpudrmfb frame buffer device
[   10.083684] amdgpu: [powerplay]
                failed to send message 148 ret is 0
[   10.904841] amdgpu: [powerplay]
                last message was failed ret is 0
[   11.315428] amdgpu: [powerplay]
                failed to send message 145 ret is 0
[   12.137120] amdgpu: [powerplay]
                last message was failed ret is 0
[   12.552819] amdgpu: [powerplay]
                failed to send message 146 ret is 0
[   12.552936] [drm] Initialized amdgpu 3.26.0 20150101 for 0000:07:00.0 on minor 0
[   12.595174] [drm] radeon kernel modesetting enabled.
[   12.979816] amdgpu: [powerplay]
                last message was failed ret is 0
[   13.398906] amdgpu: [powerplay]
                failed to send message 155 ret is 0
[   13.837011] amdgpu: [powerplay]
                last message was failed ret is 0
[   14.275253] amdgpu: [powerplay]
                failed to send message 15b ret is 0
[   15.526324] amdgpu: [powerplay]
                last message was failed ret is 0
[   15.959661] amdgpu: [powerplay]
                failed to send message 154 ret is 0
[   17.375198] amdgpu: [powerplay]
                last message was failed ret is 0
[   17.815031] amdgpu: [powerplay]
                failed to send message 15a ret is 0
[   18.681585] amdgpu: [powerplay]
                last message was failed ret is 0
[   18.682979] amdgpu: [powerplay]
                last message was failed ret is 0
[   19.115690] amdgpu: [powerplay]
                failed to send message 282 ret is 0
[   19.122945] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   19.997725] amdgpu: [powerplay]
                last message was failed ret is 0
[   20.430187] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   21.297194] amdgpu: [powerplay]
                last message was failed ret is 0
[   21.727703] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   22.552654] amdgpu: [powerplay]
                last message was failed ret is 0
[   22.962737] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   23.782713] amdgpu: [powerplay]
                last message was failed ret is 0
[   24.192816] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   25.012821] amdgpu: [powerplay]
                last message was failed ret is 0
[   25.422874] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   26.242996] amdgpu: [powerplay]
                last message was failed ret is 0
[   26.653060] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   27.473027] amdgpu: [powerplay]
                last message was failed ret is 0
[   27.883050] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   28.703097] amdgpu: [powerplay]
                last message was failed ret is 0
[   29.113103] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   29.933043] amdgpu: [powerplay]
                last message was failed ret is 0
[   30.343163] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   31.163294] amdgpu: [powerplay]
                last message was failed ret is 0
[   31.573409] amdgpu: [powerplay]
                failed to send message 260 ret is 0
[   31.989566] amdgpu: [powerplay]
                last message was failed ret is 0
[   32.399701] amdgpu: [powerplay]
                failed to send message 155 ret is 0
[   32.841386] amdgpu: [powerplay]
                last message was failed ret is 0
[   33.282989] amdgpu: [powerplay]
                failed to send message 15b ret is 0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: amdgpu: [powerplay] failed to send message 148 ret is 0
       [not found] ` <alpine.LRH.2.02.1810241603110.4080-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
@ 2018-10-25 20:37   ` Mikulas Patocka
       [not found]     ` <alpine.LRH.2.02.1810251635001.27775-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Mikulas Patocka @ 2018-10-25 20:37 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Alex Deucher



On Wed, 24 Oct 2018, Mikulas Patocka wrote:

> Hi
> 
> I have a Sapphire Pulse RX 570 ITX graphics card.
> 
> On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret 
> is 0" and the system is stuck for several seconds when they happen. The 
> card works, except for these errors and occasional delays.

I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit 
off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it 
off also fixes hibernation problems)

Should it be turned off automatically in response to these errors?

Mikulas
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: amdgpu: [powerplay] failed to send message 148 ret is 0
       [not found]     ` <alpine.LRH.2.02.1810251635001.27775-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
@ 2018-10-29 19:23       ` Alex Deucher
       [not found]         ` <CADnq5_OQ+i+WFmW3OfGDzW=09TKJE7BHvYw71nb_Rnoo+H-Xkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Deucher @ 2018-10-29 19:23 UTC (permalink / raw)
  To: mpatocka-H+wXaHxf7aLQT0dZR+AlfA; +Cc: Deucher, Alexander, amd-gfx list

On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Wed, 24 Oct 2018, Mikulas Patocka wrote:
>
> > Hi
> >
> > I have a Sapphire Pulse RX 570 ITX graphics card.
> >
> > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret
> > is 0" and the system is stuck for several seconds when they happen. The
> > card works, except for these errors and occasional delays.
>
> I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit
> off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it
> off also fixes hibernation problems)
>
> Should it be turned off automatically in response to these errors?

What platform are you running on?  Are you running in a VM?  The
driver accesses pci config space on the bridge to determine the pcie
gen and lane caps of the platform to determine what clocks and lanes
are valid.  See amdgpu_device_get_pcie_info().  It would be good to
figure out why this is not working on your platform.

Alex
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: amdgpu: [powerplay] failed to send message 148 ret is 0
       [not found]         ` <CADnq5_OQ+i+WFmW3OfGDzW=09TKJE7BHvYw71nb_Rnoo+H-Xkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-30 16:32           ` Mikulas Patocka
       [not found]             ` <alpine.LRH.2.02.1810301021060.28237-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Mikulas Patocka @ 2018-10-30 16:32 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Deucher, Alexander, amd-gfx list



On Mon, 29 Oct 2018, Alex Deucher wrote:

> On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> >
> > On Wed, 24 Oct 2018, Mikulas Patocka wrote:
> >
> > > Hi
> > >
> > > I have a Sapphire Pulse RX 570 ITX graphics card.
> > >
> > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret
> > > is 0" and the system is stuck for several seconds when they happen. The
> > > card works, except for these errors and occasional delays.
> >
> > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit
> > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it
> > off also fixes hibernation problems)
> >
> > Should it be turned off automatically in response to these errors?
> 
> What platform are you running on?  Are you running in a VM?  The
> driver accesses pci config space on the bridge to determine the pcie
> gen and lane caps of the platform to determine what clocks and lanes
> are valid.  See amdgpu_device_get_pcie_info().  It would be good to
> figure out why this is not working on your platform.
> 
> Alex

It's not a VM. It's an old motherboard with dual socket F. It has HT2000 
north bridge and HT1000 south bridge. It has two PCIe-v1 8-lane slots.

I've found the bug - pcie_get_speed_cap incorrectly tests the lnkcap 
variable against values that are not bit-masks, so that the PCIe port is 
incorrectly reported as 8GB/s capable. When I fix these tests, the errors 
are gone.

Mikulas
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: amdgpu: [powerplay] failed to send message 148 ret is 0
       [not found]             ` <alpine.LRH.2.02.1810301021060.28237-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
@ 2018-10-30 18:41               ` Alex Deucher
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Deucher @ 2018-10-30 18:41 UTC (permalink / raw)
  To: mpatocka-H+wXaHxf7aLQT0dZR+AlfA; +Cc: Deucher, Alexander, amd-gfx list

Nice work.  Thanks for tracking this down!

Alex
On Tue, Oct 30, 2018 at 12:32 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Mon, 29 Oct 2018, Alex Deucher wrote:
>
> > On Thu, Oct 25, 2018 at 4:46 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > >
> > >
> > > On Wed, 24 Oct 2018, Mikulas Patocka wrote:
> > >
> > > > Hi
> > > >
> > > > I have a Sapphire Pulse RX 570 ITX graphics card.
> > > >
> > > > On Linux, I get errors "amdgpu: [powerplay] failed to send message 148 ret
> > > > is 0" and the system is stuck for several seconds when they happen. The
> > > > card works, except for these errors and occasional delays.
> > >
> > > I've found that PP_PCIE_DPM_MASK causes there errors. If I turn this bit
> > > off in amdgpu.ppfeaturemask, there are no more any errors. (and turning it
> > > off also fixes hibernation problems)
> > >
> > > Should it be turned off automatically in response to these errors?
> >
> > What platform are you running on?  Are you running in a VM?  The
> > driver accesses pci config space on the bridge to determine the pcie
> > gen and lane caps of the platform to determine what clocks and lanes
> > are valid.  See amdgpu_device_get_pcie_info().  It would be good to
> > figure out why this is not working on your platform.
> >
> > Alex
>
> It's not a VM. It's an old motherboard with dual socket F. It has HT2000
> north bridge and HT1000 south bridge. It has two PCIe-v1 8-lane slots.
>
> I've found the bug - pcie_get_speed_cap incorrectly tests the lnkcap
> variable against values that are not bit-masks, so that the PCIe port is
> incorrectly reported as 8GB/s capable. When I fix these tests, the errors
> are gone.
>
> Mikulas
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-10-30 18:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-24 21:49 amdgpu: [powerplay] failed to send message 148 ret is 0 Mikulas Patocka
     [not found] ` <alpine.LRH.2.02.1810241603110.4080-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
2018-10-25 20:37   ` Mikulas Patocka
     [not found]     ` <alpine.LRH.2.02.1810251635001.27775-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
2018-10-29 19:23       ` Alex Deucher
     [not found]         ` <CADnq5_OQ+i+WFmW3OfGDzW=09TKJE7BHvYw71nb_Rnoo+H-Xkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-30 16:32           ` Mikulas Patocka
     [not found]             ` <alpine.LRH.2.02.1810301021060.28237-Hpncn10jQN4oNljnaZt3ZvA+iT7yCHsGwRM8/txMwJMAicBL8TP8PQ@public.gmane.org>
2018-10-30 18:41               ` Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.