All of lore.kernel.org
 help / color / mirror / Atom feed
* 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
@ 2018-11-27  3:53 Michael S. Tsirkin
       [not found] ` <20181126221005-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-11-27  3:53 UTC (permalink / raw)
  To: nouveau; +Cc: Dave Airlie, Mika Westerberg, Ben Skeggs, dri-devel

So a new thinkpad:
01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1)

Hangs whenever I try to poke at the card. It starts happily enough with

[    3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
[    3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
[    3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
[    3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle
[    3.971727] nouveau: detected PR support, will not use DSM
[    3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007)
[    3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1)
[    4.009875] PM: Image not found (code -22)
[    4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB
[    4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB
[    4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found
[    4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found
[    4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
[    4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1
[    4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020
[    4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010
[    4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010
[    4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020
[    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047
[    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161
[    4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246
[    4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346
[    4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    4.508355] [drm] Driver supports precise vblank timestamp query.
[    4.509812] [drm] Cannot find any crtc or sizes
[    4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2


Although that type mismatch is a bit worrying. And I'm not sure what
prints PM: Image not found.

But after a short while it gets pretty busy:


[   52.917009] No Local Variables are initialized for Method [NVPO]
[   52.917011] No Arguments are initialized for method [NVPO]
[   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
[   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
[   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
[   52.917108] acpi device:00: Failed to change power state to D0
[   52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
[   52.969289] pci_raw_set_power_state: 2 callbacks suppressed
[   52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
[   53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[   53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
[   53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3


And then kernel proceeds to throw up errors at random places, e.g.

[   67.021892] cfg80211: failed to load regulatory.db
[   67.021895] cfg80211: failed to load regulatory.db
[   67.021897] cfg80211: failed to load regulatory.db
[   67.021900] cfg80211: failed to load regulatory.db
[   67.021927] cfg80211: failed to load regulatory.db
[   67.021928] cfg80211: failed to load regulatory.db
[   67.021932] cfg80211: failed to load regulatory.db
[   67.021934] cfg80211: failed to load regulatory.db
[   67.024463] cfg80211: failed to load regulatory.db
[   99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms.

followed by soft lockups and sometimes hard lockups in places
like attempts to walk skb lists.

Adding runpm=0 does away with this issue.

The specific test was with noaccel=1 - it does not seem to change
things for me.

I poked at the ACPI method NVPO and yes it does actually
seem to execute a while loop waiting for some register
to become 0. Which I guess never happens? Because card
is in a low power state and so reads return ffffffff maybe?

X isn't happy even with runpm=0 but that might be a different
issue - I thought runpm=0 might be an easier place to start debugging
things given there are logs of the failure.

Using 4.20.0-rc3 right now.

Userspace bits are from fedora 29:
xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64

firmware is pretty recent:
linux-firmware-20181008-88.gitc6b6265d.fc29.noarch

More hints for debugging would be appreciated.
If anyone wants me to play with a different kernel tree,
let me know.

-- 
MST
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found] ` <20181126221005-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-11-27  9:36   ` Mika Westerberg
       [not found]     ` <20181127093650.GP2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Mika Westerberg @ 2018-11-27  9:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Dave Airlie

+linux-acpi

Hi Michael,

On Mon, Nov 26, 2018 at 10:53:26PM -0500, Michael S. Tsirkin wrote:
> So a new thinkpad:
> 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1)
> 
> Hangs whenever I try to poke at the card. It starts happily enough with
> 
> [    3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> [    3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> [    3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
> [    3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle
> [    3.971727] nouveau: detected PR support, will not use DSM
> [    3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007)
> [    3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1)
> [    4.009875] PM: Image not found (code -22)
> [    4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB
> [    4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB
> [    4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found
> [    4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found
> [    4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
> [    4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1
> [    4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020
> [    4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010
> [    4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010
> [    4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020
> [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047
> [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161
> [    4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246
> [    4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346
> [    4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> [    4.508355] [drm] Driver supports precise vblank timestamp query.
> [    4.509812] [drm] Cannot find any crtc or sizes
> [    4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2
> 
> 
> Although that type mismatch is a bit worrying. And I'm not sure what
> prints PM: Image not found.

That is fine, it just says it cannot find a hibernation image from swap
device. I guess you have resume=... in the kernel command line.

> But after a short while it gets pretty busy:
> 
> 
> [   52.917009] No Local Variables are initialized for Method [NVPO]
> [   52.917011] No Arguments are initialized for method [NVPO]
> [   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> [   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> [   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> [   52.917108] acpi device:00: Failed to change power state to D0

Here it seems to fail to turn on the power resource for the device.

> [   52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> [   52.969289] pci_raw_set_power_state: 2 callbacks suppressed
> [   52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> [   53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> [   53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> [   53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> [   53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> 
> 
> And then kernel proceeds to throw up errors at random places, e.g.
> 
> [   67.021892] cfg80211: failed to load regulatory.db
> [   67.021895] cfg80211: failed to load regulatory.db
> [   67.021897] cfg80211: failed to load regulatory.db
> [   67.021900] cfg80211: failed to load regulatory.db
> [   67.021927] cfg80211: failed to load regulatory.db
> [   67.021928] cfg80211: failed to load regulatory.db
> [   67.021932] cfg80211: failed to load regulatory.db
> [   67.021934] cfg80211: failed to load regulatory.db
> [   67.024463] cfg80211: failed to load regulatory.db
> [   99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms.
> 
> followed by soft lockups and sometimes hard lockups in places
> like attempts to walk skb lists.
> 
> Adding runpm=0 does away with this issue.
> 
> The specific test was with noaccel=1 - it does not seem to change
> things for me.
> 
> I poked at the ACPI method NVPO and yes it does actually
> seem to execute a while loop waiting for some register
> to become 0. Which I guess never happens? Because card
> is in a low power state and so reads return ffffffff maybe?

Yes, it could be the case.

> X isn't happy even with runpm=0 but that might be a different
> issue - I thought runpm=0 might be an easier place to start debugging
> things given there are logs of the failure.
> 
> Using 4.20.0-rc3 right now.
> 
> Userspace bits are from fedora 29:
> xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64
> 
> firmware is pretty recent:
> linux-firmware-20181008-88.gitc6b6265d.fc29.noarch
> 
> More hints for debugging would be appreciated.
> If anyone wants me to play with a different kernel tree,
> let me know.

Can you share full dmesg and acpidump of the system? I would like to
check the power resource methods.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]     ` <20181127093650.GP2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
@ 2018-11-28  2:49       ` Michael S. Tsirkin
       [not found]         ` <20181127212550-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-11-28  2:49 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Rafael J. Wysocki,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, Dave Airlie,
	Len Brown

On Tue, Nov 27, 2018 at 11:36:50AM +0200, Mika Westerberg wrote:
> +linux-acpi
> 
> Hi Michael,
> 
> On Mon, Nov 26, 2018 at 10:53:26PM -0500, Michael S. Tsirkin wrote:
> > So a new thinkpad:
> > 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1)
> > 
> > Hangs whenever I try to poke at the card. It starts happily enough with
> > 
> > [    3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> > [    3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> > [    3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
> > [    3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle
> > [    3.971727] nouveau: detected PR support, will not use DSM

BTW this is also a bit strange. It says it will not use DSM - but did
it maybe use DSM previously? The ACPI Warning seems to indicate that
it did ...

And just to complete the picture here's the _DSM from ACPI:

        Method (_DSM, 4, Serialized)  // _DSM: Device-Specific Method
        {
            If ((Arg0 == ToUUID ("3e5b41c6-eb1d-4260-9d15-c71fbadae414")))
            {
                Switch (ToInteger (Arg2))
                {
                    Case (Zero)
                    {
                        If ((Arg1 == One))
                        {
                            Return (0x0001E7FF)
                        }
                    }
                    Case (One)
                    {
                        If ((Arg1 == One))
                        {
                            If (((S0ID == One) && (OSYS < 0x07DF)))
                            {
                                If (((DerefOf (Arg3 [Zero]) & 0xFF) == One))
                                {
                                    \GUAM (One)
                                }

                                Local0 = (DerefOf (Arg3 [One]) & 0xFF)
                                If ((Local0 == Zero))
                                {
                                    \GUAM (Zero)
                                }
                            }

                            If ((DerefOf (Arg3 [Zero]) == Zero))
                            {
                                Local0 = CLID /* \_SB_.PCI0.GFX0.CLID */
                                If ((0x80000000 & Local0))
                                {
                                    CLID &= 0x0F
                                    GLID (CLID)
                                }
                            }

                            Return (One)
                        }
                    }
                    Case (0x02)
                    {
                        If ((Arg1 == One))
                        {
                            Return (One)
                        }
                    }
                    Case (0x03)
                    {
                        If ((Arg1 == One))
                        {
                            Return (One)
                        }
                    }
                    Case (0x04)
                    {
                        If ((Arg1 == One))
                        {
                            Return (One)
                        }
                    }
                    Case (0x05)
                    {
                        If ((Arg1 == One))
                        {
                            Return (One)
                        }
                    }
                    Case (0x06)
                    {
                        If ((Arg1 == One))
                        {
                            Return (One)
                        }
                    }
                    Case (0x07)
                    {
                        If ((Arg1 == One))
                        {
                            IBTT = (DerefOf (Arg3 [Zero]) & 0xFF)
                            Return (One)
                        }
                    }
                    Case (0x08)
                    {
                        If ((Arg1 == One))
                        {
                            IPSC = (DerefOf (Arg3 [Zero]) & 0xFF)
                            If ((DerefOf (Arg3 [One]) & 0xFF))
                            {
                                IPAT = (DerefOf (Arg3 [One]) & 0xFF)
                                IPAT--
                            }

                            IBIA = ((DerefOf (Arg3 [0x02]) >> 0x04) & 0x07)
                            Return (One)
                        }
                    }
                    Case (0x09)
                    {
                        If ((Arg1 == One))
                        {
                            Return (One)
                        }
                    }
                    Case (0x0A)
                    {
                        If ((Arg1 == One))
                        {
                            Local0 = (LIDS << 0x08)
                            Local0 += 0x0100
                            Return (Local0)
                        }
                    }
                    Case (0x0D)
                    {
                        If ((Arg1 == One))
                        {
                            Local0 = ((DerefOf (Arg3 [0x03]) << 0x18) | (DerefOf (
                                Arg3 [0x02]) << 0x10))
                            Local0 &= 0xEFFF0000
                            Local0 &= (DerefOf (DBTB [IBTT]) << 0x10)
                            Local0 |= IBTT
                            Return (Local0)
                        }
                    }
                    Case (0x0E)
                    {
                        If ((Arg1 == One))
                        {
                            Local0 = IPSC /* \IPSC */
                            Local0 |= (IPAT << 0x08)
                            Local0 += 0x0100
                            Local0 |= (LIDS << 0x10)
                            Local0 += 0x00010000
                            Local0 |= (IBIA << 0x14)
                            Return (Local0)
                        }
                    }
                    Case (0x0F)
                    {
                        If ((Arg1 == One))
                        {
                            Local0 = GIVD /* \_SB_.PCI0.GFX0.GIVD */
                            Local0 ^= One
                            Local0 |= (GMFN << One)
                            Local0 |= 0x1800
                            Local0 |= (IDMS << 0x11)
                            Local0 |= (DerefOf (DerefOf (CDCT [HVCO]) [CDVL]) << 
                                0x15)
                            Return (Local0)
                        }
                    }
                    Case (0x10)
                    {
                        If ((Arg1 == One))
                        {
                            Name (KSVP, Package (0x02)
                            {
                                0x80000000, 
                                0x8000
                            })
                            KSVP [Zero] = KSV0 /* \KSV0 */
                            KSVP [One] = KSV1 /* \KSV1 */
                            Return (KSVP) /* \_SB_.PCI0.GFX0._DSM.KSVP */
                        }
                    }

                }
            }

            Return (Buffer (One)
            {
                 0x00                                             // .
            })
        }


I am not sure what makes it think that Arg3 is a buffer and
not a package: IIRC Index (decoded C-style here as []) can apply
to either.

Am I maybe misunderstanding the warning?


> > [    3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007)
> > [    3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1)
> > [    4.009875] PM: Image not found (code -22)
> > [    4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB
> > [    4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB
> > [    4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found
> > [    4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found
> > [    4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
> > [    4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1
> > [    4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020
> > [    4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010
> > [    4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010
> > [    4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020
> > [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047
> > [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161
> > [    4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246
> > [    4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346
> > [    4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> > [    4.508355] [drm] Driver supports precise vblank timestamp query.
> > [    4.509812] [drm] Cannot find any crtc or sizes
> > [    4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2
> > 
> > 
> > Although that type mismatch is a bit worrying. And I'm not sure what
> > prints PM: Image not found.
> 
> That is fine, it just says it cannot find a hibernation image from swap
> device. I guess you have resume=... in the kernel command line.
> 
> > But after a short while it gets pretty busy:
> > 
> > 
> > [   52.917009] No Local Variables are initialized for Method [NVPO]
> > [   52.917011] No Arguments are initialized for method [NVPO]
> > [   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > [   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > [   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > [   52.917108] acpi device:00: Failed to change power state to D0
> 
> Here it seems to fail to turn on the power resource for the device.
>
> > [   52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > [   52.969289] pci_raw_set_power_state: 2 callbacks suppressed
> > [   52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > [   53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > [   53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > [   53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > [   53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > 
> > 
> > And then kernel proceeds to throw up errors at random places, e.g.
> > 
> > [   67.021892] cfg80211: failed to load regulatory.db
> > [   67.021895] cfg80211: failed to load regulatory.db
> > [   67.021897] cfg80211: failed to load regulatory.db
> > [   67.021900] cfg80211: failed to load regulatory.db
> > [   67.021927] cfg80211: failed to load regulatory.db
> > [   67.021928] cfg80211: failed to load regulatory.db
> > [   67.021932] cfg80211: failed to load regulatory.db
> > [   67.021934] cfg80211: failed to load regulatory.db
> > [   67.024463] cfg80211: failed to load regulatory.db
> > [   99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms.

BTW this one might indicate that somehow iwlwifi got powered
down too somehow. It's at

00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)

so really shouldn't be affected, but go figure. If driver really is getting
all-ones from the device, it just might try to poke at a wrong b:d.f by mistake
maybe ...

> > followed by soft lockups and sometimes hard lockups in places
> > like attempts to walk skb lists.
> > 
> > Adding runpm=0 does away with this issue.
> > 
> > The specific test was with noaccel=1 - it does not seem to change
> > things for me.
> > 
> > I poked at the ACPI method NVPO and yes it does actually
> > seem to execute a while loop waiting for some register
> > to become 0. Which I guess never happens? Because card
> > is in a low power state and so reads return ffffffff maybe?
> 
> Yes, it could be the case.
> 
> > X isn't happy even with runpm=0 but that might be a different
> > issue - I thought runpm=0 might be an easier place to start debugging
> > things given there are logs of the failure.
> > 
> > Using 4.20.0-rc3 right now.
> > 
> > Userspace bits are from fedora 29:
> > xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64
> > 
> > firmware is pretty recent:
> > linux-firmware-20181008-88.gitc6b6265d.fc29.noarch
> > 
> > More hints for debugging would be appreciated.
> > If anyone wants me to play with a different kernel tree,
> > let me know.
> 
> Can you share full dmesg and acpidump of the system? I would like to
> check the power resource methods.

Pls find it as an attachment here:
https://bugs.freedesktop.org/show_bug.cgi?id=108873

If you like I can send it to you directly too -
spamming the list with it isn't helpful I guess?

-- 
MST
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]         ` <20181127212550-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-11-28 11:08           ` Mika Westerberg
       [not found]             ` <20181128110857.GW2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Mika Westerberg @ 2018-11-28 11:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Rafael J. Wysocki,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, Dave Airlie,
	Len Brown

On Tue, Nov 27, 2018 at 09:49:44PM -0500, Michael S. Tsirkin wrote:
> On Tue, Nov 27, 2018 at 11:36:50AM +0200, Mika Westerberg wrote:
> > +linux-acpi
> > 
> > Hi Michael,
> > 
> > On Mon, Nov 26, 2018 at 10:53:26PM -0500, Michael S. Tsirkin wrote:
> > > So a new thinkpad:
> > > 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1)
> > > 
> > > Hangs whenever I try to poke at the card. It starts happily enough with
> > > 
> > > [    3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> > > [    3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> > > [    3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
> > > [    3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle
> > > [    3.971727] nouveau: detected PR support, will not use DSM
> 
> BTW this is also a bit strange. It says it will not use DSM - but did
> it maybe use DSM previously? The ACPI Warning seems to indicate that
> it did ...
> 
> And just to complete the picture here's the _DSM from ACPI:
> 
>         Method (_DSM, 4, Serialized)  // _DSM: Device-Specific Method
>         {

...

>         }
> 
> 
> I am not sure what makes it think that Arg3 is a buffer and
> not a package: IIRC Index (decoded C-style here as []) can apply
> to either.
> 
> Am I maybe misunderstanding the warning?

It looks like coming from the nouveau driver (assuming I'm reading it right).

drivers/gpu/drm/nouveau/nouveau_acpi.c::nouveau_optimus_dsm()

        union acpi_object argv4 = {
                .buffer.type = ACPI_TYPE_BUFFER,
                .buffer.length = 4,
                .buffer.pointer = args_buff
        };

        ...

        obj = acpi_evaluate_dsm_typed(handle, &nouveau_op_dsm_muid, 0x00000100,
                                      func, &argv4, ACPI_TYPE_BUFFER);


It passes ACPI_TYPE_BUFFER but ACPI spec _DSM expects package.

> > > [    3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007)
> > > [    3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1)
> > > [    4.009875] PM: Image not found (code -22)
> > > [    4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB
> > > [    4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB
> > > [    4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found
> > > [    4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found
> > > [    4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
> > > [    4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > [    4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020
> > > [    4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010
> > > [    4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010
> > > [    4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020
> > > [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047
> > > [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161
> > > [    4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246
> > > [    4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346
> > > [    4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> > > [    4.508355] [drm] Driver supports precise vblank timestamp query.
> > > [    4.509812] [drm] Cannot find any crtc or sizes
> > > [    4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2
> > > 
> > > 
> > > Although that type mismatch is a bit worrying. And I'm not sure what
> > > prints PM: Image not found.
> > 
> > That is fine, it just says it cannot find a hibernation image from swap
> > device. I guess you have resume=... in the kernel command line.
> > 
> > > But after a short while it gets pretty busy:
> > > 
> > > 
> > > [   52.917009] No Local Variables are initialized for Method [NVPO]
> > > [   52.917011] No Arguments are initialized for method [NVPO]
> > > [   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > > [   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > > [   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > > [   52.917108] acpi device:00: Failed to change power state to D0
> > 
> > Here it seems to fail to turn on the power resource for the device.
> >
> > > [   52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > > [   52.969289] pci_raw_set_power_state: 2 callbacks suppressed
> > > [   52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > > [   53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > > [   53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > > [   53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > > [   53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > > 
> > > 
> > > And then kernel proceeds to throw up errors at random places, e.g.
> > > 
> > > [   67.021892] cfg80211: failed to load regulatory.db
> > > [   67.021895] cfg80211: failed to load regulatory.db
> > > [   67.021897] cfg80211: failed to load regulatory.db
> > > [   67.021900] cfg80211: failed to load regulatory.db
> > > [   67.021927] cfg80211: failed to load regulatory.db
> > > [   67.021928] cfg80211: failed to load regulatory.db
> > > [   67.021932] cfg80211: failed to load regulatory.db
> > > [   67.021934] cfg80211: failed to load regulatory.db
> > > [   67.024463] cfg80211: failed to load regulatory.db
> > > [   99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms.
> 
> BTW this one might indicate that somehow iwlwifi got powered
> down too somehow. It's at
> 
> 00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)
> 
> so really shouldn't be affected, but go figure. If driver really is getting
> all-ones from the device, it just might try to poke at a wrong b:d.f by mistake
> maybe ...

Or it the power resource is shared by wifi as well.

> > > followed by soft lockups and sometimes hard lockups in places
> > > like attempts to walk skb lists.
> > > 
> > > Adding runpm=0 does away with this issue.
> > > 
> > > The specific test was with noaccel=1 - it does not seem to change
> > > things for me.
> > > 
> > > I poked at the ACPI method NVPO and yes it does actually
> > > seem to execute a while loop waiting for some register
> > > to become 0. Which I guess never happens? Because card
> > > is in a low power state and so reads return ffffffff maybe?
> > 
> > Yes, it could be the case.
> > 
> > > X isn't happy even with runpm=0 but that might be a different
> > > issue - I thought runpm=0 might be an easier place to start debugging
> > > things given there are logs of the failure.
> > > 
> > > Using 4.20.0-rc3 right now.
> > > 
> > > Userspace bits are from fedora 29:
> > > xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64
> > > 
> > > firmware is pretty recent:
> > > linux-firmware-20181008-88.gitc6b6265d.fc29.noarch
> > > 
> > > More hints for debugging would be appreciated.
> > > If anyone wants me to play with a different kernel tree,
> > > let me know.
> > 
> > Can you share full dmesg and acpidump of the system? I would like to
> > check the power resource methods.
> 
> Pls find it as an attachment here:
> https://bugs.freedesktop.org/show_bug.cgi?id=108873
> 
> If you like I can send it to you directly too -
> spamming the list with it isn't helpful I guess?

No need to send, I can read it from the bugzilla just fine. Can you attach
acpidump there as well?
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]             ` <20181128110857.GW2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
@ 2018-11-28 15:09               ` Michael S. Tsirkin
       [not found]                 ` <20181128093527-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-11-28 15:09 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Rafael J. Wysocki,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, Dave Airlie,
	Len Brown

On Wed, Nov 28, 2018 at 01:08:57PM +0200, Mika Westerberg wrote:
> On Tue, Nov 27, 2018 at 09:49:44PM -0500, Michael S. Tsirkin wrote:
> > On Tue, Nov 27, 2018 at 11:36:50AM +0200, Mika Westerberg wrote:
> > > +linux-acpi
> > > 
> > > Hi Michael,
> > > 
> > > On Mon, Nov 26, 2018 at 10:53:26PM -0500, Michael S. Tsirkin wrote:
> > > > So a new thinkpad:
> > > > 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1)
> > > > 
> > > > Hangs whenever I try to poke at the card. It starts happily enough with
> > > > 
> > > > [    3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> > > > [    3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> > > > [    3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
> > > > [    3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle
> > > > [    3.971727] nouveau: detected PR support, will not use DSM
> > 
> > BTW this is also a bit strange. It says it will not use DSM - but did
> > it maybe use DSM previously? The ACPI Warning seems to indicate that
> > it did ...
> > 
> > And just to complete the picture here's the _DSM from ACPI:
> > 
> >         Method (_DSM, 4, Serialized)  // _DSM: Device-Specific Method
> >         {
> 
> ...
> 
> >         }
> > 
> > 
> > I am not sure what makes it think that Arg3 is a buffer and
> > not a package: IIRC Index (decoded C-style here as []) can apply
> > to either.
> > 
> > Am I maybe misunderstanding the warning?
> 
> It looks like coming from the nouveau driver (assuming I'm reading it right).
> 
> drivers/gpu/drm/nouveau/nouveau_acpi.c::nouveau_optimus_dsm()
> 
>         union acpi_object argv4 = {
>                 .buffer.type = ACPI_TYPE_BUFFER,
>                 .buffer.length = 4,
>                 .buffer.pointer = args_buff
>         };
> 
>         ...
> 
>         obj = acpi_evaluate_dsm_typed(handle, &nouveau_op_dsm_muid, 0x00000100,
>                                       func, &argv4, ACPI_TYPE_BUFFER);
> 
> 
> It passes ACPI_TYPE_BUFFER but ACPI spec _DSM expects package.


Yea all this is weird, in particular I wonder why does everyone
using dsm insists on saying Arg4
when they actually mean Arg3. ACPI numbers arguments from 0.

So it's a bit ugly, and maybe worth fixing but unlikely to be
an actual issue simply because we end up not using DSM in the end.

Poking at the probing code in nouveau_pr3_present, I started to wonder:
should I try to hack it to disable d3cold and pr3 and see what
happens?



> > > > [    3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007)
> > > > [    3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1)
> > > > [    4.009875] PM: Image not found (code -22)
> > > > [    4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB
> > > > [    4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB
> > > > [    4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found
> > > > [    4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found
> > > > [    4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
> > > > [    4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > > [    4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020
> > > > [    4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010
> > > > [    4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010
> > > > [    4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020
> > > > [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047
> > > > [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161
> > > > [    4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246
> > > > [    4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346
> > > > [    4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> > > > [    4.508355] [drm] Driver supports precise vblank timestamp query.
> > > > [    4.509812] [drm] Cannot find any crtc or sizes
> > > > [    4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2
> > > > 
> > > > 
> > > > Although that type mismatch is a bit worrying. And I'm not sure what
> > > > prints PM: Image not found.
> > > 
> > > That is fine, it just says it cannot find a hibernation image from swap
> > > device. I guess you have resume=... in the kernel command line.
> > > 
> > > > But after a short while it gets pretty busy:
> > > > 
> > > > 
> > > > [   52.917009] No Local Variables are initialized for Method [NVPO]
> > > > [   52.917011] No Arguments are initialized for method [NVPO]
> > > > [   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > > > [   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > > > [   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > > > [   52.917108] acpi device:00: Failed to change power state to D0
> > > 
> > > Here it seems to fail to turn on the power resource for the device.
> > >
> > > > [   52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > > > [   52.969289] pci_raw_set_power_state: 2 callbacks suppressed
> > > > [   52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > > > [   53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > > > [   53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > > > [   53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > > > [   53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > > > 
> > > > 
> > > > And then kernel proceeds to throw up errors at random places, e.g.
> > > > 
> > > > [   67.021892] cfg80211: failed to load regulatory.db
> > > > [   67.021895] cfg80211: failed to load regulatory.db
> > > > [   67.021897] cfg80211: failed to load regulatory.db
> > > > [   67.021900] cfg80211: failed to load regulatory.db
> > > > [   67.021927] cfg80211: failed to load regulatory.db
> > > > [   67.021928] cfg80211: failed to load regulatory.db
> > > > [   67.021932] cfg80211: failed to load regulatory.db
> > > > [   67.021934] cfg80211: failed to load regulatory.db
> > > > [   67.024463] cfg80211: failed to load regulatory.db
> > > > [   99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms.
> > 
> > BTW this one might indicate that somehow iwlwifi got powered
> > down too somehow. It's at
> > 
> > 00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)
> > 
> > so really shouldn't be affected, but go figure. If driver really is getting
> > all-ones from the device, it just might try to poke at a wrong b:d.f by mistake
> > maybe ...
> 
> Or it the power resource is shared by wifi as well.

Is there a way to find out through e.g. sysfs?

> > > > followed by soft lockups and sometimes hard lockups in places
> > > > like attempts to walk skb lists.
> > > > 
> > > > Adding runpm=0 does away with this issue.
> > > > 
> > > > The specific test was with noaccel=1 - it does not seem to change
> > > > things for me.
> > > > 
> > > > I poked at the ACPI method NVPO and yes it does actually
> > > > seem to execute a while loop waiting for some register
> > > > to become 0. Which I guess never happens? Because card
> > > > is in a low power state and so reads return ffffffff maybe?
> > > 
> > > Yes, it could be the case.
> > > 
> > > > X isn't happy even with runpm=0 but that might be a different
> > > > issue - I thought runpm=0 might be an easier place to start debugging
> > > > things given there are logs of the failure.
> > > > 
> > > > Using 4.20.0-rc3 right now.
> > > > 
> > > > Userspace bits are from fedora 29:
> > > > xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64
> > > > 
> > > > firmware is pretty recent:
> > > > linux-firmware-20181008-88.gitc6b6265d.fc29.noarch
> > > > 
> > > > More hints for debugging would be appreciated.
> > > > If anyone wants me to play with a different kernel tree,
> > > > let me know.
> > > 
> > > Can you share full dmesg and acpidump of the system? I would like to
> > > check the power resource methods.
> > 
> > Pls find it as an attachment here:
> > https://bugs.freedesktop.org/show_bug.cgi?id=108873
> > 
> > If you like I can send it to you directly too -
> > spamming the list with it isn't helpful I guess?
> 
> No need to send, I can read it from the bugzilla just fine. Can you attach
> acpidump there as well?

Done. lspci -x too just in case.

-- 
MST
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]                 ` <20181128093527-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-11-28 15:55                   ` Mika Westerberg
       [not found]                     ` <20181128155544.GD2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Mika Westerberg @ 2018-11-28 15:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Rafael J. Wysocki,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, Dave Airlie,
	Len Brown

On Wed, Nov 28, 2018 at 10:09:22AM -0500, Michael S. Tsirkin wrote:
> Yea all this is weird, in particular I wonder why does everyone
> using dsm insists on saying Arg4
> when they actually mean Arg3. ACPI numbers arguments from 0.
> 
> So it's a bit ugly, and maybe worth fixing but unlikely to be
> an actual issue simply because we end up not using DSM in the end.

I agree.

> Poking at the probing code in nouveau_pr3_present, I started to wonder:
> should I try to hack it to disable d3cold and pr3 and see what
> happens?

I guess it is worth a try. You can do it from sysfs for the graphics
PCI device there is an attribute d3cold_allowed that controls this.

[snip]

> > > 00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)
> > > 
> > > so really shouldn't be affected, but go figure. If driver really is getting
> > > all-ones from the device, it just might try to poke at a wrong b:d.f by mistake
> > > maybe ...
> > 
> > Or it the power resource is shared by wifi as well.
> 
> Is there a way to find out through e.g. sysfs?

It is not shared, I checked from the acpidump you provided. Possibly the
infinite loop in AML when executing NVPO method have some effect on
this.

[snip]

> > No need to send, I can read it from the bugzilla just fine. Can you attach
> > acpidump there as well?
> 
> Done. lspci -x too just in case.

Looking at the dmesg:

[   52.917009] No Local Variables are initialized for Method [NVPO]
[   52.917011] No Arguments are initialized for method [NVPO]
[   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
[   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
[   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)

So what happens here is that Linux turns off power resource
\_SB.PCI0.PEG0.PG00 by calling its _OFF method (happens when the root
port is runtime suspended). This ends up calling \_SB.PCI0.PGON which
calls \_SB.PCI0.PEG0.PEGP.NVPO.

The last method looks like this:

       Method (NVPO, 0, NotSerialized)
        {
            While ((\_SB.PCI0.P0LS < 0x03))
            {
                Sleep (One)
            }

So basically it polls P0LS register infinitely if the returned value is
less than 3. I suspect this is the issue and it then makes the other
like wifi to fail to execute its methods.

P0LS comes from this operation region:

        OperationRegion (OPG0, SystemMemory, (XBAS + 0x8000), 0x1000)
        Field (OPG0, AnyAcc, NoLock, Preserve)
        {
            ...
            Offset (0x216),
            P0LS,   4,

This is some host bridge register but not sure which because XBAS value
cannot be determined from the acpidump.
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]                     ` <20181128155544.GD2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
@ 2018-11-28 20:30                       ` Michael S. Tsirkin
       [not found]                         ` <20181128111312-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-11-28 20:30 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Rafael J. Wysocki,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, Dave Airlie,
	Len Brown

On Wed, Nov 28, 2018 at 05:55:44PM +0200, Mika Westerberg wrote:
> On Wed, Nov 28, 2018 at 10:09:22AM -0500, Michael S. Tsirkin wrote:
> > Yea all this is weird, in particular I wonder why does everyone
> > using dsm insists on saying Arg4
> > when they actually mean Arg3. ACPI numbers arguments from 0.
> > 
> > So it's a bit ugly, and maybe worth fixing but unlikely to be
> > an actual issue simply because we end up not using DSM in the end.
> 
> I agree.
> 
> > Poking at the probing code in nouveau_pr3_present, I started to wonder:
> > should I try to hack it to disable d3cold and pr3 and see what
> > happens?
> 
> I guess it is worth a try. You can do it from sysfs for the graphics
> PCI device there is an attribute d3cold_allowed that controls this.
> 
> [snip]

But probably too late by time nouveau is up at boot?

> > > > 00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)
> > > > 
> > > > so really shouldn't be affected, but go figure. If driver really is getting
> > > > all-ones from the device, it just might try to poke at a wrong b:d.f by mistake
> > > > maybe ...
> > > 
> > > Or it the power resource is shared by wifi as well.
> > 
> > Is there a way to find out through e.g. sysfs?
> 
> It is not shared, I checked from the acpidump you provided. Possibly the
> infinite loop in AML when executing NVPO method have some effect on
> this.
> 
> [snip]
> 
> > > No need to send, I can read it from the bugzilla just fine. Can you attach
> > > acpidump there as well?
> > 
> > Done. lspci -x too just in case.
> 
> Looking at the dmesg:
> 
> [   52.917009] No Local Variables are initialized for Method [NVPO]
> [   52.917011] No Arguments are initialized for method [NVPO]
> [   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> [   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> [   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> 
> So what happens here is that Linux turns off power resource
> \_SB.PCI0.PEG0.PG00 by calling its _OFF method (happens when the root
> port is runtime suspended). This ends up calling \_SB.PCI0.PGON which
> calls \_SB.PCI0.PEG0.PEGP.NVPO.
> 
> The last method looks like this:
> 
>        Method (NVPO, 0, NotSerialized)
>         {
>             While ((\_SB.PCI0.P0LS < 0x03))
>             {
>                 Sleep (One)
>             }
> 
> So basically it polls P0LS register infinitely if the returned value is
> less than 3. I suspect this is the issue and it then makes the other
> like wifi to fail to execute its methods.
> 
> P0LS comes from this operation region:
> 
>         OperationRegion (OPG0, SystemMemory, (XBAS + 0x8000), 0x1000)
>         Field (OPG0, AnyAcc, NoLock, Preserve)
>         {
>             ...
>             Offset (0x216),
>             P0LS,   4,
> 
> This is some host bridge register but not sure which because XBAS value
> cannot be determined from the acpidump.

Oh I think XBAS is in SSDT4:

    OperationRegion (SANV, SystemMemory, 0x4FBF7018, 0x01F4)
    Field (SANV, AnyAcc, Lock, Preserve)
    {
        ASLB,   32, 
        IMON,   8, 
        IGDS,   8, 
        IBTT,   8, 
        IPAT,   8, 
        IPSC,   8, 
        IBIA,   8, 
        ISSC,   8, 
        IDMS,   8, 
        IF1E,   8, 
        HVCO,   8, 
        GSMI,   8, 
        PAVP,   8, 
        CADL,   8, 
        CSTE,   16, 
        NSTE,   16, 
        NDID,   8, 
        DID1,   32, 
        DID2,   32, 
        DID3,   32, 
        DID4,   32, 
        DID5,   32, 
        DID6,   32, 
        DID7,   32, 
        DID8,   32, 
        DID9,   32, 
        DIDA,   32, 
        DIDB,   32, 
        DIDC,   32, 
        DIDD,   32, 
        DIDE,   32, 
        DIDF,   32, 
        DIDX,   32, 
        NXD1,   32, 
        NXD2,   32, 
        NXD3,   32, 
        NXD4,   32, 
        NXD5,   32, 
        NXD6,   32, 
        NXD7,   32, 
        NXD8,   32, 
        NXDX,   32, 
        LIDS,   8, 
        KSV0,   32, 
        KSV1,   8, 
        BRTL,   8, 
        ALSE,   8, 
        ALAF,   8, 
        LLOW,   8, 
        LHIH,   8, 
        ALFP,   8, 
        IPTP,   8, 
        EDPV,   8, 
        SGMD,   8, 
        SGFL,   8, 
        SGGP,   8, 
        HRE0,   8, 
        HRG0,   32, 
        HRA0,   8, 
        PWE0,   8, 
        PWG0,   32, 
        PWA0,   8, 
        P1GP,   8, 
        HRE1,   8, 
        HRG1,   32, 
        HRA1,   8, 
        PWE1,   8, 
        PWG1,   32, 
        PWA1,   8, 
        P2GP,   8, 
        HRE2,   8, 
        HRG2,   32, 
        HRA2,   8, 
        PWE2,   8, 
        PWG2,   32, 
        PWA2,   8, 
        DLPW,   16, 
        DLHR,   16, 
        EECP,   8, 
        XBAS,   32, 
        GBAS,   16, 
        NVGA,   32, 
        NVHA,   32, 
        AMDA,   32, 
        LTRX,   8, 
        OBFX,   8, 
        LTRY,   8, 
        OBFY,   8, 
        LTRZ,   8, 
        OBFZ,   8, 
        LTRW,   8, 
        OBFA,   8, 
        SMSL,   16, 
        SNSL,   16, 
        P0UB,   8, 
        P1UB,   8, 
        P2UB,   8, 
        P3UB,   8, 
        PCSL,   8, 
        PBGE,   8, 
        M64B,   64, 
        M64L,   64, 
        CPEX,   32, 
        EEC1,   8, 
        EEC2,   8, 
        SBN0,   8, 
        SBN1,   8, 
        SBN2,   8, 
        M32B,   32, 
        M32L,   32, 
        P0WK,   32, 
        P1WK,   32, 
        P2WK,   32, 
        VTDS,   8, 
        VTB1,   32, 
        VTB2,   32, 
        VTB3,   32, 
        VE1V,   16, 
        VE2V,   16, 
        SBN3,   8, 
        P3GP,   8, 
        HRE3,   8, 
        HRG3,   32, 
        HRA3,   8, 
        PWE3,   8, 
        PWG3,   32, 
        PWA3,   8, 
        P3WK,   32, 
        EEC3,   8, 
        RPIN,   8, 
        RPBA,   32, 
        Offset (0x1F4)
    }

If my math is correct, this is offset 1456 bits, ie 0xb6
bytes, and so 0x4fbf70ce

XBAS + 0x8000 is 0x4fbff0ce ?

cat /proc/iomem shows that this is
4ee5f000-4fca0fff : ACPI Non-volatile Storage

-- 
MST
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]                         ` <20181128111312-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-11-28 23:21                           ` Karol Herbst
       [not found]                             ` <CACO55tuzDgTb2g_ebhm7+A=UjEEDoM09-b5X_7fkznkFjJGiag-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Karol Herbst @ 2018-11-28 23:21 UTC (permalink / raw)
  To: mst-H+wXaHxf7aLQT0dZR+AlfA
  Cc: nouveau, rjw-LthD3rsA81gm4RdzfppkhA, dri-devel,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, David Airlie,
	mika.westerberg-VuQAYsv1563Yd54FQh9/CA,
	lenb-DgEjT+Ai2ygdnm+yROfE0A

this was already debugged and there is no point in searching inside
the Firmware. It's not a firmware bug or anything.

The proper fix is to do something inside Nouveau so that we don't
upset the device and being able to runtime resume it again.

The initial thing we do inside Nouveau to cause those issues is to run
that so called "DEVINIT" script inside the vbios to initialize the
GPU, problem is, it changes something on the PCIe configuration so
that the GPU isn't able to runtime resume anymore. I am in contact
with Nvidia about that issue and hopefully we get the proper answers.
When I was digging into that myself I was able to make the situation
more stable by setting the PCIE link speed to the boot defaults, but
that was still pretty unstable.

Anyway, because the binary driver fails here as well (through
bumblebee and so on) there isn't much of reverse engineering we can do
besides guessing and trying it on literally every hardware until it
works.

We also have an upstream bug for this issue:
https://bugzilla.kernel.org/show_bug.cgi?id=156341
On Wed, Nov 28, 2018 at 9:30 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Nov 28, 2018 at 05:55:44PM +0200, Mika Westerberg wrote:
> > On Wed, Nov 28, 2018 at 10:09:22AM -0500, Michael S. Tsirkin wrote:
> > > Yea all this is weird, in particular I wonder why does everyone
> > > using dsm insists on saying Arg4
> > > when they actually mean Arg3. ACPI numbers arguments from 0.
> > >
> > > So it's a bit ugly, and maybe worth fixing but unlikely to be
> > > an actual issue simply because we end up not using DSM in the end.
> >
> > I agree.
> >
> > > Poking at the probing code in nouveau_pr3_present, I started to wonder:
> > > should I try to hack it to disable d3cold and pr3 and see what
> > > happens?
> >
> > I guess it is worth a try. You can do it from sysfs for the graphics
> > PCI device there is an attribute d3cold_allowed that controls this.
> >
> > [snip]
>
> But probably too late by time nouveau is up at boot?
>
> > > > > 00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)
> > > > >
> > > > > so really shouldn't be affected, but go figure. If driver really is getting
> > > > > all-ones from the device, it just might try to poke at a wrong b:d.f by mistake
> > > > > maybe ...
> > > >
> > > > Or it the power resource is shared by wifi as well.
> > >
> > > Is there a way to find out through e.g. sysfs?
> >
> > It is not shared, I checked from the acpidump you provided. Possibly the
> > infinite loop in AML when executing NVPO method have some effect on
> > this.
> >
> > [snip]
> >
> > > > No need to send, I can read it from the bugzilla just fine. Can you attach
> > > > acpidump there as well?
> > >
> > > Done. lspci -x too just in case.
> >
> > Looking at the dmesg:
> >
> > [   52.917009] No Local Variables are initialized for Method [NVPO]
> > [   52.917011] No Arguments are initialized for method [NVPO]
> > [   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > [   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > [   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> >
> > So what happens here is that Linux turns off power resource
> > \_SB.PCI0.PEG0.PG00 by calling its _OFF method (happens when the root
> > port is runtime suspended). This ends up calling \_SB.PCI0.PGON which
> > calls \_SB.PCI0.PEG0.PEGP.NVPO.
> >
> > The last method looks like this:
> >
> >        Method (NVPO, 0, NotSerialized)
> >         {
> >             While ((\_SB.PCI0.P0LS < 0x03))
> >             {
> >                 Sleep (One)
> >             }
> >
> > So basically it polls P0LS register infinitely if the returned value is
> > less than 3. I suspect this is the issue and it then makes the other
> > like wifi to fail to execute its methods.
> >
> > P0LS comes from this operation region:
> >
> >         OperationRegion (OPG0, SystemMemory, (XBAS + 0x8000), 0x1000)
> >         Field (OPG0, AnyAcc, NoLock, Preserve)
> >         {
> >             ...
> >             Offset (0x216),
> >             P0LS,   4,
> >
> > This is some host bridge register but not sure which because XBAS value
> > cannot be determined from the acpidump.
>
> Oh I think XBAS is in SSDT4:
>
>     OperationRegion (SANV, SystemMemory, 0x4FBF7018, 0x01F4)
>     Field (SANV, AnyAcc, Lock, Preserve)
>     {
>         ASLB,   32,
>         IMON,   8,
>         IGDS,   8,
>         IBTT,   8,
>         IPAT,   8,
>         IPSC,   8,
>         IBIA,   8,
>         ISSC,   8,
>         IDMS,   8,
>         IF1E,   8,
>         HVCO,   8,
>         GSMI,   8,
>         PAVP,   8,
>         CADL,   8,
>         CSTE,   16,
>         NSTE,   16,
>         NDID,   8,
>         DID1,   32,
>         DID2,   32,
>         DID3,   32,
>         DID4,   32,
>         DID5,   32,
>         DID6,   32,
>         DID7,   32,
>         DID8,   32,
>         DID9,   32,
>         DIDA,   32,
>         DIDB,   32,
>         DIDC,   32,
>         DIDD,   32,
>         DIDE,   32,
>         DIDF,   32,
>         DIDX,   32,
>         NXD1,   32,
>         NXD2,   32,
>         NXD3,   32,
>         NXD4,   32,
>         NXD5,   32,
>         NXD6,   32,
>         NXD7,   32,
>         NXD8,   32,
>         NXDX,   32,
>         LIDS,   8,
>         KSV0,   32,
>         KSV1,   8,
>         BRTL,   8,
>         ALSE,   8,
>         ALAF,   8,
>         LLOW,   8,
>         LHIH,   8,
>         ALFP,   8,
>         IPTP,   8,
>         EDPV,   8,
>         SGMD,   8,
>         SGFL,   8,
>         SGGP,   8,
>         HRE0,   8,
>         HRG0,   32,
>         HRA0,   8,
>         PWE0,   8,
>         PWG0,   32,
>         PWA0,   8,
>         P1GP,   8,
>         HRE1,   8,
>         HRG1,   32,
>         HRA1,   8,
>         PWE1,   8,
>         PWG1,   32,
>         PWA1,   8,
>         P2GP,   8,
>         HRE2,   8,
>         HRG2,   32,
>         HRA2,   8,
>         PWE2,   8,
>         PWG2,   32,
>         PWA2,   8,
>         DLPW,   16,
>         DLHR,   16,
>         EECP,   8,
>         XBAS,   32,
>         GBAS,   16,
>         NVGA,   32,
>         NVHA,   32,
>         AMDA,   32,
>         LTRX,   8,
>         OBFX,   8,
>         LTRY,   8,
>         OBFY,   8,
>         LTRZ,   8,
>         OBFZ,   8,
>         LTRW,   8,
>         OBFA,   8,
>         SMSL,   16,
>         SNSL,   16,
>         P0UB,   8,
>         P1UB,   8,
>         P2UB,   8,
>         P3UB,   8,
>         PCSL,   8,
>         PBGE,   8,
>         M64B,   64,
>         M64L,   64,
>         CPEX,   32,
>         EEC1,   8,
>         EEC2,   8,
>         SBN0,   8,
>         SBN1,   8,
>         SBN2,   8,
>         M32B,   32,
>         M32L,   32,
>         P0WK,   32,
>         P1WK,   32,
>         P2WK,   32,
>         VTDS,   8,
>         VTB1,   32,
>         VTB2,   32,
>         VTB3,   32,
>         VE1V,   16,
>         VE2V,   16,
>         SBN3,   8,
>         P3GP,   8,
>         HRE3,   8,
>         HRG3,   32,
>         HRA3,   8,
>         PWE3,   8,
>         PWG3,   32,
>         PWA3,   8,
>         P3WK,   32,
>         EEC3,   8,
>         RPIN,   8,
>         RPBA,   32,
>         Offset (0x1F4)
>     }
>
> If my math is correct, this is offset 1456 bits, ie 0xb6
> bytes, and so 0x4fbf70ce
>
> XBAS + 0x8000 is 0x4fbff0ce ?
>
> cat /proc/iomem shows that this is
> 4ee5f000-4fca0fff : ACPI Non-volatile Storage
>
> --
> MST
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/nouveau
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]                             ` <CACO55tuzDgTb2g_ebhm7+A=UjEEDoM09-b5X_7fkznkFjJGiag-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-11-29  1:29                               ` Michael S. Tsirkin
  2018-11-29 10:53                                 ` [Nouveau] " Karol Herbst
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-11-29  1:29 UTC (permalink / raw)
  To: Karol Herbst
  Cc: nouveau, rjw-LthD3rsA81gm4RdzfppkhA, dri-devel,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, David Airlie,
	mika.westerberg-VuQAYsv1563Yd54FQh9/CA,
	lenb-DgEjT+Ai2ygdnm+yROfE0A

On Thu, Nov 29, 2018 at 12:21:31AM +0100, Karol Herbst wrote:
> this was already debugged and there is no point in searching inside
> the Firmware. It's not a firmware bug or anything.
> 
> The proper fix is to do something inside Nouveau so that we don't
> upset the device and being able to runtime resume it again.
> 
> The initial thing we do inside Nouveau to cause those issues is to run
> that so called "DEVINIT" script inside the vbios to initialize the
> GPU, problem is, it changes something on the PCIe configuration so
> that the GPU isn't able to runtime resume anymore. I am in contact
> with Nvidia about that issue and hopefully we get the proper answers.
> When I was digging into that myself I was able to make the situation
> more stable by setting the PCIE link speed to the boot defaults, but
> that was still pretty unstable.
> 
> Anyway, because the binary driver fails here as well (through
> bumblebee and so on) there isn't much of reverse engineering we can do
> besides guessing and trying it on literally every hardware until it
> works.
> 
> We also have an upstream bug for this issue:
> https://bugzilla.kernel.org/show_bug.cgi?id=156341

If you like I can probably dump the pcie registers on card
and/or the pcie port under windows. The card works there :)
Let me know.

-- 
MST
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Nouveau] 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
  2018-11-29  1:29                               ` Michael S. Tsirkin
@ 2018-11-29 10:53                                 ` Karol Herbst
       [not found]                                   ` <CACO55tutwu+46dmhy7RuqRHSBorundx3ZcgDZCqSwvxN+JU4-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Karol Herbst @ 2018-11-29 10:53 UTC (permalink / raw)
  To: mst
  Cc: nouveau, rjw, dri-devel, linux-acpi, Ben Skeggs, David Airlie,
	mika.westerberg, lenb

On Thu, Nov 29, 2018 at 2:29 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Nov 29, 2018 at 12:21:31AM +0100, Karol Herbst wrote:
> > this was already debugged and there is no point in searching inside
> > the Firmware. It's not a firmware bug or anything.
> >
> > The proper fix is to do something inside Nouveau so that we don't
> > upset the device and being able to runtime resume it again.
> >
> > The initial thing we do inside Nouveau to cause those issues is to run
> > that so called "DEVINIT" script inside the vbios to initialize the
> > GPU, problem is, it changes something on the PCIe configuration so
> > that the GPU isn't able to runtime resume anymore. I am in contact
> > with Nvidia about that issue and hopefully we get the proper answers.
> > When I was digging into that myself I was able to make the situation
> > more stable by setting the PCIE link speed to the boot defaults, but
> > that was still pretty unstable.
> >
> > Anyway, because the binary driver fails here as well (through
> > bumblebee and so on) there isn't much of reverse engineering we can do
> > besides guessing and trying it on literally every hardware until it
> > works.
> >
> > We also have an upstream bug for this issue:
> > https://bugzilla.kernel.org/show_bug.cgi?id=156341
>
> If you like I can probably dump the pcie registers on card
> and/or the pcie port under windows. The card works there :)
> Let me know.
>
> --
> MST

the problem is, we would need to know the registers right before
suspending the GPU. If someone would be able to trace all PCIe
register read and writes for the entire suspending/resume process,
that would be very helpful.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]                                   ` <CACO55tutwu+46dmhy7RuqRHSBorundx3ZcgDZCqSwvxN+JU4-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-11-29 17:12                                     ` Michael S. Tsirkin
       [not found]                                       ` <20181129115629-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2018-11-29 17:12 UTC (permalink / raw)
  To: Karol Herbst
  Cc: nouveau, rjw-LthD3rsA81gm4RdzfppkhA, dri-devel,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, David Airlie,
	mika.westerberg-VuQAYsv1563Yd54FQh9/CA,
	lenb-DgEjT+Ai2ygdnm+yROfE0A

On Thu, Nov 29, 2018 at 11:53:53AM +0100, Karol Herbst wrote:
> On Thu, Nov 29, 2018 at 2:29 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Nov 29, 2018 at 12:21:31AM +0100, Karol Herbst wrote:
> > > this was already debugged and there is no point in searching inside
> > > the Firmware. It's not a firmware bug or anything.
> > >
> > > The proper fix is to do something inside Nouveau so that we don't
> > > upset the device and being able to runtime resume it again.
> > >
> > > The initial thing we do inside Nouveau to cause those issues is to run
> > > that so called "DEVINIT" script inside the vbios to initialize the
> > > GPU, problem is, it changes something on the PCIe configuration so
> > > that the GPU isn't able to runtime resume anymore. I am in contact
> > > with Nvidia about that issue and hopefully we get the proper answers.
> > > When I was digging into that myself I was able to make the situation
> > > more stable by setting the PCIE link speed to the boot defaults, but
> > > that was still pretty unstable.
> > >
> > > Anyway, because the binary driver fails here as well (through
> > > bumblebee and so on) there isn't much of reverse engineering we can do
> > > besides guessing and trying it on literally every hardware until it
> > > works.
> > >
> > > We also have an upstream bug for this issue:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=156341
> >
> > If you like I can probably dump the pcie registers on card
> > and/or the pcie port under windows. The card works there :)
> > Let me know.
> >
> > --
> > MST
> 
> the problem is, we would need to know the registers right before
> suspending the GPU. If someone would be able to trace all PCIe
> register read and writes for the entire suspending/resume process,
> that would be very helpful.


Well I can pass the card to a VM, and trace it on the hypervisor, that
isn't a problem.  A tricky thing is the ACPI tables, would need to
somehow know which ones are relevant to pass them to guest ... ideas on
that?

-- 
MST
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
       [not found]                                       ` <20181129115629-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-11-29 17:26                                         ` Karol Herbst
  0 siblings, 0 replies; 12+ messages in thread
From: Karol Herbst @ 2018-11-29 17:26 UTC (permalink / raw)
  To: mst-H+wXaHxf7aLQT0dZR+AlfA
  Cc: nouveau, rjw-LthD3rsA81gm4RdzfppkhA, dri-devel,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, David Airlie,
	mika.westerberg-VuQAYsv1563Yd54FQh9/CA,
	lenb-DgEjT+Ai2ygdnm+yROfE0A

yeah... I don't think that's gonna work that nicely. Anyway, you
probably need to allow the VM to access all of ACPI in order to
runtime suspend the GPU and on newer laptops the vbios is retrieved
via ACPI as well. It would be probably better if there is a way to
trace all that on a machine running windows directly.
On Thu, Nov 29, 2018 at 6:12 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Nov 29, 2018 at 11:53:53AM +0100, Karol Herbst wrote:
> > On Thu, Nov 29, 2018 at 2:29 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Nov 29, 2018 at 12:21:31AM +0100, Karol Herbst wrote:
> > > > this was already debugged and there is no point in searching inside
> > > > the Firmware. It's not a firmware bug or anything.
> > > >
> > > > The proper fix is to do something inside Nouveau so that we don't
> > > > upset the device and being able to runtime resume it again.
> > > >
> > > > The initial thing we do inside Nouveau to cause those issues is to run
> > > > that so called "DEVINIT" script inside the vbios to initialize the
> > > > GPU, problem is, it changes something on the PCIe configuration so
> > > > that the GPU isn't able to runtime resume anymore. I am in contact
> > > > with Nvidia about that issue and hopefully we get the proper answers.
> > > > When I was digging into that myself I was able to make the situation
> > > > more stable by setting the PCIE link speed to the boot defaults, but
> > > > that was still pretty unstable.
> > > >
> > > > Anyway, because the binary driver fails here as well (through
> > > > bumblebee and so on) there isn't much of reverse engineering we can do
> > > > besides guessing and trying it on literally every hardware until it
> > > > works.
> > > >
> > > > We also have an upstream bug for this issue:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=156341
> > >
> > > If you like I can probably dump the pcie registers on card
> > > and/or the pcie port under windows. The card works there :)
> > > Let me know.
> > >
> > > --
> > > MST
> >
> > the problem is, we would need to know the registers right before
> > suspending the GPU. If someone would be able to trace all PCIe
> > register read and writes for the entire suspending/resume process,
> > that would be very helpful.
>
>
> Well I can pass the card to a VM, and trace it on the hypervisor, that
> isn't a problem.  A tricky thing is the ACPI tables, would need to
> somehow know which ones are relevant to pass them to guest ... ideas on
> that?
>
> --
> MST
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-11-29 17:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-27  3:53 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups Michael S. Tsirkin
     [not found] ` <20181126221005-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2018-11-27  9:36   ` Mika Westerberg
     [not found]     ` <20181127093650.GP2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2018-11-28  2:49       ` Michael S. Tsirkin
     [not found]         ` <20181127212550-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2018-11-28 11:08           ` Mika Westerberg
     [not found]             ` <20181128110857.GW2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2018-11-28 15:09               ` Michael S. Tsirkin
     [not found]                 ` <20181128093527-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2018-11-28 15:55                   ` Mika Westerberg
     [not found]                     ` <20181128155544.GD2296-3PARRvDOhMZrdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2018-11-28 20:30                       ` Michael S. Tsirkin
     [not found]                         ` <20181128111312-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2018-11-28 23:21                           ` Karol Herbst
     [not found]                             ` <CACO55tuzDgTb2g_ebhm7+A=UjEEDoM09-b5X_7fkznkFjJGiag-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-11-29  1:29                               ` Michael S. Tsirkin
2018-11-29 10:53                                 ` [Nouveau] " Karol Herbst
     [not found]                                   ` <CACO55tutwu+46dmhy7RuqRHSBorundx3ZcgDZCqSwvxN+JU4-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-11-29 17:12                                     ` Michael S. Tsirkin
     [not found]                                       ` <20181129115629-mutt-send-email-mst-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2018-11-29 17:26                                         ` Karol Herbst

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.