All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Wu <peter@lekensteyn.nl>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: linux-pci@vger.kernel.org,
	Roland Singer <roland.singer@desertbit.com>,
	linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-acpi@vger.kernel.org
Subject: Re: Kernel Freeze with American Megatrends BIOS
Date: Tue, 30 Aug 2016 21:53:37 +0200	[thread overview]
Message-ID: <20160830195337.GA18805@al> (raw)
In-Reply-To: <20160829160210.GA24451@localhost>

On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
> [+cc linux-acpi, linux-kernel, dri-devel]
> 
> Hi Roland,
> 
> I have no idea how to debug this problem.  Are you seeing something
> that suggests it may be a PCI problem?

Yes I suspect there is an ACPI and/ or PCI problem, possibly
device-specific. Steps to reproduce on the affected machines:

 1. Load nouveau.
 2. Wait for it to runtime suspend.
 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
 3. lspci never returns, few moments later an AML_INFINITE_LOOP is
    reported.

If you use the external bbswitch module, the effect is the same. I have
been trying to debug this for some time on nouveau with no luck. The
PCI/PM D3cold patches from Mika makes no difference.

Runtime resume via nouveau triggers some ACPI methods (I'll assume the
Windows 8-style PR method and take the Clevo P651 as example):

    \_SB.PCI0.PEG0.PG00._ON () ->
        \_SB.PCI0.PGON (0)

Then:

    Method (PGON, 1, Serialized) {
        PION = Arg0     // note: 0 for PG00
        // ...
        If ((OSYS != 0x07DF)) { /* Not Windows 2015 (Windows 10), see below */ }
        Else {
            LKEN (PION)
        }
        // this is the infinite loop: it tries to bring the PCIe link to
        // full speed, but fails to do so.
        While ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
            Local0 = 0x20
            While (Local0) {
                If ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
                    Stall (0x64)
                    Local0--
                } Else { Break }
            }
            If ((Local0 == Zero)) {
                \_SB.PCI0.PEG0.RTLK = One
                Stall (0x64)
            }
        }
        // ...
    }

Without any workaround, this piece of code is invoked:

    Method (LKEN, 1, NotSerialized) {
        Local3 = (CPEX & 0x0F)  // CPEX at 0x5ff9be7f and has value 000506e3
        If ((Local3 == Zero)) {
            /* Similar to below, but with Q0L0 -> P0L0 (register 0xBC bit 6) */
        } ElseIf ((Local3 != Zero)) {
            If ((Arg0 == Zero)) {
                /* Enter L0 Activate state.
                 * (LKDS tries to enter L2, deep-energy-saving state.) */
                Q0L0 = One      // register 0x249 bit 0; \_SB.PCI0.OPG0.Q0L0 00:01.0
                Sleep (0x10)
                Local0 = Zero
                While (Q0L0) {
                    If ((Local0 > 0x04)) { Break }
                    Sleep (0x10)
                    Local0++
                }
            } else { /* other cases, but we are only interested in PGON(0) */ }
        }
    }

The acpi_osi="!Windows 2015" workaround will invoke this instead:

    If ((OSYS != 0x07DF)) {
        If ((PION == Zero)) {
            P0AP = Zero  /* PGOF writes 3 */
            P0RM = Zero  /* PGOF writes 1 */
        }
        If ((PBGE != Zero)) { /* Observed to be false (PBGE == 0) */
            If (SBDL (PION)) {
                PUAB (PION)
                CBDL = GUBC (PION)
                MBDL = GMXB (PION)
                If ((CBDL > MBDL)) {
                    CBDL = MBDL /* \_SB_.PCI0.MBDL */
                }
                PDUB (PION, CBDL)
            }
        }
        If ((PION == Zero)) {
            P0LD = Zero     /* Link Disable = 0, PGOF sets 1 instead. */
            P0TR = One      /* Train? (PGOF does not set this). */
            TCNT = Zero
            While ((TCNT < LDLY)) { /* LDLY = 300 */
                If ((P0VC == Zero)) {
                    /* VC Negotiation Pending 0 means VC negotation is complete. */
                    Break
                }
                Sleep (0x10)
                TCNT += 0x10 /* At most 19 iterations, sleeping for 304ms. */
            }
        }
    }

The comments above are my own interpretation based on the acpidumps I
extracted from the machine. These notes and ACPI tables can be found at
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
https://github.com/Lekensteyn/acpi-stuff/tree/master/dsl/Clevo_P651RA

Other affected devices have similar code, differences are small:
 - No check for LNKS (avoids the infinite loop, but device is still off)
 - Instead of a check for != "Windows 2015", they check for == "Windows
   2009" or even for == "Windows 2009" || "Windows 2013" (Dell Inspiron
   7559).

The tested kernels (with bbswitch or nouveau) were Linux 4.4.0, 4.6,
4.7 (nouveau + PCI/PM + nouveau PR patches). The PCIe device is
something from the GTX 9xxM family in all cases.

I have a bunch of PCI config dumps from Windows and Linux, but there is
nothing extraordinary. Also did an ACPI trace via a Checked/Debug build
of Windows, but it just confirms that the ACPI method we use for the
Nvidia device is the correct one.

Let me know if you need more information, I would be glad to provide.

Kind regards,
Peter

> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> > Hi,
> > 
> > hope somebody can help me fix this kernel problem which affects the following machines:
> > 
> > - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> > - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> > - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> > - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> > 
> > 
> > The kernel freezes if the graphical user session (Xorg & Wayland) is
> > started with a switched off discrete GPU card (NVIDIA).
> > If the discrete GPU is switched off after the graphical session start,
> > then everything works as expected, until the graphical session is restarted.
> > 
> > This problem seams to be linked to specific BIOS settings. If the computer
> > is started with the following command line:
> > 
> > acpi_osi=! acpi_osi="Windows 2009"
> > 
> > then the kernel freeze does not occur anymore. However this required a special
> > ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> > 
> > https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> > 
> > I strongly recommend to fix this in the kernel and I am ready to help and solve
> > this problem with some help.
> > 
> > Here is a link to the GitHub issue with further information:
> > 
> > https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> > 
> > Here are some more detailed information:
> > 
> > https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> > 
> > Hope somebody can help.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Peter Wu <peter@lekensteyn.nl>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Roland Singer <roland.singer@desertbit.com>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dri-devel@lists.freedesktop.org
Subject: Re: Kernel Freeze with American Megatrends BIOS
Date: Tue, 30 Aug 2016 21:53:37 +0200	[thread overview]
Message-ID: <20160830195337.GA18805@al> (raw)
In-Reply-To: <20160829160210.GA24451@localhost>

On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
> [+cc linux-acpi, linux-kernel, dri-devel]
> 
> Hi Roland,
> 
> I have no idea how to debug this problem.  Are you seeing something
> that suggests it may be a PCI problem?

Yes I suspect there is an ACPI and/ or PCI problem, possibly
device-specific. Steps to reproduce on the affected machines:

 1. Load nouveau.
 2. Wait for it to runtime suspend.
 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
 3. lspci never returns, few moments later an AML_INFINITE_LOOP is
    reported.

If you use the external bbswitch module, the effect is the same. I have
been trying to debug this for some time on nouveau with no luck. The
PCI/PM D3cold patches from Mika makes no difference.

Runtime resume via nouveau triggers some ACPI methods (I'll assume the
Windows 8-style PR method and take the Clevo P651 as example):

    \_SB.PCI0.PEG0.PG00._ON () ->
        \_SB.PCI0.PGON (0)

Then:

    Method (PGON, 1, Serialized) {
        PION = Arg0     // note: 0 for PG00
        // ...
        If ((OSYS != 0x07DF)) { /* Not Windows 2015 (Windows 10), see below */ }
        Else {
            LKEN (PION)
        }
        // this is the infinite loop: it tries to bring the PCIe link to
        // full speed, but fails to do so.
        While ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
            Local0 = 0x20
            While (Local0) {
                If ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
                    Stall (0x64)
                    Local0--
                } Else { Break }
            }
            If ((Local0 == Zero)) {
                \_SB.PCI0.PEG0.RTLK = One
                Stall (0x64)
            }
        }
        // ...
    }

Without any workaround, this piece of code is invoked:

    Method (LKEN, 1, NotSerialized) {
        Local3 = (CPEX & 0x0F)  // CPEX at 0x5ff9be7f and has value 000506e3
        If ((Local3 == Zero)) {
            /* Similar to below, but with Q0L0 -> P0L0 (register 0xBC bit 6) */
        } ElseIf ((Local3 != Zero)) {
            If ((Arg0 == Zero)) {
                /* Enter L0 Activate state.
                 * (LKDS tries to enter L2, deep-energy-saving state.) */
                Q0L0 = One      // register 0x249 bit 0; \_SB.PCI0.OPG0.Q0L0 00:01.0
                Sleep (0x10)
                Local0 = Zero
                While (Q0L0) {
                    If ((Local0 > 0x04)) { Break }
                    Sleep (0x10)
                    Local0++
                }
            } else { /* other cases, but we are only interested in PGON(0) */ }
        }
    }

The acpi_osi="!Windows 2015" workaround will invoke this instead:

    If ((OSYS != 0x07DF)) {
        If ((PION == Zero)) {
            P0AP = Zero  /* PGOF writes 3 */
            P0RM = Zero  /* PGOF writes 1 */
        }
        If ((PBGE != Zero)) { /* Observed to be false (PBGE == 0) */
            If (SBDL (PION)) {
                PUAB (PION)
                CBDL = GUBC (PION)
                MBDL = GMXB (PION)
                If ((CBDL > MBDL)) {
                    CBDL = MBDL /* \_SB_.PCI0.MBDL */
                }
                PDUB (PION, CBDL)
            }
        }
        If ((PION == Zero)) {
            P0LD = Zero     /* Link Disable = 0, PGOF sets 1 instead. */
            P0TR = One      /* Train? (PGOF does not set this). */
            TCNT = Zero
            While ((TCNT < LDLY)) { /* LDLY = 300 */
                If ((P0VC == Zero)) {
                    /* VC Negotiation Pending 0 means VC negotation is complete. */
                    Break
                }
                Sleep (0x10)
                TCNT += 0x10 /* At most 19 iterations, sleeping for 304ms. */
            }
        }
    }

The comments above are my own interpretation based on the acpidumps I
extracted from the machine. These notes and ACPI tables can be found at
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
https://github.com/Lekensteyn/acpi-stuff/tree/master/dsl/Clevo_P651RA

Other affected devices have similar code, differences are small:
 - No check for LNKS (avoids the infinite loop, but device is still off)
 - Instead of a check for != "Windows 2015", they check for == "Windows
   2009" or even for == "Windows 2009" || "Windows 2013" (Dell Inspiron
   7559).

The tested kernels (with bbswitch or nouveau) were Linux 4.4.0, 4.6,
4.7 (nouveau + PCI/PM + nouveau PR patches). The PCIe device is
something from the GTX 9xxM family in all cases.

I have a bunch of PCI config dumps from Windows and Linux, but there is
nothing extraordinary. Also did an ACPI trace via a Checked/Debug build
of Windows, but it just confirms that the ACPI method we use for the
Nvidia device is the correct one.

Let me know if you need more information, I would be glad to provide.

Kind regards,
Peter

> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> > Hi,
> > 
> > hope somebody can help me fix this kernel problem which affects the following machines:
> > 
> > - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> > - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> > - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> > - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> > 
> > 
> > The kernel freezes if the graphical user session (Xorg & Wayland) is
> > started with a switched off discrete GPU card (NVIDIA).
> > If the discrete GPU is switched off after the graphical session start,
> > then everything works as expected, until the graphical session is restarted.
> > 
> > This problem seams to be linked to specific BIOS settings. If the computer
> > is started with the following command line:
> > 
> > acpi_osi=! acpi_osi="Windows 2009"
> > 
> > then the kernel freeze does not occur anymore. However this required a special
> > ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> > 
> > https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> > 
> > I strongly recommend to fix this in the kernel and I am ready to help and solve
> > this problem with some help.
> > 
> > Here is a link to the GitHub issue with further information:
> > 
> > https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> > 
> > Here are some more detailed information:
> > 
> > https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> > 
> > Hope somebody can help.

  parent reply	other threads:[~2016-08-30 19:53 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-23  9:23 Kernel Freeze with American Megatrends BIOS Roland Singer
2016-08-29  7:56 ` Roland Singer
2016-08-29 16:02 ` Bjorn Helgaas
2016-08-29 18:46   ` Roland Singer
2016-08-29 19:07     ` Bjorn Helgaas
2016-08-29 19:55       ` Roland Singer
2016-08-29 23:54         ` Bjorn Helgaas
2016-08-30 10:08           ` Roland Singer
2016-08-30 13:06             ` Bjorn Helgaas
2016-08-30 14:08               ` Emil Velikov
2016-08-30 14:08                 ` Emil Velikov
2016-08-30 15:25                 ` Roland Singer
2016-08-30 15:44                   ` Ilia Mirkin
2016-08-30 15:48                     ` Ilia Mirkin
2016-08-30 15:48                   ` Emil Velikov
2016-08-30 17:37                     ` Roland Singer
2016-08-30 17:43                       ` Ilia Mirkin
2016-08-30 17:43                         ` Ilia Mirkin
2016-08-30 18:02                         ` Roland Singer
2016-08-30 18:13                           ` Ilia Mirkin
2016-08-30 18:13                             ` Ilia Mirkin
2016-08-30 19:21                             ` Peter Wu
2016-08-31 11:12                               ` Roland Singer
2016-08-31 11:11                             ` Roland Singer
2016-08-30 18:09                       ` Emil Velikov
2016-08-30 18:09                         ` Emil Velikov
2016-08-30 18:10                         ` Emil Velikov
2016-08-30 18:10                           ` Emil Velikov
2016-08-31 10:51                           ` Roland Singer
2016-08-30 19:53   ` Peter Wu [this message]
2016-08-30 19:53     ` Peter Wu
2016-08-31 11:27     ` Roland Singer
2016-08-31 11:46       ` Peter Wu
2016-08-31 11:46         ` Peter Wu
2016-08-31 12:21         ` Roland Singer
2016-08-31 12:34           ` Peter Wu
2016-08-31 13:13             ` Roland Singer
2016-08-31 20:06               ` Roland Singer
2016-08-31 20:16                 ` Roland Singer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160830195337.GA18805@al \
    --to=peter@lekensteyn.nl \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=helgaas@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=roland.singer@desertbit.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.