All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel Freeze with American Megatrends BIOS
@ 2016-08-23  9:23 Roland Singer
  2016-08-29  7:56 ` Roland Singer
  2016-08-29 16:02 ` Bjorn Helgaas
  0 siblings, 2 replies; 39+ messages in thread
From: Roland Singer @ 2016-08-23  9:23 UTC (permalink / raw)
  To: linux-pci

Hi,

hope somebody can help me fix this kernel problem which affects the following machines:

- Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
- MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
- Gigabyte P35V5 (i7-6700HQ/GTX 970M)
- Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)


The kernel freezes if the graphical user session (Xorg & Wayland) is
started with a switched off discrete GPU card (NVIDIA).
If the discrete GPU is switched off after the graphical session start,
then everything works as expected, until the graphical session is restarted.

This problem seams to be linked to specific BIOS settings. If the computer
is started with the following command line:

acpi_osi=! acpi_osi="Windows 2009"

then the kernel freeze does not occur anymore. However this required a special
ACPI DSDT firmware patch for the Razer Blade 2016 laptop:

https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt

I strongly recommend to fix this in the kernel and I am ready to help and solve
this problem with some help.

Here is a link to the GitHub issue with further information:

https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595

Here are some more detailed information:

https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt

Hope somebody can help.

Thanks!

Roland Singer

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-23  9:23 Kernel Freeze with American Megatrends BIOS Roland Singer
@ 2016-08-29  7:56 ` Roland Singer
  2016-08-29 16:02 ` Bjorn Helgaas
  1 sibling, 0 replies; 39+ messages in thread
From: Roland Singer @ 2016-08-29  7:56 UTC (permalink / raw)
  To: linux-pci

Updated the affected machines list:

- Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
- MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
- Gigabyte P35V5 (i7-6700HQ/GTX 970M)
- Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
- Dell Inspiron 7559 (i7-6700HQ/GTX 960M) (BIOS 1.1.3, 11/05/2015)


Am 23.08.2016 um 11:23 schrieb Roland Singer:
> Hi,
> 
> hope somebody can help me fix this kernel problem which affects the following machines:
> 
> - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> 
> 
> The kernel freezes if the graphical user session (Xorg & Wayland) is
> started with a switched off discrete GPU card (NVIDIA).
> If the discrete GPU is switched off after the graphical session start,
> then everything works as expected, until the graphical session is restarted.
> 
> This problem seams to be linked to specific BIOS settings. If the computer
> is started with the following command line:
> 
> acpi_osi=! acpi_osi="Windows 2009"
> 
> then the kernel freeze does not occur anymore. However this required a special
> ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> 
> https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> 
> I strongly recommend to fix this in the kernel and I am ready to help and solve
> this problem with some help.
> 
> Here is a link to the GitHub issue with further information:
> 
> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> 
> Here are some more detailed information:
> 
> https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> 
> Hope somebody can help.
> 
> Thanks!
> 
> Roland Singer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-23  9:23 Kernel Freeze with American Megatrends BIOS Roland Singer
  2016-08-29  7:56 ` Roland Singer
@ 2016-08-29 16:02 ` Bjorn Helgaas
  2016-08-29 18:46   ` Roland Singer
  2016-08-30 19:53     ` Peter Wu
  1 sibling, 2 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2016-08-29 16:02 UTC (permalink / raw)
  To: Roland Singer; +Cc: linux-pci, linux-kernel, linux-acpi, dri-devel

[+cc linux-acpi, linux-kernel, dri-devel]

Hi Roland,

I have no idea how to debug this problem.  Are you seeing something
that suggests it may be a PCI problem?

On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> Hi,
> 
> hope somebody can help me fix this kernel problem which affects the following machines:
> 
> - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> 
> 
> The kernel freezes if the graphical user session (Xorg & Wayland) is
> started with a switched off discrete GPU card (NVIDIA).
> If the discrete GPU is switched off after the graphical session start,
> then everything works as expected, until the graphical session is restarted.
> 
> This problem seams to be linked to specific BIOS settings. If the computer
> is started with the following command line:
> 
> acpi_osi=! acpi_osi="Windows 2009"
> 
> then the kernel freeze does not occur anymore. However this required a special
> ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> 
> https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> 
> I strongly recommend to fix this in the kernel and I am ready to help and solve
> this problem with some help.
> 
> Here is a link to the GitHub issue with further information:
> 
> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> 
> Here are some more detailed information:
> 
> https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> 
> Hope somebody can help.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-29 16:02 ` Bjorn Helgaas
@ 2016-08-29 18:46   ` Roland Singer
  2016-08-29 19:07     ` Bjorn Helgaas
  2016-08-30 19:53     ` Peter Wu
  1 sibling, 1 reply; 39+ messages in thread
From: Roland Singer @ 2016-08-29 18:46 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linux-kernel, linux-acpi, dri-devel

Hi Bjorn,

I am using the bbswitch kernel module to switch off/on the GPU and
to obtain the GPU power state.
Obtaining the GPU state immediately after starting the graphical user
session freezes the system.

This code triggers something, which is responsible for the freeze.

---
// Returns 1 if the card is disabled, 0 if enabled
static int is_card_disabled(void) {
    u32 cfg_word;
    // read first config word which contains Vendor and Device ID. If all bits
    // are enabled, the device is assumed to be off
    pci_read_config_dword(dis_dev, 0, &cfg_word);
    // if one of the bits is not enabled (the card is enabled), the inverted
    // result will be non-zero and hence logical not will make it 0 ("false")
    return !~cfg_word;
}

static int bbswitch_proc_show(struct seq_file *seqfp, void *p) {
    // show the card state. Example output: 0000:01:00:00 ON
    dis_dev_get();
    seq_printf(seqfp, "%s %s\n", dev_name(&dis_dev->dev),
             is_card_disabled() ? "OFF" : "ON");
    dis_dev_put();
    return 0;
}
---

Either dis_dev_get or pci_read_config_dword is the trigger.

Link to the bbswitch module source code:
https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L333


Am 29.08.2016 um 18:02 schrieb Bjorn Helgaas:
> [+cc linux-acpi, linux-kernel, dri-devel]
> 
> Hi Roland,
> 
> I have no idea how to debug this problem.  Are you seeing something
> that suggests it may be a PCI problem?
> 
> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
>> Hi,
>>
>> hope somebody can help me fix this kernel problem which affects the following machines:
>>
>> - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
>> - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
>> - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
>> - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
>>
>>
>> The kernel freezes if the graphical user session (Xorg & Wayland) is
>> started with a switched off discrete GPU card (NVIDIA).
>> If the discrete GPU is switched off after the graphical session start,
>> then everything works as expected, until the graphical session is restarted.
>>
>> This problem seams to be linked to specific BIOS settings. If the computer
>> is started with the following command line:
>>
>> acpi_osi=! acpi_osi="Windows 2009"
>>
>> then the kernel freeze does not occur anymore. However this required a special
>> ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
>>
>> https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
>>
>> I strongly recommend to fix this in the kernel and I am ready to help and solve
>> this problem with some help.
>>
>> Here is a link to the GitHub issue with further information:
>>
>> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
>>
>> Here are some more detailed information:
>>
>> https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
>>
>> Hope somebody can help.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-29 18:46   ` Roland Singer
@ 2016-08-29 19:07     ` Bjorn Helgaas
  2016-08-29 19:55       ` Roland Singer
  0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2016-08-29 19:07 UTC (permalink / raw)
  To: Roland Singer; +Cc: linux-pci, linux-kernel, linux-acpi, dri-devel

On Mon, Aug 29, 2016 at 08:46:17PM +0200, Roland Singer wrote:
> Hi Bjorn,
> 
> I am using the bbswitch kernel module to switch off/on the GPU and
> to obtain the GPU power state.
> Obtaining the GPU state immediately after starting the graphical user
> session freezes the system.
> 
> This code triggers something, which is responsible for the freeze.
> 
> ---
> // Returns 1 if the card is disabled, 0 if enabled
> static int is_card_disabled(void) {
>     u32 cfg_word;
>     // read first config word which contains Vendor and Device ID. If all bits
>     // are enabled, the device is assumed to be off
>     pci_read_config_dword(dis_dev, 0, &cfg_word);
>     // if one of the bits is not enabled (the card is enabled), the inverted
>     // result will be non-zero and hence logical not will make it 0 ("false")
>     return !~cfg_word;
> }
> 
> static int bbswitch_proc_show(struct seq_file *seqfp, void *p) {
>     // show the card state. Example output: 0000:01:00:00 ON
>     dis_dev_get();
>     seq_printf(seqfp, "%s %s\n", dev_name(&dis_dev->dev),
>              is_card_disabled() ? "OFF" : "ON");
>     dis_dev_put();
>     return 0;
> }
> ---
> 
> Either dis_dev_get or pci_read_config_dword is the trigger.

What happens if you remove the call to is_card_disabled()?  Does the
system still freeze if you only do the dis_dev_get()/dis_dev_put()?

> Link to the bbswitch module source code:
> https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L333
> 
> 
> Am 29.08.2016 um 18:02 schrieb Bjorn Helgaas:
> > [+cc linux-acpi, linux-kernel, dri-devel]
> > 
> > Hi Roland,
> > 
> > I have no idea how to debug this problem.  Are you seeing something
> > that suggests it may be a PCI problem?
> > 
> > On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> >> Hi,
> >>
> >> hope somebody can help me fix this kernel problem which affects the following machines:
> >>
> >> - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> >> - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> >> - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> >> - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> >>
> >>
> >> The kernel freezes if the graphical user session (Xorg & Wayland) is
> >> started with a switched off discrete GPU card (NVIDIA).
> >> If the discrete GPU is switched off after the graphical session start,
> >> then everything works as expected, until the graphical session is restarted.
> >>
> >> This problem seams to be linked to specific BIOS settings. If the computer
> >> is started with the following command line:
> >>
> >> acpi_osi=! acpi_osi="Windows 2009"
> >>
> >> then the kernel freeze does not occur anymore. However this required a special
> >> ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> >>
> >> https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> >>
> >> I strongly recommend to fix this in the kernel and I am ready to help and solve
> >> this problem with some help.
> >>
> >> Here is a link to the GitHub issue with further information:
> >>
> >> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> >>
> >> Here are some more detailed information:
> >>
> >> https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> >>
> >> Hope somebody can help.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-29 19:07     ` Bjorn Helgaas
@ 2016-08-29 19:55       ` Roland Singer
  2016-08-29 23:54         ` Bjorn Helgaas
  0 siblings, 1 reply; 39+ messages in thread
From: Roland Singer @ 2016-08-29 19:55 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linux-kernel, linux-acpi, dri-devel

Just tried it and the system didn't freeze. However it will freeze
after some time (few minutes while working).

Seams to be pci_read_config_dword. Where is this exactly defined?


Am 29.08.2016 um 21:07 schrieb Bjorn Helgaas:
> On Mon, Aug 29, 2016 at 08:46:17PM +0200, Roland Singer wrote:
>> Hi Bjorn,
>>
>> I am using the bbswitch kernel module to switch off/on the GPU and
>> to obtain the GPU power state.
>> Obtaining the GPU state immediately after starting the graphical user
>> session freezes the system.
>>
>> This code triggers something, which is responsible for the freeze.
>>
>> ---
>> // Returns 1 if the card is disabled, 0 if enabled
>> static int is_card_disabled(void) {
>>     u32 cfg_word;
>>     // read first config word which contains Vendor and Device ID. If all bits
>>     // are enabled, the device is assumed to be off
>>     pci_read_config_dword(dis_dev, 0, &cfg_word);
>>     // if one of the bits is not enabled (the card is enabled), the inverted
>>     // result will be non-zero and hence logical not will make it 0 ("false")
>>     return !~cfg_word;
>> }
>>
>> static int bbswitch_proc_show(struct seq_file *seqfp, void *p) {
>>     // show the card state. Example output: 0000:01:00:00 ON
>>     dis_dev_get();
>>     seq_printf(seqfp, "%s %s\n", dev_name(&dis_dev->dev),
>>              is_card_disabled() ? "OFF" : "ON");
>>     dis_dev_put();
>>     return 0;
>> }
>> ---
>>
>> Either dis_dev_get or pci_read_config_dword is the trigger.
> 
> What happens if you remove the call to is_card_disabled()?  Does the
> system still freeze if you only do the dis_dev_get()/dis_dev_put()?
> 
>> Link to the bbswitch module source code:
>> https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L333
>>
>>
>> Am 29.08.2016 um 18:02 schrieb Bjorn Helgaas:
>>> [+cc linux-acpi, linux-kernel, dri-devel]
>>>
>>> Hi Roland,
>>>
>>> I have no idea how to debug this problem.  Are you seeing something
>>> that suggests it may be a PCI problem?
>>>
>>> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
>>>> Hi,
>>>>
>>>> hope somebody can help me fix this kernel problem which affects the following machines:
>>>>
>>>> - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
>>>> - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
>>>> - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
>>>> - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
>>>>
>>>>
>>>> The kernel freezes if the graphical user session (Xorg & Wayland) is
>>>> started with a switched off discrete GPU card (NVIDIA).
>>>> If the discrete GPU is switched off after the graphical session start,
>>>> then everything works as expected, until the graphical session is restarted.
>>>>
>>>> This problem seams to be linked to specific BIOS settings. If the computer
>>>> is started with the following command line:
>>>>
>>>> acpi_osi=! acpi_osi="Windows 2009"
>>>>
>>>> then the kernel freeze does not occur anymore. However this required a special
>>>> ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
>>>>
>>>> https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
>>>>
>>>> I strongly recommend to fix this in the kernel and I am ready to help and solve
>>>> this problem with some help.
>>>>
>>>> Here is a link to the GitHub issue with further information:
>>>>
>>>> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
>>>>
>>>> Here are some more detailed information:
>>>>
>>>> https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
>>>>
>>>> Hope somebody can help.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-29 19:55       ` Roland Singer
@ 2016-08-29 23:54         ` Bjorn Helgaas
  2016-08-30 10:08           ` Roland Singer
  0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2016-08-29 23:54 UTC (permalink / raw)
  To: Roland Singer; +Cc: linux-pci, linux-kernel, linux-acpi, dri-devel

On Mon, Aug 29, 2016 at 09:55:56PM +0200, Roland Singer wrote:
> Just tried it and the system didn't freeze. However it will freeze
> after some time (few minutes while working).
> 
> Seams to be pci_read_config_dword. Where is this exactly defined?

pci_read_config_dword() is defined in include/linux/pci.h.  It calls
pci_bus_read_config_dword() which is defined by the PCI_OP_READ() macro
in drivers/pci/access.c.

If I understand correctly, this:

  dis_dev_get();
  pci_read_config_dword(dis_dev, 0, &cfg_word);
  dis_dev_put();

causes an immediate system hang, but if you only do this:

  dis_dev_get();
  dis_dev_put();

the system hangs a few minutes later.  Right?

> Am 29.08.2016 um 21:07 schrieb Bjorn Helgaas:
> > On Mon, Aug 29, 2016 at 08:46:17PM +0200, Roland Singer wrote:
> >> Hi Bjorn,
> >>
> >> I am using the bbswitch kernel module to switch off/on the GPU and
> >> to obtain the GPU power state.
> >> Obtaining the GPU state immediately after starting the graphical user
> >> session freezes the system.
> >>
> >> This code triggers something, which is responsible for the freeze.
> >>
> >> ---
> >> // Returns 1 if the card is disabled, 0 if enabled
> >> static int is_card_disabled(void) {
> >>     u32 cfg_word;
> >>     // read first config word which contains Vendor and Device ID. If all bits
> >>     // are enabled, the device is assumed to be off
> >>     pci_read_config_dword(dis_dev, 0, &cfg_word);
> >>     // if one of the bits is not enabled (the card is enabled), the inverted
> >>     // result will be non-zero and hence logical not will make it 0 ("false")
> >>     return !~cfg_word;
> >> }
> >>
> >> static int bbswitch_proc_show(struct seq_file *seqfp, void *p) {
> >>     // show the card state. Example output: 0000:01:00:00 ON
> >>     dis_dev_get();
> >>     seq_printf(seqfp, "%s %s\n", dev_name(&dis_dev->dev),
> >>              is_card_disabled() ? "OFF" : "ON");
> >>     dis_dev_put();
> >>     return 0;
> >> }
> >> ---
> >>
> >> Either dis_dev_get or pci_read_config_dword is the trigger.
> > 
> > What happens if you remove the call to is_card_disabled()?  Does the
> > system still freeze if you only do the dis_dev_get()/dis_dev_put()?
> > 
> >> Link to the bbswitch module source code:
> >> https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L333
> >>
> >>
> >> Am 29.08.2016 um 18:02 schrieb Bjorn Helgaas:
> >>> [+cc linux-acpi, linux-kernel, dri-devel]
> >>>
> >>> Hi Roland,
> >>>
> >>> I have no idea how to debug this problem.  Are you seeing something
> >>> that suggests it may be a PCI problem?
> >>>
> >>> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> >>>> Hi,
> >>>>
> >>>> hope somebody can help me fix this kernel problem which affects the following machines:
> >>>>
> >>>> - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> >>>> - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> >>>> - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> >>>> - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> >>>>
> >>>>
> >>>> The kernel freezes if the graphical user session (Xorg & Wayland) is
> >>>> started with a switched off discrete GPU card (NVIDIA).
> >>>> If the discrete GPU is switched off after the graphical session start,
> >>>> then everything works as expected, until the graphical session is restarted.
> >>>>
> >>>> This problem seams to be linked to specific BIOS settings. If the computer
> >>>> is started with the following command line:
> >>>>
> >>>> acpi_osi=! acpi_osi="Windows 2009"
> >>>>
> >>>> then the kernel freeze does not occur anymore. However this required a special
> >>>> ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> >>>>
> >>>> https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> >>>>
> >>>> I strongly recommend to fix this in the kernel and I am ready to help and solve
> >>>> this problem with some help.
> >>>>
> >>>> Here is a link to the GitHub issue with further information:
> >>>>
> >>>> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> >>>>
> >>>> Here are some more detailed information:
> >>>>
> >>>> https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> >>>>
> >>>> Hope somebody can help.
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-29 23:54         ` Bjorn Helgaas
@ 2016-08-30 10:08           ` Roland Singer
  2016-08-30 13:06             ` Bjorn Helgaas
  0 siblings, 1 reply; 39+ messages in thread
From: Roland Singer @ 2016-08-30 10:08 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linux-kernel, linux-acpi, dri-devel

Thanks for pointing it out.

Yeah that's right. The system will hang randomly a few minutes later,
because some certain actions in the graphical user session will trigger
the freeze.

I had a look at the function body of pci_read_config_dword:

  #define PCI_OP_READ(size, type, len) \
  int pci_bus_read_config_##size \
	(struct pci_bus *bus, unsigned int devfn, int pos, type *value)	\
  {									\
	int res;							\
	unsigned long flags;						\
	u32 data = 0;							\
	if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;	\
	raw_spin_lock_irqsave(&pci_lock, flags);			\
	res = bus->ops->read(bus, devfn, pos, len, &data);		\
	*value = (type)data;						\
	raw_spin_unlock_irqrestore(&pci_lock, flags);		\
	return res;							\
  }

I guess, that bus->ops->read(...) might be the trigger.
Any hints how to continue debugging?

Cheers,
Roland

Am 30.08.2016 um 01:54 schrieb Bjorn Helgaas:
> On Mon, Aug 29, 2016 at 09:55:56PM +0200, Roland Singer wrote:
>> Just tried it and the system didn't freeze. However it will freeze
>> after some time (few minutes while working).
>>
>> Seams to be pci_read_config_dword. Where is this exactly defined?
> 
> pci_read_config_dword() is defined in include/linux/pci.h.  It calls
> pci_bus_read_config_dword() which is defined by the PCI_OP_READ() macro
> in drivers/pci/access.c.
> 
> If I understand correctly, this:
> 
>   dis_dev_get();
>   pci_read_config_dword(dis_dev, 0, &cfg_word);
>   dis_dev_put();
> 
> causes an immediate system hang, but if you only do this:
> 
>   dis_dev_get();
>   dis_dev_put();
> 
> the system hangs a few minutes later.  Right?
> 
>> Am 29.08.2016 um 21:07 schrieb Bjorn Helgaas:
>>> On Mon, Aug 29, 2016 at 08:46:17PM +0200, Roland Singer wrote:
>>>> Hi Bjorn,
>>>>
>>>> I am using the bbswitch kernel module to switch off/on the GPU and
>>>> to obtain the GPU power state.
>>>> Obtaining the GPU state immediately after starting the graphical user
>>>> session freezes the system.
>>>>
>>>> This code triggers something, which is responsible for the freeze.
>>>>
>>>> ---
>>>> // Returns 1 if the card is disabled, 0 if enabled
>>>> static int is_card_disabled(void) {
>>>>     u32 cfg_word;
>>>>     // read first config word which contains Vendor and Device ID. If all bits
>>>>     // are enabled, the device is assumed to be off
>>>>     pci_read_config_dword(dis_dev, 0, &cfg_word);
>>>>     // if one of the bits is not enabled (the card is enabled), the inverted
>>>>     // result will be non-zero and hence logical not will make it 0 ("false")
>>>>     return !~cfg_word;
>>>> }
>>>>
>>>> static int bbswitch_proc_show(struct seq_file *seqfp, void *p) {
>>>>     // show the card state. Example output: 0000:01:00:00 ON
>>>>     dis_dev_get();
>>>>     seq_printf(seqfp, "%s %s\n", dev_name(&dis_dev->dev),
>>>>              is_card_disabled() ? "OFF" : "ON");
>>>>     dis_dev_put();
>>>>     return 0;
>>>> }
>>>> ---
>>>>
>>>> Either dis_dev_get or pci_read_config_dword is the trigger.
>>>
>>> What happens if you remove the call to is_card_disabled()?  Does the
>>> system still freeze if you only do the dis_dev_get()/dis_dev_put()?
>>>
>>>> Link to the bbswitch module source code:
>>>> https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L333
>>>>
>>>>
>>>> Am 29.08.2016 um 18:02 schrieb Bjorn Helgaas:
>>>>> [+cc linux-acpi, linux-kernel, dri-devel]
>>>>>
>>>>> Hi Roland,
>>>>>
>>>>> I have no idea how to debug this problem.  Are you seeing something
>>>>> that suggests it may be a PCI problem?
>>>>>
>>>>> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
>>>>>> Hi,
>>>>>>
>>>>>> hope somebody can help me fix this kernel problem which affects the following machines:
>>>>>>
>>>>>> - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
>>>>>> - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
>>>>>> - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
>>>>>> - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
>>>>>>
>>>>>>
>>>>>> The kernel freezes if the graphical user session (Xorg & Wayland) is
>>>>>> started with a switched off discrete GPU card (NVIDIA).
>>>>>> If the discrete GPU is switched off after the graphical session start,
>>>>>> then everything works as expected, until the graphical session is restarted.
>>>>>>
>>>>>> This problem seams to be linked to specific BIOS settings. If the computer
>>>>>> is started with the following command line:
>>>>>>
>>>>>> acpi_osi=! acpi_osi="Windows 2009"
>>>>>>
>>>>>> then the kernel freeze does not occur anymore. However this required a special
>>>>>> ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
>>>>>>
>>>>>> https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
>>>>>>
>>>>>> I strongly recommend to fix this in the kernel and I am ready to help and solve
>>>>>> this problem with some help.
>>>>>>
>>>>>> Here is a link to the GitHub issue with further information:
>>>>>>
>>>>>> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
>>>>>>
>>>>>> Here are some more detailed information:
>>>>>>
>>>>>> https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
>>>>>>
>>>>>> Hope somebody can help.
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 10:08           ` Roland Singer
@ 2016-08-30 13:06             ` Bjorn Helgaas
  2016-08-30 14:08                 ` Emil Velikov
  0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2016-08-30 13:06 UTC (permalink / raw)
  To: Roland Singer; +Cc: linux-pci, linux-kernel, linux-acpi, dri-devel

On Tue, Aug 30, 2016 at 12:08:57PM +0200, Roland Singer wrote:
> Thanks for pointing it out.
> 
> Yeah that's right. The system will hang randomly a few minutes later,
> because some certain actions in the graphical user session will trigger
> the freeze.
> 
> I had a look at the function body of pci_read_config_dword:
> 
>   #define PCI_OP_READ(size, type, len) \
>   int pci_bus_read_config_##size \
> 	(struct pci_bus *bus, unsigned int devfn, int pos, type *value)	\
>   {									\
> 	int res;							\
> 	unsigned long flags;						\
> 	u32 data = 0;							\
> 	if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;	\
> 	raw_spin_lock_irqsave(&pci_lock, flags);			\
> 	res = bus->ops->read(bus, devfn, pos, len, &data);		\
> 	*value = (type)data;						\
> 	raw_spin_unlock_irqrestore(&pci_lock, flags);		\
> 	return res;							\
>   }
> 
> I guess, that bus->ops->read(...) might be the trigger.
> Any hints how to continue debugging?

It's not likely that the problem is in the bus->ops->read() path.  That
is used by every device driver, so a problem there would cause more
serious problems than what you're seeing.

My guess would be some problem in the video driver or the bbswitch
thing.

> Am 30.08.2016 um 01:54 schrieb Bjorn Helgaas:
> > On Mon, Aug 29, 2016 at 09:55:56PM +0200, Roland Singer wrote:
> >> Just tried it and the system didn't freeze. However it will freeze
> >> after some time (few minutes while working).
> >>
> >> Seams to be pci_read_config_dword. Where is this exactly defined?
> > 
> > pci_read_config_dword() is defined in include/linux/pci.h.  It calls
> > pci_bus_read_config_dword() which is defined by the PCI_OP_READ() macro
> > in drivers/pci/access.c.
> > 
> > If I understand correctly, this:
> > 
> >   dis_dev_get();
> >   pci_read_config_dword(dis_dev, 0, &cfg_word);
> >   dis_dev_put();
> > 
> > causes an immediate system hang, but if you only do this:
> > 
> >   dis_dev_get();
> >   dis_dev_put();
> > 
> > the system hangs a few minutes later.  Right?
> > 
> >> Am 29.08.2016 um 21:07 schrieb Bjorn Helgaas:
> >>> On Mon, Aug 29, 2016 at 08:46:17PM +0200, Roland Singer wrote:
> >>>> Hi Bjorn,
> >>>>
> >>>> I am using the bbswitch kernel module to switch off/on the GPU and
> >>>> to obtain the GPU power state.
> >>>> Obtaining the GPU state immediately after starting the graphical user
> >>>> session freezes the system.
> >>>>
> >>>> This code triggers something, which is responsible for the freeze.
> >>>>
> >>>> ---
> >>>> // Returns 1 if the card is disabled, 0 if enabled
> >>>> static int is_card_disabled(void) {
> >>>>     u32 cfg_word;
> >>>>     // read first config word which contains Vendor and Device ID. If all bits
> >>>>     // are enabled, the device is assumed to be off
> >>>>     pci_read_config_dword(dis_dev, 0, &cfg_word);
> >>>>     // if one of the bits is not enabled (the card is enabled), the inverted
> >>>>     // result will be non-zero and hence logical not will make it 0 ("false")
> >>>>     return !~cfg_word;
> >>>> }
> >>>>
> >>>> static int bbswitch_proc_show(struct seq_file *seqfp, void *p) {
> >>>>     // show the card state. Example output: 0000:01:00:00 ON
> >>>>     dis_dev_get();
> >>>>     seq_printf(seqfp, "%s %s\n", dev_name(&dis_dev->dev),
> >>>>              is_card_disabled() ? "OFF" : "ON");
> >>>>     dis_dev_put();
> >>>>     return 0;
> >>>> }
> >>>> ---
> >>>>
> >>>> Either dis_dev_get or pci_read_config_dword is the trigger.
> >>>
> >>> What happens if you remove the call to is_card_disabled()?  Does the
> >>> system still freeze if you only do the dis_dev_get()/dis_dev_put()?
> >>>
> >>>> Link to the bbswitch module source code:
> >>>> https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L333
> >>>>
> >>>>
> >>>> Am 29.08.2016 um 18:02 schrieb Bjorn Helgaas:
> >>>>> [+cc linux-acpi, linux-kernel, dri-devel]
> >>>>>
> >>>>> Hi Roland,
> >>>>>
> >>>>> I have no idea how to debug this problem.  Are you seeing something
> >>>>> that suggests it may be a PCI problem?
> >>>>>
> >>>>> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> hope somebody can help me fix this kernel problem which affects the following machines:
> >>>>>>
> >>>>>> - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> >>>>>> - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> >>>>>> - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> >>>>>> - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> >>>>>>
> >>>>>>
> >>>>>> The kernel freezes if the graphical user session (Xorg & Wayland) is
> >>>>>> started with a switched off discrete GPU card (NVIDIA).
> >>>>>> If the discrete GPU is switched off after the graphical session start,
> >>>>>> then everything works as expected, until the graphical session is restarted.
> >>>>>>
> >>>>>> This problem seams to be linked to specific BIOS settings. If the computer
> >>>>>> is started with the following command line:
> >>>>>>
> >>>>>> acpi_osi=! acpi_osi="Windows 2009"
> >>>>>>
> >>>>>> then the kernel freeze does not occur anymore. However this required a special
> >>>>>> ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> >>>>>>
> >>>>>> https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> >>>>>>
> >>>>>> I strongly recommend to fix this in the kernel and I am ready to help and solve
> >>>>>> this problem with some help.
> >>>>>>
> >>>>>> Here is a link to the GitHub issue with further information:
> >>>>>>
> >>>>>> https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> >>>>>>
> >>>>>> Here are some more detailed information:
> >>>>>>
> >>>>>> https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> >>>>>>
> >>>>>> Hope somebody can help.
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> >>>> the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 13:06             ` Bjorn Helgaas
@ 2016-08-30 14:08                 ` Emil Velikov
  0 siblings, 0 replies; 39+ messages in thread
From: Emil Velikov @ 2016-08-30 14:08 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Roland Singer, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi

On 30 August 2016 at 14:06, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Tue, Aug 30, 2016 at 12:08:57PM +0200, Roland Singer wrote:
>> Thanks for pointing it out.
>>
>> Yeah that's right. The system will hang randomly a few minutes later,
>> because some certain actions in the graphical user session will trigger
>> the freeze.
>>
>> I had a look at the function body of pci_read_config_dword:
>>
>>   #define PCI_OP_READ(size, type, len) \
>>   int pci_bus_read_config_##size \
>>       (struct pci_bus *bus, unsigned int devfn, int pos, type *value) \
>>   {                                                                   \
>>       int res;                                                        \
>>       unsigned long flags;                                            \
>>       u32 data = 0;                                                   \
>>       if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;       \
>>       raw_spin_lock_irqsave(&pci_lock, flags);                        \
>>       res = bus->ops->read(bus, devfn, pos, len, &data);              \
>>       *value = (type)data;                                            \
>>       raw_spin_unlock_irqrestore(&pci_lock, flags);           \
>>       return res;                                                     \
>>   }
>>
>> I guess, that bus->ops->read(...) might be the trigger.
>> Any hints how to continue debugging?
>
> It's not likely that the problem is in the bus->ops->read() path.  That
> is used by every device driver, so a problem there would cause more
> serious problems than what you're seeing.
>
> My guess would be some problem in the video driver or the bbswitch
> thing.
>
FWIW I'm inclined to call it a bbswitch bug. It can (and does when
needed) power off the dedicated GPU.

Depending on the platform different methods are used:

Sometimes the GPU driver will get 0xffffffff (or similar) when trying
to read from the device mmio space. While one can say that the driver
should attribute for this, IMHO it's a bad idea to have two drivers
controlling the same hardware, let alone without any coordination
between them.

IIRC in some cases the device can disappear from the PCI bus (not 100%
sure this one). In which case a simple read can lead to a wide range
of fireworks.

Disclaimer: it's been a while since I've looked into bbswitch so
things might have changed/improved.

Regards,
Emil
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
@ 2016-08-30 14:08                 ` Emil Velikov
  0 siblings, 0 replies; 39+ messages in thread
From: Emil Velikov @ 2016-08-30 14:08 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Roland Singer, linux-pci, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi

On 30 August 2016 at 14:06, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Tue, Aug 30, 2016 at 12:08:57PM +0200, Roland Singer wrote:
>> Thanks for pointing it out.
>>
>> Yeah that's right. The system will hang randomly a few minutes later,
>> because some certain actions in the graphical user session will trigger
>> the freeze.
>>
>> I had a look at the function body of pci_read_config_dword:
>>
>>   #define PCI_OP_READ(size, type, len) \
>>   int pci_bus_read_config_##size \
>>       (struct pci_bus *bus, unsigned int devfn, int pos, type *value) \
>>   {                                                                   \
>>       int res;                                                        \
>>       unsigned long flags;                                            \
>>       u32 data = 0;                                                   \
>>       if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;       \
>>       raw_spin_lock_irqsave(&pci_lock, flags);                        \
>>       res = bus->ops->read(bus, devfn, pos, len, &data);              \
>>       *value = (type)data;                                            \
>>       raw_spin_unlock_irqrestore(&pci_lock, flags);           \
>>       return res;                                                     \
>>   }
>>
>> I guess, that bus->ops->read(...) might be the trigger.
>> Any hints how to continue debugging?
>
> It's not likely that the problem is in the bus->ops->read() path.  That
> is used by every device driver, so a problem there would cause more
> serious problems than what you're seeing.
>
> My guess would be some problem in the video driver or the bbswitch
> thing.
>
FWIW I'm inclined to call it a bbswitch bug. It can (and does when
needed) power off the dedicated GPU.

Depending on the platform different methods are used:

Sometimes the GPU driver will get 0xffffffff (or similar) when trying
to read from the device mmio space. While one can say that the driver
should attribute for this, IMHO it's a bad idea to have two drivers
controlling the same hardware, let alone without any coordination
between them.

IIRC in some cases the device can disappear from the PCI bus (not 100%
sure this one). In which case a simple read can lead to a wide range
of fireworks.

Disclaimer: it's been a while since I've looked into bbswitch so
things might have changed/improved.

Regards,
Emil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 14:08                 ` Emil Velikov
  (?)
@ 2016-08-30 15:25                 ` Roland Singer
  2016-08-30 15:44                   ` Ilia Mirkin
  2016-08-30 15:48                   ` Emil Velikov
  -1 siblings, 2 replies; 39+ messages in thread
From: Roland Singer @ 2016-08-30 15:25 UTC (permalink / raw)
  To: Emil Velikov, Bjorn Helgaas
  Cc: linux-pci, Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi

I tried these scenarios:

1. Booted the system without the bbswitch module. The nouveau module
   was loaded and is responsible for the power management of the GPU.
   The graphical session freezes after some minutes...

2. Booted the system without bbswitch and with nouveau blacklisted.
   Manually loaded bbswitch to switch off the discrete GPU.
   Same freeze after a while or by explicitly obtaining the GPU state.

Is there a possibility to switch off the discrete card without bbswitch?
If this is possible, then I could test this without nouveau and bbswitch
at all. If the system hangs, then it is not the video driver nor bbswitch.

Am 30.08.2016 um 16:08 schrieb Emil Velikov:
> On 30 August 2016 at 14:06, Bjorn Helgaas <helgaas@kernel.org> wrote:
>> On Tue, Aug 30, 2016 at 12:08:57PM +0200, Roland Singer wrote:
>>> Thanks for pointing it out.
>>>
>>> Yeah that's right. The system will hang randomly a few minutes later,
>>> because some certain actions in the graphical user session will trigger
>>> the freeze.
>>>
>>> I had a look at the function body of pci_read_config_dword:
>>>
>>>   #define PCI_OP_READ(size, type, len) \
>>>   int pci_bus_read_config_##size \
>>>       (struct pci_bus *bus, unsigned int devfn, int pos, type *value) \
>>>   {                                                                   \
>>>       int res;                                                        \
>>>       unsigned long flags;                                            \
>>>       u32 data = 0;                                                   \
>>>       if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;       \
>>>       raw_spin_lock_irqsave(&pci_lock, flags);                        \
>>>       res = bus->ops->read(bus, devfn, pos, len, &data);              \
>>>       *value = (type)data;                                            \
>>>       raw_spin_unlock_irqrestore(&pci_lock, flags);           \
>>>       return res;                                                     \
>>>   }
>>>
>>> I guess, that bus->ops->read(...) might be the trigger.
>>> Any hints how to continue debugging?
>>
>> It's not likely that the problem is in the bus->ops->read() path.  That
>> is used by every device driver, so a problem there would cause more
>> serious problems than what you're seeing.
>>
>> My guess would be some problem in the video driver or the bbswitch
>> thing.
>>
> FWIW I'm inclined to call it a bbswitch bug. It can (and does when
> needed) power off the dedicated GPU.
> 
> Depending on the platform different methods are used:
> 
> Sometimes the GPU driver will get 0xffffffff (or similar) when trying
> to read from the device mmio space. While one can say that the driver
> should attribute for this, IMHO it's a bad idea to have two drivers
> controlling the same hardware, let alone without any coordination
> between them.
> 
> IIRC in some cases the device can disappear from the PCI bus (not 100%
> sure this one). In which case a simple read can lead to a wide range
> of fireworks.
> 
> Disclaimer: it's been a while since I've looked into bbswitch so
> things might have changed/improved.
> 
> Regards,
> Emil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 15:25                 ` Roland Singer
@ 2016-08-30 15:44                   ` Ilia Mirkin
  2016-08-30 15:48                     ` Ilia Mirkin
  2016-08-30 15:48                   ` Emil Velikov
  1 sibling, 1 reply; 39+ messages in thread
From: Ilia Mirkin @ 2016-08-30 15:44 UTC (permalink / raw)
  To: Roland Singer
  Cc: Emil Velikov, Bjorn Helgaas, Linux PCI,
	Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi

On Tue, Aug 30, 2016 at 11:25 AM, Roland Singer
<roland.singer@desertbit.com> wrote:
> I tried these scenarios:
>
> 1. Booted the system without the bbswitch module. The nouveau module
>    was loaded and is responsible for the power management of the GPU.
>    The graphical session freezes after some minutes...
>
> 2. Booted the system without bbswitch and with nouveau blacklisted.
>    Manually loaded bbswitch to switch off the discrete GPU.
>    Same freeze after a while or by explicitly obtaining the GPU state.
>
> Is there a possibility to switch off the discrete card without bbswitch?
> If this is possible, then I could test this without nouveau and bbswitch
> at all. If the system hangs, then it is not the video driver nor bbswitch.

You can use acpi_call (a random search points to
https://github.com/mkottman/acpi_call, but I don't know if that's the
"official" version) - need to find the right method to call, but
that's basically all it takes to acpi-suspend a gpu.

Separately, there was a recent fix to ... something, including but not
limited to nouveau, involving hangs on gpu suspend on newer laptops. I
don't think it's upstream yet. Look for patches from Lukas Wunner.

  -ilia

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 15:44                   ` Ilia Mirkin
@ 2016-08-30 15:48                     ` Ilia Mirkin
  0 siblings, 0 replies; 39+ messages in thread
From: Ilia Mirkin @ 2016-08-30 15:48 UTC (permalink / raw)
  To: Roland Singer
  Cc: Emil Velikov, Bjorn Helgaas, Linux PCI,
	Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi

On Tue, Aug 30, 2016 at 11:44 AM, Ilia Mirkin <imirkin@alum.mit.edu> wrote:
> On Tue, Aug 30, 2016 at 11:25 AM, Roland Singer
> <roland.singer@desertbit.com> wrote:
>> I tried these scenarios:
>>
>> 1. Booted the system without the bbswitch module. The nouveau module
>>    was loaded and is responsible for the power management of the GPU.
>>    The graphical session freezes after some minutes...
>>
>> 2. Booted the system without bbswitch and with nouveau blacklisted.
>>    Manually loaded bbswitch to switch off the discrete GPU.
>>    Same freeze after a while or by explicitly obtaining the GPU state.
>>
>> Is there a possibility to switch off the discrete card without bbswitch?
>> If this is possible, then I could test this without nouveau and bbswitch
>> at all. If the system hangs, then it is not the video driver nor bbswitch.
>
> You can use acpi_call (a random search points to
> https://github.com/mkottman/acpi_call, but I don't know if that's the
> "official" version) - need to find the right method to call, but
> that's basically all it takes to acpi-suspend a gpu.
>
> Separately, there was a recent fix to ... something, including but not
> limited to nouveau, involving hangs on gpu suspend on newer laptops. I
> don't think it's upstream yet. Look for patches from Lukas Wunner.

Er oops. Looks like I misremembered. Patches are from Peter Wu, and at
least one of them is in v4.8-rc1:

commit 692a17dcc2922a91c6bcf11b3321503a3377b1b1
Author: Peter Wu <peter@lekensteyn.nl>
Date:   Fri Jul 15 15:12:18 2016 +0200

    drm/nouveau/acpi: fix lockup with PCIe runtime PM

along with a number of other related patches. It's not clear which
kernel you were trying this with... can you give v4.8-rcN a shot?

  -ilia

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 15:25                 ` Roland Singer
  2016-08-30 15:44                   ` Ilia Mirkin
@ 2016-08-30 15:48                   ` Emil Velikov
  2016-08-30 17:37                     ` Roland Singer
  1 sibling, 1 reply; 39+ messages in thread
From: Emil Velikov @ 2016-08-30 15:48 UTC (permalink / raw)
  To: Roland Singer
  Cc: Bjorn Helgaas, linux-pci, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi

On 30 August 2016 at 16:25, Roland Singer <roland.singer@desertbit.com> wrote:
> I tried these scenarios:
>
> 1. Booted the system without the bbswitch module. The nouveau module
>    was loaded and is responsible for the power management of the GPU.
>    The graphical session freezes after some minutes...
>
> 2. Booted the system without bbswitch and with nouveau blacklisted.
>    Manually loaded bbswitch to switch off the discrete GPU.
>    Same freeze after a while or by explicitly obtaining the GPU state.
>
> Is there a possibility to switch off the discrete card without bbswitch?
> If this is possible, then I could test this without nouveau and bbswitch
> at all. If the system hangs, then it is not the video driver nor bbswitch.
>
As Ilia mentioned acpi_call should do it. You can also check with the
nouveau/bbwswitch code to see which ones they use in your case and
bash it manually. It might be that the 'wrong one' gets used thus
things going horribly wrong.

Regards,
Emil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 15:48                   ` Emil Velikov
@ 2016-08-30 17:37                     ` Roland Singer
  2016-08-30 17:43                         ` Ilia Mirkin
  2016-08-30 18:09                         ` Emil Velikov
  0 siblings, 2 replies; 39+ messages in thread
From: Roland Singer @ 2016-08-30 17:37 UTC (permalink / raw)
  To: Emil Velikov
  Cc: Bjorn Helgaas, linux-pci, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi, imirkin

I am running 4.7.2, but I also just tried the 4.8.0-rc4 mainline kernel.
The result is the same. There is no difference if bbswitch of acpi_call
is used. However I noticed following:

1. The nouveau driver is broken in both kernel version and is responsible
   for the freezes while gathering power state information with bbswitch.
   Sometimes while shutting the system down, everything except the LCD
   screen is switched off. This only happens with nouveau.
   I noticed following error log messages:

   kernel: nouveau 0000:01:00.0: fb: 6144 MiB GDDR5
   kernel: nouveau 0000:01:00.0: priv: HUB0: 10ecc0 ffffffff (1e40822c)
   kernel: nouveau 0000:01:00.0: DRM: VRAM: 6144 MiB
   kernel: nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
   kernel: nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
   kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
   kernel: nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid

2. -> Boot with nouveau module loaded
   -> switch off the discrete GPU with bbswitch or acpi_call
   -> start X11
   -> obtaining power state with bbswitch freezes the system
   -> or working with the system for some minutes freezes the system

3. -> Boot with nouveau module blacklisted
   -> switch off the discrete GPU
   -> start X11
   -> system immediately freezes

4. -> Boot with nouveau module blacklisted
   -> switch off the discrete GPU
   -> start Wayland
   -> system runs - Note: I tried this for couple of days with 4.6 and 4.7 mainline
                          and the system freezed randomly after some time.
                          However I have to test if this is still present with 4.7.2
                          and 4.8 mainline. Right now it seams to be fine.
   -> running Xwayland (does not depend on the GPU power state) kills performance!
      the system freezes for several seconds...
      So working with Wayland is also no solution.

My conclusion:

1. Nouveau has couple of problems with GTX 9** M Nvidia GPUs.
   I would love to help here.

2. X11 is just broken and is not capable to start the graphical session
   if the nvidia GPU is not handled by any video driver (kernel module).
   Even forcing X11 to ignore the discrete GPU doesn't help.

   Setting the command line arguments to:

     acpi_osi=! acpi_osi="Windows 2009"

   fixes the issues with X11 but other things break...
   What the hell is going on?! :/

Am 30.08.2016 um 17:48 schrieb Emil Velikov:
> On 30 August 2016 at 16:25, Roland Singer <roland.singer@desertbit.com> wrote:
>> I tried these scenarios:
>>
>> 1. Booted the system without the bbswitch module. The nouveau module
>>    was loaded and is responsible for the power management of the GPU.
>>    The graphical session freezes after some minutes...
>>
>> 2. Booted the system without bbswitch and with nouveau blacklisted.
>>    Manually loaded bbswitch to switch off the discrete GPU.
>>    Same freeze after a while or by explicitly obtaining the GPU state.
>>
>> Is there a possibility to switch off the discrete card without bbswitch?
>> If this is possible, then I could test this without nouveau and bbswitch
>> at all. If the system hangs, then it is not the video driver nor bbswitch.
>>
> As Ilia mentioned acpi_call should do it. You can also check with the
> nouveau/bbwswitch code to see which ones they use in your case and
> bash it manually. It might be that the 'wrong one' gets used thus
> things going horribly wrong.
> 
> Regards,
> Emil
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 17:37                     ` Roland Singer
@ 2016-08-30 17:43                         ` Ilia Mirkin
  2016-08-30 18:09                         ` Emil Velikov
  1 sibling, 0 replies; 39+ messages in thread
From: Ilia Mirkin @ 2016-08-30 17:43 UTC (permalink / raw)
  To: Roland Singer
  Cc: Linux PCI, Emil Velikov, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi, Bjorn Helgaas

On Tue, Aug 30, 2016 at 1:37 PM, Roland Singer
<roland.singer@desertbit.com> wrote:
> My conclusion:
>
> 1. Nouveau has couple of problems with GTX 9** M Nvidia GPUs.
>    I would love to help here.

nouveau + bbswitch will always end in tears. You're going behind the
driver's back and messing around with state it believes it is
managing. What if you just use nouveau and let it auto-power-off the
GPU like it's designed to, with v4.8-rc?

  -ilia
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
@ 2016-08-30 17:43                         ` Ilia Mirkin
  0 siblings, 0 replies; 39+ messages in thread
From: Ilia Mirkin @ 2016-08-30 17:43 UTC (permalink / raw)
  To: Roland Singer
  Cc: Emil Velikov, Bjorn Helgaas, Linux PCI,
	Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi

On Tue, Aug 30, 2016 at 1:37 PM, Roland Singer
<roland.singer@desertbit.com> wrote:
> My conclusion:
>
> 1. Nouveau has couple of problems with GTX 9** M Nvidia GPUs.
>    I would love to help here.

nouveau + bbswitch will always end in tears. You're going behind the
driver's back and messing around with state it believes it is
managing. What if you just use nouveau and let it auto-power-off the
GPU like it's designed to, with v4.8-rc?

  -ilia

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 17:43                         ` Ilia Mirkin
  (?)
@ 2016-08-30 18:02                         ` Roland Singer
  2016-08-30 18:13                             ` Ilia Mirkin
  -1 siblings, 1 reply; 39+ messages in thread
From: Roland Singer @ 2016-08-30 18:02 UTC (permalink / raw)
  To: Ilia Mirkin
  Cc: Emil Velikov, Bjorn Helgaas, Linux PCI,
	Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi

I configured bbswitch to not set any states automatically...
So it's possible to obtain and verify the GPU power state.

However I removed the bbswitch module and booted with nouveau.

Kernel 4.7.2: nouveau switches the discrete GPU off.
              I can't trigger the freeze, because bbswitch is missing.
              I'll work with the system and see if it will freeze.

Kernel 4.8-rc4: nouveau does not care about the power state and
                the discrete GPU is never switched off. I will notice
                this, because the second cooling FAN will stop...
                Same log messages as send before.


Am 30.08.2016 um 19:43 schrieb Ilia Mirkin:
> On Tue, Aug 30, 2016 at 1:37 PM, Roland Singer
> <roland.singer@desertbit.com> wrote:
>> My conclusion:
>>
>> 1. Nouveau has couple of problems with GTX 9** M Nvidia GPUs.
>>    I would love to help here.
> 
> nouveau + bbswitch will always end in tears. You're going behind the
> driver's back and messing around with state it believes it is
> managing. What if you just use nouveau and let it auto-power-off the
> GPU like it's designed to, with v4.8-rc?
> 
>   -ilia
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 17:37                     ` Roland Singer
@ 2016-08-30 18:09                         ` Emil Velikov
  2016-08-30 18:09                         ` Emil Velikov
  1 sibling, 0 replies; 39+ messages in thread
From: Emil Velikov @ 2016-08-30 18:09 UTC (permalink / raw)
  To: Roland Singer
  Cc: linux-pci, Linux-Kernel@Vger. Kernel. Org, ML dri-devel,
	linux-acpi, Bjorn Helgaas

On 30 August 2016 at 18:37, Roland Singer <roland.singer@desertbit.com> wrote:
> I am running 4.7.2, but I also just tried the 4.8.0-rc4 mainline kernel.
> The result is the same. There is no difference if bbswitch of acpi_call
> is used. However I noticed following:
>
> 1. The nouveau driver is broken in both kernel version and is responsible
>    for the freezes while gathering power state information with bbswitch.
>    Sometimes while shutting the system down, everything except the LCD
>    screen is switched off. This only happens with nouveau.
>    I noticed following error log messages:
>
I second Ilia here. Using bbswitch in conjunction with any driver (be
that nouveau or the proprietary one) is a bad idea.

>    kernel: nouveau 0000:01:00.0: fb: 6144 MiB GDDR5
>    kernel: nouveau 0000:01:00.0: priv: HUB0: 10ecc0 ffffffff (1e40822c)
>    kernel: nouveau 0000:01:00.0: DRM: VRAM: 6144 MiB
>    kernel: nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
>    kernel: nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
>    kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
>    kernel: nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid
>
> 2. -> Boot with nouveau module loaded
>    -> switch off the discrete GPU with bbswitch or acpi_call
>    -> start X11
>    -> obtaining power state with bbswitch freezes the system
>    -> or working with the system for some minutes freezes the system
>
(If Ilia's suggestions does not help) Confirm if the freeze is due
to/as the GPU is powered on or off.

> 3. -> Boot with nouveau module blacklisted
>    -> switch off the discrete GPU
>    -> start X11
>    -> system immediately freezes
>
It's perfectly possible that the discrete GPU is set as boot one and X
goes angry since there's no driver/way to bring it up.

> 4. -> Boot with nouveau module blacklisted
>    -> switch off the discrete GPU
>    -> start Wayland
>    -> system runs - Note: I tried this for couple of days with 4.6 and 4.7 mainline
>                           and the system freezed randomly after some time.
>                           However I have to test if this is still present with 4.7.2
>                           and 4.8 mainline. Right now it seams to be fine.
>    -> running Xwayland (does not depend on the GPU power state) kills performance!
>       the system freezes for several seconds...
>       So working with Wayland is also no solution.
>
> My conclusion:
>
> 1. Nouveau has couple of problems with GTX 9** M Nvidia GPUs.
>    I would love to help here.
>
> 2. X11 is just broken and is not capable to start the graphical session
>    if the nvidia GPU is not handled by any video driver (kernel module).
>    Even forcing X11 to ignore the discrete GPU doesn't help.
>
Out of curiosity: how did you force X to ignore the device ?

>    Setting the command line arguments to:
>
>      acpi_osi=! acpi_osi="Windows 2009"
>
>    fixes the issues with X11 but other things break...
>    What the hell is going on?! :/
>
You can check if it's the boot_vga assumption with

Check wh

You're a victum of the Windows specific fun (quirks?) in

> Am 30.08.2016 um 17:48 schrieb Emil Velikov:
>> On 30 August 2016 at 16:25, Roland Singer <roland.singer@desertbit.com> wrote:
>>> I tried these scenarios:
>>>
>>> 1. Booted the system without the bbswitch module. The nouveau module
>>>    was loaded and is responsible for the power management of the GPU.
>>>    The graphical session freezes after some minutes...
>>>
>>> 2. Booted the system without bbswitch and with nouveau blacklisted.
>>>    Manually loaded bbswitch to switch off the discrete GPU.
>>>    Same freeze after a while or by explicitly obtaining the GPU state.
>>>
>>> Is there a possibility to switch off the discrete card without bbswitch?
>>> If this is possible, then I could test this without nouveau and bbswitch
>>> at all. If the system hangs, then it is not the video driver nor bbswitch.
>>>
>> As Ilia mentioned acpi_call should do it. You can also check with the
>> nouveau/bbwswitch code to see which ones they use in your case and
>> bash it manually. It might be that the 'wrong one' gets used thus
>> things going horribly wrong.
>>
>> Regards,
>> Emil
>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
@ 2016-08-30 18:09                         ` Emil Velikov
  0 siblings, 0 replies; 39+ messages in thread
From: Emil Velikov @ 2016-08-30 18:09 UTC (permalink / raw)
  To: Roland Singer
  Cc: Bjorn Helgaas, linux-pci, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi, Ilia Mirkin

On 30 August 2016 at 18:37, Roland Singer <roland.singer@desertbit.com> wrote:
> I am running 4.7.2, but I also just tried the 4.8.0-rc4 mainline kernel.
> The result is the same. There is no difference if bbswitch of acpi_call
> is used. However I noticed following:
>
> 1. The nouveau driver is broken in both kernel version and is responsible
>    for the freezes while gathering power state information with bbswitch.
>    Sometimes while shutting the system down, everything except the LCD
>    screen is switched off. This only happens with nouveau.
>    I noticed following error log messages:
>
I second Ilia here. Using bbswitch in conjunction with any driver (be
that nouveau or the proprietary one) is a bad idea.

>    kernel: nouveau 0000:01:00.0: fb: 6144 MiB GDDR5
>    kernel: nouveau 0000:01:00.0: priv: HUB0: 10ecc0 ffffffff (1e40822c)
>    kernel: nouveau 0000:01:00.0: DRM: VRAM: 6144 MiB
>    kernel: nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
>    kernel: nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
>    kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
>    kernel: nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid
>
> 2. -> Boot with nouveau module loaded
>    -> switch off the discrete GPU with bbswitch or acpi_call
>    -> start X11
>    -> obtaining power state with bbswitch freezes the system
>    -> or working with the system for some minutes freezes the system
>
(If Ilia's suggestions does not help) Confirm if the freeze is due
to/as the GPU is powered on or off.

> 3. -> Boot with nouveau module blacklisted
>    -> switch off the discrete GPU
>    -> start X11
>    -> system immediately freezes
>
It's perfectly possible that the discrete GPU is set as boot one and X
goes angry since there's no driver/way to bring it up.

> 4. -> Boot with nouveau module blacklisted
>    -> switch off the discrete GPU
>    -> start Wayland
>    -> system runs - Note: I tried this for couple of days with 4.6 and 4.7 mainline
>                           and the system freezed randomly after some time.
>                           However I have to test if this is still present with 4.7.2
>                           and 4.8 mainline. Right now it seams to be fine.
>    -> running Xwayland (does not depend on the GPU power state) kills performance!
>       the system freezes for several seconds...
>       So working with Wayland is also no solution.
>
> My conclusion:
>
> 1. Nouveau has couple of problems with GTX 9** M Nvidia GPUs.
>    I would love to help here.
>
> 2. X11 is just broken and is not capable to start the graphical session
>    if the nvidia GPU is not handled by any video driver (kernel module).
>    Even forcing X11 to ignore the discrete GPU doesn't help.
>
Out of curiosity: how did you force X to ignore the device ?

>    Setting the command line arguments to:
>
>      acpi_osi=! acpi_osi="Windows 2009"
>
>    fixes the issues with X11 but other things break...
>    What the hell is going on?! :/
>
You can check if it's the boot_vga assumption with

Check wh

You're a victum of the Windows specific fun (quirks?) in

> Am 30.08.2016 um 17:48 schrieb Emil Velikov:
>> On 30 August 2016 at 16:25, Roland Singer <roland.singer@desertbit.com> wrote:
>>> I tried these scenarios:
>>>
>>> 1. Booted the system without the bbswitch module. The nouveau module
>>>    was loaded and is responsible for the power management of the GPU.
>>>    The graphical session freezes after some minutes...
>>>
>>> 2. Booted the system without bbswitch and with nouveau blacklisted.
>>>    Manually loaded bbswitch to switch off the discrete GPU.
>>>    Same freeze after a while or by explicitly obtaining the GPU state.
>>>
>>> Is there a possibility to switch off the discrete card without bbswitch?
>>> If this is possible, then I could test this without nouveau and bbswitch
>>> at all. If the system hangs, then it is not the video driver nor bbswitch.
>>>
>> As Ilia mentioned acpi_call should do it. You can also check with the
>> nouveau/bbwswitch code to see which ones they use in your case and
>> bash it manually. It might be that the 'wrong one' gets used thus
>> things going horribly wrong.
>>
>> Regards,
>> Emil
>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 18:09                         ` Emil Velikov
@ 2016-08-30 18:10                           ` Emil Velikov
  -1 siblings, 0 replies; 39+ messages in thread
From: Emil Velikov @ 2016-08-30 18:10 UTC (permalink / raw)
  To: Roland Singer
  Cc: linux-pci, Linux-Kernel@Vger. Kernel. Org, ML dri-devel,
	linux-acpi, Bjorn Helgaas

On 30 August 2016 at 19:09, Emil Velikov <emil.l.velikov@gmail.com> wrote:
> On 30 August 2016 at 18:37, Roland Singer <roland.singer@desertbit.com> wrote:
>> I am running 4.7.2, but I also just tried the 4.8.0-rc4 mainline kernel.
>> The result is the same. There is no difference if bbswitch of acpi_call
>> is used. However I noticed following:
>>
>> 1. The nouveau driver is broken in both kernel version and is responsible
>>    for the freezes while gathering power state information with bbswitch.
>>    Sometimes while shutting the system down, everything except the LCD
>>    screen is switched off. This only happens with nouveau.
>>    I noticed following error log messages:
>>
> I second Ilia here. Using bbswitch in conjunction with any driver (be
> that nouveau or the proprietary one) is a bad idea.
>
>>    kernel: nouveau 0000:01:00.0: fb: 6144 MiB GDDR5
>>    kernel: nouveau 0000:01:00.0: priv: HUB0: 10ecc0 ffffffff (1e40822c)
>>    kernel: nouveau 0000:01:00.0: DRM: VRAM: 6144 MiB
>>    kernel: nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
>>    kernel: nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
>>    kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
>>    kernel: nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid
>>
>> 2. -> Boot with nouveau module loaded
>>    -> switch off the discrete GPU with bbswitch or acpi_call
>>    -> start X11
>>    -> obtaining power state with bbswitch freezes the system
>>    -> or working with the system for some minutes freezes the system
>>
> (If Ilia's suggestions does not help) Confirm if the freeze is due
> to/as the GPU is powered on or off.
>
>> 3. -> Boot with nouveau module blacklisted
>>    -> switch off the discrete GPU
>>    -> start X11
>>    -> system immediately freezes
>>
> It's perfectly possible that the discrete GPU is set as boot one and X
> goes angry since there's no driver/way to bring it up.
>
>> 4. -> Boot with nouveau module blacklisted
>>    -> switch off the discrete GPU
>>    -> start Wayland
>>    -> system runs - Note: I tried this for couple of days with 4.6 and 4.7 mainline
>>                           and the system freezed randomly after some time.
>>                           However I have to test if this is still present with 4.7.2
>>                           and 4.8 mainline. Right now it seams to be fine.
>>    -> running Xwayland (does not depend on the GPU power state) kills performance!
>>       the system freezes for several seconds...
>>       So working with Wayland is also no solution.
>>
>> My conclusion:
>>
>> 1. Nouveau has couple of problems with GTX 9** M Nvidia GPUs.
>>    I would love to help here.
>>
>> 2. X11 is just broken and is not capable to start the graphical session
>>    if the nvidia GPU is not handled by any video driver (kernel module).
>>    Even forcing X11 to ignore the discrete GPU doesn't help.
>>
> Out of curiosity: how did you force X to ignore the device ?
>
>>    Setting the command line arguments to:
>>
>>      acpi_osi=! acpi_osi="Windows 2009"
>>
>>    fixes the issues with X11 but other things break...
>>    What the hell is going on?! :/
>>
> You can check if it's the boot_vga assumption with
>
[Sorry about that] ...
cat /sys/class/drm/card*/device/{boot_vga,vendor}

If the output changes them my assumption holds true.
-Emil
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
@ 2016-08-30 18:10                           ` Emil Velikov
  0 siblings, 0 replies; 39+ messages in thread
From: Emil Velikov @ 2016-08-30 18:10 UTC (permalink / raw)
  To: Roland Singer
  Cc: Bjorn Helgaas, linux-pci, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi, Ilia Mirkin

On 30 August 2016 at 19:09, Emil Velikov <emil.l.velikov@gmail.com> wrote:
> On 30 August 2016 at 18:37, Roland Singer <roland.singer@desertbit.com> wrote:
>> I am running 4.7.2, but I also just tried the 4.8.0-rc4 mainline kernel.
>> The result is the same. There is no difference if bbswitch of acpi_call
>> is used. However I noticed following:
>>
>> 1. The nouveau driver is broken in both kernel version and is responsible
>>    for the freezes while gathering power state information with bbswitch.
>>    Sometimes while shutting the system down, everything except the LCD
>>    screen is switched off. This only happens with nouveau.
>>    I noticed following error log messages:
>>
> I second Ilia here. Using bbswitch in conjunction with any driver (be
> that nouveau or the proprietary one) is a bad idea.
>
>>    kernel: nouveau 0000:01:00.0: fb: 6144 MiB GDDR5
>>    kernel: nouveau 0000:01:00.0: priv: HUB0: 10ecc0 ffffffff (1e40822c)
>>    kernel: nouveau 0000:01:00.0: DRM: VRAM: 6144 MiB
>>    kernel: nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
>>    kernel: nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid
>>    kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
>>    kernel: nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid
>>
>> 2. -> Boot with nouveau module loaded
>>    -> switch off the discrete GPU with bbswitch or acpi_call
>>    -> start X11
>>    -> obtaining power state with bbswitch freezes the system
>>    -> or working with the system for some minutes freezes the system
>>
> (If Ilia's suggestions does not help) Confirm if the freeze is due
> to/as the GPU is powered on or off.
>
>> 3. -> Boot with nouveau module blacklisted
>>    -> switch off the discrete GPU
>>    -> start X11
>>    -> system immediately freezes
>>
> It's perfectly possible that the discrete GPU is set as boot one and X
> goes angry since there's no driver/way to bring it up.
>
>> 4. -> Boot with nouveau module blacklisted
>>    -> switch off the discrete GPU
>>    -> start Wayland
>>    -> system runs - Note: I tried this for couple of days with 4.6 and 4.7 mainline
>>                           and the system freezed randomly after some time.
>>                           However I have to test if this is still present with 4.7.2
>>                           and 4.8 mainline. Right now it seams to be fine.
>>    -> running Xwayland (does not depend on the GPU power state) kills performance!
>>       the system freezes for several seconds...
>>       So working with Wayland is also no solution.
>>
>> My conclusion:
>>
>> 1. Nouveau has couple of problems with GTX 9** M Nvidia GPUs.
>>    I would love to help here.
>>
>> 2. X11 is just broken and is not capable to start the graphical session
>>    if the nvidia GPU is not handled by any video driver (kernel module).
>>    Even forcing X11 to ignore the discrete GPU doesn't help.
>>
> Out of curiosity: how did you force X to ignore the device ?
>
>>    Setting the command line arguments to:
>>
>>      acpi_osi=! acpi_osi="Windows 2009"
>>
>>    fixes the issues with X11 but other things break...
>>    What the hell is going on?! :/
>>
> You can check if it's the boot_vga assumption with
>
[Sorry about that] ...
cat /sys/class/drm/card*/device/{boot_vga,vendor}

If the output changes them my assumption holds true.
-Emil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 18:02                         ` Roland Singer
@ 2016-08-30 18:13                             ` Ilia Mirkin
  0 siblings, 0 replies; 39+ messages in thread
From: Ilia Mirkin @ 2016-08-30 18:13 UTC (permalink / raw)
  To: Roland Singer
  Cc: Linux PCI, Emil Velikov, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi, Bjorn Helgaas, Peter Wu

On Tue, Aug 30, 2016 at 2:02 PM, Roland Singer
<roland.singer@desertbit.com> wrote:
> I configured bbswitch to not set any states automatically...
> So it's possible to obtain and verify the GPU power state.
>
> However I removed the bbswitch module and booted with nouveau.
>
> Kernel 4.7.2: nouveau switches the discrete GPU off.
>               I can't trigger the freeze, because bbswitch is missing.
>               I'll work with the system and see if it will freeze.
>
> Kernel 4.8-rc4: nouveau does not care about the power state and
>                 the discrete GPU is never switched off. I will notice
>                 this, because the second cooling FAN will stop...
>                 Same log messages as send before.

That's surprising. I believe there's an issue with the new logic when
there's an HDMI audio subdevice. However that only appears if there's
a cable plugged in, at least in the systems Peter tested. You should
be able to see whether it's there or not with 'lspci'.

You can check for sure by looking in the vgaswitcheroo state. It
should say DynOff when it's powered off.

Either way, I think using bbswitch + nouveau isn't supported by
anyone, so if you want to use it that way, you're on your own. (You
may want to load nouveau with runpm=0 so that nouveau doesn't try to
manage the GPU suspend stuff.)

  -ilia
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
@ 2016-08-30 18:13                             ` Ilia Mirkin
  0 siblings, 0 replies; 39+ messages in thread
From: Ilia Mirkin @ 2016-08-30 18:13 UTC (permalink / raw)
  To: Roland Singer
  Cc: Emil Velikov, Bjorn Helgaas, Linux PCI,
	Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi,
	Peter Wu

On Tue, Aug 30, 2016 at 2:02 PM, Roland Singer
<roland.singer@desertbit.com> wrote:
> I configured bbswitch to not set any states automatically...
> So it's possible to obtain and verify the GPU power state.
>
> However I removed the bbswitch module and booted with nouveau.
>
> Kernel 4.7.2: nouveau switches the discrete GPU off.
>               I can't trigger the freeze, because bbswitch is missing.
>               I'll work with the system and see if it will freeze.
>
> Kernel 4.8-rc4: nouveau does not care about the power state and
>                 the discrete GPU is never switched off. I will notice
>                 this, because the second cooling FAN will stop...
>                 Same log messages as send before.

That's surprising. I believe there's an issue with the new logic when
there's an HDMI audio subdevice. However that only appears if there's
a cable plugged in, at least in the systems Peter tested. You should
be able to see whether it's there or not with 'lspci'.

You can check for sure by looking in the vgaswitcheroo state. It
should say DynOff when it's powered off.

Either way, I think using bbswitch + nouveau isn't supported by
anyone, so if you want to use it that way, you're on your own. (You
may want to load nouveau with runpm=0 so that nouveau doesn't try to
manage the GPU suspend stuff.)

  -ilia

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 18:13                             ` Ilia Mirkin
  (?)
@ 2016-08-30 19:21                             ` Peter Wu
  2016-08-31 11:12                               ` Roland Singer
  -1 siblings, 1 reply; 39+ messages in thread
From: Peter Wu @ 2016-08-30 19:21 UTC (permalink / raw)
  To: Ilia Mirkin
  Cc: Roland Singer, Emil Velikov, Bjorn Helgaas, Linux PCI,
	Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi

On Tue, Aug 30, 2016 at 02:13:46PM -0400, Ilia Mirkin wrote:
> On Tue, Aug 30, 2016 at 2:02 PM, Roland Singer
> <roland.singer@desertbit.com> wrote:
> > I configured bbswitch to not set any states automatically...
> > So it's possible to obtain and verify the GPU power state.
> >
> > However I removed the bbswitch module and booted with nouveau.
> >
> > Kernel 4.7.2: nouveau switches the discrete GPU off.
> >               I can't trigger the freeze, because bbswitch is missing.
> >               I'll work with the system and see if it will freeze.
> >
> > Kernel 4.8-rc4: nouveau does not care about the power state and
> >                 the discrete GPU is never switched off. I will notice
> >                 this, because the second cooling FAN will stop...
> >                 Same log messages as send before.
> 
> That's surprising. I believe there's an issue with the new logic when
> there's an HDMI audio subdevice. However that only appears if there's
> a cable plugged in, at least in the systems Peter tested. You should
> be able to see whether it's there or not with 'lspci'.

I doubt that the audio device is responsible here, that should only show
up after following very specific steps (runtime suspend/resume (PCI or
ACPI magic), remove PCI device, rescan bus).

> You can check for sure by looking in the vgaswitcheroo state. It
> should say DynOff when it's powered off.
> 
> Either way, I think using bbswitch + nouveau isn't supported by
> anyone, so if you want to use it that way, you're on your own. (You
> may want to load nouveau with runpm=0 so that nouveau doesn't try to
> manage the GPU suspend stuff.)

I understood that Roland's intent is to check the power state, not use
the suspend functionality of bbswitch, if you load bbswitch without
module options amd do not write to /proc/bbswitch, then it allows you to
read out the actual status (you could also just use lspci -H1 for that
though).
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-29 16:02 ` Bjorn Helgaas
@ 2016-08-30 19:53     ` Peter Wu
  2016-08-30 19:53     ` Peter Wu
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Wu @ 2016-08-30 19:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Roland Singer, linux-kernel, dri-devel, linux-acpi

On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
> [+cc linux-acpi, linux-kernel, dri-devel]
> 
> Hi Roland,
> 
> I have no idea how to debug this problem.  Are you seeing something
> that suggests it may be a PCI problem?

Yes I suspect there is an ACPI and/ or PCI problem, possibly
device-specific. Steps to reproduce on the affected machines:

 1. Load nouveau.
 2. Wait for it to runtime suspend.
 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
 3. lspci never returns, few moments later an AML_INFINITE_LOOP is
    reported.

If you use the external bbswitch module, the effect is the same. I have
been trying to debug this for some time on nouveau with no luck. The
PCI/PM D3cold patches from Mika makes no difference.

Runtime resume via nouveau triggers some ACPI methods (I'll assume the
Windows 8-style PR method and take the Clevo P651 as example):

    \_SB.PCI0.PEG0.PG00._ON () ->
        \_SB.PCI0.PGON (0)

Then:

    Method (PGON, 1, Serialized) {
        PION = Arg0     // note: 0 for PG00
        // ...
        If ((OSYS != 0x07DF)) { /* Not Windows 2015 (Windows 10), see below */ }
        Else {
            LKEN (PION)
        }
        // this is the infinite loop: it tries to bring the PCIe link to
        // full speed, but fails to do so.
        While ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
            Local0 = 0x20
            While (Local0) {
                If ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
                    Stall (0x64)
                    Local0--
                } Else { Break }
            }
            If ((Local0 == Zero)) {
                \_SB.PCI0.PEG0.RTLK = One
                Stall (0x64)
            }
        }
        // ...
    }

Without any workaround, this piece of code is invoked:

    Method (LKEN, 1, NotSerialized) {
        Local3 = (CPEX & 0x0F)  // CPEX at 0x5ff9be7f and has value 000506e3
        If ((Local3 == Zero)) {
            /* Similar to below, but with Q0L0 -> P0L0 (register 0xBC bit 6) */
        } ElseIf ((Local3 != Zero)) {
            If ((Arg0 == Zero)) {
                /* Enter L0 Activate state.
                 * (LKDS tries to enter L2, deep-energy-saving state.) */
                Q0L0 = One      // register 0x249 bit 0; \_SB.PCI0.OPG0.Q0L0 00:01.0
                Sleep (0x10)
                Local0 = Zero
                While (Q0L0) {
                    If ((Local0 > 0x04)) { Break }
                    Sleep (0x10)
                    Local0++
                }
            } else { /* other cases, but we are only interested in PGON(0) */ }
        }
    }

The acpi_osi="!Windows 2015" workaround will invoke this instead:

    If ((OSYS != 0x07DF)) {
        If ((PION == Zero)) {
            P0AP = Zero  /* PGOF writes 3 */
            P0RM = Zero  /* PGOF writes 1 */
        }
        If ((PBGE != Zero)) { /* Observed to be false (PBGE == 0) */
            If (SBDL (PION)) {
                PUAB (PION)
                CBDL = GUBC (PION)
                MBDL = GMXB (PION)
                If ((CBDL > MBDL)) {
                    CBDL = MBDL /* \_SB_.PCI0.MBDL */
                }
                PDUB (PION, CBDL)
            }
        }
        If ((PION == Zero)) {
            P0LD = Zero     /* Link Disable = 0, PGOF sets 1 instead. */
            P0TR = One      /* Train? (PGOF does not set this). */
            TCNT = Zero
            While ((TCNT < LDLY)) { /* LDLY = 300 */
                If ((P0VC == Zero)) {
                    /* VC Negotiation Pending 0 means VC negotation is complete. */
                    Break
                }
                Sleep (0x10)
                TCNT += 0x10 /* At most 19 iterations, sleeping for 304ms. */
            }
        }
    }

The comments above are my own interpretation based on the acpidumps I
extracted from the machine. These notes and ACPI tables can be found at
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
https://github.com/Lekensteyn/acpi-stuff/tree/master/dsl/Clevo_P651RA

Other affected devices have similar code, differences are small:
 - No check for LNKS (avoids the infinite loop, but device is still off)
 - Instead of a check for != "Windows 2015", they check for == "Windows
   2009" or even for == "Windows 2009" || "Windows 2013" (Dell Inspiron
   7559).

The tested kernels (with bbswitch or nouveau) were Linux 4.4.0, 4.6,
4.7 (nouveau + PCI/PM + nouveau PR patches). The PCIe device is
something from the GTX 9xxM family in all cases.

I have a bunch of PCI config dumps from Windows and Linux, but there is
nothing extraordinary. Also did an ACPI trace via a Checked/Debug build
of Windows, but it just confirms that the ACPI method we use for the
Nvidia device is the correct one.

Let me know if you need more information, I would be glad to provide.

Kind regards,
Peter

> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> > Hi,
> > 
> > hope somebody can help me fix this kernel problem which affects the following machines:
> > 
> > - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> > - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> > - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> > - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> > 
> > 
> > The kernel freezes if the graphical user session (Xorg & Wayland) is
> > started with a switched off discrete GPU card (NVIDIA).
> > If the discrete GPU is switched off after the graphical session start,
> > then everything works as expected, until the graphical session is restarted.
> > 
> > This problem seams to be linked to specific BIOS settings. If the computer
> > is started with the following command line:
> > 
> > acpi_osi=! acpi_osi="Windows 2009"
> > 
> > then the kernel freeze does not occur anymore. However this required a special
> > ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> > 
> > https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> > 
> > I strongly recommend to fix this in the kernel and I am ready to help and solve
> > this problem with some help.
> > 
> > Here is a link to the GitHub issue with further information:
> > 
> > https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> > 
> > Here are some more detailed information:
> > 
> > https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> > 
> > Hope somebody can help.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
@ 2016-08-30 19:53     ` Peter Wu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Wu @ 2016-08-30 19:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Roland Singer, linux-pci, linux-kernel, linux-acpi, dri-devel

On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
> [+cc linux-acpi, linux-kernel, dri-devel]
> 
> Hi Roland,
> 
> I have no idea how to debug this problem.  Are you seeing something
> that suggests it may be a PCI problem?

Yes I suspect there is an ACPI and/ or PCI problem, possibly
device-specific. Steps to reproduce on the affected machines:

 1. Load nouveau.
 2. Wait for it to runtime suspend.
 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
 3. lspci never returns, few moments later an AML_INFINITE_LOOP is
    reported.

If you use the external bbswitch module, the effect is the same. I have
been trying to debug this for some time on nouveau with no luck. The
PCI/PM D3cold patches from Mika makes no difference.

Runtime resume via nouveau triggers some ACPI methods (I'll assume the
Windows 8-style PR method and take the Clevo P651 as example):

    \_SB.PCI0.PEG0.PG00._ON () ->
        \_SB.PCI0.PGON (0)

Then:

    Method (PGON, 1, Serialized) {
        PION = Arg0     // note: 0 for PG00
        // ...
        If ((OSYS != 0x07DF)) { /* Not Windows 2015 (Windows 10), see below */ }
        Else {
            LKEN (PION)
        }
        // this is the infinite loop: it tries to bring the PCIe link to
        // full speed, but fails to do so.
        While ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
            Local0 = 0x20
            While (Local0) {
                If ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
                    Stall (0x64)
                    Local0--
                } Else { Break }
            }
            If ((Local0 == Zero)) {
                \_SB.PCI0.PEG0.RTLK = One
                Stall (0x64)
            }
        }
        // ...
    }

Without any workaround, this piece of code is invoked:

    Method (LKEN, 1, NotSerialized) {
        Local3 = (CPEX & 0x0F)  // CPEX at 0x5ff9be7f and has value 000506e3
        If ((Local3 == Zero)) {
            /* Similar to below, but with Q0L0 -> P0L0 (register 0xBC bit 6) */
        } ElseIf ((Local3 != Zero)) {
            If ((Arg0 == Zero)) {
                /* Enter L0 Activate state.
                 * (LKDS tries to enter L2, deep-energy-saving state.) */
                Q0L0 = One      // register 0x249 bit 0; \_SB.PCI0.OPG0.Q0L0 00:01.0
                Sleep (0x10)
                Local0 = Zero
                While (Q0L0) {
                    If ((Local0 > 0x04)) { Break }
                    Sleep (0x10)
                    Local0++
                }
            } else { /* other cases, but we are only interested in PGON(0) */ }
        }
    }

The acpi_osi="!Windows 2015" workaround will invoke this instead:

    If ((OSYS != 0x07DF)) {
        If ((PION == Zero)) {
            P0AP = Zero  /* PGOF writes 3 */
            P0RM = Zero  /* PGOF writes 1 */
        }
        If ((PBGE != Zero)) { /* Observed to be false (PBGE == 0) */
            If (SBDL (PION)) {
                PUAB (PION)
                CBDL = GUBC (PION)
                MBDL = GMXB (PION)
                If ((CBDL > MBDL)) {
                    CBDL = MBDL /* \_SB_.PCI0.MBDL */
                }
                PDUB (PION, CBDL)
            }
        }
        If ((PION == Zero)) {
            P0LD = Zero     /* Link Disable = 0, PGOF sets 1 instead. */
            P0TR = One      /* Train? (PGOF does not set this). */
            TCNT = Zero
            While ((TCNT < LDLY)) { /* LDLY = 300 */
                If ((P0VC == Zero)) {
                    /* VC Negotiation Pending 0 means VC negotation is complete. */
                    Break
                }
                Sleep (0x10)
                TCNT += 0x10 /* At most 19 iterations, sleeping for 304ms. */
            }
        }
    }

The comments above are my own interpretation based on the acpidumps I
extracted from the machine. These notes and ACPI tables can be found at
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
https://github.com/Lekensteyn/acpi-stuff/tree/master/dsl/Clevo_P651RA

Other affected devices have similar code, differences are small:
 - No check for LNKS (avoids the infinite loop, but device is still off)
 - Instead of a check for != "Windows 2015", they check for == "Windows
   2009" or even for == "Windows 2009" || "Windows 2013" (Dell Inspiron
   7559).

The tested kernels (with bbswitch or nouveau) were Linux 4.4.0, 4.6,
4.7 (nouveau + PCI/PM + nouveau PR patches). The PCIe device is
something from the GTX 9xxM family in all cases.

I have a bunch of PCI config dumps from Windows and Linux, but there is
nothing extraordinary. Also did an ACPI trace via a Checked/Debug build
of Windows, but it just confirms that the ACPI method we use for the
Nvidia device is the correct one.

Let me know if you need more information, I would be glad to provide.

Kind regards,
Peter

> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> > Hi,
> > 
> > hope somebody can help me fix this kernel problem which affects the following machines:
> > 
> > - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> > - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> > - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> > - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> > 
> > 
> > The kernel freezes if the graphical user session (Xorg & Wayland) is
> > started with a switched off discrete GPU card (NVIDIA).
> > If the discrete GPU is switched off after the graphical session start,
> > then everything works as expected, until the graphical session is restarted.
> > 
> > This problem seams to be linked to specific BIOS settings. If the computer
> > is started with the following command line:
> > 
> > acpi_osi=! acpi_osi="Windows 2009"
> > 
> > then the kernel freeze does not occur anymore. However this required a special
> > ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> > 
> > https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> > 
> > I strongly recommend to fix this in the kernel and I am ready to help and solve
> > this problem with some help.
> > 
> > Here is a link to the GitHub issue with further information:
> > 
> > https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> > 
> > Here are some more detailed information:
> > 
> > https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> > 
> > Hope somebody can help.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 18:10                           ` Emil Velikov
  (?)
@ 2016-08-31 10:51                           ` Roland Singer
  -1 siblings, 0 replies; 39+ messages in thread
From: Roland Singer @ 2016-08-31 10:51 UTC (permalink / raw)
  To: Emil Velikov
  Cc: Bjorn Helgaas, linux-pci, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, linux-acpi, Ilia Mirkin, peter

Am 30.08.2016 um 20:09 schrieb Emil Velikov:
> I second Ilia here. Using bbswitch in conjunction with any driver (be
> that nouveau or the proprietary one) is a bad idea.
>

I removed bbswitch from my system and will use vgaswitcheroo to check
the GPU power state from now.

> (If Ilia's suggestions does not help) Confirm if the freeze is due
> to/as the GPU is powered on or off.
>

Yeah, the freeze is caused by the switched off GPU.
Waited for the nouveau driver to switch it off, before starting
the graphical user interface...

> Out of curiosity: how did you force X to ignore the device ?
>

I tried to tell X11 to ignore the device with the following
configuration:

  Section "Device"
      Identifier  "Nvidia"
      VendorName  "NVIDIA Corporation"
      Option      "Ignore" "true"
  EndSection

> You can check if it's the boot_vga assumption with
> cat /sys/class/drm/card*/device/{boot_vga,vendor}
> If the output changes them my assumption holds true.

Output did not change:

  1
  0x8086

0x8086 is the vendor ID of intel. So that's ok...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 18:13                             ` Ilia Mirkin
  (?)
  (?)
@ 2016-08-31 11:11                             ` Roland Singer
  -1 siblings, 0 replies; 39+ messages in thread
From: Roland Singer @ 2016-08-31 11:11 UTC (permalink / raw)
  To: Ilia Mirkin
  Cc: Emil Velikov, Bjorn Helgaas, Linux PCI,
	Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi,
	Peter Wu

Am 30.08.2016 um 20:13 schrieb Ilia Mirkin:
> On Tue, Aug 30, 2016 at 2:02 PM, Roland Singer
> <roland.singer@desertbit.com> wrote:
>> I configured bbswitch to not set any states automatically...
>> So it's possible to obtain and verify the GPU power state.
>>
>> However I removed the bbswitch module and booted with nouveau.
>>
>> Kernel 4.7.2: nouveau switches the discrete GPU off.
>>               I can't trigger the freeze, because bbswitch is missing.
>>               I'll work with the system and see if it will freeze.
>>
>> Kernel 4.8-rc4: nouveau does not care about the power state and
>>                 the discrete GPU is never switched off. I will notice
>>                 this, because the second cooling FAN will stop...
>>                 Same log messages as send before.
> 
> That's surprising. I believe there's an issue with the new logic when
> there's an HDMI audio subdevice. However that only appears if there's
> a cable plugged in, at least in the systems Peter tested. You should
> be able to see whether it's there or not with 'lspci'.
> 
> You can check for sure by looking in the vgaswitcheroo state. It
> should say DynOff when it's powered off.
> 
> Either way, I think using bbswitch + nouveau isn't supported by
> anyone, so if you want to use it that way, you're on your own. (You
> may want to load nouveau with runpm=0 so that nouveau doesn't try to
> manage the GPU suspend stuff.)
> 
>   -ilia
> 

Kernel 4.8-rc4:

While running lspci, following kernel log message was printed on the TTY:

  nouveau: 0000:01:00:0: priv: HUB0: 6013d4 0000573f (1f408200)
  nouveau: 0000:01:00:0: priv: HUB0: 10ecc0 ffffffff (1940822c)

This is my output of lspci:

00:00.0 Host bridge: Intel Corporation Skylake Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:08.0 System peripheral: Intel Corporation Skylake Gaussian Mixture Model
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1)
00:1d.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #13 (rev f1)
00:1e.0 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO UART #0 (rev 31)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
01:00.0 3D controller: NVIDIA Corporation GM204M [GeForce GTX 970M] (rev a1)
3b:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
3d:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller (rev 01)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 19:21                             ` Peter Wu
@ 2016-08-31 11:12                               ` Roland Singer
  0 siblings, 0 replies; 39+ messages in thread
From: Roland Singer @ 2016-08-31 11:12 UTC (permalink / raw)
  To: Peter Wu, Ilia Mirkin
  Cc: Emil Velikov, Bjorn Helgaas, Linux PCI,
	Linux-Kernel@Vger. Kernel. Org, ML dri-devel, linux-acpi

Am 30.08.2016 um 21:21 schrieb Peter Wu:
> On Tue, Aug 30, 2016 at 02:13:46PM -0400, Ilia Mirkin wrote:
>> On Tue, Aug 30, 2016 at 2:02 PM, Roland Singer
>> <roland.singer@desertbit.com> wrote:
>>> I configured bbswitch to not set any states automatically...
>>> So it's possible to obtain and verify the GPU power state.
>>>
>>> However I removed the bbswitch module and booted with nouveau.
>>>
>>> Kernel 4.7.2: nouveau switches the discrete GPU off.
>>>               I can't trigger the freeze, because bbswitch is missing.
>>>               I'll work with the system and see if it will freeze.
>>>
>>> Kernel 4.8-rc4: nouveau does not care about the power state and
>>>                 the discrete GPU is never switched off. I will notice
>>>                 this, because the second cooling FAN will stop...
>>>                 Same log messages as send before.
>>
>> That's surprising. I believe there's an issue with the new logic when
>> there's an HDMI audio subdevice. However that only appears if there's
>> a cable plugged in, at least in the systems Peter tested. You should
>> be able to see whether it's there or not with 'lspci'.
> 
> I doubt that the audio device is responsible here, that should only show
> up after following very specific steps (runtime suspend/resume (PCI or
> ACPI magic), remove PCI device, rescan bus).
> 
>> You can check for sure by looking in the vgaswitcheroo state. It
>> should say DynOff when it's powered off.
>>
>> Either way, I think using bbswitch + nouveau isn't supported by
>> anyone, so if you want to use it that way, you're on your own. (You
>> may want to load nouveau with runpm=0 so that nouveau doesn't try to
>> manage the GPU suspend stuff.)
> 
> I understood that Roland's intent is to check the power state, not use
> the suspend functionality of bbswitch, if you load bbswitch without
> module options amd do not write to /proc/bbswitch, then it allows you to
> read out the actual status (you could also just use lspci -H1 for that
> though).
> 

lspci -H1 works perfect. Thanks.
Just tried to verify the output with lspci -H1. I unloaded the nouveau
module and modprobe freezed with:

  $ modprobe -r nouveau
    nouveau 0000:01:00.0: pci: failed to adjust lnkctl speed
    nouveau 0000:01:00.0: fb: init failed. -22
    nouveau 0000:01:00.0: init failed with -22
    nouveau: DRM:00000000:00000000: init failed with -22
    nouveau: DRM:00000000:00000000: init failed with -22

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-30 19:53     ` Peter Wu
  (?)
@ 2016-08-31 11:27     ` Roland Singer
  2016-08-31 11:46         ` Peter Wu
  -1 siblings, 1 reply; 39+ messages in thread
From: Roland Singer @ 2016-08-31 11:27 UTC (permalink / raw)
  To: Peter Wu, Bjorn Helgaas
  Cc: linux-pci, linux-kernel, linux-acpi, dri-devel, emil.l.velikov,
	imirkin@alum.mit.edu >> Ilia Mirkin

Am 30.08.2016 um 21:53 schrieb Peter Wu:
> On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
>> [+cc linux-acpi, linux-kernel, dri-devel]
>>
>> Hi Roland,
>>
>> I have no idea how to debug this problem.  Are you seeing something
>> that suggests it may be a PCI problem?
> 
> Yes I suspect there is an ACPI and/ or PCI problem, possibly
> device-specific. Steps to reproduce on the affected machines:
> 
>  1. Load nouveau.
>  2. Wait for it to runtime suspend.
>  2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
>  3. lspci never returns, few moments later an AML_INFINITE_LOOP is
>     reported.
> 

I can confirm this. Same result on my machine.

Here is a link to my ACPI tables:
https://bugs.launchpad.net/lpbugreporter/+bug/752542/+attachment/4722651/+files/Razer-Blade.tar.gz

The specific source for the NVIDIA card can be found in the ssdt5.dsl file.


    Method (PGON, 1, Serialized)
    {
        /* ... */

        GPPR (PION, One)
        If ((OSYS == 0x07D9))  /* Is Windows 2009 - In my case, setting to Windows 2009 only works! */
        {
            If ((PION == Zero))
            {
                P0AP = Zero
                P0RM = Zero
            }
            ElseIf ((PION == One))
            {
                P1AP = Zero
                P1RM = Zero
            }
            ElseIf ((PION == 0x02))
            {
                P2AP = Zero
                P2RM = Zero
            }

            If ((PBGE != Zero))
            {
                If (SBDL (PION))
                {
                    PUAB (PION)
                    CBDL = GUBC (PION)
                    MBDL = GMXB (PION)
                    If ((CBDL > MBDL))
                    {
                        CBDL = MBDL /* \_SB_.PCI0.MBDL */
                    }

                    PDUB (PION, CBDL)
                }
            }

            If ((PION == Zero))
            {
                P0LD = Zero
                P0TR = One
                TCNT = Zero
                While ((TCNT < LDLY))
                {
                    If ((P0VC == Zero))
                    {
                        Break
                    }

                    Sleep (0x10)
                    TCNT += 0x10
                }
            }
            ElseIf ((PION == One))
            {
                P1LD = Zero
                P1TR = One
                TCNT = Zero
                While ((TCNT < LDLY))
                {
                    If ((P1VC == Zero))
                    {
                        Break
                    }

                    Sleep (0x10)
                    TCNT += 0x10
                }
            }
            ElseIf ((PION == 0x02))
            {
                P2LD = Zero
                P2TR = One
                TCNT = Zero
                While ((TCNT < LDLY))
                {
                    If ((P2VC == Zero))
                    {
                        Break
                    }

                    Sleep (0x10)
                    TCNT += 0x10
                }
            }
        }
        Else
        {
            LKEN (PION)
        }

        /* ... */
        
        Return (Zero)
    }



If not set to Windows 2009, then this is triggered:


    Method (LKEN, 1, NotSerialized)
    {
        Local3 = (CPEX & 0x0F)
        If ((Local3 == Zero))
        {
            If ((Arg0 == Zero))
            {
                P0L0 = One
                Sleep (0x10)
                Local0 = Zero
                While (P0L0)
                {
                    If ((Local0 > 0x04))
                    {
                        Break
                    }

                    Sleep (0x10)
                    Local0++
                }
            }
            ElseIf ((Arg0 == One))
            {
                P1L0 = One
                Sleep (0x10)
                Local0 = Zero
                While (P0L0)
                {
                    If ((Local0 > 0x04))
                    {
                        Break
                    }

                    Sleep (0x10)
                    Local0++
                }
            }
            ElseIf ((Arg0 == 0x02))
            {
                P2L0 = One
                Sleep (0x10)
                Local0 = Zero
                While (P0L0)
                {
                    If ((Local0 > 0x04))
                    {
                        Break
                    }

                    Sleep (0x10)
                    Local0++
                }
            }
        }
        ElseIf ((Local3 != Zero))
        {
            If ((Arg0 == Zero))
            {
                Q0L0 = One
                Sleep (0x10)
                Local0 = Zero
                While (Q0L0)
                {
                    If ((Local0 > 0x04))
                    {
                        Break
                    }

                    Sleep (0x10)
                    Local0++
                }
            }
            ElseIf ((Arg0 == One))
            {
                Q1L0 = One
                Sleep (0x10)
                Local0 = Zero
                While (Q1L0)
                {
                    If ((Local0 > 0x04))
                    {
                        Break
                    }

                    Sleep (0x10)
                    Local0++
                }
            }
            ElseIf ((Arg0 == 0x02))
            {
                Q2L0 = One
                Sleep (0x10)
                Local0 = Zero
                While (Q2L0)
                {
                    If ((Local0 > 0x04))
                    {
                        Break
                    }

                    Sleep (0x10)
                    Local0++
                }
            }
        }
    }


Is it possible to override the specific ACPI table functions (SSDT) in the DSDT?
This way I could try to debug to find some more information...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-31 11:27     ` Roland Singer
@ 2016-08-31 11:46         ` Peter Wu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Wu @ 2016-08-31 11:46 UTC (permalink / raw)
  To: Roland Singer
  Cc: linux-pci, emil.l.velikov, linux-kernel, dri-devel, linux-acpi,
	Bjorn Helgaas

On Wed, Aug 31, 2016 at 01:27:36PM +0200, Roland Singer wrote:
> Am 30.08.2016 um 21:53 schrieb Peter Wu:
> > On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
> >> [+cc linux-acpi, linux-kernel, dri-devel]
> >>
> >> Hi Roland,
> >>
> >> I have no idea how to debug this problem.  Are you seeing something
> >> that suggests it may be a PCI problem?
> > 
> > Yes I suspect there is an ACPI and/ or PCI problem, possibly
> > device-specific. Steps to reproduce on the affected machines:
> > 
> >  1. Load nouveau.
> >  2. Wait for it to runtime suspend.
> >  2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
> >  3. lspci never returns, few moments later an AML_INFINITE_LOOP is
> >     reported.
> > 
> 
> I can confirm this. Same result on my machine.
> 
> Here is a link to my ACPI tables:
> https://bugs.launchpad.net/lpbugreporter/+bug/752542/+attachment/4722651/+files/Razer-Blade.tar.gz
> 
> The specific source for the NVIDIA card can be found in the ssdt5.dsl file.
> 
> 
>     Method (PGON, 1, Serialized)
>     {
>         /* ... */
> 
>         GPPR (PION, One)
>         If ((OSYS == 0x07D9))  /* Is Windows 2009 - In my case, setting to Windows 2009 only works! */
>         {
[..]
>         }
>         Else
>         {
>             LKEN (PION)
>         }
> 
>         /* ... */
>         
>         Return (Zero)
>     }
> 
> 
> 
> If not set to Windows 2009, then this is triggered:
> 
> 
>     Method (LKEN, 1, NotSerialized)
>     {
[..]
>     }

Yep, this is the same code. I stripped out irrelevant parts from the
previous mail for brevity.

> Is it possible to override the specific ACPI table functions (SSDT) in the DSDT?
> This way I could try to debug to find some more information...

See Documentation/acpi/initrd_table_override.txt and note that it is
important that the tables are really located at /kernel/firmware/acpi/
in your initrd (which must be the first, even before any possible
microcode updates).

What are you trying to do? For ACPI method tracing, see
Documentation/acpi/method-tracing.txt
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
@ 2016-08-31 11:46         ` Peter Wu
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Wu @ 2016-08-31 11:46 UTC (permalink / raw)
  To: Roland Singer
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, linux-acpi, dri-devel,
	emil.l.velikov, imirkin@alum.mit.edu >> Ilia Mirkin

On Wed, Aug 31, 2016 at 01:27:36PM +0200, Roland Singer wrote:
> Am 30.08.2016 um 21:53 schrieb Peter Wu:
> > On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
> >> [+cc linux-acpi, linux-kernel, dri-devel]
> >>
> >> Hi Roland,
> >>
> >> I have no idea how to debug this problem.  Are you seeing something
> >> that suggests it may be a PCI problem?
> > 
> > Yes I suspect there is an ACPI and/ or PCI problem, possibly
> > device-specific. Steps to reproduce on the affected machines:
> > 
> >  1. Load nouveau.
> >  2. Wait for it to runtime suspend.
> >  2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
> >  3. lspci never returns, few moments later an AML_INFINITE_LOOP is
> >     reported.
> > 
> 
> I can confirm this. Same result on my machine.
> 
> Here is a link to my ACPI tables:
> https://bugs.launchpad.net/lpbugreporter/+bug/752542/+attachment/4722651/+files/Razer-Blade.tar.gz
> 
> The specific source for the NVIDIA card can be found in the ssdt5.dsl file.
> 
> 
>     Method (PGON, 1, Serialized)
>     {
>         /* ... */
> 
>         GPPR (PION, One)
>         If ((OSYS == 0x07D9))  /* Is Windows 2009 - In my case, setting to Windows 2009 only works! */
>         {
[..]
>         }
>         Else
>         {
>             LKEN (PION)
>         }
> 
>         /* ... */
>         
>         Return (Zero)
>     }
> 
> 
> 
> If not set to Windows 2009, then this is triggered:
> 
> 
>     Method (LKEN, 1, NotSerialized)
>     {
[..]
>     }

Yep, this is the same code. I stripped out irrelevant parts from the
previous mail for brevity.

> Is it possible to override the specific ACPI table functions (SSDT) in the DSDT?
> This way I could try to debug to find some more information...

See Documentation/acpi/initrd_table_override.txt and note that it is
important that the tables are really located at /kernel/firmware/acpi/
in your initrd (which must be the first, even before any possible
microcode updates).

What are you trying to do? For ACPI method tracing, see
Documentation/acpi/method-tracing.txt
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-31 11:46         ` Peter Wu
  (?)
@ 2016-08-31 12:21         ` Roland Singer
  2016-08-31 12:34           ` Peter Wu
  -1 siblings, 1 reply; 39+ messages in thread
From: Roland Singer @ 2016-08-31 12:21 UTC (permalink / raw)
  To: Peter Wu
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, linux-acpi, dri-devel,
	emil.l.velikov, imirkin@alum.mit.edu >> Ilia Mirkin

Am 31.08.2016 um 13:46 schrieb Peter Wu:
> On Wed, Aug 31, 2016 at 01:27:36PM +0200, Roland Singer wrote:
>> Am 30.08.2016 um 21:53 schrieb Peter Wu:
>>> On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
>>>> [+cc linux-acpi, linux-kernel, dri-devel]
>>>>
>>>> Hi Roland,
>>>>
>>>> I have no idea how to debug this problem.  Are you seeing something
>>>> that suggests it may be a PCI problem?
>>>
>>> Yes I suspect there is an ACPI and/ or PCI problem, possibly
>>> device-specific. Steps to reproduce on the affected machines:
>>>
>>>  1. Load nouveau.
>>>  2. Wait for it to runtime suspend.
>>>  2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
>>>  3. lspci never returns, few moments later an AML_INFINITE_LOOP is
>>>     reported.
>>>
>>
>> I can confirm this. Same result on my machine.
>>
>> Here is a link to my ACPI tables:
>> https://bugs.launchpad.net/lpbugreporter/+bug/752542/+attachment/4722651/+files/Razer-Blade.tar.gz
>>
>> The specific source for the NVIDIA card can be found in the ssdt5.dsl file.
>>
>>
>>     Method (PGON, 1, Serialized)
>>     {
>>         /* ... */
>>
>>         GPPR (PION, One)
>>         If ((OSYS == 0x07D9))  /* Is Windows 2009 - In my case, setting to Windows 2009 only works! */
>>         {
> [..]
>>         }
>>         Else
>>         {
>>             LKEN (PION)
>>         }
>>
>>         /* ... */
>>         
>>         Return (Zero)
>>     }
>>
>>
>>
>> If not set to Windows 2009, then this is triggered:
>>
>>
>>     Method (LKEN, 1, NotSerialized)
>>     {
> [..]
>>     }
> 
> Yep, this is the same code. I stripped out irrelevant parts from the
> previous mail for brevity.
> 
>> Is it possible to override the specific ACPI table functions (SSDT) in the DSDT?
>> This way I could try to debug to find some more information...
> 
> See Documentation/acpi/initrd_table_override.txt and note that it is
> important that the tables are really located at /kernel/firmware/acpi/
> in your initrd (which must be the first, even before any possible
> microcode updates).
> 
> What are you trying to do? For ACPI method tracing, see
> Documentation/acpi/method-tracing.txt
> 

Oh, you're right.

Thanks. Right now I am overriding the DSDT, but I am not able to override
the SSDT, because I have to fix and compile all the SSDT files. There
are too many compile errors... Wanted to find the exact line which
is responsible for the hickup.

>>> Yes I suspect there is an ACPI and/ or PCI problem, possibly
>>> device-specific. Steps to reproduce on the affected machines:
>>>
>>>  1. Load nouveau.
>>>  2. Wait for it to runtime suspend.
>>>  2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
>>>  3. lspci never returns, few moments later an AML_INFINITE_LOOP is
>>>     reported.

I noticed following:

1. Blacklist nouveau
2. Boot to GDM login manager (Wayland)
3. Switch to TTY with CTRL+ALT+FN2
4. Load bbswitch
5. Switch off GPU
6. run lspci -> no freeze
7. Switch to GDM
8. Login to a Wayland session (X11 won't work)
9. run lspci in a GUI terminal -> system freezes

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-31 12:21         ` Roland Singer
@ 2016-08-31 12:34           ` Peter Wu
  2016-08-31 13:13             ` Roland Singer
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Wu @ 2016-08-31 12:34 UTC (permalink / raw)
  To: Roland Singer
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, linux-acpi, dri-devel,
	emil.l.velikov, imirkin@alum.mit.edu >> Ilia Mirkin

On Wed, Aug 31, 2016 at 02:21:31PM +0200, Roland Singer wrote:
> Am 31.08.2016 um 13:46 schrieb Peter Wu:
> > On Wed, Aug 31, 2016 at 01:27:36PM +0200, Roland Singer wrote:
> >> Am 30.08.2016 um 21:53 schrieb Peter Wu:
> >>> On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
> >>>> [+cc linux-acpi, linux-kernel, dri-devel]
> >>>>
> >>>> Hi Roland,
> >>>>
> >>>> I have no idea how to debug this problem.  Are you seeing something
> >>>> that suggests it may be a PCI problem?
> >>>
> >>> Yes I suspect there is an ACPI and/ or PCI problem, possibly
> >>> device-specific. Steps to reproduce on the affected machines:
> >>>
> >>>  1. Load nouveau.
> >>>  2. Wait for it to runtime suspend.
> >>>  2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
> >>>  3. lspci never returns, few moments later an AML_INFINITE_LOOP is
> >>>     reported.
> >>>
> >>
> >> I can confirm this. Same result on my machine.
> >>
> >> Here is a link to my ACPI tables:
> >> https://bugs.launchpad.net/lpbugreporter/+bug/752542/+attachment/4722651/+files/Razer-Blade.tar.gz
> >>
> >> The specific source for the NVIDIA card can be found in the ssdt5.dsl file.
> >>
> >>
> >>     Method (PGON, 1, Serialized)
> >>     {
> >>         /* ... */
> >>
> >>         GPPR (PION, One)
> >>         If ((OSYS == 0x07D9))  /* Is Windows 2009 - In my case, setting to Windows 2009 only works! */
> >>         {
> > [..]
> >>         }
> >>         Else
> >>         {
> >>             LKEN (PION)
> >>         }
> >>
> >>         /* ... */
> >>         
> >>         Return (Zero)
> >>     }
> >>
> >>
> >>
> >> If not set to Windows 2009, then this is triggered:
> >>
> >>
> >>     Method (LKEN, 1, NotSerialized)
> >>     {
> > [..]
> >>     }
> > 
> > Yep, this is the same code. I stripped out irrelevant parts from the
> > previous mail for brevity.
> > 
> >> Is it possible to override the specific ACPI table functions (SSDT) in the DSDT?
> >> This way I could try to debug to find some more information...
> > 
> > See Documentation/acpi/initrd_table_override.txt and note that it is
> > important that the tables are really located at /kernel/firmware/acpi/
> > in your initrd (which must be the first, even before any possible
> > microcode updates).
> > 
> > What are you trying to do? For ACPI method tracing, see
> > Documentation/acpi/method-tracing.txt
> > 
> 
> Oh, you're right.
> 
> Thanks. Right now I am overriding the DSDT, but I am not able to override
> the SSDT, because I have to fix and compile all the SSDT files. There
> are too many compile errors... Wanted to find the exact line which
> is responsible for the hickup.

Have you disassembled with externs included? That is,

    iasl -e *.dat -d ssdtX.dat

If you are sure that the remaining errors are harmless, you can use the
'-f' option to ignore errors. You can also use the `-ve` option to
suppress warnings and remarks so you can focus on the errors.

If you look at my notes.txt, you will see that _OFF always executes the
same code. PGON differs. When the problem occurs, "Q0L0" somehow always
reads back as non-zero and LNKS < 7.

> >>> Yes I suspect there is an ACPI and/ or PCI problem, possibly
> >>> device-specific. Steps to reproduce on the affected machines:
> >>>
> >>>  1. Load nouveau.
> >>>  2. Wait for it to runtime suspend.
> >>>  2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
> >>>  3. lspci never returns, few moments later an AML_INFINITE_LOOP is
> >>>     reported.
> 
> I noticed following:
> 
> 1. Blacklist nouveau
> 2. Boot to GDM login manager (Wayland)
> 3. Switch to TTY with CTRL+ALT+FN2
> 4. Load bbswitch
> 5. Switch off GPU
> 6. run lspci -> no freeze
> 7. Switch to GDM
> 8. Login to a Wayland session (X11 won't work)
> 9. run lspci in a GUI terminal -> system freezes

Is nouveau somehow loaded anyway? All those extra components (X11,
Wayland, etc.) are unnecessary to reproduce the core problem. It occurs
whenever the device is being resumed (either via DSM/_PS0 or via power
resource PG00._ON).
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-31 12:34           ` Peter Wu
@ 2016-08-31 13:13             ` Roland Singer
  2016-08-31 20:06               ` Roland Singer
  0 siblings, 1 reply; 39+ messages in thread
From: Roland Singer @ 2016-08-31 13:13 UTC (permalink / raw)
  To: Peter Wu
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, linux-acpi, dri-devel,
	emil.l.velikov, imirkin@alum.mit.edu >> Ilia Mirkin

>>
>> Thanks. Right now I am overriding the DSDT, but I am not able to override
>> the SSDT, because I have to fix and compile all the SSDT files. There
>> are too many compile errors... Wanted to find the exact line which
>> is responsible for the hickup.
> 
> Have you disassembled with externs included? That is,
> 
>     iasl -e *.dat -d ssdtX.dat
> 
> If you are sure that the remaining errors are harmless, you can use the
> '-f' option to ignore errors. You can also use the `-ve` option to
> suppress warnings and remarks so you can focus on the errors.
> 

Thanks, I'll try that.


> If you look at my notes.txt, you will see that _OFF always executes the
> same code. PGON differs. When the problem occurs, "Q0L0" somehow always
> reads back as non-zero and LNKS < 7.
> 

Oh you're Lekensteyn ^^

I don't have LNKS and no while loop after calling LKEN ?!


>>
>> I noticed following:
>>
>> 1. Blacklist nouveau
>> 2. Boot to GDM login manager (Wayland)
>> 3. Switch to TTY with CTRL+ALT+FN2
>> 4. Load bbswitch
>> 5. Switch off GPU
>> 6. run lspci -> no freeze
>> 7. Switch to GDM
>> 8. Login to a Wayland session (X11 won't work)
>> 9. run lspci in a GUI terminal -> system freezes
> 
> Is nouveau somehow loaded anyway? All those extra components (X11,
> Wayland, etc.) are unnecessary to reproduce the core problem. It occurs
> whenever the device is being resumed (either via DSM/_PS0 or via power
> resource PG00._ON).
> 

Sorry that was nonsense. The steps to reproduce the problem are still valid.
I didn't wait enough to power it down...

But whats interesting:

1. Blacklist nouveau
2. Load bbswitch
3. Power off GPU with bbswitch
4. Power on GPU with bbswitch
5. Run lspci
6. Power off GPU with bbswitch
7. Run lspci -> freeze

So setting the GPU power state with bbswitch works as expected.
Powering it on is also fine. I did this a couple of times.
But powering it off and letting lspci powering it on, ends in a race.

It might be, that lspci does not only power the GPU on, but triggers
another pci action which causes the race condition.
Does this have something to do with your quote about the retrain bit?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-31 13:13             ` Roland Singer
@ 2016-08-31 20:06               ` Roland Singer
  2016-08-31 20:16                 ` Roland Singer
  0 siblings, 1 reply; 39+ messages in thread
From: Roland Singer @ 2016-08-31 20:06 UTC (permalink / raw)
  To: Peter Wu
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, linux-acpi, dri-devel,
	emil.l.velikov, imirkin@alum.mit.edu >> Ilia Mirkin

Here is Peter Wu's reply, which was not send to the mailing list, because
I had to resend my e-mail to him due to a failure...


-------- Forwarded Message --------
Subject: Re: Fwd: Re: Kernel Freeze with American Megatrends BIOS
Date: Wed, 31 Aug 2016 18:08:53 +0200
From: Peter Wu <peter@lekensteyn.nl>
To: Roland Singer <roland.singer@desertbit.com>

On Wed, Aug 31, 2016 at 05:56:18PM +0200, Roland Singer wrote:

> > If you look at my notes.txt, you will see that _OFF always executes the
> > same code. PGON differs. When the problem occurs, "Q0L0" somehow always
> > reads back as non-zero and LNKS < 7.
> > 
> 
> Oh you're Lekensteyn ^^

Yes, that's me :) I wrote bbswitch, did the Optimus and PR3 ACPI support
in nouveau so I am fairly certain what happens behind the scenes.

> I don't have LNKS and no while loop after calling LKEN ?!

Yes that is what I said in
https://www.spinics.net/lists/linux-pci/msg53694.html:

"Other affected devices have similar code, differences are small:
No check for LNKS (avoids the infinite loop, but device is still off)"

> >>
> >> I noticed following:
> >>
> >> 1. Blacklist nouveau
> >> 2. Boot to GDM login manager (Wayland)
> >> 3. Switch to TTY with CTRL+ALT+FN2
> >> 4. Load bbswitch
> >> 5. Switch off GPU
> >> 6. run lspci -> no freeze
> >> 7. Switch to GDM
> >> 8. Login to a Wayland session (X11 won't work)
> >> 9. run lspci in a GUI terminal -> system freezes
> > 
> > Is nouveau somehow loaded anyway? All those extra components (X11,
> > Wayland, etc.) are unnecessary to reproduce the core problem. It occurs
> > whenever the device is being resumed (either via DSM/_PS0 or via power
> > resource PG00._ON).
> > 
> 
> Sorry that was nonsense. The steps to reproduce the problem are still valid.
> I didn't wait enough to power it down...
> 
> But whats interesting:
> 
> 1. Blacklist nouveau
> 2. Load bbswitch
> 3. Power off GPU with bbswitch
> 4. Power on GPU with bbswitch
> 5. Run lspci
> 6. Power off GPU with bbswitch
> 7. Run lspci -> freeze
> 
> So setting the GPU power state with bbswitch works as expected.
> Powering it on is also fine. I did this a couple of times.
> But powering it off and letting lspci powering it on, ends in a race.

In some cases I also found that it does always happen at the first try,
but with nouveau it always seem to happen.

> It might be, that lspci does not only power the GPU on, but triggers
> another pci action which causes the race condition.
> Does this have something to do with your quote about the retrain bit?

That is an interesting hypothesis. Even if you invoke `lspci -s01:00.0`
for example, it will always probe for all devices. So maybe interaction
with its parent device (PCI root port 00:02.0) causes issues.

However I also tested without lspci before, and the problem still
exists. You can trigger runtime resume via (as root):

    echo > /sys/bus/pci/0000:01:00.0/power/control on

Set it to "auto" to make it sleep again.
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Kernel Freeze with American Megatrends BIOS
  2016-08-31 20:06               ` Roland Singer
@ 2016-08-31 20:16                 ` Roland Singer
  0 siblings, 0 replies; 39+ messages in thread
From: Roland Singer @ 2016-08-31 20:16 UTC (permalink / raw)
  To: Peter Wu
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, linux-acpi, dri-devel,
	emil.l.velikov, imirkin@alum.mit.edu >> Ilia Mirkin

On 08/31/16 22:06, Roland Singer wrote:
> Here is Peter Wu's reply, which was not send to the mailing list, because
> I had to resend my e-mail to him due to a failure...
> 
> 
> -------- Forwarded Message --------
> Subject: Re: Fwd: Re: Kernel Freeze with American Megatrends BIOS
> Date: Wed, 31 Aug 2016 18:08:53 +0200
> From: Peter Wu <peter@lekensteyn.nl>
> To: Roland Singer <roland.singer@desertbit.com>
> 
> On Wed, Aug 31, 2016 at 05:56:18PM +0200, Roland Singer wrote:
> 
>>> If you look at my notes.txt, you will see that _OFF always executes the
>>> same code. PGON differs. When the problem occurs, "Q0L0" somehow always
>>> reads back as non-zero and LNKS < 7.
>>>
>>
>> Oh you're Lekensteyn ^^
> 
> Yes, that's me :) I wrote bbswitch, did the Optimus and PR3 ACPI support
> in nouveau so I am fairly certain what happens behind the scenes.
> 

Awesome! Thanks for all your efforts! Great work :)


>> I don't have LNKS and no while loop after calling LKEN ?!
> 
> Yes that is what I said in
> https://www.spinics.net/lists/linux-pci/msg53694.html:
> 
> "Other affected devices have similar code, differences are small:
> No check for LNKS (avoids the infinite loop, but device is still off)"
>

Ah ok, missed that.


>> It might be, that lspci does not only power the GPU on, but triggers
>> another pci action which causes the race condition.
>> Does this have something to do with your quote about the retrain bit?
> 
> That is an interesting hypothesis. Even if you invoke `lspci -s01:00.0`
> for example, it will always probe for all devices. So maybe interaction
> with its parent device (PCI root port 00:02.0) causes issues.
> 
> However I also tested without lspci before, and the problem still
> exists. You can trigger runtime resume via (as root):
> 
>     echo > /sys/bus/pci/0000:01:00.0/power/control on
> 
> Set it to "auto" to make it sleep again.
> 

Just tried it over and over again. I don't have any problems switching the GPU power state
with bbswitch. So, switching the GPU on is just fine. There must be something else, which
does not cooperate well while switching it on (lspci)...

I can confirm,, that `lspci -s01:00.0` also freezes the system.

Trying to trigger runtime resume with `/sys/bus/pci/0000:01:00.0/power/control`
did not work for me. The GPU just stayed off.
Any hints how to get some more information?

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-08-31 20:17 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-23  9:23 Kernel Freeze with American Megatrends BIOS Roland Singer
2016-08-29  7:56 ` Roland Singer
2016-08-29 16:02 ` Bjorn Helgaas
2016-08-29 18:46   ` Roland Singer
2016-08-29 19:07     ` Bjorn Helgaas
2016-08-29 19:55       ` Roland Singer
2016-08-29 23:54         ` Bjorn Helgaas
2016-08-30 10:08           ` Roland Singer
2016-08-30 13:06             ` Bjorn Helgaas
2016-08-30 14:08               ` Emil Velikov
2016-08-30 14:08                 ` Emil Velikov
2016-08-30 15:25                 ` Roland Singer
2016-08-30 15:44                   ` Ilia Mirkin
2016-08-30 15:48                     ` Ilia Mirkin
2016-08-30 15:48                   ` Emil Velikov
2016-08-30 17:37                     ` Roland Singer
2016-08-30 17:43                       ` Ilia Mirkin
2016-08-30 17:43                         ` Ilia Mirkin
2016-08-30 18:02                         ` Roland Singer
2016-08-30 18:13                           ` Ilia Mirkin
2016-08-30 18:13                             ` Ilia Mirkin
2016-08-30 19:21                             ` Peter Wu
2016-08-31 11:12                               ` Roland Singer
2016-08-31 11:11                             ` Roland Singer
2016-08-30 18:09                       ` Emil Velikov
2016-08-30 18:09                         ` Emil Velikov
2016-08-30 18:10                         ` Emil Velikov
2016-08-30 18:10                           ` Emil Velikov
2016-08-31 10:51                           ` Roland Singer
2016-08-30 19:53   ` Peter Wu
2016-08-30 19:53     ` Peter Wu
2016-08-31 11:27     ` Roland Singer
2016-08-31 11:46       ` Peter Wu
2016-08-31 11:46         ` Peter Wu
2016-08-31 12:21         ` Roland Singer
2016-08-31 12:34           ` Peter Wu
2016-08-31 13:13             ` Roland Singer
2016-08-31 20:06               ` Roland Singer
2016-08-31 20:16                 ` Roland Singer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.