regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
@ 2022-01-17  2:12 James D. Turner
  2022-01-17  8:09 ` Greg KH
  2022-01-17  9:03 ` Thorsten Leemhuis
  0 siblings, 2 replies; 30+ messages in thread
From: James D. Turner @ 2022-01-17  2:12 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, regressions, linux-kernel

Hi,

With newer kernels, starting with the v5.14 series, when using a MS
Windows 10 guest VM with PCI passthrough of an AMD Radeon Pro WX 3200
discrete GPU, the passed-through GPU will not run above 501 MHz, even
when it is under 100% load and well below the temperature limit. As a
result, GPU-intensive software (such as video games) runs unusably
slowly in the VM.

In contrast, with older kernels, the passed-through GPU runs at up to
1295 MHz (the correct hardware limit), so GPU-intensive software runs at
a reasonable speed in the VM.

I've confirmed that the issue exists with the following kernel versions:

- v5.16
- v5.14
- v5.14-rc1

The issue does not exist with the following kernels:

- v5.13
- various packaged (non-vanilla) 5.10.* Arch Linux `linux-lts` kernels

So, the issue was introduced between v5.13 and v5.14-rc1. I'm willing to
bisect the commit history to narrow it down further, if that would be
helpful.

The configuration details and test results are provided below. In
summary, for the kernels with this issue, the GPU core stays at a
constant 0.8 V, the GPU core clock ranges from 214 MHz to 501 MHz, and
the GPU memory stays at a constant 625 MHz, in the VM. For the correctly
working kernels, the GPU core ranges from 0.85 V to 1.0 V, the GPU core
clock ranges from 214 MHz to 1295 MHz, and the GPU memory stays at 1500
MHz, in the VM.

Please let me know if additional information would be helpful.

Regards,
James Turner

# Configuration Details

Hardware:

- Dell Precision 7540 laptop
- CPU: Intel Core i7-9750H (x86-64)
- Discrete GPU: AMD Radeon Pro WX 3200
- The internal display is connected to the integrated GPU, and external
  displays are connected to the discrete GPU.

Software:

- KVM host: Arch Linux
  - self-built vanilla kernel (built using Arch Linux `PKGBUILD`
    modified to use vanilla kernel sources from git.kernel.org)
  - libvirt 1:7.10.0-2
  - qemu 6.2.0-2

- KVM guest: Windows 10
  - GPU driver: Radeon Pro Software Version 21.Q3 (Note that I also
    experienced this issue with the 20.Q4 driver, using packaged
    (non-vanilla) Arch Linux kernels on the host, before updating to the
    21.Q3 driver.)

Kernel config:

- For v5.13, v5.14-rc1, and v5.14, I used
  https://github.com/archlinux/svntogit-packages/blob/89c24952adbfa645d9e1a6f12c572929f7e4e3c7/trunk/config
  (The build script ran `make olddefconfig` on that config file.)

- For v5.16, I used
  https://github.com/archlinux/svntogit-packages/blob/94f84e1ad8a530e54aa34cadbaa76e8dcc439d10/trunk/config
  (The build script ran `make olddefconfig` on that config file.)

I set up the VM with PCI passthrough according to the instructions at
https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF

I'm passing through the following PCI devices to the VM, as listed by
`lspci -D -nn`:

  0000:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
  0000:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]

The host kernel command line includes the following relevant options:

  intel_iommu=on vfio-pci.ids=1002:6981,1002:aae0

to enable IOMMU and bind the `vfio-pci` driver to the PCI devices.

My `/etc/mkinitcpio.conf` includes the following line:

  MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd i915 amdgpu)

to load `vfio-pci` before the graphics drivers. (Note that removing
`i915 amdgpu` has no effect on this issue.)

I'm using libvirt to manage the VM. The relevant portions of the XML
file are:

  <hostdev mode="subsystem" type="pci" managed="yes">
    <source>
      <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
    </source>
    <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
  </hostdev>
  <hostdev mode="subsystem" type="pci" managed="yes">
    <source>
      <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
    </source>
    <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
  </hostdev>

# Test Results

For testing, I used the following procedure:

1. Boot the host machine and log in.

2. Run the following commands to gather information. For all the tests,
   the output was identical.

   - `cat /proc/sys/kernel/tainted` printed:

     0

   - `hostnamectl | grep "Operating System"` printed:

     Operating System: Arch Linux

   - `lspci -nnk -d 1002:6981` printed

     01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
     	Subsystem: Dell Device [1028:0926]
     	Kernel driver in use: vfio-pci
     	Kernel modules: amdgpu

   - `lspci -nnk -d 1002:aae0` printed

     01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
     	Subsystem: Dell Device [1028:0926]
     	Kernel driver in use: vfio-pci
     	Kernel modules: snd_hda_intel

   - `sudo dmesg | grep -i vfio` printed the kernel command line and the
     following messages:

     VFIO - User Level meta-driver version: 0.3
     vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
     vfio_pci: add [1002:6981[ffffffff:ffffffff]] class 0x000000/00000000
     vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
     vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none

3. Start the Windows VM using libvirt and log in. Record sensor
   information.

4. Run a graphically-intensive video game to put the GPU under load.
   Record sensor information.

5. Stop the game. Record sensor information.

6. Shut down the VM. Save the output of `sudo dmesg`.

I compared the `sudo dmesg` output for v5.13 and v5.14-rc1 and didn't
see any relevant differences.

Note that the issue occurs only within the guest VM. When I'm not using
a VM (after removing `vfio-pci.ids=1002:6981,1002:aae0` from the kernel
command line so that the PCI devices are bound to their normal `amdgpu`
and `snd_hda_intel` drivers instead of the `vfio-pci` driver), the GPU
operates correctly on the host.

## Linux v5.16 (issue present)

$ cat /proc/version
Linux version 5.16.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 01:51:08 +0000

Before running the game:

- GPU core: 214.0 MHz, 0.800 V, 0.0% load, 53.0 degC
- GPU memory: 625.0 MHz

While running the game:

- GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
- GPU memory: 625.0 MHz

After stopping the game:

- GPU core: 214.0 MHz, 0.800 V, 0.0% load, 51.0 degC
- GPU memory: 625.0 MHz

## Linux v5.14 (issue present)

$ cat /proc/version
Linux version 5.14.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 03:19:35 +0000

Before running the game:

- GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
- GPU memory: 625.0 MHz

While running the game:

- GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
- GPU memory: 625.0 MHz

After stopping the game:

- GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
- GPU memory: 625.0 MHz

## Linux v5.14-rc1 (issue present)

$ cat /proc/version
Linux version 5.14.0-rc1-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 18:31:35 +0000

Before running the game:

- GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
- GPU memory: 625.0 MHz

While running the game:

- GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
- GPU memory: 625.0 MHz

After stopping the game:

- GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
- GPU memory: 625.0 MHz

## Linux v5.13 (works correctly, issue not present)

$ cat /proc/version
Linux version 5.13.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 02:39:18 +0000

Before running the game:

- GPU core: 214.0 MHz, 0.850 V, 0.0% load, 55.0 degC
- GPU memory: 1500.0 MHz

While running the game:

- GPU core: 1295.0 MHz, 1.000 V, 100.0% load, 67.0 degC
- GPU memory: 1500.0 MHz

After stopping the game:

- GPU core: 214.0 MHz, 0.850 V, 0.0% load, 52.0 degC
- GPU memory: 1500.0 MHz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-17  2:12 [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM James D. Turner
@ 2022-01-17  8:09 ` Greg KH
  2022-01-17  9:03 ` Thorsten Leemhuis
  1 sibling, 0 replies; 30+ messages in thread
From: Greg KH @ 2022-01-17  8:09 UTC (permalink / raw)
  To: James D. Turner; +Cc: Alex Williamson, kvm, regressions, linux-kernel

On Sun, Jan 16, 2022 at 09:12:21PM -0500, James D. Turner wrote:
> So, the issue was introduced between v5.13 and v5.14-rc1. I'm willing to
> bisect the commit history to narrow it down further, if that would be
> helpful.

Bisection would be great as that is a very large range of commits there
from many months ago, so people might not remember what could have
caused this issue.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-17  2:12 [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM James D. Turner
  2022-01-17  8:09 ` Greg KH
@ 2022-01-17  9:03 ` Thorsten Leemhuis
  2022-01-18  3:14   ` James Turner
  1 sibling, 1 reply; 30+ messages in thread
From: Thorsten Leemhuis @ 2022-01-17  9:03 UTC (permalink / raw)
  To: James D. Turner, Alex Williamson; +Cc: kvm, regressions, linux-kernel

[TLDR: I'm adding this regression to regzbot, the Linux kernel
regression tracking bot; most text you find below is compiled from a few
templates paragraphs some of you might have seen already.]

Hi, this is your Linux kernel regression tracker speaking.


On 17.01.22 03:12, James D. Turner wrote:
> 
> With newer kernels, starting with the v5.14 series, when using a MS
> Windows 10 guest VM with PCI passthrough of an AMD Radeon Pro WX 3200
> discrete GPU, the passed-through GPU will not run above 501 MHz, even
> when it is under 100% load and well below the temperature limit. As a
> result, GPU-intensive software (such as video games) runs unusably
> slowly in the VM.

Thanks for the report. Greg already asked for a bisection, which would
help a lot here.

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced v5.13..v5.14-rc1
#regzbot ignore-activity

Reminder: when fixing the issue, please add a 'Link:' tag with the URL
to the report (the parent of this mail) using the kernel.org redirector,
as explained in 'Documentation/process/submitting-patches.rst'. Regzbot
then will automatically mark the regression as resolved once the fix
lands in the appropriate tree. For more details about regzbot see footer.

Sending this to everyone that got the initial report, to make all aware
of the tracking. I also hope that messages like this motivate people to
directly get at least the regression mailing list and ideally even
regzbot involved when dealing with regressions, as messages like this
wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), as
long as they are intended just for regzbot. With a bit of luck no such
messages will be needed anyway.

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

> In contrast, with older kernels, the passed-through GPU runs at up to
> 1295 MHz (the correct hardware limit), so GPU-intensive software runs at
> a reasonable speed in the VM.
> 
> I've confirmed that the issue exists with the following kernel versions:
> 
> - v5.16
> - v5.14
> - v5.14-rc1
> 
> The issue does not exist with the following kernels:
> 
> - v5.13
> - various packaged (non-vanilla) 5.10.* Arch Linux `linux-lts` kernels
> 
> So, the issue was introduced between v5.13 and v5.14-rc1. I'm willing to
> bisect the commit history to narrow it down further, if that would be
> helpful.
> 
> The configuration details and test results are provided below. In
> summary, for the kernels with this issue, the GPU core stays at a
> constant 0.8 V, the GPU core clock ranges from 214 MHz to 501 MHz, and
> the GPU memory stays at a constant 625 MHz, in the VM. For the correctly
> working kernels, the GPU core ranges from 0.85 V to 1.0 V, the GPU core
> clock ranges from 214 MHz to 1295 MHz, and the GPU memory stays at 1500
> MHz, in the VM.
> 
> Please let me know if additional information would be helpful.
> 
> Regards,
> James Turner
> 
> # Configuration Details
> 
> Hardware:
> 
> - Dell Precision 7540 laptop
> - CPU: Intel Core i7-9750H (x86-64)
> - Discrete GPU: AMD Radeon Pro WX 3200
> - The internal display is connected to the integrated GPU, and external
>   displays are connected to the discrete GPU.
> 
> Software:
> 
> - KVM host: Arch Linux
>   - self-built vanilla kernel (built using Arch Linux `PKGBUILD`
>     modified to use vanilla kernel sources from git.kernel.org)
>   - libvirt 1:7.10.0-2
>   - qemu 6.2.0-2
> 
> - KVM guest: Windows 10
>   - GPU driver: Radeon Pro Software Version 21.Q3 (Note that I also
>     experienced this issue with the 20.Q4 driver, using packaged
>     (non-vanilla) Arch Linux kernels on the host, before updating to the
>     21.Q3 driver.)
> 
> Kernel config:
> 
> - For v5.13, v5.14-rc1, and v5.14, I used
>   https://github.com/archlinux/svntogit-packages/blob/89c24952adbfa645d9e1a6f12c572929f7e4e3c7/trunk/config
>   (The build script ran `make olddefconfig` on that config file.)
> 
> - For v5.16, I used
>   https://github.com/archlinux/svntogit-packages/blob/94f84e1ad8a530e54aa34cadbaa76e8dcc439d10/trunk/config
>   (The build script ran `make olddefconfig` on that config file.)
> 
> I set up the VM with PCI passthrough according to the instructions at
> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
> 
> I'm passing through the following PCI devices to the VM, as listed by
> `lspci -D -nn`:
> 
>   0000:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
>   0000:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> 
> The host kernel command line includes the following relevant options:
> 
>   intel_iommu=on vfio-pci.ids=1002:6981,1002:aae0
> 
> to enable IOMMU and bind the `vfio-pci` driver to the PCI devices.
> 
> My `/etc/mkinitcpio.conf` includes the following line:
> 
>   MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd i915 amdgpu)
> 
> to load `vfio-pci` before the graphics drivers. (Note that removing
> `i915 amdgpu` has no effect on this issue.)
> 
> I'm using libvirt to manage the VM. The relevant portions of the XML
> file are:
> 
>   <hostdev mode="subsystem" type="pci" managed="yes">
>     <source>
>       <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
>     </source>
>     <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
>   </hostdev>
>   <hostdev mode="subsystem" type="pci" managed="yes">
>     <source>
>       <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
>     </source>
>     <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
>   </hostdev>
> 
> # Test Results
> 
> For testing, I used the following procedure:
> 
> 1. Boot the host machine and log in.
> 
> 2. Run the following commands to gather information. For all the tests,
>    the output was identical.
> 
>    - `cat /proc/sys/kernel/tainted` printed:
> 
>      0
> 
>    - `hostnamectl | grep "Operating System"` printed:
> 
>      Operating System: Arch Linux
> 
>    - `lspci -nnk -d 1002:6981` printed
> 
>      01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
>      	Subsystem: Dell Device [1028:0926]
>      	Kernel driver in use: vfio-pci
>      	Kernel modules: amdgpu
> 
>    - `lspci -nnk -d 1002:aae0` printed
> 
>      01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
>      	Subsystem: Dell Device [1028:0926]
>      	Kernel driver in use: vfio-pci
>      	Kernel modules: snd_hda_intel
> 
>    - `sudo dmesg | grep -i vfio` printed the kernel command line and the
>      following messages:
> 
>      VFIO - User Level meta-driver version: 0.3
>      vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
>      vfio_pci: add [1002:6981[ffffffff:ffffffff]] class 0x000000/00000000
>      vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
>      vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
> 
> 3. Start the Windows VM using libvirt and log in. Record sensor
>    information.
> 
> 4. Run a graphically-intensive video game to put the GPU under load.
>    Record sensor information.
> 
> 5. Stop the game. Record sensor information.
> 
> 6. Shut down the VM. Save the output of `sudo dmesg`.
> 
> I compared the `sudo dmesg` output for v5.13 and v5.14-rc1 and didn't
> see any relevant differences.
> 
> Note that the issue occurs only within the guest VM. When I'm not using
> a VM (after removing `vfio-pci.ids=1002:6981,1002:aae0` from the kernel
> command line so that the PCI devices are bound to their normal `amdgpu`
> and `snd_hda_intel` drivers instead of the `vfio-pci` driver), the GPU
> operates correctly on the host.
> 
> ## Linux v5.16 (issue present)
> 
> $ cat /proc/version
> Linux version 5.16.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 01:51:08 +0000
> 
> Before running the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 53.0 degC
> - GPU memory: 625.0 MHz
> 
> While running the game:
> 
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
> 
> After stopping the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 51.0 degC
> - GPU memory: 625.0 MHz
> 
> ## Linux v5.14 (issue present)
> 
> $ cat /proc/version
> Linux version 5.14.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 03:19:35 +0000
> 
> Before running the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> - GPU memory: 625.0 MHz
> 
> While running the game:
> 
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
> 
> After stopping the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> - GPU memory: 625.0 MHz
> 
> ## Linux v5.14-rc1 (issue present)
> 
> $ cat /proc/version
> Linux version 5.14.0-rc1-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 18:31:35 +0000
> 
> Before running the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> - GPU memory: 625.0 MHz
> 
> While running the game:
> 
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
> 
> After stopping the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> - GPU memory: 625.0 MHz
> 
> ## Linux v5.13 (works correctly, issue not present)
> 
> $ cat /proc/version
> Linux version 5.13.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 02:39:18 +0000
> 
> Before running the game:
> 
> - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 55.0 degC
> - GPU memory: 1500.0 MHz
> 
> While running the game:
> 
> - GPU core: 1295.0 MHz, 1.000 V, 100.0% load, 67.0 degC
> - GPU memory: 1500.0 MHz
> 
> After stopping the game:
> 
> - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 52.0 degC
> - GPU memory: 1500.0 MHz
> 
> 
---
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and/or the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
tell #regzbot about it in the report, as that will ensure the regression
gets on the radar of regzbot and the regression tracker. That's in your
interest, as they will make sure the report won't fall through the
cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include a 'Link:' tag to the report in the commit message, as explained
in Documentation/process/submitting-patches.rst
That aspect was recently was made more explicit in commit 1f57bd42b77c:
https://git.kernel.org/linus/1f57bd42b77c

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-17  9:03 ` Thorsten Leemhuis
@ 2022-01-18  3:14   ` James Turner
  2022-01-21  2:13     ` James Turner
  0 siblings, 1 reply; 30+ messages in thread
From: James Turner @ 2022-01-18  3:14 UTC (permalink / raw)
  To: Thorsten Leemhuis; +Cc: Alex Williamson, kvm, regressions, linux-kernel

I finished about half of the bisection process today. The log so far is
below. I'll follow up again once I've narrowed it down to a single
commit.

git bisect start
# bad: [e73f0f0ee7541171d89f2e2491130c7771ba58d3] Linux 5.14-rc1
git bisect bad e73f0f0ee7541171d89f2e2491130c7771ba58d3
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect good 62fb9874f5da54fdb243003b386128037319b219
# bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
git bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9
# good: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
# good: [007b312c6f294770de01fbc0643610145012d244] Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
git bisect good 007b312c6f294770de01fbc0643610145012d244
# bad: [18703923a66aecf6f7ded0e16d22eb412ddae72f] drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
git bisect bad 18703923a66aecf6f7ded0e16d22eb412ddae72f
# good: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
# good: [43ed3c6c786d996a264fcde68dbb36df6f03b965] Merge tag 'drm-misc-next-2021-06-01' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect good 43ed3c6c786d996a264fcde68dbb36df6f03b965
# bad: [050cd3d616d96c3a04f4877842a391c0a4fdcc7a] drm/amd/display: Add support for SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616.
git bisect bad 050cd3d616d96c3a04f4877842a391c0a4fdcc7a

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-18  3:14   ` James Turner
@ 2022-01-21  2:13     ` James Turner
  2022-01-21  6:22       ` Thorsten Leemhuis
  0 siblings, 1 reply; 30+ messages in thread
From: James Turner @ 2022-01-21  2:13 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Greg KH, Alex Williamson, kvm, regressions, linux-kernel

Hi all,

I finished the bisection (log below). The issue was introduced in
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)").

Would any additional information be helpful?

git bisect start
# bad: [e73f0f0ee7541171d89f2e2491130c7771ba58d3] Linux 5.14-rc1
git bisect bad e73f0f0ee7541171d89f2e2491130c7771ba58d3
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect good 62fb9874f5da54fdb243003b386128037319b219
# bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
git bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9
# good: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
# good: [007b312c6f294770de01fbc0643610145012d244] Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
git bisect good 007b312c6f294770de01fbc0643610145012d244
# bad: [18703923a66aecf6f7ded0e16d22eb412ddae72f] drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
git bisect bad 18703923a66aecf6f7ded0e16d22eb412ddae72f
# good: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
# good: [43ed3c6c786d996a264fcde68dbb36df6f03b965] Merge tag 'drm-misc-next-2021-06-01' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect good 43ed3c6c786d996a264fcde68dbb36df6f03b965
# bad: [050cd3d616d96c3a04f4877842a391c0a4fdcc7a] drm/amd/display: Add support for SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616.
git bisect bad 050cd3d616d96c3a04f4877842a391c0a4fdcc7a
# good: [f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8] drm/amdgpu: Fix inconsistent indenting
git bisect good f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8
# good: [6566cae7aef30da8833f1fa0eb854baf33b96676] drm/amd/display: fix odm scaling
git bisect good 6566cae7aef30da8833f1fa0eb854baf33b96676
# good: [5ac1dd89df549648b67f4d5e3a01b2d653914c55] drm/amd/display/dc/dce/dmub_outbox: Convert over to kernel-doc
git bisect good 5ac1dd89df549648b67f4d5e3a01b2d653914c55
# good: [a76eb7d30f700e5bdecc72d88d2226d137b11f74] drm/amd/display/dc/dce110/dce110_hw_sequencer: Include header containing our prototypes
git bisect good a76eb7d30f700e5bdecc72d88d2226d137b11f74
# good: [dd1d82c04e111b5a864638ede8965db2fe6d8653] drm/amdgpu/swsmu/aldebaran: fix check in is_dpm_running
git bisect good dd1d82c04e111b5a864638ede8965db2fe6d8653
# bad: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
git bisect bad f9b7f3703ff97768a8dfabd42bdb107681f1da22
# good: [f1688bd69ec4b07eda1657ff953daebce7cfabf6] drm/amd/amdgpu:save psp ring wptr to avoid attack
git bisect good f1688bd69ec4b07eda1657ff953daebce7cfabf6
# first bad commit: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-21  2:13     ` James Turner
@ 2022-01-21  6:22       ` Thorsten Leemhuis
  2022-01-21 16:45         ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: Thorsten Leemhuis @ 2022-01-21  6:22 UTC (permalink / raw)
  To: James Turner, Alex Deucher, Lijo Lazar
  Cc: Greg KH, Alex Williamson, kvm, regressions, linux-kernel,
	Christian König, Pan, Xinhui, amd-gfx

Hi, this is your Linux kernel regression tracker speaking.

On 21.01.22 03:13, James Turner wrote:
> 
> I finished the bisection (log below). The issue was introduced in
> f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)").

FWIW, that was:

> drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> They are global ACPI methods, so maybe the structures
> global in the driver. This simplified a number of things
> in the handling of these methods.
> 
> v2: reset the handle if verify interface fails (Lijo)
> v3: fix compilation when ACPI is not defined.
> 
> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

In that case we need to get those two and the maintainers for the driver
involved by addressing them with this mail. And to make it easy for them
here is a link and a quote from the original report:

https://lore.kernel.org/all/87ee57c8fu.fsf@turner.link/

```
> Hi,
> 
> With newer kernels, starting with the v5.14 series, when using a MS
> Windows 10 guest VM with PCI passthrough of an AMD Radeon Pro WX 3200
> discrete GPU, the passed-through GPU will not run above 501 MHz, even
> when it is under 100% load and well below the temperature limit. As a
> result, GPU-intensive software (such as video games) runs unusably
> slowly in the VM.
> 
> In contrast, with older kernels, the passed-through GPU runs at up to
> 1295 MHz (the correct hardware limit), so GPU-intensive software runs at
> a reasonable speed in the VM.
> 
> I've confirmed that the issue exists with the following kernel versions:
> 
> - v5.16
> - v5.14
> - v5.14-rc1
> 
> The issue does not exist with the following kernels:
> 
> - v5.13
> - various packaged (non-vanilla) 5.10.* Arch Linux `linux-lts` kernels
> 
> So, the issue was introduced between v5.13 and v5.14-rc1. I'm willing to
> bisect the commit history to narrow it down further, if that would be
> helpful.
> 
> The configuration details and test results are provided below. In
> summary, for the kernels with this issue, the GPU core stays at a
> constant 0.8 V, the GPU core clock ranges from 214 MHz to 501 MHz, and
> the GPU memory stays at a constant 625 MHz, in the VM. For the correctly
> working kernels, the GPU core ranges from 0.85 V to 1.0 V, the GPU core
> clock ranges from 214 MHz to 1295 MHz, and the GPU memory stays at 1500
> MHz, in the VM.
> 
> Please let me know if additional information would be helpful.
> 
> Regards,
> James Turner
> 
> # Configuration Details
> 
> Hardware:
> 
> - Dell Precision 7540 laptop
> - CPU: Intel Core i7-9750H (x86-64)
> - Discrete GPU: AMD Radeon Pro WX 3200
> - The internal display is connected to the integrated GPU, and external
>   displays are connected to the discrete GPU.
> 
> Software:
> 
> - KVM host: Arch Linux
>   - self-built vanilla kernel (built using Arch Linux `PKGBUILD`
>     modified to use vanilla kernel sources from git.kernel.org)
>   - libvirt 1:7.10.0-2
>   - qemu 6.2.0-2
> 
> - KVM guest: Windows 10
>   - GPU driver: Radeon Pro Software Version 21.Q3 (Note that I also
>     experienced this issue with the 20.Q4 driver, using packaged
>     (non-vanilla) Arch Linux kernels on the host, before updating to the
>     21.Q3 driver.)
> 
> Kernel config:
> 
> - For v5.13, v5.14-rc1, and v5.14, I used
>   https://github.com/archlinux/svntogit-packages/blob/89c24952adbfa645d9e1a6f12c572929f7e4e3c7/trunk/config
>   (The build script ran `make olddefconfig` on that config file.)
> 
> - For v5.16, I used
>   https://github.com/archlinux/svntogit-packages/blob/94f84e1ad8a530e54aa34cadbaa76e8dcc439d10/trunk/config
>   (The build script ran `make olddefconfig` on that config file.)
> 
> I set up the VM with PCI passthrough according to the instructions at
> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
> 
> I'm passing through the following PCI devices to the VM, as listed by
> `lspci -D -nn`:
> 
>   0000:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
>   0000:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> 
> The host kernel command line includes the following relevant options:
> 
>   intel_iommu=on vfio-pci.ids=1002:6981,1002:aae0
> 
> to enable IOMMU and bind the `vfio-pci` driver to the PCI devices.
> 
> My `/etc/mkinitcpio.conf` includes the following line:
> 
>   MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd i915 amdgpu)
> 
> to load `vfio-pci` before the graphics drivers. (Note that removing
> `i915 amdgpu` has no effect on this issue.)
> 
> I'm using libvirt to manage the VM. The relevant portions of the XML
> file are:
> 
>   <hostdev mode="subsystem" type="pci" managed="yes">
>     <source>
>       <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
>     </source>
>     <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
>   </hostdev>
>   <hostdev mode="subsystem" type="pci" managed="yes">
>     <source>
>       <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
>     </source>
>     <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
>   </hostdev>
> 
> # Test Results
> 
> For testing, I used the following procedure:
> 
> 1. Boot the host machine and log in.
> 
> 2. Run the following commands to gather information. For all the tests,
>    the output was identical.
> 
>    - `cat /proc/sys/kernel/tainted` printed:
> 
>      0
> 
>    - `hostnamectl | grep "Operating System"` printed:
> 
>      Operating System: Arch Linux
> 
>    - `lspci -nnk -d 1002:6981` printed
> 
>      01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
>      	Subsystem: Dell Device [1028:0926]
>      	Kernel driver in use: vfio-pci
>      	Kernel modules: amdgpu
> 
>    - `lspci -nnk -d 1002:aae0` printed
> 
>      01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
>      	Subsystem: Dell Device [1028:0926]
>      	Kernel driver in use: vfio-pci
>      	Kernel modules: snd_hda_intel
> 
>    - `sudo dmesg | grep -i vfio` printed the kernel command line and the
>      following messages:
> 
>      VFIO - User Level meta-driver version: 0.3
>      vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
>      vfio_pci: add [1002:6981[ffffffff:ffffffff]] class 0x000000/00000000
>      vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
>      vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
> 
> 3. Start the Windows VM using libvirt and log in. Record sensor
>    information.
> 
> 4. Run a graphically-intensive video game to put the GPU under load.
>    Record sensor information.
> 
> 5. Stop the game. Record sensor information.
> 
> 6. Shut down the VM. Save the output of `sudo dmesg`.
> 
> I compared the `sudo dmesg` output for v5.13 and v5.14-rc1 and didn't
> see any relevant differences.
> 
> Note that the issue occurs only within the guest VM. When I'm not using
> a VM (after removing `vfio-pci.ids=1002:6981,1002:aae0` from the kernel
> command line so that the PCI devices are bound to their normal `amdgpu`
> and `snd_hda_intel` drivers instead of the `vfio-pci` driver), the GPU
> operates correctly on the host.
> 
> ## Linux v5.16 (issue present)
> 
> $ cat /proc/version
> Linux version 5.16.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 01:51:08 +0000
> 
> Before running the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 53.0 degC
> - GPU memory: 625.0 MHz
> 
> While running the game:
> 
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
> 
> After stopping the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 51.0 degC
> - GPU memory: 625.0 MHz
> 
> ## Linux v5.14 (issue present)
> 
> $ cat /proc/version
> Linux version 5.14.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 03:19:35 +0000
> 
> Before running the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> - GPU memory: 625.0 MHz
> 
> While running the game:
> 
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
> 
> After stopping the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> - GPU memory: 625.0 MHz
> 
> ## Linux v5.14-rc1 (issue present)
> 
> $ cat /proc/version
> Linux version 5.14.0-rc1-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 18:31:35 +0000
> 
> Before running the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> - GPU memory: 625.0 MHz
> 
> While running the game:
> 
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
> 
> After stopping the game:
> 
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> - GPU memory: 625.0 MHz
> 
> ## Linux v5.13 (works correctly, issue not present)
> 
> $ cat /proc/version
> Linux version 5.13.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 02:39:18 +0000
> 
> Before running the game:
> 
> - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 55.0 degC
> - GPU memory: 1500.0 MHz
> 
> While running the game:
> 
> - GPU core: 1295.0 MHz, 1.000 V, 100.0% load, 67.0 degC
> - GPU memory: 1500.0 MHz
> 
> After stopping the game:
> 
> - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 52.0 degC
> - GPU memory: 1500.0 MHz

```

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

#regzbot introduced f9b7f3703ff9
#regzbot title drm: amdgpu: Too-low frequency limit for AMD GPU
PCI-passed-through to Windows VM


> Would any additional information be helpful?
> 
> git bisect start
> # bad: [e73f0f0ee7541171d89f2e2491130c7771ba58d3] Linux 5.14-rc1
> git bisect bad e73f0f0ee7541171d89f2e2491130c7771ba58d3
> # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> git bisect good 62fb9874f5da54fdb243003b386128037319b219
> # bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
> git bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9
> # good: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
> # good: [007b312c6f294770de01fbc0643610145012d244] Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
> git bisect good 007b312c6f294770de01fbc0643610145012d244
> # bad: [18703923a66aecf6f7ded0e16d22eb412ddae72f] drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
> git bisect bad 18703923a66aecf6f7ded0e16d22eb412ddae72f
> # good: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
> git bisect good c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
> # good: [43ed3c6c786d996a264fcde68dbb36df6f03b965] Merge tag 'drm-misc-next-2021-06-01' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> git bisect good 43ed3c6c786d996a264fcde68dbb36df6f03b965
> # bad: [050cd3d616d96c3a04f4877842a391c0a4fdcc7a] drm/amd/display: Add support for SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616.
> git bisect bad 050cd3d616d96c3a04f4877842a391c0a4fdcc7a
> # good: [f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8] drm/amdgpu: Fix inconsistent indenting
> git bisect good f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8
> # good: [6566cae7aef30da8833f1fa0eb854baf33b96676] drm/amd/display: fix odm scaling
> git bisect good 6566cae7aef30da8833f1fa0eb854baf33b96676
> # good: [5ac1dd89df549648b67f4d5e3a01b2d653914c55] drm/amd/display/dc/dce/dmub_outbox: Convert over to kernel-doc
> git bisect good 5ac1dd89df549648b67f4d5e3a01b2d653914c55
> # good: [a76eb7d30f700e5bdecc72d88d2226d137b11f74] drm/amd/display/dc/dce110/dce110_hw_sequencer: Include header containing our prototypes
> git bisect good a76eb7d30f700e5bdecc72d88d2226d137b11f74
> # good: [dd1d82c04e111b5a864638ede8965db2fe6d8653] drm/amdgpu/swsmu/aldebaran: fix check in is_dpm_running
> git bisect good dd1d82c04e111b5a864638ede8965db2fe6d8653
> # bad: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> git bisect bad f9b7f3703ff97768a8dfabd42bdb107681f1da22
> # good: [f1688bd69ec4b07eda1657ff953daebce7cfabf6] drm/amd/amdgpu:save psp ring wptr to avoid attack
> git bisect good f1688bd69ec4b07eda1657ff953daebce7cfabf6
> # first bad commit: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> 
> James
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-21  6:22       ` Thorsten Leemhuis
@ 2022-01-21 16:45         ` Alex Deucher
  2022-01-22  0:51           ` James Turner
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Deucher @ 2022-01-21 16:45 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: James Turner, Alex Deucher, Lijo Lazar, regressions, kvm,
	Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Christian König

On Fri, Jan 21, 2022 at 3:35 AM Thorsten Leemhuis
<regressions@leemhuis.info> wrote:
>
> Hi, this is your Linux kernel regression tracker speaking.
>
> On 21.01.22 03:13, James Turner wrote:
> >
> > I finished the bisection (log below). The issue was introduced in
> > f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)").
>
> FWIW, that was:
>
> > drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> > They are global ACPI methods, so maybe the structures
> > global in the driver. This simplified a number of things
> > in the handling of these methods.
> >
> > v2: reset the handle if verify interface fails (Lijo)
> > v3: fix compilation when ACPI is not defined.
> >
> > Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>
> In that case we need to get those two and the maintainers for the driver
> involved by addressing them with this mail. And to make it easy for them
> here is a link and a quote from the original report:
>
> https://lore.kernel.org/all/87ee57c8fu.fsf@turner.link/

Are you ever loading the amdgpu driver in your tests?  If not, I don't
see how this patch would affect anything as the driver code would
never have executed.  It would appear not based on your example.

Alex

>
> ```
> > Hi,
> >
> > With newer kernels, starting with the v5.14 series, when using a MS
> > Windows 10 guest VM with PCI passthrough of an AMD Radeon Pro WX 3200
> > discrete GPU, the passed-through GPU will not run above 501 MHz, even
> > when it is under 100% load and well below the temperature limit. As a
> > result, GPU-intensive software (such as video games) runs unusably
> > slowly in the VM.
> >
> > In contrast, with older kernels, the passed-through GPU runs at up to
> > 1295 MHz (the correct hardware limit), so GPU-intensive software runs at
> > a reasonable speed in the VM.
> >
> > I've confirmed that the issue exists with the following kernel versions:
> >
> > - v5.16
> > - v5.14
> > - v5.14-rc1
> >
> > The issue does not exist with the following kernels:
> >
> > - v5.13
> > - various packaged (non-vanilla) 5.10.* Arch Linux `linux-lts` kernels
> >
> > So, the issue was introduced between v5.13 and v5.14-rc1. I'm willing to
> > bisect the commit history to narrow it down further, if that would be
> > helpful.
> >
> > The configuration details and test results are provided below. In
> > summary, for the kernels with this issue, the GPU core stays at a
> > constant 0.8 V, the GPU core clock ranges from 214 MHz to 501 MHz, and
> > the GPU memory stays at a constant 625 MHz, in the VM. For the correctly
> > working kernels, the GPU core ranges from 0.85 V to 1.0 V, the GPU core
> > clock ranges from 214 MHz to 1295 MHz, and the GPU memory stays at 1500
> > MHz, in the VM.
> >
> > Please let me know if additional information would be helpful.
> >
> > Regards,
> > James Turner
> >
> > # Configuration Details
> >
> > Hardware:
> >
> > - Dell Precision 7540 laptop
> > - CPU: Intel Core i7-9750H (x86-64)
> > - Discrete GPU: AMD Radeon Pro WX 3200
> > - The internal display is connected to the integrated GPU, and external
> >   displays are connected to the discrete GPU.
> >
> > Software:
> >
> > - KVM host: Arch Linux
> >   - self-built vanilla kernel (built using Arch Linux `PKGBUILD`
> >     modified to use vanilla kernel sources from git.kernel.org)
> >   - libvirt 1:7.10.0-2
> >   - qemu 6.2.0-2
> >
> > - KVM guest: Windows 10
> >   - GPU driver: Radeon Pro Software Version 21.Q3 (Note that I also
> >     experienced this issue with the 20.Q4 driver, using packaged
> >     (non-vanilla) Arch Linux kernels on the host, before updating to the
> >     21.Q3 driver.)
> >
> > Kernel config:
> >
> > - For v5.13, v5.14-rc1, and v5.14, I used
> >   https://github.com/archlinux/svntogit-packages/blob/89c24952adbfa645d9e1a6f12c572929f7e4e3c7/trunk/config
> >   (The build script ran `make olddefconfig` on that config file.)
> >
> > - For v5.16, I used
> >   https://github.com/archlinux/svntogit-packages/blob/94f84e1ad8a530e54aa34cadbaa76e8dcc439d10/trunk/config
> >   (The build script ran `make olddefconfig` on that config file.)
> >
> > I set up the VM with PCI passthrough according to the instructions at
> > https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
> >
> > I'm passing through the following PCI devices to the VM, as listed by
> > `lspci -D -nn`:
> >
> >   0000:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
> >   0000:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> >
> > The host kernel command line includes the following relevant options:
> >
> >   intel_iommu=on vfio-pci.ids=1002:6981,1002:aae0
> >
> > to enable IOMMU and bind the `vfio-pci` driver to the PCI devices.
> >
> > My `/etc/mkinitcpio.conf` includes the following line:
> >
> >   MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd i915 amdgpu)
> >
> > to load `vfio-pci` before the graphics drivers. (Note that removing
> > `i915 amdgpu` has no effect on this issue.)
> >
> > I'm using libvirt to manage the VM. The relevant portions of the XML
> > file are:
> >
> >   <hostdev mode="subsystem" type="pci" managed="yes">
> >     <source>
> >       <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
> >     </source>
> >     <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
> >   </hostdev>
> >   <hostdev mode="subsystem" type="pci" managed="yes">
> >     <source>
> >       <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
> >     </source>
> >     <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
> >   </hostdev>
> >
> > # Test Results
> >
> > For testing, I used the following procedure:
> >
> > 1. Boot the host machine and log in.
> >
> > 2. Run the following commands to gather information. For all the tests,
> >    the output was identical.
> >
> >    - `cat /proc/sys/kernel/tainted` printed:
> >
> >      0
> >
> >    - `hostnamectl | grep "Operating System"` printed:
> >
> >      Operating System: Arch Linux
> >
> >    - `lspci -nnk -d 1002:6981` printed
> >
> >      01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
> >       Subsystem: Dell Device [1028:0926]
> >       Kernel driver in use: vfio-pci
> >       Kernel modules: amdgpu
> >
> >    - `lspci -nnk -d 1002:aae0` printed
> >
> >      01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> >       Subsystem: Dell Device [1028:0926]
> >       Kernel driver in use: vfio-pci
> >       Kernel modules: snd_hda_intel
> >
> >    - `sudo dmesg | grep -i vfio` printed the kernel command line and the
> >      following messages:
> >
> >      VFIO - User Level meta-driver version: 0.3
> >      vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
> >      vfio_pci: add [1002:6981[ffffffff:ffffffff]] class 0x000000/00000000
> >      vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
> >      vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
> >
> > 3. Start the Windows VM using libvirt and log in. Record sensor
> >    information.
> >
> > 4. Run a graphically-intensive video game to put the GPU under load.
> >    Record sensor information.
> >
> > 5. Stop the game. Record sensor information.
> >
> > 6. Shut down the VM. Save the output of `sudo dmesg`.
> >
> > I compared the `sudo dmesg` output for v5.13 and v5.14-rc1 and didn't
> > see any relevant differences.
> >
> > Note that the issue occurs only within the guest VM. When I'm not using
> > a VM (after removing `vfio-pci.ids=1002:6981,1002:aae0` from the kernel
> > command line so that the PCI devices are bound to their normal `amdgpu`
> > and `snd_hda_intel` drivers instead of the `vfio-pci` driver), the GPU
> > operates correctly on the host.
> >
> > ## Linux v5.16 (issue present)
> >
> > $ cat /proc/version
> > Linux version 5.16.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 01:51:08 +0000
> >
> > Before running the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 53.0 degC
> > - GPU memory: 625.0 MHz
> >
> > While running the game:
> >
> > - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> > - GPU memory: 625.0 MHz
> >
> > After stopping the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 51.0 degC
> > - GPU memory: 625.0 MHz
> >
> > ## Linux v5.14 (issue present)
> >
> > $ cat /proc/version
> > Linux version 5.14.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 03:19:35 +0000
> >
> > Before running the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> > - GPU memory: 625.0 MHz
> >
> > While running the game:
> >
> > - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> > - GPU memory: 625.0 MHz
> >
> > After stopping the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> > - GPU memory: 625.0 MHz
> >
> > ## Linux v5.14-rc1 (issue present)
> >
> > $ cat /proc/version
> > Linux version 5.14.0-rc1-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 18:31:35 +0000
> >
> > Before running the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> > - GPU memory: 625.0 MHz
> >
> > While running the game:
> >
> > - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> > - GPU memory: 625.0 MHz
> >
> > After stopping the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> > - GPU memory: 625.0 MHz
> >
> > ## Linux v5.13 (works correctly, issue not present)
> >
> > $ cat /proc/version
> > Linux version 5.13.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 02:39:18 +0000
> >
> > Before running the game:
> >
> > - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 55.0 degC
> > - GPU memory: 1500.0 MHz
> >
> > While running the game:
> >
> > - GPU core: 1295.0 MHz, 1.000 V, 100.0% load, 67.0 degC
> > - GPU memory: 1500.0 MHz
> >
> > After stopping the game:
> >
> > - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 52.0 degC
> > - GPU memory: 1500.0 MHz
>
> ```
>
> Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)
>
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply, that's in everyone's interest.
>
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
>
> #regzbot introduced f9b7f3703ff9
> #regzbot title drm: amdgpu: Too-low frequency limit for AMD GPU
> PCI-passed-through to Windows VM
>
>
> > Would any additional information be helpful?
> >
> > git bisect start
> > # bad: [e73f0f0ee7541171d89f2e2491130c7771ba58d3] Linux 5.14-rc1
> > git bisect bad e73f0f0ee7541171d89f2e2491130c7771ba58d3
> > # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> > git bisect good 62fb9874f5da54fdb243003b386128037319b219
> > # bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
> > git bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9
> > # good: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect good a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
> > # good: [007b312c6f294770de01fbc0643610145012d244] Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
> > git bisect good 007b312c6f294770de01fbc0643610145012d244
> > # bad: [18703923a66aecf6f7ded0e16d22eb412ddae72f] drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
> > git bisect bad 18703923a66aecf6f7ded0e16d22eb412ddae72f
> > # good: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
> > git bisect good c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
> > # good: [43ed3c6c786d996a264fcde68dbb36df6f03b965] Merge tag 'drm-misc-next-2021-06-01' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> > git bisect good 43ed3c6c786d996a264fcde68dbb36df6f03b965
> > # bad: [050cd3d616d96c3a04f4877842a391c0a4fdcc7a] drm/amd/display: Add support for SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616.
> > git bisect bad 050cd3d616d96c3a04f4877842a391c0a4fdcc7a
> > # good: [f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8] drm/amdgpu: Fix inconsistent indenting
> > git bisect good f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8
> > # good: [6566cae7aef30da8833f1fa0eb854baf33b96676] drm/amd/display: fix odm scaling
> > git bisect good 6566cae7aef30da8833f1fa0eb854baf33b96676
> > # good: [5ac1dd89df549648b67f4d5e3a01b2d653914c55] drm/amd/display/dc/dce/dmub_outbox: Convert over to kernel-doc
> > git bisect good 5ac1dd89df549648b67f4d5e3a01b2d653914c55
> > # good: [a76eb7d30f700e5bdecc72d88d2226d137b11f74] drm/amd/display/dc/dce110/dce110_hw_sequencer: Include header containing our prototypes
> > git bisect good a76eb7d30f700e5bdecc72d88d2226d137b11f74
> > # good: [dd1d82c04e111b5a864638ede8965db2fe6d8653] drm/amdgpu/swsmu/aldebaran: fix check in is_dpm_running
> > git bisect good dd1d82c04e111b5a864638ede8965db2fe6d8653
> > # bad: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> > git bisect bad f9b7f3703ff97768a8dfabd42bdb107681f1da22
> > # good: [f1688bd69ec4b07eda1657ff953daebce7cfabf6] drm/amd/amdgpu:save psp ring wptr to avoid attack
> > git bisect good f1688bd69ec4b07eda1657ff953daebce7cfabf6
> > # first bad commit: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> >
> > James
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-21 16:45         ` Alex Deucher
@ 2022-01-22  0:51           ` James Turner
  2022-01-22  5:52             ` Lazar, Lijo
  0 siblings, 1 reply; 30+ messages in thread
From: James Turner @ 2022-01-22  0:51 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Thorsten Leemhuis, Alex Deucher, Lijo Lazar, regressions, kvm,
	Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Christian König

> Are you ever loading the amdgpu driver in your tests?

Yes, although I'm binding the `vfio-pci` driver to the AMD GPU's PCI
devices via the kernel command line. (See my initial email.) My
understanding is that `vfio-pci` is supposed to keep other drivers, such
as `amdgpu`, from interacting with the GPU, although that's clearly not
what's happening.

I've been testing with `amdgpu` included in the `MODULES` list in
`/etc/mkinitcpio.conf` (which Arch Linux uses to generate the
initramfs). However, I ran some more tests today (results below), this
time without `i915` or `amdgpu` in the `MODULES` list. The `amdgpu`
kernel module still gets loaded. (I think udev loads it automatically?)

Your comment gave me the idea to blacklist the `amdgpu` kernel module.
That does serve as a workaround on my machine – it fixes the behavior
for f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
and for the current Arch Linux prebuilt kernel (5.16.2-arch1-1). That's
an acceptable workaround for my machine only because the separate GPU
used by the host is an Intel integrated GPU. That workaround wouldn't
work well for someone with two AMD GPUs.


# New test results

The following tests are set up the same way as in my initial email,
with the following exceptions:

- I've updated libvirt to 1:8.0.0-1.

- I've removed `i915` and `amdgpu` from the `MODULES` list in
  `/etc/mkinitcpio.conf`.

For all three of these tests, `lspci` said the following:

% lspci -nnk -d 1002:6981
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
	Subsystem: Dell Device [1028:0926]
	Kernel driver in use: vfio-pci
	Kernel modules: amdgpu

% lspci -nnk -d 1002:aae0
01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
	Subsystem: Dell Device [1028:0926]
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel


## Version f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")

This is the commit immediately preceding the one which introduced the issue.

% sudo dmesg | grep -i amdgpu
[   15.840160] [drm] amdgpu kernel modesetting enabled.
[   15.840884] amdgpu: CRAT table not found
[   15.840885] amdgpu: Virtual CRAT table created for CPU
[   15.840893] amdgpu: Topology: Add CPU node

% lsmod | grep amdgpu
amdgpu               7450624  0
gpu_sched              49152  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
ttm                    77824  2 amdgpu,drm_ttm_helper
i2c_algo_bit           16384  2 amdgpu,i915
drm_kms_helper        303104  2 amdgpu,i915
drm                   581632  11 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,i915,ttm

The passed-through GPU worked properly in the VM.


## Version f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

This is the commit which introduced the issue.

% sudo dmesg | grep -i amdgpu
[   15.319023] [drm] amdgpu kernel modesetting enabled.
[   15.329468] amdgpu: CRAT table not found
[   15.329470] amdgpu: Virtual CRAT table created for CPU
[   15.329482] amdgpu: Topology: Add CPU node

% lsmod | grep amdgpu
amdgpu               7450624  0
gpu_sched              49152  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
ttm                    77824  2 amdgpu,drm_ttm_helper
i2c_algo_bit           16384  2 amdgpu,i915
drm_kms_helper        303104  2 amdgpu,i915
drm                   581632  11 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,i915,ttm

The passed-through GPU did not run above 501 MHz in the VM.


## Blacklisted `amdgpu`, version f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

For this test, I added `module_blacklist=amdgpu` to kernel command line
to blacklist the `amdgpu` module.

% sudo dmesg | grep -i amdgpu
[   14.591576] Module amdgpu is blacklisted

% lsmod | grep amdgpu

The passed-through GPU worked properly in the VM.


James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-22  0:51           ` James Turner
@ 2022-01-22  5:52             ` Lazar, Lijo
  2022-01-22 21:11               ` James Turner
  0 siblings, 1 reply; 30+ messages in thread
From: Lazar, Lijo @ 2022-01-22  5:52 UTC (permalink / raw)
  To: James Turner, Alex Deucher
  Cc: Thorsten Leemhuis, Deucher, Alexander, regressions, kvm, Greg KH,
	Pan, Xinhui, LKML, amd-gfx, Alex Williamson, Koenig, Christian

[AMD Official Use Only]

Hi James,

Could you provide the pp_dpm_* values in sysfs with and without the patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie) if it's not in gen3 when the issue happens?

For details on pp_dpm_*, please check https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

Thanks,
Lijo

-----Original Message-----
From: James Turner <linuxkernel.foss@dmarc-none.turner.link> 
Sent: Saturday, January 22, 2022 6:21 AM
To: Alex Deucher <alexdeucher@gmail.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>; Deucher, Alexander <Alexander.Deucher@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; regressions@lists.linux.dev; kvm@vger.kernel.org; Greg KH <gregkh@linuxfoundation.org>; Pan, Xinhui <Xinhui.Pan@amd.com>; LKML <linux-kernel@vger.kernel.org>; amd-gfx@lists.freedesktop.org; Alex Williamson <alex.williamson@redhat.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

> Are you ever loading the amdgpu driver in your tests?

Yes, although I'm binding the `vfio-pci` driver to the AMD GPU's PCI devices via the kernel command line. (See my initial email.) My understanding is that `vfio-pci` is supposed to keep other drivers, such as `amdgpu`, from interacting with the GPU, although that's clearly not what's happening.

I've been testing with `amdgpu` included in the `MODULES` list in `/etc/mkinitcpio.conf` (which Arch Linux uses to generate the initramfs). However, I ran some more tests today (results below), this time without `i915` or `amdgpu` in the `MODULES` list. The `amdgpu` kernel module still gets loaded. (I think udev loads it automatically?)

Your comment gave me the idea to blacklist the `amdgpu` kernel module.
That does serve as a workaround on my machine – it fixes the behavior for f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") and for the current Arch Linux prebuilt kernel (5.16.2-arch1-1). That's an acceptable workaround for my machine only because the separate GPU used by the host is an Intel integrated GPU. That workaround wouldn't work well for someone with two AMD GPUs.


# New test results

The following tests are set up the same way as in my initial email, with the following exceptions:

- I've updated libvirt to 1:8.0.0-1.

- I've removed `i915` and `amdgpu` from the `MODULES` list in
  `/etc/mkinitcpio.conf`.

For all three of these tests, `lspci` said the following:

% lspci -nnk -d 1002:6981
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
	Subsystem: Dell Device [1028:0926]
	Kernel driver in use: vfio-pci
	Kernel modules: amdgpu

% lspci -nnk -d 1002:aae0
01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
	Subsystem: Dell Device [1028:0926]
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel


## Version f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")

This is the commit immediately preceding the one which introduced the issue.

% sudo dmesg | grep -i amdgpu
[   15.840160] [drm] amdgpu kernel modesetting enabled.
[   15.840884] amdgpu: CRAT table not found
[   15.840885] amdgpu: Virtual CRAT table created for CPU
[   15.840893] amdgpu: Topology: Add CPU node

% lsmod | grep amdgpu
amdgpu               7450624  0
gpu_sched              49152  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
ttm                    77824  2 amdgpu,drm_ttm_helper
i2c_algo_bit           16384  2 amdgpu,i915
drm_kms_helper        303104  2 amdgpu,i915
drm                   581632  11 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,i915,ttm

The passed-through GPU worked properly in the VM.


## Version f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

This is the commit which introduced the issue.

% sudo dmesg | grep -i amdgpu
[   15.319023] [drm] amdgpu kernel modesetting enabled.
[   15.329468] amdgpu: CRAT table not found
[   15.329470] amdgpu: Virtual CRAT table created for CPU
[   15.329482] amdgpu: Topology: Add CPU node

% lsmod | grep amdgpu
amdgpu               7450624  0
gpu_sched              49152  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
ttm                    77824  2 amdgpu,drm_ttm_helper
i2c_algo_bit           16384  2 amdgpu,i915
drm_kms_helper        303104  2 amdgpu,i915
drm                   581632  11 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,i915,ttm

The passed-through GPU did not run above 501 MHz in the VM.


## Blacklisted `amdgpu`, version f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

For this test, I added `module_blacklist=amdgpu` to kernel command line to blacklist the `amdgpu` module.

% sudo dmesg | grep -i amdgpu
[   14.591576] Module amdgpu is blacklisted

% lsmod | grep amdgpu

The passed-through GPU worked properly in the VM.


James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-22  5:52             ` Lazar, Lijo
@ 2022-01-22 21:11               ` James Turner
  2022-01-24 14:21                 ` Lazar, Lijo
  2022-01-24 17:04                 ` Alex Deucher
  0 siblings, 2 replies; 30+ messages in thread
From: James Turner @ 2022-01-22 21:11 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Alex Deucher, Thorsten Leemhuis, Deucher, Alexander, regressions,
	kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Koenig, Christian

Hi Lijo,

> Could you provide the pp_dpm_* values in sysfs with and without the
> patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie)
> if it's not in gen3 when the issue happens?

AFAICT, I can't access those values while the AMD GPU PCI devices are
bound to `vfio-pci`. However, I can at least access the link speed and
width elsewhere in sysfs. So, I gathered what information I could for
two different cases:

- With the PCI devices bound to `vfio-pci`. With this configuration, I
  can start the VM, but the `pp_dpm_*` values are not available since
  the devices are bound to `vfio-pci` instead of `amdgpu`.

- Without the PCI devices bound to `vfio-pci` (i.e. after removing the
  `vfio-pci.ids=...` kernel command line argument). With this
  configuration, I can access the `pp_dpm_*` values, since the PCI
  devices are bound to `amdgpu`. However, I cannot use the VM. If I try
  to start the VM, the display (both the external monitors attached to
  the AMD GPU and the built-in laptop display attached to the Intel
  iGPU) completely freezes.

The output shown below was identical for both the good commit:
f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")
and the commit which introduced the issue:
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

Note that the PCI link speed increased to 8.0 GT/s when the GPU was
under heavy load for both versions, but the clock speeds of the GPU were
different under load. (For the good commit, it was 1295 MHz; for the bad
commit, it was 501 MHz.)


# With the PCI devices bound to `vfio-pci`

## Before starting the VM

% ls /sys/module/amdgpu/drivers/pci:amdgpu
module  bind  new_id  remove_id  uevent  unbind

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, before placing the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## While running the VM, with the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, after stopping the heavy load on the AMD GPU

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## After stopping the VM

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe


# Without the PCI devices bound to `vfio-pci`

% ls /sys/module/amdgpu/drivers/pci:amdgpu
0000:01:00.0  module  bind  new_id  remove_id  uevent  unbind

% for f in /sys/module/amdgpu/drivers/pci:amdgpu/*/pp_dpm_*; do echo "$f"; cat "$f"; echo; done
/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_mclk
0: 300Mhz
1: 625Mhz
2: 1500Mhz *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_pcie
0: 2.5GT/s, x8
1: 8.0GT/s, x16 *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_sclk
0: 214Mhz
1: 501Mhz
2: 850Mhz
3: 1034Mhz
4: 1144Mhz
5: 1228Mhz
6: 1275Mhz
7: 1295Mhz *

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe


James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-22 21:11               ` James Turner
@ 2022-01-24 14:21                 ` Lazar, Lijo
  2022-01-24 23:58                   ` James Turner
  2022-01-24 17:04                 ` Alex Deucher
  1 sibling, 1 reply; 30+ messages in thread
From: Lazar, Lijo @ 2022-01-24 14:21 UTC (permalink / raw)
  To: James Turner
  Cc: Alex Deucher, Thorsten Leemhuis, Deucher, Alexander, regressions,
	kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Koenig, Christian

[Public]

Not able to relate to how it affects gfx/mem DPM alone. Unless Alex has other ideas, would you be able to enable drm debug messages and share the log?

	Enabling verbose debug messages is done through the drm.debug parameter, each category being enabled by a bit:

	drm.debug=0x1 will enable CORE messages
	drm.debug=0x2 will enable DRIVER messages
	drm.debug=0x3 will enable CORE and DRIVER messages
	...
	drm.debug=0x1ff will enable all messages
	An interesting feature is that it's possible to enable verbose logging at run-time by echoing the debug value in its sysfs node:

	# echo 0xf > /sys/module/drm/parameters/debug

Thanks,
Lijo

-----Original Message-----
From: James Turner <linuxkernel.foss@dmarc-none.turner.link> 
Sent: Sunday, January 23, 2022 2:41 AM
To: Lazar, Lijo <Lijo.Lazar@amd.com>
Cc: Alex Deucher <alexdeucher@gmail.com>; Thorsten Leemhuis <regressions@leemhuis.info>; Deucher, Alexander <Alexander.Deucher@amd.com>; regressions@lists.linux.dev; kvm@vger.kernel.org; Greg KH <gregkh@linuxfoundation.org>; Pan, Xinhui <Xinhui.Pan@amd.com>; LKML <linux-kernel@vger.kernel.org>; amd-gfx@lists.freedesktop.org; Alex Williamson <alex.williamson@redhat.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi Lijo,

> Could you provide the pp_dpm_* values in sysfs with and without the 
> patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie) 
> if it's not in gen3 when the issue happens?

AFAICT, I can't access those values while the AMD GPU PCI devices are bound to `vfio-pci`. However, I can at least access the link speed and width elsewhere in sysfs. So, I gathered what information I could for two different cases:

- With the PCI devices bound to `vfio-pci`. With this configuration, I
  can start the VM, but the `pp_dpm_*` values are not available since
  the devices are bound to `vfio-pci` instead of `amdgpu`.

- Without the PCI devices bound to `vfio-pci` (i.e. after removing the
  `vfio-pci.ids=...` kernel command line argument). With this
  configuration, I can access the `pp_dpm_*` values, since the PCI
  devices are bound to `amdgpu`. However, I cannot use the VM. If I try
  to start the VM, the display (both the external monitors attached to
  the AMD GPU and the built-in laptop display attached to the Intel
  iGPU) completely freezes.

The output shown below was identical for both the good commit:
f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack") and the commit which introduced the issue:
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

Note that the PCI link speed increased to 8.0 GT/s when the GPU was under heavy load for both versions, but the clock speeds of the GPU were different under load. (For the good commit, it was 1295 MHz; for the bad commit, it was 501 MHz.)


# With the PCI devices bound to `vfio-pci`

## Before starting the VM

% ls /sys/module/amdgpu/drivers/pci:amdgpu
module  bind  new_id  remove_id  uevent  unbind

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, before placing the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## While running the VM, with the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, after stopping the heavy load on the AMD GPU

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## After stopping the VM

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe


# Without the PCI devices bound to `vfio-pci`

% ls /sys/module/amdgpu/drivers/pci:amdgpu
0000:01:00.0  module  bind  new_id  remove_id  uevent  unbind

% for f in /sys/module/amdgpu/drivers/pci:amdgpu/*/pp_dpm_*; do echo "$f"; cat "$f"; echo; done /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_mclk
0: 300Mhz
1: 625Mhz
2: 1500Mhz *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_pcie
0: 2.5GT/s, x8
1: 8.0GT/s, x16 *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_sclk
0: 214Mhz
1: 501Mhz
2: 850Mhz
3: 1034Mhz
4: 1144Mhz
5: 1228Mhz
6: 1275Mhz
7: 1295Mhz *

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe


James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-22 21:11               ` James Turner
  2022-01-24 14:21                 ` Lazar, Lijo
@ 2022-01-24 17:04                 ` Alex Deucher
  2022-01-24 17:30                   ` Alex Williamson
  1 sibling, 1 reply; 30+ messages in thread
From: Alex Deucher @ 2022-01-24 17:04 UTC (permalink / raw)
  To: James Turner
  Cc: Lazar, Lijo, Thorsten Leemhuis, Deucher, Alexander, regressions,
	kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Koenig, Christian

On Sat, Jan 22, 2022 at 4:38 PM James Turner
<linuxkernel.foss@dmarc-none.turner.link> wrote:
>
> Hi Lijo,
>
> > Could you provide the pp_dpm_* values in sysfs with and without the
> > patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie)
> > if it's not in gen3 when the issue happens?
>
> AFAICT, I can't access those values while the AMD GPU PCI devices are
> bound to `vfio-pci`. However, I can at least access the link speed and
> width elsewhere in sysfs. So, I gathered what information I could for
> two different cases:
>
> - With the PCI devices bound to `vfio-pci`. With this configuration, I
>   can start the VM, but the `pp_dpm_*` values are not available since
>   the devices are bound to `vfio-pci` instead of `amdgpu`.
>
> - Without the PCI devices bound to `vfio-pci` (i.e. after removing the
>   `vfio-pci.ids=...` kernel command line argument). With this
>   configuration, I can access the `pp_dpm_*` values, since the PCI
>   devices are bound to `amdgpu`. However, I cannot use the VM. If I try
>   to start the VM, the display (both the external monitors attached to
>   the AMD GPU and the built-in laptop display attached to the Intel
>   iGPU) completely freezes.
>
> The output shown below was identical for both the good commit:
> f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")
> and the commit which introduced the issue:
> f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
>
> Note that the PCI link speed increased to 8.0 GT/s when the GPU was
> under heavy load for both versions, but the clock speeds of the GPU were
> different under load. (For the good commit, it was 1295 MHz; for the bad
> commit, it was 501 MHz.)
>

Are the ATIF and ATCS ACPI methods available in the guest VM?  They
are required for this platform to work correctly from a power
standpoint.  One thing that f9b7f3703ff9 did was to get those ACPI
methods executed on certain platforms where they had not been
previously due to a bug in the original implementation.  If the
windows driver doesn't interact with them, it could cause performance
issues.  It may have worked by accident before because the ACPI
interfaces may not have been called, leading the windows driver to
believe this was a standalone dGPU rather than one integrated into a
power/thermal limited platform.

Alex


>
> # With the PCI devices bound to `vfio-pci`
>
> ## Before starting the VM
>
> % ls /sys/module/amdgpu/drivers/pci:amdgpu
> module  bind  new_id  remove_id  uevent  unbind
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 8.0 GT/s PCIe
>
> ## While running the VM, before placing the AMD GPU under heavy load
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 2.5 GT/s PCIe
>
> ## While running the VM, with the AMD GPU under heavy load
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 8.0 GT/s PCIe
>
> ## While running the VM, after stopping the heavy load on the AMD GPU
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 2.5 GT/s PCIe
>
> ## After stopping the VM
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 2.5 GT/s PCIe
>
>
> # Without the PCI devices bound to `vfio-pci`
>
> % ls /sys/module/amdgpu/drivers/pci:amdgpu
> 0000:01:00.0  module  bind  new_id  remove_id  uevent  unbind
>
> % for f in /sys/module/amdgpu/drivers/pci:amdgpu/*/pp_dpm_*; do echo "$f"; cat "$f"; echo; done
> /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_mclk
> 0: 300Mhz
> 1: 625Mhz
> 2: 1500Mhz *
>
> /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_pcie
> 0: 2.5GT/s, x8
> 1: 8.0GT/s, x16 *
>
> /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_sclk
> 0: 214Mhz
> 1: 501Mhz
> 2: 850Mhz
> 3: 1034Mhz
> 4: 1144Mhz
> 5: 1228Mhz
> 6: 1275Mhz
> 7: 1295Mhz *
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 8.0 GT/s PCIe
>
>
> James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-24 17:04                 ` Alex Deucher
@ 2022-01-24 17:30                   ` Alex Williamson
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Williamson @ 2022-01-24 17:30 UTC (permalink / raw)
  To: Alex Deucher
  Cc: James Turner, Lazar, Lijo, Thorsten Leemhuis, Deucher, Alexander,
	regressions, kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Koenig,
	Christian

On Mon, 24 Jan 2022 12:04:18 -0500
Alex Deucher <alexdeucher@gmail.com> wrote:

> On Sat, Jan 22, 2022 at 4:38 PM James Turner
> <linuxkernel.foss@dmarc-none.turner.link> wrote:
> >
> > Hi Lijo,
> >  
> > > Could you provide the pp_dpm_* values in sysfs with and without the
> > > patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie)
> > > if it's not in gen3 when the issue happens?  
> >
> > AFAICT, I can't access those values while the AMD GPU PCI devices are
> > bound to `vfio-pci`. However, I can at least access the link speed and
> > width elsewhere in sysfs. So, I gathered what information I could for
> > two different cases:
> >
> > - With the PCI devices bound to `vfio-pci`. With this configuration, I
> >   can start the VM, but the `pp_dpm_*` values are not available since
> >   the devices are bound to `vfio-pci` instead of `amdgpu`.
> >
> > - Without the PCI devices bound to `vfio-pci` (i.e. after removing the
> >   `vfio-pci.ids=...` kernel command line argument). With this
> >   configuration, I can access the `pp_dpm_*` values, since the PCI
> >   devices are bound to `amdgpu`. However, I cannot use the VM. If I try
> >   to start the VM, the display (both the external monitors attached to
> >   the AMD GPU and the built-in laptop display attached to the Intel
> >   iGPU) completely freezes.
> >
> > The output shown below was identical for both the good commit:
> > f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")
> > and the commit which introduced the issue:
> > f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
> >
> > Note that the PCI link speed increased to 8.0 GT/s when the GPU was
> > under heavy load for both versions, but the clock speeds of the GPU were
> > different under load. (For the good commit, it was 1295 MHz; for the bad
> > commit, it was 501 MHz.)
> >  
> 
> Are the ATIF and ATCS ACPI methods available in the guest VM?  They
> are required for this platform to work correctly from a power
> standpoint.  One thing that f9b7f3703ff9 did was to get those ACPI
> methods executed on certain platforms where they had not been
> previously due to a bug in the original implementation.  If the
> windows driver doesn't interact with them, it could cause performance
> issues.  It may have worked by accident before because the ACPI
> interfaces may not have been called, leading the windows driver to
> believe this was a standalone dGPU rather than one integrated into a
> power/thermal limited platform.

None of the host ACPI interfaces are available to or accessible by the
guest when assigning a PCI device.  Likewise the guest does not have
access to the parent downstream ports of the PCIe link.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-24 14:21                 ` Lazar, Lijo
@ 2022-01-24 23:58                   ` James Turner
  2022-01-25 13:33                     ` Lazar, Lijo
  0 siblings, 1 reply; 30+ messages in thread
From: James Turner @ 2022-01-24 23:58 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Alex Deucher, Thorsten Leemhuis, Deucher, Alexander, regressions,
	kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Koenig, Christian

Hi Lijo,

> Not able to relate to how it affects gfx/mem DPM alone. Unless Alex
> has other ideas, would you be able to enable drm debug messages and
> share the log?

Sure, I'm happy to provide drm debug messages. Enabling everything
(0x1ff) generates *a lot* of log messages, though. Is there a smaller
subset that would be useful? Fwiw, I don't see much in the full drm logs
about the AMD GPU anyway; it's mostly about the Intel GPU.

All the messages in the system log containing "01:00" or "1002:6981" are
identical between the two versions.

I've posted below the only places in the logs which contain "amd". The
commit with the issue (f9b7f3703ff9) has a few drm log messages from
amdgpu which are not present in the logs for f1688bd69ec4.


# f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")

[drm] amdgpu kernel modesetting enabled.
vga_switcheroo: detected switching method \_SB_.PCI0.GFX0.ATPX handle
ATPX version 1, functions 0x00000033
amdgpu: CRAT table not found
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node


# f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

[drm] amdgpu kernel modesetting enabled.
vga_switcheroo: detected switching method \_SB_.PCI0.GFX0.ATPX handle
ATPX version 1, functions 0x00000033
[drm:amdgpu_atif_pci_probe_handle.isra.0 [amdgpu]] Found ATIF handle \_SB_.PCI0.GFX0.ATIF
[drm:amdgpu_atif_pci_probe_handle.isra.0 [amdgpu]] ATIF version 1
[drm:amdgpu_acpi_detect [amdgpu]] SYSTEM_PARAMS: mask = 0x6, flags = 0x7
[drm:amdgpu_acpi_detect [amdgpu]] Notification enabled, command code = 0xd9
amdgpu: CRAT table not found
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node


Other things I'm willing to try if they'd be useful:

- I could update to the 21.Q4 Radeon Pro driver in the Windows VM. (The
  21.Q3 driver is currently installed.)

- I could set up a Linux guest VM with PCI passthrough to compare to the
  Windows VM and obtain more debugging information.

- I could build a kernel with a patch applied, e.g. to disable some of
  the changes in f9b7f3703ff9.

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-24 23:58                   ` James Turner
@ 2022-01-25 13:33                     ` Lazar, Lijo
  2022-01-30  0:25                       ` Jim Turner
  0 siblings, 1 reply; 30+ messages in thread
From: Lazar, Lijo @ 2022-01-25 13:33 UTC (permalink / raw)
  To: James Turner
  Cc: Alex Deucher, Thorsten Leemhuis, Deucher, Alexander, regressions,
	kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Koenig, Christian



On 1/25/2022 5:28 AM, James Turner wrote:
> Hi Lijo,
> 
>> Not able to relate to how it affects gfx/mem DPM alone. Unless Alex
>> has other ideas, would you be able to enable drm debug messages and
>> share the log?
> 
> Sure, I'm happy to provide drm debug messages. Enabling everything
> (0x1ff) generates *a lot* of log messages, though. Is there a smaller
> subset that would be useful? Fwiw, I don't see much in the full drm logs
> about the AMD GPU anyway; it's mostly about the Intel GPU.
> 
> All the messages in the system log containing "01:00" or "1002:6981" are
> identical between the two versions.
> 
> I've posted below the only places in the logs which contain "amd". The
> commit with the issue (f9b7f3703ff9) has a few drm log messages from
> amdgpu which are not present in the logs for f1688bd69ec4.
> 
> 
> # f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")
> 
> [drm] amdgpu kernel modesetting enabled.
> vga_switcheroo: detected switching method \_SB_.PCI0.GFX0.ATPX handle
> ATPX version 1, functions 0x00000033
> amdgpu: CRAT table not found
> amdgpu: Virtual CRAT table created for CPU
> amdgpu: Topology: Add CPU node
> 
> 
> # f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
> 
> [drm] amdgpu kernel modesetting enabled.
> vga_switcheroo: detected switching method \_SB_.PCI0.GFX0.ATPX handle
> ATPX version 1, functions 0x00000033
> [drm:amdgpu_atif_pci_probe_handle.isra.0 [amdgpu]] Found ATIF handle \_SB_.PCI0.GFX0.ATIF
> [drm:amdgpu_atif_pci_probe_handle.isra.0 [amdgpu]] ATIF version 1
> [drm:amdgpu_acpi_detect [amdgpu]] SYSTEM_PARAMS: mask = 0x6, flags = 0x7
> [drm:amdgpu_acpi_detect [amdgpu]] Notification enabled, command code = 0xd9
> amdgpu: CRAT table not found
> amdgpu: Virtual CRAT table created for CPU
> amdgpu: Topology: Add CPU node
> 
> 

Hi James,

Specifically, I was looking for any events happening at these two places 
because of the patch-

https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L411

https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L653

The patch specifically affects these two. On/before starting VM, if 
there are invocations of these two functions on your system as a result 
of the patch, we could navigate from there and check what is the side 
effect.

Thanks,
Lijo

> Other things I'm willing to try if they'd be useful:
> 
> - I could update to the 21.Q4 Radeon Pro driver in the Windows VM. (The
>    21.Q3 driver is currently installed.)
> 
> - I could set up a Linux guest VM with PCI passthrough to compare to the
>    Windows VM and obtain more debugging information.
> 
> - I could build a kernel with a patch applied, e.g. to disable some of
>    the changes in f9b7f3703ff9.
> 
> James
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-25 13:33                     ` Lazar, Lijo
@ 2022-01-30  0:25                       ` Jim Turner
  2022-02-15 14:56                         ` Thorsten Leemhuis
  0 siblings, 1 reply; 30+ messages in thread
From: Jim Turner @ 2022-01-30  0:25 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Alex Deucher, Thorsten Leemhuis, Deucher, Alexander, regressions,
	kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Koenig, Christian

Hi Lijo,

> Specifically, I was looking for any events happening at these two
> places because of the patch-
>
> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L411
>
> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L653

I searched the logs generated with all drm debug messages enabled
(drm.debug=0x1ff) for "device_class", "ATCS", "atcs", "ATIF", and
"atif", for both f1688bd69ec4 and f9b7f3703ff9. Other than the few lines
mentioning ATIF from my previous email, there weren't any matches.

Since "device_class" didn't appear in the logs, we know that
`amdgpu_atif_handler` was not called for either version.

I also patched f9b7f3703ff9 to add the line

  DRM_DEBUG_DRIVER("Entered amdgpu_acpi_pcie_performance_request");

at the top (below the variable declarations) of
`amdgpu_acpi_pcie_performance_request`, and then tested again with all
drm debug messages enabled (0x1ff). That debug message didn't show up.

So, `amdgpu_acpi_pcie_performance_request` was not called either, at
least with f9b7f3703ff9. (I didn't try adding this patch to
f1688bd69ec4.)

Would anything else be helpful?

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-01-30  0:25                       ` Jim Turner
@ 2022-02-15 14:56                         ` Thorsten Leemhuis
  2022-02-15 15:11                           ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: Thorsten Leemhuis @ 2022-02-15 14:56 UTC (permalink / raw)
  To: Jim Turner, Lazar, Lijo
  Cc: Alex Deucher, Deucher, Alexander, regressions, kvm, Greg KH, Pan,
	Xinhui, LKML, amd-gfx, Alex Williamson, Koenig, Christian

Top-posting for once, to make this easy accessible to everyone.

Nothing happened here for two weeks now afaics. Was the discussion moved
elsewhere or did it fall through the cracks?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

On 30.01.22 01:25, Jim Turner wrote:
> Hi Lijo,
> 
>> Specifically, I was looking for any events happening at these two
>> places because of the patch-
>>
>> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L411
>>
>> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L653
> 
> I searched the logs generated with all drm debug messages enabled
> (drm.debug=0x1ff) for "device_class", "ATCS", "atcs", "ATIF", and
> "atif", for both f1688bd69ec4 and f9b7f3703ff9. Other than the few lines
> mentioning ATIF from my previous email, there weren't any matches.
> 
> Since "device_class" didn't appear in the logs, we know that
> `amdgpu_atif_handler` was not called for either version.
> 
> I also patched f9b7f3703ff9 to add the line
> 
>   DRM_DEBUG_DRIVER("Entered amdgpu_acpi_pcie_performance_request");
> 
> at the top (below the variable declarations) of
> `amdgpu_acpi_pcie_performance_request`, and then tested again with all
> drm debug messages enabled (0x1ff). That debug message didn't show up.
> 
> So, `amdgpu_acpi_pcie_performance_request` was not called either, at
> least with f9b7f3703ff9. (I didn't try adding this patch to
> f1688bd69ec4.)
> 
> Would anything else be helpful?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-02-15 14:56                         ` Thorsten Leemhuis
@ 2022-02-15 15:11                           ` Alex Deucher
  2022-02-16  0:25                             ` James D. Turner
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Deucher @ 2022-02-15 15:11 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Jim Turner, Lazar, Lijo, Deucher, Alexander, regressions, kvm,
	Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson, Koenig,
	Christian

On Tue, Feb 15, 2022 at 9:56 AM Thorsten Leemhuis
<regressions@leemhuis.info> wrote:
>
> Top-posting for once, to make this easy accessible to everyone.
>
> Nothing happened here for two weeks now afaics. Was the discussion moved
> elsewhere or did it fall through the cracks?
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
>
> On 30.01.22 01:25, Jim Turner wrote:
> > Hi Lijo,
> >
> >> Specifically, I was looking for any events happening at these two
> >> places because of the patch-
> >>
> >> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L411
> >>
> >> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L653
> >
> > I searched the logs generated with all drm debug messages enabled
> > (drm.debug=0x1ff) for "device_class", "ATCS", "atcs", "ATIF", and
> > "atif", for both f1688bd69ec4 and f9b7f3703ff9. Other than the few lines
> > mentioning ATIF from my previous email, there weren't any matches.
> >
> > Since "device_class" didn't appear in the logs, we know that
> > `amdgpu_atif_handler` was not called for either version.
> >
> > I also patched f9b7f3703ff9 to add the line
> >
> >   DRM_DEBUG_DRIVER("Entered amdgpu_acpi_pcie_performance_request");
> >
> > at the top (below the variable declarations) of
> > `amdgpu_acpi_pcie_performance_request`, and then tested again with all
> > drm debug messages enabled (0x1ff). That debug message didn't show up.
> >
> > So, `amdgpu_acpi_pcie_performance_request` was not called either, at
> > least with f9b7f3703ff9. (I didn't try adding this patch to
> > f1688bd69ec4.)
> >
> > Would anything else be helpful?

I guess just querying the ATIF method does something that negatively
influences the windows driver in the guest.  Perhaps the platform
thinks the driver has been loaded since the method has been called so
it enables certain behaviors that require ATIF interaction that never
happen because the ACPI methods are not available in the guest.  I
don't really have a good workaround other than blacklisting the driver
since on bare metal the driver needs to use this interface for
platform interactions.

Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-02-15 15:11                           ` Alex Deucher
@ 2022-02-16  0:25                             ` James D. Turner
  2022-02-16 16:37                               ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: James D. Turner @ 2022-02-16  0:25 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Thorsten Leemhuis, Lazar, Lijo, Deucher, Alexander, regressions,
	kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Koenig, Christian

Hi Alex,

> I guess just querying the ATIF method does something that negatively
> influences the windows driver in the guest. Perhaps the platform
> thinks the driver has been loaded since the method has been called so
> it enables certain behaviors that require ATIF interaction that never
> happen because the ACPI methods are not available in the guest.

Do you mean the `amdgpu_atif_pci_probe_handle` function? If it would be
helpful, I could try disabling that function and testing again.

> I don't really have a good workaround other than blacklisting the
> driver since on bare metal the driver needs to use this interface for
> platform interactions.

I'm not familiar with ATIF, but should `amdgpu_atif_pci_probe_handle`
really be called for PCI devices which are bound to vfio-pci? I'd expect
amdgpu to ignore such devices.

As I understand it, starting with
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)"),
the `amdgpu_acpi_detect` function loops over all PCI devices in the
`PCI_CLASS_DISPLAY_VGA` and `PCI_CLASS_DISPLAY_OTHER` classes to find
the ATIF and ATCS handles. Maybe skipping over any PCI devices bound to
vfio-pci would fix the issue? On a related note, shouldn't it also skip
over any PCI devices with non-AMD vendor IDs?

Regards,
James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-02-16  0:25                             ` James D. Turner
@ 2022-02-16 16:37                               ` Alex Deucher
  2022-03-06 15:48                                 ` Thorsten Leemhuis
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Deucher @ 2022-02-16 16:37 UTC (permalink / raw)
  To: James D. Turner
  Cc: Thorsten Leemhuis, Lazar, Lijo, Deucher, Alexander, regressions,
	kvm, Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson,
	Koenig, Christian

On Tue, Feb 15, 2022 at 9:35 PM James D. Turner
<linuxkernel.foss@dmarc-none.turner.link> wrote:
>
> Hi Alex,
>
> > I guess just querying the ATIF method does something that negatively
> > influences the windows driver in the guest. Perhaps the platform
> > thinks the driver has been loaded since the method has been called so
> > it enables certain behaviors that require ATIF interaction that never
> > happen because the ACPI methods are not available in the guest.
>
> Do you mean the `amdgpu_atif_pci_probe_handle` function? If it would be
> helpful, I could try disabling that function and testing again.

Correct.

>
> > I don't really have a good workaround other than blacklisting the
> > driver since on bare metal the driver needs to use this interface for
> > platform interactions.
>
> I'm not familiar with ATIF, but should `amdgpu_atif_pci_probe_handle`
> really be called for PCI devices which are bound to vfio-pci? I'd expect
> amdgpu to ignore such devices.
>
> As I understand it, starting with
> f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)"),
> the `amdgpu_acpi_detect` function loops over all PCI devices in the
> `PCI_CLASS_DISPLAY_VGA` and `PCI_CLASS_DISPLAY_OTHER` classes to find
> the ATIF and ATCS handles. Maybe skipping over any PCI devices bound to
> vfio-pci would fix the issue? On a related note, shouldn't it also skip
> over any PCI devices with non-AMD vendor IDs?

The ACPI methods are global.  There's only one instance of each per
system and they are relevant to add GPUs on the platform.  That's why
they are a global resource in the driver.  They can be hung off of the
dGPU or APU ACPI namespace, depending on the platform which is why we
check all of the display devices.  Skipping them would prevent them
from being available if you later bound the amdgpu driver to the GPU
device(s) I think.

Alex

>
> Regards,
> James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-02-16 16:37                               ` Alex Deucher
@ 2022-03-06 15:48                                 ` Thorsten Leemhuis
  2022-03-07  2:12                                   ` James Turner
  0 siblings, 1 reply; 30+ messages in thread
From: Thorsten Leemhuis @ 2022-03-06 15:48 UTC (permalink / raw)
  To: Alex Deucher, James D. Turner
  Cc: Lazar, Lijo, Deucher, Alexander, regressions, kvm, Greg KH, Pan,
	Xinhui, LKML, amd-gfx, Alex Williamson, Koenig, Christian

Hi, this is your Linux kernel regression tracker again. Top-posting once
more, to make this easily accessible to everyone.

What's the status of this? It looks stuck, or did the discussion
continue somewhere else? James, it sounded like you wanted to test
something, did you give it a try? Or is there some reason why I should
stop tracking this regression?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

#regzbot poke

On 16.02.22 17:37, Alex Deucher wrote:
> On Tue, Feb 15, 2022 at 9:35 PM James D. Turner
> <linuxkernel.foss@dmarc-none.turner.link> wrote:
>>
>> Hi Alex,
>>
>>> I guess just querying the ATIF method does something that negatively
>>> influences the windows driver in the guest. Perhaps the platform
>>> thinks the driver has been loaded since the method has been called so
>>> it enables certain behaviors that require ATIF interaction that never
>>> happen because the ACPI methods are not available in the guest.
>>
>> Do you mean the `amdgpu_atif_pci_probe_handle` function? If it would be
>> helpful, I could try disabling that function and testing again.
> 
> Correct.
> 
>>
>>> I don't really have a good workaround other than blacklisting the
>>> driver since on bare metal the driver needs to use this interface for
>>> platform interactions.
>>
>> I'm not familiar with ATIF, but should `amdgpu_atif_pci_probe_handle`
>> really be called for PCI devices which are bound to vfio-pci? I'd expect
>> amdgpu to ignore such devices.
>>
>> As I understand it, starting with
>> f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)"),
>> the `amdgpu_acpi_detect` function loops over all PCI devices in the
>> `PCI_CLASS_DISPLAY_VGA` and `PCI_CLASS_DISPLAY_OTHER` classes to find
>> the ATIF and ATCS handles. Maybe skipping over any PCI devices bound to
>> vfio-pci would fix the issue? On a related note, shouldn't it also skip
>> over any PCI devices with non-AMD vendor IDs?
> 
> The ACPI methods are global.  There's only one instance of each per
> system and they are relevant to add GPUs on the platform.  That's why
> they are a global resource in the driver.  They can be hung off of the
> dGPU or APU ACPI namespace, depending on the platform which is why we
> check all of the display devices.  Skipping them would prevent them
> from being available if you later bound the amdgpu driver to the GPU
> device(s) I think.
> 
> Alex
> 
>>
>> Regards,
>> James
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-06 15:48                                 ` Thorsten Leemhuis
@ 2022-03-07  2:12                                   ` James Turner
  2022-03-13 18:33                                     ` James Turner
  0 siblings, 1 reply; 30+ messages in thread
From: James Turner @ 2022-03-07  2:12 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Alex Deucher, Lazar, Lijo, Deucher, Alexander, regressions, kvm,
	Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson, Koenig,
	Christian

Hi Thorsten,

My understanding at this point is that the root problem is probably not
in the Linux kernel but rather something else (e.g. the machine firmware
or AMD Windows driver) and that the change in
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
simply exposed the underlying problem.

This week, I'll double-check that this is the case by disabling the
`amdgpu_atif_pci_probe_handle` function and testing again. I'll post the
results here.

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-07  2:12                                   ` James Turner
@ 2022-03-13 18:33                                     ` James Turner
  2022-03-17 12:54                                       ` Thorsten Leemhuis
  0 siblings, 1 reply; 30+ messages in thread
From: James Turner @ 2022-03-13 18:33 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Alex Deucher, Lazar, Lijo, Deucher, Alexander, regressions, kvm,
	Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson, Koenig,
	Christian

Hi all,

I've confirmed that changing the `amdgpu_atif_pci_probe_handle` function
to do nothing does make the GPU work properly in the VM. I started with
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
and changed the function implementation to:

static bool amdgpu_atif_pci_probe_handle(struct pci_dev *pdev)
{
	DRM_DEBUG_DRIVER("Entered amdgpu_atif_pci_probe_handle");
	return false;
}

With that change, the GPU works properly in the VM.

I'm not sure where to go from here. This issue isn't much of a concern
for me anymore, since blacklisting `amdgpu` works for my machine. At
this point, my understanding is that the root problem needs to be fixed
in AMD's Windows GPU driver or Dell's firmware, not the Linux kernel. If
any of the AMD developers on this thread would like to forward it to the
AMD Windows driver team, I'd be happy to work with AMD to fix the issue
properly.

I've added a mention of this issue and workaround to the [Arch Wiki][1]
to make it more discoverable. If anyone has a better place to document
this, please let me know.

Thank you all for your help on this.

[1]: https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#Too-low_frequency_limit_for_AMD_GPU_passed-through_to_virtual_machine

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-13 18:33                                     ` James Turner
@ 2022-03-17 12:54                                       ` Thorsten Leemhuis
  2022-03-18  5:43                                         ` Paul Menzel
  0 siblings, 1 reply; 30+ messages in thread
From: Thorsten Leemhuis @ 2022-03-17 12:54 UTC (permalink / raw)
  To: James Turner
  Cc: Alex Deucher, Lazar, Lijo, Deucher, Alexander, regressions, kvm,
	Greg KH, Pan, Xinhui, LKML, amd-gfx, Alex Williamson, Koenig,
	Christian

On 13.03.22 19:33, James Turner wrote:
>
>> My understanding at this point is that the root problem is probably
>> not in the Linux kernel but rather something else (e.g. the machine
>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
>> exposed the underlying problem.

FWIW: that in the end is irrelevant when it comes to the Linux kernel's
'no regressions' rule. For details see:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst

That being said: sometimes for the greater good it's better to not
insist on that. And I guess that might be the case here.

> I'm not sure where to go from here. This issue isn't much of a concern> for me anymore, since blacklisting `amdgpu` works for my machine. At>
this point, my understanding is that the root problem needs to be fixed>
in AMD's Windows GPU driver or Dell's firmware, not the Linux kernel.
If> any of the AMD developers on this thread would like to forward it to
the> AMD Windows driver team, I'd be happy to work with AMD to fix the
issue> properly.
In that case I'll drop it from the list of regressions, unless what I
wrote above makes you change your mind.

#regzbot invalid: firmware issue exposed by kernel change, user seems to
be happy with a workaround

Thx everyone who participated in handling this.

Ciao, Thorsten


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-17 12:54                                       ` Thorsten Leemhuis
@ 2022-03-18  5:43                                         ` Paul Menzel
  2022-03-18  7:01                                           ` Thorsten Leemhuis
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Menzel @ 2022-03-18  5:43 UTC (permalink / raw)
  To: Thorsten Leemhuis, James Turner
  Cc: Xinhui Pan, regressions, kvm, Greg KH, Lijo Lazar, LKML, amd-gfx,
	Alexander Deucher, Alex Williamson, Alex Deucher,
	Christian König

Dear Thorsten, dear James,


Am 17.03.22 um 13:54 schrieb Thorsten Leemhuis:
> On 13.03.22 19:33, James Turner wrote:
>>
>>> My understanding at this point is that the root problem is probably
>>> not in the Linux kernel but rather something else (e.g. the machine
>>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
>>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
>>> exposed the underlying problem.
> 
> FWIW: that in the end is irrelevant when it comes to the Linux kernel's
> 'no regressions' rule. For details see:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst
> 
> That being said: sometimes for the greater good it's better to not
> insist on that. And I guess that might be the case here.

But who decides that? Running stuff in a virtual machine is not that 
uncommon.

Should the commit be reverted, and re-added with a more elaborate commit 
message documenting the downsides?

Could the user be notified somehow? Can PCI passthrough and a loaded 
amdgpu driver be detected, so Linux warns about this?

Also, should this be documented in the code?

>> I'm not sure where to go from here. This issue isn't much of a concern
>> for me anymore, since blacklisting `amdgpu` works for my machine. At
>> this point, my understanding is that the root problem needs to be fixed
>> in AMD's Windows GPU driver or Dell's firmware, not the Linux kernel. If
>> any of the AMD developers on this thread would like to forward it to the
>> AMD Windows driver team, I'd be happy to work with AMD to fix the issue
>> properly.

(Thorsten, your mailer mangled the quote somehow – I reformatted it –, 
which is too bad, as this message is shown when clicking on the link 
*marked invalid* in the regzbot Web page [1]. (The link is a very nice 
feature.)

> In that case I'll drop it from the list of regressions, unless what I
> wrote above makes you change your mind.
> 
> #regzbot invalid: firmware issue exposed by kernel change, user seems to
> be happy with a workaround
> 
> Thx everyone who participated in handling this.

Should the regression issue be re-opened until the questions above are 
answered, and a more user friendly solution is found?


Kind regards,

Paul


[1]: https://linux-regtracking.leemhuis.info/regzbot/resolved/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-18  5:43                                         ` Paul Menzel
@ 2022-03-18  7:01                                           ` Thorsten Leemhuis
  2022-03-18 14:46                                             ` Alex Williamson
  0 siblings, 1 reply; 30+ messages in thread
From: Thorsten Leemhuis @ 2022-03-18  7:01 UTC (permalink / raw)
  To: Paul Menzel, James Turner
  Cc: Xinhui Pan, regressions, kvm, Greg KH, Lijo Lazar, LKML, amd-gfx,
	Alexander Deucher, Alex Williamson, Alex Deucher,
	Christian König

On 18.03.22 06:43, Paul Menzel wrote:
>
> Am 17.03.22 um 13:54 schrieb Thorsten Leemhuis:
>> On 13.03.22 19:33, James Turner wrote:
>>>
>>>> My understanding at this point is that the root problem is probably
>>>> not in the Linux kernel but rather something else (e.g. the machine
>>>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
>>>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
>>>> exposed the underlying problem.
>>
>> FWIW: that in the end is irrelevant when it comes to the Linux kernel's
>> 'no regressions' rule. For details see:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst
>>
>>
>> That being said: sometimes for the greater good it's better to not
>> insist on that. And I guess that might be the case here.
> 
> But who decides that?

In the end afaics: Linus. But he can't watch each and every discussion,
so it partly falls down to people discussing a regression, as they can
always decide to get him involved in case they are unhappy with how a
regression is handled. That obviously includes me in this case. I simply
use my best judgement in such situations. I'm still undecided if that
path is appropriate here, that's why I wrote above to see what James
would say, as he afaics was the only one that reported this regression.

> Running stuff in a virtual machine is not that uncommon.

No, it's about passing through a GPU to a VM, which is a lot less common
-- and afaics an area where blacklisting GPUs on the host to pass them
through is not uncommon (a quick internet search confirmed that, but I
might be wrong there).

> Should the commit be reverted, and re-added with a more elaborate commit
> message documenting the downsides?
> 
> Could the user be notified somehow? Can PCI passthrough and a loaded
> amdgpu driver be detected, so Linux warns about this?
>
> Also, should this be documented in the code?
>
>>> I'm not sure where to go from here. This issue isn't much of a concern
>>> for me anymore, since blacklisting `amdgpu` works for my machine. At
>>> this point, my understanding is that the root problem needs to be fixed
>>> in AMD's Windows GPU driver or Dell's firmware, not the Linux kernel. If
>>> any of the AMD developers on this thread would like to forward it to the
>>> AMD Windows driver team, I'd be happy to work with AMD to fix the issue
>>> properly.
> 
> (Thorsten, your mailer mangled the quote somehow 

Kinda, but it IIRC was more me doing something stupid with my mailer.
Sorry about that.

> – I reformatted it –,

thx!

> which is too bad, as this message is shown when clicking on the link
> *marked invalid* in the regzbot Web page [1]. (The link is a very nice
> feature.)
> 
>> In that case I'll drop it from the list of regressions, unless what I
>> wrote above makes you change your mind.
>>
>> #regzbot invalid: firmware issue exposed by kernel change, user seems to
>> be happy with a workaround
>>
>> Thx everyone who participated in handling this.
> 
> Should the regression issue be re-opened until the questions above are
> answered, and a more user friendly solution is found?

I'll for now will just continue to watch this discussion and see what
happens.

> [1]: https://linux-regtracking.leemhuis.info/regzbot/resolved/

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-18  7:01                                           ` Thorsten Leemhuis
@ 2022-03-18 14:46                                             ` Alex Williamson
  2022-03-18 15:06                                               ` Alex Deucher
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2022-03-18 14:46 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Paul Menzel, James Turner, Xinhui Pan, regressions, kvm, Greg KH,
	Lijo Lazar, LKML, amd-gfx, Alexander Deucher, Alex Deucher,
	Christian König

On Fri, 18 Mar 2022 08:01:31 +0100
Thorsten Leemhuis <regressions@leemhuis.info> wrote:

> On 18.03.22 06:43, Paul Menzel wrote:
> >
> > Am 17.03.22 um 13:54 schrieb Thorsten Leemhuis:  
> >> On 13.03.22 19:33, James Turner wrote:  
> >>>  
> >>>> My understanding at this point is that the root problem is probably
> >>>> not in the Linux kernel but rather something else (e.g. the machine
> >>>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
> >>>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
> >>>> exposed the underlying problem.  
> >>
> >> FWIW: that in the end is irrelevant when it comes to the Linux kernel's
> >> 'no regressions' rule. For details see:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst
> >>
> >>
> >> That being said: sometimes for the greater good it's better to not
> >> insist on that. And I guess that might be the case here.  
> > 
> > But who decides that?  
> 
> In the end afaics: Linus. But he can't watch each and every discussion,
> so it partly falls down to people discussing a regression, as they can
> always decide to get him involved in case they are unhappy with how a
> regression is handled. That obviously includes me in this case. I simply
> use my best judgement in such situations. I'm still undecided if that
> path is appropriate here, that's why I wrote above to see what James
> would say, as he afaics was the only one that reported this regression.
> 
> > Running stuff in a virtual machine is not that uncommon.  
> 
> No, it's about passing through a GPU to a VM, which is a lot less common
> -- and afaics an area where blacklisting GPUs on the host to pass them
> through is not uncommon (a quick internet search confirmed that, but I
> might be wrong there).

Right, interference from host drivers and pre-boot environments is
always a concern with GPU assignment in particular.  AMD GPUs have a
long history of poor behavior relative to things like PCI secondary bus
resets which we use to try to get devices to clean, reusable states for
assignment.  Here a device is being bound to a host driver that
initiates some sort of power control, unbound from that driver and
exposed to new drivers far beyond the scope of the kernel's regression
policy.  Perhaps it's possible to undo such power control when
unbinding the device, but it's not necessarily a given that such a
thing is possible for this device without a cold reset.

IMO, it's not fair to restrict the kernel from such advancements.  If
the use case is within a VM, don't bind host drivers.  It's difficult
to make promises when dynamically switching between host and userspace
drivers for devices that don't have functional reset mechanisms.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-18 14:46                                             ` Alex Williamson
@ 2022-03-18 15:06                                               ` Alex Deucher
  2022-03-18 15:25                                                 ` Alex Williamson
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Deucher @ 2022-03-18 15:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Thorsten Leemhuis, Paul Menzel, James Turner, Xinhui Pan,
	regressions, kvm, Greg KH, Lijo Lazar, LKML, amd-gfx list,
	Alexander Deucher, Christian König

On Fri, Mar 18, 2022 at 10:46 AM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Fri, 18 Mar 2022 08:01:31 +0100
> Thorsten Leemhuis <regressions@leemhuis.info> wrote:
>
> > On 18.03.22 06:43, Paul Menzel wrote:
> > >
> > > Am 17.03.22 um 13:54 schrieb Thorsten Leemhuis:
> > >> On 13.03.22 19:33, James Turner wrote:
> > >>>
> > >>>> My understanding at this point is that the root problem is probably
> > >>>> not in the Linux kernel but rather something else (e.g. the machine
> > >>>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
> > >>>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
> > >>>> exposed the underlying problem.
> > >>
> > >> FWIW: that in the end is irrelevant when it comes to the Linux kernel's
> > >> 'no regressions' rule. For details see:
> > >>
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
> > >>
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst
> > >>
> > >>
> > >> That being said: sometimes for the greater good it's better to not
> > >> insist on that. And I guess that might be the case here.
> > >
> > > But who decides that?
> >
> > In the end afaics: Linus. But he can't watch each and every discussion,
> > so it partly falls down to people discussing a regression, as they can
> > always decide to get him involved in case they are unhappy with how a
> > regression is handled. That obviously includes me in this case. I simply
> > use my best judgement in such situations. I'm still undecided if that
> > path is appropriate here, that's why I wrote above to see what James
> > would say, as he afaics was the only one that reported this regression.
> >
> > > Running stuff in a virtual machine is not that uncommon.
> >
> > No, it's about passing through a GPU to a VM, which is a lot less common
> > -- and afaics an area where blacklisting GPUs on the host to pass them
> > through is not uncommon (a quick internet search confirmed that, but I
> > might be wrong there).
>
> Right, interference from host drivers and pre-boot environments is
> always a concern with GPU assignment in particular.  AMD GPUs have a
> long history of poor behavior relative to things like PCI secondary bus
> resets which we use to try to get devices to clean, reusable states for
> assignment.  Here a device is being bound to a host driver that
> initiates some sort of power control, unbound from that driver and
> exposed to new drivers far beyond the scope of the kernel's regression
> policy.  Perhaps it's possible to undo such power control when
> unbinding the device, but it's not necessarily a given that such a
> thing is possible for this device without a cold reset.
>
> IMO, it's not fair to restrict the kernel from such advancements.  If
> the use case is within a VM, don't bind host drivers.  It's difficult
> to make promises when dynamically switching between host and userspace
> drivers for devices that don't have functional reset mechanisms.
> Thanks,

Additionally, operating the isolated device in a VM on a constrained
environment like a laptop may have other adverse side effects.  The
driver in the guest would ideally know that this is a laptop and needs
to properly interact with APCI to handle power management on the
device.  If that is not the case, the driver in the guest may end up
running the device out of spec with what the platform supports.  It's
also likely to break suspend and resume, especially on systems which
use S0ix since the firmware will generally only turn off certain power
rails if all of the devices on the rails have been put into the proper
state.  That state may vary depending on the platform requirements.

Alex

>
> Alex
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-18 15:06                                               ` Alex Deucher
@ 2022-03-18 15:25                                                 ` Alex Williamson
  2022-03-21  1:26                                                   ` James Turner
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2022-03-18 15:25 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Thorsten Leemhuis, Paul Menzel, James Turner, Xinhui Pan,
	regressions, kvm, Greg KH, Lijo Lazar, LKML, amd-gfx list,
	Alexander Deucher, Christian König

On Fri, 18 Mar 2022 11:06:00 -0400
Alex Deucher <alexdeucher@gmail.com> wrote:

> On Fri, Mar 18, 2022 at 10:46 AM Alex Williamson
> <alex.williamson@redhat.com> wrote:
> >
> > On Fri, 18 Mar 2022 08:01:31 +0100
> > Thorsten Leemhuis <regressions@leemhuis.info> wrote:
> >  
> > > On 18.03.22 06:43, Paul Menzel wrote:  
> > > >
> > > > Am 17.03.22 um 13:54 schrieb Thorsten Leemhuis:  
> > > >> On 13.03.22 19:33, James Turner wrote:  
> > > >>>  
> > > >>>> My understanding at this point is that the root problem is probably
> > > >>>> not in the Linux kernel but rather something else (e.g. the machine
> > > >>>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
> > > >>>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
> > > >>>> exposed the underlying problem.  
> > > >>
> > > >> FWIW: that in the end is irrelevant when it comes to the Linux kernel's
> > > >> 'no regressions' rule. For details see:
> > > >>
> > > >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
> > > >>
> > > >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst
> > > >>
> > > >>
> > > >> That being said: sometimes for the greater good it's better to not
> > > >> insist on that. And I guess that might be the case here.  
> > > >
> > > > But who decides that?  
> > >
> > > In the end afaics: Linus. But he can't watch each and every discussion,
> > > so it partly falls down to people discussing a regression, as they can
> > > always decide to get him involved in case they are unhappy with how a
> > > regression is handled. That obviously includes me in this case. I simply
> > > use my best judgement in such situations. I'm still undecided if that
> > > path is appropriate here, that's why I wrote above to see what James
> > > would say, as he afaics was the only one that reported this regression.
> > >  
> > > > Running stuff in a virtual machine is not that uncommon.  
> > >
> > > No, it's about passing through a GPU to a VM, which is a lot less common
> > > -- and afaics an area where blacklisting GPUs on the host to pass them
> > > through is not uncommon (a quick internet search confirmed that, but I
> > > might be wrong there).  
> >
> > Right, interference from host drivers and pre-boot environments is
> > always a concern with GPU assignment in particular.  AMD GPUs have a
> > long history of poor behavior relative to things like PCI secondary bus
> > resets which we use to try to get devices to clean, reusable states for
> > assignment.  Here a device is being bound to a host driver that
> > initiates some sort of power control, unbound from that driver and
> > exposed to new drivers far beyond the scope of the kernel's regression
> > policy.  Perhaps it's possible to undo such power control when
> > unbinding the device, but it's not necessarily a given that such a
> > thing is possible for this device without a cold reset.
> >
> > IMO, it's not fair to restrict the kernel from such advancements.  If
> > the use case is within a VM, don't bind host drivers.  It's difficult
> > to make promises when dynamically switching between host and userspace
> > drivers for devices that don't have functional reset mechanisms.
> > Thanks,  
> 
> Additionally, operating the isolated device in a VM on a constrained
> environment like a laptop may have other adverse side effects.  The
> driver in the guest would ideally know that this is a laptop and needs
> to properly interact with APCI to handle power management on the
> device.  If that is not the case, the driver in the guest may end up
> running the device out of spec with what the platform supports.  It's
> also likely to break suspend and resume, especially on systems which
> use S0ix since the firmware will generally only turn off certain power
> rails if all of the devices on the rails have been put into the proper
> state.  That state may vary depending on the platform requirements.

Good point, devices with platform dependencies to manage thermal
budgets, etc. should be considered "use at your own risk" relative to
device assignment currently.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM
  2022-03-18 15:25                                                 ` Alex Williamson
@ 2022-03-21  1:26                                                   ` James Turner
  0 siblings, 0 replies; 30+ messages in thread
From: James Turner @ 2022-03-21  1:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alex Deucher, Thorsten Leemhuis, Paul Menzel, Xinhui Pan,
	regressions, kvm, Greg KH, Lijo Lazar, LKML, amd-gfx list,
	Alexander Deucher, Christian König

>>> Right, interference from host drivers and pre-boot environments is
>>> always a concern with GPU assignment in particular. AMD GPUs have a
>>> long history of poor behavior relative to things like PCI secondary
>>> bus resets which we use to try to get devices to clean, reusable
>>> states for assignment. Here a device is being bound to a host driver
>>> that initiates some sort of power control, unbound from that driver
>>> and exposed to new drivers far beyond the scope of the kernel's
>>> regression policy. Perhaps it's possible to undo such power control
>>> when unbinding the device, but it's not necessarily a given that
>>> such a thing is possible for this device without a cold reset.
>>>
>>> IMO, it's not fair to restrict the kernel from such advancements. If
>>> the use case is within a VM, don't bind host drivers. It's difficult
>>> to make promises when dynamically switching between host and
>>> userspace drivers for devices that don't have functional reset
>>> mechanisms.

To clarify, the GPU is never bound to the `amdgpu` driver on the host.
I'm binding it to `vfio-pci` on the host at boot, specifically to avoid
issues with dynamic rebinding. To do this, I'm passing
`vfio-pci.ids=1002:6981,1002:aae0` on the kernel command line, and I've
confirmed that this option is working:

% lspci -nnk -d 1002:6981
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
	Subsystem: Dell Device [1028:0926]
	Kernel driver in use: vfio-pci
	Kernel modules: amdgpu

% lspci -nnk -d 1002:aae0
01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
	Subsystem: Dell Device [1028:0926]
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

Starting with
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
this is insufficient for the GPU to work properly in the VM, since the
`amdgpu` module is calling global ACPI methods which affect the GPU even
though it's not bound to the `amdgpu` driver.

>> Additionally, operating the isolated device in a VM on a constrained
>> environment like a laptop may have other adverse side effects.  The
>> driver in the guest would ideally know that this is a laptop and needs
>> to properly interact with APCI to handle power management on the
>> device.  If that is not the case, the driver in the guest may end up
>> running the device out of spec with what the platform supports.  It's
>> also likely to break suspend and resume, especially on systems which
>> use S0ix since the firmware will generally only turn off certain power
>> rails if all of the devices on the rails have been put into the proper
>> state.  That state may vary depending on the platform requirements.

Fwiw, the guest Windows AMD driver can tell that it's a mobile GPU, and
as a result, the driver GUI locks various performance parameters to the
defaults. The cooling system and power supply seem to work without
issues. As the load on the GPU increases, the fan speed increases. The
GPU stays below the critical temperature with plenty of margin, even at
100% load. The voltage reported by the GPU adjusts with the load, and I
haven't experienced any glitches which would suggest that the GPU is not
getting enough power or something. I haven't tried suspend/resume.

What are the differences between a laptop and desktop, aside from the
size of the cooling system? Could the issue reported here affect desktop
systems, too?

As far as what to do for this issue: Personally, I don't mind
blacklisting `amdgpu` on my machine. My primary concerns are:

1. Other users may experience this issue and have trouble figuring out
   what's happening, or they may not even realize that they're
   experiencing significantly-lower-than-expected performance.

2. It's possible that this issue affects some machines which use an AMD
   GPU for the host and a second AMD GPU for the guest. For those
   machines, blacklisting `amdgpu` would not be an option, since that
   would disable the AMD GPU for the host.

I've tried to help with concern 1 by mentioning this issue on the Arch
Linux Wiki [1]. Another thing that would help is to print a warning
message to the kernel ring buffer when an AMD GPU is bound to `vfio-pci`
and the `amdgpu` module is loaded. (It would say something like,
"Although the <GPU_NAME> device is bound to `vfio-pci`, loading the
`amdgpu` module may still affect it via ACPI. Consider blacklisting
`amdgpu` if the GPU does not behave as expected.")

I'm not sure if there's any way to address concern 2, aside from fixing
the firmware / Windows AMD driver.

I thought of one more thing I could test -- I could try a Linux guest
instead of a Windows guest to determine if the issue is due to the
firmware or the guest Windows AMD driver. Would that be helpful?

[1]: https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#Too-low_frequency_limit_for_AMD_GPU_passed-through_to_virtual_machine

James

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-03-21  2:01 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-17  2:12 [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM James D. Turner
2022-01-17  8:09 ` Greg KH
2022-01-17  9:03 ` Thorsten Leemhuis
2022-01-18  3:14   ` James Turner
2022-01-21  2:13     ` James Turner
2022-01-21  6:22       ` Thorsten Leemhuis
2022-01-21 16:45         ` Alex Deucher
2022-01-22  0:51           ` James Turner
2022-01-22  5:52             ` Lazar, Lijo
2022-01-22 21:11               ` James Turner
2022-01-24 14:21                 ` Lazar, Lijo
2022-01-24 23:58                   ` James Turner
2022-01-25 13:33                     ` Lazar, Lijo
2022-01-30  0:25                       ` Jim Turner
2022-02-15 14:56                         ` Thorsten Leemhuis
2022-02-15 15:11                           ` Alex Deucher
2022-02-16  0:25                             ` James D. Turner
2022-02-16 16:37                               ` Alex Deucher
2022-03-06 15:48                                 ` Thorsten Leemhuis
2022-03-07  2:12                                   ` James Turner
2022-03-13 18:33                                     ` James Turner
2022-03-17 12:54                                       ` Thorsten Leemhuis
2022-03-18  5:43                                         ` Paul Menzel
2022-03-18  7:01                                           ` Thorsten Leemhuis
2022-03-18 14:46                                             ` Alex Williamson
2022-03-18 15:06                                               ` Alex Deucher
2022-03-18 15:25                                                 ` Alex Williamson
2022-03-21  1:26                                                   ` James Turner
2022-01-24 17:04                 ` Alex Deucher
2022-01-24 17:30                   ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).