All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thorsten Leemhuis <regressions@leemhuis.info>
To: Daniel Vetter <daniel.vetter@intel.com>
Cc: "Sergey V." <truesmb@gmail.com>,
	DRI Development <dri-devel@lists.freedesktop.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"regressions@lists.linux.dev" <regressions@lists.linux.dev>
Subject: [regression] Bug 216475 - fbcon crashes during single gpu passthough reattachment to host
Date: Mon, 19 Sep 2022 11:10:30 +0200	[thread overview]
Message-ID: <342fadc8-d902-3ada-fd61-67312d0da352@leemhuis.info> (raw)

Hi, this is your Linux kernel regression tracker speaking.

I noticed a regression report in bugzilla.kernel.org. As many (most?)
kernel developer don't keep an eye on it, I decided to forward it by
mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=216475 :

> Created attachment 301792 [details]
> My dmesg right after VM shutdown
> 
> Hello, after 5.19 kernel many VFIO users have problems with reattaching GPU from guest to host. It works well previously (5.18.16 for me).
> 
> More complains about the issue:
> https://www.reddit.com/r/VFIO/comments/wp85ve/linux_519_kernel_single_gpu_passthough_black/
> 
> My PC Spec:
>   CPU: Ryzen 5950X
>   RAM: 128GB
>   GPU: NVIDIA RTX 3080
>   OS: Arch Linux
> 
> How to reproduce:
>   1. You have to have properly configured VM with working GPU passthough (too complicated to explain it here)
>   2. When VM starts it detaches GPU from host by 'start.sh' (see below)
>   3. VM starts properly, Windows loads properly
>   4. Shutdown VM regularly and GPU should be reattached by 'revert.sh' (see below)
> Actual results (5.19.*):
>   5. Windows shutdowns, and GPU is not reattaching to host only black screen present and monitors shutdown (no signal)
>   5.1 dmesg contains error message - dmesg.txt in attachments
>     WARNING: CPU: 30 PID: 12528 at drivers/video/fbdev/core/fbcon.c:999 fbcon_init+0x5ce/0x670
>     ...
>     BUG: kernel NULL pointer dereference, address: 0000000000000330
> Expected Result (5.18.* and previous):
>   5. Windows shutdowns, and GPU successfully reattached to host
> 
> I have tried to bisect git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git v5.18.16 as good and v5.19.2 as bad
> (I've done it for the first time, maybe I've done something wrong)
> 
> During bisect after some point my Linux doesn't boot, and it trying to mark those commits as bad.
> Commit below might be not real problem causer
> 
> Commit which I found by bisect:
> 
> commit 3647d6d3dbdafc55f8c4ca8225966963252abe7b (refs/bisect/bad)
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Tue Apr 5 23:03:33 2022 +0200
> 
>     fbcon: Move more code into fbcon_release
> 
>     con2fb_release_oldinfo() has a bunch more kfree() calls than
>     fbcon_exit(), but since kfree() on NULL is harmless doing that in both
>     places should be ok. This is also a bit more symmetric now again with
>     fbcon_open also allocating the fbcon_ops structure.
> 
>     Acked-by: Sam Ravnborg <sam@ravnborg.org>
>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>     Cc: Daniel Vetter <daniel@ffwll.ch>
>     Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>     Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>     Cc: Du Cheng <ducheng2@gmail.com>
>     Cc: Claudio Suarez <cssk@net-c.es>
>     Link: https://patchwork.freedesktop.org/patch/msgid/20220405210335.3434130-16-daniel.vetter@ffwll.ch
> 
> 
> start.sh
> ========
> #!/bin/bash
> set -x
> 
> systemctl stop display-manager.service
> while systemctl is-active --quiet "display-manager.service" ; do
>     sleep 1
> done
> 
> killall gdm-x-session
> killall -u bormor
> 
> echo 0 > /sys/class/vtconsole/vtcon0/bind
> echo 0 > /sys/class/vtconsole/vtcon1/bind
> 
> # Unbind EFI-Framebuffer
> echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind
> 
> # Avoid a Race condition by waiting 2 seconds. This can be calibrated to be shorter or longer if required for your system
> sleep 2
> 
> # Unload all Nvidia drivers
> modprobe -r nvidia_drm
> modprobe -r nvidia_modeset
> modprobe -r nvidia_uvm
> modprobe -r nvidia
> modprobe -r nouveau
> 
> # Unbind the GPU from display driver
> virsh nodedev-detach pci_0000_09_00_0
> virsh nodedev-detach pci_0000_09_00_1
> 
> # Load VFIO Kernel Module  
> modprobe vfio-pci
> 
> 
> revert.sh
> ========
> #!/bin/bash
> set -x
> 
> # Unload VFIO-PCI Kernel Driver
> modprobe -r vfio-pci
> modprobe -r vfio_iommu_type1
> modprobe -r vfio
> 
> virsh nodedev-reattach pci_0000_09_00_1
> virsh nodedev-reattach pci_0000_09_00_0
> 
> echo 1 > /sys/class/vtconsole/vtcon0/bind
> echo 1 > /sys/class/vtconsole/vtcon1/bind
> 
> nvidia-xconfig --query-gpu-info > /dev/null 2>&1
> echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind
> 
> modprobe nvidia_drm
> modprobe nvidia_modeset
> modprobe nvidia_uvm
> modprobe nvidia
> modprobe nouveau
> 
> 
> systemctl start display-manager.service

See the ticket for more details.

BTW, let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:

#regzbot introduced: 3647d6d3dbdafc55f8c4ca8225966963252abe7b
https://bugzilla.kernel.org/show_bug.cgi?id=216475
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

WARNING: multiple messages have this Message-ID (diff)
From: Thorsten Leemhuis <regressions@leemhuis.info>
To: Daniel Vetter <daniel.vetter@intel.com>
Cc: "Sergey V." <truesmb@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	DRI Development <dri-devel@lists.freedesktop.org>,
	"regressions@lists.linux.dev" <regressions@lists.linux.dev>
Subject: [regression] Bug 216475 - fbcon crashes during single gpu passthough reattachment to host
Date: Mon, 19 Sep 2022 11:10:30 +0200	[thread overview]
Message-ID: <342fadc8-d902-3ada-fd61-67312d0da352@leemhuis.info> (raw)

Hi, this is your Linux kernel regression tracker speaking.

I noticed a regression report in bugzilla.kernel.org. As many (most?)
kernel developer don't keep an eye on it, I decided to forward it by
mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=216475 :

> Created attachment 301792 [details]
> My dmesg right after VM shutdown
> 
> Hello, after 5.19 kernel many VFIO users have problems with reattaching GPU from guest to host. It works well previously (5.18.16 for me).
> 
> More complains about the issue:
> https://www.reddit.com/r/VFIO/comments/wp85ve/linux_519_kernel_single_gpu_passthough_black/
> 
> My PC Spec:
>   CPU: Ryzen 5950X
>   RAM: 128GB
>   GPU: NVIDIA RTX 3080
>   OS: Arch Linux
> 
> How to reproduce:
>   1. You have to have properly configured VM with working GPU passthough (too complicated to explain it here)
>   2. When VM starts it detaches GPU from host by 'start.sh' (see below)
>   3. VM starts properly, Windows loads properly
>   4. Shutdown VM regularly and GPU should be reattached by 'revert.sh' (see below)
> Actual results (5.19.*):
>   5. Windows shutdowns, and GPU is not reattaching to host only black screen present and monitors shutdown (no signal)
>   5.1 dmesg contains error message - dmesg.txt in attachments
>     WARNING: CPU: 30 PID: 12528 at drivers/video/fbdev/core/fbcon.c:999 fbcon_init+0x5ce/0x670
>     ...
>     BUG: kernel NULL pointer dereference, address: 0000000000000330
> Expected Result (5.18.* and previous):
>   5. Windows shutdowns, and GPU successfully reattached to host
> 
> I have tried to bisect git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git v5.18.16 as good and v5.19.2 as bad
> (I've done it for the first time, maybe I've done something wrong)
> 
> During bisect after some point my Linux doesn't boot, and it trying to mark those commits as bad.
> Commit below might be not real problem causer
> 
> Commit which I found by bisect:
> 
> commit 3647d6d3dbdafc55f8c4ca8225966963252abe7b (refs/bisect/bad)
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Tue Apr 5 23:03:33 2022 +0200
> 
>     fbcon: Move more code into fbcon_release
> 
>     con2fb_release_oldinfo() has a bunch more kfree() calls than
>     fbcon_exit(), but since kfree() on NULL is harmless doing that in both
>     places should be ok. This is also a bit more symmetric now again with
>     fbcon_open also allocating the fbcon_ops structure.
> 
>     Acked-by: Sam Ravnborg <sam@ravnborg.org>
>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>     Cc: Daniel Vetter <daniel@ffwll.ch>
>     Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>     Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>     Cc: Du Cheng <ducheng2@gmail.com>
>     Cc: Claudio Suarez <cssk@net-c.es>
>     Link: https://patchwork.freedesktop.org/patch/msgid/20220405210335.3434130-16-daniel.vetter@ffwll.ch
> 
> 
> start.sh
> ========
> #!/bin/bash
> set -x
> 
> systemctl stop display-manager.service
> while systemctl is-active --quiet "display-manager.service" ; do
>     sleep 1
> done
> 
> killall gdm-x-session
> killall -u bormor
> 
> echo 0 > /sys/class/vtconsole/vtcon0/bind
> echo 0 > /sys/class/vtconsole/vtcon1/bind
> 
> # Unbind EFI-Framebuffer
> echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind
> 
> # Avoid a Race condition by waiting 2 seconds. This can be calibrated to be shorter or longer if required for your system
> sleep 2
> 
> # Unload all Nvidia drivers
> modprobe -r nvidia_drm
> modprobe -r nvidia_modeset
> modprobe -r nvidia_uvm
> modprobe -r nvidia
> modprobe -r nouveau
> 
> # Unbind the GPU from display driver
> virsh nodedev-detach pci_0000_09_00_0
> virsh nodedev-detach pci_0000_09_00_1
> 
> # Load VFIO Kernel Module  
> modprobe vfio-pci
> 
> 
> revert.sh
> ========
> #!/bin/bash
> set -x
> 
> # Unload VFIO-PCI Kernel Driver
> modprobe -r vfio-pci
> modprobe -r vfio_iommu_type1
> modprobe -r vfio
> 
> virsh nodedev-reattach pci_0000_09_00_1
> virsh nodedev-reattach pci_0000_09_00_0
> 
> echo 1 > /sys/class/vtconsole/vtcon0/bind
> echo 1 > /sys/class/vtconsole/vtcon1/bind
> 
> nvidia-xconfig --query-gpu-info > /dev/null 2>&1
> echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind
> 
> modprobe nvidia_drm
> modprobe nvidia_modeset
> modprobe nvidia_uvm
> modprobe nvidia
> modprobe nouveau
> 
> 
> systemctl start display-manager.service

See the ticket for more details.

BTW, let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:

#regzbot introduced: 3647d6d3dbdafc55f8c4ca8225966963252abe7b
https://bugzilla.kernel.org/show_bug.cgi?id=216475
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

             reply	other threads:[~2022-09-19  9:10 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-19  9:10 Thorsten Leemhuis [this message]
2022-09-19  9:10 ` [regression] Bug 216475 - fbcon crashes during single gpu passthough reattachment to host Thorsten Leemhuis
2022-11-15  9:40 ` [regression] Bug 216475 - fbcon crashes during single gpu passthough reattachment to host #forregzbot Thorsten Leemhuis
2022-11-15  9:40   ` Thorsten Leemhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=342fadc8-d902-3ada-fd61-67312d0da352@leemhuis.info \
    --to=regressions@leemhuis.info \
    --cc=daniel.vetter@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=regressions@lists.linux.dev \
    --cc=truesmb@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.