regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently
       [not found] <alpine.DEB.2.20.2310242308150.28457@mail.home>
@ 2023-10-24 21:09 ` Salvatore Bonaccorso
  2023-10-24 21:39   ` Timo Lindfors
  2023-10-24 23:55   ` Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently Bagas Sanjaya
  0 siblings, 2 replies; 15+ messages in thread
From: Salvatore Bonaccorso @ 2023-10-24 21:09 UTC (permalink / raw)
  To: Timo Lindfors, 1054514
  Cc: Gerd Hoffmann, Thomas Zimmermann, Dave Airlie, Daniel Vetter,
	virtualization, spice-devel, dri-devel, linux-kernel,
	regressions

Hi Timo,

On Tue, Oct 24, 2023 at 11:14:32PM +0300, Timo Lindfors wrote:
> Package: src:linux
> Version: 6.1.55-1
> Severity: normal
> 
> Steps to reproduce:
> 1) Install Debian 12 as a virtual machine using virt-manager, choose qxl
>    graphics card. You only need basic installation without wayland or X.
> 2) Login from the console and save thë following to reproduce.bash:
> 
> #!/bin/bash
> 
> chvt 3
> for j in $(seq 80); do
>     echo "$(date) starting round $j"
>     if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ];
> then
>         echo "bug was reproduced after $j tries"
>         exit 1
>     fi
>     for i in $(seq 100); do
>         dmesg > /dev/tty3
>     done
> done
> 
> echo "bug could not be reproduced"
> exit 0
> 
> 
> 3) Run chmod a+x reproduce.bash
> 4) Run ./reproduce.bash and wait for up to 20 minutes.
> 
> Expected results:
> 4) The system prints a steady flow of text without kernel error messages
> 
> Actual messages:
> 4) At some point the text stops flowing and the script prints "bug was
>    reproduced". If you run "journalctl --boot" you see
> 
> kernel: [TTM] Buffer eviction failed
> kernel: qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
> kernel: [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
> 
> 
> 
> More info:
> 1) The bug does not occur if I downgrade the kernel to
>    linux-image-5.10.0-26-amd64_5.10.197-1_amd64.deb from Debian 11.
> 2) I used the following test_linux.bash to bisect this issue against
>    upstream source:
> 
> #!/bin/bash
> set -x
> 
> gitversion="$(git describe HEAD|sed 's@^v@@')"
> 
> git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> git show bec771b5e0901f4b0bc861bcb58056de5151ae3a | patch -p1
> # Build
> cp ~/kernel.config .config
> # cp /boot/config-$(uname -r) .config
> # scripts/config --enable LOCALVERSION_AUTO
> # scripts/config --disable DEBUG_INFO
> # scripts/config --disable SYSTEM_TRUSTED_KEYRING
> # scripts/config --set-str SYSTEM_TRUSTED_KEYS ''
> # scripts/config --disable STACKPROTECTOR_STRONG
> make olddefconfig
> # make localmodconfig
> make -j$(nproc --all) bindeb-pkg
> rc="$?"
> if [ "$rc" != "0" ]; then
>     exit 125
> fi
> git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> 
> package="$(ls --sort=time ../linux-image-*_amd64.deb|head -n1)"
> version=$(echo $package | cut -d_ -f1|cut -d- -f3-)
> 
> if [ "$gitversion" != "$version" ]; then
>     echo "Build produced version $gitversion but got $version, ignoring"
>     #exit 255
> fi
> 
> # Deploy
> scp $package target:a.deb
> ssh target sudo apt install ./a.deb
> ssh target rm -f a.deb
> ssh target ./grub_set_default_version.bash $version
> ssh target sudo shutdown -r now
> sleep 40
> 
> detected_version=$(ssh target uname -r)
> if [ "$detected_version" != "$version" ]; then
>     echo "Booted to $detected_version but expected $version"
>     exit 255
> fi
> 
> # Test
> exec ssh target sudo ./reproduce.bash
> 
> 
> Bisect printed the following log:
> 
> git bisect start
> # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.
> git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> # bad: [762949bb1da78941b25e63f7e952af037eee15a9] drm: fix drm_mode_create_blob comment
> git bisect bad 762949bb1da78941b25e63f7e952af037eee15a9
> # bad: [e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc] drm/gma500: Drop DRM_GMA600 config option
> git bisect bad e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc
> # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> # bad: [d2b6f8a179194de0ffc4886ffc2c4358d86047b8] Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> git bisect bad d2b6f8a179194de0ffc4886ffc2c4358d86047b8
> # bad: [68a32ba14177d4a21c4a9a941cf1d7aea86d436f] Merge tag 'drm-next-2021-04-28' of git://anongit.freedesktop.org/drm/drm
> git bisect bad 68a32ba14177d4a21c4a9a941cf1d7aea86d436f
> # bad: [0698b13403788a646073fcd9b2294f2dce0ce429] drm/amdgpu: skip PP_MP1_STATE_UNLOAD on aldebaran
> git bisect bad 0698b13403788a646073fcd9b2294f2dce0ce429
> # bad: [e1a5e6a8c48bf99ea374fb3e535661cfe226bca4] drm/doc: Add RFC section
> git bisect bad e1a5e6a8c48bf99ea374fb3e535661cfe226bca4
> # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.
> git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> # bad: [2c8ab3339e398bbbcb0980933e266b93bedaae52] drm/i915: Pin timeline map after first timeline pin, v4.
> git bisect bad 2c8ab3339e398bbbcb0980933e266b93bedaae52
> # bad: [2eb8e1a69d9f8cc9c0a75e327f854957224ba421] drm/i915/gem: Drop relocation support on all new hardware (v6)
> git bisect bad 2eb8e1a69d9f8cc9c0a75e327f854957224ba421
> # bad: [b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2] drm/i915/gem: Drop legacy execbuffer support (v2)
> git bisect bad b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2
> # bad: [06debd6e1b28029e6e77c41e59a162868f377897] Merge tag 'drm-intel-next-2021-03-16' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
> git bisect bad 06debd6e1b28029e6e77c41e59a162868f377897
> # good: [e19eede54240d64b4baf9b0df4dfb8191f7ae48b] Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
> git bisect good e19eede54240d64b4baf9b0df4dfb8191f7ae48b
> # good: [1e28eed17697bcf343c6743f0028cc3b5dd88bf0] Linux 5.12-rc3
> git bisect good 1e28eed17697bcf343c6743f0028cc3b5dd88bf0
> # bad: [6af70eb3b40edfc8bdf2373cdc2bcf9d5a20c8c7] drm/atmel-hlcdc: Rename custom plane state variable
> git bisect bad 6af70eb3b40edfc8bdf2373cdc2bcf9d5a20c8c7
> # good: [4ca77c513537700d3fae69030879f781dde1904c] drm/qxl: release shadow on shutdown
> git bisect good 4ca77c513537700d3fae69030879f781dde1904c
> # bad: [4a11bd1e88af130f50a72e0f54391c1c7d268e03] drm/ast: Add constants for VGACRCB register bits
> git bisect bad 4a11bd1e88af130f50a72e0f54391c1c7d268e03
> # bad: [5c209d8056b9763ce544ecd7dadb3782cdaf96ed] drm/gma500: psb_spank() doesn't need it's own file
> git bisect bad 5c209d8056b9763ce544ecd7dadb3782cdaf96ed
> # bad: [db0c6bd2c0c0dada8927cd46a7c34c316a3a6c04] drm/gem: Export drm_gem_vmap() and drm_gem_vunmap()
> git bisect bad db0c6bd2c0c0dada8927cd46a7c34c316a3a6c04
> # bad: [f4a84e165e6d58606097dd07b5b78767a94b870c] drm/qxl: allocate dumb buffers in ram
> git bisect bad f4a84e165e6d58606097dd07b5b78767a94b870c
> # good: [a7709b9b89a67f3ead2d188b1d0c261059b1f291] drm/qxl: handle shadow in primary destroy
> git bisect good a7709b9b89a67f3ead2d188b1d0c261059b1f291
> # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> # good: [5f6c871fe919999774e8535ea611a6f84ee43ee4] drm/qxl: properly free qxl releases
> git bisect good 5f6c871fe919999774e8535ea611a6f84ee43ee4
> # first bad commit: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> 
> I took a look at
> 
> commit 5a838e5d5825c85556011478abde708251cc0776 (refs/bisect/bad)
> Author: Gerd Hoffmann <kraxel@redhat.com>
> Date:   Thu Feb 4 15:57:10 2021 +0100
> 
>     drm/qxl: simplify qxl_fence_wait
> 
>     Now that we have the new release_event wait queue we can just
>     use that in qxl_fence_wait() and simplify the code a lot.
> 
>     Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
>     Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
>     Link: http://patchwork.freedesktop.org/patch/msgid/20210204145712.1531203-10-kraxel@redhat.com
> 
> 
> and noticed that the bug does not occur if I boot 6.1 kernel with this patch
> reverted (see attached file).

Thanks for the excelent constructed report! I think it's best to
forward this directly to upstream including the people for the
bisected commit to get some idea.

Can you reproduce the issue with 6.5.8-1 in unstable as well?

If not, are you able to isolate an upstream fix which should be
backported to the 6.1.y series as well?

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently
  2023-10-24 21:09 ` Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently Salvatore Bonaccorso
@ 2023-10-24 21:39   ` Timo Lindfors
  2023-12-06  9:56     ` Linux regression tracking (Thorsten Leemhuis)
  2023-10-24 23:55   ` Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently Bagas Sanjaya
  1 sibling, 1 reply; 15+ messages in thread
From: Timo Lindfors @ 2023-10-24 21:39 UTC (permalink / raw)
  To: Salvatore Bonaccorso
  Cc: Timo Lindfors, 1054514, Gerd Hoffmann, Thomas Zimmermann,
	Dave Airlie, Daniel Vetter, virtualization, spice-devel,
	dri-devel, linux-kernel, regressions

Hi,

On Tue, 24 Oct 2023, Salvatore Bonaccorso wrote:
> Thanks for the excelent constructed report! I think it's best to
> forward this directly to upstream including the people for the
> bisected commit to get some idea.

Thanks for the quick reply!

> Can you reproduce the issue with 6.5.8-1 in unstable as well?

Unfortunately yes:

ansible@target:~$ uname -r
6.5.0-3-amd64
ansible@target:~$ time sudo ./reproduce.bash
Wed 25 Oct 2023 12:27:00 AM EEST starting round 1
Wed 25 Oct 2023 12:27:24 AM EEST starting round 2
Wed 25 Oct 2023 12:27:48 AM EEST starting round 3
bug was reproduced after 3 tries

real    0m48.838s
user    0m1.115s
sys     0m45.530s

I also tested upstream tag v6.6-rc6:

...
+ detected_version=6.6.0-rc6
+ '[' 6.6.0-rc6 '!=' 6.6.0-rc6 ']'
+ exec ssh target sudo ./reproduce.bash
Wed 25 Oct 2023 12:37:16 AM EEST starting round 1
Wed 25 Oct 2023 12:37:42 AM EEST starting round 2
Wed 25 Oct 2023 12:38:10 AM EEST starting round 3
Wed 25 Oct 2023 12:38:36 AM EEST starting round 4
Wed 25 Oct 2023 12:39:01 AM EEST starting round 5
Wed 25 Oct 2023 12:39:27 AM EEST starting round 6
bug was reproduced after 6 tries


For completeness, here is also the grub_set_default_version.bash script 
that I had to write to automate this (maybe these could be in debian 
wiki?):

#!/bin/bash
set -x

version="$1"

idx=$(expr $(grep "menuentry " /boot/grub/grub.cfg | sed 1d |grep -n "'Debian GNU/Linux, with Linux $version'"|cut -d: -f1) - 1)
exec sudo grub-set-default "1>$idx"



-Timo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently
  2023-10-24 21:09 ` Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently Salvatore Bonaccorso
  2023-10-24 21:39   ` Timo Lindfors
@ 2023-10-24 23:55   ` Bagas Sanjaya
  1 sibling, 0 replies; 15+ messages in thread
From: Bagas Sanjaya @ 2023-10-24 23:55 UTC (permalink / raw)
  To: Salvatore Bonaccorso, Timo Lindfors, 1054514
  Cc: Thomas Zimmermann, Linux Kernel Mailing List,
	Linux DRI Development, Linux Virtualization, Gerd Hoffmann,
	SPICE Mailing List, Dave Airlie, Linux Regressions

[-- Attachment #1: Type: text/plain, Size: 8989 bytes --]

On Tue, Oct 24, 2023 at 11:09:10PM +0200, Salvatore Bonaccorso wrote:
> Hi Timo,
> 
> On Tue, Oct 24, 2023 at 11:14:32PM +0300, Timo Lindfors wrote:
> > Package: src:linux
> > Version: 6.1.55-1
> > Severity: normal
> > 
> > Steps to reproduce:
> > 1) Install Debian 12 as a virtual machine using virt-manager, choose qxl
> >    graphics card. You only need basic installation without wayland or X.
> > 2) Login from the console and save thë following to reproduce.bash:
> > 
> > #!/bin/bash
> > 
> > chvt 3
> > for j in $(seq 80); do
> >     echo "$(date) starting round $j"
> >     if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ];
> > then
> >         echo "bug was reproduced after $j tries"
> >         exit 1
> >     fi
> >     for i in $(seq 100); do
> >         dmesg > /dev/tty3
> >     done
> > done
> > 
> > echo "bug could not be reproduced"
> > exit 0
> > 
> > 
> > 3) Run chmod a+x reproduce.bash
> > 4) Run ./reproduce.bash and wait for up to 20 minutes.
> > 
> > Expected results:
> > 4) The system prints a steady flow of text without kernel error messages
> > 
> > Actual messages:
> > 4) At some point the text stops flowing and the script prints "bug was
> >    reproduced". If you run "journalctl --boot" you see
> > 
> > kernel: [TTM] Buffer eviction failed
> > kernel: qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
> > kernel: [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
> > 
> > 
> > 
> > More info:
> > 1) The bug does not occur if I downgrade the kernel to
> >    linux-image-5.10.0-26-amd64_5.10.197-1_amd64.deb from Debian 11.
> > 2) I used the following test_linux.bash to bisect this issue against
> >    upstream source:
> > 
> > #!/bin/bash
> > set -x
> > 
> > gitversion="$(git describe HEAD|sed 's@^v@@')"
> > 
> > git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> > git show bec771b5e0901f4b0bc861bcb58056de5151ae3a | patch -p1
> > # Build
> > cp ~/kernel.config .config
> > # cp /boot/config-$(uname -r) .config
> > # scripts/config --enable LOCALVERSION_AUTO
> > # scripts/config --disable DEBUG_INFO
> > # scripts/config --disable SYSTEM_TRUSTED_KEYRING
> > # scripts/config --set-str SYSTEM_TRUSTED_KEYS ''
> > # scripts/config --disable STACKPROTECTOR_STRONG
> > make olddefconfig
> > # make localmodconfig
> > make -j$(nproc --all) bindeb-pkg
> > rc="$?"
> > if [ "$rc" != "0" ]; then
> >     exit 125
> > fi
> > git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> > 
> > package="$(ls --sort=time ../linux-image-*_amd64.deb|head -n1)"
> > version=$(echo $package | cut -d_ -f1|cut -d- -f3-)
> > 
> > if [ "$gitversion" != "$version" ]; then
> >     echo "Build produced version $gitversion but got $version, ignoring"
> >     #exit 255
> > fi
> > 
> > # Deploy
> > scp $package target:a.deb
> > ssh target sudo apt install ./a.deb
> > ssh target rm -f a.deb
> > ssh target ./grub_set_default_version.bash $version
> > ssh target sudo shutdown -r now
> > sleep 40
> > 
> > detected_version=$(ssh target uname -r)
> > if [ "$detected_version" != "$version" ]; then
> >     echo "Booted to $detected_version but expected $version"
> >     exit 255
> > fi
> > 
> > # Test
> > exec ssh target sudo ./reproduce.bash
> > 
> > 
> > Bisect printed the following log:
> > 
> > git bisect start
> > # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.
> > git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> > # bad: [762949bb1da78941b25e63f7e952af037eee15a9] drm: fix drm_mode_create_blob comment
> > git bisect bad 762949bb1da78941b25e63f7e952af037eee15a9
> > # bad: [e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc] drm/gma500: Drop DRM_GMA600 config option
> > git bisect bad e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc
> > # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> > git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> > # bad: [d2b6f8a179194de0ffc4886ffc2c4358d86047b8] Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> > git bisect bad d2b6f8a179194de0ffc4886ffc2c4358d86047b8
> > # bad: [68a32ba14177d4a21c4a9a941cf1d7aea86d436f] Merge tag 'drm-next-2021-04-28' of git://anongit.freedesktop.org/drm/drm
> > git bisect bad 68a32ba14177d4a21c4a9a941cf1d7aea86d436f
> > # bad: [0698b13403788a646073fcd9b2294f2dce0ce429] drm/amdgpu: skip PP_MP1_STATE_UNLOAD on aldebaran
> > git bisect bad 0698b13403788a646073fcd9b2294f2dce0ce429
> > # bad: [e1a5e6a8c48bf99ea374fb3e535661cfe226bca4] drm/doc: Add RFC section
> > git bisect bad e1a5e6a8c48bf99ea374fb3e535661cfe226bca4
> > # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.
> > git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> > # bad: [2c8ab3339e398bbbcb0980933e266b93bedaae52] drm/i915: Pin timeline map after first timeline pin, v4.
> > git bisect bad 2c8ab3339e398bbbcb0980933e266b93bedaae52
> > # bad: [2eb8e1a69d9f8cc9c0a75e327f854957224ba421] drm/i915/gem: Drop relocation support on all new hardware (v6)
> > git bisect bad 2eb8e1a69d9f8cc9c0a75e327f854957224ba421
> > # bad: [b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2] drm/i915/gem: Drop legacy execbuffer support (v2)
> > git bisect bad b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2
> > # bad: [06debd6e1b28029e6e77c41e59a162868f377897] Merge tag 'drm-intel-next-2021-03-16' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
> > git bisect bad 06debd6e1b28029e6e77c41e59a162868f377897
> > # good: [e19eede54240d64b4baf9b0df4dfb8191f7ae48b] Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
> > git bisect good e19eede54240d64b4baf9b0df4dfb8191f7ae48b
> > # good: [1e28eed17697bcf343c6743f0028cc3b5dd88bf0] Linux 5.12-rc3
> > git bisect good 1e28eed17697bcf343c6743f0028cc3b5dd88bf0
> > # bad: [6af70eb3b40edfc8bdf2373cdc2bcf9d5a20c8c7] drm/atmel-hlcdc: Rename custom plane state variable
> > git bisect bad 6af70eb3b40edfc8bdf2373cdc2bcf9d5a20c8c7
> > # good: [4ca77c513537700d3fae69030879f781dde1904c] drm/qxl: release shadow on shutdown
> > git bisect good 4ca77c513537700d3fae69030879f781dde1904c
> > # bad: [4a11bd1e88af130f50a72e0f54391c1c7d268e03] drm/ast: Add constants for VGACRCB register bits
> > git bisect bad 4a11bd1e88af130f50a72e0f54391c1c7d268e03
> > # bad: [5c209d8056b9763ce544ecd7dadb3782cdaf96ed] drm/gma500: psb_spank() doesn't need it's own file
> > git bisect bad 5c209d8056b9763ce544ecd7dadb3782cdaf96ed
> > # bad: [db0c6bd2c0c0dada8927cd46a7c34c316a3a6c04] drm/gem: Export drm_gem_vmap() and drm_gem_vunmap()
> > git bisect bad db0c6bd2c0c0dada8927cd46a7c34c316a3a6c04
> > # bad: [f4a84e165e6d58606097dd07b5b78767a94b870c] drm/qxl: allocate dumb buffers in ram
> > git bisect bad f4a84e165e6d58606097dd07b5b78767a94b870c
> > # good: [a7709b9b89a67f3ead2d188b1d0c261059b1f291] drm/qxl: handle shadow in primary destroy
> > git bisect good a7709b9b89a67f3ead2d188b1d0c261059b1f291
> > # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> > git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> > # good: [5f6c871fe919999774e8535ea611a6f84ee43ee4] drm/qxl: properly free qxl releases
> > git bisect good 5f6c871fe919999774e8535ea611a6f84ee43ee4
> > # first bad commit: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> > 
> > I took a look at
> > 
> > commit 5a838e5d5825c85556011478abde708251cc0776 (refs/bisect/bad)
> > Author: Gerd Hoffmann <kraxel@redhat.com>
> > Date:   Thu Feb 4 15:57:10 2021 +0100
> > 
> >     drm/qxl: simplify qxl_fence_wait
> > 
> >     Now that we have the new release_event wait queue we can just
> >     use that in qxl_fence_wait() and simplify the code a lot.
> > 
> >     Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> >     Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
> >     Link: http://patchwork.freedesktop.org/patch/msgid/20210204145712.1531203-10-kraxel@redhat.com
> > 
> > 
> > and noticed that the bug does not occur if I boot 6.1 kernel with this patch
> > reverted (see attached file).
> 
> Thanks for the excelent constructed report! I think it's best to
> forward this directly to upstream including the people for the
> bisected commit to get some idea.
> 
> Can you reproduce the issue with 6.5.8-1 in unstable as well?
> 
> If not, are you able to isolate an upstream fix which should be
> backported to the 6.1.y series as well?
> 

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 5a838e5d5825c8
#regzbot title: simplifying qxl_fence_wait() makes VRAM BO allocation fail
#regzbot from: Timo Lindfors <timo.lindfors@iki.fi>

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently
  2023-10-24 21:39   ` Timo Lindfors
@ 2023-12-06  9:56     ` Linux regression tracking (Thorsten Leemhuis)
  2023-12-06 10:45       ` Bug#1054514: Info received (Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently) Debian Bug Tracking System
  2024-03-08  1:08       ` [PATCH 0/1] drm/qxl: fixes qxl_fence_wait Alex Constantino
  0 siblings, 2 replies; 15+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-12-06  9:56 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: 1054514, Dave Airlie, Daniel Vetter, virtualization, spice-devel,
	dri-devel, linux-kernel, regressions, Timo Lindfors,
	Salvatore Bonaccorso, Thomas Zimmermann

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Gerd, it seems this regression[1] fell through the cracks. Could you
please take a look? Or is there a good reason why this can't be
addressed? Or was it dealt with and I just missed it?

[1] apparently caused by 5a838e5d5825c8 ("drm/qxl: simplify
qxl_fence_wait") [v5.13-rc1] from Gerd; for details see
https://lore.kernel.org/regressions/ZTgydqRlK6WX_b29@eldamar.lan/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 24.10.23 23:39, Timo Lindfors wrote:
> Hi,
> 
> On Tue, 24 Oct 2023, Salvatore Bonaccorso wrote:
>> Thanks for the excelent constructed report! I think it's best to
>> forward this directly to upstream including the people for the
>> bisected commit to get some idea.
> 
> Thanks for the quick reply!
> 
>> Can you reproduce the issue with 6.5.8-1 in unstable as well?
> 
> Unfortunately yes:
> 
> ansible@target:~$ uname -r
> 6.5.0-3-amd64
> ansible@target:~$ time sudo ./reproduce.bash
> Wed 25 Oct 2023 12:27:00 AM EEST starting round 1
> Wed 25 Oct 2023 12:27:24 AM EEST starting round 2
> Wed 25 Oct 2023 12:27:48 AM EEST starting round 3
> bug was reproduced after 3 tries
> 
> real    0m48.838s
> user    0m1.115s
> sys     0m45.530s
> 
> I also tested upstream tag v6.6-rc6:
> 
> ...
> + detected_version=6.6.0-rc6
> + '[' 6.6.0-rc6 '!=' 6.6.0-rc6 ']'
> + exec ssh target sudo ./reproduce.bash
> Wed 25 Oct 2023 12:37:16 AM EEST starting round 1
> Wed 25 Oct 2023 12:37:42 AM EEST starting round 2
> Wed 25 Oct 2023 12:38:10 AM EEST starting round 3
> Wed 25 Oct 2023 12:38:36 AM EEST starting round 4
> Wed 25 Oct 2023 12:39:01 AM EEST starting round 5
> Wed 25 Oct 2023 12:39:27 AM EEST starting round 6
> bug was reproduced after 6 tries
> 
> 
> For completeness, here is also the grub_set_default_version.bash script
> that I had to write to automate this (maybe these could be in debian
> wiki?):
> 
> #!/bin/bash
> set -x
> 
> version="$1"
> 
> idx=$(expr $(grep "menuentry " /boot/grub/grub.cfg | sed 1d |grep -n
> "'Debian GNU/Linux, with Linux $version'"|cut -d: -f1) - 1)
> exec sudo grub-set-default "1>$idx"
> 
> 
> 
> -Timo
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Bug#1054514: Info received (Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently)
  2023-12-06  9:56     ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-12-06 10:45       ` Debian Bug Tracking System
  2024-03-08  1:08       ` [PATCH 0/1] drm/qxl: fixes qxl_fence_wait Alex Constantino
  1 sibling, 0 replies; 15+ messages in thread
From: Debian Bug Tracking System @ 2023-12-06 10:45 UTC (permalink / raw)
  To: Linux regressions mailing list

Thank you for the additional information you have supplied regarding
this Bug report.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

Your message has been sent to the package maintainer(s):
 Debian Kernel Team <debian-kernel@lists.debian.org>

If you wish to submit further information on this problem, please
send it to 1054514@bugs.debian.org.

Please do not send mail to owner@bugs.debian.org unless you wish
to report a problem with the Bug-tracking system.

-- 
1054514: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/1] drm/qxl: fixes qxl_fence_wait
  2023-12-06  9:56     ` Linux regression tracking (Thorsten Leemhuis)
  2023-12-06 10:45       ` Bug#1054514: Info received (Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently) Debian Bug Tracking System
@ 2024-03-08  1:08       ` Alex Constantino
  2024-03-08  1:08         ` [PATCH 1/1] " Alex Constantino
  1 sibling, 1 reply; 15+ messages in thread
From: Alex Constantino @ 2024-03-08  1:08 UTC (permalink / raw)
  To: regressions
  Cc: 1054514, airlied, carnil, daniel, dri-devel, kraxel,
	linux-kernel, regressions, spice-devel, timo.lindfors,
	tzimmermann, virtualization, Alex Constantino

Hi,
As initially reported by Timo in the QXL driver will crash given enough
workload:
https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b9af@leemhuis.info/
I initially came across this problem when migrating Debian VMs from Bullseye
to Bookworm. This bug will somewhat randomly but consistently happen, even
just by using neovim with plugins or playing a video. This exception would
then cascade and make Xorg crash too.

The error log from dmesg would have `[TTM] Buffer eviction failed` followed
by either a `failed to allocate VRAM BO` or `failed to allocate GEM object`.
And the error log from Xorg would have `qxl(0): error doing QXL_ALLOC`
followed by a backtrace and segmentation fault.

I can confirm the problem still exists in latest kernel versions:
https://gitlab.freedesktop.org/drm/kernel @ c6d6a82d8a9f
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git @ 1870cdc0e8de

When I was investigating this issue I ended up creating a script which
triggers the issue in just a couple of minutes when executed under uxterm.
YMMV according to your system, for example when using urxvt crashes were
not as consistent, likely due to it being more efficient and having less
video memory allocations.
For me this is the fastest way to trigger the bug. Here follows:
```
#!/bin/bash
print_gradient_with_awk() {
    local arg="$1"
    if [[ -n $arg ]]; then
        arg=" ($arg)"
    fi
    awk -v arg="$arg" 'BEGIN{
        s="/\\/\\/\\/\\/\\"; s=s s s s s s s s;
        for (colnum = 0; colnum<77; colnum++) {
            r = 255-(colnum*255/76);
            g = (colnum*510/76);
            b = (colnum*255/76);
            if (g>255) g = 510-g;
            printf "\033[48;2;%d;%d;%dm", r,g,b;
            printf "\033[38;2;%d;%d;%dm", 255-r,255-g,255-b;
            printf "%s\033[0m", substr(s,colnum+1,1);
        }
        printf "%s\n", arg;
    }'
}
for i in {1..10000}; do
    print_gradient_with_awk $i
done
```

Timo initially reported:
commit 5f6c871fe919 ("drm/qxl: properly free qxl releases") as working fine
commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") introducing the bug

The bug occurs whenever a timeout is reached in wait_event_timeout.
To fix this issue I updated the code to include a busy wait logic, which
was how the last working version operated. That fixes this bug while still
keeping the code simple (which I suspect was the motivation for the
5a838e5d5825 commit in the first place), as opposed to just reverting to
the last working version at 5f6c871fe919
The choice for the use of HZ as a scaling factor for the loop was that it
is also used by ttm_bo_wait_ctx which is one of the indirect callers of
qxl_fence_wait, with the other being ttm_bo_delayed_delete

To confirm the problem no longer manifests I have:
- executed my own test case pasted above
- executed Timo's test case pasted below
- played a video stream in mplayer for 3h (no audio stream because
  apparently pulseaudio and/or alsa have memory leaks that make the
  system run out of memory)

For quick reference here is Timo's script:
```
#!/bin/bash
chvt 3
for j in $(seq 80); do
    echo "$(date) starting round $j"
    if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ]; then
        echo "bug was reproduced after $j tries"
        exit 1
    fi
    for i in $(seq 100); do
        dmesg > /dev/tty3
    done
done
echo "bug could not be reproduced"
exit 0
```

From what I could find online it seems that users that have been affected
by this problem just tend to move from QXL to VirtIO, that is why this bug
has been hidding for over 3 years now.
This issue was initially reported by Timo 4 months ago but the discussion
seems to have stalled.
It would be great if this could be addressed and avoid it falling through
the cracks.

Thank you for your time.


---

Alex Constantino (1):
  drm/qxl: fixes qxl_fence_wait

 drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)


base-commit: 1870cdc0e8dee32e3c221704a2977898ba4c10e8
-- 
2.39.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/1] drm/qxl: fixes qxl_fence_wait
  2024-03-08  1:08       ` [PATCH 0/1] drm/qxl: fixes qxl_fence_wait Alex Constantino
@ 2024-03-08  1:08         ` Alex Constantino
  2024-03-08  8:58           ` Thorsten Leemhuis
  2024-03-20 15:25           ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 2 replies; 15+ messages in thread
From: Alex Constantino @ 2024-03-08  1:08 UTC (permalink / raw)
  To: regressions
  Cc: 1054514, airlied, carnil, daniel, dri-devel, kraxel,
	linux-kernel, regressions, spice-devel, timo.lindfors,
	tzimmermann, virtualization, Alex Constantino

Fix OOM scenario by doing multiple notifications to the OOM handler through
a busy wait logic.
Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
result in a '[TTM] Buffer eviction failed' exception whenever it reached a
timeout.

Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
Link: https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b9af@leemhuis.info
Reported-by: Timo Lindfors <timo.lindfors@iki.fi>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
Signed-off-by: Alex Constantino <dreaming.about.electric.sheep@gmail.com>
---
 drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
index 368d26da0d6a..51c22e7f9647 100644
--- a/drivers/gpu/drm/qxl/qxl_release.c
+++ b/drivers/gpu/drm/qxl/qxl_release.c
@@ -20,8 +20,6 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  */
 
-#include <linux/delay.h>
-
 #include <trace/events/dma_fence.h>
 
 #include "qxl_drv.h"
@@ -59,14 +57,24 @@ static long qxl_fence_wait(struct dma_fence *fence, bool intr,
 {
 	struct qxl_device *qdev;
 	unsigned long cur, end = jiffies + timeout;
+	signed long iterations = 1;
+	signed long timeout_fraction = timeout;
 
 	qdev = container_of(fence->lock, struct qxl_device, release_lock);
 
-	if (!wait_event_timeout(qdev->release_event,
+	// using HZ as a factor since it is used in ttm_bo_wait_ctx too
+	if (timeout_fraction > HZ) {
+		iterations = timeout_fraction / HZ;
+		timeout_fraction = HZ;
+	}
+	for (int i = 0; i < iterations; i++) {
+		if (wait_event_timeout(
+				qdev->release_event,
 				(dma_fence_is_signaled(fence) ||
-				 (qxl_io_notify_oom(qdev), 0)),
-				timeout))
-		return 0;
+					(qxl_io_notify_oom(qdev), 0)),
+				timeout_fraction))
+			break;
+	}
 
 	cur = jiffies;
 	if (time_after(cur, end))
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] drm/qxl: fixes qxl_fence_wait
  2024-03-08  1:08         ` [PATCH 1/1] " Alex Constantino
@ 2024-03-08  8:58           ` Thorsten Leemhuis
  2024-03-20 15:25           ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 0 replies; 15+ messages in thread
From: Thorsten Leemhuis @ 2024-03-08  8:58 UTC (permalink / raw)
  To: Alex Constantino
  Cc: 1054514, airlied, carnil, daniel, dri-devel, kraxel,
	linux-kernel, regressions, spice-devel, timo.lindfors,
	tzimmermann, virtualization

On 08.03.24 02:08, Alex Constantino wrote:
> Fix OOM scenario by doing multiple notifications to the OOM handler through
> a busy wait logic.
> Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> timeout.

Thx for working on this.

> Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
> Link: https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b9af@leemhuis.info

Nitpicking: that ideally should be pointing to
https://lore.kernel.org/regressions/ZTgydqRlK6WX_b29@eldamar.lan/ , as
that the report and not just a reply to prod things.

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] drm/qxl: fixes qxl_fence_wait
  2024-03-08  1:08         ` [PATCH 1/1] " Alex Constantino
  2024-03-08  8:58           ` Thorsten Leemhuis
@ 2024-03-20 15:25           ` Linux regression tracking (Thorsten Leemhuis)
  2024-03-20 15:27             ` Bug#1054514: Info received ([PATCH 1/1] drm/qxl: fixes qxl_fence_wait) Debian Bug Tracking System
  2024-03-27 13:27             ` [PATCH 1/1] drm/qxl: fixes qxl_fence_wait Maxime Ripard
  1 sibling, 2 replies; 15+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-03-20 15:25 UTC (permalink / raw)
  To: Alex Constantino
  Cc: 1054514, airlied, carnil, daniel, dri-devel, kraxel,
	linux-kernel, regressions, spice-devel, timo.lindfors,
	tzimmermann, virtualization, Maarten Lankhorst, Maxime Ripard

On 08.03.24 02:08, Alex Constantino wrote:
> Fix OOM scenario by doing multiple notifications to the OOM handler through
> a busy wait logic.
> Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> timeout.
> 
> Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
> Link: https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b9af@leemhuis.info
> Reported-by: Timo Lindfors <timo.lindfors@iki.fi>
> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
> Signed-off-by: Alex Constantino <dreaming.about.electric.sheep@gmail.com>
> ---
>  drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)

Hey Dave and Gerd as well as Thomas, Maarten and Maxime (the latter two
I just added to the CC), it seems to me this regression fix did not
maybe any progress since it was posted. Did I miss something, is it just
"we are busy with the merge window", or is there some other a reason?
Just wondering, I just saw someone on a Fedora IRC channel complaining
about the regression, that's why I'm asking. Would be really good to
finally get this resolved...

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
> index 368d26da0d6a..51c22e7f9647 100644
> --- a/drivers/gpu/drm/qxl/qxl_release.c
> +++ b/drivers/gpu/drm/qxl/qxl_release.c
> @@ -20,8 +20,6 @@
>   * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
>   */
>  
> -#include <linux/delay.h>
> -
>  #include <trace/events/dma_fence.h>
>  
>  #include "qxl_drv.h"
> @@ -59,14 +57,24 @@ static long qxl_fence_wait(struct dma_fence *fence, bool intr,
>  {
>  	struct qxl_device *qdev;
>  	unsigned long cur, end = jiffies + timeout;
> +	signed long iterations = 1;
> +	signed long timeout_fraction = timeout;
>  
>  	qdev = container_of(fence->lock, struct qxl_device, release_lock);
>  
> -	if (!wait_event_timeout(qdev->release_event,
> +	// using HZ as a factor since it is used in ttm_bo_wait_ctx too
> +	if (timeout_fraction > HZ) {
> +		iterations = timeout_fraction / HZ;
> +		timeout_fraction = HZ;
> +	}
> +	for (int i = 0; i < iterations; i++) {
> +		if (wait_event_timeout(
> +				qdev->release_event,
>  				(dma_fence_is_signaled(fence) ||
> -				 (qxl_io_notify_oom(qdev), 0)),
> -				timeout))
> -		return 0;
> +					(qxl_io_notify_oom(qdev), 0)),
> +				timeout_fraction))
> +			break;
> +	}
>  
>  	cur = jiffies;
>  	if (time_after(cur, end))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Bug#1054514: Info received ([PATCH 1/1] drm/qxl: fixes qxl_fence_wait)
  2024-03-20 15:25           ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-03-20 15:27             ` Debian Bug Tracking System
  2024-03-27 13:27             ` [PATCH 1/1] drm/qxl: fixes qxl_fence_wait Maxime Ripard
  1 sibling, 0 replies; 15+ messages in thread
From: Debian Bug Tracking System @ 2024-03-20 15:27 UTC (permalink / raw)
  To: Linux regressions mailing list

Thank you for the additional information you have supplied regarding
this Bug report.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

Your message has been sent to the package maintainer(s):
 Debian Kernel Team <debian-kernel@lists.debian.org>

If you wish to submit further information on this problem, please
send it to 1054514@bugs.debian.org.

Please do not send mail to owner@bugs.debian.org unless you wish
to report a problem with the Bug-tracking system.

-- 
1054514: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] drm/qxl: fixes qxl_fence_wait
  2024-03-20 15:25           ` Linux regression tracking (Thorsten Leemhuis)
  2024-03-20 15:27             ` Bug#1054514: Info received ([PATCH 1/1] drm/qxl: fixes qxl_fence_wait) Debian Bug Tracking System
@ 2024-03-27 13:27             ` Maxime Ripard
  2024-04-04 18:14               ` [PATCH v2 0/1] Revert "drm/qxl: simplify qxl_fence_wait" Alex Constantino
  1 sibling, 1 reply; 15+ messages in thread
From: Maxime Ripard @ 2024-03-27 13:27 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Alex Constantino, 1054514, airlied, carnil, daniel, dri-devel,
	kraxel, linux-kernel, spice-devel, timo.lindfors, tzimmermann,
	virtualization, Maarten Lankhorst

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

Hi,

On Wed, Mar 20, 2024 at 04:25:48PM +0100, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 08.03.24 02:08, Alex Constantino wrote:
> > Fix OOM scenario by doing multiple notifications to the OOM handler through
> > a busy wait logic.
> > Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> > result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> > timeout.
> > 
> > Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
> > Link: https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b9af@leemhuis.info
> > Reported-by: Timo Lindfors <timo.lindfors@iki.fi>
> > Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
> > Signed-off-by: Alex Constantino <dreaming.about.electric.sheep@gmail.com>
> > ---
> >  drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> Hey Dave and Gerd as well as Thomas, Maarten and Maxime (the latter two
> I just added to the CC), it seems to me this regression fix did not
> maybe any progress since it was posted. Did I miss something, is it just
> "we are busy with the merge window", or is there some other a reason?
> Just wondering, I just saw someone on a Fedora IRC channel complaining
> about the regression, that's why I'm asking. Would be really good to
> finally get this resolved...

I've ping'd Gerd last week about it, but he couldn't remember the
details of why that patch was warranted in the first place.

If it works, I'd prefer to revert the original patch that we know used
to work instead of coming up with some less proven logic, which seems to
be quite different to what it used to be.

Alex, could you try reverting 5a838e5d5825c85556011478abde708251cc0776
and letting us know the result?

Thanks!
Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 0/1] Revert "drm/qxl: simplify qxl_fence_wait"
  2024-03-27 13:27             ` [PATCH 1/1] drm/qxl: fixes qxl_fence_wait Maxime Ripard
@ 2024-04-04 18:14               ` Alex Constantino
  2024-04-04 18:14                 ` [PATCH v2 1/1] " Alex Constantino
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Constantino @ 2024-04-04 18:14 UTC (permalink / raw)
  To: mripard
  Cc: 1054514, airlied, carnil, daniel, dreaming.about.electric.sheep,
	dri-devel, kraxel, linux-kernel, maarten.lankhorst, regressions,
	spice-devel, timo.lindfors, tzimmermann, virtualization

Changes since v1:
- replace new code logic in v1 with past code version by reverting
  commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
- add missing code dependency from
  commit d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2")

---

Hi,

To clarify, the reason for my original patch, as explained in more detail
in my previous email, was that it fixed the issue while keeping the code
simpler (which was the original reason for the commit being reverted here).
But I perfectly understand opting for previously battle tested code. Makes
sense.

As requested I've reverted commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
and then executed both Timo's and my test cases, and 1h video playback.
I was unable to reproduce the bug with any of those cases. So the revert
seems to fix the bug.
Please note, and as stated in the commit message, due to a dependency to
DMA_FENCE_WARN this patch also restores the relevant code deleted
by commit d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2").

A couple of things I've observed from dmesg:
- (1) it always triggers a single warning at boot, this is issued by
  `WARN_ON(list_empty(&release->bos));` @ qxl_release_free @ qxl_release.c
  Maybe better for this to be addressed separately from this patch?
- (2) there are quite a few `failed to wait on release xx after spincount
  301` messages as printed by the patch v2 code when the test case shell
  scripts are being executed.
- (3) there can be a single error message `[drm:qxl_release_from_id_locked
  [qxl]] *ERROR* failed to find id in release_idr`
- (4) occasional error messages about `[drm:drm_atomic_helper_commit_planes
  [drm_kms_helper]] *ERROR* head 9 wrong:`.

Issue (1) relates to this patch v2 and also happened with kernel from
base-commit 1870cdc0e8de (March 1st).
Issue (2) also relates to this patch v2 but only happens with kernel from
base-commit a6bd6c933339 (March 30th).
Both (3) and (4) are unrelated to this patch as they can occur
independently of it and I'm guessing these may be related to the recent
changes discussed in
https://lore.kernel.org/dri-devel/38d38331-3848-4995-b78e-a87ecae722d5@linux.intel.com/T/#u


For reference here is the output of (1):
```
[   20.779514] ------------[ cut here ]------------
[   20.779525] workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]
[   20.779666] WARNING: CPU: 1 PID: 601 at kernel/workqueue.c:3692 check_flush_dependency+0xfa/0x110
[   20.779683] Modules linked in: nfsv3 nfs_acl nfs lockd grace intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel rfkill kvm snd_hda_codec_generic crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_intel_dspcfg sha512_ssse3 sha512_generic snd_hda_codec sha256_ssse3 snd_hwdep sha1_ssse3 snd_hda_core sunrpc binfmt_misc snd_pcm aesni_intel qxl drm_ttm_helper ttm crypto_simd snd_timer cryptd rapl snd virtio_balloon virtio_console drm_kms_helper pcspkr soundcore button evdev joydev serio_raw drm loop fuse efi_pstore dm_mod configfs qemu_fw_cfg virtio_rng autofs4 ext4 crc32c_generic crc16 mbcache jbd2 virtio_net ata_generic net_failover virtio_blk failover uhci_hcd ata_piix ehci_hcd libata scsi_mod usbcore crc32c_intel i2c_piix4 virtio_pci virtio psmouse virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring floppy scsi_common usb_common
[   20.779825] CPU: 1 PID: 601 Comm: kworker/u13:1 Not tainted 6.9.0-rc1-next-20240328-amd64-00001-g756220c4615c #81
[   20.779833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[   20.779837] Workqueue: ttm ttm_bo_delayed_delete [ttm]
[   20.779862] RIP: 0010:check_flush_dependency+0xfa/0x110
[   20.779869] Code: ff ff 49 8b 55 18 48 8d 8b c0 00 00 00 49 89 e8 48 81 c6 c0 00 00 00 48 c7 c7 c0 16 44 8d c6 05 e7 75 b3 01 01 e8 86 97 fd ff <0f> 0b e9 21 ff ff ff 80 3d d5 75 b3 01 00 75 96 e9 4d ff ff ff 90
[   20.779875] RSP: 0000:ffffb59600dd7cc8 EFLAGS: 00010082
[   20.779880] RAX: 0000000000000000 RBX: ffff9af88104ee00 RCX: 0000000000000027
[   20.779902] RDX: ffff9af8fdd21708 RSI: 0000000000000001 RDI: ffff9af8fdd21700
[   20.779906] RBP: ffffffffc0882570 R08: 0000000000000000 R09: 0000000000000003
[   20.779910] R10: ffffb59600dd7b58 R11: ffffffff8dcc83e8 R12: ffff9af894498000
[   20.779914] R13: ffff9af89558d780 R14: ffffb59600dd7cf8 R15: 0000000000000001
[   20.779918] FS:  0000000000000000(0000) GS:ffff9af8fdd00000(0000) knlGS:0000000000000000
[   20.779924] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.779928] CR2: 00005574b0bd4148 CR3: 000000001fb40002 CR4: 0000000000370ef0
[   20.779994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   20.779999] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   20.780003] Call Trace:
[   20.780135]  <TASK>
[   20.780144]  ? __warn+0x7c/0x120
[   20.780153]  ? check_flush_dependency+0xfa/0x110
[   20.780161]  ? report_bug+0x160/0x190
[   20.780169]  ? prb_read_valid+0x17/0x20
[   20.780179]  ? handle_bug+0x41/0x70
[   20.780186]  ? exc_invalid_op+0x13/0x60
[   20.780193]  ? asm_exc_invalid_op+0x16/0x20
[   20.780201]  ? __pfx_qxl_gc_work+0x10/0x10 [qxl]
[   20.780221]  ? check_flush_dependency+0xfa/0x110
[   20.780228]  ? check_flush_dependency+0xfa/0x110
[   20.780234]  __flush_work+0xce/0x2c0
[   20.780244]  qxl_queue_garbage_collect+0x7f/0x90 [qxl]
[   20.780268]  qxl_fence_wait+0xa0/0x190 [qxl]
[   20.780287]  dma_fence_wait_timeout+0x5e/0x130
[   20.780313]  dma_resv_wait_timeout+0x7b/0xe0
[   20.780327]  ttm_bo_delayed_delete+0x26/0x80 [ttm]
[   20.780359]  process_one_work+0x184/0x3a0
[   20.780370]  worker_thread+0x273/0x390
[   20.780379]  ? __pfx_worker_thread+0x10/0x10
[   20.780388]  kthread+0xcb/0x100
[   20.780396]  ? __pfx_kthread+0x10/0x10
[   20.780404]  ret_from_fork+0x2d/0x50
[   20.780416]  ? __pfx_kthread+0x10/0x10
[   20.780421]  ret_from_fork_asm+0x1a/0x30
[   20.780435]  </TASK>
[   20.780437] ---[ end trace 0000000000000000 ]---
```


TLDR: this patch fixes the instability issues. But there may be warnings
in dmesg. Errors in dmesg were observed too but they are unrelated to this
patch.


Thank you for your time.

---

Alex Constantino (1):
  Revert "drm/qxl: simplify qxl_fence_wait"

 drivers/gpu/drm/qxl/qxl_release.c | 50 +++++++++++++++++++++++++++----
 include/linux/dma-fence.h         |  7 +++++
 2 files changed, 52 insertions(+), 5 deletions(-)


base-commit: a6bd6c9333397f5a0e2667d4d82fef8c970108f2
-- 
2.39.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/1] Revert "drm/qxl: simplify qxl_fence_wait"
  2024-04-04 18:14               ` [PATCH v2 0/1] Revert "drm/qxl: simplify qxl_fence_wait" Alex Constantino
@ 2024-04-04 18:14                 ` Alex Constantino
  2024-04-05  4:37                   ` Greg KH
  2024-04-05 13:13                   ` (subset) " Maxime Ripard
  0 siblings, 2 replies; 15+ messages in thread
From: Alex Constantino @ 2024-04-04 18:14 UTC (permalink / raw)
  To: mripard
  Cc: 1054514, airlied, carnil, daniel, dreaming.about.electric.sheep,
	dri-devel, kraxel, linux-kernel, maarten.lankhorst, regressions,
	spice-devel, timo.lindfors, tzimmermann, virtualization

This reverts commit 5a838e5d5825c85556011478abde708251cc0776.

Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
result in a '[TTM] Buffer eviction failed' exception whenever it reached a
timeout.
Due to a dependency to DMA_FENCE_WARN this also restores some code deleted
by commit d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2").

Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
Link: https://lore.kernel.org/regressions/ZTgydqRlK6WX_b29@eldamar.lan/
Reported-by: Timo Lindfors <timo.lindfors@iki.fi>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
Signed-off-by: Alex Constantino <dreaming.about.electric.sheep@gmail.com>
---
 drivers/gpu/drm/qxl/qxl_release.c | 50 +++++++++++++++++++++++++++----
 include/linux/dma-fence.h         |  7 +++++
 2 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
index 368d26da0d6a..9febc8b73f09 100644
--- a/drivers/gpu/drm/qxl/qxl_release.c
+++ b/drivers/gpu/drm/qxl/qxl_release.c
@@ -58,16 +58,56 @@ static long qxl_fence_wait(struct dma_fence *fence, bool intr,
 			   signed long timeout)
 {
 	struct qxl_device *qdev;
+	struct qxl_release *release;
+	int count = 0, sc = 0;
+	bool have_drawable_releases;
 	unsigned long cur, end = jiffies + timeout;
 
 	qdev = container_of(fence->lock, struct qxl_device, release_lock);
+	release = container_of(fence, struct qxl_release, base);
+	have_drawable_releases = release->type == QXL_RELEASE_DRAWABLE;
 
-	if (!wait_event_timeout(qdev->release_event,
-				(dma_fence_is_signaled(fence) ||
-				 (qxl_io_notify_oom(qdev), 0)),
-				timeout))
-		return 0;
+retry:
+	sc++;
+
+	if (dma_fence_is_signaled(fence))
+		goto signaled;
+
+	qxl_io_notify_oom(qdev);
+
+	for (count = 0; count < 11; count++) {
+		if (!qxl_queue_garbage_collect(qdev, true))
+			break;
+
+		if (dma_fence_is_signaled(fence))
+			goto signaled;
+	}
+
+	if (dma_fence_is_signaled(fence))
+		goto signaled;
+
+	if (have_drawable_releases || sc < 4) {
+		if (sc > 2)
+			/* back off */
+			usleep_range(500, 1000);
+
+		if (time_after(jiffies, end))
+			return 0;
+
+		if (have_drawable_releases && sc > 300) {
+			DMA_FENCE_WARN(fence,
+				       "failed to wait on release %llu after spincount %d\n",
+				       fence->context & ~0xf0000000, sc);
+			goto signaled;
+		}
+		goto retry;
+	}
+	/*
+	 * yeah, original sync_obj_wait gave up after 3 spins when
+	 * have_drawable_releases is not set.
+	 */
 
+signaled:
 	cur = jiffies;
 	if (time_after(cur, end))
 		return 0;
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index e06bad467f55..c3f9bb6602ba 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -682,4 +682,11 @@ static inline bool dma_fence_is_container(struct dma_fence *fence)
 	return dma_fence_is_array(fence) || dma_fence_is_chain(fence);
 }
 
+#define DMA_FENCE_WARN(f, fmt, args...) \
+	do {								\
+		struct dma_fence *__ff = (f);				\
+		pr_warn("f %llu#%llu: " fmt, __ff->context, __ff->seqno,\
+			 ##args);					\
+	} while (0)
+
 #endif /* __LINUX_DMA_FENCE_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/1] Revert "drm/qxl: simplify qxl_fence_wait"
  2024-04-04 18:14                 ` [PATCH v2 1/1] " Alex Constantino
@ 2024-04-05  4:37                   ` Greg KH
  2024-04-05 13:13                   ` (subset) " Maxime Ripard
  1 sibling, 0 replies; 15+ messages in thread
From: Greg KH @ 2024-04-05  4:37 UTC (permalink / raw)
  To: Alex Constantino
  Cc: mripard, 1054514, airlied, carnil, daniel, dri-devel, kraxel,
	linux-kernel, maarten.lankhorst, regressions, spice-devel,
	timo.lindfors, tzimmermann, virtualization

On Thu, Apr 04, 2024 at 07:14:48PM +0100, Alex Constantino wrote:
> This reverts commit 5a838e5d5825c85556011478abde708251cc0776.
> 
> Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> timeout.
> Due to a dependency to DMA_FENCE_WARN this also restores some code deleted
> by commit d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2").
> 
> Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
> Link: https://lore.kernel.org/regressions/ZTgydqRlK6WX_b29@eldamar.lan/
> Reported-by: Timo Lindfors <timo.lindfors@iki.fi>
> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
> Signed-off-by: Alex Constantino <dreaming.about.electric.sheep@gmail.com>
> ---
>  drivers/gpu/drm/qxl/qxl_release.c | 50 +++++++++++++++++++++++++++----
>  include/linux/dma-fence.h         |  7 +++++
>  2 files changed, 52 insertions(+), 5 deletions(-)

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- You have marked a patch with a "Fixes:" tag for a commit that is in an
  older released kernel, yet you do not have a cc: stable line in the
  signed-off-by area at all, which means that the patch will not be
  applied to any older kernel releases.  To properly fix this, please
  follow the documented rules in the
  Documentation/process/stable-kernel-rules.rst file for how to resolve
  this.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (subset) [PATCH v2 1/1] Revert "drm/qxl: simplify qxl_fence_wait"
  2024-04-04 18:14                 ` [PATCH v2 1/1] " Alex Constantino
  2024-04-05  4:37                   ` Greg KH
@ 2024-04-05 13:13                   ` Maxime Ripard
  1 sibling, 0 replies; 15+ messages in thread
From: Maxime Ripard @ 2024-04-05 13:13 UTC (permalink / raw)
  To: Alex Constantino
  Cc: 1054514, airlied, carnil, daniel, dri-devel, kraxel,
	linux-kernel, maarten.lankhorst, regressions, spice-devel,
	timo.lindfors, tzimmermann, virtualization

On Thu, 04 Apr 2024 19:14:48 +0100, Alex Constantino wrote:
> This reverts commit 5a838e5d5825c85556011478abde708251cc0776.
> 
> Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> timeout.
> Due to a dependency to DMA_FENCE_WARN this also restores some code deleted
> by commit d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2").
> 
> [...]

Applied to misc/kernel.git (drm-misc-fixes).

Thanks!
Maxime


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-04-05 13:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <alpine.DEB.2.20.2310242308150.28457@mail.home>
2023-10-24 21:09 ` Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently Salvatore Bonaccorso
2023-10-24 21:39   ` Timo Lindfors
2023-12-06  9:56     ` Linux regression tracking (Thorsten Leemhuis)
2023-12-06 10:45       ` Bug#1054514: Info received (Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently) Debian Bug Tracking System
2024-03-08  1:08       ` [PATCH 0/1] drm/qxl: fixes qxl_fence_wait Alex Constantino
2024-03-08  1:08         ` [PATCH 1/1] " Alex Constantino
2024-03-08  8:58           ` Thorsten Leemhuis
2024-03-20 15:25           ` Linux regression tracking (Thorsten Leemhuis)
2024-03-20 15:27             ` Bug#1054514: Info received ([PATCH 1/1] drm/qxl: fixes qxl_fence_wait) Debian Bug Tracking System
2024-03-27 13:27             ` [PATCH 1/1] drm/qxl: fixes qxl_fence_wait Maxime Ripard
2024-04-04 18:14               ` [PATCH v2 0/1] Revert "drm/qxl: simplify qxl_fence_wait" Alex Constantino
2024-04-04 18:14                 ` [PATCH v2 1/1] " Alex Constantino
2024-04-05  4:37                   ` Greg KH
2024-04-05 13:13                   ` (subset) " Maxime Ripard
2023-10-24 23:55   ` Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently Bagas Sanjaya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).