[Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
@ 2021-10-13 17:31 Computer Enthusiastic
  2021-10-13 17:43 ` Ilia Mirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Computer Enthusiastic @ 2021-10-13 17:31 UTC (permalink / raw)
  To: nouveau

Hello,

Greeting to all list members.

I hope this mailing list is the right place to ask about it, otherwise
I apologize in advance and, please, direct me to the right mailing
list.

I would like to receive advice on how to debug (or help to debug) the
noveau kernel module suspend issue reported in [1][2][3]. The issue is
constantly reproducible and it consists in a  screen corruption with
system lockup on resume after suspend to ram or suspend to disk. It
seems to affect NV50 nvidia graphic cards at least from kernel 5.10.0
(probably from 5.2 or 5.3) up to 5.14.10 (I have not tested 5.15 in rc
state, actually).

Thanks in advance.

[1] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705
[3] https://bugzilla.kernel.org/show_bug.cgi?id=213617

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2021-10-13 17:31 [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1) Computer Enthusiastic
@ 2021-10-13 17:43 ` Ilia Mirkin
  2021-10-14 16:51   ` Computer Enthusiastic
  0 siblings, 1 reply; 13+ messages in thread
From: Ilia Mirkin @ 2021-10-13 17:43 UTC (permalink / raw)
  To: Computer Enthusiastic; +Cc: nouveau

The most straightforward thing, if you can reproduce at will, would be
to do a bisect to figure out which change this happened with. Once we
know which change caused the problem, it will hopefully provide with
more ability to sort out where we're going wrong. You can also play
with things like KASAN, although it's not clear to me that they would
necessarily catch this problem.

Cheers,

  -ilia

On Wed, Oct 13, 2021 at 1:31 PM Computer Enthusiastic
<computer.enthusiastic@gmail.com> wrote:
>
> Hello,
>
> Greeting to all list members.
>
> I hope this mailing list is the right place to ask about it, otherwise
> I apologize in advance and, please, direct me to the right mailing
> list.
>
> I would like to receive advice on how to debug (or help to debug) the
> noveau kernel module suspend issue reported in [1][2][3]. The issue is
> constantly reproducible and it consists in a  screen corruption with
> system lockup on resume after suspend to ram or suspend to disk. It
> seems to affect NV50 nvidia graphic cards at least from kernel 5.10.0
> (probably from 5.2 or 5.3) up to 5.14.10 (I have not tested 5.15 in rc
> state, actually).
>
> Thanks in advance.
>
> [1] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547
> [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=213617

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2021-10-13 17:43 ` Ilia Mirkin
@ 2021-10-14 16:51   ` Computer Enthusiastic
  2021-10-14 18:07     ` Ilia Mirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Computer Enthusiastic @ 2021-10-14 16:51 UTC (permalink / raw)
  To: nouveau

Hello,

Thanks for the answer.

On Wed, Oct 13, 2021 at 01:43:56PM -0400, Ilia Mirkin wrote:
> The most straightforward thing, if you can reproduce at will, would be
> to do a bisect to figure out which change this happened with. Once we
> know which change caused the problem, it will hopefully provide with
> more ability to sort out where we're going wrong. You can also play
> with things like KASAN, although it's not clear to me that they would
> necessarily catch this problem.

The patch is "mm: security: introduce init_on_alloc=1 and init_on_free=1 
boot options" (commit 6471384af2a6530696fc0203bafe4de41a23c9ef) [1]. 

It was introduced in kernel version 5.2:
$ git describe --tags 6471384af2a6530696fc0203bafe4de41a23c9ef
v5.2-5754-g6471384af2a6

This patch does not specifically change the source code of the nouveau 
module. I suppose it interacts with the memory management of nouveau 
kernel module during kernel suspend to ram and resume from ram (logs 
recorded during malfunction are available in [2]).

I activated the kasan support in the kernel (recompiling it, for example 
with kernel version 5.14.10), but I wasn't able to collect a kasan 
report/dump triggered by the issue.

A hypotesis could be that the issue is caused by an extra latency in 
buffer eviction during suspend to ram (due to memory init), that causes 
the nouveau kernel not responding to suspend, causing a "forced" suspend 
that leaves the nouveau module in an inconsistent state when the system 
is resumed. I'm not a kernel expert, therefore I don't know how to test 
this hypotesis.

Many other patches has been written over time about init_on_alloc and 
init_on_free kernel features, but the issue with nouveau driver is still 
there, up to now.

For the previous reasons, it don't know how to investigate it.

Thanks for any suggestions.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.14.12&id=6471384af2a6530696fc0203bafe4de41a23c9ef
[2] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/uploads/4294703a356e4c00814f581a42988999/suspend-errors-5.14.9.log.zip

> On Wed, Oct 13, 2021 at 1:31 PM Computer Enthusiastic
> <computer.enthusiastic@gmail.com> wrote:
> >
> > Hello,
> >
> > Greeting to all list members.
> >
> > I hope this mailing list is the right place to ask about it, otherwise
> > I apologize in advance and, please, direct me to the right mailing
> > list.
> >
> > I would like to receive advice on how to debug (or help to debug) the
> > noveau kernel module suspend issue reported in [1][2][3]. The issue is
> > constantly reproducible and it consists in a  screen corruption with
> > system lockup on resume after suspend to ram or suspend to disk. It
> > seems to affect NV50 nvidia graphic cards at least from kernel 5.10.0
> > (probably from 5.2 or 5.3) up to 5.14.10 (I have not tested 5.15 in rc
> > state, actually).
> >
> > Thanks in advance.
> >
> > [1] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547
> > [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705
> > [3] https://bugzilla.kernel.org/show_bug.cgi?id=213617

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2021-10-14 16:51   ` Computer Enthusiastic
@ 2021-10-14 18:07     ` Ilia Mirkin
  2022-01-24 23:19       ` computer enthusiastic
  0 siblings, 1 reply; 13+ messages in thread
From: Ilia Mirkin @ 2021-10-14 18:07 UTC (permalink / raw)
  To: Computer Enthusiastic; +Cc: nouveau

On Thu, Oct 14, 2021 at 12:51 PM Computer Enthusiastic
<computer.enthusiastic@gmail.com> wrote:
>
> Hello,
>
> Thanks for the answer.
>
> On Wed, Oct 13, 2021 at 01:43:56PM -0400, Ilia Mirkin wrote:
> > The most straightforward thing, if you can reproduce at will, would be
> > to do a bisect to figure out which change this happened with. Once we
> > know which change caused the problem, it will hopefully provide with
> > more ability to sort out where we're going wrong. You can also play
> > with things like KASAN, although it's not clear to me that they would
> > necessarily catch this problem.
>
> The patch is "mm: security: introduce init_on_alloc=1 and init_on_free=1
> boot options" (commit 6471384af2a6530696fc0203bafe4de41a23c9ef) [1].
>
> It was introduced in kernel version 5.2:
> $ git describe --tags 6471384af2a6530696fc0203bafe4de41a23c9ef
> v5.2-5754-g6471384af2a6

Uff... so you've verified that a kernel at that commit +
init_on_alloc=1 and init_on_free=1 (or the CONFIG_* equivalents) will
cause the problems?

That means the problem has "always been there" (or at least we have no
great way of knowing when it might have been introduced).

I think there's some sort of SLUB debug options that help find
use-after-frees and such. What if you disable the init_on_*, enable
SLUB, and boot with like "slub_debug=FZP"?

  -ilia

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2021-10-14 18:07     ` Ilia Mirkin
@ 2022-01-24 23:19       ` computer enthusiastic
  2022-08-10 10:41         ` Computer Enthusiastic
  0 siblings, 1 reply; 13+ messages in thread
From: computer enthusiastic @ 2022-01-24 23:19 UTC (permalink / raw)
  To: nouveau

Hello,

On Thu, Oct 14, 2021 at 02:07:33PM -0400, Ilia Mirkin wrote:
> On Thu, Oct 14, 2021 at 12:51 PM Computer Enthusiastic
> <computer.enthusiastic@gmail.com> wrote:
> > [..]
> >
> > The patch is "mm: security: introduce init_on_alloc=1 and init_on_free=1
> > boot options" (commit 6471384af2a6530696fc0203bafe4de41a23c9ef) [1].
> >
> > It was introduced in kernel version 5.2:
> > $ git describe --tags 6471384af2a6530696fc0203bafe4de41a23c9ef
> > v5.2-5754-g6471384af2a6
> 
> Uff... so you've verified that a kernel at that commit +
> init_on_alloc=1 and init_on_free=1 (or the CONFIG_* equivalents) will
> cause the problems?
> 
> That means the problem has "always been there" (or at least we have no
> great way of knowing when it might have been introduced).
> 
> I think there's some sort of SLUB debug options that help find
> use-after-frees and such. What if you disable the init_on_*, enable
> SLUB, and boot with like "slub_debug=FZP"?
> 
>   -ilia

Hello,

First of all, thank you for your answer and advice: it has been a very useful 
starting point. I apologize for the delay in replying, but I've been analysing 
the issue for a long time, trying to acquire more clues. 

I hope it can be useful in addressing the issue with suspend to ram and 
suspend to disk (with 'nouveau' kernel module) when kernel is compiled with 
CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON (see 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#5, https://
gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547, https://
bugzilla.kernel.org/show_bug.cgi?id=213617). 

The issue immediately appears during suspension to RAM, if the graphical 
console is active (e.g., CTRL-ALT-F7): the screen image is always garbled with 
the same pattern, regardless of the kernel version (see here https://
gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/uploads/
44ffbbef135538b42a39a2d4dd288624/
After_resume_linux-5.4.60_and_linux-5.7.17.jpg for an example).

The issue does not immediately appears if suspend to ram is started (e.g., 
systemctl suspend) from a text virtual terminal (even if after a much longer 
wait) and it is possible to complete resume: in this case, the system crashes 
(the same way reported in previous paragraph) as soon as the graphical console 
(e.g., with ALT-F7) is activated by the user, anyways.

Below I report the analysis I've done so far: I apologize for its length, but 
I would like to give as much information as possible. 

I followed your advice about slub debug. After some tests, I realised that the 
activation of "slub poisoning" (e.g. kernel option 'slub_debug=FPZ') has the 
same effect of "init_on_alloc=0 init_on_free=0". In fact, it seems that "slub 
poisoning" takes precedence over activation of init_on_alloc and init_on_free. 

This is an excerpt of kernel log when kernel is booted with slubdebug option 
activated (suspend to RAM succeeds):

8<------8<------8<------8<------8<------8<------8<------8<

gen 24 17:37:22 debian kernel: Linux version 5.10.87-dbg-timeout 
(([.,removed..])) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils 
for Debian) 2.35.2) #2 SMP Sat Dec 18 09:02:06 CET 2021
gen 24 17:37:22 debian kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.87-
dbg-timeout root=UUID=d4c01dd6-eca5-4311-9ea2-6511cd217dc7 ro 
no_console_suspend loglevel=6 nouveau.debug=info drm.debug=2 
trace_clock=global slub_debug=FPZ nokaslr crashkernel=384M-:128M
[..]
gen 24 17:37:22 debian kernel: mem auto-init: SLAB_POISON will take precedence 
over init_on_alloc/init_on_free
gen 24 17:37:22 debian kernel: Dentry cache hash table entries: 524288 (order: 
10, 4194304 bytes, linear)
gen 24 17:37:22 debian kernel: Inode-cache hash table entries: 262144 (order: 
9, 2097152 bytes, linear)
gen 24 17:37:22 debian kernel: mem auto-init: stack:off, heap alloc:on, heap 
free:off
gen 24 17:37:22 debian kernel: Memory: 205516K/3140468K available (14344K 
kernel code, 4648K rwdata, 4480K rodata, 2488K init, 21384K bss, 282616K 
reserved, 0K cma-reserved)
gen 24 17:37:22 debian kernel: random: get_random_u64 called from 
__kmem_cache_create+0x2c/0x580 with crng_init=0
gen 24 17:37:22 debian kernel: SLUB: HWalign=64, Order=0-3, MinObjects=0, 
CPUs=2, Nodes=1
gen 24 17:37:22 debian kernel: Kernel/User page tables isolation: enabled
gen 24 17:37:22 debian kernel: ftrace: allocating 37278 entries in 146 pages
gen 24 17:37:22 debian kernel: ftrace: allocated 146 pages with 3 groups
gen 24 17:37:22 debian kernel: Running RCU self tests
gen 24 17:37:22 debian kernel: rcu: Hierarchical RCU implementation.
gen 24 17:37:22 debian kernel: rcu:         RCU lockdep checking is enabled.
gen 24 17:37:22 debian kernel: rcu:         RCU restricting CPUs from 
NR_CPUS=8192 to nr_cpu_ids=2.
gen 24 17:37:22 debian kernel:         Rude variant of Tasks RCU enabled.
gen 24 17:37:22 debian kernel:         Tracing variant of Tasks RCU enabled.
gen 24 17:37:22 debian kernel: rcu: RCU calculated value of scheduler-
enlistment delay is 25 jiffies.
gen 24 17:37:22 debian kernel: rcu: Adjusting geometry for rcu_fanout_leaf=16, 
nr_cpu_ids=2
gen 24 17:37:22 debian kernel: NR_IRQS: 524544, nr_irqs: 440, preallocated 
irqs: 16
gen 24 17:37:22 debian kernel: Console: colour VGA+ 80x25
gen 24 17:37:22 debian kernel: printk: console [tty0] enabled
gen 24 17:37:22 debian kernel: Lock dependency validator: Copyright (c) 2006 
Red Hat, Inc., Ingo Molnar
gen 24 17:37:22 debian kernel: ... MAX_LOCKDEP_SUBCLASSES:  8
gen 24 17:37:22 debian kernel: ... MAX_LOCK_DEPTH:          48
gen 24 17:37:22 debian kernel: ... MAX_LOCKDEP_KEYS:        8192
gen 24 17:37:22 debian kernel: ... CLASSHASH_SIZE:          4096
gen 24 17:37:22 debian kernel: ... MAX_LOCKDEP_ENTRIES:     32768
gen 24 17:37:22 debian kernel: ... MAX_LOCKDEP_CHAINS:      65536
gen 24 17:37:22 debian kernel: ... CHAINHASH_SIZE:          32768
gen 24 17:37:22 debian kernel:  memory used by lock dependency info: 6365 kB
gen 24 17:37:22 debian kernel:  memory used for stack traces: 4224 kB
gen 24 17:37:22 debian kernel:  per task-struct memory footprint: 1920 bytes
[..]
8<------8<------8<------8<------8<------8<------8<------8<

Furthermore, the activation of the "sanity checks" and "red zoning" in the 
kernel slub ('F' and 'Z' options in 'slub_debug=FPG') does not make appear 
warning or error messages in system logs when resume from suspension to RAM 
fails. This make me guess that probably the issue is not caused by a kernel 
slub corruption or a memory leak. 

These are the relevant error messages in system logs during suspension to RAM 
(when resume from STR fails):

8<------8<------8<------8<------8<------8<------8<------8<
[..]
  gen 24 18:42:27 debian systemd-sleep[2228]: Suspending system...
  gen 24 18:42:27 debian kernel: PM: suspend entry (deep)
  gen 24 18:42:27 debian kernel: Filesystems sync: 0.045 seconds
  gen 24 18:45:30 debian kernel: Freezing user space processes ... (elapsed 
0.026 seconds) done.
  gen 24 18:45:30 debian kernel: OOM killer disabled.
  gen 24 18:45:30 debian kernel: Freezing remaining freezable tasks ... 
(elapsed 0.003 seconds) done.
  gen 24 18:45:30 debian kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
  gen 24 18:45:30 debian kernel: sd 0:0:0:0: [sda] Stopping disk
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: suspending 
console...
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: suspending 
display...
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: evicting 
buffers...
>>gen 24 18:45:30 debian kernel: [TTM] Buffer eviction failed 
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: waiting for kernel 
channels to go idle...
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: suspending 
fence...
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: suspending object 
tree...
  gen 24 18:45:30 debian kernel: ACPI: EC: interrupt blocked
  gen 24 18:45:30 debian kernel: ACPI: Preparing to enter system sleep state 
S3
  gen 24 18:45:30 debian kernel: ACPI: EC: event blocked
  gen 24 18:45:30 debian kernel: ACPI: EC: EC stopped
  gen 24 18:45:30 debian kernel: PM: Saving platform NVS memory
  gen 24 18:45:30 debian kernel: Disabling non-boot CPUs ...
  gen 24 18:45:30 debian kernel: smpboot: CPU 1 is now offline
  gen 24 18:45:30 debian kernel: ACPI: Low-level resume complete
  gen 24 18:45:30 debian kernel: ACPI: EC: EC started
  gen 24 18:45:30 debian kernel: PM: Restoring platform NVS memory
  gen 24 18:45:30 debian kernel: Enabling non-boot CPUs ...
  gen 24 18:45:30 debian kernel: x86: Booting SMP configuration:
  gen 24 18:45:30 debian kernel: smpboot: Booting Node 0 Processor 1 APIC 0x1
  gen 24 18:45:30 debian kernel: CPU1 is up
  gen 24 18:45:30 debian kernel: ACPI: Waking up from system sleep state S3
  gen 24 18:45:30 debian kernel: ACPI: EC: interrupt unblocked
  gen 24 18:45:30 debian kernel: nouveau 0000:01:00.0: DRM: resuming object 
tree...
  gen 24 18:45:30 debian kernel: ACPI: EC: event unblocked
[..]
8<------8<------8<------8<------8<------8<------8<------8<

As you can see, there is a *"[TTM] Buffer eviction failed"* error message from 
the kernel during suspension to RAM. 

The "Buffer eviction failed" error message always appears (with my hardware: 
Intel Core 2 Duo, nvidia GEForce 9600M GT) during suspension to ram after that 
the kernel is booted with init_on_alloc activated (or without SLUB_debug='FPZ' 
option); it is *constantly replicable* from different kernel versions, for 
example from 5.10 (e.g., in Debian Bullseye 11.0, 11.1, 11.2 and in vanilla 
versions up to 5.16, the latter tested few days ago).  It probably started 
with kernel version 5.8.

The following tests and logs are from a custom kernel built from vanilla 
version 5.10.87 with debug options enabled; I used (as base) the kernel's 
.config from the 
latest Debian kernel from release 11.0 codenamed "Bullseye" (linux-
image-5.10.0-10-amd64, kernel 5.10.84) and then activated further kernel debug 
option. The user space installation is from a Debian 11.2.

8<------8<------8<------8<------8<------8<------8<------8<

$ uname -a
Linux debian 5.10.87-dbg-timeout #2 SMP Sat Dec 18 09:02:06 CET 2021 x86_64 
GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye

$ cat /etc/debian_version 
11.2

# inxi -mMG
Machine:   Type: Laptop System: Acer product: Aspire 5930 v: 0100 serial: 
<superuser required> 
           Mobo: Acer model: Eiger v: Rev serial: <superuser required> BIOS: 
Phoenix v: 1.21 date: 08/22/2008 
Memory:    RAM: total: 2.77 GiB used: 2.27 GiB (82.1%) 
           RAM Report: permissions: Unable to run dmidecode. Root privileges 
required. 
Graphics:  Device-1: NVIDIA G96CM [GeForce 9600M GT] driver: nouveau v: kernel 
           Device-2: Suyin Acer HD Crystal Eye webcam type: USB driver: 
uvcvideo 
           Display: x11 server: X.Org 1.20.11 driver: loaded: modesetting 
unloaded: fbdev,vesa resolution: 1280x800~60Hz 
           OpenGL: renderer: NV96 v: 3.3 Mesa 20.3.5 
Machine:   Type: Laptop System: Acer product: Aspire 5930 v: 0100 serial: 
[..removed..] 
           Mobo: Acer model: Eiger v: Rev serial: [..removed..]  BIOS: Phoenix 
v: 1.21 date: 08/22/2008 
Memory:    RAM: total: 2.79 GiB used: 1.37 GiB (49.3%) 
           Array-1: capacity: 4 GiB slots: 2 EC: None 
           Device-1: M1 size: 2 GiB speed: 667 MT/s 
           Device-2: M2 size: 2 GiB speed: 667 MT/s 
Graphics:  Device-1: NVIDIA G96CM [GeForce 9600M GT] driver: nouveau v: kernel 
           Device-2: Suyin Acer HD Crystal Eye webcam type: USB driver: 
uvcvideo 
           Display: server: X.org 1.20.11 driver: loaded: modesetting 
unloaded: fbdev,vesa tty: 178x46 
           OpenGL: renderer: NV96 v: 3.3 Mesa 20.3.5 

8<------8<------8<------8<------8<------8<------8<------8<

I traced the kernel function nouveau_do_suspend() [1] as a starting point in 
the analysis of nouveau kernel module for "Suspend To Ram" ("STR", from now 
on) . 

I recorded the graph of kernel functions called by nouveau_do_suspend() after 
starting STR and up to the beginning of resume (I used the ftrace kernel 
feature [2] [3]). 

The ftrace recording were done in two conditions:

    1) kernel started *with* parameter debug_slub=FPZ (in this case, STR 
*works*); note: the kernel parameter debug_slub=FPZ [4] take precedence over 
init_on_alloc kernel parameter (which is therefore disabled in this case) [5]. 

    2) kernel started *without* debug_slub=FPZ (in this case, STR *fails* 
after resume); note: as previously mentioned, init_on_alloc and init_on_free 
are enabled in Debian kernel by default starting from release 11.0 codenamed 
"Bullseye" (CONFIG_INIT_ON_ALLOC_DEFAULT_ON and 
CONFIG_INIT_ON_FREE_DEFAULT_ON).
    
The analysis of time spent by involved kernel functions when STR fails, 
pointed me to function named ttm_resource_manager_force_list_clean() [6]. 

I generated three files for each recording (using ftrace with graph plugin) 
for each of the two aforementioned conditions:

    a) a graph of functions call from ftrace kernel buffer in /sys/kernel/
debug/tracing/trace as text log file

    b) a binary raw ftrace recording (with .dat extension, for subsequent 
analyzes)

    c) a profile of functions traced (with overall time spent by kernel in 
each function) as text log file 
    
Then, I compared the profile of called function when STR fails to the one when 
STR works (see the attached script join_traces.sh). 

I generated a comparison file that I reworked using a 
spreadsheet (functions calls are sorted by total time in descending order). 

The most time consuming functions called by 
ttm_resource_manager_force_list_clean() when STR *fails* are (in descending 
order):

NUM.CALLS   TOTAL TIME     AVG TIME   FUNCTION (note: STR *fails*, usec)
=========  ===========   ===========  =======================================
(1)        11436084627   11436084627  ttm_resource_manager_force_list_clean() 
(10)       11435990310    1143599031  ttm_mem_evict_first()
(10)       11435753724    1143575372  ttm_bo_evict()
(10)       11433592700    1143359270  ttm_bo_handle_move_mem()
(10)       11381408793    1138140879  nouveau_bo_move()
(6)        10379433308    1729905551  ttm_bo_move_memcpy()
(2256)     10367683828       4595604  memcpy_fromio()
(10)        1001119248     100111924  nouveau_bo_move_m2mf.constprop.0()
(10)         995441053      99544105  ttm_bo_move_accel_cleanup()
(38)         994614055      26174054  ttm_bo_wait()
(38)         994403882      26168523  >> dma_resv_wait_timeout_rcu() <<
(6)          993363983     165560663  dma_fence_wait_timeout()
(6)          993285402     165547567  dma_fence_default_wait()
(2596)       137206543         52853  __sysvec_apic_timer_interrupt()
(2596)       129927214         50049  hrtimer_interrupt()
(2595)        91271707         35172  __hrtimer_run_queues()
(2594)        74417961         28688  tick_sched_timer()
(15)          47315477       3154365  ttm_tt_populate()
(10)          47287068       4728706  nouveau_ttm_tt_populate()
(10)          47268974       4726897  ttm_populate_and_map_pages()
(10)          47068753       4706875  ttm_pool_populate()
(12)          30015421       2501285  ttm_page_pool_get_pages()
(4)           29847966       7461991  ttm_alloc_new_pages.isra.0()
(6)           25147308       4191218  ttm_set_pages_caching()
(30758)       17361172           564  rcu_read_lock_sched_held()
(2283)        16058655          7034  ttm_mem_global_alloc_page()
(2283)        14317071          6271  ttm_mem_global_alloc_zone()


The time consumed by the same previous funcions in the same order, but when 
STR *succeeds* is:

NUM.CALLS   TOTAL TIME     AVG TIME   FUNCTION (note: STR *fails*, usec)
=========  ===========   ===========  =======================================
(1)          501338026     501338026  ttm_resource_manager_force_list_clean() 
(119)        500532532       4206155  ttm_mem_evict_first()
(119)        497645267       4181893  ttm_bo_evict()
(119)        479173966       4026671  ttm_bo_handle_move_mem()
(119)        186045609       1563408  nouveau_bo_move()
                                      ttm_bo_move_memcpy()
                                      memcpy_fromio()
(119)        183246904       1539889  nouveau_bo_move_m2mf.constprop.0()
(119)        122513397       1029524  ttm_bo_move_accel_cleanup()
(311)         82506104        265292  ttm_bo_wait()
(311)         80733662        259593  >> dma_resv_wait_timeout_rcu() <<
(80)          75522674        944033  dma_fence_wait_timeout()
(80)          74936515        936706  dma_fence_default_wait()
(34)           1565271         46037  __sysvec_apic_timer_interrupt()
(17)            810651         47685  hrtimer_interrupt()
(13)            474341         36487  __hrtimer_run_queues()
(10)            302783         30278  tick_sched_timer()
(119)        240333865       2019612  ttm_tt_populate()
(119)        240037719       2017123  nouveau_ttm_tt_populate()
(119)        239856293       2015599  ttm_populate_and_map_pages()
(119)        224154269       1883649  ttm_pool_populate()
(144)         25454062        176764  ttm_page_pool_get_pages()
(20)          22598464       1129923  ttm_alloc_new_pages.isra.0()
(4)           13932500       3483125  ttm_set_pages_caching()
(7141)         7724863          1081  rcu_read_lock_sched_held()
(26455)      187335946          7081  ttm_mem_global_alloc_page()
(26455)      168008932          6350  ttm_mem_global_alloc_zone()


As you can see, when STR succeeds (because init_on_alloc disabled) compared 
with when it fails (because init_on_allloc is enabled):

a) the count of function calls is much higher for ttm* and nouveau* functions 
and the total time spent by kernel for ttm_resource_manager_force_list_clean() 
and subsequent calls is lower (total time: 501338026 vs 11436084627): this let 
me suppose that code flows and it is not stuck somewhere (when init_on_alloc 
is disabled); this is confirmed in subsequent point c)

b) the ttm_bo_move_memcpy() and memcpy_fromio() are not executed: this let me 
suppose that the sequence of funcions call is somewhere different (after a 
common starting point): this is confirmed by subsequent analysis of kernel 
function's trace

c) the number of call for dma_resv_wait_timeout_rcu() is much lower (10 times) 
and the average duration of the call is much lower (259593 usec vs 26168523 
usec): this makes me suppose that the kernel is stuck in accessing to shared 
DMA resources (GPU's memory ? deadlock ?) when suspension to RAM fails; 


This is the kernel graph trace starting from 
ttm_resource_manager_force_list_clean (when STR succeeds or fails):

8<------8<------8<------8<------8<------8<------8<------8<
# kprobe-perf -s 'p:ttm_resource_manager_force_list_clean'

Tracing kprobe ttm_resource_manager_force_list_clean. Ctrl-C to end.
    kworker/u4:9-4234    [001] ....  1317.593113: 
ttm_resource_manager_force_list_clean: 
(ttm_resource_manager_force_list_clean+0x0/0x1d0 [ttm])
    kworker/u4:9-4234    [001] ....  1317.593126: <stack trace>
 => ttm_resource_manager_force_list_clean
 => nouveau_do_suspend
 => nouveau_pmops_suspend
 => pci_pm_suspend
 => dpm_run_callback
 => __device_suspend
 => async_suspend
 => async_run_entry_fn
 => process_one_work
 => worker_thread
 => kthread
 => ret_from_fork
8<------8<------8<------8<------8<------8<------8<------8<

Then, I traced (using ftrace [2]) some specific kernel functions and probed 
the return values for some of them.

These are the functions called when suspend to RAM fails (with reference to 
source code at the URL https://elixir.bootlin.com/linux/v5.10.87/source):

8<------8<------8<------8<------8<------8<------8<------8<
nouveau_pmops_suspend [nouveau] -> nouveau_do_suspend
https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/
nouveau_drm.c#L945
https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/
nouveau_drm.c#L841
|
+-> nouveau_do_suspend -> ttm_bo_evict_mm
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/nouveau/
nouveau_drm.c#L860
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L1308
  |
  +-> ttm_bo_evict_mm -> ttm_manager_type 
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L1310
  https://elixir.bootlin.com/linux/v5.10.87/source/include/drm/ttm/
ttm_bo_driver.h#L355
  |
  +-> ttm_bo_evict_mm -> ttm_resource_manager_force_list_clean 
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L1322
  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_resource.c#L86
    |
    +-> ttm_resource_manager_force_list_clean -> ttm_mem_evict_first 
    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_resource.c#L107
    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L725
           ==> note: ttm_resource_manager_force_list_clean:104  <==
           ==>       here starts a loop for buffer eviction     <==
    https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_resource.c#L104
      |
      +-> ttm_mem_evict_first -> ttm_bo_evict 
      https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L789
      https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L598
        |
        +-> ttm_bo_evict -> ttm_bo_handle_move_mem
        https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L598
        https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L235
            |
            +-> ttm_bo_handle_move_mem -> ttm_mem_io_free+0x0/0x40
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
ttm/ttm_bo.c#L244
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
ttm/ttm_bo.c#L1471
            |
            +-> ttm_bo_handle_move_mem -> 
tm_tt_set_placement_caching+0x0/0x110
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
ttm/ttm_bo.c#L258
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
ttm/ttm_tt.c#L195
            |
            +-> ttm_bo_handle_move_mem -> ttm_tt_populate+0x0/0x80
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
ttm/ttm_bo.c#L265
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
ttm/ttm_tt.c#L422
            |
            |-> ttm_bo_handle_move_mem -> nouveau_bo_move_ntfy 
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
ttm/ttm_bo.c#L280
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
nouveau/nouveau_bo.c#L1004
            |
            +-> ttm_bo_handle_move_mem -> nouveau_bo_move
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
ttm/ttm_bo.c#L286
            https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/
nouveau/nouveau_bo.c#L1053
              |
              +-> nouveau_bo_move -> ttm_bo_wait
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/nouveau/nouveau_bo.c#L1063
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo.c#L1480
              |
              +-> nouveau_bo_move -> nouveau_bo_move_m2mf.constprop
              | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/nouveau/nouveau_bo.c#L1093
              | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/nouveau/nouveau_bo.c#L798
              | |
              | +-> nouveau_bo_move_m2mf.constprop -> 
ttm_bo_move_accel_cleanup
              | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/nouveau/nouveau_bo.c#L824
              | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo_util.c#L606
              | | |
              | | |   ** HARDWARE ASSISTED COPY **
              | | +-> ttm_bo_move_accel_cleanup -> dma_resv_add_excl_fence
              | | |   https://elixir.bootlin.com/linux/v5.10.87/source/
drivers/gpu/drm/ttm/ttm_bo_util.c#L617
              | | |   https://elixir.bootlin.com/linux/v5.10.87/source/
drivers/gpu/drm/ttm/ttm_bo_util.c#L617
              | | | 
HERE FLOW ==> | | +-> ttm_bo_move_accel_cleanup -> ttm_bo_wait_free_node
DIVERGES  ==> | | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/ttm/ttm_bo_util.c#L623
WITH INIT ==> | | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/ttm/ttm_bo_util.c#L529
ON ALLOC IS=> | | |  |
ACTIVATED ==> | | |  +-> ttm_bo_wait_free_node -> ttm_bo_wait
              | | |  https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/ttm/ttm_bo_util.c#L533
              | | |    |   
              | | |    +-> ttm_bo_wait -> dma_resv_wait_timeout_rcu 
              | | |    https://elixir.bootlin.com/linux/v5.10.87/source/
drivers/gpu/drm/ttm/ttm_bo.c#L1492
              | | |    https://elixir.bootlin.com/linux/v5.10.87/source/
drivers/dma-buf/dma-resv.c#L532
              | | |    |
              | | |    +-> dma_resv_wait_timeout_rcu-> dma_fence_wait_timeout
              | | |    https://elixir.bootlin.com/linux/v5.10.87/source/
drivers/dma-buf/dma-resv.c#L594
              | | |      |
              | | |      +-> dma_fence_wait_timeout -> dma_fence_default_wait
              | | |      https://elixir.bootlin.com/linux/v5.10.87/source/
drivers/dma-buf/dma-fence.c#L425
              | | |      https://elixir.bootlin.com/linux/v5.10.87/source/
drivers/dma-buf/dma-fence.c#L674
              | | | 
              | | |      ** FAILING ACCELERATED COPY (TIMEOUT IN WAITING RCU) 
**
              | | | 
              | | |
              | +-+<- dma_fence_default_wait -> ttm_bo_move_accel_cleanup
              | |     https://elixir.bootlin.com/linux/v5.10.87/source/
drivers/gpu/drm/ttm/ttm_bo_util.c#L626
              | |     return arg1=0xfffffff0 (-EBUSY)
              | |    
              +-+<- ttm_bo_move_accel_cleanup -> 
nouveau_bo_move_m2mf.constprop
              |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/nouveau/nouveau_bo.c#L824
              |     return arg1=0xfffffff0 (-EBUSY)
              |
              +<-- nouveau_bo_move_m2mf.constprop -> nouveau_bo_move 
              |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/nouveau/nouveau_bo.c#L1093
              |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/nouveau/nouveau_bo.c#L1097)
              |     return arg1=0xfffffff0 (-EBUSY)
              |
              |   ** FALL BACK TO SOFTWARE COPY **
              +-> nouveau_bo_move -> ttm_bo_wait
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/nouveau/nouveau_bo.c#L1101
              |
              +-> nouveau_bo_move -> ttm_bo_move_memcpy
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/nouveau/nouveau_bo.c#L1103
              https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo_util.c#L217
                |
                +-> ttm_bo_move_memcpy -> ttm_bo_wait
                https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo_util.c#L234
                |
                +-> ttm_bo_move_memcpy -> ttm_resource_ioremap 
                https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo_util.c#L238
                |
                +-> ttm_bo_move_memcpy -> ttm_resource_ioremap 
                https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo_util.c#L241
                |
                +-> ttm_bo_move_memcpy -> ttm_tt_populate
                https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo_util.c#L265
                |
                +-> HERE START A LOOP
                +-> https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/ttm/ttm_bo_util.c#L279
                | |
                | +-> ttm_bo_move_memcpy -> ttm_copy_io_ttm_page 
                | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/ttm/ttm_bo_util.c#L289
                | https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/ttm/ttm_bo_util.c#L173
                |   |
                |   +-> memcpy_fromio [REPEATED IN LOOP]
                |   https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/ttm/ttm_bo_util.c#L289
                |    
                |     ***********************************************
                |     ** SOFTWARE COPY FAILS AFTER MANY ITERATIONS **
                |     **        OF memcpy_fromio()                 **
                |     ***********************************************
                |        
                |<--- (nouveau_bo_move+0x346/0x740 [nouveau] <- 
ttm_bo_move_memcpy) arg1=0xffffffea
                |
            +--<+-nouveau_bo_move -> ttm_bo_handle_move_mem 
            |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/nouveau/nouveau_bo.c#L1065
            |     https://elixir.bootlin.com/linux/v5.10.87/source/drivers/
gpu/drm/ttm/ttm_bo.c#L287
            |     return arg1=0xfffffff0 (-EBUSY)
            |    
    +--<----+<--ttm_bo_handle_move_mem -> ttm_bo_evict 
    |           https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo.c#L304
    |           https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/
drm/ttm/ttm_bo.c#L637
    |           return arg1=0xfffffff0 (-EBUSY)
    |            
    |                       ************************************
    |                       ** Buffer eviction failed MESSAGE **
    |                       ************************************
    |
    +-> ttm_resource_manager_force_list_clean -> ttm_mem_evict_first
    |   https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_resource.c#L86
    |  
    +-> ttm_mem_evict_first -> tm_bo_evict_mm
    |   https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L790
    |
    +-> tm_bo_evict_mm -> nouveau_do_suspend
        https://elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_bo.c#L1308 
8<------8<------8<------8<------8<------8<------8<------8<
                        
The function ttm_bo_wait_free_node() is where the flow diverges between 
suspend to ram success or failure. 

When the kernel feature CONFIG_INIT_ON_ALLOC_DEFAULT_ON is active:
    a) the function dma_resv_wait_timeout_rcu() times out in ttm_bo_wait()
    b) then, the function ttm_bo_wait() return 0xfffffff0 (corresponding to -
EBUSY) to its callers
    
The ftrace log reports (excerpt):

8<------8<------8<------8<------8<------8<------8<------8<

  #                         _-----=> irqs-off
  #                        / _----=> need-resched
  #                        | / _---=> hardirq/softirq
  #                        || / _--=> preempt-depth
  #                        ||| /     delay
  #     TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
  #        | |         |   ||||      |         |
  kworker/u4:11-2068 [000] d..2 210.090141: probe4: 
(ttm_bo_move_accel_cleanup+0x1ba/0x510 [ttm] <- ttm_bo_wait) arg1=0xfffffff0 
  kworker/u4:11-2068 [000] d..2 210.090144: probe3: 
(nouveau_bo_move_m2mf.constprop.0+0x1a5/0x200 [nouveau] <- 
ttm_bo_move_accel_cleanup) arg1=0xfffffff0 
  kworker/u4:11-2068 [000] d..2 210.090147: probe1: (nouveau_bo_move+0x39c/
0x740 [nouveau] <- nouveau_bo_move_m2mf.constprop.0) arg1=0xfffffff0 

8<------8<------8<------8<------8<------8<------8<------8<

This is the code for function ttm_bo_wait() in drivers/gpu/drm/ttm/ttm_bo.c:

   [..]
   1480 int ttm_bo_wait(struct ttm_buffer_object *bo,
   1481                 bool interruptible, bool no_wait)
   1482 {
   1483         long timeout = 15 * HZ;
   1484 
   1485         if (no_wait) {
   1486                 if (dma_resv_test_signaled_rcu(bo->base.resv, true))
   1487                         return 0;
   1488                 else
   1489                         return -EBUSY;
   1490         }
   1491 
>> 1492         timeout = dma_resv_wait_timeout_rcu(bo->base.resv, true,
   1493                                                       interruptible, 
timeout);
   1494         if (timeout < 0)
   1495                 return timeout;
   1496 
>> 1497         if (timeout == 0)
>> 1498                 return -EBUSY;
   1499 
   1500         dma_resv_add_excl_fence(bo->base.resv, NULL);
   1501         return 0;
   1502 }
   1503 EXPORT_SYMBOL(ttm_bo_wait);
   [..]

   
The -EBUSY value is returned from ttm_bo_wait() to ttm_bo_wait_free_node(), 
then to ttm_bo_move_accel_cleanup(), then to nouveau_bo_move_m2mf.constprop(), 
then to nouveau_bo_move() at line 1093 (see code below) and this activates in 
nouveau_bo_move() the fall back procedure from "hardware assisted copy" to 
"software copy" at line 1103 in /drivers/gpu/drm/nouveau/nouveau_bo.c (see 
code below): 

   [..]
   1053 nouveau_bo_move(struct ttm_buffer_object *bo, bool evict,
   1054                 struct ttm_operation_ctx *ctx,
   1055                 struct ttm_resource *new_reg)
   1056 {
   1057         struct nouveau_drm *drm = nouveau_bdev(bo->bdev);
   1058         struct nouveau_bo *nvbo = nouveau_bo(bo);
   1059         struct ttm_resource *old_reg = &bo->mem;
   1060         struct nouveau_drm_tile *new_tile = NULL;
   1061         int ret = 0;
   1062 
   1063         ret = ttm_bo_wait(bo, ctx->interruptible, ctx->no_wait_gpu);
   1064         if (ret)
   1065                 return ret;
   1066 
   1067         if (nvbo->pin_refcnt)
   1068                 NV_WARN(drm, "Moving pinned object %p!\n", nvbo);
   1069 
   1070         if (drm->client.device.info.family < NV_DEVICE_INFO_V0_TESLA) 
{
   1071                 ret = nouveau_bo_vm_bind(bo, new_reg, &new_tile);
   1072                 if (ret)
   1073                         return ret;
   1074         }
   1075 
   1076         /* Fake bo copy. */
   1077         if (old_reg->mem_type == TTM_PL_SYSTEM && !bo->ttm) {
   1078                 ttm_bo_move_null(bo, new_reg);
   1079                 goto out;
   1080         }
   1081 
>> 1082         /* Hardware assisted copy. */
   1083         if (drm->ttm.move) {
   1084                 if (new_reg->mem_type == TTM_PL_SYSTEM)
   1085                         ret = nouveau_bo_move_flipd(bo, evict,
   1086                                                     ctx-
>interruptible,
   1087                                                     ctx->no_wait_gpu, 
new_reg);
   1088                 else if (old_reg->mem_type == TTM_PL_SYSTEM)
   1089                         ret = nouveau_bo_move_flips(bo, evict,
   1090                                                     ctx-
>interruptible,
   1091                                                     ctx->no_wait_gpu, 
new_reg);
   1092                 else
>> 1093                         ret = nouveau_bo_move_m2mf(bo, evict,
>> 1094                                                    ctx->interruptible,
>> 1095                                                    ctx->no_wait_gpu, 
new_reg);
   1096                 if (!ret)
   1097                         goto out;
   1098         }
   1099 
>> 1100         /* Fallback to software copy. */
>> 1101         ret = ttm_bo_wait(bo, ctx->interruptible, ctx->no_wait_gpu);
>> 1102         if (ret == 0)
>> 1103                 ret = ttm_bo_move_memcpy(bo, ctx, new_reg);
   1104 
   1105 out:
   1106         if (drm->client.device.info.family < NV_DEVICE_INFO_V0_TESLA) 
{
   1107                 if (ret)
   1108                         nouveau_bo_vm_cleanup(bo, NULL, &new_tile);
   1109                 else
   1110                         nouveau_bo_vm_cleanup(bo, new_tile, &nvbo-
>tile);
   1111         }
   1112 
>> 1113         return ret;
   1114 }
   [..]

The loop inside the function ttm_resource_manager_force_list_clean() (at line 
104 in drivers/gpu/drm/ttm/resource.c) makes some iterations, then it stops 
when the function ttm_bo_handle_move_mem() fails and ttm_mem_evict_first() 
return an error value:

    [..]
     78  * ttm_resource_manager_force_list_clean
     79  *
     80  * @bdev - device to use
     81  * @man - manager to use
     82  *
     83  * Force all the objects out of a memory manager until clean.
     84  * Part of memory manager cleanup sequence.
     85  */
     86 int ttm_resource_manager_force_list_clean(struct ttm_bo_device *bdev,
     87                                           struct ttm_resource_manager 
*man)
     88 {
     89         struct ttm_operation_ctx ctx = {
     90                 .interruptible = false,
     91                 .no_wait_gpu = false,
     92                 .flags = TTM_OPT_FLAG_FORCE_ALLOC
     93         };
     94         struct ttm_bo_global *glob = &ttm_bo_glob;
     95         struct dma_fence *fence;
     96         int ret;
     97         unsigned i;
     98 
     99         /*
    100          * Can't use standard list traversal since we're unlocking.
    101          */
    102 
    103         spin_lock(&glob->lru_lock);
 >> 104         for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
    105                 while (!list_empty(&man->lru[i])) {
    106                         spin_unlock(&glob->lru_lock);
 >> 107                         ret = ttm_mem_evict_first(bdev, man, NULL, 
&ctx,
 >> 108                                                   NULL);
 >> 109                         if (ret)
 >> 110                                 return ret;
    111                         spin_lock(&glob->lru_lock);
    112                 }
    113         }
    114         spin_unlock(&glob->lru_lock);
    115 
    116         spin_lock(&man->move_lock);
    117         fence = dma_fence_get(man->move);
    118         spin_unlock(&man->move_lock);
    119 
    120         if (fence) {
    121                 ret = dma_fence_wait(fence, false);
    122                 dma_fence_put(fence);
    123                 if (ret)
    124                         return ret;
    125         }
    126 
    127         return 0;
    128 }
    129 EXPORT_SYMBOL(ttm_resource_manager_force_list_clean);
    [..]

The "Buffer eviction failed" error message is returned by function 
ttm_bo_evict() at line 637 in /drivers/gpu/drm/ttm/ttm_bo.c when 
ttm_bo_handle_move_mem() fails:

    [..]
    598 static int ttm_bo_evict(struct ttm_buffer_object *bo,
    599                         struct ttm_operation_ctx *ctx)
    600 {
    601         struct ttm_bo_device *bdev = bo->bdev;
    602         struct ttm_resource evict_mem;
    603         struct ttm_placement placement;
    604         int ret = 0;
    605 
    606         dma_resv_assert_held(bo->base.resv);
    607 
    608         placement.num_placement = 0;
    609         placement.num_busy_placement = 0;
    610         bdev->driver->evict_flags(bo, &placement);
    611 
    612         if (!placement.num_placement && !placement.num_busy_placement) 
{
    613                 ttm_bo_wait(bo, false, false);
    614 
    615                 ttm_bo_cleanup_memtype_use(bo);
    616                 return ttm_tt_create(bo, false);
    617         }
    618 
    619         evict_mem = bo->mem;
    620         evict_mem.mm_node = NULL;
    621         evict_mem.bus.offset = 0;
    622         evict_mem.bus.addr = NULL;
    623 
    624         ret = ttm_bo_mem_space(bo, &placement, &evict_mem, ctx);
    625         if (ret) {
    626                 if (ret != -ERESTARTSYS) {
    627                         pr_err("Failed to find memory space for buffer 
0x%p eviction\n",
    628                                bo);
    629                         ttm_bo_mem_space_debug(bo, &placement);
    630                 }
    631                 goto out;
    632         }
    633 
>>  634         ret = ttm_bo_handle_move_mem(bo, &evict_mem, true, ctx);
>>  635         if (unlikely(ret)) {
>>  636                 if (ret != -ERESTARTSYS)
>>  637                         pr_err("Buffer eviction failed\n");
>>  638                 ttm_resource_free(bo, &evict_mem);
>>  639         }
    640 out:
    641         return ret;
    642 }
    [..]

So, this is what I've found so far tracing the kernel.

The activation of the debug features [14][15] didn't show useful error or 
warning messages in system logs about memory leaks or kernel oops/warnings 
related to failure of GPU buffers eviction.

The only interesting message (so far) appears in kernel 5.16, when, during 
suspension to RAM, a warning about a "possible circular locking dependency 
detected" is shown (see attachment linux-5.16-lock-circular-dump.log.txt). But 
this warning appears regardless the computer is activated with or without 
activation of init_on_alloc and init_on_free kernel features: it appears both 
in success or in failure of suspension to RAM.

Now, just a small recap. 

The issue is triggered by the activation of init_on_alloc and init_on_free 
kernel feature (that are activated by default in the lasted stable Debian 
release, for example).

I suspect that, during suspend to RAM, the issue is caused by:

    1) failure of buffers eviction for the nvidia GPU (the copy of video 
memory and GPU's pipelines to RAM, I suppose) 

    2) then, the kernel suspends the computer anyway, leaving the internal 
data structures of the nouveau GPU module (and/or of the DRM module) in an 
inconsistent state

    3) then, if the graphical console is active/activated (e.g. ALT-F7), the 
nouveau driver instantly fails (probably due to the inconsistent state of its 
internal data structures)

    4) then, the kernel hangs the computer with multiple repeated errors, 
making the system unusable.

The issue is constantly replicable, therefore I can carry out further tests, 
if you wish to suggest them.

I'm still looking for clues, comments and advices to understand why buffer 
eviction fails: I would like to find (or help someone else to find) a stable 
fix for this issue.

The detailed logs are available on request (I cannot post more than 100 KB on 
nouveau mailing list).

Thanks for any advice or comment.

8<------8<------8<------8<------8<------8<------8<------8<

References:
    
[1]  function nouveau_do_suspend(), see https://elixir.bootlin.com/linux/
v5.10.87/source/drivers/gpu/drm/nouveau/
nouveau_drm.c#L841===========================================================
[2]  https://www.kernel.org/doc/html/v5.10/trace/ftrace.html ; https://
manpages.debian.org/bullseye/trace-cmd/trace-cmd.1.en.html
[3]  https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/
[4]  https://www.kernel.org/doc/html/v5.10/vm/slub.html
[5]  https://elixir.bootlin.com/linux/v5.10.50/source/mm/page_alloc.c#L191
[6]  function ttm_resource_manager_force_list_clean(): https://
elixir.bootlin.com/linux/v5.10.87/source/drivers/gpu/drm/ttm/
ttm_resource.c#L86
[7]  function ttm_bo_move_accel_cleanup(), see https://elixir.bootlin.com/
linux/v5.10.50/source/drivers/gpu/drm/ttm/ttm_bo_util.c#L606
[8]  function ttm_bo_wait, see https://elixir.bootlin.com/linux/v5.10.50/
source/drivers/gpu/drm/ttm/ttm_bo_util.c#L529
[9]  function dma_resv_wait_timeout_rcu, see https://elixir.bootlin.com/linux/
v5.10.50/source/drivers/dma-buf/dma-resv.c#L532
[10] function dma_fence_default_wait, see https://elixir.bootlin.com/linux/
v5.10.50/source/drivers/dma-buf/dma-fence.c#L674
[11] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#5
[12] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/
547
[13] https://bugzilla.kernel.org/show_bug.cgi?id=213617
[14] https://www.kernel.org/doc/html/v4.10/dev-tools/kmemleak.html
[15] https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2022-01-24 23:19       ` computer enthusiastic
@ 2022-08-10 10:41         ` Computer Enthusiastic
  2022-08-10 11:59           ` Karol Herbst
  0 siblings, 1 reply; 13+ messages in thread
From: Computer Enthusiastic @ 2022-08-10 10:41 UTC (permalink / raw)
  To: nouveau

Hello,

The "supend to ram" issue (system freeze when resuming after suspend
to ram or disk hibernation) is still affecting the nouveau kernel
module from linux version 5 and above with nvidia GeForce 9600M GT and
other nvidia graphic card with the same chipset [1][2][3].

I've been successfully using the "work in progress" patch by Karol
Herbst [4] to solve the "supend to ram" issue since it was proposed
for an apparent different issue ("[GT216][Linux 5.16.4] Long stalls in
dma_fence_default_wait") [5] whose root cause is probably the same.

I send this message to bring to your attention the aforementioned
patch [4] (below in the message as text) to evaluate it for  the next
steps for possible inclusion  in the upstream kernel version.

Thanks.

[1] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#5
[3] https://bugzilla.kernel.org/show_bug.cgi?id=213617
[4] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156#note_1385770
[5] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156

----- %= ----- %= ----- %=

From 70271cb0aa30e4523d39c3942e84b16fe18338f5 Mon Sep 17 00:00:00 2001
From: Karol Herbst <kherbst@redhat.com>
Date: Mon, 16 May 2022 17:40:20 +0200
Subject: [PATCH] nouveau WIP

---
 drivers/gpu/drm/nouveau/nouveau_bo.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 05076e530e7d..b6343741eda6 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -820,6 +820,7 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo,
int evict,
  if (ret == 0) {
  ret = nouveau_fence_new(chan, false, &fence);
  if (ret == 0) {
+ nouveau_fence_wait(fence, false, false);
  ret = ttm_bo_move_accel_cleanup(bo,
  &fence->base,
  evict, false,
-- 
2.35.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2022-08-10 10:41         ` Computer Enthusiastic
@ 2022-08-10 11:59           ` Karol Herbst
  2022-08-10 21:35             ` Lyude Paul
  0 siblings, 1 reply; 13+ messages in thread
From: Karol Herbst @ 2022-08-10 11:59 UTC (permalink / raw)
  To: Computer Enthusiastic; +Cc: nouveau, Ben Skeggs

On Wed, Aug 10, 2022 at 12:42 PM Computer Enthusiastic
<computer.enthusiastic@gmail.com> wrote:
>
> Hello,
>
> The "supend to ram" issue (system freeze when resuming after suspend
> to ram or disk hibernation) is still affecting the nouveau kernel
> module from linux version 5 and above with nvidia GeForce 9600M GT and
> other nvidia graphic card with the same chipset [1][2][3].
>
> I've been successfully using the "work in progress" patch by Karol
> Herbst [4] to solve the "supend to ram" issue since it was proposed
> for an apparent different issue ("[GT216][Linux 5.16.4] Long stalls in
> dma_fence_default_wait") [5] whose root cause is probably the same.
>
> I send this message to bring to your attention the aforementioned
> patch [4] (below in the message as text) to evaluate it for  the next
> steps for possible inclusion  in the upstream kernel version.
>

I think it might be better to CC Ben and Lyude, who _might_ have a
good idea on what's the proper patch, but maybe the patch is fine as
it is?

> Thanks.
>
> [1] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547
> [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#5
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=213617
> [4] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156#note_1385770
> [5] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
>
> ----- %= ----- %= ----- %=
>
> From 70271cb0aa30e4523d39c3942e84b16fe18338f5 Mon Sep 17 00:00:00 2001
> From: Karol Herbst <kherbst@redhat.com>
> Date: Mon, 16 May 2022 17:40:20 +0200
> Subject: [PATCH] nouveau WIP
>
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c
> b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 05076e530e7d..b6343741eda6 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -820,6 +820,7 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo,
> int evict,
>   if (ret == 0) {
>   ret = nouveau_fence_new(chan, false, &fence);
>   if (ret == 0) {
> + nouveau_fence_wait(fence, false, false);
>   ret = ttm_bo_move_accel_cleanup(bo,
>   &fence->base,
>   evict, false,
> --
> 2.35.3
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2022-08-10 11:59           ` Karol Herbst
@ 2022-08-10 21:35             ` Lyude Paul
  2022-08-10 22:08               ` Karol Herbst
  0 siblings, 1 reply; 13+ messages in thread
From: Lyude Paul @ 2022-08-10 21:35 UTC (permalink / raw)
  To: Karol Herbst, Computer Enthusiastic; +Cc: nouveau, Ben Skeggs

On Wed, 2022-08-10 at 13:59 +0200, Karol Herbst wrote:
> On Wed, Aug 10, 2022 at 12:42 PM Computer Enthusiastic
> <computer.enthusiastic@gmail.com> wrote:
> > 
> > Hello,
> > 
> > The "supend to ram" issue (system freeze when resuming after suspend
> > to ram or disk hibernation) is still affecting the nouveau kernel
> > module from linux version 5 and above with nvidia GeForce 9600M GT and
> > other nvidia graphic card with the same chipset [1][2][3].
> > 
> > I've been successfully using the "work in progress" patch by Karol
> > Herbst [4] to solve the "supend to ram" issue since it was proposed
> > for an apparent different issue ("[GT216][Linux 5.16.4] Long stalls in
> > dma_fence_default_wait") [5] whose root cause is probably the same.
> > 
> > I send this message to bring to your attention the aforementioned
> > patch [4] (below in the message as text) to evaluate it for  the next
> > steps for possible inclusion  in the upstream kernel version.
> > 
> 
> I think it might be better to CC Ben and Lyude, who _might_ have a
> good idea on what's the proper patch, but maybe the patch is fine as
> it is?

When the GPU starts going into suspend, TTM migrates all of the active bos in
vram over to system ram so they aren't lost. Based on what's described here
and the fact that a fence wait seems to fix it, it sounds like to me that TTM
may still be migrating bos (buffer objects) from the GPU over to system RAM
when the suspend process starts. Of course this causes those operations to get
interrupted and lose access to VRAM unexpectedly, which likely causes the
world to go boom.

So, I'm not 100% sure but I think this patch is potentially overkill since it
looks like it would fix this issue by just making all m2mf bo moves
synchronous. Thisis something we probably(?) don't want to do since if my
theory is correct, the only time we should need to do a fence wait is before
suspending the GPU in order to ensure that all m2mf operations have been
completed before continuing suspend. I think Ben would likely need to weigh
in, or you might be able to figure out the answer to this just by seeing how
we handle this with the newer method of vram blitting - the CE (copy engine).
Let me know if you need any more input, or are able to come up with a patch
that more or less does what I just described.

> 
> > Thanks.
> > 
> > [1] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547
> > [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#5
> > [3] https://bugzilla.kernel.org/show_bug.cgi?id=213617
> > [4] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156#note_1385770
> > [5] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> > 
> > ----- %= ----- %= ----- %=
> > 
> > From 70271cb0aa30e4523d39c3942e84b16fe18338f5 Mon Sep 17 00:00:00 2001
> > From: Karol Herbst <kherbst@redhat.com>
> > Date: Mon, 16 May 2022 17:40:20 +0200
> > Subject: [PATCH] nouveau WIP
> > 
> > ---
> >  drivers/gpu/drm/nouveau/nouveau_bo.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > index 05076e530e7d..b6343741eda6 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > @@ -820,6 +820,7 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo,
> > int evict,
> >   if (ret == 0) {
> >   ret = nouveau_fence_new(chan, false, &fence);
> >   if (ret == 0) {
> > + nouveau_fence_wait(fence, false, false);
> >   ret = ttm_bo_move_accel_cleanup(bo,
> >   &fence->base,
> >   evict, false,
> > --
> > 2.35.3
> > 
> 

-- 
Cheers,
 Lyude Paul (she/her)
 Software Engineer at Red Hat


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2022-08-10 21:35             ` Lyude Paul
@ 2022-08-10 22:08               ` Karol Herbst
  2022-08-11 21:19                 ` Lyude Paul
  0 siblings, 1 reply; 13+ messages in thread
From: Karol Herbst @ 2022-08-10 22:08 UTC (permalink / raw)
  To: Lyude Paul; +Cc: Ben Skeggs, nouveau

On Wed, Aug 10, 2022 at 11:35 PM Lyude Paul <lyude@redhat.com> wrote:
>
> On Wed, 2022-08-10 at 13:59 +0200, Karol Herbst wrote:
> > On Wed, Aug 10, 2022 at 12:42 PM Computer Enthusiastic
> > <computer.enthusiastic@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > The "supend to ram" issue (system freeze when resuming after suspend
> > > to ram or disk hibernation) is still affecting the nouveau kernel
> > > module from linux version 5 and above with nvidia GeForce 9600M GT and
> > > other nvidia graphic card with the same chipset [1][2][3].
> > >
> > > I've been successfully using the "work in progress" patch by Karol
> > > Herbst [4] to solve the "supend to ram" issue since it was proposed
> > > for an apparent different issue ("[GT216][Linux 5.16.4] Long stalls in
> > > dma_fence_default_wait") [5] whose root cause is probably the same.
> > >
> > > I send this message to bring to your attention the aforementioned
> > > patch [4] (below in the message as text) to evaluate it for  the next
> > > steps for possible inclusion  in the upstream kernel version.
> > >
> >
> > I think it might be better to CC Ben and Lyude, who _might_ have a
> > good idea on what's the proper patch, but maybe the patch is fine as
> > it is?
>
> When the GPU starts going into suspend, TTM migrates all of the active bos in
> vram over to system ram so they aren't lost. Based on what's described here
> and the fact that a fence wait seems to fix it, it sounds like to me that TTM
> may still be migrating bos (buffer objects) from the GPU over to system RAM
> when the suspend process starts. Of course this causes those operations to get
> interrupted and lose access to VRAM unexpectedly, which likely causes the
> world to go boom.
>

the thing is, without this patch, the "hw path" fails with a timeout
of... 30 seconds?, the code falls back to sw and it takes up to 2-3
minutes to finish up. So even if this solution is overkill, it fixes a
real issue. I suspect _something_ is bonkers with the fencing we do
here and we don't see that the stuff is actually finished or whatever,
or maybe we never emit the fence in the first place.. I didn't dig
deep enough.

With this, it's really quick and everything.

> So, I'm not 100% sure but I think this patch is potentially overkill since it
> looks like it would fix this issue by just making all m2mf bo moves
> synchronous. Thisis something we probably(?) don't want to do since if my
> theory is correct, the only time we should need to do a fence wait is before
> suspending the GPU in order to ensure that all m2mf operations have been
> completed before continuing suspend. I think Ben would likely need to weigh
> in, or you might be able to figure out the answer to this just by seeing how
> we handle this with the newer method of vram blitting - the CE (copy engine).
> Let me know if you need any more input, or are able to come up with a patch
> that more or less does what I just described.
>
> >
> > > Thanks.
> > >
> > > [1] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/547
> > > [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#5
> > > [3] https://bugzilla.kernel.org/show_bug.cgi?id=213617
> > > [4] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156#note_1385770
> > > [5] https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> > >
> > > ----- %= ----- %= ----- %=
> > >
> > > From 70271cb0aa30e4523d39c3942e84b16fe18338f5 Mon Sep 17 00:00:00 2001
> > > From: Karol Herbst <kherbst@redhat.com>
> > > Date: Mon, 16 May 2022 17:40:20 +0200
> > > Subject: [PATCH] nouveau WIP
> > >
> > > ---
> > >  drivers/gpu/drm/nouveau/nouveau_bo.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > index 05076e530e7d..b6343741eda6 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > @@ -820,6 +820,7 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo,
> > > int evict,
> > >   if (ret == 0) {
> > >   ret = nouveau_fence_new(chan, false, &fence);
> > >   if (ret == 0) {
> > > + nouveau_fence_wait(fence, false, false);
> > >   ret = ttm_bo_move_accel_cleanup(bo,
> > >   &fence->base,
> > >   evict, false,
> > > --
> > > 2.35.3
> > >
> >
>
> --
> Cheers,
>  Lyude Paul (she/her)
>  Software Engineer at Red Hat
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2022-08-10 22:08               ` Karol Herbst
@ 2022-08-11 21:19                 ` Lyude Paul
  2022-08-11 23:07                   ` Karol Herbst
  0 siblings, 1 reply; 13+ messages in thread
From: Lyude Paul @ 2022-08-11 21:19 UTC (permalink / raw)
  To: Karol Herbst; +Cc: Ben Skeggs, nouveau

On Thu, 2022-08-11 at 00:08 +0200, Karol Herbst wrote:
> the thing is, without this patch, the "hw path" fails with a timeout
> of... 30 seconds?, the code falls back to sw and it takes up to 2-3
> minutes to finish up. So even if this solution is overkill, it fixes a
> real issue. I suspect _something_ is bonkers with the fencing we do
> here and we don't see that the stuff is actually finished or whatever,
> or maybe we never emit the fence in the first place.. I didn't dig
> deep enough.
> 
> With this, it's really quick and everything.

I guess with a todo comment this is probably fine then, since this would only
really be affecting GPUs using m2mf anyway

-- 
Cheers,
 Lyude Paul (she/her)
 Software Engineer at Red Hat


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2022-08-11 21:19                 ` Lyude Paul
@ 2022-08-11 23:07                   ` Karol Herbst
  2022-08-18 20:39                     ` Lyude Paul
  0 siblings, 1 reply; 13+ messages in thread
From: Karol Herbst @ 2022-08-11 23:07 UTC (permalink / raw)
  To: Lyude Paul; +Cc: Ben Skeggs, nouveau

On Thu, Aug 11, 2022 at 11:20 PM Lyude Paul <lyude@redhat.com> wrote:
>
> On Thu, 2022-08-11 at 00:08 +0200, Karol Herbst wrote:
> > the thing is, without this patch, the "hw path" fails with a timeout
> > of... 30 seconds?, the code falls back to sw and it takes up to 2-3
> > minutes to finish up. So even if this solution is overkill, it fixes a
> > real issue. I suspect _something_ is bonkers with the fencing we do
> > here and we don't see that the stuff is actually finished or whatever,
> > or maybe we never emit the fence in the first place.. I didn't dig
> > deep enough.
> >
> > With this, it's really quick and everything.
>
> I guess with a todo comment this is probably fine then, since this would only
> really be affecting GPUs using m2mf anyway
>

I was mostly wondering if this is safe, but it probably is and it does
fix real user issues. I don't have a better solution at the moment
anyway.

> --
> Cheers,
>  Lyude Paul (she/her)
>  Software Engineer at Red Hat
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2022-08-11 23:07                   ` Karol Herbst
@ 2022-08-18 20:39                     ` Lyude Paul
  2022-08-18 21:17                       ` Karol Herbst
  0 siblings, 1 reply; 13+ messages in thread
From: Lyude Paul @ 2022-08-18 20:39 UTC (permalink / raw)
  To: Karol Herbst; +Cc: Ben Skeggs, nouveau

So then - if we can get this patch re-submitted with the /* TODO */ added,
I'll be happy to push this

On Fri, 2022-08-12 at 01:07 +0200, Karol Herbst wrote:
> On Thu, Aug 11, 2022 at 11:20 PM Lyude Paul <lyude@redhat.com> wrote:
> > 
> > On Thu, 2022-08-11 at 00:08 +0200, Karol Herbst wrote:
> > > the thing is, without this patch, the "hw path" fails with a timeout
> > > of... 30 seconds?, the code falls back to sw and it takes up to 2-3
> > > minutes to finish up. So even if this solution is overkill, it fixes a
> > > real issue. I suspect _something_ is bonkers with the fencing we do
> > > here and we don't see that the stuff is actually finished or whatever,
> > > or maybe we never emit the fence in the first place.. I didn't dig
> > > deep enough.
> > > 
> > > With this, it's really quick and everything.
> > 
> > I guess with a todo comment this is probably fine then, since this would only
> > really be affecting GPUs using m2mf anyway
> > 
> 
> I was mostly wondering if this is safe, but it probably is and it does
> fix real user issues. I don't have a better solution at the moment
> anyway.
> 
> > --
> > Cheers,
> >  Lyude Paul (she/her)
> >  Software Engineer at Red Hat
> > 
> 

-- 
Cheers,
 Lyude Paul (she/her)
 Software Engineer at Red Hat


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1)
  2022-08-18 20:39                     ` Lyude Paul
@ 2022-08-18 21:17                       ` Karol Herbst
  0 siblings, 0 replies; 13+ messages in thread
From: Karol Herbst @ 2022-08-18 21:17 UTC (permalink / raw)
  To: Lyude Paul; +Cc: Ben Skeggs, nouveau

On Thu, Aug 18, 2022 at 10:39 PM Lyude Paul <lyude@redhat.com> wrote:
>
> So then - if we can get this patch re-submitted with the /* TODO */ added,
> I'll be happy to push this
>

yeah, I'll send something out tomorrow then.

> On Fri, 2022-08-12 at 01:07 +0200, Karol Herbst wrote:
> > On Thu, Aug 11, 2022 at 11:20 PM Lyude Paul <lyude@redhat.com> wrote:
> > >
> > > On Thu, 2022-08-11 at 00:08 +0200, Karol Herbst wrote:
> > > > the thing is, without this patch, the "hw path" fails with a timeout
> > > > of... 30 seconds?, the code falls back to sw and it takes up to 2-3
> > > > minutes to finish up. So even if this solution is overkill, it fixes a
> > > > real issue. I suspect _something_ is bonkers with the fencing we do
> > > > here and we don't see that the stuff is actually finished or whatever,
> > > > or maybe we never emit the fence in the first place.. I didn't dig
> > > > deep enough.
> > > >
> > > > With this, it's really quick and everything.
> > >
> > > I guess with a todo comment this is probably fine then, since this would only
> > > really be affecting GPUs using m2mf anyway
> > >
> >
> > I was mostly wondering if this is safe, but it probably is and it does
> > fix real user issues. I don't have a better solution at the moment
> > anyway.
> >
> > > --
> > > Cheers,
> > >  Lyude Paul (she/her)
> > >  Software Engineer at Red Hat
> > >
> >
>
> --
> Cheers,
>  Lyude Paul (she/her)
>  Software Engineer at Red Hat
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-18 21:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13 17:31 [Nouveau] Advice about debugging nouveau driver suspend issue (init_on_alloc=1 and init_on_free=1) Computer Enthusiastic
2021-10-13 17:43 ` Ilia Mirkin
2021-10-14 16:51   ` Computer Enthusiastic
2021-10-14 18:07     ` Ilia Mirkin
2022-01-24 23:19       ` computer enthusiastic
2022-08-10 10:41         ` Computer Enthusiastic
2022-08-10 11:59           ` Karol Herbst
2022-08-10 21:35             ` Lyude Paul
2022-08-10 22:08               ` Karol Herbst
2022-08-11 21:19                 ` Lyude Paul
2022-08-11 23:07                   ` Karol Herbst
2022-08-18 20:39                     ` Lyude Paul
2022-08-18 21:17                       ` Karol Herbst

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.