nouveau.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix?
@ 2024-04-18 20:27 Lyude Paul
  2024-04-18 22:14 ` David Airlie
  2024-04-19 13:52 ` Ben Skeggs
  0 siblings, 2 replies; 4+ messages in thread
From: Lyude Paul @ 2024-04-18 20:27 UTC (permalink / raw)
  To: Danilo Krummrich, Dave Airlie, Timur Tabi, Ben Skeggs; +Cc: nouveau

So - first some context here for Ben and anyone else who hasn't been
following. A little while ago I got a Slimbook Executive 16 with a
Nvidia RTX 4060 in it, and I've unfortunately been running into a kind
of annoying issue. Currently this laptop only has 16 gigs of ram, and
as it turns out - this can easily lead the system to having pretty
heavy memory fragmentation once it starts swapping pages out.

Normally this wouldn't matter, but I unfortunately discovered that when
we're runtime suspending the GPU in Nouveau - we actually appear to
allocate some of the memory we use for migrating using
dma_alloc_coherent. This starts to fail on my system once memory
fragmentation goes up like so:

  kworker/18:0: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
  nodemask=(null),cpuset=/,mems_allowed=0
  CPU: 18 PID: 287012 Comm: kworker/18:0 Not tainted
  6.8.4-200.ChopperV1.fc39.x86_64 #1
  Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
  Workqueue: pm pm_runtime_work
  Call Trace:
   <TASK>
   dump_stack_lvl+0x47/0x60
   warn_alloc+0x165/0x1e0
   ? __alloc_pages_direct_compact+0x1ad/0x2b0
   __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
   __alloc_pages+0x32d/0x350
   __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
   dma_direct_alloc+0x70/0x280
   nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
   r535_gsp_fini+0x1d4/0x350 [nouveau]
   nvkm_subdev_fini+0x67/0x150 [nouveau]
   nvkm_device_fini+0x95/0x1e0 [nouveau]
   nvkm_udevice_fini+0x53/0x70 [nouveau]
   nvkm_object_fini+0xb9/0x240 [nouveau]
   nvkm_object_fini+0x75/0x240 [nouveau]
   nouveau_do_suspend+0xf5/0x280 [nouveau]
   nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
   pci_pm_runtime_suspend+0x67/0x1e0
   ? __pfx_pci_pm_runtime_suspend+0x10/0x10
   __rpm_callback+0x41/0x170
   ? __pfx_pci_pm_runtime_suspend+0x10/0x10
   rpm_callback+0x5d/0x70
   ? __pfx_pci_pm_runtime_suspend+0x10/0x10
   rpm_suspend+0x120/0x6a0
   pm_runtime_work+0x98/0xb0
   process_one_work+0x171/0x340
   worker_thread+0x27b/0x3a0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xe5/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x31/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1b/0x30

  nouveau 0000:01:00.0: gsp: suspend failed, -12
  nouveau: DRM-master:00000000:00000080: suspend failed with -12
  nouveau 0000:01:00.0: can't suspend (nouveau_pmops_runtime_suspend
  [nouveau] returned -12)

Keep in mind, I don't dive into memory management related stuff like
this very often! But I'd very much like to know how to help out
anywhere around the driver, including outside of my usual domains, so
I've been trying to write up a patch for this. The original suggestion
for a fix that Dave Airlie had given me was (unless I misunderstood,
which isn't unlikely) to try to see if we could get nvkm_gsp_mem_ctor()
to start allocating memory with vmalloc() and map that onto the GPU
using the SG helpers instead. So - I gave a shot at writing up a patch
for doing that:

https://gitlab.freedesktop.org/lyudess/linux/-/commit/b5a41ac2bd948979815d262d8d20b4f3333f9c26

As you can probably guess - the patch does not really seem to work, and
I've been trying to figure out why. There's already a couple of issues
I'm aware of: the most glaring one being that as Timur pointed out, a
lot of GSP hardware expects contiguous memory allocations - but
according to them the allocation that's specifically failing should be
small enough that it'd be allocated in a contiguous page anyway:

   [    9.429884] Lyude:r535_gsp_init:2186: (mbox1) == 0
   [    9.429898] Lyude:r535_gsp_init:2186: (mbox0) == dbdfe000
   [    9.491300] ------------[ cut here ]------------
   [    9.491308] WARNING: CPU: 5 PID: 921 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:1713 r535_gsp_init+0x75e/0x7c0 [nouveau]
   [    9.491533] Modules linked in: nouveau(+) rfkill binfmt_misc vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep wmi_bmof ppdev snd_hda_core drm_ttm_helper intel_rapl_msr snd_seq ttm snd_seq_device snd_pcm video gpu_sched snd_timer i2c_algo_bit drm_gpuvm drm_exec intel_rapl_common mxm_wmi rapl snd drm_display_helper acpi_cpufreq soundcore k10temp i2c_piix4 parport_pc wmi parport gpio_amdpt gpio_generic loop dm_multipath nfnetlink zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 r8169 realtek sha1_ssse3 ccp w83627hf_wdt scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse
   [    9.491670] CPU: 5 PID: 921 Comm: (udev-worker) Not tainted 6.9.0-rc3Lyude-Test+ #22
   [    9.491681] Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.I0 01/22/2019
   [    9.491690] RIP: 0010:r535_gsp_init+0x75e/0x7c0 [nouveau]
   [    9.491885] Code: 8b 83 10 0d 00 00 48 89 ef 41 bf e4 ff ff ff 48 8b 40 18 48 8b 80 48 0f 00 00 48 8b 40 28 e8 b9 5e 89 ee 0f 0b e9 73 f9 ff ff <0f> 0b 41 bf fb ff ff ff e9 5a f9 ff ff 41 89 ef 0f 0b e9 5c f9 ff
   [    9.491905] RSP: 0018:ffffb271c175f748 EFLAGS: 00010246
   [    9.491914] RAX: 0000000000000000 RBX: ffffa098e192f000 RCX: ffffa098ca2768c8
   [    9.491922] RDX: ffffa098e191d400 RSI: ffffb271cc110080 RDI: ffffb271cc111388
   [    9.491930] RBP: 00000000dbdfe000 R08: 0000000000000003 R09: 0000000000000000
   [    9.491938] R10: 0000000000000000 R11: ffffa098ca276828 R12: ffffa098e192f008
   [    9.491946] R13: 000000022b906452 R14: ffffa098e192f008 R15: 0000000000000000
   [    9.491956] FS:  00007f4de98cc980(0000) GS:ffffa099c4a80000(0000) knlGS:0000000000000000
   [    9.491966] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [    9.491974] CR2: 00007f7bd8d18ea0 CR3: 0000000104e58000 CR4: 00000000003506f0
   [    9.491989] Call Trace:
   [    9.491996]  <TASK>
   [    9.492002]  ? __warn+0x80/0x120
   [    9.492012]  ? r535_gsp_init+0x75e/0x7c0 [nouveau]
   [    9.492200]  ? report_bug+0x164/0x190
   [    9.492211]  ? handle_bug+0x3c/0x80
   [    9.492218]  ? exc_invalid_op+0x17/0x70
   [    9.492227]  ? asm_exc_invalid_op+0x1a/0x20
   [    9.492241]  ? r535_gsp_init+0x75e/0x7c0 [nouveau]
   [    9.492429]  ? r535_gsp_init+0x18e/0x7c0 [nouveau]
   [    9.492616]  ? srso_return_thunk+0x5/0x5f
   [    9.492626]  nvkm_subdev_init_+0x48/0x130 [nouveau]
   [    9.492802]  ? srso_return_thunk+0x5/0x5f
   [    9.492810]  nvkm_subdev_init+0x44/0x90 [nouveau]
   [    9.492988]  nvkm_device_init+0x166/0x2e0 [nouveau]
   [    9.493189]  nvkm_udevice_init+0x47/0x70 [nouveau]
   [    9.493391]  nvkm_object_init+0x41/0x1c0 [nouveau]
   [    9.493567]  nvkm_ioctl_new+0x16a/0x290 [nouveau]
   [    9.493740]  ? __pfx_nvkm_client_child_new+0x10/0x10 [nouveau]
   [    9.493912]  ? __pfx_nvkm_udevice_new+0x10/0x10 [nouveau]
   [    9.494121]  nvkm_ioctl+0x10e/0x250 [nouveau]
   [    9.494288]  nvif_object_ctor+0x112/0x190 [nouveau]
   [    9.494456]  nvif_device_ctor+0x23/0x60 [nouveau]
   [    9.494625]  nouveau_cli_init+0x164/0x5d0 [nouveau]
   [    9.494820]  nouveau_drm_device_init+0x97/0xe00 [nouveau]
   [    9.495022]  ? srso_return_thunk+0x5/0x5f
   [    9.495030]  ? pci_bus_read_config_word+0x4d/0x90
   [    9.495039]  ? srso_return_thunk+0x5/0x5f
   [    9.495047]  ? pci_update_current_state+0x72/0xb0
   [    9.495059]  nouveau_drm_probe+0x12c/0x280 [nouveau]
   [    9.495245]  ? srso_return_thunk+0x5/0x5f
   [    9.495254]  local_pci_probe+0x45/0xa0
   [    9.495263]  pci_device_probe+0xc7/0x240
   [    9.495272]  really_probe+0xd6/0x390
   [    9.495282]  ? __pfx___driver_attach+0x10/0x10
   [    9.495290]  __driver_probe_device+0x78/0x150
   [    9.495301]  driver_probe_device+0x1f/0x90
   [    9.495308]  __driver_attach+0xd2/0x1c0
   [    9.495316]  bus_for_each_dev+0x88/0xd0
   [    9.495325]  bus_add_driver+0x116/0x220
   [    9.495334]  driver_register+0x59/0x100
   [    9.495342]  ? __pfx_nouveau_drm_init+0x10/0x10 [nouveau]
   [    9.495512]  do_one_initcall+0x5b/0x320
   [    9.495524]  do_init_module+0x60/0x240
   [    9.495536]  init_module_from_file+0x86/0xc0
   [    9.495550]  idempotent_init_module+0x120/0x2b0
   [    9.495562]  __x64_sys_finit_module+0x5e/0xb0
   [    9.495571]  do_syscall_64+0x88/0x170
   [    9.495581]  ? srso_return_thunk+0x5/0x5f
   [    9.495589]  ? syscall_exit_to_user_mode_prepare+0x15d/0x190
   [    9.495600]  ? srso_return_thunk+0x5/0x5f
   [    9.495607]  ? syscall_exit_to_user_mode+0x60/0x210
   [    9.495615]  ? srso_return_thunk+0x5/0x5f
   [    9.495622]  ? do_syscall_64+0x95/0x170
   [    9.495630]  ? srso_return_thunk+0x5/0x5f
   [    9.495636]  ? do_syscall_64+0x95/0x170
   [    9.495644]  ? srso_return_thunk+0x5/0x5f
   [    9.495653]  entry_SYSCALL_64_after_hwframe+0x71/0x79
   [    9.495663] RIP: 0033:0x7f4de9b2919d
   [    9.495680] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b cc 0c 00 f7 d8 64 89 01 48
   [    9.495697] RSP: 002b:00007ffc56bfe468 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
   [    9.495707] RAX: ffffffffffffffda RBX: 00005644a0432350 RCX: 00007f4de9b2919d
   [    9.495717] RDX: 0000000000000000 RSI: 00005644a042ef30 RDI: 0000000000000031
   [    9.495726] RBP: 00007ffc56bfe520 R08: 00007f4de9bf6b20 R09: 00007ffc56bfe4b0
   [    9.495734] R10: 00005644a04346a0 R11: 0000000000000246 R12: 00005644a042ef30
   [    9.495742] R13: 0000000000020000 R14: 00005644a0432d10 R15: 00005644a0434660
   [    9.495754]  </TASK>
   [    9.495759] ---[ end trace 0000000000000000 ]---
   [    9.495778] ------------[ cut here ]------------
   [    9.495784] WARNING: CPU: 5 PID: 921 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:2187 r535_gsp_init+0xc5/0x7c0 [nouveau]
   [    9.495981] Modules linked in: nouveau(+) rfkill binfmt_misc vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep wmi_bmof ppdev snd_hda_core drm_ttm_helper intel_rapl_msr snd_seq ttm snd_seq_device snd_pcm video gpu_sched snd_timer i2c_algo_bit drm_gpuvm drm_exec intel_rapl_common mxm_wmi rapl snd drm_display_helper acpi_cpufreq soundcore k10temp i2c_piix4 parport_pc wmi parport gpio_amdpt gpio_generic loop dm_multipath nfnetlink zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 r8169 realtek sha1_ssse3 ccp w83627hf_wdt scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse
   [    9.496112] CPU: 5 PID: 921 Comm: (udev-worker) Tainted: G        W          6.9.0-rc3Lyude-Test+ #22
   [    9.496123] Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.I0 01/22/2019
   [    9.496132] RIP: 0010:r535_gsp_init+0xc5/0x7c0 [nouveau]
   [    9.496317] Code: 24 18 4c 8d 63 08 89 6c 24 14 4c 89 e6 6a 00 4c 8d 44 24 20 48 8d 4c 24 1c e8 b7 c3 fa ff 5f 41 89 c7 85 c0 0f 84 97 00 00 00 <0f> 0b 48 83 bb 20 0a 00 00 00 75 37 48 8b 44 24 20 65 48 2b 04 25
   [    9.496333] RSP: 0018:ffffb271c175f748 EFLAGS: 00010246
   [    9.496341] RAX: 0000000000000000 RBX: ffffa098e192f000 RCX: ffffa098ca2768c8
   [    9.496351] RDX: ffffa098e191d400 RSI: ffffb271cc110080 RDI: ffffb271cc111388
   [    9.496360] RBP: 00000000dbdfe000 R08: 0000000000000003 R09: 0000000000000000
   [    9.496368] R10: 0000000000000000 R11: ffffa098ca276828 R12: ffffa098e192f008
   [    9.496375] R13: 000000022b906452 R14: ffffa098e192f008 R15: 00000000fffffffb
   [    9.496383] FS:  00007f4de98cc980(0000) GS:ffffa099c4a80000(0000) knlGS:0000000000000000
   [    9.496393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [    9.496400] CR2: 00007f7bd8d18ea0 CR3: 0000000104e58000 CR4: 00000000003506f0
   [    9.496410] Call Trace:
   [    9.496416]  <TASK>
   [    9.496422]  ? __warn+0x80/0x120
   [    9.496429]  ? r535_gsp_init+0xc5/0x7c0 [nouveau]
   [    9.496622]  ? report_bug+0x164/0x190
   [    9.496631]  ? handle_bug+0x3c/0x80
   [    9.496638]  ? exc_invalid_op+0x17/0x70
   [    9.496647]  ? asm_exc_invalid_op+0x1a/0x20
   [    9.496660]  ? r535_gsp_init+0xc5/0x7c0 [nouveau]
   [    9.496851]  ? r535_gsp_init+0x18e/0x7c0 [nouveau]
   [    9.497044]  ? srso_return_thunk+0x5/0x5f
   [    9.497055]  nvkm_subdev_init_+0x48/0x130 [nouveau]
   [    9.497227]  ? srso_return_thunk+0x5/0x5f
   [    9.497236]  nvkm_subdev_init+0x44/0x90 [nouveau]
   [    9.497405]  nvkm_device_init+0x166/0x2e0 [nouveau]
   [    9.497608]  nvkm_udevice_init+0x47/0x70 [nouveau]
   [    9.497808]  nvkm_object_init+0x41/0x1c0 [nouveau]
   [    9.497983]  nvkm_ioctl_new+0x16a/0x290 [nouveau]
   [    9.498154]  ? __pfx_nvkm_client_child_new+0x10/0x10 [nouveau]
   [    9.498326]  ? __pfx_nvkm_udevice_new+0x10/0x10 [nouveau]
   [    9.498531]  nvkm_ioctl+0x10e/0x250 [nouveau]
   [    9.498702]  nvif_object_ctor+0x112/0x190 [nouveau]
   [    9.498873]  nvif_device_ctor+0x23/0x60 [nouveau]
   [    9.499049]  nouveau_cli_init+0x164/0x5d0 [nouveau]
   [    9.499244]  nouveau_drm_device_init+0x97/0xe00 [nouveau]
   [    9.499430]  ? srso_return_thunk+0x5/0x5f
   [    9.499437]  ? pci_bus_read_config_word+0x4d/0x90
   [    9.499445]  ? srso_return_thunk+0x5/0x5f
   [    9.499452]  ? pci_update_current_state+0x72/0xb0
   [    9.499461]  nouveau_drm_probe+0x12c/0x280 [nouveau]
   [    9.499657]  ? srso_return_thunk+0x5/0x5f
   [    9.499666]  local_pci_probe+0x45/0xa0
   [    9.499674]  pci_device_probe+0xc7/0x240
   [    9.499683]  really_probe+0xd6/0x390
   [    9.499692]  ? __pfx___driver_attach+0x10/0x10
   [    9.499699]  __driver_probe_device+0x78/0x150
   [    9.499709]  driver_probe_device+0x1f/0x90
   [    9.499718]  __driver_attach+0xd2/0x1c0
   [    9.499726]  bus_for_each_dev+0x88/0xd0
   [    9.499735]  bus_add_driver+0x116/0x220
   [    9.499744]  driver_register+0x59/0x100
   [    9.499751]  ? __pfx_nouveau_drm_init+0x10/0x10 [nouveau]
   [    9.499915]  do_one_initcall+0x5b/0x320
   [    9.499926]  do_init_module+0x60/0x240
   [    9.499934]  init_module_from_file+0x86/0xc0
   [    9.499948]  idempotent_init_module+0x120/0x2b0
   [    9.499962]  __x64_sys_finit_module+0x5e/0xb0
   [    9.499971]  do_syscall_64+0x88/0x170
   [    9.499987]  ? srso_return_thunk+0x5/0x5f
   [    9.499996]  ? syscall_exit_to_user_mode_prepare+0x15d/0x190
   [    9.500004]  ? srso_return_thunk+0x5/0x5f
   [    9.500011]  ? syscall_exit_to_user_mode+0x60/0x210
   [    9.500019]  ? srso_return_thunk+0x5/0x5f
   [    9.500026]  ? do_syscall_64+0x95/0x170
   [    9.500034]  ? srso_return_thunk+0x5/0x5f
   [    9.500041]  ? do_syscall_64+0x95/0x170
   [    9.500050]  ? srso_return_thunk+0x5/0x5f
   [    9.500058]  entry_SYSCALL_64_after_hwframe+0x71/0x79
   [    9.500067] RIP: 0033:0x7f4de9b2919d
   [    9.500075] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b cc 0c 00 f7 d8 64 89 01 48
   [    9.500091] RSP: 002b:00007ffc56bfe468 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
   [    9.500103] RAX: ffffffffffffffda RBX: 00005644a0432350 RCX: 00007f4de9b2919d
   [    9.500112] RDX: 0000000000000000 RSI: 00005644a042ef30 RDI: 0000000000000031
   [    9.500121] RBP: 00007ffc56bfe520 R08: 00007f4de9bf6b20 R09: 00007ffc56bfe4b0
   [    9.500128] R10: 00005644a04346a0 R11: 0000000000000246 R12: 00005644a042ef30
   [    9.500136] R13: 0000000000020000 R14: 00005644a0432d10 R15: 00005644a0434660
   [    9.500149]  </TASK>
   [    9.500154] ---[ end trace 0000000000000000 ]---
   [    9.500162] nouveau 0000:1f:00.0: gsp: init failed, -5
   [    9.500189] nouveau 0000:1f:00.0: init failed with -5
   [    9.500196] nouveau: DRM-master:00000000:00000080: init failed with -5
   [    9.500207] nouveau 0000:1f:00.0: DRM-master: Device allocation failed: -5
   [    9.502661] nouveau 0000:1f:00.0: probe with driver nouveau failed with error -5
   
   
Which brings me to the second part - TImur had me enable CONFIG_SG_DEBUG, which quickly hit a different issue:

   [    8.992320] RIP: 0010:sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
   [ 8.992331] Code: 71 93 37 01 83 e1 03 f6 c3 03 75 20 a8 01 75 1e 48 09 cb 41 89 54 24 08 49 89 1c 24 41 89 6c 24 0c 5b 5d 41 5c e9 7b 94 7d 00 <0f> 0b 0f 0b 0f 0b 48 8b 05 5e ae 9f 01 eb b2 66 66 2e 0f 1f 84 00 
   [    8.992428] Call Trace:
   [    8.992433]  <TASK>
   [    8.992439] ? die (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/dumpstack.c:421 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/dumpstack.c:434 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/dumpstack.c:447)
   [    8.992448] ? do_trap (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/traps.c:114 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/traps.c:155)
   [    8.992455] ? sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
   [    8.992464] ? do_error_trap (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./arch/x86/include/asm/traps.h:58 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/traps.c:176)
   [    8.992472] ? sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
   [    8.992481] ? exc_invalid_op (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/traps.c:267)
   [    8.992489] ? sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
   [    8.992496] ? asm_exc_invalid_op (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./arch/x86/include/asm/idtentry.h:621)
   [    8.992509] ? sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
   [    8.992518] nvkm_firmware_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/firmware.c:249) nouveau
   [    8.992722] nvkm_falcon_fw_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/falcon/fw.c:199) nouveau
   [    8.992898] ga102_gsp_booter_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/ga102.c:62) nouveau
   [    8.993095] r535_gsp_oneinit (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:2309) nouveau
   [    8.993292] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
   [    8.993302] ? kmem_cache_alloc_lru (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/mm/slub.c:3748 (discriminator 2) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/mm/slub.c:3827 (discriminator 2) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/mm/slub.c:3864 (discriminator 2))
   [    8.993311] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
   [    8.993317] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
   [    8.993324] ? ktime_get (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/time/timekeeping.c:292 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/time/timekeeping.c:388 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/time/timekeeping.c:848)
   [    8.993334] nvkm_subdev_oneinit_ (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/subdev.c:113) nouveau
   [    8.993510] nvkm_subdev_init_ (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/subdev.c:139) nouveau
   [    8.993685] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
   [    8.993693] nvkm_subdev_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/subdev.c:170) nouveau
   [    8.993867] nvkm_device_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c:3023) nouveau
   [    8.994079] nvkm_udevice_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:295) nouveau
   [    8.994281] nvkm_object_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/object.c:245) nouveau
   [    8.994457] nvkm_ioctl_new (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/ioctl.c:149) nouveau
   [    8.994630] ? __pfx_nvkm_client_child_new (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/client.c:125) nouveau
   [    8.994803] ? __pfx_nvkm_udevice_new (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:386) nouveau
   [    8.995013] nvkm_ioctl (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/ioctl.c:354 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/ioctl.c:376) nouveau
   [    8.995187] nvif_object_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvif/object.c:298 (discriminator 1)) nouveau
   [    8.995356] nvif_device_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvif/device.c:56) nouveau
   [    8.995524] nouveau_cli_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nouveau_drm.c:270) nouveau
   [    8.995721] nouveau_drm_device_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nouveau_drm.c:602) nouveau
   [    8.995915] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
   [    8.995923] ? pci_bus_read_config_word (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/access.c:67 (discriminator 1))
   [    8.995932] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
   [    8.995939] ? pci_update_current_state (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci.c:1195 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci.c:1187)
   [    8.995949] nouveau_drm_probe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nouveau_drm.c:841) nouveau
   [    8.996145] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
   [    8.996154] local_pci_probe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci-driver.c:325)
   [    8.996163] pci_device_probe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci-driver.c:392 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci-driver.c:417 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci-driver.c:451 (discriminator 1))
   [    8.996174] really_probe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:578 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:656)
   [    8.996185] ? __pfx___driver_attach (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:1155)
   [    8.996192] __driver_probe_device (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:798)
   [    8.996201] driver_probe_device (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:828)
   [    8.996209] __driver_attach (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:1215)
   [    8.996217] bus_for_each_dev (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/bus.c:368)
   [    8.996228] bus_add_driver (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/bus.c:673)
   [    8.996238] driver_register (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/driver.c:246)
   [    8.996246] ? __pfx_nouveau_drm_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvif/object.c:32) nouveau
   [    8.996415] do_one_initcall (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/init/main.c:1238)
   [    8.996428] do_init_module (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:2538)
   [    8.996437] init_module_from_file (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3168)
   [    8.996450] idempotent_init_module (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/spinlock.h:351 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3131 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3185)
   [    8.996462] __x64_sys_finit_module (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/file.h:47 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3207 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3189 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3189)
   [    8.996473] do_syscall_64 (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/entry/common.c:52 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/entry/common.c:83 (discriminator 1))
   [    8.996482] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
   [    8.996490] entry_SYSCALL_64_after_hwframe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/entry/entry_64.S:129)
   [    8.996499] RIP: 0033:0x7fd12f52919d

I think timur actually mentioned this bug to you previously, but in
hopes of getting something more useful out of SG_DEBUG I dug into this
problem a  bit and ended up with what I believe is an actually correct
patch:

https://gitlab.freedesktop.org/lyudess/linux/-/commit/485f1fb62ddd4b42b60848eeb48206fef4376161

...unfortunately, fixing that issue on my system did not get SG_DEBUG
to give me any useful info.

Anyway - that brings me to ask 1: do you have any idea what might be
going on with the falcon boot issue I mentioned, or if I might just be
doing something wrong/silly with how I'm setting up memory in
nvkm_gsp_mem_ctor()?

And 2: if you have the time does that patch look correct? I'm happy to
submit it :)

Also 3: welcome back again :)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix?
  2024-04-18 20:27 Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix? Lyude Paul
@ 2024-04-18 22:14 ` David Airlie
  2024-04-19 13:54   ` Ben Skeggs
  2024-04-19 13:52 ` Ben Skeggs
  1 sibling, 1 reply; 4+ messages in thread
From: David Airlie @ 2024-04-18 22:14 UTC (permalink / raw)
  To: Lyude Paul; +Cc: Danilo Krummrich, Timur Tabi, Ben Skeggs, nouveau

On Fri, Apr 19, 2024 at 6:27 AM Lyude Paul <lyude@redhat.com> wrote:
>
> So - first some context here for Ben and anyone else who hasn't been
> following. A little while ago I got a Slimbook Executive 16 with a
> Nvidia RTX 4060 in it, and I've unfortunately been running into a kind
> of annoying issue. Currently this laptop only has 16 gigs of ram, and
> as it turns out - this can easily lead the system to having pretty
> heavy memory fragmentation once it starts swapping pages out.
>
> Normally this wouldn't matter, but I unfortunately discovered that when
> we're runtime suspending the GPU in Nouveau - we actually appear to
> allocate some of the memory we use for migrating using
> dma_alloc_coherent. This starts to fail on my system once memory
> fragmentation goes up like so:
>
>   kworker/18:0: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
>   nodemask=(null),cpuset=/,mems_allowed=0
>   CPU: 18 PID: 287012 Comm: kworker/18:0 Not tainted
>   6.8.4-200.ChopperV1.fc39.x86_64 #1
>   Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
>   Workqueue: pm pm_runtime_work
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x47/0x60
>    warn_alloc+0x165/0x1e0
>    ? __alloc_pages_direct_compact+0x1ad/0x2b0
>    __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
>    __alloc_pages+0x32d/0x350
>    __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
>    dma_direct_alloc+0x70/0x280
>    nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
>    r535_gsp_fini+0x1d4/0x350 [nouveau]
>    nvkm_subdev_fini+0x67/0x150 [nouveau]
>    nvkm_device_fini+0x95/0x1e0 [nouveau]
>    nvkm_udevice_fini+0x53/0x70 [nouveau]
>    nvkm_object_fini+0xb9/0x240 [nouveau]
>    nvkm_object_fini+0x75/0x240 [nouveau]
>    nouveau_do_suspend+0xf5/0x280 [nouveau]
>    nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
>    pci_pm_runtime_suspend+0x67/0x1e0
>    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>    __rpm_callback+0x41/0x170
>    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>    rpm_callback+0x5d/0x70
>    ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>    rpm_suspend+0x120/0x6a0
>    pm_runtime_work+0x98/0xb0
>    process_one_work+0x171/0x340
>    worker_thread+0x27b/0x3a0
>    ? __pfx_worker_thread+0x10/0x10
>    kthread+0xe5/0x120
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork+0x31/0x50
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork_asm+0x1b/0x30
>
>   nouveau 0000:01:00.0: gsp: suspend failed, -12
>   nouveau: DRM-master:00000000:00000080: suspend failed with -12
>   nouveau 0000:01:00.0: can't suspend (nouveau_pmops_runtime_suspend
>   [nouveau] returned -12)
>
> Keep in mind, I don't dive into memory management related stuff like
> this very often! But I'd very much like to know how to help out
> anywhere around the driver, including outside of my usual domains, so
> I've been trying to write up a patch for this. The original suggestion
> for a fix that Dave Airlie had given me was (unless I misunderstood,
> which isn't unlikely) to try to see if we could get nvkm_gsp_mem_ctor()
> to start allocating memory with vmalloc() and map that onto the GPU
> using the SG helpers instead. So - I gave a shot at writing up a patch
> for doing that:
>
> https://gitlab.freedesktop.org/lyudess/linux/-/commit/b5a41ac2bd948979815d262d8d20b4f3333f9c26
>
> As you can probably guess - the patch does not really seem to work, and
> I've been trying to figure out why. There's already a couple of issues
> I'm aware of: the most glaring one being that as Timur pointed out, a
> lot of GSP hardware expects contiguous memory allocations - but
> according to them the allocation that's specifically failing should be
> small enough that it'd be allocated in a contiguous page anyway:

nvkm_gsp_mem_ctor is used to do coherent allocations in a bunch of
places in the gsp code, we can't use vmalloc for a lot of them. A lot
of the allocations are small multi-page and hang around and the
hardware expects allocations to be non-scattered.

Now in this single case we have a large amount of data pointed to by a
radix3 page table.

The data is allocated with nvkm_gsp_sg, then we fail to allocate the
first level of page tables with the coherent allocation. However I
don't think the first level of the page table needs to be allocated
with the coherent allocator, we should allocate it with nvkm_gsp_sg
instead.

Dave.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix?
  2024-04-18 20:27 Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix? Lyude Paul
  2024-04-18 22:14 ` David Airlie
@ 2024-04-19 13:52 ` Ben Skeggs
  1 sibling, 0 replies; 4+ messages in thread
From: Ben Skeggs @ 2024-04-19 13:52 UTC (permalink / raw)
  To: Lyude Paul, Danilo Krummrich, Dave Airlie, Timur Tabi; +Cc: nouveau

On 19/4/24 06:27, Lyude Paul wrote:

> So - first some context here for Ben and anyone else who hasn't been
> following. A little while ago I got a Slimbook Executive 16 with a
> Nvidia RTX 4060 in it, and I've unfortunately been running into a kind
> of annoying issue. Currently this laptop only has 16 gigs of ram, and
> as it turns out - this can easily lead the system to having pretty
> heavy memory fragmentation once it starts swapping pages out.
>
> Normally this wouldn't matter, but I unfortunately discovered that when
> we're runtime suspending the GPU in Nouveau - we actually appear to
> allocate some of the memory we use for migrating using
> dma_alloc_coherent. This starts to fail on my system once memory
> fragmentation goes up like so:
>
>    kworker/18:0: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
>    nodemask=(null),cpuset=/,mems_allowed=0
>    CPU: 18 PID: 287012 Comm: kworker/18:0 Not tainted
>    6.8.4-200.ChopperV1.fc39.x86_64 #1
>    Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
>    Workqueue: pm pm_runtime_work
>    Call Trace:
>     <TASK>
>     dump_stack_lvl+0x47/0x60
>     warn_alloc+0x165/0x1e0
>     ? __alloc_pages_direct_compact+0x1ad/0x2b0
>     __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
>     __alloc_pages+0x32d/0x350
>     __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
>     dma_direct_alloc+0x70/0x280
>     nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
>     r535_gsp_fini+0x1d4/0x350 [nouveau]
>     nvkm_subdev_fini+0x67/0x150 [nouveau]
>     nvkm_device_fini+0x95/0x1e0 [nouveau]
>     nvkm_udevice_fini+0x53/0x70 [nouveau]
>     nvkm_object_fini+0xb9/0x240 [nouveau]
>     nvkm_object_fini+0x75/0x240 [nouveau]
>     nouveau_do_suspend+0xf5/0x280 [nouveau]
>     nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
>     pci_pm_runtime_suspend+0x67/0x1e0
>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>     __rpm_callback+0x41/0x170
>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>     rpm_callback+0x5d/0x70
>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>     rpm_suspend+0x120/0x6a0
>     pm_runtime_work+0x98/0xb0
>     process_one_work+0x171/0x340
>     worker_thread+0x27b/0x3a0
>     ? __pfx_worker_thread+0x10/0x10
>     kthread+0xe5/0x120
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork+0x31/0x50
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork_asm+0x1b/0x30
>
>    nouveau 0000:01:00.0: gsp: suspend failed, -12
>    nouveau: DRM-master:00000000:00000080: suspend failed with -12
>    nouveau 0000:01:00.0: can't suspend (nouveau_pmops_runtime_suspend
>    [nouveau] returned -12)
>
> Keep in mind, I don't dive into memory management related stuff like
> this very often! But I'd very much like to know how to help out
> anywhere around the driver, including outside of my usual domains, so
> I've been trying to write up a patch for this. The original suggestion
> for a fix that Dave Airlie had given me was (unless I misunderstood,
> which isn't unlikely) to try to see if we could get nvkm_gsp_mem_ctor()
> to start allocating memory with vmalloc() and map that onto the GPU
> using the SG helpers instead. So - I gave a shot at writing up a patch
> for doing that:
>
> https://gitlab.freedesktop.org/lyudess/linux/-/commit/b5a41ac2bd948979815d262d8d20b4f3333f9c26
>
> As you can probably guess - the patch does not really seem to work, and
> I've been trying to figure out why. There's already a couple of issues
> I'm aware of: the most glaring one being that as Timur pointed out, a
> lot of GSP hardware expects contiguous memory allocations - but
> according to them the allocation that's specifically failing should be
> small enough that it'd be allocated in a contiguous page anyway:
>
>     [    9.429884] Lyude:r535_gsp_init:2186: (mbox1) == 0
>     [    9.429898] Lyude:r535_gsp_init:2186: (mbox0) == dbdfe000
>     [    9.491300] ------------[ cut here ]------------
>     [    9.491308] WARNING: CPU: 5 PID: 921 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:1713 r535_gsp_init+0x75e/0x7c0 [nouveau]
>     [    9.491533] Modules linked in: nouveau(+) rfkill binfmt_misc vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep wmi_bmof ppdev snd_hda_core drm_ttm_helper intel_rapl_msr snd_seq ttm snd_seq_device snd_pcm video gpu_sched snd_timer i2c_algo_bit drm_gpuvm drm_exec intel_rapl_common mxm_wmi rapl snd drm_display_helper acpi_cpufreq soundcore k10temp i2c_piix4 parport_pc wmi parport gpio_amdpt gpio_generic loop dm_multipath nfnetlink zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 r8169 realtek sha1_ssse3 ccp w83627hf_wdt scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse
>     [    9.491670] CPU: 5 PID: 921 Comm: (udev-worker) Not tainted 6.9.0-rc3Lyude-Test+ #22
>     [    9.491681] Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.I0 01/22/2019
>     [    9.491690] RIP: 0010:r535_gsp_init+0x75e/0x7c0 [nouveau]
>     [    9.491885] Code: 8b 83 10 0d 00 00 48 89 ef 41 bf e4 ff ff ff 48 8b 40 18 48 8b 80 48 0f 00 00 48 8b 40 28 e8 b9 5e 89 ee 0f 0b e9 73 f9 ff ff <0f> 0b 41 bf fb ff ff ff e9 5a f9 ff ff 41 89 ef 0f 0b e9 5c f9 ff
>     [    9.491905] RSP: 0018:ffffb271c175f748 EFLAGS: 00010246
>     [    9.491914] RAX: 0000000000000000 RBX: ffffa098e192f000 RCX: ffffa098ca2768c8
>     [    9.491922] RDX: ffffa098e191d400 RSI: ffffb271cc110080 RDI: ffffb271cc111388
>     [    9.491930] RBP: 00000000dbdfe000 R08: 0000000000000003 R09: 0000000000000000
>     [    9.491938] R10: 0000000000000000 R11: ffffa098ca276828 R12: ffffa098e192f008
>     [    9.491946] R13: 000000022b906452 R14: ffffa098e192f008 R15: 0000000000000000
>     [    9.491956] FS:  00007f4de98cc980(0000) GS:ffffa099c4a80000(0000) knlGS:0000000000000000
>     [    9.491966] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     [    9.491974] CR2: 00007f7bd8d18ea0 CR3: 0000000104e58000 CR4: 00000000003506f0
>     [    9.491989] Call Trace:
>     [    9.491996]  <TASK>
>     [    9.492002]  ? __warn+0x80/0x120
>     [    9.492012]  ? r535_gsp_init+0x75e/0x7c0 [nouveau]
>     [    9.492200]  ? report_bug+0x164/0x190
>     [    9.492211]  ? handle_bug+0x3c/0x80
>     [    9.492218]  ? exc_invalid_op+0x17/0x70
>     [    9.492227]  ? asm_exc_invalid_op+0x1a/0x20
>     [    9.492241]  ? r535_gsp_init+0x75e/0x7c0 [nouveau]
>     [    9.492429]  ? r535_gsp_init+0x18e/0x7c0 [nouveau]
>     [    9.492616]  ? srso_return_thunk+0x5/0x5f
>     [    9.492626]  nvkm_subdev_init_+0x48/0x130 [nouveau]
>     [    9.492802]  ? srso_return_thunk+0x5/0x5f
>     [    9.492810]  nvkm_subdev_init+0x44/0x90 [nouveau]
>     [    9.492988]  nvkm_device_init+0x166/0x2e0 [nouveau]
>     [    9.493189]  nvkm_udevice_init+0x47/0x70 [nouveau]
>     [    9.493391]  nvkm_object_init+0x41/0x1c0 [nouveau]
>     [    9.493567]  nvkm_ioctl_new+0x16a/0x290 [nouveau]
>     [    9.493740]  ? __pfx_nvkm_client_child_new+0x10/0x10 [nouveau]
>     [    9.493912]  ? __pfx_nvkm_udevice_new+0x10/0x10 [nouveau]
>     [    9.494121]  nvkm_ioctl+0x10e/0x250 [nouveau]
>     [    9.494288]  nvif_object_ctor+0x112/0x190 [nouveau]
>     [    9.494456]  nvif_device_ctor+0x23/0x60 [nouveau]
>     [    9.494625]  nouveau_cli_init+0x164/0x5d0 [nouveau]
>     [    9.494820]  nouveau_drm_device_init+0x97/0xe00 [nouveau]
>     [    9.495022]  ? srso_return_thunk+0x5/0x5f
>     [    9.495030]  ? pci_bus_read_config_word+0x4d/0x90
>     [    9.495039]  ? srso_return_thunk+0x5/0x5f
>     [    9.495047]  ? pci_update_current_state+0x72/0xb0
>     [    9.495059]  nouveau_drm_probe+0x12c/0x280 [nouveau]
>     [    9.495245]  ? srso_return_thunk+0x5/0x5f
>     [    9.495254]  local_pci_probe+0x45/0xa0
>     [    9.495263]  pci_device_probe+0xc7/0x240
>     [    9.495272]  really_probe+0xd6/0x390
>     [    9.495282]  ? __pfx___driver_attach+0x10/0x10
>     [    9.495290]  __driver_probe_device+0x78/0x150
>     [    9.495301]  driver_probe_device+0x1f/0x90
>     [    9.495308]  __driver_attach+0xd2/0x1c0
>     [    9.495316]  bus_for_each_dev+0x88/0xd0
>     [    9.495325]  bus_add_driver+0x116/0x220
>     [    9.495334]  driver_register+0x59/0x100
>     [    9.495342]  ? __pfx_nouveau_drm_init+0x10/0x10 [nouveau]
>     [    9.495512]  do_one_initcall+0x5b/0x320
>     [    9.495524]  do_init_module+0x60/0x240
>     [    9.495536]  init_module_from_file+0x86/0xc0
>     [    9.495550]  idempotent_init_module+0x120/0x2b0
>     [    9.495562]  __x64_sys_finit_module+0x5e/0xb0
>     [    9.495571]  do_syscall_64+0x88/0x170
>     [    9.495581]  ? srso_return_thunk+0x5/0x5f
>     [    9.495589]  ? syscall_exit_to_user_mode_prepare+0x15d/0x190
>     [    9.495600]  ? srso_return_thunk+0x5/0x5f
>     [    9.495607]  ? syscall_exit_to_user_mode+0x60/0x210
>     [    9.495615]  ? srso_return_thunk+0x5/0x5f
>     [    9.495622]  ? do_syscall_64+0x95/0x170
>     [    9.495630]  ? srso_return_thunk+0x5/0x5f
>     [    9.495636]  ? do_syscall_64+0x95/0x170
>     [    9.495644]  ? srso_return_thunk+0x5/0x5f
>     [    9.495653]  entry_SYSCALL_64_after_hwframe+0x71/0x79
>     [    9.495663] RIP: 0033:0x7f4de9b2919d
>     [    9.495680] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b cc 0c 00 f7 d8 64 89 01 48
>     [    9.495697] RSP: 002b:00007ffc56bfe468 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
>     [    9.495707] RAX: ffffffffffffffda RBX: 00005644a0432350 RCX: 00007f4de9b2919d
>     [    9.495717] RDX: 0000000000000000 RSI: 00005644a042ef30 RDI: 0000000000000031
>     [    9.495726] RBP: 00007ffc56bfe520 R08: 00007f4de9bf6b20 R09: 00007ffc56bfe4b0
>     [    9.495734] R10: 00005644a04346a0 R11: 0000000000000246 R12: 00005644a042ef30
>     [    9.495742] R13: 0000000000020000 R14: 00005644a0432d10 R15: 00005644a0434660
>     [    9.495754]  </TASK>
>     [    9.495759] ---[ end trace 0000000000000000 ]---
>     [    9.495778] ------------[ cut here ]------------
>     [    9.495784] WARNING: CPU: 5 PID: 921 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:2187 r535_gsp_init+0xc5/0x7c0 [nouveau]
>     [    9.495981] Modules linked in: nouveau(+) rfkill binfmt_misc vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep wmi_bmof ppdev snd_hda_core drm_ttm_helper intel_rapl_msr snd_seq ttm snd_seq_device snd_pcm video gpu_sched snd_timer i2c_algo_bit drm_gpuvm drm_exec intel_rapl_common mxm_wmi rapl snd drm_display_helper acpi_cpufreq soundcore k10temp i2c_piix4 parport_pc wmi parport gpio_amdpt gpio_generic loop dm_multipath nfnetlink zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 r8169 realtek sha1_ssse3 ccp w83627hf_wdt scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse
>     [    9.496112] CPU: 5 PID: 921 Comm: (udev-worker) Tainted: G        W          6.9.0-rc3Lyude-Test+ #22
>     [    9.496123] Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.I0 01/22/2019
>     [    9.496132] RIP: 0010:r535_gsp_init+0xc5/0x7c0 [nouveau]
>     [    9.496317] Code: 24 18 4c 8d 63 08 89 6c 24 14 4c 89 e6 6a 00 4c 8d 44 24 20 48 8d 4c 24 1c e8 b7 c3 fa ff 5f 41 89 c7 85 c0 0f 84 97 00 00 00 <0f> 0b 48 83 bb 20 0a 00 00 00 75 37 48 8b 44 24 20 65 48 2b 04 25
>     [    9.496333] RSP: 0018:ffffb271c175f748 EFLAGS: 00010246
>     [    9.496341] RAX: 0000000000000000 RBX: ffffa098e192f000 RCX: ffffa098ca2768c8
>     [    9.496351] RDX: ffffa098e191d400 RSI: ffffb271cc110080 RDI: ffffb271cc111388
>     [    9.496360] RBP: 00000000dbdfe000 R08: 0000000000000003 R09: 0000000000000000
>     [    9.496368] R10: 0000000000000000 R11: ffffa098ca276828 R12: ffffa098e192f008
>     [    9.496375] R13: 000000022b906452 R14: ffffa098e192f008 R15: 00000000fffffffb
>     [    9.496383] FS:  00007f4de98cc980(0000) GS:ffffa099c4a80000(0000) knlGS:0000000000000000
>     [    9.496393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     [    9.496400] CR2: 00007f7bd8d18ea0 CR3: 0000000104e58000 CR4: 00000000003506f0
>     [    9.496410] Call Trace:
>     [    9.496416]  <TASK>
>     [    9.496422]  ? __warn+0x80/0x120
>     [    9.496429]  ? r535_gsp_init+0xc5/0x7c0 [nouveau]
>     [    9.496622]  ? report_bug+0x164/0x190
>     [    9.496631]  ? handle_bug+0x3c/0x80
>     [    9.496638]  ? exc_invalid_op+0x17/0x70
>     [    9.496647]  ? asm_exc_invalid_op+0x1a/0x20
>     [    9.496660]  ? r535_gsp_init+0xc5/0x7c0 [nouveau]
>     [    9.496851]  ? r535_gsp_init+0x18e/0x7c0 [nouveau]
>     [    9.497044]  ? srso_return_thunk+0x5/0x5f
>     [    9.497055]  nvkm_subdev_init_+0x48/0x130 [nouveau]
>     [    9.497227]  ? srso_return_thunk+0x5/0x5f
>     [    9.497236]  nvkm_subdev_init+0x44/0x90 [nouveau]
>     [    9.497405]  nvkm_device_init+0x166/0x2e0 [nouveau]
>     [    9.497608]  nvkm_udevice_init+0x47/0x70 [nouveau]
>     [    9.497808]  nvkm_object_init+0x41/0x1c0 [nouveau]
>     [    9.497983]  nvkm_ioctl_new+0x16a/0x290 [nouveau]
>     [    9.498154]  ? __pfx_nvkm_client_child_new+0x10/0x10 [nouveau]
>     [    9.498326]  ? __pfx_nvkm_udevice_new+0x10/0x10 [nouveau]
>     [    9.498531]  nvkm_ioctl+0x10e/0x250 [nouveau]
>     [    9.498702]  nvif_object_ctor+0x112/0x190 [nouveau]
>     [    9.498873]  nvif_device_ctor+0x23/0x60 [nouveau]
>     [    9.499049]  nouveau_cli_init+0x164/0x5d0 [nouveau]
>     [    9.499244]  nouveau_drm_device_init+0x97/0xe00 [nouveau]
>     [    9.499430]  ? srso_return_thunk+0x5/0x5f
>     [    9.499437]  ? pci_bus_read_config_word+0x4d/0x90
>     [    9.499445]  ? srso_return_thunk+0x5/0x5f
>     [    9.499452]  ? pci_update_current_state+0x72/0xb0
>     [    9.499461]  nouveau_drm_probe+0x12c/0x280 [nouveau]
>     [    9.499657]  ? srso_return_thunk+0x5/0x5f
>     [    9.499666]  local_pci_probe+0x45/0xa0
>     [    9.499674]  pci_device_probe+0xc7/0x240
>     [    9.499683]  really_probe+0xd6/0x390
>     [    9.499692]  ? __pfx___driver_attach+0x10/0x10
>     [    9.499699]  __driver_probe_device+0x78/0x150
>     [    9.499709]  driver_probe_device+0x1f/0x90
>     [    9.499718]  __driver_attach+0xd2/0x1c0
>     [    9.499726]  bus_for_each_dev+0x88/0xd0
>     [    9.499735]  bus_add_driver+0x116/0x220
>     [    9.499744]  driver_register+0x59/0x100
>     [    9.499751]  ? __pfx_nouveau_drm_init+0x10/0x10 [nouveau]
>     [    9.499915]  do_one_initcall+0x5b/0x320
>     [    9.499926]  do_init_module+0x60/0x240
>     [    9.499934]  init_module_from_file+0x86/0xc0
>     [    9.499948]  idempotent_init_module+0x120/0x2b0
>     [    9.499962]  __x64_sys_finit_module+0x5e/0xb0
>     [    9.499971]  do_syscall_64+0x88/0x170
>     [    9.499987]  ? srso_return_thunk+0x5/0x5f
>     [    9.499996]  ? syscall_exit_to_user_mode_prepare+0x15d/0x190
>     [    9.500004]  ? srso_return_thunk+0x5/0x5f
>     [    9.500011]  ? syscall_exit_to_user_mode+0x60/0x210
>     [    9.500019]  ? srso_return_thunk+0x5/0x5f
>     [    9.500026]  ? do_syscall_64+0x95/0x170
>     [    9.500034]  ? srso_return_thunk+0x5/0x5f
>     [    9.500041]  ? do_syscall_64+0x95/0x170
>     [    9.500050]  ? srso_return_thunk+0x5/0x5f
>     [    9.500058]  entry_SYSCALL_64_after_hwframe+0x71/0x79
>     [    9.500067] RIP: 0033:0x7f4de9b2919d
>     [    9.500075] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b cc 0c 00 f7 d8 64 89 01 48
>     [    9.500091] RSP: 002b:00007ffc56bfe468 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
>     [    9.500103] RAX: ffffffffffffffda RBX: 00005644a0432350 RCX: 00007f4de9b2919d
>     [    9.500112] RDX: 0000000000000000 RSI: 00005644a042ef30 RDI: 0000000000000031
>     [    9.500121] RBP: 00007ffc56bfe520 R08: 00007f4de9bf6b20 R09: 00007ffc56bfe4b0
>     [    9.500128] R10: 00005644a04346a0 R11: 0000000000000246 R12: 00005644a042ef30
>     [    9.500136] R13: 0000000000020000 R14: 00005644a0432d10 R15: 00005644a0434660
>     [    9.500149]  </TASK>
>     [    9.500154] ---[ end trace 0000000000000000 ]---
>     [    9.500162] nouveau 0000:1f:00.0: gsp: init failed, -5
>     [    9.500189] nouveau 0000:1f:00.0: init failed with -5
>     [    9.500196] nouveau: DRM-master:00000000:00000080: init failed with -5
>     [    9.500207] nouveau 0000:1f:00.0: DRM-master: Device allocation failed: -5
>     [    9.502661] nouveau 0000:1f:00.0: probe with driver nouveau failed with error -5
>
>
> Which brings me to the second part - TImur had me enable CONFIG_SG_DEBUG, which quickly hit a different issue:
>
>     [    8.992320] RIP: 0010:sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
>     [ 8.992331] Code: 71 93 37 01 83 e1 03 f6 c3 03 75 20 a8 01 75 1e 48 09 cb 41 89 54 24 08 49 89 1c 24 41 89 6c 24 0c 5b 5d 41 5c e9 7b 94 7d 00 <0f> 0b 0f 0b 0f 0b 48 8b 05 5e ae 9f 01 eb b2 66 66 2e 0f 1f 84 00
>     [    8.992428] Call Trace:
>     [    8.992433]  <TASK>
>     [    8.992439] ? die (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/dumpstack.c:421 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/dumpstack.c:434 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/dumpstack.c:447)
>     [    8.992448] ? do_trap (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/traps.c:114 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/traps.c:155)
>     [    8.992455] ? sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
>     [    8.992464] ? do_error_trap (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./arch/x86/include/asm/traps.h:58 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/traps.c:176)
>     [    8.992472] ? sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
>     [    8.992481] ? exc_invalid_op (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/kernel/traps.c:267)
>     [    8.992489] ? sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
>     [    8.992496] ? asm_exc_invalid_op (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./arch/x86/include/asm/idtentry.h:621)
>     [    8.992509] ? sg_init_one (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/scatterlist.h:187 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/lib/scatterlist.c:143 (discriminator 1))
>     [    8.992518] nvkm_firmware_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/firmware.c:249) nouveau
>     [    8.992722] nvkm_falcon_fw_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/falcon/fw.c:199) nouveau
>     [    8.992898] ga102_gsp_booter_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/ga102.c:62) nouveau
>     [    8.993095] r535_gsp_oneinit (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:2309) nouveau
>     [    8.993292] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
>     [    8.993302] ? kmem_cache_alloc_lru (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/mm/slub.c:3748 (discriminator 2) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/mm/slub.c:3827 (discriminator 2) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/mm/slub.c:3864 (discriminator 2))
>     [    8.993311] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
>     [    8.993317] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
>     [    8.993324] ? ktime_get (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/time/timekeeping.c:292 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/time/timekeeping.c:388 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/time/timekeeping.c:848)
>     [    8.993334] nvkm_subdev_oneinit_ (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/subdev.c:113) nouveau
>     [    8.993510] nvkm_subdev_init_ (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/subdev.c:139) nouveau
>     [    8.993685] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
>     [    8.993693] nvkm_subdev_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/subdev.c:170) nouveau
>     [    8.993867] nvkm_device_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c:3023) nouveau
>     [    8.994079] nvkm_udevice_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:295) nouveau
>     [    8.994281] nvkm_object_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/object.c:245) nouveau
>     [    8.994457] nvkm_ioctl_new (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/ioctl.c:149) nouveau
>     [    8.994630] ? __pfx_nvkm_client_child_new (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/client.c:125) nouveau
>     [    8.994803] ? __pfx_nvkm_udevice_new (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:386) nouveau
>     [    8.995013] nvkm_ioctl (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/ioctl.c:354 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvkm/core/ioctl.c:376) nouveau
>     [    8.995187] nvif_object_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvif/object.c:298 (discriminator 1)) nouveau
>     [    8.995356] nvif_device_ctor (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvif/device.c:56) nouveau
>     [    8.995524] nouveau_cli_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nouveau_drm.c:270) nouveau
>     [    8.995721] nouveau_drm_device_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nouveau_drm.c:602) nouveau
>     [    8.995915] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
>     [    8.995923] ? pci_bus_read_config_word (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/access.c:67 (discriminator 1))
>     [    8.995932] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
>     [    8.995939] ? pci_update_current_state (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci.c:1195 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci.c:1187)
>     [    8.995949] nouveau_drm_probe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nouveau_drm.c:841) nouveau
>     [    8.996145] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
>     [    8.996154] local_pci_probe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci-driver.c:325)
>     [    8.996163] pci_device_probe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci-driver.c:392 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci-driver.c:417 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/pci/pci-driver.c:451 (discriminator 1))
>     [    8.996174] really_probe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:578 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:656)
>     [    8.996185] ? __pfx___driver_attach (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:1155)
>     [    8.996192] __driver_probe_device (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:798)
>     [    8.996201] driver_probe_device (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:828)
>     [    8.996209] __driver_attach (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/dd.c:1215)
>     [    8.996217] bus_for_each_dev (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/bus.c:368)
>     [    8.996228] bus_add_driver (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/bus.c:673)
>     [    8.996238] driver_register (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/base/driver.c:246)
>     [    8.996246] ? __pfx_nouveau_drm_init (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/drivers/gpu/drm/nouveau/nvif/object.c:32) nouveau
>     [    8.996415] do_one_initcall (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/init/main.c:1238)
>     [    8.996428] do_init_module (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:2538)
>     [    8.996437] init_module_from_file (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3168)
>     [    8.996450] idempotent_init_module (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/spinlock.h:351 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3131 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3185)
>     [    8.996462] __x64_sys_finit_module (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/./include/linux/file.h:47 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3207 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3189 /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/kernel/module/main.c:3189)
>     [    8.996473] do_syscall_64 (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/entry/common.c:52 (discriminator 1) /home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/entry/common.c:83 (discriminator 1))
>     [    8.996482] ? srso_return_thunk (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/lib/retpoline.S:224)
>     [    8.996490] entry_SYSCALL_64_after_hwframe (/home/lyudess/Projects/linux/worktrees/nouveau-aux-fixes/arch/x86/entry/entry_64.S:129)
>     [    8.996499] RIP: 0033:0x7fd12f52919d
>
> I think timur actually mentioned this bug to you previously, but in
> hopes of getting something more useful out of SG_DEBUG I dug into this
> problem a  bit and ended up with what I believe is an actually correct
> patch:
>
> https://gitlab.freedesktop.org/lyudess/linux/-/commit/485f1fb62ddd4b42b60848eeb48206fef4376161

I think this patch is fine, and does solve the issue for me here if I 
enable SG_DEBUG.

Ben.

>
> ...unfortunately, fixing that issue on my system did not get SG_DEBUG
> to give me any useful info.
>
> Anyway - that brings me to ask 1: do you have any idea what might be
> going on with the falcon boot issue I mentioned, or if I might just be
> doing something wrong/silly with how I'm setting up memory in
> nvkm_gsp_mem_ctor()?
>
> And 2: if you have the time does that patch look correct? I'm happy to
> submit it :)
>
> Also 3: welcome back again :)
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix?
  2024-04-18 22:14 ` David Airlie
@ 2024-04-19 13:54   ` Ben Skeggs
  0 siblings, 0 replies; 4+ messages in thread
From: Ben Skeggs @ 2024-04-19 13:54 UTC (permalink / raw)
  To: David Airlie, Lyude Paul; +Cc: Danilo Krummrich, Timur Tabi, nouveau

On 19/4/24 08:14, David Airlie wrote:

> On Fri, Apr 19, 2024 at 6:27 AM Lyude Paul <lyude@redhat.com> wrote:
>> So - first some context here for Ben and anyone else who hasn't been
>> following. A little while ago I got a Slimbook Executive 16 with a
>> Nvidia RTX 4060 in it, and I've unfortunately been running into a kind
>> of annoying issue. Currently this laptop only has 16 gigs of ram, and
>> as it turns out - this can easily lead the system to having pretty
>> heavy memory fragmentation once it starts swapping pages out.
>>
>> Normally this wouldn't matter, but I unfortunately discovered that when
>> we're runtime suspending the GPU in Nouveau - we actually appear to
>> allocate some of the memory we use for migrating using
>> dma_alloc_coherent. This starts to fail on my system once memory
>> fragmentation goes up like so:
>>
>>    kworker/18:0: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
>>    nodemask=(null),cpuset=/,mems_allowed=0
>>    CPU: 18 PID: 287012 Comm: kworker/18:0 Not tainted
>>    6.8.4-200.ChopperV1.fc39.x86_64 #1
>>    Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
>>    Workqueue: pm pm_runtime_work
>>    Call Trace:
>>     <TASK>
>>     dump_stack_lvl+0x47/0x60
>>     warn_alloc+0x165/0x1e0
>>     ? __alloc_pages_direct_compact+0x1ad/0x2b0
>>     __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
>>     __alloc_pages+0x32d/0x350
>>     __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
>>     dma_direct_alloc+0x70/0x280
>>     nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
>>     r535_gsp_fini+0x1d4/0x350 [nouveau]
>>     nvkm_subdev_fini+0x67/0x150 [nouveau]
>>     nvkm_device_fini+0x95/0x1e0 [nouveau]
>>     nvkm_udevice_fini+0x53/0x70 [nouveau]
>>     nvkm_object_fini+0xb9/0x240 [nouveau]
>>     nvkm_object_fini+0x75/0x240 [nouveau]
>>     nouveau_do_suspend+0xf5/0x280 [nouveau]
>>     nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
>>     pci_pm_runtime_suspend+0x67/0x1e0
>>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>>     __rpm_callback+0x41/0x170
>>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>>     rpm_callback+0x5d/0x70
>>     ? __pfx_pci_pm_runtime_suspend+0x10/0x10
>>     rpm_suspend+0x120/0x6a0
>>     pm_runtime_work+0x98/0xb0
>>     process_one_work+0x171/0x340
>>     worker_thread+0x27b/0x3a0
>>     ? __pfx_worker_thread+0x10/0x10
>>     kthread+0xe5/0x120
>>     ? __pfx_kthread+0x10/0x10
>>     ret_from_fork+0x31/0x50
>>     ? __pfx_kthread+0x10/0x10
>>     ret_from_fork_asm+0x1b/0x30
>>
>>    nouveau 0000:01:00.0: gsp: suspend failed, -12
>>    nouveau: DRM-master:00000000:00000080: suspend failed with -12
>>    nouveau 0000:01:00.0: can't suspend (nouveau_pmops_runtime_suspend
>>    [nouveau] returned -12)
>>
>> Keep in mind, I don't dive into memory management related stuff like
>> this very often! But I'd very much like to know how to help out
>> anywhere around the driver, including outside of my usual domains, so
>> I've been trying to write up a patch for this. The original suggestion
>> for a fix that Dave Airlie had given me was (unless I misunderstood,
>> which isn't unlikely) to try to see if we could get nvkm_gsp_mem_ctor()
>> to start allocating memory with vmalloc() and map that onto the GPU
>> using the SG helpers instead. So - I gave a shot at writing up a patch
>> for doing that:
>>
>> https://gitlab.freedesktop.org/lyudess/linux/-/commit/b5a41ac2bd948979815d262d8d20b4f3333f9c26
>>
>> As you can probably guess - the patch does not really seem to work, and
>> I've been trying to figure out why. There's already a couple of issues
>> I'm aware of: the most glaring one being that as Timur pointed out, a
>> lot of GSP hardware expects contiguous memory allocations - but
>> according to them the allocation that's specifically failing should be
>> small enough that it'd be allocated in a contiguous page anyway:
> nvkm_gsp_mem_ctor is used to do coherent allocations in a bunch of
> places in the gsp code, we can't use vmalloc for a lot of them. A lot
> of the allocations are small multi-page and hang around and the
> hardware expects allocations to be non-scattered.
>
> Now in this single case we have a large amount of data pointed to by a
> radix3 page table.
>
> The data is allocated with nvkm_gsp_sg, then we fail to allocate the
> first level of page tables with the coherent allocation. However I
> don't think the first level of the page table needs to be allocated
> with the coherent allocator, we should allocate it with nvkm_gsp_sg
> instead.

Yes, that seems sensible here.  Lyude, did you want me to take a look at 
making this change, or are you working on it already?

Ben.

>
> Dave.
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-22  0:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18 20:27 Issues with trying to boot falcons from sgt memory + Possible firmware SG_DEBUG fix? Lyude Paul
2024-04-18 22:14 ` David Airlie
2024-04-19 13:54   ` Ben Skeggs
2024-04-19 13:52 ` Ben Skeggs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).