linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
@ 2021-06-05 19:43 Ondrej Zary
  2021-06-05 21:22 ` [Nouveau] " Ilia Mirkin
  2021-06-05 21:34 ` Ondrej Zary
  0 siblings, 2 replies; 20+ messages in thread
From: Ondrej Zary @ 2021-06-05 19:43 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: dri-devel, nouveau, linux-kernel

Hello,
I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
Found various reports like this but that was back in februaryso that should be fixed now.

[   21.003216] BUG: kernel NULL pointer dereference, address: 00000000
[   21.003235] #PF: supervisor read access in kernel mode
[   21.003243] #PF: error_code(0x0000) - not-present page
[   21.003250] *pde = 00000000
[   21.003258] Oops: 0000 [#1] SMP
[   21.003268] CPU: 0 PID: 222 Comm: systemd-udevd Not tainted 5.13.0-rc4+ #327
[   21.003278] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
[   21.003285] EIP: nouveau_bo_sync_for_device+0x9e/0xbf [nouveau]
[   21.003571] Code: 02 89 45 e8 01 d1 8b 19 89 5d ec bb 01 00 00 00 3b 5d e8 74 0d 89 d8 c1 e0 05 03 45 ec 39 04 99 74 1e 8b 46 10 89 d9 c1 e1 0c <8b> 14 10 8b 47 e0 8b 40 08 6a 01 e8 d5 03 55 df 01 5d f0 58 eb ae
[   21.003588] EAX: 00000000 EBX: 00000010 ECX: 00010000 EDX: 00000000
[   21.003597] ESI: c3e90280 EDI: c185a494 EBP: c2ed7c10 ESP: c2ed7bf8
[   21.003606] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210206
[   21.003615] CR0: 80050033 CR2: 00000000 CR3: 02ecb000 CR4: 00000690
[   21.003625] Call Trace:
[   21.003635]  nouveau_bo_validate+0x3f/0x48 [nouveau]
[   21.003911]  nouveau_bo_pin+0xf0/0x187 [nouveau]
[   21.004182]  nouveau_channel_prep+0xc0/0x269 [nouveau]
[   21.004454]  nouveau_channel_new+0x3c/0x5f5 [nouveau]
[   21.004725]  ? slab_free_freelist_hook+0x3b/0xa7
[   21.004740]  ? kfree+0x9e/0x11a
[   21.004749]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
[   21.004944]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
[   21.005186]  ? pci_enable_device_flags+0x23/0x97
[   21.005202]  nouveau_drm_probe+0xe5/0x182 [nouveau]
[   21.005443]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
[   21.005683]  pci_device_probe+0x89/0xe9
[   21.005696]  really_probe+0x127/0x2b9
[   21.005707]  driver_probe_device+0x62/0x89
[   21.005715]  device_driver_attach+0x2e/0x41
[   21.005724]  __driver_attach+0x83/0x8a
[   21.005732]  bus_for_each_dev+0x4c/0x66
[   21.005740]  driver_attach+0x14/0x16
[   21.005747]  ? device_driver_attach+0x41/0x41
[   21.005756]  bus_add_driver+0xc5/0x16c
[   21.005764]  driver_register+0x87/0xb9
[   21.005772]  __pci_register_driver+0x38/0x3b
[   21.005780]  ? 0xf0be4000
[   21.005787]  nouveau_drm_init+0x14c/0x1000 [nouveau]
[   21.005964]  do_one_initcall+0x5a/0x134
[   21.005975]  ? __vunmap+0x124/0x12d
[   21.005984]  ? __vunmap+0x124/0x12d
[   21.005992]  ? kmem_cache_alloc+0xa8/0xb6
[   21.006001]  ? do_init_module+0x17/0x1cf
[   21.006012]  do_init_module+0x46/0x1cf
[   21.006021]  load_module+0x1799/0x1bcb
[   21.006032]  __ia32_sys_finit_module+0x72/0x7a
[   21.006044]  do_int80_syscall_32+0x53/0x62
[   21.006054]  entry_INT80_32+0xf0/0xf0
[   21.006063] EIP: 0xb7f40092
[   21.006071] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[   21.006086] EAX: ffffffda EBX: 00000010 ECX: b7e9bbdd EDX: 00000000
[   21.006095] ESI: 008f27d0 EDI: 008f9e10 EBP: 00000000 ESP: bfa140b8
[   21.006103] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200296
[   21.006114] Modules linked in: nouveau(+) snd_intel8x0 snd_ac97_codec pcmcia wmi hwmon ac97_bus yenta_socket pcmcia_rsrc drm_ttm_helper snd_pcm ttm snd_timer pcmcia_core psmouse 8139cp snd sg soundcore serio_raw parport_pc intel_agp parport
[   21.006165] CR2: 0000000000000000
[   21.006201] ---[ end trace 02dc541683feafc6 ]---
[   21.006211] EIP: nouveau_bo_sync_for_device+0x9e/0xbf [nouveau]
[   21.006460] Code: 02 89 45 e8 01 d1 8b 19 89 5d ec bb 01 00 00 00 3b 5d e8 74 0d 89 d8 c1 e0 05 03 45 ec 39 04 99 74 1e 8b 46 10 89 d9 c1 e1 0c <8b> 14 10 8b 47 e0 8b 40 08 6a 01 e8 d5 03 55 df 01 5d f0 58 eb ae
[   21.006476] EAX: 00000000 EBX: 00000010 ECX: 00010000 EDX: 00000000
[   21.006485] ESI: c3e90280 EDI: c185a494 EBP: c2ed7c10 ESP: c2ed7bf8
[   21.006494] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210206
[   21.006503] CR0: 80050033 CR2: 00000000 CR3: 02ecb000 CR4: 00000690


-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-05 19:43 nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device Ondrej Zary
@ 2021-06-05 21:22 ` Ilia Mirkin
  2021-06-05 21:34 ` Ondrej Zary
  1 sibling, 0 replies; 20+ messages in thread
From: Ilia Mirkin @ 2021-06-05 21:22 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Ben Skeggs, nouveau, LKML, dri-devel

Another instance of a report like this here:
https://gitlab.freedesktop.org/drm/nouveau/-/issues/92

On Sat, Jun 5, 2021 at 3:53 PM Ondrej Zary <linux@zary.sk> wrote:
>
> Hello,
> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> Found various reports like this but that was back in februaryso that should be fixed now.
>
> [   21.003216] BUG: kernel NULL pointer dereference, address: 00000000
> [   21.003235] #PF: supervisor read access in kernel mode
> [   21.003243] #PF: error_code(0x0000) - not-present page
> [   21.003250] *pde = 00000000
> [   21.003258] Oops: 0000 [#1] SMP
> [   21.003268] CPU: 0 PID: 222 Comm: systemd-udevd Not tainted 5.13.0-rc4+ #327
> [   21.003278] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
> [   21.003285] EIP: nouveau_bo_sync_for_device+0x9e/0xbf [nouveau]
> [   21.003571] Code: 02 89 45 e8 01 d1 8b 19 89 5d ec bb 01 00 00 00 3b 5d e8 74 0d 89 d8 c1 e0 05 03 45 ec 39 04 99 74 1e 8b 46 10 89 d9 c1 e1 0c <8b> 14 10 8b 47 e0 8b 40 08 6a 01 e8 d5 03 55 df 01 5d f0 58 eb ae
> [   21.003588] EAX: 00000000 EBX: 00000010 ECX: 00010000 EDX: 00000000
> [   21.003597] ESI: c3e90280 EDI: c185a494 EBP: c2ed7c10 ESP: c2ed7bf8
> [   21.003606] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210206
> [   21.003615] CR0: 80050033 CR2: 00000000 CR3: 02ecb000 CR4: 00000690
> [   21.003625] Call Trace:
> [   21.003635]  nouveau_bo_validate+0x3f/0x48 [nouveau]
> [   21.003911]  nouveau_bo_pin+0xf0/0x187 [nouveau]
> [   21.004182]  nouveau_channel_prep+0xc0/0x269 [nouveau]
> [   21.004454]  nouveau_channel_new+0x3c/0x5f5 [nouveau]
> [   21.004725]  ? slab_free_freelist_hook+0x3b/0xa7
> [   21.004740]  ? kfree+0x9e/0x11a
> [   21.004749]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
> [   21.004944]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
> [   21.005186]  ? pci_enable_device_flags+0x23/0x97
> [   21.005202]  nouveau_drm_probe+0xe5/0x182 [nouveau]
> [   21.005443]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
> [   21.005683]  pci_device_probe+0x89/0xe9
> [   21.005696]  really_probe+0x127/0x2b9
> [   21.005707]  driver_probe_device+0x62/0x89
> [   21.005715]  device_driver_attach+0x2e/0x41
> [   21.005724]  __driver_attach+0x83/0x8a
> [   21.005732]  bus_for_each_dev+0x4c/0x66
> [   21.005740]  driver_attach+0x14/0x16
> [   21.005747]  ? device_driver_attach+0x41/0x41
> [   21.005756]  bus_add_driver+0xc5/0x16c
> [   21.005764]  driver_register+0x87/0xb9
> [   21.005772]  __pci_register_driver+0x38/0x3b
> [   21.005780]  ? 0xf0be4000
> [   21.005787]  nouveau_drm_init+0x14c/0x1000 [nouveau]
> [   21.005964]  do_one_initcall+0x5a/0x134
> [   21.005975]  ? __vunmap+0x124/0x12d
> [   21.005984]  ? __vunmap+0x124/0x12d
> [   21.005992]  ? kmem_cache_alloc+0xa8/0xb6
> [   21.006001]  ? do_init_module+0x17/0x1cf
> [   21.006012]  do_init_module+0x46/0x1cf
> [   21.006021]  load_module+0x1799/0x1bcb
> [   21.006032]  __ia32_sys_finit_module+0x72/0x7a
> [   21.006044]  do_int80_syscall_32+0x53/0x62
> [   21.006054]  entry_INT80_32+0xf0/0xf0
> [   21.006063] EIP: 0xb7f40092
> [   21.006071] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> [   21.006086] EAX: ffffffda EBX: 00000010 ECX: b7e9bbdd EDX: 00000000
> [   21.006095] ESI: 008f27d0 EDI: 008f9e10 EBP: 00000000 ESP: bfa140b8
> [   21.006103] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200296
> [   21.006114] Modules linked in: nouveau(+) snd_intel8x0 snd_ac97_codec pcmcia wmi hwmon ac97_bus yenta_socket pcmcia_rsrc drm_ttm_helper snd_pcm ttm snd_timer pcmcia_core psmouse 8139cp snd sg soundcore serio_raw parport_pc intel_agp parport
> [   21.006165] CR2: 0000000000000000
> [   21.006201] ---[ end trace 02dc541683feafc6 ]---
> [   21.006211] EIP: nouveau_bo_sync_for_device+0x9e/0xbf [nouveau]
> [   21.006460] Code: 02 89 45 e8 01 d1 8b 19 89 5d ec bb 01 00 00 00 3b 5d e8 74 0d 89 d8 c1 e0 05 03 45 ec 39 04 99 74 1e 8b 46 10 89 d9 c1 e1 0c <8b> 14 10 8b 47 e0 8b 40 08 6a 01 e8 d5 03 55 df 01 5d f0 58 eb ae
> [   21.006476] EAX: 00000000 EBX: 00000010 ECX: 00010000 EDX: 00000000
> [   21.006485] ESI: c3e90280 EDI: c185a494 EBP: c2ed7c10 ESP: c2ed7bf8
> [   21.006494] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210206
> [   21.006503] CR0: 80050033 CR2: 00000000 CR3: 02ecb000 CR4: 00000690
>
>
> --
> Ondrej Zary
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-05 19:43 nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device Ondrej Zary
  2021-06-05 21:22 ` [Nouveau] " Ilia Mirkin
@ 2021-06-05 21:34 ` Ondrej Zary
  2021-06-06 21:16   ` Ondrej Zary
  1 sibling, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-05 21:34 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: dri-devel, nouveau, linux-kernel

On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> Hello,
> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> Found various reports like this but that was back in februaryso that should be fixed now.

So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html

Added some debug printks to nouveau_bo_sync_for_device:
[   22.225048] ttm_dma=fc33b500
[   22.225066] ttm_dma->num_pages=18
[   22.225071] i=0 num_pages=16
[   22.225077] ttm_dma->dma_address=00000000
[   22.225094] BUG: kernel NULL pointer dereference, address: 00000000

So ttm->dma_address is NULL.

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-05 21:34 ` Ondrej Zary
@ 2021-06-06 21:16   ` Ondrej Zary
  2021-06-07 20:58     ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-06 21:16 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: dri-devel, nouveau, linux-kernel

On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > Hello,
> > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > Found various reports like this but that was back in februaryso that should be fixed now.
> 
> So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> 
> Added some debug printks to nouveau_bo_sync_for_device:
> [   22.225048] ttm_dma=fc33b500
> [   22.225066] ttm_dma->num_pages=18
> [   22.225071] i=0 num_pages=16
> [   22.225077] ttm_dma->dma_address=00000000
> [   22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> 
> So ttm->dma_address is NULL.
> 

Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
Not sure what I did before.

Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
As always with nouveau...

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-06 21:16   ` Ondrej Zary
@ 2021-06-07 20:58     ` Ondrej Zary
  2021-06-08 18:47       ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-07 20:58 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: dri-devel, nouveau, linux-kernel

On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> > On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > > Hello,
> > > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > > Found various reports like this but that was back in februaryso that should be fixed now.
> > 
> > So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> > https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> > 
> > Added some debug printks to nouveau_bo_sync_for_device:
> > [   22.225048] ttm_dma=fc33b500
> > [   22.225066] ttm_dma->num_pages=18
> > [   22.225071] i=0 num_pages=16
> > [   22.225077] ttm_dma->dma_address=00000000
> > [   22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> > 
> > So ttm->dma_address is NULL.
> > 
> 
> Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> Not sure what I did before.
> 
> Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> As always with nouveau...

e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
Going back one commit makes it crash in a different way:

[   55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
[   55.444219] #PF: supervisor read access in kernel mode
[   55.444222] #PF: error_code(0x0000) - not-present page
[   55.444225] *pde = 00000000
[   55.444231] Oops: 0000 [#1] SMP
[   55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
[   55.444240] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
[   55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
[   55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
[   55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
[   55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
[   55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
[   55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
[   55.444344] Call Trace:
[   55.444395]  nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
[   55.444442]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
[   55.444451]  drm_mode_cursor_common+0x13b/0x1ad
[   55.444497]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
[   55.444504]  drm_mode_cursor_ioctl+0x2e/0x36
[   55.444509]  ? drm_mode_setplane+0x203/0x203
[   55.444514]  drm_ioctl_kernel+0x66/0x99
[   55.444518]  drm_ioctl+0x211/0x2d8
[   55.444522]  ? drm_mode_setplane+0x203/0x203
[   55.444529]  ? _cond_resched+0x1e/0x22
[   55.444533]  ? mutex_lock+0xb/0x24
[   55.444582]  ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
[   55.444589]  ? rpm_resume.part.13+0x72/0x365
[   55.444594]  ? ktime_get_mono_fast_ns+0x5e/0xf2
[   55.444598]  ? __pm_runtime_resume+0x5b/0x63
[   55.444647]  nouveau_drm_ioctl+0x65/0x81 [nouveau]
[   55.444696]  ? nouveau_cli_work+0xc3/0xc3 [nouveau]
[   55.444702]  vfs_ioctl+0x1a/0x24
[   55.444706]  __ia32_sys_ioctl+0x583/0x59d
[   55.444711]  ? doublefault_shim+0x120/0x120
[   55.444717]  ? exit_to_user_mode_prepare+0x71/0xba
[   55.444721]  do_int80_syscall_32+0x2c/0x39
[   55.444725]  entry_INT80_32+0xf0/0xf0
[   55.444729] EIP: 0xb7fb2092
[   55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[   55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
[   55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
[   55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
[   55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
[   55.444769] CR2: 00000000000001b0
[   55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
[   55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
[   55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
[   55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
[   55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
[   55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
[   55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690


-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-07 20:58     ` Ondrej Zary
@ 2021-06-08 18:47       ` Ondrej Zary
  2021-06-08 20:01         ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-08 18:47 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: dri-devel, nouveau, linux-kernel, Christian König

On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
> On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> > On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> > > On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > > > Hello,
> > > > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > > > Found various reports like this but that was back in februaryso that should be fixed now.
> > > 
> > > So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> > > https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> > > 
> > > Added some debug printks to nouveau_bo_sync_for_device:
> > > [   22.225048] ttm_dma=fc33b500
> > > [   22.225066] ttm_dma->num_pages=18
> > > [   22.225071] i=0 num_pages=16
> > > [   22.225077] ttm_dma->dma_address=00000000
> > > [   22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> > > 
> > > So ttm->dma_address is NULL.
> > > 
> > 
> > Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> > Not sure what I did before.
> > 
> > Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> > As always with nouveau...
> 
> e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
> Going back one commit makes it crash in a different way:
> 
> [   55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
> [   55.444219] #PF: supervisor read access in kernel mode
> [   55.444222] #PF: error_code(0x0000) - not-present page
> [   55.444225] *pde = 00000000
> [   55.444231] Oops: 0000 [#1] SMP
> [   55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
> [   55.444240] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
> [   55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> [   55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> [   55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> [   55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> [   55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> [   55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> [   55.444344] Call Trace:
> [   55.444395]  nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
> [   55.444442]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> [   55.444451]  drm_mode_cursor_common+0x13b/0x1ad
> [   55.444497]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> [   55.444504]  drm_mode_cursor_ioctl+0x2e/0x36
> [   55.444509]  ? drm_mode_setplane+0x203/0x203
> [   55.444514]  drm_ioctl_kernel+0x66/0x99
> [   55.444518]  drm_ioctl+0x211/0x2d8
> [   55.444522]  ? drm_mode_setplane+0x203/0x203
> [   55.444529]  ? _cond_resched+0x1e/0x22
> [   55.444533]  ? mutex_lock+0xb/0x24
> [   55.444582]  ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
> [   55.444589]  ? rpm_resume.part.13+0x72/0x365
> [   55.444594]  ? ktime_get_mono_fast_ns+0x5e/0xf2
> [   55.444598]  ? __pm_runtime_resume+0x5b/0x63
> [   55.444647]  nouveau_drm_ioctl+0x65/0x81 [nouveau]
> [   55.444696]  ? nouveau_cli_work+0xc3/0xc3 [nouveau]
> [   55.444702]  vfs_ioctl+0x1a/0x24
> [   55.444706]  __ia32_sys_ioctl+0x583/0x59d
> [   55.444711]  ? doublefault_shim+0x120/0x120
> [   55.444717]  ? exit_to_user_mode_prepare+0x71/0xba
> [   55.444721]  do_int80_syscall_32+0x2c/0x39
> [   55.444725]  entry_INT80_32+0xf0/0xf0
> [   55.444729] EIP: 0xb7fb2092
> [   55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> [   55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
> [   55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
> [   55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
> [   55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
> [   55.444769] CR2: 00000000000001b0
> [   55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
> [   55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> [   55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> [   55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> [   55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> [   55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> [   55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690

Bisected this crash:
# first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5

Adding Christian König to CC.


-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-08 18:47       ` Ondrej Zary
@ 2021-06-08 20:01         ` Ondrej Zary
  2021-06-08 21:59           ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-08 20:01 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: dri-devel, nouveau, linux-kernel, Christian König

On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
> On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
> > On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> > > On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> > > > On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > > > > Hello,
> > > > > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > > > > Found various reports like this but that was back in februaryso that should be fixed now.
> > > > 
> > > > So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> > > > https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> > > > 
> > > > Added some debug printks to nouveau_bo_sync_for_device:
> > > > [   22.225048] ttm_dma=fc33b500
> > > > [   22.225066] ttm_dma->num_pages=18
> > > > [   22.225071] i=0 num_pages=16
> > > > [   22.225077] ttm_dma->dma_address=00000000
> > > > [   22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> > > > 
> > > > So ttm->dma_address is NULL.
> > > > 
> > > 
> > > Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> > > Not sure what I did before.
> > > 
> > > Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> > > As always with nouveau...
> > 
> > e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
> > Going back one commit makes it crash in a different way:
> > 
> > [   55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
> > [   55.444219] #PF: supervisor read access in kernel mode
> > [   55.444222] #PF: error_code(0x0000) - not-present page
> > [   55.444225] *pde = 00000000
> > [   55.444231] Oops: 0000 [#1] SMP
> > [   55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
> > [   55.444240] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
> > [   55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> > [   55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> > [   55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> > [   55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> > [   55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> > [   55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> > [   55.444344] Call Trace:
> > [   55.444395]  nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
> > [   55.444442]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> > [   55.444451]  drm_mode_cursor_common+0x13b/0x1ad
> > [   55.444497]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> > [   55.444504]  drm_mode_cursor_ioctl+0x2e/0x36
> > [   55.444509]  ? drm_mode_setplane+0x203/0x203
> > [   55.444514]  drm_ioctl_kernel+0x66/0x99
> > [   55.444518]  drm_ioctl+0x211/0x2d8
> > [   55.444522]  ? drm_mode_setplane+0x203/0x203
> > [   55.444529]  ? _cond_resched+0x1e/0x22
> > [   55.444533]  ? mutex_lock+0xb/0x24
> > [   55.444582]  ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
> > [   55.444589]  ? rpm_resume.part.13+0x72/0x365
> > [   55.444594]  ? ktime_get_mono_fast_ns+0x5e/0xf2
> > [   55.444598]  ? __pm_runtime_resume+0x5b/0x63
> > [   55.444647]  nouveau_drm_ioctl+0x65/0x81 [nouveau]
> > [   55.444696]  ? nouveau_cli_work+0xc3/0xc3 [nouveau]
> > [   55.444702]  vfs_ioctl+0x1a/0x24
> > [   55.444706]  __ia32_sys_ioctl+0x583/0x59d
> > [   55.444711]  ? doublefault_shim+0x120/0x120
> > [   55.444717]  ? exit_to_user_mode_prepare+0x71/0xba
> > [   55.444721]  do_int80_syscall_32+0x2c/0x39
> > [   55.444725]  entry_INT80_32+0xf0/0xf0
> > [   55.444729] EIP: 0xb7fb2092
> > [   55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> > [   55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
> > [   55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
> > [   55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
> > [   55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
> > [   55.444769] CR2: 00000000000001b0
> > [   55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
> > [   55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> > [   55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> > [   55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> > [   55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> > [   55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> > [   55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> 
> Bisected this crash:
> # first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5
> 
> Adding Christian König to CC.

Tracked it down to an uninitialized variable bug.
I see now that this was fixed by aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-08 20:01         ` Ondrej Zary
@ 2021-06-08 21:59           ` Ondrej Zary
  2021-06-09  6:43             ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-08 21:59 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: dri-devel, nouveau, linux-kernel, Christian König

On Tuesday 08 June 2021 22:01:56 Ondrej Zary wrote:
> On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
> > On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
> > > On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> > > > On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> > > > > On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > > > > > Hello,
> > > > > > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > > > > > Found various reports like this but that was back in februaryso that should be fixed now.
> > > > > 
> > > > > So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> > > > > https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> > > > > 
> > > > > Added some debug printks to nouveau_bo_sync_for_device:
> > > > > [   22.225048] ttm_dma=fc33b500
> > > > > [   22.225066] ttm_dma->num_pages=18
> > > > > [   22.225071] i=0 num_pages=16
> > > > > [   22.225077] ttm_dma->dma_address=00000000
> > > > > [   22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> > > > > 
> > > > > So ttm->dma_address is NULL.
> > > > > 
> > > > 
> > > > Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> > > > Not sure what I did before.
> > > > 
> > > > Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> > > > As always with nouveau...
> > > 
> > > e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
> > > Going back one commit makes it crash in a different way:
> > > 
> > > [   55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
> > > [   55.444219] #PF: supervisor read access in kernel mode
> > > [   55.444222] #PF: error_code(0x0000) - not-present page
> > > [   55.444225] *pde = 00000000
> > > [   55.444231] Oops: 0000 [#1] SMP
> > > [   55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
> > > [   55.444240] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
> > > [   55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> > > [   55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> > > [   55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> > > [   55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> > > [   55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> > > [   55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> > > [   55.444344] Call Trace:
> > > [   55.444395]  nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
> > > [   55.444442]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> > > [   55.444451]  drm_mode_cursor_common+0x13b/0x1ad
> > > [   55.444497]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> > > [   55.444504]  drm_mode_cursor_ioctl+0x2e/0x36
> > > [   55.444509]  ? drm_mode_setplane+0x203/0x203
> > > [   55.444514]  drm_ioctl_kernel+0x66/0x99
> > > [   55.444518]  drm_ioctl+0x211/0x2d8
> > > [   55.444522]  ? drm_mode_setplane+0x203/0x203
> > > [   55.444529]  ? _cond_resched+0x1e/0x22
> > > [   55.444533]  ? mutex_lock+0xb/0x24
> > > [   55.444582]  ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
> > > [   55.444589]  ? rpm_resume.part.13+0x72/0x365
> > > [   55.444594]  ? ktime_get_mono_fast_ns+0x5e/0xf2
> > > [   55.444598]  ? __pm_runtime_resume+0x5b/0x63
> > > [   55.444647]  nouveau_drm_ioctl+0x65/0x81 [nouveau]
> > > [   55.444696]  ? nouveau_cli_work+0xc3/0xc3 [nouveau]
> > > [   55.444702]  vfs_ioctl+0x1a/0x24
> > > [   55.444706]  __ia32_sys_ioctl+0x583/0x59d
> > > [   55.444711]  ? doublefault_shim+0x120/0x120
> > > [   55.444717]  ? exit_to_user_mode_prepare+0x71/0xba
> > > [   55.444721]  do_int80_syscall_32+0x2c/0x39
> > > [   55.444725]  entry_INT80_32+0xf0/0xf0
> > > [   55.444729] EIP: 0xb7fb2092
> > > [   55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> > > [   55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
> > > [   55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
> > > [   55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
> > > [   55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
> > > [   55.444769] CR2: 00000000000001b0
> > > [   55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
> > > [   55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> > > [   55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> > > [   55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> > > [   55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> > > [   55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> > > [   55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> > 
> > Bisected this crash:
> > # first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5
> > 
> > Adding Christian König to CC.
> 
> Tracked it down to an uninitialized variable bug.
> I see now that this was fixed by aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.

So the first bad commit for the original bug is e34b8feeaa4b65725b25f49c9b08a0f8707e8e86
(as bisected before).
Going one commit back and fixing the uninitialized variable and endian bugs manually makes nouveau work.

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-08 21:59           ` Ondrej Zary
@ 2021-06-09  6:43             ` Christian König
  2021-06-09  6:57               ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-06-09  6:43 UTC (permalink / raw)
  To: Ondrej Zary, Ben Skeggs; +Cc: dri-devel, nouveau, linux-kernel

Am 08.06.21 um 23:59 schrieb Ondrej Zary:
> On Tuesday 08 June 2021 22:01:56 Ondrej Zary wrote:
>> On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
>>> On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
>>>> On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
>>>>> On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
>>>>>> On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
>>>>>>> Hello,
>>>>>>> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
>>>>>>> Found various reports like this but that was back in februaryso that should be fixed now.
>>>>>> So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2021-February%2F298531.html&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C605d2e3757ba466bb02a08d92ac8a895%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637587864017853132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=M5KXSwD%2Fnro3cnCo8Nx4llFu%2Fj2T%2FGQAaMBLeGl0XMc%3D&amp;reserved=0
>>>>>>
>>>>>> Added some debug printks to nouveau_bo_sync_for_device:
>>>>>> [   22.225048] ttm_dma=fc33b500
>>>>>> [   22.225066] ttm_dma->num_pages=18
>>>>>> [   22.225071] i=0 num_pages=16
>>>>>> [   22.225077] ttm_dma->dma_address=00000000
>>>>>> [   22.225094] BUG: kernel NULL pointer dereference, address: 00000000
>>>>>>
>>>>>> So ttm->dma_address is NULL.
>>>>>>
>>>>> Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
>>>>> Not sure what I did before.
>>>>>
>>>>> Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
>>>>> As always with nouveau...
>>>> e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
>>>> Going back one commit makes it crash in a different way:
>>>>
>>>> [   55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
>>>> [   55.444219] #PF: supervisor read access in kernel mode
>>>> [   55.444222] #PF: error_code(0x0000) - not-present page
>>>> [   55.444225] *pde = 00000000
>>>> [   55.444231] Oops: 0000 [#1] SMP
>>>> [   55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
>>>> [   55.444240] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
>>>> [   55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
>>>> [   55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
>>>> [   55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
>>>> [   55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
>>>> [   55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
>>>> [   55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
>>>> [   55.444344] Call Trace:
>>>> [   55.444395]  nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
>>>> [   55.444442]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
>>>> [   55.444451]  drm_mode_cursor_common+0x13b/0x1ad
>>>> [   55.444497]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
>>>> [   55.444504]  drm_mode_cursor_ioctl+0x2e/0x36
>>>> [   55.444509]  ? drm_mode_setplane+0x203/0x203
>>>> [   55.444514]  drm_ioctl_kernel+0x66/0x99
>>>> [   55.444518]  drm_ioctl+0x211/0x2d8
>>>> [   55.444522]  ? drm_mode_setplane+0x203/0x203
>>>> [   55.444529]  ? _cond_resched+0x1e/0x22
>>>> [   55.444533]  ? mutex_lock+0xb/0x24
>>>> [   55.444582]  ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
>>>> [   55.444589]  ? rpm_resume.part.13+0x72/0x365
>>>> [   55.444594]  ? ktime_get_mono_fast_ns+0x5e/0xf2
>>>> [   55.444598]  ? __pm_runtime_resume+0x5b/0x63
>>>> [   55.444647]  nouveau_drm_ioctl+0x65/0x81 [nouveau]
>>>> [   55.444696]  ? nouveau_cli_work+0xc3/0xc3 [nouveau]
>>>> [   55.444702]  vfs_ioctl+0x1a/0x24
>>>> [   55.444706]  __ia32_sys_ioctl+0x583/0x59d
>>>> [   55.444711]  ? doublefault_shim+0x120/0x120
>>>> [   55.444717]  ? exit_to_user_mode_prepare+0x71/0xba
>>>> [   55.444721]  do_int80_syscall_32+0x2c/0x39
>>>> [   55.444725]  entry_INT80_32+0xf0/0xf0
>>>> [   55.444729] EIP: 0xb7fb2092
>>>> [   55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
>>>> [   55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
>>>> [   55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
>>>> [   55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
>>>> [   55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
>>>> [   55.444769] CR2: 00000000000001b0
>>>> [   55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
>>>> [   55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
>>>> [   55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
>>>> [   55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
>>>> [   55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
>>>> [   55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
>>>> [   55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
>>> Bisected this crash:
>>> # first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5
>>>
>>> Adding Christian König to CC.
>> Tracked it down to an uninitialized variable bug.
>> I see now that this was fixed by aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> So the first bad commit for the original bug is e34b8feeaa4b65725b25f49c9b08a0f8707e8e86
> (as bisected before).
> Going one commit back and fixing the uninitialized variable and endian bugs manually makes nouveau work.

Thanks for the heads up. So the problem with my patch is already fixed, 
isn't it?

Regards,
Christian.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-09  6:43             ` Christian König
@ 2021-06-09  6:57               ` Ondrej Zary
  2021-06-09  7:02                 ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-09  6:57 UTC (permalink / raw)
  To: Christian König; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

On Wednesday 09 June 2021, Christian König wrote:
> Am 08.06.21 um 23:59 schrieb Ondrej Zary:
> > On Tuesday 08 June 2021 22:01:56 Ondrej Zary wrote:
> >> On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
> >>> On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
> >>>> On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> >>>>> On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> >>>>>> On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> >>>>>>> Hello,
> >>>>>>> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> >>>>>>> Found various reports like this but that was back in februaryso that should be fixed now.
> >>>>>> So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2021-February%2F298531.html&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C605d2e3757ba466bb02a08d92ac8a895%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637587864017853132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=M5KXSwD%2Fnro3cnCo8Nx4llFu%2Fj2T%2FGQAaMBLeGl0XMc%3D&amp;reserved=0
> >>>>>>
> >>>>>> Added some debug printks to nouveau_bo_sync_for_device:
> >>>>>> [   22.225048] ttm_dma=fc33b500
> >>>>>> [   22.225066] ttm_dma->num_pages=18
> >>>>>> [   22.225071] i=0 num_pages=16
> >>>>>> [   22.225077] ttm_dma->dma_address=00000000
> >>>>>> [   22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> >>>>>>
> >>>>>> So ttm->dma_address is NULL.
> >>>>>>
> >>>>> Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> >>>>> Not sure what I did before.
> >>>>>
> >>>>> Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> >>>>> As always with nouveau...
> >>>> e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
> >>>> Going back one commit makes it crash in a different way:
> >>>>
> >>>> [   55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
> >>>> [   55.444219] #PF: supervisor read access in kernel mode
> >>>> [   55.444222] #PF: error_code(0x0000) - not-present page
> >>>> [   55.444225] *pde = 00000000
> >>>> [   55.444231] Oops: 0000 [#1] SMP
> >>>> [   55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
> >>>> [   55.444240] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
> >>>> [   55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> >>>> [   55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> >>>> [   55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> >>>> [   55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> >>>> [   55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> >>>> [   55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> >>>> [   55.444344] Call Trace:
> >>>> [   55.444395]  nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
> >>>> [   55.444442]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> >>>> [   55.444451]  drm_mode_cursor_common+0x13b/0x1ad
> >>>> [   55.444497]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> >>>> [   55.444504]  drm_mode_cursor_ioctl+0x2e/0x36
> >>>> [   55.444509]  ? drm_mode_setplane+0x203/0x203
> >>>> [   55.444514]  drm_ioctl_kernel+0x66/0x99
> >>>> [   55.444518]  drm_ioctl+0x211/0x2d8
> >>>> [   55.444522]  ? drm_mode_setplane+0x203/0x203
> >>>> [   55.444529]  ? _cond_resched+0x1e/0x22
> >>>> [   55.444533]  ? mutex_lock+0xb/0x24
> >>>> [   55.444582]  ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
> >>>> [   55.444589]  ? rpm_resume.part.13+0x72/0x365
> >>>> [   55.444594]  ? ktime_get_mono_fast_ns+0x5e/0xf2
> >>>> [   55.444598]  ? __pm_runtime_resume+0x5b/0x63
> >>>> [   55.444647]  nouveau_drm_ioctl+0x65/0x81 [nouveau]
> >>>> [   55.444696]  ? nouveau_cli_work+0xc3/0xc3 [nouveau]
> >>>> [   55.444702]  vfs_ioctl+0x1a/0x24
> >>>> [   55.444706]  __ia32_sys_ioctl+0x583/0x59d
> >>>> [   55.444711]  ? doublefault_shim+0x120/0x120
> >>>> [   55.444717]  ? exit_to_user_mode_prepare+0x71/0xba
> >>>> [   55.444721]  do_int80_syscall_32+0x2c/0x39
> >>>> [   55.444725]  entry_INT80_32+0xf0/0xf0
> >>>> [   55.444729] EIP: 0xb7fb2092
> >>>> [   55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> >>>> [   55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
> >>>> [   55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
> >>>> [   55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
> >>>> [   55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
> >>>> [   55.444769] CR2: 00000000000001b0
> >>>> [   55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
> >>>> [   55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> >>>> [   55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> >>>> [   55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> >>>> [   55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> >>>> [   55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> >>>> [   55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> >>> Bisected this crash:
> >>> # first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5
> >>>
> >>> Adding Christian König to CC.
> >> Tracked it down to an uninitialized variable bug.
> >> I see now that this was fixed by aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> > So the first bad commit for the original bug is e34b8feeaa4b65725b25f49c9b08a0f8707e8e86
> > (as bisected before).
> > Going one commit back and fixing the uninitialized variable and endian bugs manually makes nouveau work.
> 
> Thanks for the heads up. So the problem with my patch is already fixed, 
> isn't it?

The NULL pointer dereference in nouveau_bo_wr16 introduced in
141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.

That's the bug I hit when bisecting the original problem:
NULL pointer dereference in nouveau_bo_sync_for_device
It's caused by:
# first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-09  6:57               ` Ondrej Zary
@ 2021-06-09  7:02                 ` Christian König
  2021-06-09  7:10                   ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-06-09  7:02 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> [SNIP]
>> Thanks for the heads up. So the problem with my patch is already fixed,
>> isn't it?
> The NULL pointer dereference in nouveau_bo_wr16 introduced in
> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
>
> That's the bug I hit when bisecting the original problem:
> NULL pointer dereference in nouveau_bo_sync_for_device
> It's caused by:
> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt

Good that I've asked :)

Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was 
created mostly automated.

Do you have the original backtrace of that NULL pointer deref once more?

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-09  7:02                 ` Christian König
@ 2021-06-09  7:10                   ` Ondrej Zary
  2021-06-09  9:21                     ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-09  7:10 UTC (permalink / raw)
  To: Christian König; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

On Wednesday 09 June 2021, Christian König wrote:
> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> > [SNIP]
> >> Thanks for the heads up. So the problem with my patch is already fixed,
> >> isn't it?
> > The NULL pointer dereference in nouveau_bo_wr16 introduced in
> > 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> > aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> >
> > That's the bug I hit when bisecting the original problem:
> > NULL pointer dereference in nouveau_bo_sync_for_device
> > It's caused by:
> > # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
> 
> Good that I've asked :)
> 
> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was 
> created mostly automated.
> 
> Do you have the original backtrace of that NULL pointer deref once more?

The original backtrace is here: https://lkml.org/lkml/2021/6/5/350

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-09  7:10                   ` Ondrej Zary
@ 2021-06-09  9:21                     ` Christian König
  2021-06-09 20:00                       ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-06-09  9:21 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

Am 09.06.21 um 09:10 schrieb Ondrej Zary:
> On Wednesday 09 June 2021, Christian König wrote:
>> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
>>> [SNIP]
>>>> Thanks for the heads up. So the problem with my patch is already fixed,
>>>> isn't it?
>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
>>>
>>> That's the bug I hit when bisecting the original problem:
>>> NULL pointer dereference in nouveau_bo_sync_for_device
>>> It's caused by:
>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
>> Good that I've asked :)
>>
>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
>> created mostly automated.
>>
>> Do you have the original backtrace of that NULL pointer deref once more?
> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce905b6bd2aa842ace15508d92b15b96d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588195000729460%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=zFqheBbJcOHtYgqG%2Fs63AT1dwuk4REmUDJWHvzaLAlc%3D&amp;reserved=0

And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I 
don't see how that can happen since nouveau is using ttm_sg_tt_init().

Apart from that what nouveau does here is rather questionable since you 
need a coherent architecture for most things anyway, but that's not what 
we are trying to fix here.

Can you try to narrow down if ttm_sg_tt_init is called before calling 
this function for the tt object in question?

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-09  9:21                     ` Christian König
@ 2021-06-09 20:00                       ` Ondrej Zary
  2021-06-10  6:43                         ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-09 20:00 UTC (permalink / raw)
  To: Christian König; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

On Wednesday 09 June 2021 11:21:05 Christian König wrote:
> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
> > On Wednesday 09 June 2021, Christian König wrote:
> >> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> >>> [SNIP]
> >>>> Thanks for the heads up. So the problem with my patch is already fixed,
> >>>> isn't it?
> >>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
> >>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> >>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> >>>
> >>> That's the bug I hit when bisecting the original problem:
> >>> NULL pointer dereference in nouveau_bo_sync_for_device
> >>> It's caused by:
> >>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
> >> Good that I've asked :)
> >>
> >> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
> >> created mostly automated.
> >>
> >> Do you have the original backtrace of that NULL pointer deref once more?
> > The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce905b6bd2aa842ace15508d92b15b96d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588195000729460%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=zFqheBbJcOHtYgqG%2Fs63AT1dwuk4REmUDJWHvzaLAlc%3D&amp;reserved=0
> 
> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I 
> don't see how that can happen since nouveau is using ttm_sg_tt_init().
> 
> Apart from that what nouveau does here is rather questionable since you 
> need a coherent architecture for most things anyway, but that's not what 
> we are trying to fix here.
> 
> Can you try to narrow down if ttm_sg_tt_init is called before calling 
> this function for the tt object in question?

ttm_sg_tt_init is not called:
[   12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB
[   12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB
[   12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6
[   12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS
[   12.151362] ttm_tt_init
[   12.151370] ttm_tt_init_fields
[   12.151374] ttm_tt_alloc_page_directory
[   12.151615] BUG: kernel NULL pointer dereference, address: 00000000



-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-09 20:00                       ` Ondrej Zary
@ 2021-06-10  6:43                         ` Christian König
  2021-06-10 17:50                           ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-06-10  6:43 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel



Am 09.06.21 um 22:00 schrieb Ondrej Zary:
> On Wednesday 09 June 2021 11:21:05 Christian König wrote:
>> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
>>> On Wednesday 09 June 2021, Christian König wrote:
>>>> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
>>>>> [SNIP]
>>>>>> Thanks for the heads up. So the problem with my patch is already fixed,
>>>>>> isn't it?
>>>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
>>>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
>>>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
>>>>>
>>>>> That's the bug I hit when bisecting the original problem:
>>>>> NULL pointer dereference in nouveau_bo_sync_for_device
>>>>> It's caused by:
>>>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
>>>> Good that I've asked :)
>>>>
>>>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
>>>> created mostly automated.
>>>>
>>>> Do you have the original backtrace of that NULL pointer deref once more?
>>> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4309ff021d5e4cbe948b08d92b813106%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588657045383056%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=t70c9ktzPJzDaEAcO4wpQMv3TUo5b53cUy66AkLeVwE%3D&amp;reserved=0
>> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I
>> don't see how that can happen since nouveau is using ttm_sg_tt_init().
>>
>> Apart from that what nouveau does here is rather questionable since you
>> need a coherent architecture for most things anyway, but that's not what
>> we are trying to fix here.
>>
>> Can you try to narrow down if ttm_sg_tt_init is called before calling
>> this function for the tt object in question?
> ttm_sg_tt_init is not called:
> [   12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB
> [   12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB
> [   12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6
> [   12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS
> [   12.151362] ttm_tt_init
> [   12.151370] ttm_tt_init_fields
> [   12.151374] ttm_tt_alloc_page_directory
> [   12.151615] BUG: kernel NULL pointer dereference, address: 00000000

Please add dump_stack(); to ttm_tt_init() and report back with the 
backtrace.

I can't see how this is called from the nouveau code, only possibility I 
see is that it is maybe called through the AGP code somehow.

Christian.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-10  6:43                         ` Christian König
@ 2021-06-10 17:50                           ` Ondrej Zary
  2021-06-10 17:59                             ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-10 17:50 UTC (permalink / raw)
  To: Christian König; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

On Thursday 10 June 2021 08:43:06 Christian König wrote:
> 
> Am 09.06.21 um 22:00 schrieb Ondrej Zary:
> > On Wednesday 09 June 2021 11:21:05 Christian König wrote:
> >> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
> >>> On Wednesday 09 June 2021, Christian König wrote:
> >>>> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> >>>>> [SNIP]
> >>>>>> Thanks for the heads up. So the problem with my patch is already fixed,
> >>>>>> isn't it?
> >>>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
> >>>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> >>>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> >>>>>
> >>>>> That's the bug I hit when bisecting the original problem:
> >>>>> NULL pointer dereference in nouveau_bo_sync_for_device
> >>>>> It's caused by:
> >>>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
> >>>> Good that I've asked :)
> >>>>
> >>>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
> >>>> created mostly automated.
> >>>>
> >>>> Do you have the original backtrace of that NULL pointer deref once more?
> >>> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4309ff021d5e4cbe948b08d92b813106%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588657045383056%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=t70c9ktzPJzDaEAcO4wpQMv3TUo5b53cUy66AkLeVwE%3D&amp;reserved=0
> >> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I
> >> don't see how that can happen since nouveau is using ttm_sg_tt_init().
> >>
> >> Apart from that what nouveau does here is rather questionable since you
> >> need a coherent architecture for most things anyway, but that's not what
> >> we are trying to fix here.
> >>
> >> Can you try to narrow down if ttm_sg_tt_init is called before calling
> >> this function for the tt object in question?
> > ttm_sg_tt_init is not called:
> > [   12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB
> > [   12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB
> > [   12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6
> > [   12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS
> > [   12.151362] ttm_tt_init
> > [   12.151370] ttm_tt_init_fields
> > [   12.151374] ttm_tt_alloc_page_directory
> > [   12.151615] BUG: kernel NULL pointer dereference, address: 00000000
> 
> Please add dump_stack(); to ttm_tt_init() and report back with the 
> backtrace.
> 
> I can't see how this is called from the nouveau code, only possibility I 
> see is that it is maybe called through the AGP code somehow.

Yes, you're right:
[   13.192663] Call Trace:
[   13.192678]  dump_stack+0x54/0x68
[   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
[   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
[   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
[   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
[   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
[   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
[   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
[   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
[   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
[   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
[   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
[   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
[   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
[   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
[   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
[   13.193686]  ? kfree+0x9e/0x11a
[   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
[   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
[   13.193924]  ? pci_enable_device_flags+0x1e/0xac
[   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
[   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
[   13.194195]  pci_device_probe+0x89/0xe9
[   13.194205]  really_probe+0x127/0x2a7
[   13.194212]  driver_probe_device+0x5b/0x87
[   13.194219]  device_driver_attach+0x2e/0x41
[   13.194226]  __driver_attach+0x7c/0x83
[   13.194232]  bus_for_each_dev+0x4c/0x66
[   13.194238]  driver_attach+0x14/0x16
[   13.194244]  ? device_driver_attach+0x41/0x41
[   13.194251]  bus_add_driver+0xc5/0x16c
[   13.194258]  driver_register+0x87/0xb9
[   13.194265]  __pci_register_driver+0x38/0x3b
[   13.194271]  ? 0xf0c0d000
[   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]

How is ttm_dma_tt->dma_address allocated? I cannot find any assignment
executed (in the working code):

$ git grep dma_address\ = drivers/gpu/
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:       sg->sgl->dma_address = addr;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:                dma_address = &dma->dma_address[offset >> PAGE_SHIFT];
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:                dma_address = (mm_node->start << PAGE_SHIFT) + offset;
drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *) (ttm->ttm.pages + ttm->ttm.num_pages);
drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = kvmalloc_array(ttm->ttm.num_pages,
drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c:             viter->dma_address = &__vmw_piter_phys_addr;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c:             viter->dma_address = &__vmw_piter_dma_addr;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c:             viter->dma_address = &__vmw_piter_sg_addr;

The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
ttm_sg_tt_alloc_page_directory().
Confirmed by adding printk()s that they're NOT called.


-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-10 17:50                           ` Ondrej Zary
@ 2021-06-10 17:59                             ` Christian König
  2021-06-11 12:38                               ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-06-10 17:59 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

Am 10.06.21 um 19:50 schrieb Ondrej Zary:
> On Thursday 10 June 2021 08:43:06 Christian König wrote:
>> Am 09.06.21 um 22:00 schrieb Ondrej Zary:
>>> On Wednesday 09 June 2021 11:21:05 Christian König wrote:
>>>> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
>>>>> On Wednesday 09 June 2021, Christian König wrote:
>>>>>> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
>>>>>>> [SNIP]
>>>>>>>> Thanks for the heads up. So the problem with my patch is already fixed,
>>>>>>>> isn't it?
>>>>>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
>>>>>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
>>>>>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
>>>>>>>
>>>>>>> That's the bug I hit when bisecting the original problem:
>>>>>>> NULL pointer dereference in nouveau_bo_sync_for_device
>>>>>>> It's caused by:
>>>>>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
>>>>>> Good that I've asked :)
>>>>>>
>>>>>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
>>>>>> created mostly automated.
>>>>>>
>>>>>> Do you have the original backtrace of that NULL pointer deref once more?
>>>>> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C657222345e3242e7a6a608d92c383f66%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637589442963348551%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZkJs%2FR8MeQKUxwhJUC%2FG4Hi3T%2FMIftt%2FWRh%2B1%2BU5rUE%3D&amp;reserved=0
>>>> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I
>>>> don't see how that can happen since nouveau is using ttm_sg_tt_init().
>>>>
>>>> Apart from that what nouveau does here is rather questionable since you
>>>> need a coherent architecture for most things anyway, but that's not what
>>>> we are trying to fix here.
>>>>
>>>> Can you try to narrow down if ttm_sg_tt_init is called before calling
>>>> this function for the tt object in question?
>>> ttm_sg_tt_init is not called:
>>> [   12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB
>>> [   12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB
>>> [   12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6
>>> [   12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS
>>> [   12.151362] ttm_tt_init
>>> [   12.151370] ttm_tt_init_fields
>>> [   12.151374] ttm_tt_alloc_page_directory
>>> [   12.151615] BUG: kernel NULL pointer dereference, address: 00000000
>> Please add dump_stack(); to ttm_tt_init() and report back with the
>> backtrace.
>>
>> I can't see how this is called from the nouveau code, only possibility I
>> see is that it is maybe called through the AGP code somehow.
> Yes, you're right:
> [   13.192663] Call Trace:
> [   13.192678]  dump_stack+0x54/0x68
> [   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
> [   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
> [   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
> [   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
> [   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
> [   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
> [   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
> [   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
> [   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> [   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
> [   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> [   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
> [   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
> [   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
> [   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
> [   13.193686]  ? kfree+0x9e/0x11a
> [   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
> [   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
> [   13.193924]  ? pci_enable_device_flags+0x1e/0xac
> [   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
> [   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
> [   13.194195]  pci_device_probe+0x89/0xe9
> [   13.194205]  really_probe+0x127/0x2a7
> [   13.194212]  driver_probe_device+0x5b/0x87
> [   13.194219]  device_driver_attach+0x2e/0x41
> [   13.194226]  __driver_attach+0x7c/0x83
> [   13.194232]  bus_for_each_dev+0x4c/0x66
> [   13.194238]  driver_attach+0x14/0x16
> [   13.194244]  ? device_driver_attach+0x41/0x41
> [   13.194251]  bus_add_driver+0xc5/0x16c
> [   13.194258]  driver_register+0x87/0xb9
> [   13.194265]  __pci_register_driver+0x38/0x3b
> [   13.194271]  ? 0xf0c0d000
> [   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]
>
> How is ttm_dma_tt->dma_address allocated?

Mhm, I need to double check how AGP is supposed to work.

Since barely anybody is using it these days it is something which breaks 
from time to time.

Thanks for the backtrace,
Christian.

>   I cannot find any assignment
> executed (in the working code):
>
> $ git grep dma_address\ = drivers/gpu/
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:       sg->sgl->dma_address = addr;
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:                dma_address = &dma->dma_address[offset >> PAGE_SHIFT];
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:                dma_address = (mm_node->start << PAGE_SHIFT) + offset;
> drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
> drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *) (ttm->ttm.pages + ttm->ttm.num_pages);
> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = kvmalloc_array(ttm->ttm.num_pages,
> drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c:             viter->dma_address = &__vmw_piter_phys_addr;
> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c:             viter->dma_address = &__vmw_piter_dma_addr;
> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c:             viter->dma_address = &__vmw_piter_sg_addr;
>
> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
> ttm_sg_tt_alloc_page_directory().
> Confirmed by adding printk()s that they're NOT called.
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-10 17:59                             ` Christian König
@ 2021-06-11 12:38                               ` Christian König
  2021-06-11 18:23                                 ` Ondrej Zary
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-06-11 12:38 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4084 bytes --]



Am 10.06.21 um 19:59 schrieb Christian König:
> Am 10.06.21 um 19:50 schrieb Ondrej Zary:
>> [SNIP]
>>> I can't see how this is called from the nouveau code, only 
>>> possibility I
>>> see is that it is maybe called through the AGP code somehow.
>> Yes, you're right:
>> [   13.192663] Call Trace:
>> [   13.192678]  dump_stack+0x54/0x68
>> [   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
>> [   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
>> [   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
>> [   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
>> [   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
>> [   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
>> [   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
>> [   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
>> [   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
>> [   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
>> [   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
>> [   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
>> [   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
>> [   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
>> [   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
>> [   13.193686]  ? kfree+0x9e/0x11a
>> [   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
>> [   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
>> [   13.193924]  ? pci_enable_device_flags+0x1e/0xac
>> [   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
>> [   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
>> [   13.194195]  pci_device_probe+0x89/0xe9
>> [   13.194205]  really_probe+0x127/0x2a7
>> [   13.194212]  driver_probe_device+0x5b/0x87
>> [   13.194219]  device_driver_attach+0x2e/0x41
>> [   13.194226]  __driver_attach+0x7c/0x83
>> [   13.194232]  bus_for_each_dev+0x4c/0x66
>> [   13.194238]  driver_attach+0x14/0x16
>> [   13.194244]  ? device_driver_attach+0x41/0x41
>> [   13.194251]  bus_add_driver+0xc5/0x16c
>> [   13.194258]  driver_register+0x87/0xb9
>> [   13.194265]  __pci_register_driver+0x38/0x3b
>> [   13.194271]  ? 0xf0c0d000
>> [   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]
>>
>> How is ttm_dma_tt->dma_address allocated?
>
> Mhm, I need to double check how AGP is supposed to work.
>
> Since barely anybody is using it these days it is something which 
> breaks from time to time.

I have no idea how that ever worked in the first place since AGP isn't 
supposed to sync between CPU/GPU. Everything is coherent for that case.

Anyway here is a patch which adds a check to those functions if the 
dma_address array is allocated in the first place. Please test it.

Thanks,
Christian.

>
> Thanks for the backtrace,
> Christian.
>
>>   I cannot find any assignment
>> executed (in the working code):
>>
>> $ git grep dma_address\ = drivers/gpu/
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c: 
>> sg->sgl->dma_address = addr;
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = 
>> &dma->dma_address[offset >> PAGE_SHIFT];
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = 
>> (mm_node->start << PAGE_SHIFT) + offset;
>> drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
>> drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *) 
>> (ttm->ttm.pages + ttm->ttm.num_pages);
>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = 
>> kvmalloc_array(ttm->ttm.num_pages,
>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = 
>> &__vmw_piter_phys_addr;
>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = 
>> &__vmw_piter_dma_addr;
>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = 
>> &__vmw_piter_sg_addr;
>>
>> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
>> ttm_sg_tt_alloc_page_directory().
>> Confirmed by adding printk()s that they're NOT called.
>>
>>
>


[-- Attachment #2: 0001-drm-nouveau-check-dma_address-array-for-CPU-GPU-sync.patch --]
[-- Type: text/x-patch, Size: 1362 bytes --]

From 5370102729c6ecb280712c40b92ff7b9f58c6e1e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig@amd.com>
Date: Fri, 11 Jun 2021 14:34:50 +0200
Subject: [PATCH] drm/nouveau: check dma_address array for CPU/GPU sync
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

AGP for example doesn't have a dma_address array.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 085023624fb0..1a52590f5303 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -551,7 +551,7 @@ nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
 	struct ttm_tt *ttm_dma = (struct ttm_tt *)nvbo->bo.ttm;
 	int i, j;
 
-	if (!ttm_dma)
+	if (!ttm_dma || !ttm_dma->dma_address)
 		return;
 	if (!ttm_dma->pages) {
 		NV_DEBUG(drm, "ttm_dma 0x%p: pages NULL\n", ttm_dma);
@@ -587,7 +587,7 @@ nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
 	struct ttm_tt *ttm_dma = (struct ttm_tt *)nvbo->bo.ttm;
 	int i, j;
 
-	if (!ttm_dma)
+	if (!ttm_dma || !ttm_dma->dma_address)
 		return;
 	if (!ttm_dma->pages) {
 		NV_DEBUG(drm, "ttm_dma 0x%p: pages NULL\n", ttm_dma);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-11 12:38                               ` Christian König
@ 2021-06-11 18:23                                 ` Ondrej Zary
  2021-06-14 11:07                                   ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Ondrej Zary @ 2021-06-11 18:23 UTC (permalink / raw)
  To: Christian König; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel

On Friday 11 June 2021 14:38:18 Christian König wrote:
> 
> Am 10.06.21 um 19:59 schrieb Christian König:
> > Am 10.06.21 um 19:50 schrieb Ondrej Zary:
> >> [SNIP]
> >>> I can't see how this is called from the nouveau code, only 
> >>> possibility I
> >>> see is that it is maybe called through the AGP code somehow.
> >> Yes, you're right:
> >> [   13.192663] Call Trace:
> >> [   13.192678]  dump_stack+0x54/0x68
> >> [   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
> >> [   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
> >> [   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
> >> [   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
> >> [   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
> >> [   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
> >> [   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
> >> [   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
> >> [   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> >> [   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
> >> [   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> >> [   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
> >> [   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
> >> [   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
> >> [   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
> >> [   13.193686]  ? kfree+0x9e/0x11a
> >> [   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
> >> [   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
> >> [   13.193924]  ? pci_enable_device_flags+0x1e/0xac
> >> [   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
> >> [   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
> >> [   13.194195]  pci_device_probe+0x89/0xe9
> >> [   13.194205]  really_probe+0x127/0x2a7
> >> [   13.194212]  driver_probe_device+0x5b/0x87
> >> [   13.194219]  device_driver_attach+0x2e/0x41
> >> [   13.194226]  __driver_attach+0x7c/0x83
> >> [   13.194232]  bus_for_each_dev+0x4c/0x66
> >> [   13.194238]  driver_attach+0x14/0x16
> >> [   13.194244]  ? device_driver_attach+0x41/0x41
> >> [   13.194251]  bus_add_driver+0xc5/0x16c
> >> [   13.194258]  driver_register+0x87/0xb9
> >> [   13.194265]  __pci_register_driver+0x38/0x3b
> >> [   13.194271]  ? 0xf0c0d000
> >> [   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]
> >>
> >> How is ttm_dma_tt->dma_address allocated?
> >
> > Mhm, I need to double check how AGP is supposed to work.
> >
> > Since barely anybody is using it these days it is something which 
> > breaks from time to time.
> 
> I have no idea how that ever worked in the first place since AGP isn't 
> supposed to sync between CPU/GPU. Everything is coherent for that case.
> 
> Anyway here is a patch which adds a check to those functions if the 
> dma_address array is allocated in the first place. Please test it.

Thanks, the patch fixes the problem and nouveau now works!
Should be applied to 5.12-stable too (5.11 is affected too but EOL).

It's weird that it worked before.
Looks like dma_address was used uninitialized - it contained some random
crap:
[   12.293304] nouveau_bo_sync_for_device: ttm_dma->dma_address=3e055971 ttm_dma->ttm.num_pages=18
[   12.293321] ttm_dma->dma_address[0]=0x0
[   12.293341] ttm_dma->dma_address[1]=0x0
[   12.293360] ttm_dma->dma_address[2]=0xee728980
[   12.293379] ttm_dma->dma_address[3]=0xed1cb120
[   12.293397] ttm_dma->dma_address[4]=0x12
[   12.293416] ttm_dma->dma_address[5]=0x0
[   12.293434] ttm_dma->dma_address[6]=0x1
[   12.293453] ttm_dma->dma_address[7]=0x0
[   12.293471] ttm_dma->dma_address[8]=0x10000
[   12.293490] ttm_dma->dma_address[9]=0x0
[   12.293510] ttm_dma->dma_address[10]=0x101
[   12.293528] ttm_dma->dma_address[11]=0xee7289ec
[   12.293546] ttm_dma->dma_address[12]=0xee7289ec
[   12.293564] ttm_dma->dma_address[13]=0x0
[   12.293581] ttm_dma->dma_address[14]=0x0
[   12.293599] ttm_dma->dma_address[15]=0x0
[   12.293616] ttm_dma->dma_address[16]=0x0
[   12.293634] ttm_dma->dma_address[17]=0x0
But it did not matter as dma_sync_single_for_device is a no-op here.
When dma_address is properly initialized to NULL, it crashes...

> Thanks,
> Christian.
> 
> >
> > Thanks for the backtrace,
> > Christian.
> >
> >>   I cannot find any assignment
> >> executed (in the working code):
> >>
> >> $ git grep dma_address\ = drivers/gpu/
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c: 
> >> sg->sgl->dma_address = addr;
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = 
> >> &dma->dma_address[offset >> PAGE_SHIFT];
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = 
> >> (mm_node->start << PAGE_SHIFT) + offset;
> >> drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
> >> drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
> >> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *) 
> >> (ttm->ttm.pages + ttm->ttm.num_pages);
> >> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = 
> >> kvmalloc_array(ttm->ttm.num_pages,
> >> drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = 
> >> &__vmw_piter_phys_addr;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = 
> >> &__vmw_piter_dma_addr;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = 
> >> &__vmw_piter_sg_addr;
> >>
> >> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
> >> ttm_sg_tt_alloc_page_directory().
> >> Confirmed by adding printk()s that they're NOT called.
> >>
> >>
> >
> 
> 


-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
  2021-06-11 18:23                                 ` Ondrej Zary
@ 2021-06-14 11:07                                   ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-06-14 11:07 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Ben Skeggs, dri-devel, nouveau, linux-kernel



Am 11.06.21 um 20:23 schrieb Ondrej Zary:
> On Friday 11 June 2021 14:38:18 Christian König wrote:
>> Am 10.06.21 um 19:59 schrieb Christian König:
>>> Am 10.06.21 um 19:50 schrieb Ondrej Zary:
>>>> [SNIP]
>>>>> I can't see how this is called from the nouveau code, only
>>>>> possibility I
>>>>> see is that it is maybe called through the AGP code somehow.
>>>> Yes, you're right:
>>>> [   13.192663] Call Trace:
>>>> [   13.192678]  dump_stack+0x54/0x68
>>>> [   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
>>>> [   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
>>>> [   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
>>>> [   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
>>>> [   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
>>>> [   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
>>>> [   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
>>>> [   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
>>>> [   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
>>>> [   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
>>>> [   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
>>>> [   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
>>>> [   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
>>>> [   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
>>>> [   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
>>>> [   13.193686]  ? kfree+0x9e/0x11a
>>>> [   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
>>>> [   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
>>>> [   13.193924]  ? pci_enable_device_flags+0x1e/0xac
>>>> [   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
>>>> [   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
>>>> [   13.194195]  pci_device_probe+0x89/0xe9
>>>> [   13.194205]  really_probe+0x127/0x2a7
>>>> [   13.194212]  driver_probe_device+0x5b/0x87
>>>> [   13.194219]  device_driver_attach+0x2e/0x41
>>>> [   13.194226]  __driver_attach+0x7c/0x83
>>>> [   13.194232]  bus_for_each_dev+0x4c/0x66
>>>> [   13.194238]  driver_attach+0x14/0x16
>>>> [   13.194244]  ? device_driver_attach+0x41/0x41
>>>> [   13.194251]  bus_add_driver+0xc5/0x16c
>>>> [   13.194258]  driver_register+0x87/0xb9
>>>> [   13.194265]  __pci_register_driver+0x38/0x3b
>>>> [   13.194271]  ? 0xf0c0d000
>>>> [   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]
>>>>
>>>> How is ttm_dma_tt->dma_address allocated?
>>> Mhm, I need to double check how AGP is supposed to work.
>>>
>>> Since barely anybody is using it these days it is something which
>>> breaks from time to time.
>> I have no idea how that ever worked in the first place since AGP isn't
>> supposed to sync between CPU/GPU. Everything is coherent for that case.
>>
>> Anyway here is a patch which adds a check to those functions if the
>> dma_address array is allocated in the first place. Please test it.
> Thanks, the patch fixes the problem and nouveau now works!
> Should be applied to 5.12-stable too (5.11 is affected too but EOL).

I will just add a CC stable tag before pushing.

>
> It's weird that it worked before.
> Looks like dma_address was used uninitialized - it contained some random
> crap:
> [   12.293304] nouveau_bo_sync_for_device: ttm_dma->dma_address=3e055971 ttm_dma->ttm.num_pages=18
> [   12.293321] ttm_dma->dma_address[0]=0x0
> [   12.293341] ttm_dma->dma_address[1]=0x0
> [   12.293360] ttm_dma->dma_address[2]=0xee728980
> [   12.293379] ttm_dma->dma_address[3]=0xed1cb120
> [   12.293397] ttm_dma->dma_address[4]=0x12
> [   12.293416] ttm_dma->dma_address[5]=0x0
> [   12.293434] ttm_dma->dma_address[6]=0x1
> [   12.293453] ttm_dma->dma_address[7]=0x0
> [   12.293471] ttm_dma->dma_address[8]=0x10000
> [   12.293490] ttm_dma->dma_address[9]=0x0
> [   12.293510] ttm_dma->dma_address[10]=0x101
> [   12.293528] ttm_dma->dma_address[11]=0xee7289ec
> [   12.293546] ttm_dma->dma_address[12]=0xee7289ec
> [   12.293564] ttm_dma->dma_address[13]=0x0
> [   12.293581] ttm_dma->dma_address[14]=0x0
> [   12.293599] ttm_dma->dma_address[15]=0x0
> [   12.293616] ttm_dma->dma_address[16]=0x0
> [   12.293634] ttm_dma->dma_address[17]=0x0
> But it did not matter as dma_sync_single_for_device is a no-op here.
> When dma_address is properly initialized to NULL, it crashes...

Ok that explains things, but essentially means that this only worked by 
coincident.

Just send out the patch to Ben, the list and you once more. Please reply 
with a rb, ak-by and/or tested-by so that I can push it ASAP.

Thanks,
Christian.

>
>> Thanks,
>> Christian.
>>
>>> Thanks for the backtrace,
>>> Christian.
>>>
>>>>    I cannot find any assignment
>>>> executed (in the working code):
>>>>
>>>> $ git grep dma_address\ = drivers/gpu/
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:
>>>> sg->sgl->dma_address = addr;
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
>>>> &dma->dma_address[offset >> PAGE_SHIFT];
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
>>>> (mm_node->start << PAGE_SHIFT) + offset;
>>>> drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
>>>> drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
>>>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *)
>>>> (ttm->ttm.pages + ttm->ttm.num_pages);
>>>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address =
>>>> kvmalloc_array(ttm->ttm.num_pages,
>>>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
>>>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>>>> &__vmw_piter_phys_addr;
>>>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>>>> &__vmw_piter_dma_addr;
>>>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>>>> &__vmw_piter_sg_addr;
>>>>
>>>> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
>>>> ttm_sg_tt_alloc_page_directory().
>>>> Confirmed by adding printk()s that they're NOT called.
>>>>
>>>>
>>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-06-14 11:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-05 19:43 nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device Ondrej Zary
2021-06-05 21:22 ` [Nouveau] " Ilia Mirkin
2021-06-05 21:34 ` Ondrej Zary
2021-06-06 21:16   ` Ondrej Zary
2021-06-07 20:58     ` Ondrej Zary
2021-06-08 18:47       ` Ondrej Zary
2021-06-08 20:01         ` Ondrej Zary
2021-06-08 21:59           ` Ondrej Zary
2021-06-09  6:43             ` Christian König
2021-06-09  6:57               ` Ondrej Zary
2021-06-09  7:02                 ` Christian König
2021-06-09  7:10                   ` Ondrej Zary
2021-06-09  9:21                     ` Christian König
2021-06-09 20:00                       ` Ondrej Zary
2021-06-10  6:43                         ` Christian König
2021-06-10 17:50                           ` Ondrej Zary
2021-06-10 17:59                             ` Christian König
2021-06-11 12:38                               ` Christian König
2021-06-11 18:23                                 ` Ondrej Zary
2021-06-14 11:07                                   ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).