All of lore.kernel.org
 help / color / mirror / Atom feed
* [next] Null pointer dereference in nouveau_vm_map_sg
@ 2012-01-15 21:31 Martin Nyhus
  2012-01-16 20:30 ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Nyhus @ 2012-01-15 21:31 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: David Airlie, dri-devel, linux-kernel

In some cases mem will be null in nouveau_vm_map_sg, resulting in a crash
at drivers/gpu/drm/nouveau/nouveau_vm.c:84. It seems to be easy enough to
reproduce, so I can test patches if needed.

	Martin



[  216.546584] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[  216.546613] IP: [<ffffffff814a87ec>] nouveau_vm_map_sg+0x2c/0x130
[  216.546631] PGD 5b155067 PUD 5ab71067 PMD 0 
[  216.546647] Oops: 0000 [#1] SMP 
[  216.546659] CPU 1 
[  216.546664] Modules linked in: tun iwl4965 iwlegacy mac80211 cfg80211 tg3 psmouse rtc_cmos evdev ehci_hcd uhci_hcd usbcore usb_common [last unloaded: scsi_wait_scan]
[  216.546721] 
[  216.546727] Pid: 3327, comm: Xorg Not tainted 3.2.0-next-20120113 #56 Dell Inc. XPS M1330                       /0PU073
[  216.546749] RIP: 0010:[<ffffffff814a87ec>]  [<ffffffff814a87ec>] nouveau_vm_map_sg+0x2c/0x130
[  216.546770] RSP: 0018:ffff88005b0c9858  EFLAGS: 00010246
[  216.546780] RAX: ffff88005bf84620 RBX: ffff88005ab08d20 RCX: 0000000000000000
[  216.546791] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[  216.546802] RBP: ffff88005b0c98a8 R08: 0000000000000000 R09: 0000000000000000
[  216.546813] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000004000
[  216.546823] R13: ffff88005bf84dc8 R14: ffff88007838c000 R15: 0000000000000000
[  216.546835] FS:  00007f5f728a8880(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[  216.546848] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  216.546857] CR2: 00000000000000d0 CR3: 000000006c1bb000 CR4: 00000000000006e0
[  216.546869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  216.546880] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  216.546892] Process Xorg (pid: 3327, threadinfo ffff88005b0c8000, task ffff8800655da180)
[  216.546904] Stack:
[  216.546909]  ffff88005b0c9960 ffff880037180368 0000000000000000 0000000000000000
[  216.546930]  ffff88005b0c98d8 ffff88005bf84dc8 ffff88005b0c9960 ffff88007838c240
[  216.546949]  ffff88007838c000 0000000000000000 ffff88005b0c98d8 ffffffff81481bdf
[  216.546969] Call Trace:
[  216.546979]  [<ffffffff81481bdf>] nouveau_bo_move_ntfy+0x7f/0xb0
[  216.546991]  [<ffffffff81470614>] ttm_bo_handle_move_mem+0x204/0x3d0
[  216.547003]  [<ffffffff8147099d>] ttm_bo_evict+0x1bd/0x2a0
[  216.547015]  [<ffffffff81460de7>] ? drm_mm_kmalloc+0x37/0xd0
[  216.547027]  [<ffffffff81470bf1>] ttm_mem_evict_first+0x171/0x230
[  216.547039]  [<ffffffff814714ed>] ttm_bo_mem_space+0x30d/0x420
[  216.547056]  [<ffffffff814716e8>] ttm_bo_move_buffer+0xe8/0x160
[  216.547069]  [<ffffffff8108df2b>] ? __lock_release+0x6b/0xe0
[  216.547080]  [<ffffffff81460de7>] ? drm_mm_kmalloc+0x37/0xd0
[  216.547091]  [<ffffffff81471847>] ttm_bo_validate+0xe7/0xf0
[  216.547102]  [<ffffffff81471a24>] ttm_bo_init+0x1d4/0x2a0
[  216.547113]  [<ffffffff81482481>] ? nouveau_bo_new+0x51/0x1c0
[  216.547124]  [<ffffffff8148258c>] nouveau_bo_new+0x15c/0x1c0
[  216.547135]  [<ffffffff81481eb0>] ? nouveau_ttm_tt_create+0x80/0x80
[  216.547148]  [<ffffffff81338bba>] ? avc_has_perm_noaudit+0xfa/0x290
[  216.547160]  [<ffffffff81485cf3>] nouveau_gem_new+0x53/0x120
[  216.548008]  [<ffffffff8108df81>] ? __lock_release+0xc1/0xe0
[  216.548008]  [<ffffffff81112a97>] ? might_fault+0x57/0xb0
[  216.548008]  [<ffffffff81485e29>] nouveau_gem_ioctl_new+0x69/0x170
[  216.548008]  [<ffffffff81112a97>] ? might_fault+0x57/0xb0
[  216.548008]  [<ffffffff814553e4>] drm_ioctl+0x444/0x510
[  216.548008]  [<ffffffff81485dc0>] ? nouveau_gem_new+0x120/0x120
[  216.548008]  [<ffffffff81150b17>] do_vfs_ioctl+0x87/0x330
[  216.548008]  [<ffffffff8133b528>] ? selinux_file_ioctl+0x68/0x140
[  216.548008]  [<ffffffff81150e51>] sys_ioctl+0x91/0xa0
[  216.555939]  [<ffffffff817c1722>] system_call_fastpath+0x16/0x1b
[  216.555939] Code: 48 89 e5 41 57 49 89 cf 41 56 41 55 49 89 fd 41 54 49 89 d4 ba 01 00 00 00 53 41 89 d3 48 83 ec 28 48 8b 47 20 48 8b 5f 18 31 ff <4c> 8b b1 d0 00 00 00 0f b6 48 30 44 8b 48 34 8b 83 20 01 00 00 
[  216.555939] RIP  [<ffffffff814a87ec>] nouveau_vm_map_sg+0x2c/0x130
[  216.555939]  RSP <ffff88005b0c9858>
[  216.555939] CR2: 00000000000000d0
[  216.581301] ---[ end trace 0d910003d5fb1cd8 ]---

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [next] Null pointer dereference in nouveau_vm_map_sg
  2012-01-15 21:31 [next] Null pointer dereference in nouveau_vm_map_sg Martin Nyhus
@ 2012-01-16 20:30 ` Jerome Glisse
  2012-01-16 23:57   ` Martin Nyhus
  0 siblings, 1 reply; 23+ messages in thread
From: Jerome Glisse @ 2012-01-16 20:30 UTC (permalink / raw)
  To: Martin Nyhus; +Cc: Ben Skeggs, David Airlie, dri-devel, linux-kernel

On Sun, Jan 15, 2012 at 10:31:08PM +0100, Martin Nyhus wrote:
> In some cases mem will be null in nouveau_vm_map_sg, resulting in a crash
> at drivers/gpu/drm/nouveau/nouveau_vm.c:84. It seems to be easy enough to
> reproduce, so I can test patches if needed.
> 
> 	Martin
> 

How do you trigger this ?

Cheers,
Jerome

> 
> 
> [  216.546584] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
> [  216.546613] IP: [<ffffffff814a87ec>] nouveau_vm_map_sg+0x2c/0x130
> [  216.546631] PGD 5b155067 PUD 5ab71067 PMD 0 
> [  216.546647] Oops: 0000 [#1] SMP 
> [  216.546659] CPU 1 
> [  216.546664] Modules linked in: tun iwl4965 iwlegacy mac80211 cfg80211 tg3 psmouse rtc_cmos evdev ehci_hcd uhci_hcd usbcore usb_common [last unloaded: scsi_wait_scan]
> [  216.546721] 
> [  216.546727] Pid: 3327, comm: Xorg Not tainted 3.2.0-next-20120113 #56 Dell Inc. XPS M1330                       /0PU073
> [  216.546749] RIP: 0010:[<ffffffff814a87ec>]  [<ffffffff814a87ec>] nouveau_vm_map_sg+0x2c/0x130
> [  216.546770] RSP: 0018:ffff88005b0c9858  EFLAGS: 00010246
> [  216.546780] RAX: ffff88005bf84620 RBX: ffff88005ab08d20 RCX: 0000000000000000
> [  216.546791] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
> [  216.546802] RBP: ffff88005b0c98a8 R08: 0000000000000000 R09: 0000000000000000
> [  216.546813] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000004000
> [  216.546823] R13: ffff88005bf84dc8 R14: ffff88007838c000 R15: 0000000000000000
> [  216.546835] FS:  00007f5f728a8880(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
> [  216.546848] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  216.546857] CR2: 00000000000000d0 CR3: 000000006c1bb000 CR4: 00000000000006e0
> [  216.546869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  216.546880] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  216.546892] Process Xorg (pid: 3327, threadinfo ffff88005b0c8000, task ffff8800655da180)
> [  216.546904] Stack:
> [  216.546909]  ffff88005b0c9960 ffff880037180368 0000000000000000 0000000000000000
> [  216.546930]  ffff88005b0c98d8 ffff88005bf84dc8 ffff88005b0c9960 ffff88007838c240
> [  216.546949]  ffff88007838c000 0000000000000000 ffff88005b0c98d8 ffffffff81481bdf
> [  216.546969] Call Trace:
> [  216.546979]  [<ffffffff81481bdf>] nouveau_bo_move_ntfy+0x7f/0xb0
> [  216.546991]  [<ffffffff81470614>] ttm_bo_handle_move_mem+0x204/0x3d0
> [  216.547003]  [<ffffffff8147099d>] ttm_bo_evict+0x1bd/0x2a0
> [  216.547015]  [<ffffffff81460de7>] ? drm_mm_kmalloc+0x37/0xd0
> [  216.547027]  [<ffffffff81470bf1>] ttm_mem_evict_first+0x171/0x230
> [  216.547039]  [<ffffffff814714ed>] ttm_bo_mem_space+0x30d/0x420
> [  216.547056]  [<ffffffff814716e8>] ttm_bo_move_buffer+0xe8/0x160
> [  216.547069]  [<ffffffff8108df2b>] ? __lock_release+0x6b/0xe0
> [  216.547080]  [<ffffffff81460de7>] ? drm_mm_kmalloc+0x37/0xd0
> [  216.547091]  [<ffffffff81471847>] ttm_bo_validate+0xe7/0xf0
> [  216.547102]  [<ffffffff81471a24>] ttm_bo_init+0x1d4/0x2a0
> [  216.547113]  [<ffffffff81482481>] ? nouveau_bo_new+0x51/0x1c0
> [  216.547124]  [<ffffffff8148258c>] nouveau_bo_new+0x15c/0x1c0
> [  216.547135]  [<ffffffff81481eb0>] ? nouveau_ttm_tt_create+0x80/0x80
> [  216.547148]  [<ffffffff81338bba>] ? avc_has_perm_noaudit+0xfa/0x290
> [  216.547160]  [<ffffffff81485cf3>] nouveau_gem_new+0x53/0x120
> [  216.548008]  [<ffffffff8108df81>] ? __lock_release+0xc1/0xe0
> [  216.548008]  [<ffffffff81112a97>] ? might_fault+0x57/0xb0
> [  216.548008]  [<ffffffff81485e29>] nouveau_gem_ioctl_new+0x69/0x170
> [  216.548008]  [<ffffffff81112a97>] ? might_fault+0x57/0xb0
> [  216.548008]  [<ffffffff814553e4>] drm_ioctl+0x444/0x510
> [  216.548008]  [<ffffffff81485dc0>] ? nouveau_gem_new+0x120/0x120
> [  216.548008]  [<ffffffff81150b17>] do_vfs_ioctl+0x87/0x330
> [  216.548008]  [<ffffffff8133b528>] ? selinux_file_ioctl+0x68/0x140
> [  216.548008]  [<ffffffff81150e51>] sys_ioctl+0x91/0xa0
> [  216.555939]  [<ffffffff817c1722>] system_call_fastpath+0x16/0x1b
> [  216.555939] Code: 48 89 e5 41 57 49 89 cf 41 56 41 55 49 89 fd 41 54 49 89 d4 ba 01 00 00 00 53 41 89 d3 48 83 ec 28 48 8b 47 20 48 8b 5f 18 31 ff <4c> 8b b1 d0 00 00 00 0f b6 48 30 44 8b 48 34 8b 83 20 01 00 00 
> [  216.555939] RIP  [<ffffffff814a87ec>] nouveau_vm_map_sg+0x2c/0x130
> [  216.555939]  RSP <ffff88005b0c9858>
> [  216.555939] CR2: 00000000000000d0
> [  216.581301] ---[ end trace 0d910003d5fb1cd8 ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [next] Null pointer dereference in nouveau_vm_map_sg
  2012-01-16 20:30 ` Jerome Glisse
@ 2012-01-16 23:57   ` Martin Nyhus
  2012-01-22 18:33     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Nyhus @ 2012-01-16 23:57 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Ben Skeggs, David Airlie, dri-devel, linux-kernel

On Monday 16. January 2012 21:30:59 Jerome Glisse wrote:
> On Sun, Jan 15, 2012 at 10:31:08PM +0100, Martin Nyhus wrote:
> > In some cases mem will be null in nouveau_vm_map_sg, resulting in a crash
> > at drivers/gpu/drm/nouveau/nouveau_vm.c:84. It seems to be easy enough to
> > reproduce, so I can test patches if needed.
> How do you trigger this ?

Opening 10-15 high-res pictures in Firefox triggers it every time. Doing the 
same using Gimp does not, and neither does Firefox and lots of small images 
(eg. Google image search).

	Martin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [next] Null pointer dereference in nouveau_vm_map_sg
  2012-01-16 23:57   ` Martin Nyhus
@ 2012-01-22 18:33     ` Konrad Rzeszutek Wilk
  2012-01-24 22:33       ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-01-22 18:33 UTC (permalink / raw)
  To: Martin Nyhus; +Cc: Jerome Glisse, Ben Skeggs, dri-devel, linux-kernel

On Tue, Jan 17, 2012 at 12:57:50AM +0100, Martin Nyhus wrote:
> On Monday 16. January 2012 21:30:59 Jerome Glisse wrote:
> > On Sun, Jan 15, 2012 at 10:31:08PM +0100, Martin Nyhus wrote:
> > > In some cases mem will be null in nouveau_vm_map_sg, resulting in a crash
> > > at drivers/gpu/drm/nouveau/nouveau_vm.c:84. It seems to be easy enough to
> > > reproduce, so I can test patches if needed.
> > How do you trigger this ?
> 
> Opening 10-15 high-res pictures in Firefox triggers it every time. Doing the 
> same using Gimp does not, and neither does Firefox and lots of small images 
> (eg. Google image search).

I seem to be able to trigger this by using both Chrome and Firefox and 
seeing a YouTube video. I did at that time have a dual-head display, while
in the past to reproduce this I had only one monitor and it took a bit of
time before I hit it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [next] Null pointer dereference in nouveau_vm_map_sg
  2012-01-22 18:33     ` Konrad Rzeszutek Wilk
@ 2012-01-24 22:33       ` Jerome Glisse
  2012-01-25  0:12         ` Martin Nyhus
  2012-01-25  5:34         ` [PATCH] drm/ttm: fix two regressions since move_notify changes Ben Skeggs
  0 siblings, 2 replies; 23+ messages in thread
From: Jerome Glisse @ 2012-01-24 22:33 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Martin Nyhus, Ben Skeggs, dri-devel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]

On Sun, Jan 22, 2012 at 01:33:16PM -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 17, 2012 at 12:57:50AM +0100, Martin Nyhus wrote:
> > On Monday 16. January 2012 21:30:59 Jerome Glisse wrote:
> > > On Sun, Jan 15, 2012 at 10:31:08PM +0100, Martin Nyhus wrote:
> > > > In some cases mem will be null in nouveau_vm_map_sg, resulting in a crash
> > > > at drivers/gpu/drm/nouveau/nouveau_vm.c:84. It seems to be easy enough to
> > > > reproduce, so I can test patches if needed.
> > > How do you trigger this ?
> > 
> > Opening 10-15 high-res pictures in Firefox triggers it every time. Doing the 
> > same using Gimp does not, and neither does Firefox and lots of small images 
> > (eg. Google image search).
> 
> I seem to be able to trigger this by using both Chrome and Firefox and 
> seeing a YouTube video. I did at that time have a dual-head display, while
> in the past to reproduce this I had only one monitor and it took a bit of
> time before I hit it.

Can you please both test if attached patch fix it for you ?

Cheers,
Jerome

[-- Attachment #2: 0001-drm-nouveau-fix-move-notify-callback.patch --]
[-- Type: text/plain, Size: 1401 bytes --]

>From 67d4836e3511db2691c4ff2d3a23bf8c0e950edb Mon Sep 17 00:00:00 2001
From: John Doe <glisse@dhcp-189-215.bos.redhat.com>
Date: Tue, 24 Jan 2012 22:55:26 -0500
Subject: [PATCH] drm/nouveau: fix move notify callback

On vram buffer eviction the ttm_bo_move_accel_cleanup will the
mm_node field of struct ttm_mem_reg of new_mem placement to NULL.
As move notify call back is now call after ttm_bo_move_accel_cleanup
it was using NULL ptr for mm_node.

Signed-off-by: Jerome Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 724b41a..3a9d978 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -814,13 +814,13 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
 
 	list_for_each_entry(vma, &nvbo->vma_list, head) {
 		if (new_mem && new_mem->mem_type == TTM_PL_VRAM) {
-			nouveau_vm_map(vma, new_mem->mm_node);
+			nouveau_vm_map(vma, bo->mem.mm_node);
 		} else
 		if (new_mem && new_mem->mem_type == TTM_PL_TT &&
 		    nvbo->page_shift == vma->vm->spg_shift) {
 			nouveau_vm_map_sg(vma, 0, new_mem->
 					  num_pages << PAGE_SHIFT,
-					  new_mem->mm_node);
+					  bo->mem.mm_node);
 		} else {
 			nouveau_vm_unmap(vma);
 		}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [next] Null pointer dereference in nouveau_vm_map_sg
  2012-01-24 22:33       ` Jerome Glisse
@ 2012-01-25  0:12         ` Martin Nyhus
  2012-01-25 16:54           ` Jerome Glisse
  2012-01-25  5:34         ` [PATCH] drm/ttm: fix two regressions since move_notify changes Ben Skeggs
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Nyhus @ 2012-01-25  0:12 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Konrad Rzeszutek Wilk, Ben Skeggs, dri-devel, linux-kernel

On Tue, 24 Jan 2012 17:33:19 -0500 Jerome Glisse <j.glisse@gmail.com>
wrote:
> Can you please both test if attached patch fix it for you ?

Thanks. It looks good too me, but it crashes a little later due to vma->node
being invalid:

Jan 25 00:54:21 callisto kernel: [  119.038357] [drm] nouveau_vm_unmap vma ffff880057502f50
Jan 25 00:54:21 callisto kernel: [  119.038360] [drm] nouveau_vm_unmap vma->node ffff8800576b87a8
Jan 25 00:54:21 callisto kernel: [  119.038363] [drm] nouveau_vm_unmap vma->node->length 58
Jan 25 00:54:21 callisto kernel: [  119.038477] [drm] nouveau_vm_unmap vma ffff8800577beab8
Jan 25 00:54:21 callisto kernel: [  119.038479] [drm] nouveau_vm_unmap vma->node ffff8800577bf880
Jan 25 00:54:21 callisto kernel: [  119.038482] [drm] nouveau_vm_unmap vma->node->length 1
Jan 25 00:54:21 callisto kernel: [  119.078025] [drm] nouveau_vm_unmap vma ffffffff8148df45
Jan 25 00:54:21 callisto kernel: [  119.078029] [drm] nouveau_vm_unmap vma->node 8b48084b8b480000
Jan 25 00:54:21 callisto kernel: [  119.078040] general protection fault: 0000 [#1] SMP 
Jan 25 00:54:21 callisto kernel: [  119.078133] CPU 0 
Jan 25 00:54:21 callisto kernel: [  119.078138] Modules linked in: tun iwl4965 iwlegacy mac80211 cfg80211 tg3 psmouse rtc_cmos evdev ehci_hcd uhci_hcd usbcore usb_common [last unloaded: scsi_wait_scan]
Jan 25 00:54:21 callisto kernel: [  119.078542] 
Jan 25 00:54:21 callisto kernel: [  119.078914] Pid: 3220, comm: Xorg Tainted: G        W    3.3.0-rc1-00076-g44d4826-dirty #75 Dell Inc. XPS M1330 /0PU073
Jan 25 00:54:21 callisto kernel: [  119.079331] RIP: 0010:[<ffffffff814b2f7f>]  [<ffffffff814b2f7f>] nouveau_vm_unmap+0x4f/0x80
Jan 25 00:54:21 callisto kernel: [  119.079778] RSP: 0018:ffff88005c167868  EFLAGS: 00010292
Jan 25 00:54:21 callisto kernel: [  119.080266] RAX: 8b48084b8b480000 RBX: ffffffff8148df45 RCX: 0000000000000006
Jan 25 00:54:21 callisto kernel: [  119.080712] RDX: 0000000000000000 RSI: ffffffff81868740 RDI: ffffffff81a6e040
Jan 25 00:54:21 callisto kernel: [  119.081218] RBP: ffff88005c167878 R08: 0000000000000001 R09: 0000000000000000
Jan 25 00:54:21 callisto kernel: [  119.081320] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
Jan 25 00:54:21 callisto kernel: [  119.081320] R13: ffff88006c309c80 R14: ffff88006c309a40 R15: ffff880037180590
Jan 25 00:54:21 callisto kernel: [  119.081320] FS:  00007f141232f880(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
Jan 25 00:54:21 callisto kernel: [  119.081320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 25 00:54:21 callisto kernel: [  119.081320] CR2: 00007fb09c1de000 CR3: 000000005ce28000 CR4: 00000000000006f0
Jan 25 00:54:21 callisto kernel: [  119.081320] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 25 00:54:21 callisto kernel: [  119.081320] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 25 00:54:21 callisto kernel: [  119.081320] Process Xorg (pid: 3220, threadinfo ffff88005c166000, task ffff88005f502180)
Jan 25 00:54:21 callisto kernel: [  119.081320] Stack:
Jan 25 00:54:21 callisto kernel: [  119.081320]  ffff88005f502180 ffffffff8148df45 ffff88005c1678a8 ffffffff8148c0e8
Jan 25 00:54:21 callisto kernel: [  119.081320]  ffff88006c309a40 0000000000000002 ffff880037180b00 ffff880079ff5e68
Jan 25 00:54:21 callisto kernel: [  119.081320]  ffff88005c1678c8 ffffffff814792b1 ffff880079ff5e68 ffff88006c309a40
Jan 25 00:54:21 callisto kernel: [  119.081320] Call Trace:
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148df45>] ? nouveau_bo_move+0xb5/0x270
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148c0e8>] nouveau_bo_move_ntfy+0x38/0xc0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff814792b1>] ttm_bo_cleanup_memtype_use+0x21/0xa0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147a5b5>] ttm_bo_cleanup_refs_or_queue+0x165/0x190
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147a675>] ttm_bo_release+0x95/0xd0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147a6ef>] ttm_bo_unref+0x3f/0x60
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147cae3>] ttm_bo_move_accel_cleanup+0x213/0x240
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148db28>] nouveau_bo_move_m2mf+0x148/0x1b0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff817bfd49>] ? mutex_unlock+0x9/0x10
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148df45>] nouveau_bo_move+0xb5/0x270
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147ab66>] ttm_bo_handle_move_mem+0x1e6/0x3d0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147bcba>] ttm_bo_move_buffer+0x14a/0x160
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147bdb7>] ttm_bo_validate+0xe7/0xf0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148cbdd>] nouveau_bo_validate+0x1d/0x20
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148f2a0>] validate_list+0xc0/0x360
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148fafa>] nouveau_gem_pushbuf_validate+0x9a/0x210
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8149064d>] nouveau_gem_ioctl_pushbuf+0x1bd/0x8d0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff810960c1>] ? __lock_release+0xc1/0xe0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8145f994>] drm_ioctl+0x444/0x510
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff81490490>] ? nouveau_gem_ioctl_new+0x170/0x170
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff81152147>] do_vfs_ioctl+0x87/0x330
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff81344e78>] ? selinux_file_ioctl+0x68/0x140
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff81152481>] sys_ioctl+0x91/0xa0
Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff817cade2>] system_call_fastpath+0x16/0x1b
Jan 25 00:54:21 callisto kernel: [  119.081320] Code: 48 8b 53 20 48 c7 c6 40 87 86 81 48 c7 c7 17 3a a5 81 31 c0 e8 05 77 2f 00 48 8b 43 20 48 c7 c6 40 87 86 81 48 c7 c7 40 e0 a6 81 <8b> 50 38 31 c0 e8 e9 76 2f 00 48 8b 43 20 48 89 df 31 f6 8b 50 
Jan 25 00:54:21 callisto kernel: [  119.081320] RIP  [<ffffffff814b2f7f>] nouveau_vm_unmap+0x4f/0x80
Jan 25 00:54:21 callisto kernel: [  119.081320]  RSP <ffff88005c167868>
Jan 25 00:54:21 callisto kernel: [  119.128824] ---[ end trace a7919e7f17c0a727 ]---

The taint is because of a failing self test (debug_objects_selftest) and the
-dirty and extra lines at the start of the log are from this patch:

diff --git a/drivers/gpu/drm/nouveau/nouveau_vm.c b/drivers/gpu/drm/nouveau/nouveau_vm.c
index 2bf6c03..2b788c3 100644
--- a/drivers/gpu/drm/nouveau/nouveau_vm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_vm.c
@@ -150,6 +150,9 @@ nouveau_vm_unmap_at(struct nouveau_vma *vma, u64 delta, u64 length)
 void
 nouveau_vm_unmap(struct nouveau_vma *vma)
 {
+	DRM_INFO("%s vma %p\n", __func__, vma);
+	DRM_INFO("%s vma->node %p\n", __func__, vma->node);
+	DRM_INFO("%s vma->node->length %u\n", __func__, vma->node->length);
 	nouveau_vm_unmap_at(vma, 0, (u64)vma->node->length << 12);
 }

To reproduce I do exactly the same as before, it just takes a little longer
before it crashes.

	Martin

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-24 22:33       ` Jerome Glisse
  2012-01-25  0:12         ` Martin Nyhus
@ 2012-01-25  5:34         ` Ben Skeggs
  2012-01-25  7:43           ` Thomas Hellstrom
                             ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Ben Skeggs @ 2012-01-25  5:34 UTC (permalink / raw)
  To: dri-devel; +Cc: Ben Skeggs

From: Ben Skeggs <bskeggs@redhat.com>

Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
regressions in the nouveau driver.

move_notify() was originally able to presume that bo->mem is the old node,
and new_mem is the new node.  The above commit moves the call to
move_notify() to after move() has been done, which means that now, sometimes,
new_mem isn't the new node at all, bo->mem is, and new_mem points at a
stale, possibly-just-been-killed-by-move node.

This is clearly not a good situation.  This patch reverts this change, and
replaces it with a cleanup in the move() failure path instead.

The second issue is that the call to move_notify() from cleanup_memtype_use()
causes the TTM ghost objects to get passed into the driver.  This is clearly
bad as the driver knows nothing about these "fake" TTM BOs, and ends up
accessing uninitialised memory.

I worked around this in nouveau's move_notify() hook by ensuring the BO
destructor was nouveau's.  I don't particularly like this solution, and
would rather TTM never pass the driver these objects.  However, I don't
clearly understand the reason why we're calling move_notify() here anyway
and am happy to work around the problem in nouveau instead of breaking the
behaviour expected by other drivers.

Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Cc: Jerome Glisse <j.glisse@gmail.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
 drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 724b41a..ec54364 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
 	struct nouveau_bo *nvbo = nouveau_bo(bo);
 	struct nouveau_vma *vma;
 
+	/* ttm can now (stupidly) pass the driver bos it didn't create... */
+	if (bo->destroy != nouveau_bo_del_ttm)
+		return;
+
 	list_for_each_entry(vma, &nvbo->vma_list, head) {
 		if (new_mem && new_mem->mem_type == TTM_PL_VRAM) {
 			nouveau_vm_map(vma, new_mem->mm_node);
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 2f0eab6..7c3a57d 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 		}
 	}
 
+	if (bdev->driver->move_notify)
+		bdev->driver->move_notify(bo, mem);
+
 	if (!(old_man->flags & TTM_MEMTYPE_FLAG_FIXED) &&
 	    !(new_man->flags & TTM_MEMTYPE_FLAG_FIXED))
 		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
@@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
 	else
 		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);
 
-	if (ret)
-		goto out_err;
+	if (ret) {
+		if (bdev->driver->move_notify) {
+			struct ttm_mem_reg tmp_mem = *mem;
+			*mem = bo->mem;
+			bo->mem = tmp_mem;
+			bdev->driver->move_notify(bo, mem);
+			bo->mem = *mem;
+		}
 
-	if (bdev->driver->move_notify)
-		bdev->driver->move_notify(bo, mem);
+		goto out_err;
+	}
 
 moved:
 	if (bo->evicted) {
-- 
1.7.7.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25  5:34         ` [PATCH] drm/ttm: fix two regressions since move_notify changes Ben Skeggs
@ 2012-01-25  7:43           ` Thomas Hellstrom
  2012-01-25  8:05             ` Ben Skeggs
  2012-01-25  8:24           ` Dave Airlie
  2012-01-25 17:32           ` Thomas Hellstrom
  2 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellstrom @ 2012-01-25  7:43 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: Ben Skeggs, dri-devel

On 01/25/2012 06:34 AM, Ben Skeggs wrote:
> From: Ben Skeggs<bskeggs@redhat.com>
>
> Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
> regressions in the nouveau driver.
>
> move_notify() was originally able to presume that bo->mem is the old node,
> and new_mem is the new node.  The above commit moves the call to
> move_notify() to after move() has been done, which means that now, sometimes,
> new_mem isn't the new node at all, bo->mem is, and new_mem points at a
> stale, possibly-just-been-killed-by-move node.
>
> This is clearly not a good situation.  This patch reverts this change, and
> replaces it with a cleanup in the move() failure path instead.

While I have nothing against having move_notify() happening before the 
move, but
I still very much dislike  the failure path thing, since it puts a 
requirement on TTM to support
driver moves through move_notify(), which is done by Radeon.
I've kindly asked Jerome to change that, stating a number of reasons, 
but he refuses, and I'm
not prepared to let support for that mode of operation sneak in like this.


The second issue is that the call to move_notify() from cleanup_memtype_use()
causes the TTM ghost objects to get passed into the driver.  This is clearly
bad as the driver knows nothing about these "fake" TTM BOs, and ends up
accessing uninitialised memory.

I worked around this in nouveau's move_notify() hook by ensuring the BO
destructor was nouveau's.  I don't particularly like this solution, and
would rather TTM never pass the driver these objects.  However, I don't
clearly understand the reason why we're calling move_notify() here anyway
and am happy to work around the problem in nouveau instead of breaking the
behaviour expected by other drivers.


move_notify() here gives the driver a chance to unbind any GPU maps 
remaining on the BO
before it changes placement or is destroyed.
We've previously required the driver to support any type of BO in the 
driver hooks, I agree
with you that it would be desirable to make that go away.

Signed-off-by: Ben Skeggs<bskeggs@redhat.com>
Cc: Jerome Glisse<j.glisse@gmail.com>
---
  drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
  drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
  2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 724b41a..ec54364 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
  	struct nouveau_bo *nvbo = nouveau_bo(bo);
  	struct nouveau_vma *vma;

+	/* ttm can now (stupidly) pass the driver bos it didn't create... */
+	if (bo->destroy != nouveau_bo_del_ttm)
+		return;
+
  	list_for_each_entry(vma,&nvbo->vma_list, head) {
  		if (new_mem&&  new_mem->mem_type == TTM_PL_VRAM) {
  			nouveau_vm_map(vma, new_mem->mm_node);
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 2f0eab6..7c3a57d 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
  		}
  	}

+	if (bdev->driver->move_notify)
+		bdev->driver->move_notify(bo, mem);
+
  	if (!(old_man->flags&  TTM_MEMTYPE_FLAG_FIXED)&&
  	!(new_man->flags&  TTM_MEMTYPE_FLAG_FIXED))
  		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
@@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
  	else
  		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);

-	if (ret)
-		goto out_err;
+	if (ret) {
+		if (bdev->driver->move_notify) {
+			struct ttm_mem_reg tmp_mem = *mem;
+			*mem = bo->mem;
+			bo->mem = tmp_mem;
+			bdev->driver->move_notify(bo, mem);
+			bo->mem = *mem;
+		}

-	if (bdev->driver->move_notify)
-		bdev->driver->move_notify(bo, mem);
+		goto out_err;
+	}

  moved:
  	if (bo->evicted) {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25  7:43           ` Thomas Hellstrom
@ 2012-01-25  8:05             ` Ben Skeggs
  2012-01-25  8:39               ` Thomas Hellstrom
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Skeggs @ 2012-01-25  8:05 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Wed, 2012-01-25 at 08:43 +0100, Thomas Hellstrom wrote:
> On 01/25/2012 06:34 AM, Ben Skeggs wrote:
> > From: Ben Skeggs<bskeggs@redhat.com>
> >
> > Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
> > regressions in the nouveau driver.
> >
> > move_notify() was originally able to presume that bo->mem is the old node,
> > and new_mem is the new node.  The above commit moves the call to
> > move_notify() to after move() has been done, which means that now, sometimes,
> > new_mem isn't the new node at all, bo->mem is, and new_mem points at a
> > stale, possibly-just-been-killed-by-move node.
> >
> > This is clearly not a good situation.  This patch reverts this change, and
> > replaces it with a cleanup in the move() failure path instead.
> 
> While I have nothing against having move_notify() happening before the 
> move, but
> I still very much dislike  the failure path thing, since it puts a 
> requirement on TTM to support
> driver moves through move_notify(), which is done by Radeon.
> I've kindly asked Jerome to change that, stating a number of reasons, 
> but he refuses, and I'm
> not prepared to let support for that mode of operation sneak in like this.
What requirement is there?  All it's asking the driver is to do a
move_notify with old/new nodes swapped, which would effectively undo the
previous move_notify().

move_notify() *cannot* happen after the move for the reasons I mentioned
already, so the choice is apparently either to completely ignore
cleaning up if the subsequent move() fails (previous behaviour prior to
the patch which caused these regressions), or to make the change I
did...

> 
> 
> The second issue is that the call to move_notify() from cleanup_memtype_use()
> causes the TTM ghost objects to get passed into the driver.  This is clearly
> bad as the driver knows nothing about these "fake" TTM BOs, and ends up
> accessing uninitialised memory.
> 
> I worked around this in nouveau's move_notify() hook by ensuring the BO
> destructor was nouveau's.  I don't particularly like this solution, and
> would rather TTM never pass the driver these objects.  However, I don't
> clearly understand the reason why we're calling move_notify() here anyway
> and am happy to work around the problem in nouveau instead of breaking the
> behaviour expected by other drivers.
> 
> 
> move_notify() here gives the driver a chance to unbind any GPU maps 
> remaining on the BO
> before it changes placement or is destroyed.
> We've previously required the driver to support any type of BO in the 
> driver hooks, I agree
> with you that it would be desirable to make that go away.
Okay, that's fine I guess.  As the commit says, I'm happy to make
nouveau just ignore operations on BOs it doesn't "own".

I know *you* think of move_notify() as only being around to deal with
unmapping, but as I mentioned previously, it'd have never taken the new
node as a parameter if this is were the case.  I have zero issue with
nouveau using it to set up the GPU mappings currently.  I know you've
suggested alternatives previously, which may be possible down the track
if it's *really* necessary, but it seems like a bad idea to pursue this
for 3.3.

Thomas, what do you suggest to move forward with this?  Both of these
bugs are serious regressions that make nouveau unusable with the current
3.3-rc series.

Ben.

> 
> Signed-off-by: Ben Skeggs<bskeggs@redhat.com>
> Cc: Jerome Glisse<j.glisse@gmail.com>
> ---
>   drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
>   drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
>   2 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 724b41a..ec54364 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
>   	struct nouveau_bo *nvbo = nouveau_bo(bo);
>   	struct nouveau_vma *vma;
> 
> +	/* ttm can now (stupidly) pass the driver bos it didn't create... */
> +	if (bo->destroy != nouveau_bo_del_ttm)
> +		return;
> +
>   	list_for_each_entry(vma,&nvbo->vma_list, head) {
>   		if (new_mem&&  new_mem->mem_type == TTM_PL_VRAM) {
>   			nouveau_vm_map(vma, new_mem->mm_node);
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 2f0eab6..7c3a57d 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>   		}
>   	}
> 
> +	if (bdev->driver->move_notify)
> +		bdev->driver->move_notify(bo, mem);
> +
>   	if (!(old_man->flags&  TTM_MEMTYPE_FLAG_FIXED)&&
>   	!(new_man->flags&  TTM_MEMTYPE_FLAG_FIXED))
>   		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> @@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>   	else
>   		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> 
> -	if (ret)
> -		goto out_err;
> +	if (ret) {
> +		if (bdev->driver->move_notify) {
> +			struct ttm_mem_reg tmp_mem = *mem;
> +			*mem = bo->mem;
> +			bo->mem = tmp_mem;
> +			bdev->driver->move_notify(bo, mem);
> +			bo->mem = *mem;
> +		}
> 
> -	if (bdev->driver->move_notify)
> -		bdev->driver->move_notify(bo, mem);
> +		goto out_err;
> +	}
> 
>   moved:
>   	if (bo->evicted) {
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25  5:34         ` [PATCH] drm/ttm: fix two regressions since move_notify changes Ben Skeggs
  2012-01-25  7:43           ` Thomas Hellstrom
@ 2012-01-25  8:24           ` Dave Airlie
  2012-01-25  8:38             ` Ben Skeggs
  2012-01-25 17:32           ` Thomas Hellstrom
  2 siblings, 1 reply; 23+ messages in thread
From: Dave Airlie @ 2012-01-25  8:24 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: Ben Skeggs, dri-devel

On Wed, Jan 25, 2012 at 5:34 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
> From: Ben Skeggs <bskeggs@redhat.com>
>
> Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
> regressions in the nouveau driver.
>
> move_notify() was originally able to presume that bo->mem is the old node,
> and new_mem is the new node.  The above commit moves the call to
> move_notify() to after move() has been done, which means that now, sometimes,
> new_mem isn't the new node at all, bo->mem is, and new_mem points at a
> stale, possibly-just-been-killed-by-move node.
>
> This is clearly not a good situation.  This patch reverts this change, and
> replaces it with a cleanup in the move() failure path instead.
>
> The second issue is that the call to move_notify() from cleanup_memtype_use()
> causes the TTM ghost objects to get passed into the driver.  This is clearly
> bad as the driver knows nothing about these "fake" TTM BOs, and ends up
> accessing uninitialised memory.

btw we've had this check in radeon for a long time.
radeon_ttm_bo_is_radeon_bo

not sure how you haven't gotten a ghost object in there up until now.

Dave.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25  8:24           ` Dave Airlie
@ 2012-01-25  8:38             ` Ben Skeggs
  0 siblings, 0 replies; 23+ messages in thread
From: Ben Skeggs @ 2012-01-25  8:38 UTC (permalink / raw)
  To: Dave Airlie; +Cc: dri-devel

On Wed, 2012-01-25 at 08:24 +0000, Dave Airlie wrote:
> On Wed, Jan 25, 2012 at 5:34 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
> > From: Ben Skeggs <bskeggs@redhat.com>
> >
> > Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
> > regressions in the nouveau driver.
> >
> > move_notify() was originally able to presume that bo->mem is the old node,
> > and new_mem is the new node.  The above commit moves the call to
> > move_notify() to after move() has been done, which means that now, sometimes,
> > new_mem isn't the new node at all, bo->mem is, and new_mem points at a
> > stale, possibly-just-been-killed-by-move node.
> >
> > This is clearly not a good situation.  This patch reverts this change, and
> > replaces it with a cleanup in the move() failure path instead.
> >
> > The second issue is that the call to move_notify() from cleanup_memtype_use()
> > causes the TTM ghost objects to get passed into the driver.  This is clearly
> > bad as the driver knows nothing about these "fake" TTM BOs, and ends up
> > accessing uninitialised memory.
> 
> btw we've had this check in radeon for a long time.
> radeon_ttm_bo_is_radeon_bo
> 
> not sure how you haven't gotten a ghost object in there up until now.
I don't believe there was any possible way a ghost object could've ended
up in move_notify() until recently.

I did notice radeon had checks a long time ago, and always wondered why
it bothered when the ghost object is only ever hooked onto the delayed
delete list and not really touched by the driver again..

Ben.

> 
> Dave.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25  8:05             ` Ben Skeggs
@ 2012-01-25  8:39               ` Thomas Hellstrom
  2012-01-25  9:41                 ` Ben Skeggs
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellstrom @ 2012-01-25  8:39 UTC (permalink / raw)
  To: skeggsb; +Cc: dri-devel

On 01/25/2012 09:05 AM, Ben Skeggs wrote:
> On Wed, 2012-01-25 at 08:43 +0100, Thomas Hellstrom wrote:
>> On 01/25/2012 06:34 AM, Ben Skeggs wrote:
>>> From: Ben Skeggs<bskeggs@redhat.com>
>>>
>>> Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
>>> regressions in the nouveau driver.
>>>
>>> move_notify() was originally able to presume that bo->mem is the old node,
>>> and new_mem is the new node.  The above commit moves the call to
>>> move_notify() to after move() has been done, which means that now, sometimes,
>>> new_mem isn't the new node at all, bo->mem is, and new_mem points at a
>>> stale, possibly-just-been-killed-by-move node.
>>>
>>> This is clearly not a good situation.  This patch reverts this change, and
>>> replaces it with a cleanup in the move() failure path instead.
>> While I have nothing against having move_notify() happening before the
>> move, but
>> I still very much dislike  the failure path thing, since it puts a
>> requirement on TTM to support
>> driver moves through move_notify(), which is done by Radeon.
>> I've kindly asked Jerome to change that, stating a number of reasons,
>> but he refuses, and I'm
>> not prepared to let support for that mode of operation sneak in like this.
> What requirement is there?  All it's asking the driver is to do a
> move_notify with old/new nodes swapped, which would effectively undo the
> previous move_notify().
>
> move_notify() *cannot* happen after the move for the reasons I mentioned
> already, so the choice is apparently either to completely ignore
> cleaning up if the subsequent move() fails (previous behaviour prior to
> the patch which caused these regressions), or to make the change I
> did...

I agree. What I'm proposing is that we should go through with the first 
part of
your patch and ignore the cleanup.

>
>>
>> The second issue is that the call to move_notify() from cleanup_memtype_use()
>> causes the TTM ghost objects to get passed into the driver.  This is clearly
>> bad as the driver knows nothing about these "fake" TTM BOs, and ends up
>> accessing uninitialised memory.
>>
>> I worked around this in nouveau's move_notify() hook by ensuring the BO
>> destructor was nouveau's.  I don't particularly like this solution, and
>> would rather TTM never pass the driver these objects.  However, I don't
>> clearly understand the reason why we're calling move_notify() here anyway
>> and am happy to work around the problem in nouveau instead of breaking the
>> behaviour expected by other drivers.
>>
>>
>> move_notify() here gives the driver a chance to unbind any GPU maps
>> remaining on the BO
>> before it changes placement or is destroyed.
>> We've previously required the driver to support any type of BO in the
>> driver hooks, I agree
>> with you that it would be desirable to make that go away.
> Okay, that's fine I guess.  As the commit says, I'm happy to make
> nouveau just ignore operations on BOs it doesn't "own".
>
> I know *you* think of move_notify() as only being around to deal with
> unmapping, but as I mentioned previously, it'd have never taken the new
> node as a parameter if this is were the case.
That's not exactly true. After you pointed that out, I went back and 
check the
old vmwgfx use, and the new node was used to determine whether
an unbind was actually necessary, or whether it could leave the gpu map
untouched.

>   I have zero issue with
> nouveau using it to set up the GPU mappings currently.  I know you've
> suggested alternatives previously, which may be possible down the track
> if it's *really* necessary, but it seems like a bad idea to pursue this
> for 3.3.

My main concern is that we blindly and unnecessarily set up GPU bindings and
end up with unnecessary code in TTM, and furthermore that we communicate
that bad practice to future driver writers.

Thomas, what do you suggest to move forward with this? Both of these 
bugs are serious regressions that make nouveau unusable with the current 
3.3-rc series. Ben.

My number one choice would of course be to have the drivers set up their 
private GPU mappings after a
successful validate call, as I originally suggested and you agreed to.

If that's not possible (I realize it's late in the release series), I'll 
ack this patch if you and Jerome agree not to block
attempts to move in that direction for future kernel releases.

Thanks,
Thomas



>> Signed-off-by: Ben Skeggs<bskeggs@redhat.com>
>> Cc: Jerome Glisse<j.glisse@gmail.com>
>> ---
>>    drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
>>    drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
>>    2 files changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> index 724b41a..ec54364 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> @@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
>>    	struct nouveau_bo *nvbo = nouveau_bo(bo);
>>    	struct nouveau_vma *vma;
>>
>> +	/* ttm can now (stupidly) pass the driver bos it didn't create... */
>> +	if (bo->destroy != nouveau_bo_del_ttm)
>> +		return;
>> +
>>    	list_for_each_entry(vma,&nvbo->vma_list, head) {
>>    		if (new_mem&&   new_mem->mem_type == TTM_PL_VRAM) {
>>    			nouveau_vm_map(vma, new_mem->mm_node);
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index 2f0eab6..7c3a57d 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>>    		}
>>    	}
>>
>> +	if (bdev->driver->move_notify)
>> +		bdev->driver->move_notify(bo, mem);
>> +
>>    	if (!(old_man->flags&   TTM_MEMTYPE_FLAG_FIXED)&&
>>    	!(new_man->flags&   TTM_MEMTYPE_FLAG_FIXED))
>>    		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
>> @@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>>    	else
>>    		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);
>>
>> -	if (ret)
>> -		goto out_err;
>> +	if (ret) {
>> +		if (bdev->driver->move_notify) {
>> +			struct ttm_mem_reg tmp_mem = *mem;
>> +			*mem = bo->mem;
>> +			bo->mem = tmp_mem;
>> +			bdev->driver->move_notify(bo, mem);
>> +			bo->mem = *mem;
>> +		}
>>
>> -	if (bdev->driver->move_notify)
>> -		bdev->driver->move_notify(bo, mem);
>> +		goto out_err;
>> +	}
>>
>>    moved:
>>    	if (bo->evicted) {
>>
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25  8:39               ` Thomas Hellstrom
@ 2012-01-25  9:41                 ` Ben Skeggs
  2012-01-25 14:33                   ` Thomas Hellstrom
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Skeggs @ 2012-01-25  9:41 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Wed, 2012-01-25 at 09:39 +0100, Thomas Hellstrom wrote:
> On 01/25/2012 09:05 AM, Ben Skeggs wrote:
> > On Wed, 2012-01-25 at 08:43 +0100, Thomas Hellstrom wrote:
> >> On 01/25/2012 06:34 AM, Ben Skeggs wrote:
> >>> From: Ben Skeggs<bskeggs@redhat.com>
> >>>
> >>> Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
> >>> regressions in the nouveau driver.
> >>>
> >>> move_notify() was originally able to presume that bo->mem is the old node,
> >>> and new_mem is the new node.  The above commit moves the call to
> >>> move_notify() to after move() has been done, which means that now, sometimes,
> >>> new_mem isn't the new node at all, bo->mem is, and new_mem points at a
> >>> stale, possibly-just-been-killed-by-move node.
> >>>
> >>> This is clearly not a good situation.  This patch reverts this change, and
> >>> replaces it with a cleanup in the move() failure path instead.
> >> While I have nothing against having move_notify() happening before the
> >> move, but
> >> I still very much dislike  the failure path thing, since it puts a
> >> requirement on TTM to support
> >> driver moves through move_notify(), which is done by Radeon.
> >> I've kindly asked Jerome to change that, stating a number of reasons,
> >> but he refuses, and I'm
> >> not prepared to let support for that mode of operation sneak in like this.
> > What requirement is there?  All it's asking the driver is to do a
> > move_notify with old/new nodes swapped, which would effectively undo the
> > previous move_notify().
> >
> > move_notify() *cannot* happen after the move for the reasons I mentioned
> > already, so the choice is apparently either to completely ignore
> > cleaning up if the subsequent move() fails (previous behaviour prior to
> > the patch which caused these regressions), or to make the change I
> > did...
> 
> I agree. What I'm proposing is that we should go through with the first 
> part of
> your patch and ignore the cleanup.
> 
> >
> >>
> >> The second issue is that the call to move_notify() from cleanup_memtype_use()
> >> causes the TTM ghost objects to get passed into the driver.  This is clearly
> >> bad as the driver knows nothing about these "fake" TTM BOs, and ends up
> >> accessing uninitialised memory.
> >>
> >> I worked around this in nouveau's move_notify() hook by ensuring the BO
> >> destructor was nouveau's.  I don't particularly like this solution, and
> >> would rather TTM never pass the driver these objects.  However, I don't
> >> clearly understand the reason why we're calling move_notify() here anyway
> >> and am happy to work around the problem in nouveau instead of breaking the
> >> behaviour expected by other drivers.
> >>
> >>
> >> move_notify() here gives the driver a chance to unbind any GPU maps
> >> remaining on the BO
> >> before it changes placement or is destroyed.
> >> We've previously required the driver to support any type of BO in the
> >> driver hooks, I agree
> >> with you that it would be desirable to make that go away.
> > Okay, that's fine I guess.  As the commit says, I'm happy to make
> > nouveau just ignore operations on BOs it doesn't "own".
> >
> > I know *you* think of move_notify() as only being around to deal with
> > unmapping, but as I mentioned previously, it'd have never taken the new
> > node as a parameter if this is were the case.
> That's not exactly true. After you pointed that out, I went back and 
> check the
> old vmwgfx use, and the new node was used to determine whether
> an unbind was actually necessary, or whether it could leave the gpu map
> untouched.
Ok, but even so, prior to these changes there was no good reason nor
mention in any of the headers/code that drivers weren't supposed to
assume "new_mem" was anything useful..

> 
> >   I have zero issue with
> > nouveau using it to set up the GPU mappings currently.  I know you've
> > suggested alternatives previously, which may be possible down the track
> > if it's *really* necessary, but it seems like a bad idea to pursue this
> > for 3.3.
> 
> My main concern is that we blindly and unnecessarily set up GPU bindings and
> end up with unnecessary code in TTM, and furthermore that we communicate
> that bad practice to future driver writers.
This "unnecessary code" is like 5 lines of cleanup if something fails,
hardly anything to be jumping up and down about :)

> 
> Thomas, what do you suggest to move forward with this? Both of these 
> bugs are serious regressions that make nouveau unusable with the current 
> 3.3-rc series. Ben.
> 
> My number one choice would of course be to have the drivers set up their 
> private GPU mappings after a
> successful validate call, as I originally suggested and you agreed to.
> 
> If that's not possible (I realize it's late in the release series), I'll 
> ack this patch if you and Jerome agree not to block
> attempts to move in that direction for future kernel releases.
I can't say I'm entirely happy with the plan honestly.  To me, it still
seems more efficient to handle this when a move happens (comparatively
rare) and "map new backing storage into every vm that has a reference"
than to (on every single buffer of every single "exec" call) go "is this
buffer mapped into this channel's vm? yes, ok; no, lets go map it".

I'm not even sure how exactly I plan on storing this mapping efficiently
yet.. Scanning the BO's linked list of VMs it's mapped into for "if
(this_vma == chan->vma)" doesn't exactly sound performant.

Thanks,
Ben.

> 
> Thanks,
> Thomas
> 
> 
> 
> >> Signed-off-by: Ben Skeggs<bskeggs@redhat.com>
> >> Cc: Jerome Glisse<j.glisse@gmail.com>
> >> ---
> >>    drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
> >>    drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
> >>    2 files changed, 17 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >> index 724b41a..ec54364 100644
> >> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> >> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >> @@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
> >>    	struct nouveau_bo *nvbo = nouveau_bo(bo);
> >>    	struct nouveau_vma *vma;
> >>
> >> +	/* ttm can now (stupidly) pass the driver bos it didn't create... */
> >> +	if (bo->destroy != nouveau_bo_del_ttm)
> >> +		return;
> >> +
> >>    	list_for_each_entry(vma,&nvbo->vma_list, head) {
> >>    		if (new_mem&&   new_mem->mem_type == TTM_PL_VRAM) {
> >>    			nouveau_vm_map(vma, new_mem->mm_node);
> >> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> >> index 2f0eab6..7c3a57d 100644
> >> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> >> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> >> @@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
> >>    		}
> >>    	}
> >>
> >> +	if (bdev->driver->move_notify)
> >> +		bdev->driver->move_notify(bo, mem);
> >> +
> >>    	if (!(old_man->flags&   TTM_MEMTYPE_FLAG_FIXED)&&
> >>    	!(new_man->flags&   TTM_MEMTYPE_FLAG_FIXED))
> >>    		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> >> @@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
> >>    	else
> >>    		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> >>
> >> -	if (ret)
> >> -		goto out_err;
> >> +	if (ret) {
> >> +		if (bdev->driver->move_notify) {
> >> +			struct ttm_mem_reg tmp_mem = *mem;
> >> +			*mem = bo->mem;
> >> +			bo->mem = tmp_mem;
> >> +			bdev->driver->move_notify(bo, mem);
> >> +			bo->mem = *mem;
> >> +		}
> >>
> >> -	if (bdev->driver->move_notify)
> >> -		bdev->driver->move_notify(bo, mem);
> >> +		goto out_err;
> >> +	}
> >>
> >>    moved:
> >>    	if (bo->evicted) {
> >>
> >>
> >>
> >>
> >
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25  9:41                 ` Ben Skeggs
@ 2012-01-25 14:33                   ` Thomas Hellstrom
  2012-01-25 15:21                     ` Ben Skeggs
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellstrom @ 2012-01-25 14:33 UTC (permalink / raw)
  To: skeggsb; +Cc: dri-devel

On 01/25/2012 10:41 AM, Ben Skeggs wrote:
>
> My main concern is that we blindly and unnecessarily set up GPU bindings and
> end up with unnecessary code in TTM, and furthermore that we communicate
> that bad practice to future driver writers.
> This "unnecessary code" is like 5 lines of cleanup if something fails,
> hardly anything to be jumping up and down about :)

It's just not TTM's business, unless the GPU maps are mappable by the 
CPU as well.
Also, What if the mapping setup in move_notify() fails?

>
>> Thomas, what do you suggest to move forward with this? Both of these
>> bugs are serious regressions that make nouveau unusable with the current
>> 3.3-rc series. Ben.
>>
>> My number one choice would of course be to have the drivers set up their
>> private GPU mappings after a
>> successful validate call, as I originally suggested and you agreed to.
>>
>> If that's not possible (I realize it's late in the release series), I'll
>> ack this patch if you and Jerome agree not to block
>> attempts to move in that direction for future kernel releases.
> I can't say I'm entirely happy with the plan honestly.  To me, it still
> seems more efficient to handle this when a move happens (comparatively
> rare) and "map new backing storage into every vm that has a reference"
> than to (on every single buffer of every single "exec" call) go "is this
> buffer mapped into this channel's vm? yes, ok; no, lets go map it".
>
> I'm not even sure how exactly I plan on storing this mapping efficiently
> yet.. Scanning the BO's linked list of VMs it's mapped into for "if
> (this_vma == chan->vma)" doesn't exactly sound performant.

As previously suggested, in the simplest case a bo could have a 'needs 
remap' flag
that is set on gpu map teardown on move_notify(), and when this flag is 
detected in validate,
go ahead and set up all needed maps and clear that flag.

This is the simplest case and more or less equivalent to the current 
solution, except
maps aren't set up unless needed by at least one channel and there is a 
clear way
to handle errors when GPU maps are set up.

A simple and straightforward fix that leaves the path open (if so 
desired) to
handle finer channel granularity.

Or am I missing something?

/Thomas






>
> Thanks,
> Ben.
>
>> Thanks,
>> Thomas
>>
>>
>>
>>>> Signed-off-by: Ben Skeggs<bskeggs@redhat.com>
>>>> Cc: Jerome Glisse<j.glisse@gmail.com>
>>>> ---
>>>>     drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
>>>>     drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
>>>>     2 files changed, 17 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> index 724b41a..ec54364 100644
>>>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> @@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
>>>>     	struct nouveau_bo *nvbo = nouveau_bo(bo);
>>>>     	struct nouveau_vma *vma;
>>>>
>>>> +	/* ttm can now (stupidly) pass the driver bos it didn't create... */
>>>> +	if (bo->destroy != nouveau_bo_del_ttm)
>>>> +		return;
>>>> +
>>>>     	list_for_each_entry(vma,&nvbo->vma_list, head) {
>>>>     		if (new_mem&&    new_mem->mem_type == TTM_PL_VRAM) {
>>>>     			nouveau_vm_map(vma, new_mem->mm_node);
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> index 2f0eab6..7c3a57d 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>>> @@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>>>>     		}
>>>>     	}
>>>>
>>>> +	if (bdev->driver->move_notify)
>>>> +		bdev->driver->move_notify(bo, mem);
>>>> +
>>>>     	if (!(old_man->flags&    TTM_MEMTYPE_FLAG_FIXED)&&
>>>>     	!(new_man->flags&    TTM_MEMTYPE_FLAG_FIXED))
>>>>     		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
>>>> @@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>>>>     	else
>>>>     		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);
>>>>
>>>> -	if (ret)
>>>> -		goto out_err;
>>>> +	if (ret) {
>>>> +		if (bdev->driver->move_notify) {
>>>> +			struct ttm_mem_reg tmp_mem = *mem;
>>>> +			*mem = bo->mem;
>>>> +			bo->mem = tmp_mem;
>>>> +			bdev->driver->move_notify(bo, mem);
>>>> +			bo->mem = *mem;
>>>> +		}
>>>>
>>>> -	if (bdev->driver->move_notify)
>>>> -		bdev->driver->move_notify(bo, mem);
>>>> +		goto out_err;
>>>> +	}
>>>>
>>>>     moved:
>>>>     	if (bo->evicted) {
>>>>
>>>>
>>>>
>>>>
>>
>>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25 14:33                   ` Thomas Hellstrom
@ 2012-01-25 15:21                     ` Ben Skeggs
  2012-01-25 15:37                       ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Skeggs @ 2012-01-25 15:21 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Wed, 2012-01-25 at 15:33 +0100, Thomas Hellstrom wrote:
> On 01/25/2012 10:41 AM, Ben Skeggs wrote:
> >
> > My main concern is that we blindly and unnecessarily set up GPU bindings and
> > end up with unnecessary code in TTM, and furthermore that we communicate
> > that bad practice to future driver writers.
> > This "unnecessary code" is like 5 lines of cleanup if something fails,
> > hardly anything to be jumping up and down about :)
> 
> It's just not TTM's business, unless the GPU maps are mappable by the 
> CPU as well.
> Also, What if the mapping setup in move_notify() fails?
It can't fail, and well, in nouveau's implementation it never will.
It's simply a "fill the ptes for all the vmas currently associated with
a bo".

And well, it's about as much TTM's business as VRAM aperture allocation
is..  I don't see the big deal, if you wan't to do it a different way in
your driver, there's nothing stopping you.  It's a lot of bother for
essentially zero effort in TTM..

> 
> >
> >> Thomas, what do you suggest to move forward with this? Both of these
> >> bugs are serious regressions that make nouveau unusable with the current
> >> 3.3-rc series. Ben.
> >>
> >> My number one choice would of course be to have the drivers set up their
> >> private GPU mappings after a
> >> successful validate call, as I originally suggested and you agreed to.
> >>
> >> If that's not possible (I realize it's late in the release series), I'll
> >> ack this patch if you and Jerome agree not to block
> >> attempts to move in that direction for future kernel releases.
> > I can't say I'm entirely happy with the plan honestly.  To me, it still
> > seems more efficient to handle this when a move happens (comparatively
> > rare) and "map new backing storage into every vm that has a reference"
> > than to (on every single buffer of every single "exec" call) go "is this
> > buffer mapped into this channel's vm? yes, ok; no, lets go map it".
> >
> > I'm not even sure how exactly I plan on storing this mapping efficiently
> > yet.. Scanning the BO's linked list of VMs it's mapped into for "if
> > (this_vma == chan->vma)" doesn't exactly sound performant.
> 
> As previously suggested, in the simplest case a bo could have a 'needs 
> remap' flag
> that is set on gpu map teardown on move_notify(), and when this flag is 
> detected in validate,
> go ahead and set up all needed maps and clear that flag.
> 
> This is the simplest case and more or less equivalent to the current 
> solution, except
> maps aren't set up unless needed by at least one channel and there is a 
> clear way
> to handle errors when GPU maps are set up.
Yes, right.  That can be done, and gives exactly the same functionality
as I *already* achieve with move_notify() but apparently have to change
just because you've decided nouveau/radeon are both doing the
WrongThing(tm).

Anyhow, I care more that 3.3 works than how it works.  So, whatever.  If
I must agree to this in order to get a regression fix in, then I guess I
really have no choice in the matter.

Ben.

> 
> A simple and straightforward fix that leaves the path open (if so 
> desired) to
> handle finer channel granularity.
> 
> Or am I missing something?
> 
> /Thomas
> 
> 
> 
> 
> 
> 
> >
> > Thanks,
> > Ben.
> >
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >>>> Signed-off-by: Ben Skeggs<bskeggs@redhat.com>
> >>>> Cc: Jerome Glisse<j.glisse@gmail.com>
> >>>> ---
> >>>>     drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
> >>>>     drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
> >>>>     2 files changed, 17 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>>> index 724b41a..ec54364 100644
> >>>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>>> @@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
> >>>>     	struct nouveau_bo *nvbo = nouveau_bo(bo);
> >>>>     	struct nouveau_vma *vma;
> >>>>
> >>>> +	/* ttm can now (stupidly) pass the driver bos it didn't create... */
> >>>> +	if (bo->destroy != nouveau_bo_del_ttm)
> >>>> +		return;
> >>>> +
> >>>>     	list_for_each_entry(vma,&nvbo->vma_list, head) {
> >>>>     		if (new_mem&&    new_mem->mem_type == TTM_PL_VRAM) {
> >>>>     			nouveau_vm_map(vma, new_mem->mm_node);
> >>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> >>>> index 2f0eab6..7c3a57d 100644
> >>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> >>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> >>>> @@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
> >>>>     		}
> >>>>     	}
> >>>>
> >>>> +	if (bdev->driver->move_notify)
> >>>> +		bdev->driver->move_notify(bo, mem);
> >>>> +
> >>>>     	if (!(old_man->flags&    TTM_MEMTYPE_FLAG_FIXED)&&
> >>>>     	!(new_man->flags&    TTM_MEMTYPE_FLAG_FIXED))
> >>>>     		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> >>>> @@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
> >>>>     	else
> >>>>     		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> >>>>
> >>>> -	if (ret)
> >>>> -		goto out_err;
> >>>> +	if (ret) {
> >>>> +		if (bdev->driver->move_notify) {
> >>>> +			struct ttm_mem_reg tmp_mem = *mem;
> >>>> +			*mem = bo->mem;
> >>>> +			bo->mem = tmp_mem;
> >>>> +			bdev->driver->move_notify(bo, mem);
> >>>> +			bo->mem = *mem;
> >>>> +		}
> >>>>
> >>>> -	if (bdev->driver->move_notify)
> >>>> -		bdev->driver->move_notify(bo, mem);
> >>>> +		goto out_err;
> >>>> +	}
> >>>>
> >>>>     moved:
> >>>>     	if (bo->evicted) {
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >>
> >
> 
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25 15:21                     ` Ben Skeggs
@ 2012-01-25 15:37                       ` Jerome Glisse
  2012-01-25 17:15                         ` Thomas Hellstrom
  2012-01-25 17:19                         ` Thomas Hellstrom
  0 siblings, 2 replies; 23+ messages in thread
From: Jerome Glisse @ 2012-01-25 15:37 UTC (permalink / raw)
  To: skeggsb; +Cc: dri-devel

On Wed, Jan 25, 2012 at 10:21 AM, Ben Skeggs <skeggsb@gmail.com> wrote:
> On Wed, 2012-01-25 at 15:33 +0100, Thomas Hellstrom wrote:
>> On 01/25/2012 10:41 AM, Ben Skeggs wrote:
>> >
>> > My main concern is that we blindly and unnecessarily set up GPU bindings and
>> > end up with unnecessary code in TTM, and furthermore that we communicate
>> > that bad practice to future driver writers.
>> > This "unnecessary code" is like 5 lines of cleanup if something fails,
>> > hardly anything to be jumping up and down about :)
>>
>> It's just not TTM's business, unless the GPU maps are mappable by the
>> CPU as well.
>> Also, What if the mapping setup in move_notify() fails?
> It can't fail, and well, in nouveau's implementation it never will.
> It's simply a "fill the ptes for all the vmas currently associated with
> a bo".
>
> And well, it's about as much TTM's business as VRAM aperture allocation
> is..  I don't see the big deal, if you wan't to do it a different way in
> your driver, there's nothing stopping you.  It's a lot of bother for
> essentially zero effort in TTM..
>
>>
>> >
>> >> Thomas, what do you suggest to move forward with this? Both of these
>> >> bugs are serious regressions that make nouveau unusable with the current
>> >> 3.3-rc series. Ben.
>> >>
>> >> My number one choice would of course be to have the drivers set up their
>> >> private GPU mappings after a
>> >> successful validate call, as I originally suggested and you agreed to.
>> >>
>> >> If that's not possible (I realize it's late in the release series), I'll
>> >> ack this patch if you and Jerome agree not to block
>> >> attempts to move in that direction for future kernel releases.
>> > I can't say I'm entirely happy with the plan honestly.  To me, it still
>> > seems more efficient to handle this when a move happens (comparatively
>> > rare) and "map new backing storage into every vm that has a reference"
>> > than to (on every single buffer of every single "exec" call) go "is this
>> > buffer mapped into this channel's vm? yes, ok; no, lets go map it".
>> >
>> > I'm not even sure how exactly I plan on storing this mapping efficiently
>> > yet.. Scanning the BO's linked list of VMs it's mapped into for "if
>> > (this_vma == chan->vma)" doesn't exactly sound performant.
>>
>> As previously suggested, in the simplest case a bo could have a 'needs
>> remap' flag
>> that is set on gpu map teardown on move_notify(), and when this flag is
>> detected in validate,
>> go ahead and set up all needed maps and clear that flag.
>>
>> This is the simplest case and more or less equivalent to the current
>> solution, except
>> maps aren't set up unless needed by at least one channel and there is a
>> clear way
>> to handle errors when GPU maps are set up.
> Yes, right.  That can be done, and gives exactly the same functionality
> as I *already* achieve with move_notify() but apparently have to change
> just because you've decided nouveau/radeon are both doing the
> WrongThing(tm).
>
> Anyhow, I care more that 3.3 works than how it works.  So, whatever.  If
> I must agree to this in order to get a regression fix in, then I guess I
> really have no choice in the matter.
>
> Ben.
>
>>
>> A simple and straightforward fix that leaves the path open (if so
>> desired) to
>> handle finer channel granularity.
>>
>> Or am I missing something?
>>

I went over the code and Ben fix is ok with me, i need to test it a
bit on radeon side.

For long term solution why not just move most of the
ttm_bo_handle_move_mem to the driver. It would obsolete the
move_notify callback. move notify callback was introduced because in
some case the driver never knew directly that a bo moved. It's obvious
that driver need to know every time. So instead of having an ha-doc
function for that. Let just move the handle move stuff into the
driver. Yes there will be some code duplication but it will avoid
anykind of weird error path and driver will be able to perform what
ever make sense.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [next] Null pointer dereference in nouveau_vm_map_sg
  2012-01-25  0:12         ` Martin Nyhus
@ 2012-01-25 16:54           ` Jerome Glisse
  0 siblings, 0 replies; 23+ messages in thread
From: Jerome Glisse @ 2012-01-25 16:54 UTC (permalink / raw)
  To: Martin Nyhus; +Cc: Konrad Rzeszutek Wilk, Ben Skeggs, dri-devel, linux-kernel

On Tue, Jan 24, 2012 at 7:12 PM, Martin Nyhus <martin.nyhus@gmx.com> wrote:
> On Tue, 24 Jan 2012 17:33:19 -0500 Jerome Glisse <j.glisse@gmail.com>
> wrote:
>> Can you please both test if attached patch fix it for you ?
>
> Thanks. It looks good too me, but it crashes a little later due to vma->node
> being invalid:
>
> Jan 25 00:54:21 callisto kernel: [  119.038357] [drm] nouveau_vm_unmap vma ffff880057502f50
> Jan 25 00:54:21 callisto kernel: [  119.038360] [drm] nouveau_vm_unmap vma->node ffff8800576b87a8
> Jan 25 00:54:21 callisto kernel: [  119.038363] [drm] nouveau_vm_unmap vma->node->length 58
> Jan 25 00:54:21 callisto kernel: [  119.038477] [drm] nouveau_vm_unmap vma ffff8800577beab8
> Jan 25 00:54:21 callisto kernel: [  119.038479] [drm] nouveau_vm_unmap vma->node ffff8800577bf880
> Jan 25 00:54:21 callisto kernel: [  119.038482] [drm] nouveau_vm_unmap vma->node->length 1
> Jan 25 00:54:21 callisto kernel: [  119.078025] [drm] nouveau_vm_unmap vma ffffffff8148df45
> Jan 25 00:54:21 callisto kernel: [  119.078029] [drm] nouveau_vm_unmap vma->node 8b48084b8b480000
> Jan 25 00:54:21 callisto kernel: [  119.078040] general protection fault: 0000 [#1] SMP
> Jan 25 00:54:21 callisto kernel: [  119.078133] CPU 0
> Jan 25 00:54:21 callisto kernel: [  119.078138] Modules linked in: tun iwl4965 iwlegacy mac80211 cfg80211 tg3 psmouse rtc_cmos evdev ehci_hcd uhci_hcd usbcore usb_common [last unloaded: scsi_wait_scan]
> Jan 25 00:54:21 callisto kernel: [  119.078542]
> Jan 25 00:54:21 callisto kernel: [  119.078914] Pid: 3220, comm: Xorg Tainted: G        W    3.3.0-rc1-00076-g44d4826-dirty #75 Dell Inc. XPS M1330 /0PU073
> Jan 25 00:54:21 callisto kernel: [  119.079331] RIP: 0010:[<ffffffff814b2f7f>]  [<ffffffff814b2f7f>] nouveau_vm_unmap+0x4f/0x80
> Jan 25 00:54:21 callisto kernel: [  119.079778] RSP: 0018:ffff88005c167868  EFLAGS: 00010292
> Jan 25 00:54:21 callisto kernel: [  119.080266] RAX: 8b48084b8b480000 RBX: ffffffff8148df45 RCX: 0000000000000006
> Jan 25 00:54:21 callisto kernel: [  119.080712] RDX: 0000000000000000 RSI: ffffffff81868740 RDI: ffffffff81a6e040
> Jan 25 00:54:21 callisto kernel: [  119.081218] RBP: ffff88005c167878 R08: 0000000000000001 R09: 0000000000000000
> Jan 25 00:54:21 callisto kernel: [  119.081320] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> Jan 25 00:54:21 callisto kernel: [  119.081320] R13: ffff88006c309c80 R14: ffff88006c309a40 R15: ffff880037180590
> Jan 25 00:54:21 callisto kernel: [  119.081320] FS:  00007f141232f880(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> Jan 25 00:54:21 callisto kernel: [  119.081320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jan 25 00:54:21 callisto kernel: [  119.081320] CR2: 00007fb09c1de000 CR3: 000000005ce28000 CR4: 00000000000006f0
> Jan 25 00:54:21 callisto kernel: [  119.081320] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jan 25 00:54:21 callisto kernel: [  119.081320] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jan 25 00:54:21 callisto kernel: [  119.081320] Process Xorg (pid: 3220, threadinfo ffff88005c166000, task ffff88005f502180)
> Jan 25 00:54:21 callisto kernel: [  119.081320] Stack:
> Jan 25 00:54:21 callisto kernel: [  119.081320]  ffff88005f502180 ffffffff8148df45 ffff88005c1678a8 ffffffff8148c0e8
> Jan 25 00:54:21 callisto kernel: [  119.081320]  ffff88006c309a40 0000000000000002 ffff880037180b00 ffff880079ff5e68
> Jan 25 00:54:21 callisto kernel: [  119.081320]  ffff88005c1678c8 ffffffff814792b1 ffff880079ff5e68 ffff88006c309a40
> Jan 25 00:54:21 callisto kernel: [  119.081320] Call Trace:
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148df45>] ? nouveau_bo_move+0xb5/0x270
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148c0e8>] nouveau_bo_move_ntfy+0x38/0xc0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff814792b1>] ttm_bo_cleanup_memtype_use+0x21/0xa0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147a5b5>] ttm_bo_cleanup_refs_or_queue+0x165/0x190
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147a675>] ttm_bo_release+0x95/0xd0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147a6ef>] ttm_bo_unref+0x3f/0x60
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147cae3>] ttm_bo_move_accel_cleanup+0x213/0x240
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148db28>] nouveau_bo_move_m2mf+0x148/0x1b0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff817bfd49>] ? mutex_unlock+0x9/0x10
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148df45>] nouveau_bo_move+0xb5/0x270
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147ab66>] ttm_bo_handle_move_mem+0x1e6/0x3d0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147bcba>] ttm_bo_move_buffer+0x14a/0x160
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8147bdb7>] ttm_bo_validate+0xe7/0xf0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148cbdd>] nouveau_bo_validate+0x1d/0x20
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148f2a0>] validate_list+0xc0/0x360
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8148fafa>] nouveau_gem_pushbuf_validate+0x9a/0x210
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8149064d>] nouveau_gem_ioctl_pushbuf+0x1bd/0x8d0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff810960c1>] ? __lock_release+0xc1/0xe0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff8145f994>] drm_ioctl+0x444/0x510
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff81490490>] ? nouveau_gem_ioctl_new+0x170/0x170
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff81152147>] do_vfs_ioctl+0x87/0x330
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff81344e78>] ? selinux_file_ioctl+0x68/0x140
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff81152481>] sys_ioctl+0x91/0xa0
> Jan 25 00:54:21 callisto kernel: [  119.081320]  [<ffffffff817cade2>] system_call_fastpath+0x16/0x1b
> Jan 25 00:54:21 callisto kernel: [  119.081320] Code: 48 8b 53 20 48 c7 c6 40 87 86 81 48 c7 c7 17 3a a5 81 31 c0 e8 05 77 2f 00 48 8b 43 20 48 c7 c6 40 87 86 81 48 c7 c7 40 e0 a6 81 <8b> 50 38 31 c0 e8 e9 76 2f 00 48 8b 43 20 48 89 df 31 f6 8b 50
> Jan 25 00:54:21 callisto kernel: [  119.081320] RIP  [<ffffffff814b2f7f>] nouveau_vm_unmap+0x4f/0x80
> Jan 25 00:54:21 callisto kernel: [  119.081320]  RSP <ffff88005c167868>
> Jan 25 00:54:21 callisto kernel: [  119.128824] ---[ end trace a7919e7f17c0a727 ]---
>
> The taint is because of a failing self test (debug_objects_selftest) and the
> -dirty and extra lines at the start of the log are from this patch:
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_vm.c b/drivers/gpu/drm/nouveau/nouveau_vm.c
> index 2bf6c03..2b788c3 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_vm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_vm.c
> @@ -150,6 +150,9 @@ nouveau_vm_unmap_at(struct nouveau_vma *vma, u64 delta, u64 length)
>  void
>  nouveau_vm_unmap(struct nouveau_vma *vma)
>  {
> +       DRM_INFO("%s vma %p\n", __func__, vma);
> +       DRM_INFO("%s vma->node %p\n", __func__, vma->node);
> +       DRM_INFO("%s vma->node->length %u\n", __func__, vma->node->length);
>        nouveau_vm_unmap_at(vma, 0, (u64)vma->node->length << 12);
>  }
>
> To reproduce I do exactly the same as before, it just takes a little longer
> before it crashes.
>
>        Martin

Ben posted a proper patch on dri-devel.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25 15:37                       ` Jerome Glisse
@ 2012-01-25 17:15                         ` Thomas Hellstrom
  2012-01-25 17:19                         ` Thomas Hellstrom
  1 sibling, 0 replies; 23+ messages in thread
From: Thomas Hellstrom @ 2012-01-25 17:15 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel

On 01/25/2012 04:37 PM, Jerome Glisse wrote:
> On Wed, Jan 25, 2012 at 10:21 AM, Ben Skeggs<skeggsb@gmail.com>  wrote:
>> On Wed, 2012-01-25 at 15:33 +0100, Thomas Hellstrom wrote:
>>> On 01/25/2012 10:41 AM, Ben Skeggs wrote:
>>>> My main concern is that we blindly and unnecessarily set up GPU bindings and
>>>> end up with unnecessary code in TTM, and furthermore that we communicate
>>>> that bad practice to future driver writers.
>>>> This "unnecessary code" is like 5 lines of cleanup if something fails,
>>>> hardly anything to be jumping up and down about :)
>>> It's just not TTM's business, unless the GPU maps are mappable by the
>>> CPU as well.
>>> Also, What if the mapping setup in move_notify() fails?
>> It can't fail, and well, in nouveau's implementation it never will.
>> It's simply a "fill the ptes for all the vmas currently associated with
>> a bo".
>>
>> And well, it's about as much TTM's business as VRAM aperture allocation
>> is..  I don't see the big deal, if you wan't to do it a different way in
>> your driver, there's nothing stopping you.  It's a lot of bother for
>> essentially zero effort in TTM..

I think you fail to see the point.

TTM was written for placing data facilitating for GPU maps, and handle 
the CPU map implications.
Sometimes the placement can be used as direct GPU maps (legacy hardware 
TT, VRAM), sometimes not.
To me it's natural any driver private GPU maps, (just like CPU maps) are 
taken down before move, and
setup again on demand after a move.

Any driver could theoretically add a hook to facilitate a special need, 
but we shouldn't do
that without thinking of the implications for other drivers. What if 
other drivers *can* fail
when setting up private GPU maps? What if they need to set up private 
GPU maps to system memory?
What if they started out by copying Nouveau / Radeon code. Suddenly we'd 
need a return value from
move_notify(). We'd need to handle double failure if move_notify() fails 
when move had already failed.
We'd need the same for swap_notify(). Should we add a swapin_notify() as 
well to set up the
system memory GPU maps once memory is swapped back? What should we do if 
swapin_notify() fails?
It would be an utter mess.

These things are what I need to think about if we require TTM to notify 
the driver when private
GPU maps potential need to be set up, and I think the simple answer is: 
let the driver set up the GPU maps
when it needs to.


>> Yes, right.  That can be done, and gives exactly the same functionality
>> as I *already* achieve with move_notify() but apparently have to change
>> just because you've decided nouveau/radeon are both doing the
>> WrongThing(tm).

Well the point *is* that it's more or less the exact same functionality, 
since you stated that an implementation
might be inefficient and difficult.

Also I don't ask you to change this. I ask you to *not block* a change 
if I want to clean this up to avoid having
to deal with the above mess at a later date.

>>
>> Anyhow, I care more that 3.3 works than how it works.  So, whatever.  If
>> I must agree to this in order to get a regression fix in, then I guess I
>> really have no choice in the matter.
>>
>> Ben.
>>
>>
Well, I'm in the same situation.
I have to accept this otherwise Radeon and Nouveau will regress, and 
there's no time to do it differently.

/Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25 15:37                       ` Jerome Glisse
  2012-01-25 17:15                         ` Thomas Hellstrom
@ 2012-01-25 17:19                         ` Thomas Hellstrom
  2012-01-25 18:12                           ` Dave Airlie
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Hellstrom @ 2012-01-25 17:19 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel

On 01/25/2012 04:37 PM, Jerome Glisse wrote:
> On Wed, Jan 25, 2012 at 10:21 AM, Ben Skeggs<skeggsb@gmail.com>  wrote:
>> On Wed, 2012-01-25 at 15:33 +0100, Thomas Hellstrom wrote:
>>> On 01/25/2012 10:41 AM, Ben Skeggs wrote:
>>>> My main concern is that we blindly and unnecessarily set up GPU bindings and
>>>> end up with unnecessary code in TTM, and furthermore that we communicate
>>>> that bad practice to future driver writers.
>>>> This "unnecessary code" is like 5 lines of cleanup if something fails,
>>>> hardly anything to be jumping up and down about :)
>>> It's just not TTM's business, unless the GPU maps are mappable by the
>>> CPU as well.
>>> Also, What if the mapping setup in move_notify() fails?
>> It can't fail, and well, in nouveau's implementation it never will.
>> It's simply a "fill the ptes for all the vmas currently associated with
>> a bo".
>>
>> And well, it's about as much TTM's business as VRAM aperture allocation
>> is..  I don't see the big deal, if you wan't to do it a different way in
>> your driver, there's nothing stopping you.  It's a lot of bother for
>> essentially zero effort in TTM..
>>
>>>>> Thomas, what do you suggest to move forward with this? Both of these
>>>>> bugs are serious regressions that make nouveau unusable with the current
>>>>> 3.3-rc series. Ben.
>>>>>
>>>>> My number one choice would of course be to have the drivers set up their
>>>>> private GPU mappings after a
>>>>> successful validate call, as I originally suggested and you agreed to.
>>>>>
>>>>> If that's not possible (I realize it's late in the release series), I'll
>>>>> ack this patch if you and Jerome agree not to block
>>>>> attempts to move in that direction for future kernel releases.
>>>> I can't say I'm entirely happy with the plan honestly.  To me, it still
>>>> seems more efficient to handle this when a move happens (comparatively
>>>> rare) and "map new backing storage into every vm that has a reference"
>>>> than to (on every single buffer of every single "exec" call) go "is this
>>>> buffer mapped into this channel's vm? yes, ok; no, lets go map it".
>>>>
>>>> I'm not even sure how exactly I plan on storing this mapping efficiently
>>>> yet.. Scanning the BO's linked list of VMs it's mapped into for "if
>>>> (this_vma == chan->vma)" doesn't exactly sound performant.
>>> As previously suggested, in the simplest case a bo could have a 'needs
>>> remap' flag
>>> that is set on gpu map teardown on move_notify(), and when this flag is
>>> detected in validate,
>>> go ahead and set up all needed maps and clear that flag.
>>>
>>> This is the simplest case and more or less equivalent to the current
>>> solution, except
>>> maps aren't set up unless needed by at least one channel and there is a
>>> clear way
>>> to handle errors when GPU maps are set up.
>> Yes, right.  That can be done, and gives exactly the same functionality
>> as I *already* achieve with move_notify() but apparently have to change
>> just because you've decided nouveau/radeon are both doing the
>> WrongThing(tm).
>>
>> Anyhow, I care more that 3.3 works than how it works.  So, whatever.  If
>> I must agree to this in order to get a regression fix in, then I guess I
>> really have no choice in the matter.
>>
>> Ben.
>>
>>> A simple and straightforward fix that leaves the path open (if so
>>> desired) to
>>> handle finer channel granularity.
>>>
>>> Or am I missing something?
>>>
> I went over the code and Ben fix is ok with me, i need to test it a
> bit on radeon side.
>
> For long term solution why not just move most of the
> ttm_bo_handle_move_mem to the driver. It would obsolete the
> move_notify callback. move notify callback was introduced because in
> some case the driver never knew directly that a bo moved. It's obvious
> that driver need to know every time. So instead of having an ha-doc
> function for that. Let just move the handle move stuff into the
> driver. Yes there will be some code duplication but it will avoid
> anykind of weird error path and driver will be able to perform what
> ever make sense.

Yes, this is a solution that eliminates the need for TTM to support private
GPU map setup. Code duplication can largely be avoided if we
collect common code in a small utility function.

/Thomas

>
> Cheers,
> Jerome

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25  5:34         ` [PATCH] drm/ttm: fix two regressions since move_notify changes Ben Skeggs
  2012-01-25  7:43           ` Thomas Hellstrom
  2012-01-25  8:24           ` Dave Airlie
@ 2012-01-25 17:32           ` Thomas Hellstrom
  2 siblings, 0 replies; 23+ messages in thread
From: Thomas Hellstrom @ 2012-01-25 17:32 UTC (permalink / raw)
  To: Ben Skeggs; +Cc: Ben Skeggs, dri-devel

On 01/25/2012 06:34 AM, Ben Skeggs wrote:
> From: Ben Skeggs<bskeggs@redhat.com>
>
> Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
> regressions in the nouveau driver.
>
> move_notify() was originally able to presume that bo->mem is the old node,
> and new_mem is the new node.  The above commit moves the call to
> move_notify() to after move() has been done, which means that now, sometimes,
> new_mem isn't the new node at all, bo->mem is, and new_mem points at a
> stale, possibly-just-been-killed-by-move node.
>
> This is clearly not a good situation.  This patch reverts this change, and
> replaces it with a cleanup in the move() failure path instead.
>
> The second issue is that the call to move_notify() from cleanup_memtype_use()
> causes the TTM ghost objects to get passed into the driver.  This is clearly
> bad as the driver knows nothing about these "fake" TTM BOs, and ends up
> accessing uninitialised memory.
>
> I worked around this in nouveau's move_notify() hook by ensuring the BO
> destructor was nouveau's.  I don't particularly like this solution, and
> would rather TTM never pass the driver these objects.  However, I don't
> clearly understand the reason why we're calling move_notify() here anyway
> and am happy to work around the problem in nouveau instead of breaking the
> behaviour expected by other drivers.
>
> Signed-off-by: Ben Skeggs<bskeggs@redhat.com>
> Cc: Jerome Glisse<j.glisse@gmail.com>
As mentioned in the lengthy email discussion, I don't like the ttm change,
but since we don't have time for anything better,

Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>


> ---
>   drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
>   drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
>   2 files changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 724b41a..ec54364 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
>   	struct nouveau_bo *nvbo = nouveau_bo(bo);
>   	struct nouveau_vma *vma;
>
> +	/* ttm can now (stupidly) pass the driver bos it didn't create... */
> +	if (bo->destroy != nouveau_bo_del_ttm)
> +		return;
> +
>   	list_for_each_entry(vma,&nvbo->vma_list, head) {
>   		if (new_mem&&  new_mem->mem_type == TTM_PL_VRAM) {
>   			nouveau_vm_map(vma, new_mem->mm_node);
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 2f0eab6..7c3a57d 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>   		}
>   	}
>
> +	if (bdev->driver->move_notify)
> +		bdev->driver->move_notify(bo, mem);
> +
>   	if (!(old_man->flags&  TTM_MEMTYPE_FLAG_FIXED)&&
>   	!(new_man->flags&  TTM_MEMTYPE_FLAG_FIXED))
>   		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> @@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
>   	else
>   		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);
>
> -	if (ret)
> -		goto out_err;
> +	if (ret) {
> +		if (bdev->driver->move_notify) {
> +			struct ttm_mem_reg tmp_mem = *mem;
> +			*mem = bo->mem;
> +			bo->mem = tmp_mem;
> +			bdev->driver->move_notify(bo, mem);
> +			bo->mem = *mem;
> +		}
>
> -	if (bdev->driver->move_notify)
> -		bdev->driver->move_notify(bo, mem);
> +		goto out_err;
> +	}
>
>   moved:
>   	if (bo->evicted) {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25 17:19                         ` Thomas Hellstrom
@ 2012-01-25 18:12                           ` Dave Airlie
  2012-01-25 18:21                             ` Thomas Hellstrom
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Airlie @ 2012-01-25 18:12 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Wed, Jan 25, 2012 at 5:19 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 01/25/2012 04:37 PM, Jerome Glisse wrote:
>>
>> On Wed, Jan 25, 2012 at 10:21 AM, Ben Skeggs<skeggsb@gmail.com>  wrote:
>>>
>>> On Wed, 2012-01-25 at 15:33 +0100, Thomas Hellstrom wrote:
>>>>
>>>> On 01/25/2012 10:41 AM, Ben Skeggs wrote:
>>>>>
>>>>> My main concern is that we blindly and unnecessarily set up GPU
>>>>> bindings and
>>>>> end up with unnecessary code in TTM, and furthermore that we
>>>>> communicate
>>>>> that bad practice to future driver writers.
>>>>> This "unnecessary code" is like 5 lines of cleanup if something fails,
>>>>> hardly anything to be jumping up and down about :)
>>>>
>>>> It's just not TTM's business, unless the GPU maps are mappable by the
>>>> CPU as well.
>>>> Also, What if the mapping setup in move_notify() fails?
>>>
>>> It can't fail, and well, in nouveau's implementation it never will.
>>> It's simply a "fill the ptes for all the vmas currently associated with
>>> a bo".
>>>
>>> And well, it's about as much TTM's business as VRAM aperture allocation
>>> is..  I don't see the big deal, if you wan't to do it a different way in
>>> your driver, there's nothing stopping you.  It's a lot of bother for
>>> essentially zero effort in TTM..
>>>
>>>>>> Thomas, what do you suggest to move forward with this? Both of these
>>>>>> bugs are serious regressions that make nouveau unusable with the
>>>>>> current
>>>>>> 3.3-rc series. Ben.
>>>>>>
>>>>>> My number one choice would of course be to have the drivers set up
>>>>>> their
>>>>>> private GPU mappings after a
>>>>>> successful validate call, as I originally suggested and you agreed to.
>>>>>>
>>>>>> If that's not possible (I realize it's late in the release series),
>>>>>> I'll
>>>>>> ack this patch if you and Jerome agree not to block
>>>>>> attempts to move in that direction for future kernel releases.
>>>>>
>>>>> I can't say I'm entirely happy with the plan honestly.  To me, it still
>>>>> seems more efficient to handle this when a move happens (comparatively
>>>>> rare) and "map new backing storage into every vm that has a reference"
>>>>> than to (on every single buffer of every single "exec" call) go "is
>>>>> this
>>>>> buffer mapped into this channel's vm? yes, ok; no, lets go map it".
>>>>>
>>>>> I'm not even sure how exactly I plan on storing this mapping
>>>>> efficiently
>>>>> yet.. Scanning the BO's linked list of VMs it's mapped into for "if
>>>>> (this_vma == chan->vma)" doesn't exactly sound performant.
>>>>
>>>> As previously suggested, in the simplest case a bo could have a 'needs
>>>> remap' flag
>>>> that is set on gpu map teardown on move_notify(), and when this flag is
>>>> detected in validate,
>>>> go ahead and set up all needed maps and clear that flag.
>>>>
>>>> This is the simplest case and more or less equivalent to the current
>>>> solution, except
>>>> maps aren't set up unless needed by at least one channel and there is a
>>>> clear way
>>>> to handle errors when GPU maps are set up.
>>>
>>> Yes, right.  That can be done, and gives exactly the same functionality
>>> as I *already* achieve with move_notify() but apparently have to change
>>> just because you've decided nouveau/radeon are both doing the
>>> WrongThing(tm).
>>>
>>> Anyhow, I care more that 3.3 works than how it works.  So, whatever.  If
>>> I must agree to this in order to get a regression fix in, then I guess I
>>> really have no choice in the matter.
>>>
>>> Ben.
>>>
>>>> A simple and straightforward fix that leaves the path open (if so
>>>> desired) to
>>>> handle finer channel granularity.
>>>>
>>>> Or am I missing something?
>>>>
>> I went over the code and Ben fix is ok with me, i need to test it a
>> bit on radeon side.
>>
>> For long term solution why not just move most of the
>> ttm_bo_handle_move_mem to the driver. It would obsolete the
>> move_notify callback. move notify callback was introduced because in
>> some case the driver never knew directly that a bo moved. It's obvious
>> that driver need to know every time. So instead of having an ha-doc
>> function for that. Let just move the handle move stuff into the
>> driver. Yes there will be some code duplication but it will avoid
>> anykind of weird error path and driver will be able to perform what
>> ever make sense.
>
>
> Yes, this is a solution that eliminates the need for TTM to support private
> GPU map setup. Code duplication can largely be avoided if we
> collect common code in a small utility function.

Please this, its reduces one more place TTM suffers from midlayer effects.

I'd much prefer the drivers be in control and call into help common
code instead of the driver calling in and then lots of callbacks out.

(and yes the drm suffers from this a lot as well).

Dave.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25 18:12                           ` Dave Airlie
@ 2012-01-25 18:21                             ` Thomas Hellstrom
  2012-01-25 18:51                               ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Hellstrom @ 2012-01-25 18:21 UTC (permalink / raw)
  To: Dave Airlie; +Cc: dri-devel

On 01/25/2012 07:12 PM, Dave Airlie wrote:
> On Wed, Jan 25, 2012 at 5:19 PM, Thomas Hellstrom<thomas@shipmail.org>  wrote:
>> On 01/25/2012 04:37 PM, Jerome Glisse wrote:
>>> On Wed, Jan 25, 2012 at 10:21 AM, Ben Skeggs<skeggsb@gmail.com>    wrote:
>>>> On Wed, 2012-01-25 at 15:33 +0100, Thomas Hellstrom wrote:
>>>>> On 01/25/2012 10:41 AM, Ben Skeggs wrote:
>>>>>> My main concern is that we blindly and unnecessarily set up GPU
>>>>>> bindings and
>>>>>> end up with unnecessary code in TTM, and furthermore that we
>>>>>> communicate
>>>>>> that bad practice to future driver writers.
>>>>>> This "unnecessary code" is like 5 lines of cleanup if something fails,
>>>>>> hardly anything to be jumping up and down about :)
>>>>> It's just not TTM's business, unless the GPU maps are mappable by the
>>>>> CPU as well.
>>>>> Also, What if the mapping setup in move_notify() fails?
>>>> It can't fail, and well, in nouveau's implementation it never will.
>>>> It's simply a "fill the ptes for all the vmas currently associated with
>>>> a bo".
>>>>
>>>> And well, it's about as much TTM's business as VRAM aperture allocation
>>>> is..  I don't see the big deal, if you wan't to do it a different way in
>>>> your driver, there's nothing stopping you.  It's a lot of bother for
>>>> essentially zero effort in TTM..
>>>>
>>>>>>> Thomas, what do you suggest to move forward with this? Both of these
>>>>>>> bugs are serious regressions that make nouveau unusable with the
>>>>>>> current
>>>>>>> 3.3-rc series. Ben.
>>>>>>>
>>>>>>> My number one choice would of course be to have the drivers set up
>>>>>>> their
>>>>>>> private GPU mappings after a
>>>>>>> successful validate call, as I originally suggested and you agreed to.
>>>>>>>
>>>>>>> If that's not possible (I realize it's late in the release series),
>>>>>>> I'll
>>>>>>> ack this patch if you and Jerome agree not to block
>>>>>>> attempts to move in that direction for future kernel releases.
>>>>>> I can't say I'm entirely happy with the plan honestly.  To me, it still
>>>>>> seems more efficient to handle this when a move happens (comparatively
>>>>>> rare) and "map new backing storage into every vm that has a reference"
>>>>>> than to (on every single buffer of every single "exec" call) go "is
>>>>>> this
>>>>>> buffer mapped into this channel's vm? yes, ok; no, lets go map it".
>>>>>>
>>>>>> I'm not even sure how exactly I plan on storing this mapping
>>>>>> efficiently
>>>>>> yet.. Scanning the BO's linked list of VMs it's mapped into for "if
>>>>>> (this_vma == chan->vma)" doesn't exactly sound performant.
>>>>> As previously suggested, in the simplest case a bo could have a 'needs
>>>>> remap' flag
>>>>> that is set on gpu map teardown on move_notify(), and when this flag is
>>>>> detected in validate,
>>>>> go ahead and set up all needed maps and clear that flag.
>>>>>
>>>>> This is the simplest case and more or less equivalent to the current
>>>>> solution, except
>>>>> maps aren't set up unless needed by at least one channel and there is a
>>>>> clear way
>>>>> to handle errors when GPU maps are set up.
>>>> Yes, right.  That can be done, and gives exactly the same functionality
>>>> as I *already* achieve with move_notify() but apparently have to change
>>>> just because you've decided nouveau/radeon are both doing the
>>>> WrongThing(tm).
>>>>
>>>> Anyhow, I care more that 3.3 works than how it works.  So, whatever.  If
>>>> I must agree to this in order to get a regression fix in, then I guess I
>>>> really have no choice in the matter.
>>>>
>>>> Ben.
>>>>
>>>>> A simple and straightforward fix that leaves the path open (if so
>>>>> desired) to
>>>>> handle finer channel granularity.
>>>>>
>>>>> Or am I missing something?
>>>>>
>>> I went over the code and Ben fix is ok with me, i need to test it a
>>> bit on radeon side.
>>>
>>> For long term solution why not just move most of the
>>> ttm_bo_handle_move_mem to the driver. It would obsolete the
>>> move_notify callback. move notify callback was introduced because in
>>> some case the driver never knew directly that a bo moved. It's obvious
>>> that driver need to know every time. So instead of having an ha-doc
>>> function for that. Let just move the handle move stuff into the
>>> driver. Yes there will be some code duplication but it will avoid
>>> anykind of weird error path and driver will be able to perform what
>>> ever make sense.
>>
>> Yes, this is a solution that eliminates the need for TTM to support private
>> GPU map setup. Code duplication can largely be avoided if we
>> collect common code in a small utility function.
> Please this, its reduces one more place TTM suffers from midlayer effects.
>
> I'd much prefer the drivers be in control and call into help common
> code instead of the driver calling in and then lots of callbacks out.
>
> (and yes the drm suffers from this a lot as well).
>
> Dave.
OK,
let's work in this direction.

/Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] drm/ttm: fix two regressions since move_notify changes
  2012-01-25 18:21                             ` Thomas Hellstrom
@ 2012-01-25 18:51                               ` Jerome Glisse
  0 siblings, 0 replies; 23+ messages in thread
From: Jerome Glisse @ 2012-01-25 18:51 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Wed, Jan 25, 2012 at 1:21 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 01/25/2012 07:12 PM, Dave Airlie wrote:
>>
>> On Wed, Jan 25, 2012 at 5:19 PM, Thomas Hellstrom<thomas@shipmail.org>
>>  wrote:
>>>
>>> On 01/25/2012 04:37 PM, Jerome Glisse wrote:
>>>>
>>>> On Wed, Jan 25, 2012 at 10:21 AM, Ben Skeggs<skeggsb@gmail.com>
>>>>  wrote:
>>>>>
>>>>> On Wed, 2012-01-25 at 15:33 +0100, Thomas Hellstrom wrote:
>>>>>>
>>>>>> On 01/25/2012 10:41 AM, Ben Skeggs wrote:
>>>>>>>
>>>>>>> My main concern is that we blindly and unnecessarily set up GPU
>>>>>>> bindings and
>>>>>>> end up with unnecessary code in TTM, and furthermore that we
>>>>>>> communicate
>>>>>>> that bad practice to future driver writers.
>>>>>>> This "unnecessary code" is like 5 lines of cleanup if something
>>>>>>> fails,
>>>>>>> hardly anything to be jumping up and down about :)
>>>>>>
>>>>>> It's just not TTM's business, unless the GPU maps are mappable by the
>>>>>> CPU as well.
>>>>>> Also, What if the mapping setup in move_notify() fails?
>>>>>
>>>>> It can't fail, and well, in nouveau's implementation it never will.
>>>>> It's simply a "fill the ptes for all the vmas currently associated with
>>>>> a bo".
>>>>>
>>>>> And well, it's about as much TTM's business as VRAM aperture allocation
>>>>> is..  I don't see the big deal, if you wan't to do it a different way
>>>>> in
>>>>> your driver, there's nothing stopping you.  It's a lot of bother for
>>>>> essentially zero effort in TTM..
>>>>>
>>>>>>>> Thomas, what do you suggest to move forward with this? Both of these
>>>>>>>> bugs are serious regressions that make nouveau unusable with the
>>>>>>>> current
>>>>>>>> 3.3-rc series. Ben.
>>>>>>>>
>>>>>>>> My number one choice would of course be to have the drivers set up
>>>>>>>> their
>>>>>>>> private GPU mappings after a
>>>>>>>> successful validate call, as I originally suggested and you agreed
>>>>>>>> to.
>>>>>>>>
>>>>>>>> If that's not possible (I realize it's late in the release series),
>>>>>>>> I'll
>>>>>>>> ack this patch if you and Jerome agree not to block
>>>>>>>> attempts to move in that direction for future kernel releases.
>>>>>>>
>>>>>>> I can't say I'm entirely happy with the plan honestly.  To me, it
>>>>>>> still
>>>>>>> seems more efficient to handle this when a move happens
>>>>>>> (comparatively
>>>>>>> rare) and "map new backing storage into every vm that has a
>>>>>>> reference"
>>>>>>> than to (on every single buffer of every single "exec" call) go "is
>>>>>>> this
>>>>>>> buffer mapped into this channel's vm? yes, ok; no, lets go map it".
>>>>>>>
>>>>>>> I'm not even sure how exactly I plan on storing this mapping
>>>>>>> efficiently
>>>>>>> yet.. Scanning the BO's linked list of VMs it's mapped into for "if
>>>>>>> (this_vma == chan->vma)" doesn't exactly sound performant.
>>>>>>
>>>>>> As previously suggested, in the simplest case a bo could have a 'needs
>>>>>> remap' flag
>>>>>> that is set on gpu map teardown on move_notify(), and when this flag
>>>>>> is
>>>>>> detected in validate,
>>>>>> go ahead and set up all needed maps and clear that flag.
>>>>>>
>>>>>> This is the simplest case and more or less equivalent to the current
>>>>>> solution, except
>>>>>> maps aren't set up unless needed by at least one channel and there is
>>>>>> a
>>>>>> clear way
>>>>>> to handle errors when GPU maps are set up.
>>>>>
>>>>> Yes, right.  That can be done, and gives exactly the same functionality
>>>>> as I *already* achieve with move_notify() but apparently have to change
>>>>> just because you've decided nouveau/radeon are both doing the
>>>>> WrongThing(tm).
>>>>>
>>>>> Anyhow, I care more that 3.3 works than how it works.  So, whatever.
>>>>>  If
>>>>> I must agree to this in order to get a regression fix in, then I guess
>>>>> I
>>>>> really have no choice in the matter.
>>>>>
>>>>> Ben.
>>>>>
>>>>>> A simple and straightforward fix that leaves the path open (if so
>>>>>> desired) to
>>>>>> handle finer channel granularity.
>>>>>>
>>>>>> Or am I missing something?
>>>>>>
>>>> I went over the code and Ben fix is ok with me, i need to test it a
>>>> bit on radeon side.
>>>>
>>>> For long term solution why not just move most of the
>>>> ttm_bo_handle_move_mem to the driver. It would obsolete the
>>>> move_notify callback. move notify callback was introduced because in
>>>> some case the driver never knew directly that a bo moved. It's obvious
>>>> that driver need to know every time. So instead of having an ha-doc
>>>> function for that. Let just move the handle move stuff into the
>>>> driver. Yes there will be some code duplication but it will avoid
>>>> anykind of weird error path and driver will be able to perform what
>>>> ever make sense.
>>>
>>>
>>> Yes, this is a solution that eliminates the need for TTM to support
>>> private
>>> GPU map setup. Code duplication can largely be avoided if we
>>> collect common code in a small utility function.
>>
>> Please this, its reduces one more place TTM suffers from midlayer effects.
>>
>> I'd much prefer the drivers be in control and call into help common
>> code instead of the driver calling in and then lots of callbacks out.
>>
>> (and yes the drm suffers from this a lot as well).
>>
>> Dave.
>
> OK,
> let's work in this direction.
>
> /Thomas

I should have time next week to craft a patch for this.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-01-25 18:51 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-15 21:31 [next] Null pointer dereference in nouveau_vm_map_sg Martin Nyhus
2012-01-16 20:30 ` Jerome Glisse
2012-01-16 23:57   ` Martin Nyhus
2012-01-22 18:33     ` Konrad Rzeszutek Wilk
2012-01-24 22:33       ` Jerome Glisse
2012-01-25  0:12         ` Martin Nyhus
2012-01-25 16:54           ` Jerome Glisse
2012-01-25  5:34         ` [PATCH] drm/ttm: fix two regressions since move_notify changes Ben Skeggs
2012-01-25  7:43           ` Thomas Hellstrom
2012-01-25  8:05             ` Ben Skeggs
2012-01-25  8:39               ` Thomas Hellstrom
2012-01-25  9:41                 ` Ben Skeggs
2012-01-25 14:33                   ` Thomas Hellstrom
2012-01-25 15:21                     ` Ben Skeggs
2012-01-25 15:37                       ` Jerome Glisse
2012-01-25 17:15                         ` Thomas Hellstrom
2012-01-25 17:19                         ` Thomas Hellstrom
2012-01-25 18:12                           ` Dave Airlie
2012-01-25 18:21                             ` Thomas Hellstrom
2012-01-25 18:51                               ` Jerome Glisse
2012-01-25  8:24           ` Dave Airlie
2012-01-25  8:38             ` Ben Skeggs
2012-01-25 17:32           ` Thomas Hellstrom

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.