All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available
@ 2018-02-26  5:34 Monk Liu
       [not found] ` <1519623300-17305-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Monk Liu @ 2018-02-26  5:34 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

issue:
kernel oops or vmc page fault occured during vk_example/vk_cts
test.

root cause:
previously reservation object would kfree the staged when slot
checked available during reserve_shared(), which is incorrect
becasue this way reservation_object->fence will be a wild pointer
referenced by following reservation_object_add_shared_fence,
and lead to lot of abnormal cases depends on luck.

fix:
don't call kfree on staged in reserve_shared
and there won't be memleak introduced because reservation's
finish routine would kfree both staged and fence

Change-Id: If7c01f1b4be3d3d8a81efa90216841f79ab1fc1c
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/dma-buf/reservation.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
index 314eb10..bc01e0d 100644
--- a/drivers/dma-buf/reservation.c
+++ b/drivers/dma-buf/reservation.c
@@ -74,12 +74,9 @@ int reservation_object_reserve_shared(struct reservation_object *obj)
 	old = reservation_object_get_list(obj);
 
 	if (old && old->shared_max) {
-		if (old->shared_count < old->shared_max) {
-			/* perform an in-place update */
-			kfree(obj->staged);
-			obj->staged = NULL;
+		if (old->shared_count < old->shared_max)
 			return 0;
-		} else
+		else
 			max = old->shared_max * 2;
 	} else
 		max = 4;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found] ` <1519623300-17305-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26  5:35   ` Monk Liu
       [not found]     ` <1519623300-17305-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  9:42   ` [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available Christian König
  1 sibling, 1 reply; 18+ messages in thread
From: Monk Liu @ 2018-02-26  5:35 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

should call reservation_object_reserve_shared before
amdgpu_bo_fence(), otherwise there are odds kernel
hit BUG in reversation.c

bug:
[12622.076435] ------------[ cut here ]------------
[12622.076438] kernel BUG at drivers/dma-buf/reservation.c:233!
[12622.078046] invalid opcode: 0000 [#1] SMP KASAN
[12622.078345] Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
[12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
[12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[12622.083880] task: ffff8801314f8000 task.stack: ffff8801e6df0000
[12622.084267] RIP: 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
[12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246
[12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX: ffff8801f2542b58
[12622.085471] RDX: ffff880044791428 RSI: ffff8801e2657d20 RDI: ffff880044791428
[12622.085995] RBP: ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000
[12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12: ffff880044791200
[12622.086905] R13: ffff8801e1da6850 R14: 0000000000000000 R15: ffff8801e9303000
[12622.087359] FS:  00007f70109ff700(0000) GS:ffff8801f7500000(0000) knlGS:0000000000000000
[12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0
[12622.088698] Call Trace:
[12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu]
[12622.089271]  amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu]
[12622.089721]  ? amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu]
[12622.090184]  amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu]
[12622.090569]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu]
[12622.090959]  drm_ioctl_kernel+0x6a/0xb0 [drm]
[12622.091247]  ? kasan_check_write+0x14/0x20
[12622.091517]  drm_ioctl+0x2ea/0x3b0 [drm]
[12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu]
[12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
[12622.092519]  do_vfs_ioctl+0x96/0x5b0
[12622.092754]  ? kvm_sched_clock_read+0x1e/0x30
[12622.093036]  ? kvm_sched_clock_read+0x1e/0x30
[12622.093322]  ? sched_clock+0x9/0x10
[12622.093550]  SyS_ioctl+0x79/0x90
[12622.093812]  ? vtime_user_exit+0x29/0x70
[12622.094108]  do_syscall_64+0x6e/0x160
[12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
[12622.094651] RIP: 0033:0x7f7012ebff07
[12622.094884] RSP: 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX: 00007f7012ebff07
[12622.095827] RDX: 00007f70109fdbb0 RSI: 00000000c0286448 RDI: 0000000000000004
[12622.096283] RBP: 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e
[12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12: 0000000002ebb8f0
[12622.097195] R13: 00007f6fe100fe30 R14: 0000000002fe2bf0 R15: 0000000000000000
[12622.097676] Code: 2b e6 ff ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 1f 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48
[12622.098919] RIP: reservation_object_add_shared_fence+0x2bc/0x2c0 RSP: ffff8801e6df7ae8
[12622.099527] ---[ end trace 8235addd8e22e444 ]---

Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 0b237e0..e6f159f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
 		amdgpu_ring_pad_ib(ring, params.ib);
 		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
 				 AMDGPU_FENCE_OWNER_VM, false);
+
+		r = reservation_object_reserve_shared(root->tbo.resv);
+		if (r)
+			goto error;
+
 		WARN_ON(params.ib->length_dw > ndw);
 		r = amdgpu_job_submit(job, ring, &vm->entity,
 				      AMDGPU_FENCE_OWNER_VM, &fence);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available
       [not found] ` <1519623300-17305-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-26  5:35   ` [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug Monk Liu
@ 2018-02-26  9:42   ` Christian König
       [not found]     ` <afa20c31-f721-599e-7e4c-eecbf7bda70f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Christian König @ 2018-02-26  9:42 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Chris Wilson

Well first of all you need to send that to dri-devel and even lkml, so 
that Chris and others can take a look as well.

Second that patch doesn't looks correct to me, obj->staged should never 
be related to obj->fence.

Regards,
Christian.

Am 26.02.2018 um 06:34 schrieb Monk Liu:
> issue:
> kernel oops or vmc page fault occured during vk_example/vk_cts
> test.
>
> root cause:
> previously reservation object would kfree the staged when slot
> checked available during reserve_shared(), which is incorrect
> becasue this way reservation_object->fence will be a wild pointer
> referenced by following reservation_object_add_shared_fence,
> and lead to lot of abnormal cases depends on luck.
>
> fix:
> don't call kfree on staged in reserve_shared
> and there won't be memleak introduced because reservation's
> finish routine would kfree both staged and fence
>
> Change-Id: If7c01f1b4be3d3d8a81efa90216841f79ab1fc1c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/dma-buf/reservation.c | 7 ++-----
>   1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> index 314eb10..bc01e0d 100644
> --- a/drivers/dma-buf/reservation.c
> +++ b/drivers/dma-buf/reservation.c
> @@ -74,12 +74,9 @@ int reservation_object_reserve_shared(struct reservation_object *obj)
>   	old = reservation_object_get_list(obj);
>   
>   	if (old && old->shared_max) {
> -		if (old->shared_count < old->shared_max) {
> -			/* perform an in-place update */
> -			kfree(obj->staged);
> -			obj->staged = NULL;
> +		if (old->shared_count < old->shared_max)
>   			return 0;
> -		} else
> +		else
>   			max = old->shared_max * 2;
>   	} else
>   		max = 4;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]     ` <1519623300-17305-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26  9:47       ` Christian König
       [not found]         ` <f3854579-e1dc-1226-9807-3e67a1a9bb7c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-02-26  9:47 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

That fix is incorrect.

amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>         entry->tv.shared = true;

So you must run into this issue because of something else.

Regards,
Christian.

Am 26.02.2018 um 06:35 schrieb Monk Liu:
> should call reservation_object_reserve_shared before
> amdgpu_bo_fence(), otherwise there are odds kernel
> hit BUG in reversation.c
>
> bug:
> [12622.076435] ------------[ cut here ]------------
> [12622.076438] kernel BUG at drivers/dma-buf/reservation.c:233!
> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN
> [12622.078345] Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [12622.083880] task: ffff8801314f8000 task.stack: ffff8801e6df0000
> [12622.084267] RIP: 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246
> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX: ffff8801f2542b58
> [12622.085471] RDX: ffff880044791428 RSI: ffff8801e2657d20 RDI: ffff880044791428
> [12622.085995] RBP: ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000
> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12: ffff880044791200
> [12622.086905] R13: ffff8801e1da6850 R14: 0000000000000000 R15: ffff8801e9303000
> [12622.087359] FS:  00007f70109ff700(0000) GS:ffff8801f7500000(0000) knlGS:0000000000000000
> [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0
> [12622.088698] Call Trace:
> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu]
> [12622.089271]  amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu]
> [12622.089721]  ? amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu]
> [12622.090184]  amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu]
> [12622.090569]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu]
> [12622.090959]  drm_ioctl_kernel+0x6a/0xb0 [drm]
> [12622.091247]  ? kasan_check_write+0x14/0x20
> [12622.091517]  drm_ioctl+0x2ea/0x3b0 [drm]
> [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu]
> [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
> [12622.092519]  do_vfs_ioctl+0x96/0x5b0
> [12622.092754]  ? kvm_sched_clock_read+0x1e/0x30
> [12622.093036]  ? kvm_sched_clock_read+0x1e/0x30
> [12622.093322]  ? sched_clock+0x9/0x10
> [12622.093550]  SyS_ioctl+0x79/0x90
> [12622.093812]  ? vtime_user_exit+0x29/0x70
> [12622.094108]  do_syscall_64+0x6e/0x160
> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
> [12622.094651] RIP: 0033:0x7f7012ebff07
> [12622.094884] RSP: 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX: 00007f7012ebff07
> [12622.095827] RDX: 00007f70109fdbb0 RSI: 00000000c0286448 RDI: 0000000000000004
> [12622.096283] RBP: 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e
> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12: 0000000002ebb8f0
> [12622.097195] R13: 00007f6fe100fe30 R14: 0000000002fe2bf0 R15: 0000000000000000
> [12622.097676] Code: 2b e6 ff ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 1f 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48
> [12622.098919] RIP: reservation_object_add_shared_fence+0x2bc/0x2c0 RSP: ffff8801e6df7ae8
> [12622.099527] ---[ end trace 8235addd8e22e444 ]---
>
> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 0b237e0..e6f159f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>   		amdgpu_ring_pad_ib(ring, params.ib);
>   		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>   				 AMDGPU_FENCE_OWNER_VM, false);
> +
> +		r = reservation_object_reserve_shared(root->tbo.resv);
> +		if (r)
> +			goto error;
> +
>   		WARN_ON(params.ib->length_dw > ndw);
>   		r = amdgpu_job_submit(job, ring, &vm->entity,
>   				      AMDGPU_FENCE_OWNER_VM, &fence);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available
       [not found]     ` <afa20c31-f721-599e-7e4c-eecbf7bda70f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-02-26 10:13       ` Liu, Monk
  2018-02-26 10:19       ` Chris Wilson
  1 sibling, 0 replies; 18+ messages in thread
From: Liu, Monk @ 2018-02-26 10:13 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Chris Wilson

Hi Christian

we call reservation_object_reserve_shared() at very first time, and later we call
reservation_object_add_shared_fence(), which will call reservation_object_add_shared_replace() since
fobj is not NULL by this time.

in reservation_object_get_list(), @old is NULL, @fobj is not NULL, so @fence will init the @fobj->shared[]
and @fobj->shared_count set to 1, and later the @obj->fence will be set to @fobj  (through RCU_INIT_POINTER)
and this is the part obj->fence related with obj->staged...

At next time calls on reservation_object_reserve_shared(), since current logic will kfree @obj->staged, which
let above @obj->fence points to a wild pointer, and lead to weird bugs,

in fact we pass stress test on vulkan CTS and vk_example after this patch applied ... 

/Monk


-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年2月26日 17:42
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available

Well first of all you need to send that to dri-devel and even lkml, so that Chris and others can take a look as well.

Second that patch doesn't looks correct to me, obj->staged should never be related to obj->fence.

Regards,
Christian.

Am 26.02.2018 um 06:34 schrieb Monk Liu:
> issue:
> kernel oops or vmc page fault occured during vk_example/vk_cts test.
>
> root cause:
> previously reservation object would kfree the staged when slot checked 
> available during reserve_shared(), which is incorrect becasue this way 
> reservation_object->fence will be a wild pointer referenced by 
> following reservation_object_add_shared_fence,
> and lead to lot of abnormal cases depends on luck.
>
> fix:
> don't call kfree on staged in reserve_shared and there won't be 
> memleak introduced because reservation's finish routine would kfree 
> both staged and fence
>
> Change-Id: If7c01f1b4be3d3d8a81efa90216841f79ab1fc1c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/dma-buf/reservation.c | 7 ++-----
>   1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/dma-buf/reservation.c 
> b/drivers/dma-buf/reservation.c index 314eb10..bc01e0d 100644
> --- a/drivers/dma-buf/reservation.c
> +++ b/drivers/dma-buf/reservation.c
> @@ -74,12 +74,9 @@ int reservation_object_reserve_shared(struct reservation_object *obj)
>   	old = reservation_object_get_list(obj);
>   
>   	if (old && old->shared_max) {
> -		if (old->shared_count < old->shared_max) {
> -			/* perform an in-place update */
> -			kfree(obj->staged);
> -			obj->staged = NULL;
> +		if (old->shared_count < old->shared_max)
>   			return 0;
> -		} else
> +		else
>   			max = old->shared_max * 2;
>   	} else
>   		max = 4;

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available
       [not found]     ` <afa20c31-f721-599e-7e4c-eecbf7bda70f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2018-02-26 10:13       ` Liu, Monk
@ 2018-02-26 10:19       ` Chris Wilson
       [not found]         ` <151964034205.2350.2554778746105246165-M6iVdVfohj6unts5RBS2dVaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Chris Wilson @ 2018-02-26 10:19 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo, Christian König, Monk Liu,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Quoting Christian König (2018-02-26 09:42:15)
> Well first of all you need to send that to dri-devel and even lkml, so 
> that Chris and others can take a look as well.
> 
> Second that patch doesn't looks correct to me, obj->staged should never 
> be related to obj->fence.

Concurred. I would suspect missing rcu protection around a reader to
cause the implied kernel oops. If you are fixing a bug, please do
include as much details about the bug as you can (stacktraces,
observations etc) so that other people who may have hit something
similar can see if this patch might fix their problem, or if they
recognise the problem as caused by something else.
-Chris
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available
       [not found]         ` <151964034205.2350.2554778746105246165-M6iVdVfohj6unts5RBS2dVaTQe2KTcn/@public.gmane.org>
@ 2018-02-26 10:25           ` Liu, Monk
  0 siblings, 0 replies; 18+ messages in thread
From: Liu, Monk @ 2018-02-26 10:25 UTC (permalink / raw)
  To: Chris Wilson, Koenig, Christian, Christian König,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Sorry, this bug is hit after I fixed another bug ()
And the only thing I observe is we hit vmc page fault or sometime kernel oops, very random, so I finally ends up with 
Thinking to wild pointer issue and found this place looks incorrect

I don't have trace or something right now, but those weird issues gone after this patch applied in stress CTS test 

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Chris Wilson
Sent: 2018年2月26日 18:19
To: Koenig, Christian <Christian.Koenig@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available

Quoting Christian König (2018-02-26 09:42:15)
> Well first of all you need to send that to dri-devel and even lkml, so 
> that Chris and others can take a look as well.
> 
> Second that patch doesn't looks correct to me, obj->staged should 
> never be related to obj->fence.

Concurred. I would suspect missing rcu protection around a reader to cause the implied kernel oops. If you are fixing a bug, please do include as much details about the bug as you can (stacktraces, observations etc) so that other people who may have hit something similar can see if this patch might fix their problem, or if they recognise the problem as caused by something else.
-Chris
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]         ` <f3854579-e1dc-1226-9807-3e67a1a9bb7c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-02-26 10:32           ` Liu, Monk
  2018-02-26 10:36           ` Liu, Monk
  1 sibling, 0 replies; 18+ messages in thread
From: Liu, Monk @ 2018-02-26 10:32 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Please check again

Every time you call amdgpu_bo_fence should first call " reservation_object_reserve_shared", and this patch
Fixed a bug by following this rule

The thing is when shared comes to 4 (assume max is 4) before you call amdgpu_bo_fence, you will hit BUG() at reservation.c's reservation_object_add_shared_fence()
This is a clear fix !



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年2月26日 17:48
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

That fix is incorrect.

amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>         entry->tv.shared = true;

So you must run into this issue because of something else.

Regards,
Christian.

Am 26.02.2018 um 06:35 schrieb Monk Liu:
> should call reservation_object_reserve_shared before 
> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in 
> reversation.c
>
> bug:
> [12622.076435] ------------[ cut here ]------------ [12622.076438] 
> kernel BUG at drivers/dma-buf/reservation.c:233!
> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345] 
> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task: 
> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP: 
> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246 
> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX: 
> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI: 
> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP: 
> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000 
> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12: 
> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14: 
> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:  
> 00007f70109ff700(0000) GS:ffff8801f7500000(0000) 
> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]  
> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ? 
> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]  
> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ? 
> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]  
> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ? 
> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0 
> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] 
> [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] [12622.092519]  
> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ? 
> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ? 
> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ? sched_clock+0x9/0x10 
> [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ? 
> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160 
> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP: 
> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 
> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX: 
> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI: 
> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP: 
> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e 
> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12: 
> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14: 
> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 ff 
> ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10 
> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 1f 
> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP: 
> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP: ffff8801e6df7ae8 
> [12622.099527] ---[ end trace 8235addd8e22e444 ]---
>
> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 0b237e0..e6f159f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>   		amdgpu_ring_pad_ib(ring, params.ib);
>   		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>   				 AMDGPU_FENCE_OWNER_VM, false);
> +
> +		r = reservation_object_reserve_shared(root->tbo.resv);
> +		if (r)
> +			goto error;
> +
>   		WARN_ON(params.ib->length_dw > ndw);
>   		r = amdgpu_job_submit(job, ring, &vm->entity,
>   				      AMDGPU_FENCE_OWNER_VM, &fence);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]         ` <f3854579-e1dc-1226-9807-3e67a1a9bb7c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2018-02-26 10:32           ` Liu, Monk
@ 2018-02-26 10:36           ` Liu, Monk
       [not found]             ` <SN1PR12MB0462FDC9905B318AF45FCECC84C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-02-26 10:36 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>         entry->tv.shared = true;

That's not related with the bug I fixed! The root cause is there is some sequence that 
When code path comes to amdgpu_bo_fence of vm_update_directories, no one
The shared_cnt is by coincidence already the max value of resv obj, so the BUG() in
Reservation.c file will hit, 

We must call reserve_shared before amdgpu_bo_fence....



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年2月26日 17:48
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

That fix is incorrect.

amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>         entry->tv.shared = true;

So you must run into this issue because of something else.

Regards,
Christian.

Am 26.02.2018 um 06:35 schrieb Monk Liu:
> should call reservation_object_reserve_shared before 
> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in 
> reversation.c
>
> bug:
> [12622.076435] ------------[ cut here ]------------ [12622.076438] 
> kernel BUG at drivers/dma-buf/reservation.c:233!
> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345] 
> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task: 
> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP: 
> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246 
> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX: 
> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI: 
> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP: 
> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000 
> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12: 
> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14: 
> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:  
> 00007f70109ff700(0000) GS:ffff8801f7500000(0000) 
> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]  
> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ? 
> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]  
> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ? 
> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]  
> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ? 
> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0 
> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] 
> [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] [12622.092519]  
> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ? 
> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ? 
> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ? sched_clock+0x9/0x10 
> [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ? 
> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160 
> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP: 
> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 
> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX: 
> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI: 
> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP: 
> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e 
> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12: 
> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14: 
> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 ff 
> ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10 
> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 1f 
> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP: 
> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP: ffff8801e6df7ae8 
> [12622.099527] ---[ end trace 8235addd8e22e444 ]---
>
> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 0b237e0..e6f159f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>   		amdgpu_ring_pad_ib(ring, params.ib);
>   		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>   				 AMDGPU_FENCE_OWNER_VM, false);
> +
> +		r = reservation_object_reserve_shared(root->tbo.resv);
> +		if (r)
> +			goto error;
> +
>   		WARN_ON(params.ib->length_dw > ndw);
>   		r = amdgpu_job_submit(job, ring, &vm->entity,
>   				      AMDGPU_FENCE_OWNER_VM, &fence);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]             ` <SN1PR12MB0462FDC9905B318AF45FCECC84C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-26 10:44               ` Christian König
       [not found]                 ` <7b2c1408-8ffc-4b8a-94e4-45534dff99f8-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-02-26 10:44 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Chris Wilson

> We must call reserve_shared before amdgpu_bo_fence....
Actually that's not correct. See reservation_object_add_shared_fence() 
should replace the fence when it has the same context as a previously 
added fence.

So we call reserve_shared only once and then call 
reservation_object_add_shared_fence() multiple times, always with the 
same context.

Writing this, Chris actually had patches which may have broken that 
assumption, but as far as I know those aren't merged yet.

Regards,
Christian.

Am 26.02.2018 um 11:36 schrieb Liu, Monk:
> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>          entry->tv.shared = true;
> That's not related with the bug I fixed! The root cause is there is some sequence that
> When code path comes to amdgpu_bo_fence of vm_update_directories, no one
> The shared_cnt is by coincidence already the max value of resv obj, so the BUG() in
> Reservation.c file will hit,
>
> We must call reserve_shared before amdgpu_bo_fence....
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年2月26日 17:48
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
>
> That fix is incorrect.
>
> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>          entry->tv.shared = true;
> So you must run into this issue because of something else.
>
> Regards,
> Christian.
>
> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>> should call reservation_object_reserve_shared before
>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in
>> reversation.c
>>
>> bug:
>> [12622.076435] ------------[ cut here ]------------ [12622.076438]
>> kernel BUG at drivers/dma-buf/reservation.c:233!
>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345]
>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246
>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000
>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0
>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu]
>> [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] [12622.092519]
>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ? sched_clock+0x9/0x10
>> [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160
>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e
>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 ff
>> ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10
>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 1f
>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP: ffff8801e6df7ae8
>> [12622.099527] ---[ end trace 8235addd8e22e444 ]---
>>
>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>    1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 0b237e0..e6f159f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>    		amdgpu_ring_pad_ib(ring, params.ib);
>>    		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>    				 AMDGPU_FENCE_OWNER_VM, false);
>> +
>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>> +		if (r)
>> +			goto error;
>> +
>>    		WARN_ON(params.ib->length_dw > ndw);
>>    		r = amdgpu_job_submit(job, ring, &vm->entity,
>>    				      AMDGPU_FENCE_OWNER_VM, &fence);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]                 ` <7b2c1408-8ffc-4b8a-94e4-45534dff99f8-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-26 10:54                   ` Chris Wilson
  2018-02-26 10:57                   ` Liu, Monk
  1 sibling, 0 replies; 18+ messages in thread
From: Chris Wilson @ 2018-02-26 10:54 UTC (permalink / raw)
  To: Christian König, Liu, Monk,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Quoting Christian König (2018-02-26 10:44:40)
> > We must call reserve_shared before amdgpu_bo_fence....
> Actually that's not correct. See reservation_object_add_shared_fence() 
> should replace the fence when it has the same context as a previously 
> added fence.
> 
> So we call reserve_shared only once and then call 
> reservation_object_add_shared_fence() multiple times, always with the 
> same context.
> 
> Writing this, Chris actually had patches which may have broken that 
> assumption, but as far as I know those aren't merged yet.

We haven't made progress on those (sorry, they are on the back burner).

Calling reserve_shared() once followed by multiple calls to
add_shared_fence() for the same context wouldn't break per se, just as
you noted has a chance to store the same context twice in the array.
So if you relied on the behaviour that you see the context only once in
the array later, yeah that broke. (My expectation is serving the context
twice would be a no-op, and it should boil down to should we pay the
cost of scanning for every add_shared_fence, or should that cost only be
paid later on by handling a no op. Imo, it's the second option that is
rarer but I have a pretty heavy bias from looking more often at the
will-it-scale tests.)
-Chris
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]                 ` <7b2c1408-8ffc-4b8a-94e4-45534dff99f8-5C7GfCeVMHo@public.gmane.org>
  2018-02-26 10:54                   ` Chris Wilson
@ 2018-02-26 10:57                   ` Liu, Monk
       [not found]                     ` <SN1PR12MB0462022ABFBC835ACE0D34E484C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-02-26 10:57 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Chris Wilson

> So we call reserve_shared only once and then call reservation_object_add_shared_fence() multiple times, always with the same context.
>should replace the fence when it has the same context as a previously added fence.

What if all those context are not equal ? and accumulated finally to 4 ? 
You can try VK_EXAMPLE test yourself, see if you can hit that BUG() in reservation.c file, 
I hit it and it is fixed by this patch, I don't think there is other reason.

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Christian K?nig
Sent: 2018年2月26日 18:45
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

> We must call reserve_shared before amdgpu_bo_fence....
Actually that's not correct. See reservation_object_add_shared_fence()
should replace the fence when it has the same context as a previously added fence.

So we call reserve_shared only once and then call
reservation_object_add_shared_fence() multiple times, always with the same context.

Writing this, Chris actually had patches which may have broken that assumption, but as far as I know those aren't merged yet.

Regards,
Christian.

Am 26.02.2018 um 11:36 schrieb Liu, Monk:
> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>          entry->tv.shared = true;
> That's not related with the bug I fixed! The root cause is there is 
> some sequence that When code path comes to amdgpu_bo_fence of 
> vm_update_directories, no one The shared_cnt is by coincidence already 
> the max value of resv obj, so the BUG() in Reservation.c file will 
> hit,
>
> We must call reserve_shared before amdgpu_bo_fence....
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年2月26日 17:48
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
> count bug
>
> That fix is incorrect.
>
> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>          entry->tv.shared = true;
> So you must run into this issue because of something else.
>
> Regards,
> Christian.
>
> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>> should call reservation_object_reserve_shared before 
>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in 
>> reversation.c
>>
>> bug:
>> [12622.076435] ------------[ cut here ]------------ [12622.076438] 
>> kernel BUG at drivers/dma-buf/reservation.c:233!
>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345] 
>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
>> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246 
>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000 
>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0 
>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 
>> [amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] 
>> [12622.092519]
>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ? sched_clock+0x9/0x10 
>> [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160 
>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 
>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e 
>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 ff 
>> ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10
>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 
>> 1f
>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP: ffff8801e6df7ae8 
>> [12622.099527] ---[ end trace 8235addd8e22e444 ]---
>>
>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>    1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 0b237e0..e6f159f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>    		amdgpu_ring_pad_ib(ring, params.ib);
>>    		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>    				 AMDGPU_FENCE_OWNER_VM, false);
>> +
>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>> +		if (r)
>> +			goto error;
>> +
>>    		WARN_ON(params.ib->length_dw > ndw);
>>    		r = amdgpu_job_submit(job, ring, &vm->entity,
>>    				      AMDGPU_FENCE_OWNER_VM, &fence);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]                     ` <SN1PR12MB0462022ABFBC835ACE0D34E484C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-26 11:18                       ` Christian König
       [not found]                         ` <0a041a5c-9d7f-e307-7730-275ba7b775c9-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-02-26 11:18 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Chris Wilson

> What if all those context are not equal ? and accumulated finally to 4 ?
Well then there is something else broken and I actually think that this 
is the root cause here.

> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() in reservation.c file,
Could you generate a simpler unit test to trigger the problem?


Thanks,
Christian.

Am 26.02.2018 um 11:57 schrieb Liu, Monk:
>> So we call reserve_shared only once and then call reservation_object_add_shared_fence() multiple times, always with the same context.
>> should replace the fence when it has the same context as a previously added fence.
> What if all those context are not equal ? and accumulated finally to 4 ?
> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() in reservation.c file,
> I hit it and it is fixed by this patch, I don't think there is other reason.
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Christian K?nig
> Sent: 2018年2月26日 18:45
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson <chris@chris-wilson.co.uk>
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
>
>> We must call reserve_shared before amdgpu_bo_fence....
> Actually that's not correct. See reservation_object_add_shared_fence()
> should replace the fence when it has the same context as a previously added fence.
>
> So we call reserve_shared only once and then call
> reservation_object_add_shared_fence() multiple times, always with the same context.
>
> Writing this, Chris actually had patches which may have broken that assumption, but as far as I know those aren't merged yet.
>
> Regards,
> Christian.
>
> Am 26.02.2018 um 11:36 schrieb Liu, Monk:
>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>           entry->tv.shared = true;
>> That's not related with the bug I fixed! The root cause is there is
>> some sequence that When code path comes to amdgpu_bo_fence of
>> vm_update_directories, no one The shared_cnt is by coincidence already
>> the max value of resv obj, so the BUG() in Reservation.c file will
>> hit,
>>
>> We must call reserve_shared before amdgpu_bo_fence....
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年2月26日 17:48
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
>> count bug
>>
>> That fix is incorrect.
>>
>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>           entry->tv.shared = true;
>> So you must run into this issue because of something else.
>>
>> Regards,
>> Christian.
>>
>> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>>> should call reservation_object_reserve_shared before
>>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in
>>> reversation.c
>>>
>>> bug:
>>> [12622.076435] ------------[ cut here ]------------ [12622.076438]
>>> kernel BUG at drivers/dma-buf/reservation.c:233!
>>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345]
>>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246
>>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000
>>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0
>>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0
>>> [amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
>>> [12622.092519]
>>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ? sched_clock+0x9/0x10
>>> [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160
>>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e
>>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 ff
>>> ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10
>>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f
>>> 1f
>>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP: ffff8801e6df7ae8
>>> [12622.099527] ---[ end trace 8235addd8e22e444 ]---
>>>
>>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>>     1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 0b237e0..e6f159f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>     		amdgpu_ring_pad_ib(ring, params.ib);
>>>     		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>>     				 AMDGPU_FENCE_OWNER_VM, false);
>>> +
>>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>>> +		if (r)
>>> +			goto error;
>>> +
>>>     		WARN_ON(params.ib->length_dw > ndw);
>>>     		r = amdgpu_job_submit(job, ring, &vm->entity,
>>>     				      AMDGPU_FENCE_OWNER_VM, &fence);
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]                         ` <0a041a5c-9d7f-e307-7730-275ba7b775c9-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-27  3:36                           ` Liu, Monk
       [not found]                             ` <BLUPR12MB0449672DB665AD8DB6E081BD84C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-02-27  3:36 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Chris Wilson

> Well then there is something else broken and I actually think that this is the root cause here.

What if this VM submits four different jobs on four different rings, e.g. gfx/compute1/compute2/vce
So the fences for those jobs are from different context, and now you get a mapping on this VM
So the shared count is now 4

I ran VK_example and it always can hit this reservation BUG(), just repeat running it 4 loops and 
The BUG() hit, but I did it under SRIOV case so the CPU&GPU performance is different with bare-metal

Anyway do you think above scenario exist ?


-----Original Message-----
From: Koenig, Christian 
Sent: 2018年2月26日 19:19
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

> What if all those context are not equal ? and accumulated finally to 4 ?
Well then there is something else broken and I actually think that this is the root cause here.

> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() in 
> reservation.c file,
Could you generate a simpler unit test to trigger the problem?


Thanks,
Christian.

Am 26.02.2018 um 11:57 schrieb Liu, Monk:
>> So we call reserve_shared only once and then call reservation_object_add_shared_fence() multiple times, always with the same context.
>> should replace the fence when it has the same context as a previously added fence.
> What if all those context are not equal ? and accumulated finally to 4 ?
> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() in 
> reservation.c file, I hit it and it is fixed by this patch, I don't think there is other reason.
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf 
> Of Christian K?nig
> Sent: 2018年2月26日 18:45
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris 
> Wilson <chris@chris-wilson.co.uk>
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
> count bug
>
>> We must call reserve_shared before amdgpu_bo_fence....
> Actually that's not correct. See reservation_object_add_shared_fence()
> should replace the fence when it has the same context as a previously added fence.
>
> So we call reserve_shared only once and then call
> reservation_object_add_shared_fence() multiple times, always with the same context.
>
> Writing this, Chris actually had patches which may have broken that assumption, but as far as I know those aren't merged yet.
>
> Regards,
> Christian.
>
> Am 26.02.2018 um 11:36 schrieb Liu, Monk:
>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>           entry->tv.shared = true;
>> That's not related with the bug I fixed! The root cause is there is 
>> some sequence that When code path comes to amdgpu_bo_fence of 
>> vm_update_directories, no one The shared_cnt is by coincidence 
>> already the max value of resv obj, so the BUG() in Reservation.c file 
>> will hit,
>>
>> We must call reserve_shared before amdgpu_bo_fence....
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年2月26日 17:48
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
>> count bug
>>
>> That fix is incorrect.
>>
>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>           entry->tv.shared = true;
>> So you must run into this issue because of something else.
>>
>> Regards,
>> Christian.
>>
>> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>>> should call reservation_object_reserve_shared before 
>>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in 
>>> reversation.c
>>>
>>> bug:
>>> [12622.076435] ------------[ cut here ]------------ [12622.076438] 
>>> kernel BUG at drivers/dma-buf/reservation.c:233!
>>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345] 
>>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246 
>>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000 
>>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0 
>>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0
>>> [amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] 
>>> [12622.092519]
>>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ? 
>>> sched_clock+0x9/0x10 [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160 
>>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 
>>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e 
>>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 ff 
>>> ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10
>>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 
>>> 1f
>>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP: 
>>> ffff8801e6df7ae8 [12622.099527] ---[ end trace 8235addd8e22e444 ]---
>>>
>>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>>     1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 0b237e0..e6f159f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>     		amdgpu_ring_pad_ib(ring, params.ib);
>>>     		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>>     				 AMDGPU_FENCE_OWNER_VM, false);
>>> +
>>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>>> +		if (r)
>>> +			goto error;
>>> +
>>>     		WARN_ON(params.ib->length_dw > ndw);
>>>     		r = amdgpu_job_submit(job, ring, &vm->entity,
>>>     				      AMDGPU_FENCE_OWNER_VM, &fence);
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]                             ` <BLUPR12MB0449672DB665AD8DB6E081BD84C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-27  8:34                               ` Christian König
       [not found]                                 ` <b0d669c0-e6e8-5907-a33c-3b94890fd091-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-02-27  8:34 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Chris Wilson

Am 27.02.2018 um 04:36 schrieb Liu, Monk:
>> Well then there is something else broken and I actually think that this is the root cause here.
> What if this VM submits four different jobs on four different rings, e.g. gfx/compute1/compute2/vce
> So the fences for those jobs are from different context, and now you get a mapping on this VM
> So the shared count is now 4

Yeah, but each VM submission reserves a shared slot before adding the 
fence. So even if you have 4 submissions we still have a guaranteed free 
slot.

> I ran VK_example and it always can hit this reservation BUG(), just repeat running it 4 loops and
> The BUG() hit, but I did it under SRIOV case so the CPU&GPU performance is different with bare-metal
>
> Anyway do you think above scenario exist ?

Well you somehow run into this problem, I just don't understand how.

It shouldn't matter if you run on bare-metal or SRIOV, the VM handling 
is the same as far as I know.

Going to give it a try today to create some unit tests exercising the 
problem.

Christian.

>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月26日 19:19
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson <chris@chris-wilson.co.uk>
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
>
>> What if all those context are not equal ? and accumulated finally to 4 ?
> Well then there is something else broken and I actually think that this is the root cause here.
>
>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() in
>> reservation.c file,
> Could you generate a simpler unit test to trigger the problem?
>
>
> Thanks,
> Christian.
>
> Am 26.02.2018 um 11:57 schrieb Liu, Monk:
>>> So we call reserve_shared only once and then call reservation_object_add_shared_fence() multiple times, always with the same context.
>>> should replace the fence when it has the same context as a previously added fence.
>> What if all those context are not equal ? and accumulated finally to 4 ?
>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() in
>> reservation.c file, I hit it and it is fixed by this patch, I don't think there is other reason.
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>> Of Christian K?nig
>> Sent: 2018年2月26日 18:45
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris
>> Wilson <chris@chris-wilson.co.uk>
>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
>> count bug
>>
>>> We must call reserve_shared before amdgpu_bo_fence....
>> Actually that's not correct. See reservation_object_add_shared_fence()
>> should replace the fence when it has the same context as a previously added fence.
>>
>> So we call reserve_shared only once and then call
>> reservation_object_add_shared_fence() multiple times, always with the same context.
>>
>> Writing this, Chris actually had patches which may have broken that assumption, but as far as I know those aren't merged yet.
>>
>> Regards,
>> Christian.
>>
>> Am 26.02.2018 um 11:36 schrieb Liu, Monk:
>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>            entry->tv.shared = true;
>>> That's not related with the bug I fixed! The root cause is there is
>>> some sequence that When code path comes to amdgpu_bo_fence of
>>> vm_update_directories, no one The shared_cnt is by coincidence
>>> already the max value of resv obj, so the BUG() in Reservation.c file
>>> will hit,
>>>
>>> We must call reserve_shared before amdgpu_bo_fence....
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年2月26日 17:48
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
>>> count bug
>>>
>>> That fix is incorrect.
>>>
>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>            entry->tv.shared = true;
>>> So you must run into this issue because of something else.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>>>> should call reservation_object_reserve_shared before
>>>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in
>>>> reversation.c
>>>>
>>>> bug:
>>>> [12622.076435] ------------[ cut here ]------------ [12622.076438]
>>>> kernel BUG at drivers/dma-buf/reservation.c:233!
>>>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345]
>>>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>>>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>>>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>>>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>>>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>>>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246
>>>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>>>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>>>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>>>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000
>>>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>>>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>>>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>>>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>>>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>>>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>>>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>>>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>>>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>>>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>>>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>>>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0
>>>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0
>>>> [amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
>>>> [12622.092519]
>>>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ?
>>>> sched_clock+0x9/0x10 [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>>>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160
>>>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>>>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>>>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>>>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>>>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>>>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>>>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e
>>>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>>>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>>>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 ff
>>>> ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 cf 10
>>>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f
>>>> 1f
>>>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>>>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP:
>>>> ffff8801e6df7ae8 [12622.099527] ---[ end trace 8235addd8e22e444 ]---
>>>>
>>>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>>>      1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> index 0b237e0..e6f159f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>>      		amdgpu_ring_pad_ib(ring, params.ib);
>>>>      		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>>>      				 AMDGPU_FENCE_OWNER_VM, false);
>>>> +
>>>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>>>> +		if (r)
>>>> +			goto error;
>>>> +
>>>>      		WARN_ON(params.ib->length_dw > ndw);
>>>>      		r = amdgpu_job_submit(job, ring, &vm->entity,
>>>>      				      AMDGPU_FENCE_OWNER_VM, &fence);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]                                 ` <b0d669c0-e6e8-5907-a33c-3b94890fd091-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-27  9:17                                   ` Liu, Monk
       [not found]                                     ` <BLUPR12MB0449422688494CE8CE97647884C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Liu, Monk @ 2018-02-27  9:17 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Chris Wilson

I know in theoretically this issue should be equal for sriov or not
But since the gpu performance under SRIOV is only one quarter compared with bare-metal (assume 4VF)
And the CPU's performance is also limited, so there are chance that the fence is signaled not as quick as bare-metal
And lead to the scenario that four different context fence are living in resv,

If you do it on bare-metal maybe you need some tricks, at least on my sriov platform I can duplicate this BUG()
With 4 repeating loop on VK_EXAMPLE every time 

/Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2018年2月27日 16:34
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

Am 27.02.2018 um 04:36 schrieb Liu, Monk:
>> Well then there is something else broken and I actually think that this is the root cause here.
> What if this VM submits four different jobs on four different rings, 
> e.g. gfx/compute1/compute2/vce So the fences for those jobs are from 
> different context, and now you get a mapping on this VM So the shared 
> count is now 4

Yeah, but each VM submission reserves a shared slot before adding the fence. So even if you have 4 submissions we still have a guaranteed free slot.

> I ran VK_example and it always can hit this reservation BUG(), just 
> repeat running it 4 loops and The BUG() hit, but I did it under SRIOV 
> case so the CPU&GPU performance is different with bare-metal
>
> Anyway do you think above scenario exist ?

Well you somehow run into this problem, I just don't understand how.

It shouldn't matter if you run on bare-metal or SRIOV, the VM handling is the same as far as I know.

Going to give it a try today to create some unit tests exercising the problem.

Christian.

>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月26日 19:19
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris 
> Wilson <chris@chris-wilson.co.uk>
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
> count bug
>
>> What if all those context are not equal ? and accumulated finally to 4 ?
> Well then there is something else broken and I actually think that this is the root cause here.
>
>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() 
>> in reservation.c file,
> Could you generate a simpler unit test to trigger the problem?
>
>
> Thanks,
> Christian.
>
> Am 26.02.2018 um 11:57 schrieb Liu, Monk:
>>> So we call reserve_shared only once and then call reservation_object_add_shared_fence() multiple times, always with the same context.
>>> should replace the fence when it has the same context as a previously added fence.
>> What if all those context are not equal ? and accumulated finally to 4 ?
>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() 
>> in reservation.c file, I hit it and it is fixed by this patch, I don't think there is other reason.
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On 
>> Behalf Of Christian K?nig
>> Sent: 2018年2月26日 18:45
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; 
>> Chris Wilson <chris@chris-wilson.co.uk>
>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
>> count bug
>>
>>> We must call reserve_shared before amdgpu_bo_fence....
>> Actually that's not correct. See 
>> reservation_object_add_shared_fence()
>> should replace the fence when it has the same context as a previously added fence.
>>
>> So we call reserve_shared only once and then call
>> reservation_object_add_shared_fence() multiple times, always with the same context.
>>
>> Writing this, Chris actually had patches which may have broken that assumption, but as far as I know those aren't merged yet.
>>
>> Regards,
>> Christian.
>>
>> Am 26.02.2018 um 11:36 schrieb Liu, Monk:
>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>            entry->tv.shared = true;
>>> That's not related with the bug I fixed! The root cause is there is 
>>> some sequence that When code path comes to amdgpu_bo_fence of 
>>> vm_update_directories, no one The shared_cnt is by coincidence 
>>> already the max value of resv obj, so the BUG() in Reservation.c 
>>> file will hit,
>>>
>>> We must call reserve_shared before amdgpu_bo_fence....
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年2月26日 17:48
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
>>> count bug
>>>
>>> That fix is incorrect.
>>>
>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>            entry->tv.shared = true;
>>> So you must run into this issue because of something else.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>>>> should call reservation_object_reserve_shared before 
>>>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in 
>>>> reversation.c
>>>>
>>>> bug:
>>>> [12622.076435] ------------[ cut here ]------------ [12622.076438] 
>>>> kernel BUG at drivers/dma-buf/reservation.c:233!
>>>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345] 
>>>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>>>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>>>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>>>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>>>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>>>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246 
>>>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>>>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>>>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>>>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000 
>>>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>>>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>>>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>>>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>>>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>>>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>>>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>>>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>>>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>>>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>>>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>>>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0 
>>>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0
>>>> [amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] 
>>>> [12622.092519]
>>>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ?
>>>> sched_clock+0x9/0x10 [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>>>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160 
>>>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>>>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>>>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 
>>>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>>>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>>>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>>>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e 
>>>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>>>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>>>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 
>>>> ff ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 
>>>> cf 10
>>>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 
>>>> 1f
>>>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>>>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP:
>>>> ffff8801e6df7ae8 [12622.099527] ---[ end trace 8235addd8e22e444 
>>>> ]---
>>>>
>>>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>>>      1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> index 0b237e0..e6f159f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>>      		amdgpu_ring_pad_ib(ring, params.ib);
>>>>      		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>>>      				 AMDGPU_FENCE_OWNER_VM, false);
>>>> +
>>>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>>>> +		if (r)
>>>> +			goto error;
>>>> +
>>>>      		WARN_ON(params.ib->length_dw > ndw);
>>>>      		r = amdgpu_job_submit(job, ring, &vm->entity,
>>>>      				      AMDGPU_FENCE_OWNER_VM, &fence);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]                                     ` <BLUPR12MB0449422688494CE8CE97647884C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-28 12:38                                       ` Christian König
       [not found]                                         ` <889bf3af-4fa5-def3-a3c3-410fdd0ccd3e-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2018-02-28 12:38 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Chris Wilson

Ok, I've figured out what is going on here. The problem was that we 
tried to clear VM BOs with the standard background clear function.

Because of this you needed not only one, but two fence slots for newly 
allocated page directories/tables.

But I've already fixed this bug quite a while ago, make sure you have 
this patch in your branch:
> commit 13307f7e1d0c05a68f4ba19193cbd213573a8680
> Author: Christian König <christian.koenig@amd.com>
> Date:   Wed Jan 24 17:19:04 2018 +0100
>
>     drm/amdgpu: revert "drm/amdgpu: use AMDGPU_GEM_CREATE_VRAM_CLEARED 
> for VM PD/PTs" v2
>
>     Using the standard clear turned out to be to inflexible.
>
>     First of all it is executed on the system queue, together with buffer
>     moves instead on the per VM queue.
>
>     And second we need to fill in the page tables with more than just 
> zero.
>
>     We keep the new functionality of initializing the PDEs/PTEs with ATC
>     routing entries intact.
>
>     v2: update commit message.
>
>     Signed-off-by: Christian König <christian.koenig@amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

And also this follow up commit:
> commit 29e8357b4cbbfcee6d375f2d183b674b678923d7
> Author: Christian König <christian.koenig@amd.com>
> Date:   Sun Feb 4 19:36:52 2018 +0100
>
>     drm/amdgpu: sync the VM PD/PT before clearing it
>
>     Otherwise we might overwrite stuff which is still in use.
>
>     Signed-off-by: Christian König <christian.koenig@amd.com>
>     Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Regards,
Christian.

Am 27.02.2018 um 10:17 schrieb Liu, Monk:
> I know in theoretically this issue should be equal for sriov or not
> But since the gpu performance under SRIOV is only one quarter compared with bare-metal (assume 4VF)
> And the CPU's performance is also limited, so there are chance that the fence is signaled not as quick as bare-metal
> And lead to the scenario that four different context fence are living in resv,
>
> If you do it on bare-metal maybe you need some tricks, at least on my sriov platform I can duplicate this BUG()
> With 4 repeating loop on VK_EXAMPLE every time
>
> /Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月27日 16:34
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson <chris@chris-wilson.co.uk>
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
>
> Am 27.02.2018 um 04:36 schrieb Liu, Monk:
>>> Well then there is something else broken and I actually think that this is the root cause here.
>> What if this VM submits four different jobs on four different rings,
>> e.g. gfx/compute1/compute2/vce So the fences for those jobs are from
>> different context, and now you get a mapping on this VM So the shared
>> count is now 4
> Yeah, but each VM submission reserves a shared slot before adding the fence. So even if you have 4 submissions we still have a guaranteed free slot.
>
>> I ran VK_example and it always can hit this reservation BUG(), just
>> repeat running it 4 loops and The BUG() hit, but I did it under SRIOV
>> case so the CPU&GPU performance is different with bare-metal
>>
>> Anyway do you think above scenario exist ?
> Well you somehow run into this problem, I just don't understand how.
>
> It shouldn't matter if you run on bare-metal or SRIOV, the VM handling is the same as far as I know.
>
> Going to give it a try today to create some unit tests exercising the problem.
>
> Christian.
>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年2月26日 19:19
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris
>> Wilson <chris@chris-wilson.co.uk>
>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
>> count bug
>>
>>> What if all those context are not equal ? and accumulated finally to 4 ?
>> Well then there is something else broken and I actually think that this is the root cause here.
>>
>>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG()
>>> in reservation.c file,
>> Could you generate a simpler unit test to trigger the problem?
>>
>>
>> Thanks,
>> Christian.
>>
>> Am 26.02.2018 um 11:57 schrieb Liu, Monk:
>>>> So we call reserve_shared only once and then call reservation_object_add_shared_fence() multiple times, always with the same context.
>>>> should replace the fence when it has the same context as a previously added fence.
>>> What if all those context are not equal ? and accumulated finally to 4 ?
>>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG()
>>> in reservation.c file, I hit it and it is fixed by this patch, I don't think there is other reason.
>>>
>>> -----Original Message-----
>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On
>>> Behalf Of Christian K?nig
>>> Sent: 2018年2月26日 18:45
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org;
>>> Chris Wilson <chris@chris-wilson.co.uk>
>>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
>>> count bug
>>>
>>>> We must call reserve_shared before amdgpu_bo_fence....
>>> Actually that's not correct. See
>>> reservation_object_add_shared_fence()
>>> should replace the fence when it has the same context as a previously added fence.
>>>
>>> So we call reserve_shared only once and then call
>>> reservation_object_add_shared_fence() multiple times, always with the same context.
>>>
>>> Writing this, Chris actually had patches which may have broken that assumption, but as far as I know those aren't merged yet.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 26.02.2018 um 11:36 schrieb Liu, Monk:
>>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>>             entry->tv.shared = true;
>>>> That's not related with the bug I fixed! The root cause is there is
>>>> some sequence that When code path comes to amdgpu_bo_fence of
>>>> vm_update_directories, no one The shared_cnt is by coincidence
>>>> already the max value of resv obj, so the BUG() in Reservation.c
>>>> file will hit,
>>>>
>>>> We must call reserve_shared before amdgpu_bo_fence....
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>> Sent: 2018年2月26日 17:48
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
>>>> count bug
>>>>
>>>> That fix is incorrect.
>>>>
>>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>>             entry->tv.shared = true;
>>>> So you must run into this issue because of something else.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>>>>> should call reservation_object_reserve_shared before
>>>>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in
>>>>> reversation.c
>>>>>
>>>>> bug:
>>>>> [12622.076435] ------------[ cut here ]------------ [12622.076438]
>>>>> kernel BUG at drivers/dma-buf/reservation.c:233!
>>>>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345]
>>>>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>>>>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>>>>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>>>>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>>>>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>>>>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246
>>>>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>>>>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>>>>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>>>>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000
>>>>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>>>>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>>>>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>>>>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>>>>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>>>>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>>>>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>>>>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>>>>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>>>>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>>>>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>>>>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0
>>>>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0
>>>>> [amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
>>>>> [12622.092519]
>>>>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ?
>>>>> sched_clock+0x9/0x10 [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>>>>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160
>>>>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>>>>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>>>>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
>>>>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>>>>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>>>>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>>>>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e
>>>>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>>>>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>>>>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6
>>>>> ff ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44
>>>>> cf 10
>>>>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f
>>>>> 1f
>>>>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>>>>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP:
>>>>> ffff8801e6df7ae8 [12622.099527] ---[ end trace 8235addd8e22e444
>>>>> ]---
>>>>>
>>>>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>>>>       1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index 0b237e0..e6f159f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>>>       		amdgpu_ring_pad_ib(ring, params.ib);
>>>>>       		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>>>>       				 AMDGPU_FENCE_OWNER_VM, false);
>>>>> +
>>>>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>>>>> +		if (r)
>>>>> +			goto error;
>>>>> +
>>>>>       		WARN_ON(params.ib->length_dw > ndw);
>>>>>       		r = amdgpu_job_submit(job, ring, &vm->entity,
>>>>>       				      AMDGPU_FENCE_OWNER_VM, &fence);
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug
       [not found]                                         ` <889bf3af-4fa5-def3-a3c3-410fdd0ccd3e-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-28 13:01                                           ` Liu, Monk
  0 siblings, 0 replies; 18+ messages in thread
From: Liu, Monk @ 2018-02-28 13:01 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Chris Wilson

Hi Christian

I'll try patch it and give a try on my branch, yeah my branch is old and not have this patch

But I still feed odds, because by any reason I think it is safe that always call reserve_shared() before amdgpu_bo_fence() right ? 

The real problem troubles me is:
1) Under vulkan CTS test why driver hit crash (sometime kernel oops/panic and sometimes VMC page fault, looks really like wild fence pointer issue to me) after I inserted that reserve_shared() in vm_update_directories() ? 

2) More weird thing is after I removed "kfree(obj->staged)" those kernel panic/oops/vmc page faults just gone .... I know you are right about that kfree that it won't free the staged if no race issue crossed 


I still like the idea that add this reserve_shared() in vm_update_directories() and duplicate the kernel issue again, if it was kernel panic then I can paste you the crash log.

/Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2018年2月28日 20:39
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

Ok, I've figured out what is going on here. The problem was that we tried to clear VM BOs with the standard background clear function.

Because of this you needed not only one, but two fence slots for newly allocated page directories/tables.

But I've already fixed this bug quite a while ago, make sure you have this patch in your branch:
> commit 13307f7e1d0c05a68f4ba19193cbd213573a8680
> Author: Christian König <christian.koenig@amd.com>
> Date:   Wed Jan 24 17:19:04 2018 +0100
>
>     drm/amdgpu: revert "drm/amdgpu: use AMDGPU_GEM_CREATE_VRAM_CLEARED 
> for VM PD/PTs" v2
>
>     Using the standard clear turned out to be to inflexible.
>
>     First of all it is executed on the system queue, together with 
> buffer
>     moves instead on the per VM queue.
>
>     And second we need to fill in the page tables with more than just 
> zero.
>
>     We keep the new functionality of initializing the PDEs/PTEs with 
> ATC
>     routing entries intact.
>
>     v2: update commit message.
>
>     Signed-off-by: Christian König <christian.koenig@amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

And also this follow up commit:
> commit 29e8357b4cbbfcee6d375f2d183b674b678923d7
> Author: Christian König <christian.koenig@amd.com>
> Date:   Sun Feb 4 19:36:52 2018 +0100
>
>     drm/amdgpu: sync the VM PD/PT before clearing it
>
>     Otherwise we might overwrite stuff which is still in use.
>
>     Signed-off-by: Christian König <christian.koenig@amd.com>
>     Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Regards,
Christian.

Am 27.02.2018 um 10:17 schrieb Liu, Monk:
> I know in theoretically this issue should be equal for sriov or not 
> But since the gpu performance under SRIOV is only one quarter compared 
> with bare-metal (assume 4VF) And the CPU's performance is also 
> limited, so there are chance that the fence is signaled not as quick 
> as bare-metal And lead to the scenario that four different context 
> fence are living in resv,
>
> If you do it on bare-metal maybe you need some tricks, at least on my 
> sriov platform I can duplicate this BUG() With 4 repeating loop on 
> VK_EXAMPLE every time
>
> /Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月27日 16:34
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; Chris 
> Wilson <chris@chris-wilson.co.uk>
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
> count bug
>
> Am 27.02.2018 um 04:36 schrieb Liu, Monk:
>>> Well then there is something else broken and I actually think that this is the root cause here.
>> What if this VM submits four different jobs on four different rings, 
>> e.g. gfx/compute1/compute2/vce So the fences for those jobs are from 
>> different context, and now you get a mapping on this VM So the shared 
>> count is now 4
> Yeah, but each VM submission reserves a shared slot before adding the fence. So even if you have 4 submissions we still have a guaranteed free slot.
>
>> I ran VK_example and it always can hit this reservation BUG(), just 
>> repeat running it 4 loops and The BUG() hit, but I did it under SRIOV 
>> case so the CPU&GPU performance is different with bare-metal
>>
>> Anyway do you think above scenario exist ?
> Well you somehow run into this problem, I just don't understand how.
>
> It shouldn't matter if you run on bare-metal or SRIOV, the VM handling is the same as far as I know.
>
> Going to give it a try today to create some unit tests exercising the problem.
>
> Christian.
>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年2月26日 19:19
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; 
>> Chris Wilson <chris@chris-wilson.co.uk>
>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
>> count bug
>>
>>> What if all those context are not equal ? and accumulated finally to 4 ?
>> Well then there is something else broken and I actually think that this is the root cause here.
>>
>>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() 
>>> in reservation.c file,
>> Could you generate a simpler unit test to trigger the problem?
>>
>>
>> Thanks,
>> Christian.
>>
>> Am 26.02.2018 um 11:57 schrieb Liu, Monk:
>>>> So we call reserve_shared only once and then call reservation_object_add_shared_fence() multiple times, always with the same context.
>>>> should replace the fence when it has the same context as a previously added fence.
>>> What if all those context are not equal ? and accumulated finally to 4 ?
>>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() 
>>> in reservation.c file, I hit it and it is fixed by this patch, I don't think there is other reason.
>>>
>>> -----Original Message-----
>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On 
>>> Behalf Of Christian K?nig
>>> Sent: 2018年2月26日 18:45
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org; 
>>> Chris Wilson <chris@chris-wilson.co.uk>
>>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
>>> count bug
>>>
>>>> We must call reserve_shared before amdgpu_bo_fence....
>>> Actually that's not correct. See
>>> reservation_object_add_shared_fence()
>>> should replace the fence when it has the same context as a previously added fence.
>>>
>>> So we call reserve_shared only once and then call
>>> reservation_object_add_shared_fence() multiple times, always with the same context.
>>>
>>> Writing this, Chris actually had patches which may have broken that assumption, but as far as I know those aren't merged yet.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 26.02.2018 um 11:36 schrieb Liu, Monk:
>>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>>             entry->tv.shared = true;
>>>> That's not related with the bug I fixed! The root cause is there is 
>>>> some sequence that When code path comes to amdgpu_bo_fence of 
>>>> vm_update_directories, no one The shared_cnt is by coincidence 
>>>> already the max value of resv obj, so the BUG() in Reservation.c 
>>>> file will hit,
>>>>
>>>> We must call reserve_shared before amdgpu_bo_fence....
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>> Sent: 2018年2月26日 17:48
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
>>>> count bug
>>>>
>>>> That fix is incorrect.
>>>>
>>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>>             entry->tv.shared = true;
>>>> So you must run into this issue because of something else.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>>>>> should call reservation_object_reserve_shared before 
>>>>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in 
>>>>> reversation.c
>>>>>
>>>>> bug:
>>>>> [12622.076435] ------------[ cut here ]------------ [12622.076438] 
>>>>> kernel BUG at drivers/dma-buf/reservation.c:233!
>>>>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345] 
>>>>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>>>>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>>>>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>>>>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>>>>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>>>>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246 
>>>>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>>>>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>>>>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>>>>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000 
>>>>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>>>>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>>>>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>>>>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>>>>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>>>>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>>>>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>>>>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>>>>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>>>>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>>>>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>>>>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0 
>>>>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0
>>>>> [amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] 
>>>>> [12622.092519]
>>>>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ?
>>>>> sched_clock+0x9/0x10 [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>>>>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160 
>>>>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>>>>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>>>>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 
>>>>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>>>>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>>>>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>>>>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e 
>>>>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>>>>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>>>>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 
>>>>> ff ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 
>>>>> cf 10
>>>>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 
>>>>> 0f 1f
>>>>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>>>>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP:
>>>>> ffff8801e6df7ae8 [12622.099527] ---[ end trace 8235addd8e22e444
>>>>> ]---
>>>>>
>>>>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>>>>       1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index 0b237e0..e6f159f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>>>       		amdgpu_ring_pad_ib(ring, params.ib);
>>>>>       		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>>>>       				 AMDGPU_FENCE_OWNER_VM, false);
>>>>> +
>>>>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>>>>> +		if (r)
>>>>> +			goto error;
>>>>> +
>>>>>       		WARN_ON(params.ib->length_dw > ndw);
>>>>>       		r = amdgpu_job_submit(job, ring, &vm->entity,
>>>>>       				      AMDGPU_FENCE_OWNER_VM, &fence);
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-02-28 13:01 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-26  5:34 [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available Monk Liu
     [not found] ` <1519623300-17305-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26  5:35   ` [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug Monk Liu
     [not found]     ` <1519623300-17305-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-26  9:47       ` Christian König
     [not found]         ` <f3854579-e1dc-1226-9807-3e67a1a9bb7c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-02-26 10:32           ` Liu, Monk
2018-02-26 10:36           ` Liu, Monk
     [not found]             ` <SN1PR12MB0462FDC9905B318AF45FCECC84C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-26 10:44               ` Christian König
     [not found]                 ` <7b2c1408-8ffc-4b8a-94e4-45534dff99f8-5C7GfCeVMHo@public.gmane.org>
2018-02-26 10:54                   ` Chris Wilson
2018-02-26 10:57                   ` Liu, Monk
     [not found]                     ` <SN1PR12MB0462022ABFBC835ACE0D34E484C10-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-26 11:18                       ` Christian König
     [not found]                         ` <0a041a5c-9d7f-e307-7730-275ba7b775c9-5C7GfCeVMHo@public.gmane.org>
2018-02-27  3:36                           ` Liu, Monk
     [not found]                             ` <BLUPR12MB0449672DB665AD8DB6E081BD84C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-27  8:34                               ` Christian König
     [not found]                                 ` <b0d669c0-e6e8-5907-a33c-3b94890fd091-5C7GfCeVMHo@public.gmane.org>
2018-02-27  9:17                                   ` Liu, Monk
     [not found]                                     ` <BLUPR12MB0449422688494CE8CE97647884C00-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-28 12:38                                       ` Christian König
     [not found]                                         ` <889bf3af-4fa5-def3-a3c3-410fdd0ccd3e-5C7GfCeVMHo@public.gmane.org>
2018-02-28 13:01                                           ` Liu, Monk
2018-02-26  9:42   ` [PATCH 21/22] dma-buf/reservation: shouldn't kfree staged when slot available Christian König
     [not found]     ` <afa20c31-f721-599e-7e4c-eecbf7bda70f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-02-26 10:13       ` Liu, Monk
2018-02-26 10:19       ` Chris Wilson
     [not found]         ` <151964034205.2350.2554778746105246165-M6iVdVfohj6unts5RBS2dVaTQe2KTcn/@public.gmane.org>
2018-02-26 10:25           ` Liu, Monk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.