All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] add recovery entity
@ 2016-07-28 10:13 Chunming Zhou
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

every vm has itself recovery entity, which is used to reovery page table from their shadow.
They don't need to wait front vm completed.
And also using all pte rings can speed reovery.

every scheduler has its own recovery entity, which is used to save hw jobs, and resubmit from it, which solves the conflicts between reset thread and scheduler thread when run job.

And some fixes when doing this improment.

Chunming Zhou (11):
  drm/amdgpu: hw ring should be empty when gpu reset
  drm/amdgpu: specify entity to amdgpu_copy_buffer
  drm/amd: add recover run queue for scheduler
  drm/amdgpu: fix vm init error path
  drm/amdgpu: add vm recover entity
  drm/amdgpu: use all pte rings to recover page table
  drm/amd: add recover entity for every scheduler
  drm/amd: use scheduler to recover hw jobs
  drm/amd: hw job list should be exact
  drm/amd: reset jobs to recover entity
  drm/amdgpu: no need fence wait every time

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129 +++++++++++++-------------
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
 9 files changed, 134 insertions(+), 92 deletions(-)

-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/11] drm/amdgpu: hw ring should be empty when gpu reset
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
@ 2016-07-28 10:13   ` Chunming Zhou
       [not found]     ` <1469700828-25650-2-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
  2016-07-28 10:13   ` [PATCH 02/11] drm/amdgpu: specify entity to amdgpu_copy_buffer Chunming Zhou
                     ` (11 subsequent siblings)
  12 siblings, 1 reply; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

Change-Id: I08ca5a805f590cc7aad0e9ccd91bd5925bb216e2
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 11 +++++++++++
 3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 43beefb..ebd5565 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1247,6 +1247,7 @@ int amdgpu_ib_ring_tests(struct amdgpu_device *adev);
 int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
 void amdgpu_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count);
 void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib);
+void amdgpu_ring_reset(struct amdgpu_ring *ring);
 void amdgpu_ring_commit(struct amdgpu_ring *ring);
 void amdgpu_ring_undo(struct amdgpu_ring *ring);
 int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7e63ef9..1968251 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2102,6 +2102,7 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
 			continue;
 		kthread_park(ring->sched.thread);
 		amd_sched_hw_job_reset(&ring->sched);
+		amdgpu_ring_reset(ring);
 	}
 	/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
 	amdgpu_fence_driver_force_completion(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 9989e25..75e1da6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -110,6 +110,17 @@ void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
 		ib->ptr[ib->length_dw++] = ring->nop;
 }
 
+void amdgpu_ring_reset(struct amdgpu_ring *ring)
+{
+       u32 rptr = amdgpu_ring_get_rptr(ring);
+
+       ring->wptr = rptr;
+       ring->wptr &= ring->ptr_mask;
+
+       mb();
+       amdgpu_ring_set_wptr(ring);
+}
+
 /**
  * amdgpu_ring_commit - tell the GPU to execute the new
  * commands on the ring buffer
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/11] drm/amdgpu: specify entity to amdgpu_copy_buffer
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
  2016-07-28 10:13   ` [PATCH 01/11] drm/amdgpu: hw ring should be empty when gpu reset Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
  2016-07-28 10:13   ` [PATCH 03/11] drm/amd: add recover run queue for scheduler Chunming Zhou
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

Change-Id: Ib84621d8ab61bf2ca0719c6888cc403982127684
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c | 3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      | 8 ++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 5 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        | 2 +-
 5 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index ebd5565..9f7fae0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -431,6 +431,7 @@ struct amdgpu_mman {
 };
 
 int amdgpu_copy_buffer(struct amdgpu_ring *ring,
+		       struct amd_sched_entity *entity,
 		       uint64_t src_offset,
 		       uint64_t dst_offset,
 		       uint32_t byte_count,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c
index 33e47a4..cab93c7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c
@@ -39,7 +39,8 @@ static int amdgpu_benchmark_do_move(struct amdgpu_device *adev, unsigned size,
 	start_jiffies = jiffies;
 	for (i = 0; i < n; i++) {
 		struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
-		r = amdgpu_copy_buffer(ring, saddr, daddr, size, NULL, &fence);
+		r = amdgpu_copy_buffer(ring, &adev->mman.entity,
+				       saddr, daddr, size, NULL, &fence);
 		if (r)
 			goto exit_do_move;
 		r = fence_wait(fence, false);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c
index 05a53f4..bbaa1c1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c
@@ -110,8 +110,8 @@ static void amdgpu_do_test_moves(struct amdgpu_device *adev)
 
 		amdgpu_bo_kunmap(gtt_obj[i]);
 
-		r = amdgpu_copy_buffer(ring, gtt_addr, vram_addr,
-				       size, NULL, &fence);
+		r = amdgpu_copy_buffer(ring, &adev->mman.entity, gtt_addr,
+				       vram_addr, size, NULL, &fence);
 
 		if (r) {
 			DRM_ERROR("Failed GTT->VRAM copy %d\n", i);
@@ -155,8 +155,8 @@ static void amdgpu_do_test_moves(struct amdgpu_device *adev)
 
 		amdgpu_bo_kunmap(vram_obj);
 
-		r = amdgpu_copy_buffer(ring, vram_addr, gtt_addr,
-				       size, NULL, &fence);
+		r = amdgpu_copy_buffer(ring, &adev->mman.entity, vram_addr,
+				       gtt_addr, size, NULL, &fence);
 
 		if (r) {
 			DRM_ERROR("Failed VRAM->GTT copy %d\n", i);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index b7742e6..757a71b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -283,7 +283,7 @@ static int amdgpu_move_blit(struct ttm_buffer_object *bo,
 
 	BUILD_BUG_ON((PAGE_SIZE % AMDGPU_GPU_PAGE_SIZE) != 0);
 
-	r = amdgpu_copy_buffer(ring, old_start, new_start,
+	r = amdgpu_copy_buffer(ring, &adev->mman.entity, old_start, new_start,
 			       new_mem->num_pages * PAGE_SIZE, /* bytes */
 			       bo->resv, &fence);
 	if (r)
@@ -1147,6 +1147,7 @@ int amdgpu_mmap(struct file *filp, struct vm_area_struct *vma)
 }
 
 int amdgpu_copy_buffer(struct amdgpu_ring *ring,
+		       struct amd_sched_entity *entity,
 		       uint64_t src_offset,
 		       uint64_t dst_offset,
 		       uint32_t byte_count,
@@ -1195,7 +1196,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring,
 
 	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
 	WARN_ON(job->ibs[0].length_dw > num_dw);
-	r = amdgpu_job_submit(job, ring, &adev->mman.entity,
+	r = amdgpu_job_submit(job, ring, entity,
 			      AMDGPU_FENCE_OWNER_UNDEFINED, fence);
 	if (r)
 		goto error_free;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 0e3f116..11c1263 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -739,7 +739,7 @@ static int amdgpu_vm_recover_bo_from_shadow(struct amdgpu_device *adev,
 	if (r)
 		goto err3;
 
-	r = amdgpu_copy_buffer(ring, gtt_addr, vram_addr,
+	r = amdgpu_copy_buffer(ring, &adev->mman.entity, gtt_addr, vram_addr,
 			       amdgpu_bo_size(bo), resv, fence);
 	if (!r)
 		amdgpu_bo_fence(bo, *fence, true);
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/11] drm/amd: add recover run queue for scheduler
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
  2016-07-28 10:13   ` [PATCH 01/11] drm/amdgpu: hw ring should be empty when gpu reset Chunming Zhou
  2016-07-28 10:13   ` [PATCH 02/11] drm/amdgpu: specify entity to amdgpu_copy_buffer Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
       [not found]     ` <1469700828-25650-4-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
  2016-07-28 10:13   ` [PATCH 04/11] drm/amdgpu: fix vm init error path Chunming Zhou
                     ` (9 subsequent siblings)
  12 siblings, 1 reply; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

Change-Id: I7171d1e3884aabe1263d8f7be18cadf2e98216a4
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index a1c0073..cd87bc7 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -112,7 +112,8 @@ struct amd_sched_backend_ops {
 };
 
 enum amd_sched_priority {
-	AMD_SCHED_PRIORITY_KERNEL = 0,
+	AMD_SCHED_PRIORITY_RECOVER = 0,
+	AMD_SCHED_PRIORITY_KERNEL,
 	AMD_SCHED_PRIORITY_NORMAL,
 	AMD_SCHED_MAX_PRIORITY
 };
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/11] drm/amdgpu: fix vm init error path
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 03/11] drm/amd: add recover run queue for scheduler Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
       [not found]     ` <1469700828-25650-5-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
  2016-07-28 10:13   ` [PATCH 05/11] drm/amdgpu: add vm recover entity Chunming Zhou
                     ` (8 subsequent siblings)
  12 siblings, 1 reply; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

Change-Id: Ie3d5440dc0d2d3a61d8e785ab08b8b91eda223db
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 11c1263..1d58577 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1682,7 +1682,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
 	r = amd_sched_entity_init(&ring->sched, &vm->entity,
 				  rq, amdgpu_sched_jobs);
 	if (r)
-		return r;
+		goto err;
 
 	vm->page_directory_fence = NULL;
 
@@ -1725,6 +1725,9 @@ error_free_page_directory:
 error_free_sched_entity:
 	amd_sched_entity_fini(&ring->sched, &vm->entity);
 
+err:
+	drm_free_large(vm->page_tables);
+
 	return r;
 }
 
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/11] drm/amdgpu: add vm recover entity
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (3 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 04/11] drm/amdgpu: fix vm init error path Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
       [not found]     ` <1469700828-25650-6-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
  2016-07-28 10:13   ` [PATCH 06/11] drm/amdgpu: use all pte rings to recover page table Chunming Zhou
                     ` (7 subsequent siblings)
  12 siblings, 1 reply; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

every vm uses itself recover entity to recovery page table from shadow.

Change-Id: I93e37666cb3fb511311c96ff172b6e9ebd337547
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h    |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++++++++++++++-------
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9f7fae0..98f631a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -911,7 +911,8 @@ struct amdgpu_vm {
 
 	/* Scheduler entity for page table updates */
 	struct amd_sched_entity	entity;
-
+	struct amd_sched_entity	recover_entity;
+	struct amdgpu_ring      *ring;
 	/* client id */
 	u64                     client_id;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 1d58577..6d2a28a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -714,13 +714,13 @@ error_free:
 }
 
 static int amdgpu_vm_recover_bo_from_shadow(struct amdgpu_device *adev,
+					    struct amdgpu_vm *vm,
 					    struct amdgpu_bo *bo,
 					    struct amdgpu_bo *bo_shadow,
 					    struct reservation_object *resv,
 					    struct fence **fence)
 
 {
-	struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
 	int r;
 	uint64_t vram_addr, gtt_addr;
 
@@ -739,8 +739,8 @@ static int amdgpu_vm_recover_bo_from_shadow(struct amdgpu_device *adev,
 	if (r)
 		goto err3;
 
-	r = amdgpu_copy_buffer(ring, &adev->mman.entity, gtt_addr, vram_addr,
-			       amdgpu_bo_size(bo), resv, fence);
+	r = amdgpu_copy_buffer(vm->ring, &vm->recover_entity, gtt_addr,
+			       vram_addr, amdgpu_bo_size(bo), resv, fence);
 	if (!r)
 		amdgpu_bo_fence(bo, *fence, true);
 
@@ -767,7 +767,7 @@ int amdgpu_vm_recover_page_table_from_shadow(struct amdgpu_device *adev,
 	if (unlikely(r != 0))
 		return r;
 
-	r = amdgpu_vm_recover_bo_from_shadow(adev, vm->page_directory,
+	r = amdgpu_vm_recover_bo_from_shadow(adev, vm, vm->page_directory,
 					     vm->page_directory->shadow,
 					     NULL, &fence);
 	if (r) {
@@ -784,7 +784,7 @@ int amdgpu_vm_recover_page_table_from_shadow(struct amdgpu_device *adev,
 
 		if (!bo || !bo_shadow)
 			continue;
-		r = amdgpu_vm_recover_bo_from_shadow(adev, bo, bo_shadow,
+		r = amdgpu_vm_recover_bo_from_shadow(adev, vm, bo, bo_shadow,
 						     NULL, &fence);
 		if (r) {
 			DRM_ERROR("recover page table failed!\n");
@@ -1678,12 +1678,17 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
 	ring_instance = atomic_inc_return(&adev->vm_manager.vm_pte_next_ring);
 	ring_instance %= adev->vm_manager.vm_pte_num_rings;
 	ring = adev->vm_manager.vm_pte_rings[ring_instance];
+	rq = &ring->sched.sched_rq[AMD_SCHED_PRIORITY_RECOVER];
+	r = amd_sched_entity_init(&ring->sched, &vm->recover_entity,
+				  rq, amdgpu_sched_jobs);
+	if (r)
+		goto err;
 	rq = &ring->sched.sched_rq[AMD_SCHED_PRIORITY_KERNEL];
 	r = amd_sched_entity_init(&ring->sched, &vm->entity,
 				  rq, amdgpu_sched_jobs);
 	if (r)
-		goto err;
-
+		goto err1;
+	vm->ring = ring;
 	vm->page_directory_fence = NULL;
 
 	r = amdgpu_bo_create(adev, pd_size, align, true,
@@ -1725,6 +1730,8 @@ error_free_page_directory:
 error_free_sched_entity:
 	amd_sched_entity_fini(&ring->sched, &vm->entity);
 
+err1:
+	amd_sched_entity_fini(&ring->sched, &vm->recover_entity);
 err:
 	drm_free_large(vm->page_tables);
 
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/11] drm/amdgpu: use all pte rings to recover page table
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (4 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 05/11] drm/amdgpu: add vm recover entity Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
  2016-07-28 10:13   ` [PATCH 07/11] drm/amd: add recover entity for every scheduler Chunming Zhou
                     ` (6 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

Change-Id: Ic74508ec9de0bf1c027313ce9574e6cb8ea9bb1d
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 34 ++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1968251..e91177a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2083,6 +2083,7 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
 	int i, r;
 	int resched;
 	bool need_full_reset;
+	u32 unpark_bits;
 
 	if (!amdgpu_check_soft_reset(adev)) {
 		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
@@ -2104,6 +2105,7 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
 		amd_sched_hw_job_reset(&ring->sched);
 		amdgpu_ring_reset(ring);
 	}
+	unpark_bits = 0;
 	/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
 	amdgpu_fence_driver_force_completion(adev);
 	/* store modesetting */
@@ -2147,8 +2149,6 @@ retry:
 		amdgpu_atombios_scratch_regs_restore(adev);
 	}
 	if (!r) {
-		struct amdgpu_ring *buffer_ring = adev->mman.buffer_funcs_ring;
-
 		amdgpu_irq_gpu_reset_resume_helper(adev);
 		r = amdgpu_ib_ring_tests(adev);
 		if (r) {
@@ -2163,11 +2163,20 @@ retry:
 		 */
 		if (need_full_reset && !(adev->flags & AMD_IS_APU)) {
 			struct amdgpu_vm *vm, *tmp;
+			int i;
 
 			DRM_INFO("recover page table from shadow\n");
-			amd_sched_rq_block_entity(
-				&buffer_ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL], true);
-			kthread_unpark(buffer_ring->sched.thread);
+			for (i = 0; i < adev->vm_manager.vm_pte_num_rings; i++) {
+				struct amdgpu_ring *ring = adev->vm_manager.vm_pte_rings[i];
+
+				amd_sched_rq_block_entity(
+					&ring->sched.sched_rq[AMD_SCHED_PRIORITY_KERNEL], true);
+				amd_sched_rq_block_entity(
+					&ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL], true);
+				kthread_unpark(ring->sched.thread);
+				unpark_bits |= 1 << ring->idx;
+			}
+
 			spin_lock(&adev->vm_list_lock);
 			list_for_each_entry_safe(vm, tmp, &adev->vm_list, list) {
 				spin_unlock(&adev->vm_list_lock);
@@ -2175,8 +2184,15 @@ retry:
 				spin_lock(&adev->vm_list_lock);
 			}
 			spin_unlock(&adev->vm_list_lock);
-			amd_sched_rq_block_entity(
-				&buffer_ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL], false);
+
+			for (i = 0; i < adev->vm_manager.vm_pte_num_rings; i++) {
+				struct amdgpu_ring *ring = adev->vm_manager.vm_pte_rings[i];
+
+				amd_sched_rq_block_entity(
+					&ring->sched.sched_rq[AMD_SCHED_PRIORITY_KERNEL], false);
+				amd_sched_rq_block_entity(
+					&ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL], false);
+			}
 		}
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = adev->rings[i];
@@ -2184,9 +2200,9 @@ retry:
 				continue;
 
 			DRM_INFO("ring:%d recover jobs\n", ring->idx);
-			kthread_park(buffer_ring->sched.thread);
 			amd_sched_job_recovery(&ring->sched);
-			kthread_unpark(ring->sched.thread);
+			if (!((unpark_bits >> ring->idx) & 0x1))
+				kthread_unpark(ring->sched.thread);
 		}
 	} else {
 		dev_err(adev->dev, "asic resume failed (%d).\n", r);
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/11] drm/amd: add recover entity for every scheduler
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (5 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 06/11] drm/amdgpu: use all pte rings to recover page table Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
  2016-07-28 10:13   ` [PATCH 08/11] drm/amd: use scheduler to recover hw jobs Chunming Zhou
                     ` (5 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

It will be used to recover hw jobs.

Change-Id: I5508f5ffa04909b480ddd669dfb297e5059eba04
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 24 ++++++++++++++++++++----
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  1 +
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index a15fd88..36f5805 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -635,7 +635,7 @@ int amd_sched_init(struct amd_gpu_scheduler *sched,
 		   const struct amd_sched_backend_ops *ops,
 		   unsigned hw_submission, long timeout, const char *name)
 {
-	int i;
+	int i, r;
 	sched->ops = ops;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
@@ -648,22 +648,37 @@ int amd_sched_init(struct amd_gpu_scheduler *sched,
 	INIT_LIST_HEAD(&sched->ring_mirror_list);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
+	r = amd_sched_entity_init(sched, &sched->recover_entity,
+				  &sched->sched_rq[AMD_SCHED_PRIORITY_RECOVER],
+				  hw_submission);
+	if (r)
+		return r;
 	if (atomic_inc_return(&sched_fence_slab_ref) == 1) {
 		sched_fence_slab = kmem_cache_create(
 			"amd_sched_fence", sizeof(struct amd_sched_fence), 0,
 			SLAB_HWCACHE_ALIGN, NULL);
-		if (!sched_fence_slab)
-			return -ENOMEM;
+		if (!sched_fence_slab) {
+			r = -ENOMEM;
+			goto err1;
+		}
 	}
 
 	/* Each scheduler will run on a seperate kernel thread */
 	sched->thread = kthread_run(amd_sched_main, sched, sched->name);
 	if (IS_ERR(sched->thread)) {
 		DRM_ERROR("Failed to create scheduler for %s.\n", name);
-		return PTR_ERR(sched->thread);
+		r = PTR_ERR(sched->thread);
+		goto err2;
 	}
 
 	return 0;
+err2:
+	if (atomic_dec_and_test(&sched_fence_slab_ref))
+		kmem_cache_destroy(sched_fence_slab);
+
+err1:
+	amd_sched_entity_fini(sched, &sched->recover_entity);
+	return r;
 }
 
 /**
@@ -677,4 +692,5 @@ void amd_sched_fini(struct amd_gpu_scheduler *sched)
 		kthread_stop(sched->thread);
 	if (atomic_dec_and_test(&sched_fence_slab_ref))
 		kmem_cache_destroy(sched_fence_slab);
+	amd_sched_entity_fini(sched, &sched->recover_entity);
 }
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index cd87bc7..8245316 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -133,6 +133,7 @@ struct amd_gpu_scheduler {
 	struct task_struct		*thread;
 	struct list_head	ring_mirror_list;
 	spinlock_t			job_list_lock;
+	struct amd_sched_entity         recover_entity;
 };
 
 int amd_sched_init(struct amd_gpu_scheduler *sched,
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/11] drm/amd: use scheduler to recover hw jobs
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (6 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 07/11] drm/amd: add recover entity for every scheduler Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
  2016-07-28 10:13   ` [PATCH 09/11] drm/amd: hw job list should be exact Chunming Zhou
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

The old way is trying to recover hw jobs directly, which will conflict
with scheduler thread.

Change-Id: I9e45abd43ae280a675b0b0d88a820106dea2716c
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 48 +++++++++------------------
 1 file changed, 16 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 36f5805..9f4fa6e 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -324,10 +324,12 @@ amd_sched_entity_pop_job(struct amd_sched_entity *entity)
  *
  * Returns true if we could submit the job.
  */
-static bool amd_sched_entity_in(struct amd_sched_job *sched_job)
+static bool amd_sched_entity_in_or_recover(struct amd_sched_job *sched_job,
+					   bool recover)
 {
 	struct amd_gpu_scheduler *sched = sched_job->sched;
-	struct amd_sched_entity *entity = sched_job->s_entity;
+	struct amd_sched_entity *entity = recover ? &sched->recover_entity :
+		sched_job->s_entity;
 	bool added, first = false;
 
 	spin_lock(&entity->queue_lock);
@@ -348,6 +350,15 @@ static bool amd_sched_entity_in(struct amd_sched_job *sched_job)
 	return added;
 }
 
+static void amd_sched_entity_push_job_recover(struct amd_sched_job *sched_job)
+{
+	struct amd_sched_entity *entity = sched_job->s_entity;
+
+	trace_amd_sched_job(sched_job);
+	wait_event(entity->sched->job_scheduled,
+		   amd_sched_entity_in_or_recover(sched_job, true));
+}
+
 /* job_finish is called after hw fence signaled, and
  * the job had already been deleted from ring_mirror_list
  */
@@ -426,39 +437,12 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
 void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
 {
 	struct amd_sched_job *s_job, *tmp;
-	int r;
 
 	spin_lock(&sched->job_list_lock);
-	s_job = list_first_entry_or_null(&sched->ring_mirror_list,
-					 struct amd_sched_job, node);
-	if (s_job)
-		schedule_delayed_work(&s_job->work_tdr, sched->timeout);
-
 	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
-		struct amd_sched_fence *s_fence = s_job->s_fence;
-		struct fence *fence, *dependency;
-
+		list_del_init(&s_job->node);
 		spin_unlock(&sched->job_list_lock);
-		while ((dependency = sched->ops->dependency(s_job))) {
-		       fence_wait(dependency, false);
-		       fence_put(dependency);
-		}
-		fence = sched->ops->run_job(s_job);
-		atomic_inc(&sched->hw_rq_count);
-		if (fence) {
-			s_fence->parent = fence_get(fence);
-			r = fence_add_callback(fence, &s_fence->cb,
-					       amd_sched_process_job);
-			if (r == -ENOENT)
-				amd_sched_process_job(fence, &s_fence->cb);
-			else if (r)
-				DRM_ERROR("fence add callback failed (%d)\n",
-					  r);
-			fence_put(fence);
-		} else {
-			DRM_ERROR("Failed to run job!\n");
-			amd_sched_process_job(NULL, &s_fence->cb);
-		}
+		amd_sched_entity_push_job_recover(s_job);
 		spin_lock(&sched->job_list_lock);
 	}
 	spin_unlock(&sched->job_list_lock);
@@ -479,7 +463,7 @@ void amd_sched_entity_push_job(struct amd_sched_job *sched_job)
 	fence_add_callback(&sched_job->s_fence->finished, &sched_job->finish_cb,
 			   amd_sched_job_finish_cb);
 	wait_event(entity->sched->job_scheduled,
-		   amd_sched_entity_in(sched_job));
+		   amd_sched_entity_in_or_recover(sched_job, false));
 }
 
 /* init a sched_job with basic field */
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/11] drm/amd: hw job list should be exact
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (7 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 08/11] drm/amd: use scheduler to recover hw jobs Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
       [not found]     ` <1469700828-25650-10-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
  2016-07-28 10:13   ` [PATCH 10/11] drm/amd: reset jobs to recover entity Chunming Zhou
                     ` (3 subsequent siblings)
  12 siblings, 1 reply; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

hw job list should be exact, so deleting job node should be in irq
handler instead of work thread.
And Calculating time of next job should be immediate as well.

Change-Id: I6d2686d84be3e7077300df7181c2a284fbcda9eb
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 38 +++++++++++++--------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 9f4fa6e..69a9d40 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -367,34 +367,32 @@ static void amd_sched_job_finish(struct work_struct *work)
 	struct amd_sched_job *s_job = container_of(work, struct amd_sched_job,
 						   finish_work);
 	struct amd_gpu_scheduler *sched = s_job->sched;
-	unsigned long flags;
-
-	/* remove job from ring_mirror_list */
-	spin_lock_irqsave(&sched->job_list_lock, flags);
-	list_del_init(&s_job->node);
-	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
-		struct amd_sched_job *next;
 
-		spin_unlock_irqrestore(&sched->job_list_lock, flags);
+	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
 		cancel_delayed_work_sync(&s_job->work_tdr);
-		spin_lock_irqsave(&sched->job_list_lock, flags);
-
-		/* queue TDR for next job */
-		next = list_first_entry_or_null(&sched->ring_mirror_list,
-						struct amd_sched_job, node);
 
-		if (next)
-			schedule_delayed_work(&next->work_tdr, sched->timeout);
-	}
-	spin_unlock_irqrestore(&sched->job_list_lock, flags);
 	sched->ops->free_job(s_job);
 }
 
 static void amd_sched_job_finish_cb(struct fence *f, struct fence_cb *cb)
 {
-	struct amd_sched_job *job = container_of(cb, struct amd_sched_job,
-						 finish_cb);
-	schedule_work(&job->finish_work);
+	struct amd_sched_job *s_job = container_of(cb, struct amd_sched_job,
+						   finish_cb);
+	struct amd_gpu_scheduler *sched = s_job->sched;
+	struct amd_sched_job *next;
+	unsigned long flags;
+
+	/* remove job from ring_mirror_list */
+	spin_lock_irqsave(&sched->job_list_lock, flags);
+	list_del_init(&s_job->node);
+	/* queue TDR for next job */
+	next = list_first_entry_or_null(&sched->ring_mirror_list,
+					struct amd_sched_job, node);
+	spin_unlock_irqrestore(&sched->job_list_lock, flags);
+	if (next)
+		schedule_delayed_work(&next->work_tdr, sched->timeout);
+
+	schedule_work(&s_job->finish_work);
 }
 
 static void amd_sched_job_begin(struct amd_sched_job *s_job)
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/11] drm/amd: reset jobs to recover entity
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (8 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 09/11] drm/amd: hw job list should be exact Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
  2016-07-28 10:13   ` [PATCH 11/11] drm/amdgpu: no need fence wait every time Chunming Zhou
                     ` (2 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

remove recover_entity for recover_rq when reset job.
add recover_entity back when recover job

Change-Id: Ic2e5cb6ab79d2abc49374e1770299487e327efe9
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 69a9d40..f832d0d 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -417,9 +417,10 @@ static void amd_sched_job_timedout(struct work_struct *work)
 	job->sched->ops->timedout_job(job);
 }
 
+/* scheduler must be parked before job reset */
 void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
 {
-	struct amd_sched_job *s_job;
+	struct amd_sched_job *s_job, *tmp;
 
 	spin_lock(&sched->job_list_lock);
 	list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node) {
@@ -429,14 +430,6 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
 		}
 	}
 	atomic_set(&sched->hw_rq_count, 0);
-	spin_unlock(&sched->job_list_lock);
-}
-
-void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
-{
-	struct amd_sched_job *s_job, *tmp;
-
-	spin_lock(&sched->job_list_lock);
 	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
 		list_del_init(&s_job->node);
 		spin_unlock(&sched->job_list_lock);
@@ -444,6 +437,14 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
 		spin_lock(&sched->job_list_lock);
 	}
 	spin_unlock(&sched->job_list_lock);
+	amd_sched_rq_remove_entity(&sched->sched_rq[AMD_SCHED_PRIORITY_RECOVER],
+				   &sched->recover_entity);
+}
+
+void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
+{
+	amd_sched_rq_add_entity(&sched->sched_rq[AMD_SCHED_PRIORITY_RECOVER],
+				&sched->recover_entity);
 }
 
 /**
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/11] drm/amdgpu: no need fence wait every time
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (9 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 10/11] drm/amd: reset jobs to recover entity Chunming Zhou
@ 2016-07-28 10:13   ` Chunming Zhou
  2016-08-02  2:06   ` [PATCH 00/11] add recovery entity zhoucm1
  2016-08-03 13:43   ` Christian König
  12 siblings, 0 replies; 25+ messages in thread
From: Chunming Zhou @ 2016-07-28 10:13 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

recover entities have handled very well for each dependency.

Change-Id: I70a8d0e2753741c4b54d9e01085d00dd708b5c80
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 6d2a28a..b2790eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -797,8 +797,6 @@ int amdgpu_vm_recover_page_table_from_shadow(struct amdgpu_device *adev,
 
 err:
 	amdgpu_bo_unreserve(vm->page_directory);
-	if (vm->recover_pt_fence)
-		r = fence_wait(vm->recover_pt_fence, false);
 
 	return r;
 }
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 04/11] drm/amdgpu: fix vm init error path
       [not found]     ` <1469700828-25650-5-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
@ 2016-07-30  3:41       ` Edward O'Callaghan
       [not found]         ` <7115f9e7-3afd-a693-3e23-1ad4acb3c700-dczkZgxz+BNUPWh3PAxdjQ@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Edward O'Callaghan @ 2016-07-30  3:41 UTC (permalink / raw)
  To: Chunming Zhou, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1.1: Type: text/plain, Size: 1300 bytes --]

Hi,

On 07/28/2016 08:13 PM, Chunming Zhou wrote:
> Change-Id: Ie3d5440dc0d2d3a61d8e785ab08b8b91eda223db
> Signed-off-by: Chunming Zhou <David1.Zhou-5C7GfCeVMHo@public.gmane.org>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 11c1263..1d58577 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1682,7 +1682,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>  	r = amd_sched_entity_init(&ring->sched, &vm->entity,
>  				  rq, amdgpu_sched_jobs);
>  	if (r)

Hmm while we are here I think we should be explicit that non-zero,
negative return values indicate an error path, so:

-  	if (r)
+  	if (r < 0)

This then follows precisely the semantics documented for
'amdgpu_vm_init()' invocations.

Kind Regards,
Edward.

> -		return r;
> +		goto err;
>  
>  	vm->page_directory_fence = NULL;
>  
> @@ -1725,6 +1725,9 @@ error_free_page_directory:
>  error_free_sched_entity:
>  	amd_sched_entity_fini(&ring->sched, &vm->entity);
>  
> +err:
> +	drm_free_large(vm->page_tables);
> +
>  	return r;
>  }
>  
> 


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 04/11] drm/amdgpu: fix vm init error path
       [not found]         ` <7115f9e7-3afd-a693-3e23-1ad4acb3c700-dczkZgxz+BNUPWh3PAxdjQ@public.gmane.org>
@ 2016-07-30  3:44           ` Edward O'Callaghan
  0 siblings, 0 replies; 25+ messages in thread
From: Edward O'Callaghan @ 2016-07-30  3:44 UTC (permalink / raw)
  To: Chunming Zhou, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1.1: Type: text/plain, Size: 1675 bytes --]



On 07/30/2016 01:41 PM, Edward O'Callaghan wrote:
> Hi,
> 
> On 07/28/2016 08:13 PM, Chunming Zhou wrote:
>> Change-Id: Ie3d5440dc0d2d3a61d8e785ab08b8b91eda223db
>> Signed-off-by: Chunming Zhou <David1.Zhou-5C7GfCeVMHo@public.gmane.org>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 11c1263..1d58577 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -1682,7 +1682,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>  	r = amd_sched_entity_init(&ring->sched, &vm->entity,
>>  				  rq, amdgpu_sched_jobs);
>>  	if (r)
> 
> Hmm while we are here I think we should be explicit that non-zero,
> negative return values indicate an error path, so:
> 
> -  	if (r)
> +  	if (r < 0)
> 
> This then follows precisely the semantics documented for
> 'amdgpu_vm_init()' invocations.
wops! 'amd_sched_entity_init()' I mean to say!

> 
> Kind Regards,
> Edward.
> 
>> -		return r;
>> +		goto err;
>>  
>>  	vm->page_directory_fence = NULL;
>>  
>> @@ -1725,6 +1725,9 @@ error_free_page_directory:
>>  error_free_sched_entity:
>>  	amd_sched_entity_fini(&ring->sched, &vm->entity);
>>  
>> +err:
>> +	drm_free_large(vm->page_tables);
>> +
>>  	return r;
>>  }
>>  
>>
> 
> 
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 09/11] drm/amd: hw job list should be exact
       [not found]     ` <1469700828-25650-10-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
@ 2016-07-30  3:46       ` Edward O'Callaghan
  0 siblings, 0 replies; 25+ messages in thread
From: Edward O'Callaghan @ 2016-07-30  3:46 UTC (permalink / raw)
  To: Chunming Zhou, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1.1: Type: text/plain, Size: 2930 bytes --]



On 07/28/2016 08:13 PM, Chunming Zhou wrote:
> hw job list should be exact, so deleting job node should be in irq
> handler instead of work thread.
> And Calculating time of next job should be immediate as well.
> 
> Change-Id: I6d2686d84be3e7077300df7181c2a284fbcda9eb
Guessing this Gerrit/Jenkins CI change-id is usually dropped?

otherwise, Reviewed-by: Edward O'Callaghan <funfunctor-dczkZgxz+BNUPWh3PAxdjQ@public.gmane.org>

> Signed-off-by: Chunming Zhou <David1.Zhou-5C7GfCeVMHo@public.gmane.org>
> ---
>  drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 38 +++++++++++++--------------
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 9f4fa6e..69a9d40 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -367,34 +367,32 @@ static void amd_sched_job_finish(struct work_struct *work)
>  	struct amd_sched_job *s_job = container_of(work, struct amd_sched_job,
>  						   finish_work);
>  	struct amd_gpu_scheduler *sched = s_job->sched;
> -	unsigned long flags;
> -
> -	/* remove job from ring_mirror_list */
> -	spin_lock_irqsave(&sched->job_list_lock, flags);
> -	list_del_init(&s_job->node);
> -	if (sched->timeout != MAX_SCHEDULE_TIMEOUT) {
> -		struct amd_sched_job *next;
>  
> -		spin_unlock_irqrestore(&sched->job_list_lock, flags);
> +	if (sched->timeout != MAX_SCHEDULE_TIMEOUT)
>  		cancel_delayed_work_sync(&s_job->work_tdr);
> -		spin_lock_irqsave(&sched->job_list_lock, flags);
> -
> -		/* queue TDR for next job */
> -		next = list_first_entry_or_null(&sched->ring_mirror_list,
> -						struct amd_sched_job, node);
>  
> -		if (next)
> -			schedule_delayed_work(&next->work_tdr, sched->timeout);
> -	}
> -	spin_unlock_irqrestore(&sched->job_list_lock, flags);
>  	sched->ops->free_job(s_job);
>  }
>  
>  static void amd_sched_job_finish_cb(struct fence *f, struct fence_cb *cb)
>  {
> -	struct amd_sched_job *job = container_of(cb, struct amd_sched_job,
> -						 finish_cb);
> -	schedule_work(&job->finish_work);
> +	struct amd_sched_job *s_job = container_of(cb, struct amd_sched_job,
> +						   finish_cb);
> +	struct amd_gpu_scheduler *sched = s_job->sched;
> +	struct amd_sched_job *next;
> +	unsigned long flags;
> +
> +	/* remove job from ring_mirror_list */
> +	spin_lock_irqsave(&sched->job_list_lock, flags);
> +	list_del_init(&s_job->node);
> +	/* queue TDR for next job */
> +	next = list_first_entry_or_null(&sched->ring_mirror_list,
> +					struct amd_sched_job, node);
> +	spin_unlock_irqrestore(&sched->job_list_lock, flags);
> +	if (next)
> +		schedule_delayed_work(&next->work_tdr, sched->timeout);
> +
> +	schedule_work(&s_job->finish_work);
>  }
>  
>  static void amd_sched_job_begin(struct amd_sched_job *s_job)
> 


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/11] drm/amdgpu: add vm recover entity
       [not found]     ` <1469700828-25650-6-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
@ 2016-07-30  3:51       ` Edward O'Callaghan
  0 siblings, 0 replies; 25+ messages in thread
From: Edward O'Callaghan @ 2016-07-30  3:51 UTC (permalink / raw)
  To: Chunming Zhou, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1.1: Type: text/plain, Size: 4123 bytes --]



On 07/28/2016 08:13 PM, Chunming Zhou wrote:
> every vm uses itself recover entity to recovery page table from shadow.
> 
> Change-Id: I93e37666cb3fb511311c96ff172b6e9ebd337547
> Signed-off-by: Chunming Zhou <David1.Zhou-5C7GfCeVMHo@public.gmane.org>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h    |  3 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 21 ++++++++++++++-------
>  2 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 9f7fae0..98f631a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -911,7 +911,8 @@ struct amdgpu_vm {
>  
>  	/* Scheduler entity for page table updates */
>  	struct amd_sched_entity	entity;
> -
> +	struct amd_sched_entity	recover_entity;
> +	struct amdgpu_ring      *ring;
>  	/* client id */
>  	u64                     client_id;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 1d58577..6d2a28a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -714,13 +714,13 @@ error_free:
>  }
>  
>  static int amdgpu_vm_recover_bo_from_shadow(struct amdgpu_device *adev,
> +					    struct amdgpu_vm *vm,
>  					    struct amdgpu_bo *bo,
>  					    struct amdgpu_bo *bo_shadow,
>  					    struct reservation_object *resv,
>  					    struct fence **fence)
>  
>  {
> -	struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
>  	int r;
>  	uint64_t vram_addr, gtt_addr;
>  
> @@ -739,8 +739,8 @@ static int amdgpu_vm_recover_bo_from_shadow(struct amdgpu_device *adev,
>  	if (r)
>  		goto err3;
>  
> -	r = amdgpu_copy_buffer(ring, &adev->mman.entity, gtt_addr, vram_addr,
> -			       amdgpu_bo_size(bo), resv, fence);
> +	r = amdgpu_copy_buffer(vm->ring, &vm->recover_entity, gtt_addr,
> +			       vram_addr, amdgpu_bo_size(bo), resv, fence);
>  	if (!r)
>  		amdgpu_bo_fence(bo, *fence, true);
>  
> @@ -767,7 +767,7 @@ int amdgpu_vm_recover_page_table_from_shadow(struct amdgpu_device *adev,
>  	if (unlikely(r != 0))
>  		return r;
>  
> -	r = amdgpu_vm_recover_bo_from_shadow(adev, vm->page_directory,
> +	r = amdgpu_vm_recover_bo_from_shadow(adev, vm, vm->page_directory,
>  					     vm->page_directory->shadow,
>  					     NULL, &fence);

This is slightly out of scope from your patchsets intention however:

If we are passing 'vm' in now to 'amdgpu_vm_recover_bo_from_shadow()'
could we then perhaps do the dereferences to 'vm->page_directory' and
'vm->page_directory->shadow' in there?

>  	if (r) {
> @@ -784,7 +784,7 @@ int amdgpu_vm_recover_page_table_from_shadow(struct amdgpu_device *adev,
>  
>  		if (!bo || !bo_shadow)
>  			continue;
> -		r = amdgpu_vm_recover_bo_from_shadow(adev, bo, bo_shadow,
> +		r = amdgpu_vm_recover_bo_from_shadow(adev, vm, bo, bo_shadow,
>  						     NULL, &fence);
>  		if (r) {
>  			DRM_ERROR("recover page table failed!\n");
> @@ -1678,12 +1678,17 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>  	ring_instance = atomic_inc_return(&adev->vm_manager.vm_pte_next_ring);
>  	ring_instance %= adev->vm_manager.vm_pte_num_rings;
>  	ring = adev->vm_manager.vm_pte_rings[ring_instance];
> +	rq = &ring->sched.sched_rq[AMD_SCHED_PRIORITY_RECOVER];
> +	r = amd_sched_entity_init(&ring->sched, &vm->recover_entity,
> +				  rq, amdgpu_sched_jobs);
> +	if (r)
> +		goto err;
>  	rq = &ring->sched.sched_rq[AMD_SCHED_PRIORITY_KERNEL];
>  	r = amd_sched_entity_init(&ring->sched, &vm->entity,
>  				  rq, amdgpu_sched_jobs);
>  	if (r)
> -		goto err;
> -
> +		goto err1;
> +	vm->ring = ring;
>  	vm->page_directory_fence = NULL;
>  
>  	r = amdgpu_bo_create(adev, pd_size, align, true,
> @@ -1725,6 +1730,8 @@ error_free_page_directory:
>  error_free_sched_entity:
>  	amd_sched_entity_fini(&ring->sched, &vm->entity);
>  
> +err1:
> +	amd_sched_entity_fini(&ring->sched, &vm->recover_entity);
>  err:
>  	drm_free_large(vm->page_tables);
>  
> 


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/11] drm/amd: add recover run queue for scheduler
       [not found]     ` <1469700828-25650-4-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
@ 2016-07-30  3:52       ` Edward O'Callaghan
  0 siblings, 0 replies; 25+ messages in thread
From: Edward O'Callaghan @ 2016-07-30  3:52 UTC (permalink / raw)
  To: Chunming Zhou, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1.1: Type: text/plain, Size: 912 bytes --]

Reviewed-by: Edward O'Callaghan <funfunctor-dczkZgxz+BNUPWh3PAxdjQ@public.gmane.org>

On 07/28/2016 08:13 PM, Chunming Zhou wrote:
> Change-Id: I7171d1e3884aabe1263d8f7be18cadf2e98216a4
> Signed-off-by: Chunming Zhou <David1.Zhou-5C7GfCeVMHo@public.gmane.org>
> ---
>  drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index a1c0073..cd87bc7 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -112,7 +112,8 @@ struct amd_sched_backend_ops {
>  };
>  
>  enum amd_sched_priority {
> -	AMD_SCHED_PRIORITY_KERNEL = 0,
> +	AMD_SCHED_PRIORITY_RECOVER = 0,
> +	AMD_SCHED_PRIORITY_KERNEL,
>  	AMD_SCHED_PRIORITY_NORMAL,
>  	AMD_SCHED_MAX_PRIORITY
>  };
> 


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/11] drm/amdgpu: hw ring should be empty when gpu reset
       [not found]     ` <1469700828-25650-2-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
@ 2016-07-30  3:53       ` Edward O'Callaghan
  0 siblings, 0 replies; 25+ messages in thread
From: Edward O'Callaghan @ 2016-07-30  3:53 UTC (permalink / raw)
  To: Chunming Zhou, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1.1: Type: text/plain, Size: 2594 bytes --]

Reviewed-by: Edward O'Callaghan <funfunctor-dczkZgxz+BNUPWh3PAxdjQ@public.gmane.org>


On 07/28/2016 08:13 PM, Chunming Zhou wrote:
> Change-Id: I08ca5a805f590cc7aad0e9ccd91bd5925bb216e2
> Signed-off-by: Chunming Zhou <David1.Zhou-5C7GfCeVMHo@public.gmane.org>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 11 +++++++++++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 43beefb..ebd5565 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1247,6 +1247,7 @@ int amdgpu_ib_ring_tests(struct amdgpu_device *adev);
>  int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
>  void amdgpu_ring_insert_nop(struct amdgpu_ring *ring, uint32_t count);
>  void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib);
> +void amdgpu_ring_reset(struct amdgpu_ring *ring);
>  void amdgpu_ring_commit(struct amdgpu_ring *ring);
>  void amdgpu_ring_undo(struct amdgpu_ring *ring);
>  int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7e63ef9..1968251 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2102,6 +2102,7 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
>  			continue;
>  		kthread_park(ring->sched.thread);
>  		amd_sched_hw_job_reset(&ring->sched);
> +		amdgpu_ring_reset(ring);
>  	}
>  	/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
>  	amdgpu_fence_driver_force_completion(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 9989e25..75e1da6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -110,6 +110,17 @@ void amdgpu_ring_generic_pad_ib(struct amdgpu_ring *ring, struct amdgpu_ib *ib)
>  		ib->ptr[ib->length_dw++] = ring->nop;
>  }
>  
> +void amdgpu_ring_reset(struct amdgpu_ring *ring)
> +{
> +       u32 rptr = amdgpu_ring_get_rptr(ring);
> +
> +       ring->wptr = rptr;
> +       ring->wptr &= ring->ptr_mask;
> +
> +       mb();
> +       amdgpu_ring_set_wptr(ring);
> +}
> +
>  /**
>   * amdgpu_ring_commit - tell the GPU to execute the new
>   * commands on the ring buffer
> 


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/11] add recovery entity
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (10 preceding siblings ...)
  2016-07-28 10:13   ` [PATCH 11/11] drm/amdgpu: no need fence wait every time Chunming Zhou
@ 2016-08-02  2:06   ` zhoucm1
  2016-08-03 13:43   ` Christian König
  12 siblings, 0 replies; 25+ messages in thread
From: zhoucm1 @ 2016-08-02  2:06 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

ping as well...

On 2016年07月28日 18:13, Chunming Zhou wrote:
> every vm has itself recovery entity, which is used to reovery page table from their shadow.
> They don't need to wait front vm completed.
> And also using all pte rings can speed reovery.
>
> every scheduler has its own recovery entity, which is used to save hw jobs, and resubmit from it, which solves the conflicts between reset thread and scheduler thread when run job.
>
> And some fixes when doing this improment.
>
> Chunming Zhou (11):
>    drm/amdgpu: hw ring should be empty when gpu reset
>    drm/amdgpu: specify entity to amdgpu_copy_buffer
>    drm/amd: add recover run queue for scheduler
>    drm/amdgpu: fix vm init error path
>    drm/amdgpu: add vm recover entity
>    drm/amdgpu: use all pte rings to recover page table
>    drm/amd: add recover entity for every scheduler
>    drm/amd: use scheduler to recover hw jobs
>    drm/amd: hw job list should be exact
>    drm/amd: reset jobs to recover entity
>    drm/amdgpu: no need fence wait every time
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129 +++++++++++++-------------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
>   9 files changed, 134 insertions(+), 92 deletions(-)
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/11] add recovery entity
       [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
                     ` (11 preceding siblings ...)
  2016-08-02  2:06   ` [PATCH 00/11] add recovery entity zhoucm1
@ 2016-08-03 13:43   ` Christian König
       [not found]     ` <233bde9b-f7ff-2697-1fd9-419d08f8f359-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  12 siblings, 1 reply; 25+ messages in thread
From: Christian König @ 2016-08-03 13:43 UTC (permalink / raw)
  To: Chunming Zhou, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Well that is a clear NAK to this whole approach.

Submitting the recovery jobs to the scheduler is reentrant because the 
scheduler is the one who originally signaled us of a timeout.

Why not submit the recovery jobs to the hardware ring directly?

Regards,
Christian.

Am 28.07.2016 um 12:13 schrieb Chunming Zhou:
> every vm has itself recovery entity, which is used to reovery page table from their shadow.
> They don't need to wait front vm completed.
> And also using all pte rings can speed reovery.
>
> every scheduler has its own recovery entity, which is used to save hw jobs, and resubmit from it, which solves the conflicts between reset thread and scheduler thread when run job.
>
> And some fixes when doing this improment.
>
> Chunming Zhou (11):
>    drm/amdgpu: hw ring should be empty when gpu reset
>    drm/amdgpu: specify entity to amdgpu_copy_buffer
>    drm/amd: add recover run queue for scheduler
>    drm/amdgpu: fix vm init error path
>    drm/amdgpu: add vm recover entity
>    drm/amdgpu: use all pte rings to recover page table
>    drm/amd: add recover entity for every scheduler
>    drm/amd: use scheduler to recover hw jobs
>    drm/amd: hw job list should be exact
>    drm/amd: reset jobs to recover entity
>    drm/amdgpu: no need fence wait every time
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129 +++++++++++++-------------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
>   9 files changed, 134 insertions(+), 92 deletions(-)
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/11] add recovery entity
       [not found]     ` <233bde9b-f7ff-2697-1fd9-419d08f8f359-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2016-08-04  3:10       ` zhoucm1
       [not found]         ` <57A2B217.2010100-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: zhoucm1 @ 2016-08-04  3:10 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 2016年08月03日 21:43, Christian König wrote:
> Well that is a clear NAK to this whole approach.
>
> Submitting the recovery jobs to the scheduler is reentrant because the 
> scheduler is the one who originally signaled us of a timeout.
we have reset all recovery jobs, right? Could we think those jobs are 
same as others?
>
> Why not submit the recovery jobs to the hardware ring directly?
Yeah, this is also what I did at begin.
The mainly reasons are:
0. recovery jobs need to wait itself page table recovery completed at least.
1. direct submission is using run_job which is used by scheduler as 
well, which could introduce conflicts.
2. if all vm clients use one sdma engine, the speed of restoring is 
slow. If we can use itself pte ring, then we will use all sdma engines 
for them.
3. if just one entity is to recover all vm page tables, then their 
recovery jobs will have potential dependency, the later is waiting the 
front. If they have their own entity, there will be no dependency 
between them.
4. if recovery entity is based on kernel run queue, then the recovery 
jobs could be executed with pt jobs at the same time.

Above is why I introduce recovery entity and recovery run queue.

Regards,
David Zhou
>
> Regards,
> Christian.
>
> Am 28.07.2016 um 12:13 schrieb Chunming Zhou:
>> every vm has itself recovery entity, which is used to reovery page 
>> table from their shadow.
>> They don't need to wait front vm completed.
>> And also using all pte rings can speed reovery.
>>
>> every scheduler has its own recovery entity, which is used to save hw 
>> jobs, and resubmit from it, which solves the conflicts between reset 
>> thread and scheduler thread when run job.
>>
>> And some fixes when doing this improment.
>>
>> Chunming Zhou (11):
>>    drm/amdgpu: hw ring should be empty when gpu reset
>>    drm/amdgpu: specify entity to amdgpu_copy_buffer
>>    drm/amd: add recover run queue for scheduler
>>    drm/amdgpu: fix vm init error path
>>    drm/amdgpu: add vm recover entity
>>    drm/amdgpu: use all pte rings to recover page table
>>    drm/amd: add recover entity for every scheduler
>>    drm/amd: use scheduler to recover hw jobs
>>    drm/amd: hw job list should be exact
>>    drm/amd: reset jobs to recover entity
>>    drm/amdgpu: no need fence wait every time
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--
>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129 
>> +++++++++++++-------------
>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
>>   9 files changed, 134 insertions(+), 92 deletions(-)
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/11] add recovery entity
       [not found]         ` <57A2B217.2010100-5C7GfCeVMHo@public.gmane.org>
@ 2016-08-04  8:39           ` Christian König
       [not found]             ` <a8ade743-309a-6982-7bad-a7c5648bd5e2-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Christian König @ 2016-08-04  8:39 UTC (permalink / raw)
  To: zhoucm1, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 04.08.2016 um 05:10 schrieb zhoucm1:
>
>
> On 2016年08月03日 21:43, Christian König wrote:
>> Well that is a clear NAK to this whole approach.
>>
>> Submitting the recovery jobs to the scheduler is reentrant because 
>> the scheduler is the one who originally signaled us of a timeout.
> we have reset all recovery jobs, right? Could we think those jobs are 
> same as others?

No they aren't. For recovery jobs you don't want a timeout which 
triggers another GPU reset while your first one is still under way.

>
>>
>> Why not submit the recovery jobs to the hardware ring directly?
> Yeah, this is also what I did at begin.
> The mainly reasons are:
> 0. recovery jobs need to wait itself page table recovery completed at 
> least.

Well, as noted in the other thread we need to recover the GART table 
with the CPU anyway.

> 1. direct submission is using run_job which is used by scheduler as 
> well, which could introduce conflicts.

The scheduler should be completely stopped during the GPU reset, so 
there shouldn't be any other processing.

> 2. if all vm clients use one sdma engine, the speed of restoring is 
> slow. If we can use itself pte ring, then we will use all sdma engines 
> for them.

A single SDMA engine should be able to max out the PCIe speed in one 
direction, no need to offload that to both engines. If we really need 
both engines we could also simply handle that in the recovery code as well.

>
> 3. if just one entity is to recover all vm page tables, then their 
> recovery jobs will have potential dependency, the later is waiting the 
> front. If they have their own entity, there will be no dependency 
> between them.
> 4. if recovery entity is based on kernel run queue, then the recovery 
> jobs could be executed with pt jobs at the same time.

Well that's exactly the reason why I don't want to push those jobs 
through the scheduler. The scheduler should be stopped during the GPU 
reset so that nothing else happens with the hardware.

E.g. when other jobs run concurrently with the recovery jobs you can 
have all kinds of problems like one SDMA engine is doing a recovery 
while the other one does a backup on the same BO etc..

Regards,
Christian.

>
> Above is why I introduce recovery entity and recovery run queue.
>
> Regards,
> David Zhou
>>
>> Regards,
>> Christian.
>>
>> Am 28.07.2016 um 12:13 schrieb Chunming Zhou:
>>> every vm has itself recovery entity, which is used to reovery page 
>>> table from their shadow.
>>> They don't need to wait front vm completed.
>>> And also using all pte rings can speed reovery.
>>>
>>> every scheduler has its own recovery entity, which is used to save 
>>> hw jobs, and resubmit from it, which solves the conflicts between 
>>> reset thread and scheduler thread when run job.
>>>
>>> And some fixes when doing this improment.
>>>
>>> Chunming Zhou (11):
>>>    drm/amdgpu: hw ring should be empty when gpu reset
>>>    drm/amdgpu: specify entity to amdgpu_copy_buffer
>>>    drm/amd: add recover run queue for scheduler
>>>    drm/amdgpu: fix vm init error path
>>>    drm/amdgpu: add vm recover entity
>>>    drm/amdgpu: use all pte rings to recover page table
>>>    drm/amd: add recover entity for every scheduler
>>>    drm/amd: use scheduler to recover hw jobs
>>>    drm/amd: hw job list should be exact
>>>    drm/amd: reset jobs to recover entity
>>>    drm/amdgpu: no need fence wait every time
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--
>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129 
>>> +++++++++++++-------------
>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
>>>   9 files changed, 134 insertions(+), 92 deletions(-)
>>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/11] add recovery entity
       [not found]             ` <a8ade743-309a-6982-7bad-a7c5648bd5e2-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2016-08-04  9:04               ` zhoucm1
  2016-08-04  9:04               ` zhoucm1
  1 sibling, 0 replies; 25+ messages in thread
From: zhoucm1 @ 2016-08-04  9:04 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 2016年08月04日 16:39, Christian König wrote:
> Am 04.08.2016 um 05:10 schrieb zhoucm1:
>>
>>
>> On 2016年08月03日 21:43, Christian König wrote:
>>> Well that is a clear NAK to this whole approach.
>>>
>>> Submitting the recovery jobs to the scheduler is reentrant because 
>>> the scheduler is the one who originally signaled us of a timeout.
>> we have reset all recovery jobs, right? Could we think those jobs are 
>> same as others?
>
> No they aren't. For recovery jobs you don't want a timeout which 
> triggers another GPU reset while your first one is still under way.
>
>>
>>>
>>> Why not submit the recovery jobs to the hardware ring directly?
>> Yeah, this is also what I did at begin.
>> The mainly reasons are:
>> 0. recovery jobs need to wait itself page table recovery completed at 
>> least.
>
> Well, as noted in the other thread we need to recover the GART table 
> with the CPU anyway.
>
>> 1. direct submission is using run_job which is used by scheduler as 
>> well, which could introduce conflicts.
>
> The scheduler should be completely stopped during the GPU reset, so 
> there shouldn't be any other processing.
>
>> 2. if all vm clients use one sdma engine, the speed of restoring is 
>> slow. If we can use itself pte ring, then we will use all sdma 
>> engines for them.
>
> A single SDMA engine should be able to max out the PCIe speed in one 
> direction, no need to offload that to both engines. If we really need 
> both engines we could also simply handle that in the recovery code as 
> well.
>
>>
>> 3. if just one entity is to recover all vm page tables, then their 
>> recovery jobs will have potential dependency, the later is waiting 
>> the front. If they have their own entity, there will be no dependency 
>> between them.
>> 4. if recovery entity is based on kernel run queue, then the recovery 
>> jobs could be executed with pt jobs at the same time.
>
> Well that's exactly the reason why I don't want to push those jobs 
> through the scheduler. The scheduler should be stopped during the GPU 
> reset so that nothing else happens with the hardware.
>
> E.g. when other jobs run concurrently with the recovery jobs you can 
> have all kinds of problems like one SDMA engine is doing a recovery 
> while the other one does a backup on the same BO etc..
OK, I got your mean. That means all recovery pt jobs of pt scheduler 
must be completed before directly submitting recovery job, which indeed 
simply many problems, especially kinds of fence sync.

Regards,
David zhou
>
> Regards,
> Christian.
>
>>
>> Above is why I introduce recovery entity and recovery run queue.
>>
>> Regards,
>> David Zhou
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.07.2016 um 12:13 schrieb Chunming Zhou:
>>>> every vm has itself recovery entity, which is used to reovery page 
>>>> table from their shadow.
>>>> They don't need to wait front vm completed.
>>>> And also using all pte rings can speed reovery.
>>>>
>>>> every scheduler has its own recovery entity, which is used to save 
>>>> hw jobs, and resubmit from it, which solves the conflicts between 
>>>> reset thread and scheduler thread when run job.
>>>>
>>>> And some fixes when doing this improment.
>>>>
>>>> Chunming Zhou (11):
>>>>    drm/amdgpu: hw ring should be empty when gpu reset
>>>>    drm/amdgpu: specify entity to amdgpu_copy_buffer
>>>>    drm/amd: add recover run queue for scheduler
>>>>    drm/amdgpu: fix vm init error path
>>>>    drm/amdgpu: add vm recover entity
>>>>    drm/amdgpu: use all pte rings to recover page table
>>>>    drm/amd: add recover entity for every scheduler
>>>>    drm/amd: use scheduler to recover hw jobs
>>>>    drm/amd: hw job list should be exact
>>>>    drm/amd: reset jobs to recover entity
>>>>    drm/amdgpu: no need fence wait every time
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--
>>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129 
>>>> +++++++++++++-------------
>>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
>>>>   9 files changed, 134 insertions(+), 92 deletions(-)
>>>>
>>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/11] add recovery entity
       [not found]             ` <a8ade743-309a-6982-7bad-a7c5648bd5e2-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  2016-08-04  9:04               ` zhoucm1
@ 2016-08-04  9:04               ` zhoucm1
  1 sibling, 0 replies; 25+ messages in thread
From: zhoucm1 @ 2016-08-04  9:04 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 2016年08月04日 16:39, Christian König wrote:
> Am 04.08.2016 um 05:10 schrieb zhoucm1:
>>
>>
>> On 2016年08月03日 21:43, Christian König wrote:
>>> Well that is a clear NAK to this whole approach.
>>>
>>> Submitting the recovery jobs to the scheduler is reentrant because 
>>> the scheduler is the one who originally signaled us of a timeout.
>> we have reset all recovery jobs, right? Could we think those jobs are 
>> same as others?
>
> No they aren't. For recovery jobs you don't want a timeout which 
> triggers another GPU reset while your first one is still under way.
>
>>
>>>
>>> Why not submit the recovery jobs to the hardware ring directly?
>> Yeah, this is also what I did at begin.
>> The mainly reasons are:
>> 0. recovery jobs need to wait itself page table recovery completed at 
>> least.
>
> Well, as noted in the other thread we need to recover the GART table 
> with the CPU anyway.
>
>> 1. direct submission is using run_job which is used by scheduler as 
>> well, which could introduce conflicts.
>
> The scheduler should be completely stopped during the GPU reset, so 
> there shouldn't be any other processing.
>
>> 2. if all vm clients use one sdma engine, the speed of restoring is 
>> slow. If we can use itself pte ring, then we will use all sdma 
>> engines for them.
>
> A single SDMA engine should be able to max out the PCIe speed in one 
> direction, no need to offload that to both engines. If we really need 
> both engines we could also simply handle that in the recovery code as 
> well.
>
>>
>> 3. if just one entity is to recover all vm page tables, then their 
>> recovery jobs will have potential dependency, the later is waiting 
>> the front. If they have their own entity, there will be no dependency 
>> between them.
>> 4. if recovery entity is based on kernel run queue, then the recovery 
>> jobs could be executed with pt jobs at the same time.
>
> Well that's exactly the reason why I don't want to push those jobs 
> through the scheduler. The scheduler should be stopped during the GPU 
> reset so that nothing else happens with the hardware.
>
> E.g. when other jobs run concurrently with the recovery jobs you can 
> have all kinds of problems like one SDMA engine is doing a recovery 
> while the other one does a backup on the same BO etc..
OK, I got your mean. That means all recovery pt jobs of pt scheduler 
must be completed before directly submitting recovery job, which indeed 
simply many problems, especially kinds of fence sync.

Regards,
David zhou
>
> Regards,
> Christian.
>
>>
>> Above is why I introduce recovery entity and recovery run queue.
>>
>> Regards,
>> David Zhou
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.07.2016 um 12:13 schrieb Chunming Zhou:
>>>> every vm has itself recovery entity, which is used to reovery page 
>>>> table from their shadow.
>>>> They don't need to wait front vm completed.
>>>> And also using all pte rings can speed reovery.
>>>>
>>>> every scheduler has its own recovery entity, which is used to save 
>>>> hw jobs, and resubmit from it, which solves the conflicts between 
>>>> reset thread and scheduler thread when run job.
>>>>
>>>> And some fixes when doing this improment.
>>>>
>>>> Chunming Zhou (11):
>>>>    drm/amdgpu: hw ring should be empty when gpu reset
>>>>    drm/amdgpu: specify entity to amdgpu_copy_buffer
>>>>    drm/amd: add recover run queue for scheduler
>>>>    drm/amdgpu: fix vm init error path
>>>>    drm/amdgpu: add vm recover entity
>>>>    drm/amdgpu: use all pte rings to recover page table
>>>>    drm/amd: add recover entity for every scheduler
>>>>    drm/amd: use scheduler to recover hw jobs
>>>>    drm/amd: hw job list should be exact
>>>>    drm/amd: reset jobs to recover entity
>>>>    drm/amdgpu: no need fence wait every time
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--
>>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129 
>>>> +++++++++++++-------------
>>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
>>>>   9 files changed, 134 insertions(+), 92 deletions(-)
>>>>
>>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 03/11] drm/amd: add recover run queue for scheduler
       [not found] ` <1470124302-23615-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
@ 2016-08-02  7:51   ` Chunming Zhou
  0 siblings, 0 replies; 25+ messages in thread
From: Chunming Zhou @ 2016-08-02  7:51 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Chunming Zhou

Change-Id: I7171d1e3884aabe1263d8f7be18cadf2e98216a4
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Edward O'Callaghan <funfunctor@folklore1984.net>
---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index a1c0073..cd87bc7 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -112,7 +112,8 @@ struct amd_sched_backend_ops {
 };
 
 enum amd_sched_priority {
-	AMD_SCHED_PRIORITY_KERNEL = 0,
+	AMD_SCHED_PRIORITY_RECOVER = 0,
+	AMD_SCHED_PRIORITY_KERNEL,
 	AMD_SCHED_PRIORITY_NORMAL,
 	AMD_SCHED_MAX_PRIORITY
 };
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-08-04  9:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-28 10:13 [PATCH 00/11] add recovery entity Chunming Zhou
     [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-28 10:13   ` [PATCH 01/11] drm/amdgpu: hw ring should be empty when gpu reset Chunming Zhou
     [not found]     ` <1469700828-25650-2-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:53       ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 02/11] drm/amdgpu: specify entity to amdgpu_copy_buffer Chunming Zhou
2016-07-28 10:13   ` [PATCH 03/11] drm/amd: add recover run queue for scheduler Chunming Zhou
     [not found]     ` <1469700828-25650-4-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:52       ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 04/11] drm/amdgpu: fix vm init error path Chunming Zhou
     [not found]     ` <1469700828-25650-5-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:41       ` Edward O'Callaghan
     [not found]         ` <7115f9e7-3afd-a693-3e23-1ad4acb3c700-dczkZgxz+BNUPWh3PAxdjQ@public.gmane.org>
2016-07-30  3:44           ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 05/11] drm/amdgpu: add vm recover entity Chunming Zhou
     [not found]     ` <1469700828-25650-6-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:51       ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 06/11] drm/amdgpu: use all pte rings to recover page table Chunming Zhou
2016-07-28 10:13   ` [PATCH 07/11] drm/amd: add recover entity for every scheduler Chunming Zhou
2016-07-28 10:13   ` [PATCH 08/11] drm/amd: use scheduler to recover hw jobs Chunming Zhou
2016-07-28 10:13   ` [PATCH 09/11] drm/amd: hw job list should be exact Chunming Zhou
     [not found]     ` <1469700828-25650-10-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:46       ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 10/11] drm/amd: reset jobs to recover entity Chunming Zhou
2016-07-28 10:13   ` [PATCH 11/11] drm/amdgpu: no need fence wait every time Chunming Zhou
2016-08-02  2:06   ` [PATCH 00/11] add recovery entity zhoucm1
2016-08-03 13:43   ` Christian König
     [not found]     ` <233bde9b-f7ff-2697-1fd9-419d08f8f359-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2016-08-04  3:10       ` zhoucm1
     [not found]         ` <57A2B217.2010100-5C7GfCeVMHo@public.gmane.org>
2016-08-04  8:39           ` Christian König
     [not found]             ` <a8ade743-309a-6982-7bad-a7c5648bd5e2-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2016-08-04  9:04               ` zhoucm1
2016-08-04  9:04               ` zhoucm1
2016-08-02  7:51 [PATCH 00/11] add recovery entity and run queue Chunming Zhou
     [not found] ` <1470124302-23615-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-08-02  7:51   ` [PATCH 03/11] drm/amd: add recover run queue for scheduler Chunming Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.