All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Rework amdgpu HW fence refocunt and update scheduler parent fence refcount.
@ 2022-06-20 22:02 ` Andrey Grodzovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:02 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Yiqing raised a problem of negative fence refcount for resubmitted jobs
in amdgpu and suggested a workaround in [1]. I took  a look myself and discovered
some deeper problems both in amdgpu and scheduler code.

Yiqing helped with testing the new code and also drew a detailed refcount and flow
tracing diagram for parent (HW) fence life cycle and refcount under various
cases for the proposed patchset at [2].

[1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3
[2] - https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view?usp=sharing

Andrey Grodzovsky (5):
  drm/amdgpu: Fix possible refcount leak for release of
    external_hw_fence
  drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences
  drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'
  drm/amdgpu: Follow up change to previous drm scheduler change.

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 37 ++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 12 +++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
 drivers/gpu/drm/scheduler/sched_main.c     | 16 ++++++++--
 6 files changed, 78 insertions(+), 17 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/5] Rework amdgpu HW fence refocunt and update scheduler parent fence refcount.
@ 2022-06-20 22:02 ` Andrey Grodzovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:02 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Yiqing raised a problem of negative fence refcount for resubmitted jobs
in amdgpu and suggested a workaround in [1]. I took  a look myself and discovered
some deeper problems both in amdgpu and scheduler code.

Yiqing helped with testing the new code and also drew a detailed refcount and flow
tracing diagram for parent (HW) fence life cycle and refcount under various
cases for the proposed patchset at [2].

[1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3
[2] - https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view?usp=sharing

Andrey Grodzovsky (5):
  drm/amdgpu: Fix possible refcount leak for release of
    external_hw_fence
  drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences
  drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'
  drm/amdgpu: Follow up change to previous drm scheduler change.

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 37 ++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 12 +++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
 drivers/gpu/drm/scheduler/sched_main.c     | 16 ++++++++--
 6 files changed, 78 insertions(+), 17 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
  2022-06-20 22:02 ` Andrey Grodzovsky
@ 2022-06-20 22:02   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:02 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Problem:
In amdgpu_job_submit_direct - The refcount should drop by 2
but it drops only by 1.

amdgpu_ib_sched->emit -> refcount 1 from first fence init
dma_fence_get -> refcount 2
dme_fence_put -> refcount 1

Fix:
Add put for external_hw_fence in amdgpu_job_free/free_cb

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 10aa073600d4..58568fdde2d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
     /* only put the hw fence if has embedded fence */
 	if (job->hw_fence.ops != NULL)
 		dma_fence_put(&job->hw_fence);
-	else
+	else {
+		dma_fence_put(job->external_hw_fence);
 		kfree(job);
+	}
 }
 
 void amdgpu_job_free(struct amdgpu_job *job)
@@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
 	/* only put the hw fence if has embedded fence */
 	if (job->hw_fence.ops != NULL)
 		dma_fence_put(&job->hw_fence);
-	else
+	else {
+		dma_fence_put(job->external_hw_fence);
 		kfree(job);
+	}
 }
 
 int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
@ 2022-06-20 22:02   ` Andrey Grodzovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:02 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Problem:
In amdgpu_job_submit_direct - The refcount should drop by 2
but it drops only by 1.

amdgpu_ib_sched->emit -> refcount 1 from first fence init
dma_fence_get -> refcount 2
dme_fence_put -> refcount 1

Fix:
Add put for external_hw_fence in amdgpu_job_free/free_cb

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 10aa073600d4..58568fdde2d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
     /* only put the hw fence if has embedded fence */
 	if (job->hw_fence.ops != NULL)
 		dma_fence_put(&job->hw_fence);
-	else
+	else {
+		dma_fence_put(job->external_hw_fence);
 		kfree(job);
+	}
 }
 
 void amdgpu_job_free(struct amdgpu_job *job)
@@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
 	/* only put the hw fence if has embedded fence */
 	if (job->hw_fence.ops != NULL)
 		dma_fence_put(&job->hw_fence);
-	else
+	else {
+		dma_fence_put(job->external_hw_fence);
 		kfree(job);
+	}
 }
 
 int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences
  2022-06-20 22:02 ` Andrey Grodzovsky
@ 2022-06-20 22:02   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:02 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

This function should drop the fence refcount when it extracts the
fence from the fence array, just as it's done in amdgpu_fence_process.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 957437a5558c..a9ae3beaa1d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -595,8 +595,10 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
 	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
 		ptr = &ring->fence_drv.fences[i];
 		old = rcu_dereference_protected(*ptr, 1);
-		if (old && old->ops == &amdgpu_job_fence_ops)
+		if (old && old->ops == &amdgpu_job_fence_ops) {
 			RCU_INIT_POINTER(*ptr, NULL);
+			dma_fence_put(old);
+		}
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences
@ 2022-06-20 22:02   ` Andrey Grodzovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:02 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

This function should drop the fence refcount when it extracts the
fence from the fence array, just as it's done in amdgpu_fence_process.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 957437a5558c..a9ae3beaa1d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -595,8 +595,10 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
 	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
 		ptr = &ring->fence_drv.fences[i];
 		old = rcu_dereference_protected(*ptr, 1);
-		if (old && old->ops == &amdgpu_job_fence_ops)
+		if (old && old->ops == &amdgpu_job_fence_ops) {
 			RCU_INIT_POINTER(*ptr, NULL);
+			dma_fence_put(old);
+		}
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  2022-06-20 22:02 ` Andrey Grodzovsky
@ 2022-06-20 22:03   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:03 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array from
amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
from a late EOP interrupt.

Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
 3 files changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2b92281dd0c1..c99541685804 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_virt_fini_data_exchange(adev);
 	}
 
+	amdgpu_fence_driver_isr_toggle(adev, true);
+
 	/* block all schedulers and reset given job's ring */
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
@@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_fence_driver_force_completion(ring);
 	}
 
+	amdgpu_fence_driver_isr_toggle(adev, false);
+
 	if (job && job->vm)
 		drm_sched_increase_karma(&job->base);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index a9ae3beaa1d3..d6d54ba4c185 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
 	}
 }
 
+void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool stop)
+{
+	int i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->fence_drv.initialized || !ring->fence_drv.irq_src)
+			continue;
+
+		if (stop)
+			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
+					       ring->fence_drv.irq_type);
+		else
+			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
+					ring->fence_drv.irq_type);
+	}
+
+	/* TODO Only waits for irq handlers on other CPUs, maybe local_irq_save
+	 * local_irq_local_irq_restore are needed here for local interrupts ?
+	 *
+	 */
+	if (stop)
+		synchronize_irq(adev->irq.irq);
+}
+
 void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
 {
 	unsigned int i, j;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 7d89a52091c0..82c178a9033a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct amdgpu_ring *ring,
 				      uint32_t wait_seq,
 				      signed long timeout);
 unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
+void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool stop);
 
 /*
  * Rings.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
@ 2022-06-20 22:03   ` Andrey Grodzovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:03 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array from
amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
from a late EOP interrupt.

Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
 3 files changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2b92281dd0c1..c99541685804 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_virt_fini_data_exchange(adev);
 	}
 
+	amdgpu_fence_driver_isr_toggle(adev, true);
+
 	/* block all schedulers and reset given job's ring */
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
@@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_fence_driver_force_completion(ring);
 	}
 
+	amdgpu_fence_driver_isr_toggle(adev, false);
+
 	if (job && job->vm)
 		drm_sched_increase_karma(&job->base);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index a9ae3beaa1d3..d6d54ba4c185 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
 	}
 }
 
+void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool stop)
+{
+	int i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->fence_drv.initialized || !ring->fence_drv.irq_src)
+			continue;
+
+		if (stop)
+			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
+					       ring->fence_drv.irq_type);
+		else
+			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
+					ring->fence_drv.irq_type);
+	}
+
+	/* TODO Only waits for irq handlers on other CPUs, maybe local_irq_save
+	 * local_irq_local_irq_restore are needed here for local interrupts ?
+	 *
+	 */
+	if (stop)
+		synchronize_irq(adev->irq.irq);
+}
+
 void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
 {
 	unsigned int i, j;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 7d89a52091c0..82c178a9033a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct amdgpu_ring *ring,
 				      uint32_t wait_seq,
 				      signed long timeout);
 unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
+void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool stop);
 
 /*
  * Rings.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'
  2022-06-20 22:02 ` Andrey Grodzovsky
@ 2022-06-20 22:03   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:03 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Problem:
This patch caused negative refcount as described in [1] because
for that case parent fence did not signal by the time of drm_sched_stop and hence
kept in pending list the assumption was they will not signal and
so fence was put to account for the s_fence->parent refcount but for
amdgpu which has embedded HW fence (always same parent fence)
drm_sched_fence_release_scheduled was always called and would
still drop the count for parent fence once more. For jobs that
never signaled this imbalance was masked by refcount bug in
amdgpu_fence_driver_clear_job_fences that would not drop
refcount on the fences that were removed from fence drive
fences array (against prevois insertion into the array in
get in amdgpu_fence_emit).

Fix:
Revert this patch and by setting s_job->s_fence->parent to NULL
as before prevent the extra refcount drop in amdgpu when
drm_sched_fence_release_scheduled is called on job release.

Also - align behaviour in drm_sched_resubmit_jobs_ext with that of
drm_sched_main when submitting jobs - take a refcount for the
new parent fence pointer and drop refcount for original kref_init
for new HW fence creation (or fake new HW fence in amdgpu - see next patch).

[1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index b81fceb0b8a2..b38394f5694f 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -419,6 +419,11 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 		if (s_job->s_fence->parent &&
 		    dma_fence_remove_callback(s_job->s_fence->parent,
 					      &s_job->cb)) {
+			/* Revert drm/sched: Keep s_fence->parent pointer, no
+			 * need anymore for amdgpu and creates only troubles
+			 */
+			dma_fence_put(s_job->s_fence->parent);
+			s_job->s_fence->parent = NULL;
 			atomic_dec(&sched->hw_rq_count);
 		} else {
 			/*
@@ -548,7 +553,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
 		if (found_guilty && s_job->s_fence->scheduled.context == guilty_context)
 			dma_fence_set_error(&s_fence->finished, -ECANCELED);
 
-		dma_fence_put(s_job->s_fence->parent);
 		fence = sched->ops->run_job(s_job);
 		i++;
 
@@ -558,7 +562,11 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
 
 			s_job->s_fence->parent = NULL;
 		} else {
-			s_job->s_fence->parent = fence;
+
+			s_job->s_fence->parent = dma_fence_get(fence);
+
+			/* Drop for orignal kref_init */
+			dma_fence_put(fence);
 		}
 	}
 }
@@ -952,6 +960,9 @@ static int drm_sched_main(void *param)
 
 		if (!IS_ERR_OR_NULL(fence)) {
 			s_fence->parent = dma_fence_get(fence);
+			/* Drop for original kref_init of the fence */
+			dma_fence_put(fence);
+
 			r = dma_fence_add_callback(fence, &sched_job->cb,
 						   drm_sched_job_done_cb);
 			if (r == -ENOENT)
@@ -959,7 +970,6 @@ static int drm_sched_main(void *param)
 			else if (r)
 				DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
 					  r);
-			dma_fence_put(fence);
 		} else {
 			if (IS_ERR(fence))
 				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'
@ 2022-06-20 22:03   ` Andrey Grodzovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:03 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Problem:
This patch caused negative refcount as described in [1] because
for that case parent fence did not signal by the time of drm_sched_stop and hence
kept in pending list the assumption was they will not signal and
so fence was put to account for the s_fence->parent refcount but for
amdgpu which has embedded HW fence (always same parent fence)
drm_sched_fence_release_scheduled was always called and would
still drop the count for parent fence once more. For jobs that
never signaled this imbalance was masked by refcount bug in
amdgpu_fence_driver_clear_job_fences that would not drop
refcount on the fences that were removed from fence drive
fences array (against prevois insertion into the array in
get in amdgpu_fence_emit).

Fix:
Revert this patch and by setting s_job->s_fence->parent to NULL
as before prevent the extra refcount drop in amdgpu when
drm_sched_fence_release_scheduled is called on job release.

Also - align behaviour in drm_sched_resubmit_jobs_ext with that of
drm_sched_main when submitting jobs - take a refcount for the
new parent fence pointer and drop refcount for original kref_init
for new HW fence creation (or fake new HW fence in amdgpu - see next patch).

[1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index b81fceb0b8a2..b38394f5694f 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -419,6 +419,11 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 		if (s_job->s_fence->parent &&
 		    dma_fence_remove_callback(s_job->s_fence->parent,
 					      &s_job->cb)) {
+			/* Revert drm/sched: Keep s_fence->parent pointer, no
+			 * need anymore for amdgpu and creates only troubles
+			 */
+			dma_fence_put(s_job->s_fence->parent);
+			s_job->s_fence->parent = NULL;
 			atomic_dec(&sched->hw_rq_count);
 		} else {
 			/*
@@ -548,7 +553,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
 		if (found_guilty && s_job->s_fence->scheduled.context == guilty_context)
 			dma_fence_set_error(&s_fence->finished, -ECANCELED);
 
-		dma_fence_put(s_job->s_fence->parent);
 		fence = sched->ops->run_job(s_job);
 		i++;
 
@@ -558,7 +562,11 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
 
 			s_job->s_fence->parent = NULL;
 		} else {
-			s_job->s_fence->parent = fence;
+
+			s_job->s_fence->parent = dma_fence_get(fence);
+
+			/* Drop for orignal kref_init */
+			dma_fence_put(fence);
 		}
 	}
 }
@@ -952,6 +960,9 @@ static int drm_sched_main(void *param)
 
 		if (!IS_ERR_OR_NULL(fence)) {
 			s_fence->parent = dma_fence_get(fence);
+			/* Drop for original kref_init of the fence */
+			dma_fence_put(fence);
+
 			r = dma_fence_add_callback(fence, &sched_job->cb,
 						   drm_sched_job_done_cb);
 			if (r == -ENOENT)
@@ -959,7 +970,6 @@ static int drm_sched_main(void *param)
 			else if (r)
 				DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
 					  r);
-			dma_fence_put(fence);
 		} else {
 			if (IS_ERR(fence))
 				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
  2022-06-20 22:02 ` Andrey Grodzovsky
@ 2022-06-20 22:03   ` Andrey Grodzovsky
  -1 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:03 UTC (permalink / raw)
  To: dri-devel, amd-gfx; +Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 ++++++++++++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
 4 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 513c57f839d8..447bd92c4856 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev,
 		goto err_ib_sched;
 	}
 
+	/* Drop the initial kref_init count (see drm_sched_main as example) */
+	dma_fence_put(f);
 	ret = dma_fence_wait(f, false);
 
 err_ib_sched:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c99541685804..f9718119834f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5009,16 +5009,28 @@ static void amdgpu_device_recheck_guilty_jobs(
 
 		/* clear job's guilty and depend the folowing step to decide the real one */
 		drm_sched_reset_karma(s_job);
-		/* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
-		 * to make sure fence is balanced */
-		dma_fence_get(s_job->s_fence->parent);
 		drm_sched_resubmit_jobs_ext(&ring->sched, 1);
 
+		if (!s_job->s_fence->parent) {
+			DRM_WARN("Failed to get a HW fence for job!");
+			continue;
+		}
+
 		ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
 		if (ret == 0) { /* timeout */
 			DRM_ERROR("Found the real bad job! ring:%s, job_id:%llx\n",
 						ring->sched.name, s_job->id);
 
+
+			/* Clear this failed job from fence array */
+			amdgpu_fence_driver_clear_job_fences(ring);
+
+			/* Since the job won't signal and we go for
+			 * another resubmit drop this parent pointer
+			 */
+			dma_fence_put(s_job->s_fence->parent);
+			s_job->s_fence->parent = NULL;
+
 			/* set guilty */
 			drm_sched_increase_karma(s_job);
 retry:
@@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
 
 		/* got the hw fence, signal finished fence */
 		atomic_dec(ring->sched.score);
-		dma_fence_put(s_job->s_fence->parent);
 		dma_fence_get(&s_job->s_fence->finished);
 		dma_fence_signal(&s_job->s_fence->finished);
 		dma_fence_put(&s_job->s_fence->finished);
@@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 *
 	 * job->base holds a reference to parent fence
 	 */
-	if (job && job->base.s_fence->parent &&
-	    dma_fence_is_signaled(job->base.s_fence->parent)) {
+	if (job && (job->hw_fence.ops != NULL) &&
+	    dma_fence_is_signaled(&job->hw_fence)) {
 		job_signaled = true;
 		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
 		goto skip_hw_reset;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d6d54ba4c185..9bd4e18212fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
 	if (job && job->job_run_counter) {
 		/* reinit seq for resubmitted jobs */
 		fence->seqno = seq;
+		/* TO be inline with external fence creation and other drivers */
+		dma_fence_get(fence);
 	} else {
-		if (job)
+		if (job) {
 			dma_fence_init(fence, &amdgpu_job_fence_ops,
 				       &ring->fence_drv.lock,
 				       adev->fence_context + ring->idx, seq);
+			/* Against remove in amdgpu_job_{free, free_cb} */
+			dma_fence_get(fence);
+		}
 		else
 			dma_fence_init(fence, &amdgpu_fence_ops,
 				       &ring->fence_drv.lock,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 58568fdde2d0..638e1d600258 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -267,10 +267,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 			DRM_ERROR("Error scheduling IBs (%d)\n", r);
 	}
 
-	if (!job->job_run_counter)
-		dma_fence_get(fence);
-	else if (finished->error < 0)
-		dma_fence_put(&job->hw_fence);
 	job->job_run_counter++;
 	amdgpu_job_free_resources(job);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
@ 2022-06-20 22:03   ` Andrey Grodzovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-20 22:03 UTC (permalink / raw)
  To: dri-devel, amd-gfx
  Cc: Andrey Grodzovsky, jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 ++++++++++++++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
 4 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 513c57f839d8..447bd92c4856 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev,
 		goto err_ib_sched;
 	}
 
+	/* Drop the initial kref_init count (see drm_sched_main as example) */
+	dma_fence_put(f);
 	ret = dma_fence_wait(f, false);
 
 err_ib_sched:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c99541685804..f9718119834f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5009,16 +5009,28 @@ static void amdgpu_device_recheck_guilty_jobs(
 
 		/* clear job's guilty and depend the folowing step to decide the real one */
 		drm_sched_reset_karma(s_job);
-		/* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
-		 * to make sure fence is balanced */
-		dma_fence_get(s_job->s_fence->parent);
 		drm_sched_resubmit_jobs_ext(&ring->sched, 1);
 
+		if (!s_job->s_fence->parent) {
+			DRM_WARN("Failed to get a HW fence for job!");
+			continue;
+		}
+
 		ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
 		if (ret == 0) { /* timeout */
 			DRM_ERROR("Found the real bad job! ring:%s, job_id:%llx\n",
 						ring->sched.name, s_job->id);
 
+
+			/* Clear this failed job from fence array */
+			amdgpu_fence_driver_clear_job_fences(ring);
+
+			/* Since the job won't signal and we go for
+			 * another resubmit drop this parent pointer
+			 */
+			dma_fence_put(s_job->s_fence->parent);
+			s_job->s_fence->parent = NULL;
+
 			/* set guilty */
 			drm_sched_increase_karma(s_job);
 retry:
@@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
 
 		/* got the hw fence, signal finished fence */
 		atomic_dec(ring->sched.score);
-		dma_fence_put(s_job->s_fence->parent);
 		dma_fence_get(&s_job->s_fence->finished);
 		dma_fence_signal(&s_job->s_fence->finished);
 		dma_fence_put(&s_job->s_fence->finished);
@@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 *
 	 * job->base holds a reference to parent fence
 	 */
-	if (job && job->base.s_fence->parent &&
-	    dma_fence_is_signaled(job->base.s_fence->parent)) {
+	if (job && (job->hw_fence.ops != NULL) &&
+	    dma_fence_is_signaled(&job->hw_fence)) {
 		job_signaled = true;
 		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
 		goto skip_hw_reset;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d6d54ba4c185..9bd4e18212fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
 	if (job && job->job_run_counter) {
 		/* reinit seq for resubmitted jobs */
 		fence->seqno = seq;
+		/* TO be inline with external fence creation and other drivers */
+		dma_fence_get(fence);
 	} else {
-		if (job)
+		if (job) {
 			dma_fence_init(fence, &amdgpu_job_fence_ops,
 				       &ring->fence_drv.lock,
 				       adev->fence_context + ring->idx, seq);
+			/* Against remove in amdgpu_job_{free, free_cb} */
+			dma_fence_get(fence);
+		}
 		else
 			dma_fence_init(fence, &amdgpu_fence_ops,
 				       &ring->fence_drv.lock,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 58568fdde2d0..638e1d600258 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -267,10 +267,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 			DRM_ERROR("Error scheduling IBs (%d)\n", r);
 	}
 
-	if (!job->job_run_counter)
-		dma_fence_get(fence);
-	else if (finished->error < 0)
-		dma_fence_put(&job->hw_fence);
 	job->job_run_counter++;
 	amdgpu_job_free_resources(job);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
  2022-06-20 22:02   ` Andrey Grodzovsky
  (?)
@ 2022-06-21  7:19   ` Christian König
  2022-06-21 19:34     ` Andrey Grodzovsky
  -1 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2022-06-21  7:19 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
> Problem:
> In amdgpu_job_submit_direct - The refcount should drop by 2
> but it drops only by 1.
>
> amdgpu_ib_sched->emit -> refcount 1 from first fence init
> dma_fence_get -> refcount 2
> dme_fence_put -> refcount 1
>
> Fix:
> Add put for external_hw_fence in amdgpu_job_free/free_cb

Well what is the external_hw_fence good for in this construct?

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 10aa073600d4..58568fdde2d0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>       /* only put the hw fence if has embedded fence */
>   	if (job->hw_fence.ops != NULL)
>   		dma_fence_put(&job->hw_fence);
> -	else
> +	else {

When one side of the if uses {} the other side should use {} as well, 
e.g. use } else { here.

Christian.

> +		dma_fence_put(job->external_hw_fence);
>   		kfree(job);
> +	}
>   }
>   
>   void amdgpu_job_free(struct amdgpu_job *job)
> @@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>   	/* only put the hw fence if has embedded fence */
>   	if (job->hw_fence.ops != NULL)
>   		dma_fence_put(&job->hw_fence);
> -	else
> +	else {
> +		dma_fence_put(job->external_hw_fence);
>   		kfree(job);
> +	}
>   }
>   
>   int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences
  2022-06-20 22:02   ` Andrey Grodzovsky
  (?)
@ 2022-06-21  7:21   ` Christian König
  -1 siblings, 0 replies; 34+ messages in thread
From: Christian König @ 2022-06-21  7:21 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
> This function should drop the fence refcount when it extracts the
> fence from the fence array, just as it's done in amdgpu_fence_process.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 957437a5558c..a9ae3beaa1d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -595,8 +595,10 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>   	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>   		ptr = &ring->fence_drv.fences[i];
>   		old = rcu_dereference_protected(*ptr, 1);
> -		if (old && old->ops == &amdgpu_job_fence_ops)
> +		if (old && old->ops == &amdgpu_job_fence_ops) {
>   			RCU_INIT_POINTER(*ptr, NULL);
> +			dma_fence_put(old);
> +		}
>   	}
>   }
>   


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  2022-06-20 22:03   ` Andrey Grodzovsky
  (?)
@ 2022-06-21  7:25   ` Christian König
  2022-06-21 19:45     ` Andrey Grodzovsky
  -1 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2022-06-21  7:25 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
> Problem:
> After we start handling timed out jobs we assume there fences won't be
> signaled but we cannot be sure and sometimes they fire late. We need
> to prevent concurrent accesses to fence array from
> amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
> from a late EOP interrupt.
>
> Fix:
> Before accessing fence array in GPU disable EOP interrupt and flush
> all pending interrupt handlers for amdgpu device's interrupt line.

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>   3 files changed, 31 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2b92281dd0c1..c99541685804 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_virt_fini_data_exchange(adev);
>   	}
>   
> +	amdgpu_fence_driver_isr_toggle(adev, true);
> +
>   	/* block all schedulers and reset given job's ring */
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i];
> @@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_fence_driver_force_completion(ring);
>   	}
>   
> +	amdgpu_fence_driver_isr_toggle(adev, false);
> +
>   	if (job && job->vm)
>   		drm_sched_increase_karma(&job->base);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index a9ae3beaa1d3..d6d54ba4c185 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
>   	}
>   }
>   
> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool stop)
> +{
> +	int i;
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		if (!ring || !ring->fence_drv.initialized || !ring->fence_drv.irq_src)
> +			continue;
> +
> +		if (stop)
> +			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
> +					       ring->fence_drv.irq_type);
> +		else
> +			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
> +					ring->fence_drv.irq_type);

That won't work like this. This increments/decrements the reference 
count for the IRQ, but doesn't guarantee in any way that they are 
stopped/started.

> +	}
> +
> +	/* TODO Only waits for irq handlers on other CPUs, maybe local_irq_save
> +	 * local_irq_local_irq_restore are needed here for local interrupts ?
> +	 *
> +	 */

Well that comment made me smile. Think for a moment what the local CPU 
would be doing if an interrupt would run :)

Cheers,
Christian.

> +	if (stop)
> +		synchronize_irq(adev->irq.irq);
> +}
> +
>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>   {
>   	unsigned int i, j;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 7d89a52091c0..82c178a9033a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct amdgpu_ring *ring,
>   				      uint32_t wait_seq,
>   				      signed long timeout);
>   unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool stop);
>   
>   /*
>    * Rings.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'
  2022-06-20 22:03   ` Andrey Grodzovsky
  (?)
@ 2022-06-21  7:26   ` Christian König
  -1 siblings, 0 replies; 34+ messages in thread
From: Christian König @ 2022-06-21  7:26 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
> Problem:
> This patch caused negative refcount as described in [1] because
> for that case parent fence did not signal by the time of drm_sched_stop and hence
> kept in pending list the assumption was they will not signal and
> so fence was put to account for the s_fence->parent refcount but for
> amdgpu which has embedded HW fence (always same parent fence)
> drm_sched_fence_release_scheduled was always called and would
> still drop the count for parent fence once more. For jobs that
> never signaled this imbalance was masked by refcount bug in
> amdgpu_fence_driver_clear_job_fences that would not drop
> refcount on the fences that were removed from fence drive
> fences array (against prevois insertion into the array in
> get in amdgpu_fence_emit).
>
> Fix:
> Revert this patch and by setting s_job->s_fence->parent to NULL
> as before prevent the extra refcount drop in amdgpu when
> drm_sched_fence_release_scheduled is called on job release.
>
> Also - align behaviour in drm_sched_resubmit_jobs_ext with that of
> drm_sched_main when submitting jobs - take a refcount for the
> new parent fence pointer and drop refcount for original kref_init
> for new HW fence creation (or fake new HW fence in amdgpu - see next patch).
>
> [1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 16 +++++++++++++---
>   1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index b81fceb0b8a2..b38394f5694f 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -419,6 +419,11 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>   		if (s_job->s_fence->parent &&
>   		    dma_fence_remove_callback(s_job->s_fence->parent,
>   					      &s_job->cb)) {
> +			/* Revert drm/sched: Keep s_fence->parent pointer, no
> +			 * need anymore for amdgpu and creates only troubles
> +			 */

Please no amdgpu specific comments in common code.

Apart from that looks like a step into the right direction to me.

Regards,
Christian.

> +			dma_fence_put(s_job->s_fence->parent);
> +			s_job->s_fence->parent = NULL;
>   			atomic_dec(&sched->hw_rq_count);
>   		} else {
>   			/*
> @@ -548,7 +553,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
>   		if (found_guilty && s_job->s_fence->scheduled.context == guilty_context)
>   			dma_fence_set_error(&s_fence->finished, -ECANCELED);
>   
> -		dma_fence_put(s_job->s_fence->parent);
>   		fence = sched->ops->run_job(s_job);
>   		i++;
>   
> @@ -558,7 +562,11 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
>   
>   			s_job->s_fence->parent = NULL;
>   		} else {
> -			s_job->s_fence->parent = fence;
> +
> +			s_job->s_fence->parent = dma_fence_get(fence);
> +
> +			/* Drop for orignal kref_init */
> +			dma_fence_put(fence);
>   		}
>   	}
>   }
> @@ -952,6 +960,9 @@ static int drm_sched_main(void *param)
>   
>   		if (!IS_ERR_OR_NULL(fence)) {
>   			s_fence->parent = dma_fence_get(fence);
> +			/* Drop for original kref_init of the fence */
> +			dma_fence_put(fence);
> +
>   			r = dma_fence_add_callback(fence, &sched_job->cb,
>   						   drm_sched_job_done_cb);
>   			if (r == -ENOENT)
> @@ -959,7 +970,6 @@ static int drm_sched_main(void *param)
>   			else if (r)
>   				DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
>   					  r);
> -			dma_fence_put(fence);
>   		} else {
>   			if (IS_ERR(fence))
>   				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
  2022-06-20 22:03   ` Andrey Grodzovsky
  (?)
@ 2022-06-21  7:28   ` Christian König
  2022-06-21 20:00     ` Andrey Grodzovsky
  -1 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2022-06-21  7:28 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
> Align refcount behaviour for amdgpu_job embedded HW fence with
> classic pointer style HW fences by increasing refcount each
> time emit is called so amdgpu code doesn't need to make workarounds
> using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Could we now also remove job_run_counter?

Christian.

>
> Also since in the previous patch we resumed setting s_fence->parent to NULL
> in drm_sched_stop switch to directly checking if job->hw_fence is
> signaled to short circuit reset if already signed.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 ++++++++++++++++------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
>   4 files changed, 25 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 513c57f839d8..447bd92c4856 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev,
>   		goto err_ib_sched;
>   	}
>   
> +	/* Drop the initial kref_init count (see drm_sched_main as example) */
> +	dma_fence_put(f);
>   	ret = dma_fence_wait(f, false);
>   
>   err_ib_sched:
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c99541685804..f9718119834f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5009,16 +5009,28 @@ static void amdgpu_device_recheck_guilty_jobs(
>   
>   		/* clear job's guilty and depend the folowing step to decide the real one */
>   		drm_sched_reset_karma(s_job);
> -		/* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
> -		 * to make sure fence is balanced */
> -		dma_fence_get(s_job->s_fence->parent);
>   		drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>   
> +		if (!s_job->s_fence->parent) {
> +			DRM_WARN("Failed to get a HW fence for job!");
> +			continue;
> +		}
> +
>   		ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>   		if (ret == 0) { /* timeout */
>   			DRM_ERROR("Found the real bad job! ring:%s, job_id:%llx\n",
>   						ring->sched.name, s_job->id);
>   
> +
> +			/* Clear this failed job from fence array */
> +			amdgpu_fence_driver_clear_job_fences(ring);
> +
> +			/* Since the job won't signal and we go for
> +			 * another resubmit drop this parent pointer
> +			 */
> +			dma_fence_put(s_job->s_fence->parent);
> +			s_job->s_fence->parent = NULL;
> +
>   			/* set guilty */
>   			drm_sched_increase_karma(s_job);
>   retry:
> @@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>   
>   		/* got the hw fence, signal finished fence */
>   		atomic_dec(ring->sched.score);
> -		dma_fence_put(s_job->s_fence->parent);
>   		dma_fence_get(&s_job->s_fence->finished);
>   		dma_fence_signal(&s_job->s_fence->finished);
>   		dma_fence_put(&s_job->s_fence->finished);
> @@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	 *
>   	 * job->base holds a reference to parent fence
>   	 */
> -	if (job && job->base.s_fence->parent &&
> -	    dma_fence_is_signaled(job->base.s_fence->parent)) {
> +	if (job && (job->hw_fence.ops != NULL) &&
> +	    dma_fence_is_signaled(&job->hw_fence)) {
>   		job_signaled = true;
>   		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
>   		goto skip_hw_reset;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index d6d54ba4c185..9bd4e18212fc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
>   	if (job && job->job_run_counter) {
>   		/* reinit seq for resubmitted jobs */
>   		fence->seqno = seq;
> +		/* TO be inline with external fence creation and other drivers */
> +		dma_fence_get(fence);
>   	} else {
> -		if (job)
> +		if (job) {
>   			dma_fence_init(fence, &amdgpu_job_fence_ops,
>   				       &ring->fence_drv.lock,
>   				       adev->fence_context + ring->idx, seq);
> +			/* Against remove in amdgpu_job_{free, free_cb} */
> +			dma_fence_get(fence);
> +		}
>   		else
>   			dma_fence_init(fence, &amdgpu_fence_ops,
>   				       &ring->fence_drv.lock,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 58568fdde2d0..638e1d600258 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -267,10 +267,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>   			DRM_ERROR("Error scheduling IBs (%d)\n", r);
>   	}
>   
> -	if (!job->job_run_counter)
> -		dma_fence_get(fence);
> -	else if (finished->error < 0)
> -		dma_fence_put(&job->hw_fence);
>   	job->job_run_counter++;
>   	amdgpu_job_free_resources(job);
>   


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
  2022-06-21  7:19   ` Christian König
@ 2022-06-21 19:34     ` Andrey Grodzovsky
  2022-06-22  9:00       ` Christian König
  0 siblings, 1 reply; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-21 19:34 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

On 2022-06-21 03:19, Christian König wrote:

> Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
>> Problem:
>> In amdgpu_job_submit_direct - The refcount should drop by 2
>> but it drops only by 1.
>>
>> amdgpu_ib_sched->emit -> refcount 1 from first fence init
>> dma_fence_get -> refcount 2
>> dme_fence_put -> refcount 1
>>
>> Fix:
>> Add put for external_hw_fence in amdgpu_job_free/free_cb
>
> Well what is the external_hw_fence good for in this construct?


As far as I understand for direct submissions you don't want to pass a job
pointer to ib_schedule and so u can't use the embedded fence for this case.

Andrey


>
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 10aa073600d4..58568fdde2d0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct 
>> drm_sched_job *s_job)
>>       /* only put the hw fence if has embedded fence */
>>       if (job->hw_fence.ops != NULL)
>>           dma_fence_put(&job->hw_fence);
>> -    else
>> +    else {
>
> When one side of the if uses {} the other side should use {} as well, 
> e.g. use } else { here.
>
> Christian.
>
>> + dma_fence_put(job->external_hw_fence);
>>           kfree(job);
>> +    }
>>   }
>>     void amdgpu_job_free(struct amdgpu_job *job)
>> @@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>       /* only put the hw fence if has embedded fence */
>>       if (job->hw_fence.ops != NULL)
>>           dma_fence_put(&job->hw_fence);
>> -    else
>> +    else {
>> +        dma_fence_put(job->external_hw_fence);
>>           kfree(job);
>> +    }
>>   }
>>     int amdgpu_job_submit(struct amdgpu_job *job, struct 
>> drm_sched_entity *entity,
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  2022-06-21  7:25   ` Christian König
@ 2022-06-21 19:45     ` Andrey Grodzovsky
  2022-06-22  1:47       ` VURDIGERENATARAJ, CHANDAN
  2022-06-22 17:31       ` Andrey Grodzovsky
  0 siblings, 2 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-21 19:45 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao


On 2022-06-21 03:25, Christian König wrote:
> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>> Problem:
>> After we start handling timed out jobs we assume there fences won't be
>> signaled but we cannot be sure and sometimes they fire late. We need
>> to prevent concurrent accesses to fence array from
>> amdgpu_fence_driver_clear_job_fences during GPU reset and 
>> amdgpu_fence_process
>> from a late EOP interrupt.
>>
>> Fix:
>> Before accessing fence array in GPU disable EOP interrupt and flush
>> all pending interrupt handlers for amdgpu device's interrupt line.
>
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 ++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>>   3 files changed, 31 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 2b92281dd0c1..c99541685804 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct 
>> amdgpu_device *adev,
>>           amdgpu_virt_fini_data_exchange(adev);
>>       }
>>   +    amdgpu_fence_driver_isr_toggle(adev, true);
>> +
>>       /* block all schedulers and reset given job's ring */
>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>           struct amdgpu_ring *ring = adev->rings[i];
>> @@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct 
>> amdgpu_device *adev,
>>           amdgpu_fence_driver_force_completion(ring);
>>       }
>>   +    amdgpu_fence_driver_isr_toggle(adev, false);
>> +
>>       if (job && job->vm)
>>           drm_sched_increase_karma(&job->base);
>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index a9ae3beaa1d3..d6d54ba4c185 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct 
>> amdgpu_device *adev)
>>       }
>>   }
>>   +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>> bool stop)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>> +        struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +        if (!ring || !ring->fence_drv.initialized || 
>> !ring->fence_drv.irq_src)
>> +            continue;
>> +
>> +        if (stop)
>> +            amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>> +                           ring->fence_drv.irq_type);
>> +        else
>> +            amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>> +                    ring->fence_drv.irq_type);
>
> That won't work like this. This increments/decrements the reference 
> count for the IRQ, but doesn't guarantee in any way that they are 
> stopped/started.


I understand that, i just assumed that the fence driver is the only 
holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?
I can disable amdgpu interrupt line totally using disable_irq - would 
this be better ?


>
>
>> +    }
>> +
>> +    /* TODO Only waits for irq handlers on other CPUs, maybe 
>> local_irq_save
>> +     * local_irq_local_irq_restore are needed here for local 
>> interrupts ?
>> +     *
>> +     */
>
> Well that comment made me smile. Think for a moment what the local CPU 
> would be doing if an interrupt would run :)


No, I understand this of course, I am ok to be interrupted by interrupt 
handler at this point, what i am trying to do
is to prevent amdgpu_fence_process to run concurrently with 
amdgpu_fence_driver_clear_job_fences - that is what this
function is trying to prevent - i disable and flush pending EOP ISR 
handlers before the call to clear fences and re-enable after.
I guess we can also introduce a spinlock to serialize them ? Yiqing 
reported seeing a race between them so we have to do something.

Andrey


>
> Cheers,
> Christian.
>
>> +    if (stop)
>> +        synchronize_irq(adev->irq.irq);
>> +}
>> +
>>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>   {
>>       unsigned int i, j;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> index 7d89a52091c0..82c178a9033a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct 
>> amdgpu_ring *ring,
>>                         uint32_t wait_seq,
>>                         signed long timeout);
>>   unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool 
>> stop);
>>     /*
>>    * Rings.
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
  2022-06-21  7:28   ` Christian König
@ 2022-06-21 20:00     ` Andrey Grodzovsky
  2022-06-22  7:17       ` Christian König
  0 siblings, 1 reply; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-21 20:00 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao


On 2022-06-21 03:28, Christian König wrote:
> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>> Align refcount behaviour for amdgpu_job embedded HW fence with
>> classic pointer style HW fences by increasing refcount each
>> time emit is called so amdgpu code doesn't need to make workarounds
>> using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.
>
> Could we now also remove job_run_counter?
>
> Christian.


I am afraid not, job counter is needed since at all times the refcount 
on the
embedded fence cannot drop to zero because this will free the job itself 
before
the end of it's life cycle. We have to be able to differentiate in 
amdgpu_fence_emit
between first ever call where we init the embedded fence's refcount from 
scratch using kref_init
to repeating calls when refcount already > 0 and we just fake increase 
the refcount to align
behavior with pointer style fences in other drivers.

I guess we could assume that embedded fence is all zeroes before first 
dma_fence_init  if assuming the job itself
was allocated using kzalloc and so u can look at dma_fence_ops == NULL 
or maybe seqno == 0
as a hint if that the fist call or not but it's a risky assumption in my 
opinion.

Andrey


>
>>
>> Also since in the previous patch we resumed setting s_fence->parent 
>> to NULL
>> in drm_sched_stop switch to directly checking if job->hw_fence is
>> signaled to short circuit reset if already signed.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 ++++++++++++++++------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
>>   4 files changed, 25 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index 513c57f839d8..447bd92c4856 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device 
>> *adev,
>>           goto err_ib_sched;
>>       }
>>   +    /* Drop the initial kref_init count (see drm_sched_main as 
>> example) */
>> +    dma_fence_put(f);
>>       ret = dma_fence_wait(f, false);
>>     err_ib_sched:
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index c99541685804..f9718119834f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5009,16 +5009,28 @@ static void amdgpu_device_recheck_guilty_jobs(
>>             /* clear job's guilty and depend the folowing step to 
>> decide the real one */
>>           drm_sched_reset_karma(s_job);
>> -        /* for the real bad job, it will be resubmitted twice, 
>> adding a dma_fence_get
>> -         * to make sure fence is balanced */
>> -        dma_fence_get(s_job->s_fence->parent);
>>           drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>   +        if (!s_job->s_fence->parent) {
>> +            DRM_WARN("Failed to get a HW fence for job!");
>> +            continue;
>> +        }
>> +
>>           ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, 
>> ring->sched.timeout);
>>           if (ret == 0) { /* timeout */
>>               DRM_ERROR("Found the real bad job! ring:%s, 
>> job_id:%llx\n",
>>                           ring->sched.name, s_job->id);
>>   +
>> +            /* Clear this failed job from fence array */
>> +            amdgpu_fence_driver_clear_job_fences(ring);
>> +
>> +            /* Since the job won't signal and we go for
>> +             * another resubmit drop this parent pointer
>> +             */
>> +            dma_fence_put(s_job->s_fence->parent);
>> +            s_job->s_fence->parent = NULL;
>> +
>>               /* set guilty */
>>               drm_sched_increase_karma(s_job);
>>   retry:
>> @@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>             /* got the hw fence, signal finished fence */
>>           atomic_dec(ring->sched.score);
>> -        dma_fence_put(s_job->s_fence->parent);
>>           dma_fence_get(&s_job->s_fence->finished);
>>           dma_fence_signal(&s_job->s_fence->finished);
>>           dma_fence_put(&s_job->s_fence->finished);
>> @@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>        *
>>        * job->base holds a reference to parent fence
>>        */
>> -    if (job && job->base.s_fence->parent &&
>> -        dma_fence_is_signaled(job->base.s_fence->parent)) {
>> +    if (job && (job->hw_fence.ops != NULL) &&
>> +        dma_fence_is_signaled(&job->hw_fence)) {
>>           job_signaled = true;
>>           dev_info(adev->dev, "Guilty job already signaled, skipping 
>> HW reset");
>>           goto skip_hw_reset;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index d6d54ba4c185..9bd4e18212fc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
>> struct dma_fence **f, struct amd
>>       if (job && job->job_run_counter) {
>>           /* reinit seq for resubmitted jobs */
>>           fence->seqno = seq;
>> +        /* TO be inline with external fence creation and other 
>> drivers */
>> +        dma_fence_get(fence);
>>       } else {
>> -        if (job)
>> +        if (job) {
>>               dma_fence_init(fence, &amdgpu_job_fence_ops,
>>                          &ring->fence_drv.lock,
>>                          adev->fence_context + ring->idx, seq);
>> +            /* Against remove in amdgpu_job_{free, free_cb} */
>> +            dma_fence_get(fence);
>> +        }
>>           else
>>               dma_fence_init(fence, &amdgpu_fence_ops,
>>                          &ring->fence_drv.lock,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 58568fdde2d0..638e1d600258 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -267,10 +267,6 @@ static struct dma_fence *amdgpu_job_run(struct 
>> drm_sched_job *sched_job)
>>               DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>       }
>>   -    if (!job->job_run_counter)
>> -        dma_fence_get(fence);
>> -    else if (finished->error < 0)
>> -        dma_fence_put(&job->hw_fence);
>>       job->job_run_counter++;
>>       amdgpu_job_free_resources(job);
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  2022-06-21 19:45     ` Andrey Grodzovsky
@ 2022-06-22  1:47       ` VURDIGERENATARAJ, CHANDAN
  2022-06-22  2:41         ` Andrey Grodzovsky
  2022-06-22 17:31       ` Andrey Grodzovsky
  1 sibling, 1 reply; 34+ messages in thread
From: VURDIGERENATARAJ, CHANDAN @ 2022-06-22  1:47 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Christian König, dri-devel, amd-gfx
  Cc: Chen, JingWen (Wayne), Koenig, Christian, Liu, Monk, Yao,  Yiqing(James)

Hi,

Is this a preventive fix or you found errors/oops/hangs?
If you had found errors/oops/hangs, can you please share the details?

BR,
Chandan V N


>On 2022-06-21 03:25, Christian König wrote:
>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>> Problem:
>>> After we start handling timed out jobs we assume there fences won't 
>>> be signaled but we cannot be sure and sometimes they fire late. We 
>>> need to prevent concurrent accesses to fence array from 
>>> amdgpu_fence_driver_clear_job_fences during GPU reset and 
>>> amdgpu_fence_process from a late EOP interrupt.
>>>
>>> Fix:
>>> Before accessing fence array in GPU disable EOP interrupt and flush 
>>> all pending interrupt handlers for amdgpu device's interrupt line.
>>
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 
>>> ++++++++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>>>   3 files changed, 31 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 2b92281dd0c1..c99541685804 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct
>>> amdgpu_device *adev,
>>>           amdgpu_virt_fini_data_exchange(adev);
>>>       }
>>>   +    amdgpu_fence_driver_isr_toggle(adev, true);
>>> +
>>>       /* block all schedulers and reset given job's ring */
>>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>           struct amdgpu_ring *ring = adev->rings[i]; @@ -4620,6 
>>> +4622,8 @@ int amdgpu_device_pre_asic_reset(struct
>>> amdgpu_device *adev,
>>>           amdgpu_fence_driver_force_completion(ring);
>>>       }
>>>   +    amdgpu_fence_driver_isr_toggle(adev, false);
>>> +
>>>       if (job && job->vm)
>>>           drm_sched_increase_karma(&job->base);
>>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index a9ae3beaa1d3..d6d54ba4c185 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct
>>> amdgpu_device *adev)
>>>       }
>>>   }
>>>   +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>>> bool stop)
>>> +{
>>> +    int i;
>>> +
>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>> +
>>> +        if (!ring || !ring->fence_drv.initialized ||
>>> !ring->fence_drv.irq_src)
>>> +            continue;
>>> +
>>> +        if (stop)
>>> +            amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>> +                           ring->fence_drv.irq_type);
>>> +        else
>>> +            amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>>> +                    ring->fence_drv.irq_type);
>>
>> That won't work like this. This increments/decrements the reference 
>> count for the IRQ, but doesn't guarantee in any way that they are 
>> stopped/started.
>
>
>I understand that, i just assumed that the fence driver is the only holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?
>I can disable amdgpu interrupt line totally using disable_irq - would this be better ?
>
>
>>
>>
>>> +    }
>>> +
>>> +    /* TODO Only waits for irq handlers on other CPUs, maybe
>>> local_irq_save
>>> +     * local_irq_local_irq_restore are needed here for local
>>> interrupts ?
>>> +     *
>>> +     */
>>
>> Well that comment made me smile. Think for a moment what the local CPU 
>> would be doing if an interrupt would run :)
>
>
>No, I understand this of course, I am ok to be interrupted by interrupt handler at this point, what i am trying to do is to prevent amdgpu_fence_process to run concurrently with amdgpu_fence_driver_clear_job_fences - that is what this function is trying to prevent - i disable and >flush pending EOP ISR handlers before the call to clear fences and re-enable after.
>I guess we can also introduce a spinlock to serialize them ? Yiqing reported seeing a race between them so we have to do something.
>
>Andrey
>
>
>>
>> Cheers,
>> Christian.
>>
>>> +    if (stop)
>>> +        synchronize_irq(adev->irq.irq); }
>>> +
>>>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>   {
>>>       unsigned int i, j;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> index 7d89a52091c0..82c178a9033a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct 
>>> amdgpu_ring *ring,
>>>                         uint32_t wait_seq,
>>>                         signed long timeout);
>>>   unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool
>>> stop);
>>>     /*
>>>    * Rings.
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  2022-06-22  1:47       ` VURDIGERENATARAJ, CHANDAN
@ 2022-06-22  2:41         ` Andrey Grodzovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-22  2:41 UTC (permalink / raw)
  To: VURDIGERENATARAJ, CHANDAN, Christian König, dri-devel, amd-gfx
  Cc: Chen, JingWen (Wayne), Koenig, Christian, Liu, Monk, Yao, Yiqing(James)

You have a job in the pending list which is marked as not finished in 
drm_sched_stop 
(https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/scheduler/sched_main.c#L420), 
s_fence signal cb removed and the job is kept in pending list.
Later you will try to manually clear the HW fence of this job in here 
(https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L4492) 
but the EOP interrupt can fire for that fence exactly at this
moment and you have concurrent access to fence driver's  fence array 
from both amdgpu_process_fence and amdgpu_fence_driver_clear_job_fences 
which is not supposed to happen.Yiqing reported to me a race during 
debugging
we did of the original refcount bug and it looked to me like this 
scenario. Seems to me the EOP ISR handler should be prevented from 
running during this time at least.

Andrey

On 2022-06-21 21:47, VURDIGERENATARAJ, CHANDAN wrote:
> Hi,
>
> Is this a preventive fix or you found errors/oops/hangs?
> If you had found errors/oops/hangs, can you please share the details?
>
> BR,
> Chandan V N
>
>
>> On 2022-06-21 03:25, Christian König wrote:
>>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>>> Problem:
>>>> After we start handling timed out jobs we assume there fences won't
>>>> be signaled but we cannot be sure and sometimes they fire late. We
>>>> need to prevent concurrent accesses to fence array from
>>>> amdgpu_fence_driver_clear_job_fences during GPU reset and
>>>> amdgpu_fence_process from a late EOP interrupt.
>>>>
>>>> Fix:
>>>> Before accessing fence array in GPU disable EOP interrupt and flush
>>>> all pending interrupt handlers for amdgpu device's interrupt line.
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26
>>>> ++++++++++++++++++++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>>>>    3 files changed, 31 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 2b92281dd0c1..c99541685804 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct
>>>> amdgpu_device *adev,
>>>>            amdgpu_virt_fini_data_exchange(adev);
>>>>        }
>>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>> +
>>>>        /* block all schedulers and reset given job's ring */
>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -4620,6
>>>> +4622,8 @@ int amdgpu_device_pre_asic_reset(struct
>>>> amdgpu_device *adev,
>>>>            amdgpu_fence_driver_force_completion(ring);
>>>>        }
>>>>    +    amdgpu_fence_driver_isr_toggle(adev, false);
>>>> +
>>>>        if (job && job->vm)
>>>>            drm_sched_increase_karma(&job->base);
>>>>    diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index a9ae3beaa1d3..d6d54ba4c185 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct
>>>> amdgpu_device *adev)
>>>>        }
>>>>    }
>>>>    +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev,
>>>> bool stop)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>> +
>>>> +        if (!ring || !ring->fence_drv.initialized ||
>>>> !ring->fence_drv.irq_src)
>>>> +            continue;
>>>> +
>>>> +        if (stop)
>>>> +            amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>>> +                           ring->fence_drv.irq_type);
>>>> +        else
>>>> +            amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>>>> +                    ring->fence_drv.irq_type);
>>> That won't work like this. This increments/decrements the reference
>>> count for the IRQ, but doesn't guarantee in any way that they are
>>> stopped/started.
>>
>> I understand that, i just assumed that the fence driver is the only holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?
>> I can disable amdgpu interrupt line totally using disable_irq - would this be better ?
>>
>>
>>>
>>>> +    }
>>>> +
>>>> +    /* TODO Only waits for irq handlers on other CPUs, maybe
>>>> local_irq_save
>>>> +     * local_irq_local_irq_restore are needed here for local
>>>> interrupts ?
>>>> +     *
>>>> +     */
>>> Well that comment made me smile. Think for a moment what the local CPU
>>> would be doing if an interrupt would run :)
>>
>> No, I understand this of course, I am ok to be interrupted by interrupt handler at this point, what i am trying to do is to prevent amdgpu_fence_process to run concurrently with amdgpu_fence_driver_clear_job_fences - that is what this function is trying to prevent - i disable and >flush pending EOP ISR handlers before the call to clear fences and re-enable after.
>> I guess we can also introduce a spinlock to serialize them ? Yiqing reported seeing a race between them so we have to do something.
>>
>> Andrey
>>
>>
>>> Cheers,
>>> Christian.
>>>
>>>> +    if (stop)
>>>> +        synchronize_irq(adev->irq.irq); }
>>>> +
>>>>    void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>>    {
>>>>        unsigned int i, j;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> index 7d89a52091c0..82c178a9033a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct
>>>> amdgpu_ring *ring,
>>>>                          uint32_t wait_seq,
>>>>                          signed long timeout);
>>>>    unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>>>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, bool
>>>> stop);
>>>>      /*
>>>>     * Rings.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
  2022-06-21 20:00     ` Andrey Grodzovsky
@ 2022-06-22  7:17       ` Christian König
  2022-06-22 17:19         ` Andrey Grodzovsky
  0 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2022-06-22  7:17 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, monk.liu, yiqing.yao

Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:
>
> On 2022-06-21 03:28, Christian König wrote:
>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>> Align refcount behaviour for amdgpu_job embedded HW fence with
>>> classic pointer style HW fences by increasing refcount each
>>> time emit is called so amdgpu code doesn't need to make workarounds
>>> using amdgpu_job.job_run_counter to keep the HW fence refcount 
>>> balanced.
>>
>> Could we now also remove job_run_counter?
>>
>> Christian.
>
>
> I am afraid not, job counter is needed since at all times the refcount 
> on the
> embedded fence cannot drop to zero because this will free the job 
> itself before
> the end of it's life cycle. We have to be able to differentiate in 
> amdgpu_fence_emit
> between first ever call where we init the embedded fence's refcount 
> from scratch using kref_init
> to repeating calls when refcount already > 0 and we just fake increase 
> the refcount to align
> behavior with pointer style fences in other drivers.

Well what we should probably rather do is move the init out of emit instead.

The only down side I can see is that the sequence number isn't know on 
initial init and so needs to be zero or something like that.

Regards,
Christian.

>
> I guess we could assume that embedded fence is all zeroes before first 
> dma_fence_init  if assuming the job itself
> was allocated using kzalloc and so u can look at dma_fence_ops == NULL 
> or maybe seqno == 0
> as a hint if that the fist call or not but it's a risky assumption in 
> my opinion.
>
> Andrey
>
>
>>
>>>
>>> Also since in the previous patch we resumed setting s_fence->parent 
>>> to NULL
>>> in drm_sched_stop switch to directly checking if job->hw_fence is
>>> signaled to short circuit reset if already signed.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 
>>> ++++++++++++++++------
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
>>>   4 files changed, 25 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> index 513c57f839d8..447bd92c4856 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>> @@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device 
>>> *adev,
>>>           goto err_ib_sched;
>>>       }
>>>   +    /* Drop the initial kref_init count (see drm_sched_main as 
>>> example) */
>>> +    dma_fence_put(f);
>>>       ret = dma_fence_wait(f, false);
>>>     err_ib_sched:
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index c99541685804..f9718119834f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5009,16 +5009,28 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>             /* clear job's guilty and depend the folowing step to 
>>> decide the real one */
>>>           drm_sched_reset_karma(s_job);
>>> -        /* for the real bad job, it will be resubmitted twice, 
>>> adding a dma_fence_get
>>> -         * to make sure fence is balanced */
>>> -        dma_fence_get(s_job->s_fence->parent);
>>>           drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>   +        if (!s_job->s_fence->parent) {
>>> +            DRM_WARN("Failed to get a HW fence for job!");
>>> +            continue;
>>> +        }
>>> +
>>>           ret = dma_fence_wait_timeout(s_job->s_fence->parent, 
>>> false, ring->sched.timeout);
>>>           if (ret == 0) { /* timeout */
>>>               DRM_ERROR("Found the real bad job! ring:%s, 
>>> job_id:%llx\n",
>>>                           ring->sched.name, s_job->id);
>>>   +
>>> +            /* Clear this failed job from fence array */
>>> +            amdgpu_fence_driver_clear_job_fences(ring);
>>> +
>>> +            /* Since the job won't signal and we go for
>>> +             * another resubmit drop this parent pointer
>>> +             */
>>> +            dma_fence_put(s_job->s_fence->parent);
>>> +            s_job->s_fence->parent = NULL;
>>> +
>>>               /* set guilty */
>>>               drm_sched_increase_karma(s_job);
>>>   retry:
>>> @@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>             /* got the hw fence, signal finished fence */
>>>           atomic_dec(ring->sched.score);
>>> -        dma_fence_put(s_job->s_fence->parent);
>>>           dma_fence_get(&s_job->s_fence->finished);
>>> dma_fence_signal(&s_job->s_fence->finished);
>>>           dma_fence_put(&s_job->s_fence->finished);
>>> @@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>        *
>>>        * job->base holds a reference to parent fence
>>>        */
>>> -    if (job && job->base.s_fence->parent &&
>>> - dma_fence_is_signaled(job->base.s_fence->parent)) {
>>> +    if (job && (job->hw_fence.ops != NULL) &&
>>> +        dma_fence_is_signaled(&job->hw_fence)) {
>>>           job_signaled = true;
>>>           dev_info(adev->dev, "Guilty job already signaled, skipping 
>>> HW reset");
>>>           goto skip_hw_reset;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index d6d54ba4c185..9bd4e18212fc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>> *ring, struct dma_fence **f, struct amd
>>>       if (job && job->job_run_counter) {
>>>           /* reinit seq for resubmitted jobs */
>>>           fence->seqno = seq;
>>> +        /* TO be inline with external fence creation and other 
>>> drivers */
>>> +        dma_fence_get(fence);
>>>       } else {
>>> -        if (job)
>>> +        if (job) {
>>>               dma_fence_init(fence, &amdgpu_job_fence_ops,
>>>                          &ring->fence_drv.lock,
>>>                          adev->fence_context + ring->idx, seq);
>>> +            /* Against remove in amdgpu_job_{free, free_cb} */
>>> +            dma_fence_get(fence);
>>> +        }
>>>           else
>>>               dma_fence_init(fence, &amdgpu_fence_ops,
>>>                          &ring->fence_drv.lock,
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 58568fdde2d0..638e1d600258 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -267,10 +267,6 @@ static struct dma_fence *amdgpu_job_run(struct 
>>> drm_sched_job *sched_job)
>>>               DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>>       }
>>>   -    if (!job->job_run_counter)
>>> -        dma_fence_get(fence);
>>> -    else if (finished->error < 0)
>>> -        dma_fence_put(&job->hw_fence);
>>>       job->job_run_counter++;
>>>       amdgpu_job_free_resources(job);
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
  2022-06-21 19:34     ` Andrey Grodzovsky
@ 2022-06-22  9:00       ` Christian König
  2022-06-22 15:01         ` Andrey Grodzovsky
  0 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2022-06-22  9:00 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky:
> On 2022-06-21 03:19, Christian König wrote:
>
>> Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
>>> Problem:
>>> In amdgpu_job_submit_direct - The refcount should drop by 2
>>> but it drops only by 1.
>>>
>>> amdgpu_ib_sched->emit -> refcount 1 from first fence init
>>> dma_fence_get -> refcount 2
>>> dme_fence_put -> refcount 1
>>>
>>> Fix:
>>> Add put for external_hw_fence in amdgpu_job_free/free_cb
>>
>> Well what is the external_hw_fence good for in this construct?
>
>
> As far as I understand for direct submissions you don't want to pass a 
> job
> pointer to ib_schedule and so u can't use the embedded fence for this 
> case.

Can you please look a bit deeper into this, we now have a couple of 
fields in the job structure which have no obvious use.

I think we could pass a job structure to ib_schedule even for direct 
submit now.

Regards,
Christian.

>
> Andrey
>
>
>>
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 10aa073600d4..58568fdde2d0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct 
>>> drm_sched_job *s_job)
>>>       /* only put the hw fence if has embedded fence */
>>>       if (job->hw_fence.ops != NULL)
>>>           dma_fence_put(&job->hw_fence);
>>> -    else
>>> +    else {
>>
>> When one side of the if uses {} the other side should use {} as well, 
>> e.g. use } else { here.
>>
>> Christian.
>>
>>> + dma_fence_put(job->external_hw_fence);
>>>           kfree(job);
>>> +    }
>>>   }
>>>     void amdgpu_job_free(struct amdgpu_job *job)
>>> @@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>       /* only put the hw fence if has embedded fence */
>>>       if (job->hw_fence.ops != NULL)
>>>           dma_fence_put(&job->hw_fence);
>>> -    else
>>> +    else {
>>> +        dma_fence_put(job->external_hw_fence);
>>>           kfree(job);
>>> +    }
>>>   }
>>>     int amdgpu_job_submit(struct amdgpu_job *job, struct 
>>> drm_sched_entity *entity,
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
  2022-06-22  9:00       ` Christian König
@ 2022-06-22 15:01         ` Andrey Grodzovsky
  2022-06-22 15:04           ` Christian König
  0 siblings, 1 reply; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-22 15:01 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao


On 2022-06-22 05:00, Christian König wrote:
> Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky:
>> On 2022-06-21 03:19, Christian König wrote:
>>
>>> Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
>>>> Problem:
>>>> In amdgpu_job_submit_direct - The refcount should drop by 2
>>>> but it drops only by 1.
>>>>
>>>> amdgpu_ib_sched->emit -> refcount 1 from first fence init
>>>> dma_fence_get -> refcount 2
>>>> dme_fence_put -> refcount 1
>>>>
>>>> Fix:
>>>> Add put for external_hw_fence in amdgpu_job_free/free_cb
>>>
>>> Well what is the external_hw_fence good for in this construct?
>>
>>
>> As far as I understand for direct submissions you don't want to pass 
>> a job
>> pointer to ib_schedule and so u can't use the embedded fence for this 
>> case.
>
> Can you please look a bit deeper into this, we now have a couple of 
> fields in the job structure which have no obvious use.
>
> I think we could pass a job structure to ib_schedule even for direct 
> submit now.


Are you sure  ? I see a lot of activities in amdgpu_ib_schedule depend 
on presence of  vm and fence_ctx which are set if the job pointer 
argument != NULL, might this have a negative impact on direct submit ?

Andrey


>
> Regards,
> Christian.
>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 10aa073600d4..58568fdde2d0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct 
>>>> drm_sched_job *s_job)
>>>>       /* only put the hw fence if has embedded fence */
>>>>       if (job->hw_fence.ops != NULL)
>>>>           dma_fence_put(&job->hw_fence);
>>>> -    else
>>>> +    else {
>>>
>>> When one side of the if uses {} the other side should use {} as 
>>> well, e.g. use } else { here.
>>>
>>> Christian.
>>>
>>>> + dma_fence_put(job->external_hw_fence);
>>>>           kfree(job);
>>>> +    }
>>>>   }
>>>>     void amdgpu_job_free(struct amdgpu_job *job)
>>>> @@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>>       /* only put the hw fence if has embedded fence */
>>>>       if (job->hw_fence.ops != NULL)
>>>>           dma_fence_put(&job->hw_fence);
>>>> -    else
>>>> +    else {
>>>> +        dma_fence_put(job->external_hw_fence);
>>>>           kfree(job);
>>>> +    }
>>>>   }
>>>>     int amdgpu_job_submit(struct amdgpu_job *job, struct 
>>>> drm_sched_entity *entity,
>>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
  2022-06-22 15:01         ` Andrey Grodzovsky
@ 2022-06-22 15:04           ` Christian König
  2022-06-23 21:18             ` Andrey Grodzovsky
  0 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2022-06-22 15:04 UTC (permalink / raw)
  To: Andrey Grodzovsky, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Am 22.06.22 um 17:01 schrieb Andrey Grodzovsky:
>
> On 2022-06-22 05:00, Christian König wrote:
>> Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky:
>>> On 2022-06-21 03:19, Christian König wrote:
>>>
>>>> Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
>>>>> Problem:
>>>>> In amdgpu_job_submit_direct - The refcount should drop by 2
>>>>> but it drops only by 1.
>>>>>
>>>>> amdgpu_ib_sched->emit -> refcount 1 from first fence init
>>>>> dma_fence_get -> refcount 2
>>>>> dme_fence_put -> refcount 1
>>>>>
>>>>> Fix:
>>>>> Add put for external_hw_fence in amdgpu_job_free/free_cb
>>>>
>>>> Well what is the external_hw_fence good for in this construct?
>>>
>>>
>>> As far as I understand for direct submissions you don't want to pass 
>>> a job
>>> pointer to ib_schedule and so u can't use the embedded fence for 
>>> this case.
>>
>> Can you please look a bit deeper into this, we now have a couple of 
>> fields in the job structure which have no obvious use.
>>
>> I think we could pass a job structure to ib_schedule even for direct 
>> submit now.
>
>
> Are you sure  ? I see a lot of activities in amdgpu_ib_schedule depend 
> on presence of  vm and fence_ctx which are set if the job pointer 
> argument != NULL, might this have a negative impact on direct submit ?

Not 100% sure, but we did tons of workarounds because we didn't had a 
job pointer for direct submit.

But this was before we embedded the IBs at the end of the job.

It's quite likely that this should be possible now, it's just that 
somebody needs to double check.

Christian.

>
> Andrey
>
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index 10aa073600d4..58568fdde2d0 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct 
>>>>> drm_sched_job *s_job)
>>>>>       /* only put the hw fence if has embedded fence */
>>>>>       if (job->hw_fence.ops != NULL)
>>>>>           dma_fence_put(&job->hw_fence);
>>>>> -    else
>>>>> +    else {
>>>>
>>>> When one side of the if uses {} the other side should use {} as 
>>>> well, e.g. use } else { here.
>>>>
>>>> Christian.
>>>>
>>>>> + dma_fence_put(job->external_hw_fence);
>>>>>           kfree(job);
>>>>> +    }
>>>>>   }
>>>>>     void amdgpu_job_free(struct amdgpu_job *job)
>>>>> @@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>>>       /* only put the hw fence if has embedded fence */
>>>>>       if (job->hw_fence.ops != NULL)
>>>>>           dma_fence_put(&job->hw_fence);
>>>>> -    else
>>>>> +    else {
>>>>> +        dma_fence_put(job->external_hw_fence);
>>>>>           kfree(job);
>>>>> +    }
>>>>>   }
>>>>>     int amdgpu_job_submit(struct amdgpu_job *job, struct 
>>>>> drm_sched_entity *entity,
>>>>
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
  2022-06-22  7:17       ` Christian König
@ 2022-06-22 17:19         ` Andrey Grodzovsky
  2022-06-23  5:52           ` Christian König
  0 siblings, 1 reply; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-22 17:19 UTC (permalink / raw)
  To: Christian König, Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, monk.liu, yiqing.yao


On 2022-06-22 03:17, Christian König wrote:
> Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:
>>
>> On 2022-06-21 03:28, Christian König wrote:
>>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>>> Align refcount behaviour for amdgpu_job embedded HW fence with
>>>> classic pointer style HW fences by increasing refcount each
>>>> time emit is called so amdgpu code doesn't need to make workarounds
>>>> using amdgpu_job.job_run_counter to keep the HW fence refcount 
>>>> balanced.
>>>
>>> Could we now also remove job_run_counter?
>>>
>>> Christian.
>>
>>
>> I am afraid not, job counter is needed since at all times the 
>> refcount on the
>> embedded fence cannot drop to zero because this will free the job 
>> itself before
>> the end of it's life cycle. We have to be able to differentiate in 
>> amdgpu_fence_emit
>> between first ever call where we init the embedded fence's refcount 
>> from scratch using kref_init
>> to repeating calls when refcount already > 0 and we just fake 
>> increase the refcount to align
>> behavior with pointer style fences in other drivers.
>
> Well what we should probably rather do is move the init out of emit 
> instead.
>
> The only down side I can see is that the sequence number isn't know on 
> initial init and so needs to be zero or something like that.
>
> Regards,
> Christian.


Not sure how this help, the problem is not there but in amdgpu_job_run, 
for embedded fence and resubmit job in pending list amdgpu_job_run
will be called twice or even 3 times with recheck guilty job sequence. I 
am supposed to do dma_fence_init to embeded HW fence only on first call 
while on second and
third only update sequence_num and increase refcount. How can i 
differentiate between first and non first calls without job_run_counter ?

Andrey


>
>>
>> I guess we could assume that embedded fence is all zeroes before 
>> first dma_fence_init  if assuming the job itself
>> was allocated using kzalloc and so u can look at dma_fence_ops == 
>> NULL or maybe seqno == 0
>> as a hint if that the fist call or not but it's a risky assumption in 
>> my opinion.
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Also since in the previous patch we resumed setting s_fence->parent 
>>>> to NULL
>>>> in drm_sched_stop switch to directly checking if job->hw_fence is
>>>> signaled to short circuit reset if already signed.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 
>>>> ++++++++++++++++------
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
>>>>   4 files changed, 25 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> index 513c57f839d8..447bd92c4856 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>> @@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct 
>>>> amdgpu_device *adev,
>>>>           goto err_ib_sched;
>>>>       }
>>>>   +    /* Drop the initial kref_init count (see drm_sched_main as 
>>>> example) */
>>>> +    dma_fence_put(f);
>>>>       ret = dma_fence_wait(f, false);
>>>>     err_ib_sched:
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index c99541685804..f9718119834f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -5009,16 +5009,28 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>             /* clear job's guilty and depend the folowing step to 
>>>> decide the real one */
>>>>           drm_sched_reset_karma(s_job);
>>>> -        /* for the real bad job, it will be resubmitted twice, 
>>>> adding a dma_fence_get
>>>> -         * to make sure fence is balanced */
>>>> -        dma_fence_get(s_job->s_fence->parent);
>>>>           drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>   +        if (!s_job->s_fence->parent) {
>>>> +            DRM_WARN("Failed to get a HW fence for job!");
>>>> +            continue;
>>>> +        }
>>>> +
>>>>           ret = dma_fence_wait_timeout(s_job->s_fence->parent, 
>>>> false, ring->sched.timeout);
>>>>           if (ret == 0) { /* timeout */
>>>>               DRM_ERROR("Found the real bad job! ring:%s, 
>>>> job_id:%llx\n",
>>>>                           ring->sched.name, s_job->id);
>>>>   +
>>>> +            /* Clear this failed job from fence array */
>>>> +            amdgpu_fence_driver_clear_job_fences(ring);
>>>> +
>>>> +            /* Since the job won't signal and we go for
>>>> +             * another resubmit drop this parent pointer
>>>> +             */
>>>> +            dma_fence_put(s_job->s_fence->parent);
>>>> +            s_job->s_fence->parent = NULL;
>>>> +
>>>>               /* set guilty */
>>>>               drm_sched_increase_karma(s_job);
>>>>   retry:
>>>> @@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>             /* got the hw fence, signal finished fence */
>>>>           atomic_dec(ring->sched.score);
>>>> -        dma_fence_put(s_job->s_fence->parent);
>>>> dma_fence_get(&s_job->s_fence->finished);
>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>> dma_fence_put(&s_job->s_fence->finished);
>>>> @@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>        *
>>>>        * job->base holds a reference to parent fence
>>>>        */
>>>> -    if (job && job->base.s_fence->parent &&
>>>> - dma_fence_is_signaled(job->base.s_fence->parent)) {
>>>> +    if (job && (job->hw_fence.ops != NULL) &&
>>>> +        dma_fence_is_signaled(&job->hw_fence)) {
>>>>           job_signaled = true;
>>>>           dev_info(adev->dev, "Guilty job already signaled, 
>>>> skipping HW reset");
>>>>           goto skip_hw_reset;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index d6d54ba4c185..9bd4e18212fc 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>> *ring, struct dma_fence **f, struct amd
>>>>       if (job && job->job_run_counter) {
>>>>           /* reinit seq for resubmitted jobs */
>>>>           fence->seqno = seq;
>>>> +        /* TO be inline with external fence creation and other 
>>>> drivers */
>>>> +        dma_fence_get(fence);
>>>>       } else {
>>>> -        if (job)
>>>> +        if (job) {
>>>>               dma_fence_init(fence, &amdgpu_job_fence_ops,
>>>>                          &ring->fence_drv.lock,
>>>>                          adev->fence_context + ring->idx, seq);
>>>> +            /* Against remove in amdgpu_job_{free, free_cb} */
>>>> +            dma_fence_get(fence);
>>>> +        }
>>>>           else
>>>>               dma_fence_init(fence, &amdgpu_fence_ops,
>>>>                          &ring->fence_drv.lock,
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index 58568fdde2d0..638e1d600258 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -267,10 +267,6 @@ static struct dma_fence *amdgpu_job_run(struct 
>>>> drm_sched_job *sched_job)
>>>>               DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>>>       }
>>>>   -    if (!job->job_run_counter)
>>>> -        dma_fence_get(fence);
>>>> -    else if (finished->error < 0)
>>>> -        dma_fence_put(&job->hw_fence);
>>>>       job->job_run_counter++;
>>>>       amdgpu_job_free_resources(job);
>>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  2022-06-21 19:45     ` Andrey Grodzovsky
  2022-06-22  1:47       ` VURDIGERENATARAJ, CHANDAN
@ 2022-06-22 17:31       ` Andrey Grodzovsky
  2022-06-23  5:57         ` Christian König
  1 sibling, 1 reply; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-22 17:31 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao

Just a ping

Andrey

On 2022-06-21 15:45, Andrey Grodzovsky wrote:
>
> On 2022-06-21 03:25, Christian König wrote:
>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>> Problem:
>>> After we start handling timed out jobs we assume there fences won't be
>>> signaled but we cannot be sure and sometimes they fire late. We need
>>> to prevent concurrent accesses to fence array from
>>> amdgpu_fence_driver_clear_job_fences during GPU reset and 
>>> amdgpu_fence_process
>>> from a late EOP interrupt.
>>>
>>> Fix:
>>> Before accessing fence array in GPU disable EOP interrupt and flush
>>> all pending interrupt handlers for amdgpu device's interrupt line.
>>
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 
>>> ++++++++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>>>   3 files changed, 31 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 2b92281dd0c1..c99541685804 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct 
>>> amdgpu_device *adev,
>>>           amdgpu_virt_fini_data_exchange(adev);
>>>       }
>>>   +    amdgpu_fence_driver_isr_toggle(adev, true);
>>> +
>>>       /* block all schedulers and reset given job's ring */
>>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>           struct amdgpu_ring *ring = adev->rings[i];
>>> @@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct 
>>> amdgpu_device *adev,
>>>           amdgpu_fence_driver_force_completion(ring);
>>>       }
>>>   +    amdgpu_fence_driver_isr_toggle(adev, false);
>>> +
>>>       if (job && job->vm)
>>>           drm_sched_increase_karma(&job->base);
>>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index a9ae3beaa1d3..d6d54ba4c185 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct 
>>> amdgpu_device *adev)
>>>       }
>>>   }
>>>   +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>>> bool stop)
>>> +{
>>> +    int i;
>>> +
>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>> +
>>> +        if (!ring || !ring->fence_drv.initialized || 
>>> !ring->fence_drv.irq_src)
>>> +            continue;
>>> +
>>> +        if (stop)
>>> +            amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>> +                           ring->fence_drv.irq_type);
>>> +        else
>>> +            amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>>> +                    ring->fence_drv.irq_type);
>>
>> That won't work like this. This increments/decrements the reference 
>> count for the IRQ, but doesn't guarantee in any way that they are 
>> stopped/started.
>
>
> I understand that, i just assumed that the fence driver is the only 
> holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?
> I can disable amdgpu interrupt line totally using disable_irq - would 
> this be better ?
>
>
>>
>>
>>> +    }
>>> +
>>> +    /* TODO Only waits for irq handlers on other CPUs, maybe 
>>> local_irq_save
>>> +     * local_irq_local_irq_restore are needed here for local 
>>> interrupts ?
>>> +     *
>>> +     */
>>
>> Well that comment made me smile. Think for a moment what the local 
>> CPU would be doing if an interrupt would run :)
>
>
> No, I understand this of course, I am ok to be interrupted by 
> interrupt handler at this point, what i am trying to do
> is to prevent amdgpu_fence_process to run concurrently with 
> amdgpu_fence_driver_clear_job_fences - that is what this
> function is trying to prevent - i disable and flush pending EOP ISR 
> handlers before the call to clear fences and re-enable after.
> I guess we can also introduce a spinlock to serialize them ? Yiqing 
> reported seeing a race between them so we have to do something.
>
> Andrey
>
>
>>
>> Cheers,
>> Christian.
>>
>>> +    if (stop)
>>> +        synchronize_irq(adev->irq.irq);
>>> +}
>>> +
>>>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>   {
>>>       unsigned int i, j;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> index 7d89a52091c0..82c178a9033a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct 
>>> amdgpu_ring *ring,
>>>                         uint32_t wait_seq,
>>>                         signed long timeout);
>>>   unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>>> bool stop);
>>>     /*
>>>    * Rings.
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
  2022-06-22 17:19         ` Andrey Grodzovsky
@ 2022-06-23  5:52           ` Christian König
  2022-06-23 14:51             ` Andrey Grodzovsky
  0 siblings, 1 reply; 34+ messages in thread
From: Christian König @ 2022-06-23  5:52 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, monk.liu, yiqing.yao

Am 22.06.22 um 19:19 schrieb Andrey Grodzovsky:
>
> On 2022-06-22 03:17, Christian König wrote:
>> Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-06-21 03:28, Christian König wrote:
>>>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>>>> Align refcount behaviour for amdgpu_job embedded HW fence with
>>>>> classic pointer style HW fences by increasing refcount each
>>>>> time emit is called so amdgpu code doesn't need to make workarounds
>>>>> using amdgpu_job.job_run_counter to keep the HW fence refcount 
>>>>> balanced.
>>>>
>>>> Could we now also remove job_run_counter?
>>>>
>>>> Christian.
>>>
>>>
>>> I am afraid not, job counter is needed since at all times the 
>>> refcount on the
>>> embedded fence cannot drop to zero because this will free the job 
>>> itself before
>>> the end of it's life cycle. We have to be able to differentiate in 
>>> amdgpu_fence_emit
>>> between first ever call where we init the embedded fence's refcount 
>>> from scratch using kref_init
>>> to repeating calls when refcount already > 0 and we just fake 
>>> increase the refcount to align
>>> behavior with pointer style fences in other drivers.
>>
>> Well what we should probably rather do is move the init out of emit 
>> instead.
>>
>> The only down side I can see is that the sequence number isn't know 
>> on initial init and so needs to be zero or something like that.
>>
>> Regards,
>> Christian.
>
>
> Not sure how this help, the problem is not there but in 
> amdgpu_job_run, for embedded fence and resubmit job in pending list 
> amdgpu_job_run
> will be called twice or even 3 times with recheck guilty job sequence. 
> I am supposed to do dma_fence_init to embeded HW fence only on first 
> call while on second and
> third only update sequence_num and increase refcount. How can i 
> differentiate between first and non first calls without job_run_counter ?

Yeah, good point. We should really stop re-submitting jobs altogether in 
the kernel and move that whole functionality into userspace.

Christian.

>
> Andrey
>
>
>>
>>>
>>> I guess we could assume that embedded fence is all zeroes before 
>>> first dma_fence_init  if assuming the job itself
>>> was allocated using kzalloc and so u can look at dma_fence_ops == 
>>> NULL or maybe seqno == 0
>>> as a hint if that the fist call or not but it's a risky assumption 
>>> in my opinion.
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Also since in the previous patch we resumed setting 
>>>>> s_fence->parent to NULL
>>>>> in drm_sched_stop switch to directly checking if job->hw_fence is
>>>>> signaled to short circuit reset if already signed.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 
>>>>> ++++++++++++++++------
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
>>>>>   4 files changed, 25 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> index 513c57f839d8..447bd92c4856 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>> @@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct 
>>>>> amdgpu_device *adev,
>>>>>           goto err_ib_sched;
>>>>>       }
>>>>>   +    /* Drop the initial kref_init count (see drm_sched_main as 
>>>>> example) */
>>>>> +    dma_fence_put(f);
>>>>>       ret = dma_fence_wait(f, false);
>>>>>     err_ib_sched:
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index c99541685804..f9718119834f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5009,16 +5009,28 @@ static void 
>>>>> amdgpu_device_recheck_guilty_jobs(
>>>>>             /* clear job's guilty and depend the folowing step to 
>>>>> decide the real one */
>>>>>           drm_sched_reset_karma(s_job);
>>>>> -        /* for the real bad job, it will be resubmitted twice, 
>>>>> adding a dma_fence_get
>>>>> -         * to make sure fence is balanced */
>>>>> -        dma_fence_get(s_job->s_fence->parent);
>>>>>           drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>   +        if (!s_job->s_fence->parent) {
>>>>> +            DRM_WARN("Failed to get a HW fence for job!");
>>>>> +            continue;
>>>>> +        }
>>>>> +
>>>>>           ret = dma_fence_wait_timeout(s_job->s_fence->parent, 
>>>>> false, ring->sched.timeout);
>>>>>           if (ret == 0) { /* timeout */
>>>>>               DRM_ERROR("Found the real bad job! ring:%s, 
>>>>> job_id:%llx\n",
>>>>>                           ring->sched.name, s_job->id);
>>>>>   +
>>>>> +            /* Clear this failed job from fence array */
>>>>> +            amdgpu_fence_driver_clear_job_fences(ring);
>>>>> +
>>>>> +            /* Since the job won't signal and we go for
>>>>> +             * another resubmit drop this parent pointer
>>>>> +             */
>>>>> +            dma_fence_put(s_job->s_fence->parent);
>>>>> +            s_job->s_fence->parent = NULL;
>>>>> +
>>>>>               /* set guilty */
>>>>>               drm_sched_increase_karma(s_job);
>>>>>   retry:
>>>>> @@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>             /* got the hw fence, signal finished fence */
>>>>>           atomic_dec(ring->sched.score);
>>>>> -        dma_fence_put(s_job->s_fence->parent);
>>>>> dma_fence_get(&s_job->s_fence->finished);
>>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>>> dma_fence_put(&s_job->s_fence->finished);
>>>>> @@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct 
>>>>> amdgpu_device *adev,
>>>>>        *
>>>>>        * job->base holds a reference to parent fence
>>>>>        */
>>>>> -    if (job && job->base.s_fence->parent &&
>>>>> - dma_fence_is_signaled(job->base.s_fence->parent)) {
>>>>> +    if (job && (job->hw_fence.ops != NULL) &&
>>>>> +        dma_fence_is_signaled(&job->hw_fence)) {
>>>>>           job_signaled = true;
>>>>>           dev_info(adev->dev, "Guilty job already signaled, 
>>>>> skipping HW reset");
>>>>>           goto skip_hw_reset;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index d6d54ba4c185..9bd4e18212fc 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>>> *ring, struct dma_fence **f, struct amd
>>>>>       if (job && job->job_run_counter) {
>>>>>           /* reinit seq for resubmitted jobs */
>>>>>           fence->seqno = seq;
>>>>> +        /* TO be inline with external fence creation and other 
>>>>> drivers */
>>>>> +        dma_fence_get(fence);
>>>>>       } else {
>>>>> -        if (job)
>>>>> +        if (job) {
>>>>>               dma_fence_init(fence, &amdgpu_job_fence_ops,
>>>>>                          &ring->fence_drv.lock,
>>>>>                          adev->fence_context + ring->idx, seq);
>>>>> +            /* Against remove in amdgpu_job_{free, free_cb} */
>>>>> +            dma_fence_get(fence);
>>>>> +        }
>>>>>           else
>>>>>               dma_fence_init(fence, &amdgpu_fence_ops,
>>>>>                          &ring->fence_drv.lock,
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index 58568fdde2d0..638e1d600258 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -267,10 +267,6 @@ static struct dma_fence 
>>>>> *amdgpu_job_run(struct drm_sched_job *sched_job)
>>>>>               DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>>>>       }
>>>>>   -    if (!job->job_run_counter)
>>>>> -        dma_fence_get(fence);
>>>>> -    else if (finished->error < 0)
>>>>> -        dma_fence_put(&job->hw_fence);
>>>>>       job->job_run_counter++;
>>>>>       amdgpu_job_free_resources(job);
>>>>
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  2022-06-22 17:31       ` Andrey Grodzovsky
@ 2022-06-23  5:57         ` Christian König
  0 siblings, 0 replies; 34+ messages in thread
From: Christian König @ 2022-06-23  5:57 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, monk.liu, yiqing.yao

Am 22.06.22 um 19:31 schrieb Andrey Grodzovsky:
> Just a ping

You need to give me at least some time to look into this.

>
> Andrey
>
> On 2022-06-21 15:45, Andrey Grodzovsky wrote:
>>
>> On 2022-06-21 03:25, Christian König wrote:
>>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>>> Problem:
>>>> After we start handling timed out jobs we assume there fences won't be
>>>> signaled but we cannot be sure and sometimes they fire late. We need
>>>> to prevent concurrent accesses to fence array from
>>>> amdgpu_fence_driver_clear_job_fences during GPU reset and 
>>>> amdgpu_fence_process
>>>> from a late EOP interrupt.
>>>>
>>>> Fix:
>>>> Before accessing fence array in GPU disable EOP interrupt and flush
>>>> all pending interrupt handlers for amdgpu device's interrupt line.
>>>
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 26 
>>>> ++++++++++++++++++++++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
>>>>   3 files changed, 31 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 2b92281dd0c1..c99541685804 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4605,6 +4605,8 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>       }
>>>>   +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>> +
>>>>       /* block all schedulers and reset given job's ring */
>>>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>           struct amdgpu_ring *ring = adev->rings[i];
>>>> @@ -4620,6 +4622,8 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>           amdgpu_fence_driver_force_completion(ring);
>>>>       }
>>>>   +    amdgpu_fence_driver_isr_toggle(adev, false);
>>>> +
>>>>       if (job && job->vm)
>>>>           drm_sched_increase_karma(&job->base);
>>>>   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index a9ae3beaa1d3..d6d54ba4c185 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -532,6 +532,32 @@ void amdgpu_fence_driver_hw_fini(struct 
>>>> amdgpu_device *adev)
>>>>       }
>>>>   }
>>>>   +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>>>> bool stop)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>> +
>>>> +        if (!ring || !ring->fence_drv.initialized || 
>>>> !ring->fence_drv.irq_src)
>>>> +            continue;
>>>> +
>>>> +        if (stop)
>>>> +            amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>>> +                           ring->fence_drv.irq_type);
>>>> +        else
>>>> +            amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>>>> +                    ring->fence_drv.irq_type);
>>>
>>> That won't work like this. This increments/decrements the reference 
>>> count for the IRQ, but doesn't guarantee in any way that they are 
>>> stopped/started.
>>
>>
>> I understand that, i just assumed that the fence driver is the only 
>> holder of this interrupt source (e.g. regCP_INT_CNTL_RING0) ?

I'm not 100% sure of that. The original idea was to enable/disable 
interrupt sources as they came along.

>> I can disable amdgpu interrupt line totally using disable_irq - would 
>> this be better ?

Yes, that's what I thought as well.

>>
>>
>>>
>>>
>>>> +    }
>>>> +
>>>> +    /* TODO Only waits for irq handlers on other CPUs, maybe 
>>>> local_irq_save
>>>> +     * local_irq_local_irq_restore are needed here for local 
>>>> interrupts ?
>>>> +     *
>>>> +     */
>>>
>>> Well that comment made me smile. Think for a moment what the local 
>>> CPU would be doing if an interrupt would run :)
>>
>>
>> No, I understand this of course, I am ok to be interrupted by 
>> interrupt handler at this point, what i am trying to do
>> is to prevent amdgpu_fence_process to run concurrently with 
>> amdgpu_fence_driver_clear_job_fences - that is what this
>> function is trying to prevent - i disable and flush pending EOP ISR 
>> handlers before the call to clear fences and re-enable after.

That should be sufficient. When a local interrupt would run the code 
here wouldn't be executing.

>> I guess we can also introduce a spinlock to serialize them ? Yiqing 
>> reported seeing a race between them so we have to do something.

A spinlock would be an absolute NAK because we have been working quite 
hard to remove them (for multiple reasons).

Christian.

>>
>> Andrey
>>
>>
>>>
>>> Cheers,
>>> Christian.
>>>
>>>> +    if (stop)
>>>> +        synchronize_irq(adev->irq.irq);
>>>> +}
>>>> +
>>>>   void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>>>   {
>>>>       unsigned int i, j;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> index 7d89a52091c0..82c178a9033a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> @@ -143,6 +143,7 @@ signed long amdgpu_fence_wait_polling(struct 
>>>> amdgpu_ring *ring,
>>>>                         uint32_t wait_seq,
>>>>                         signed long timeout);
>>>>   unsigned amdgpu_fence_count_emitted(struct amdgpu_ring *ring);
>>>> +void amdgpu_fence_driver_isr_toggle(struct amdgpu_device *adev, 
>>>> bool stop);
>>>>     /*
>>>>    * Rings.
>>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
  2022-06-23  5:52           ` Christian König
@ 2022-06-23 14:51             ` Andrey Grodzovsky
  2022-06-23 14:54               ` Christian König
  0 siblings, 1 reply; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-23 14:51 UTC (permalink / raw)
  To: Christian König, Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, monk.liu, yiqing.yao


On 2022-06-23 01:52, Christian König wrote:
> Am 22.06.22 um 19:19 schrieb Andrey Grodzovsky:
>>
>> On 2022-06-22 03:17, Christian König wrote:
>>> Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2022-06-21 03:28, Christian König wrote:
>>>>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>>>>> Align refcount behaviour for amdgpu_job embedded HW fence with
>>>>>> classic pointer style HW fences by increasing refcount each
>>>>>> time emit is called so amdgpu code doesn't need to make workarounds
>>>>>> using amdgpu_job.job_run_counter to keep the HW fence refcount 
>>>>>> balanced.
>>>>>
>>>>> Could we now also remove job_run_counter?
>>>>>
>>>>> Christian.
>>>>
>>>>
>>>> I am afraid not, job counter is needed since at all times the 
>>>> refcount on the
>>>> embedded fence cannot drop to zero because this will free the job 
>>>> itself before
>>>> the end of it's life cycle. We have to be able to differentiate in 
>>>> amdgpu_fence_emit
>>>> between first ever call where we init the embedded fence's refcount 
>>>> from scratch using kref_init
>>>> to repeating calls when refcount already > 0 and we just fake 
>>>> increase the refcount to align
>>>> behavior with pointer style fences in other drivers.
>>>
>>> Well what we should probably rather do is move the init out of emit 
>>> instead.
>>>
>>> The only down side I can see is that the sequence number isn't know 
>>> on initial init and so needs to be zero or something like that.
>>>
>>> Regards,
>>> Christian.
>>
>>
>> Not sure how this help, the problem is not there but in 
>> amdgpu_job_run, for embedded fence and resubmit job in pending list 
>> amdgpu_job_run
>> will be called twice or even 3 times with recheck guilty job 
>> sequence. I am supposed to do dma_fence_init to embeded HW fence only 
>> on first call while on second and
>> third only update sequence_num and increase refcount. How can i 
>> differentiate between first and non first calls without 
>> job_run_counter ?
>
> Yeah, good point. We should really stop re-submitting jobs altogether 
> in the kernel and move that whole functionality into userspace.
>
> Christian.


So i guess we keep this for now and see how to move resubmit 
functionality to user space ? as a separate task ?

Andrey


>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> I guess we could assume that embedded fence is all zeroes before 
>>>> first dma_fence_init  if assuming the job itself
>>>> was allocated using kzalloc and so u can look at dma_fence_ops == 
>>>> NULL or maybe seqno == 0
>>>> as a hint if that the fist call or not but it's a risky assumption 
>>>> in my opinion.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Also since in the previous patch we resumed setting 
>>>>>> s_fence->parent to NULL
>>>>>> in drm_sched_stop switch to directly checking if job->hw_fence is
>>>>>> signaled to short circuit reset if already signed.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 
>>>>>> ++++++++++++++++------
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
>>>>>>   4 files changed, 25 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>> index 513c57f839d8..447bd92c4856 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>> @@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct 
>>>>>> amdgpu_device *adev,
>>>>>>           goto err_ib_sched;
>>>>>>       }
>>>>>>   +    /* Drop the initial kref_init count (see drm_sched_main as 
>>>>>> example) */
>>>>>> +    dma_fence_put(f);
>>>>>>       ret = dma_fence_wait(f, false);
>>>>>>     err_ib_sched:
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index c99541685804..f9718119834f 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -5009,16 +5009,28 @@ static void 
>>>>>> amdgpu_device_recheck_guilty_jobs(
>>>>>>             /* clear job's guilty and depend the folowing step to 
>>>>>> decide the real one */
>>>>>>           drm_sched_reset_karma(s_job);
>>>>>> -        /* for the real bad job, it will be resubmitted twice, 
>>>>>> adding a dma_fence_get
>>>>>> -         * to make sure fence is balanced */
>>>>>> -        dma_fence_get(s_job->s_fence->parent);
>>>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>>   +        if (!s_job->s_fence->parent) {
>>>>>> +            DRM_WARN("Failed to get a HW fence for job!");
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +
>>>>>>           ret = dma_fence_wait_timeout(s_job->s_fence->parent, 
>>>>>> false, ring->sched.timeout);
>>>>>>           if (ret == 0) { /* timeout */
>>>>>>               DRM_ERROR("Found the real bad job! ring:%s, 
>>>>>> job_id:%llx\n",
>>>>>>                           ring->sched.name, s_job->id);
>>>>>>   +
>>>>>> +            /* Clear this failed job from fence array */
>>>>>> +            amdgpu_fence_driver_clear_job_fences(ring);
>>>>>> +
>>>>>> +            /* Since the job won't signal and we go for
>>>>>> +             * another resubmit drop this parent pointer
>>>>>> +             */
>>>>>> + dma_fence_put(s_job->s_fence->parent);
>>>>>> +            s_job->s_fence->parent = NULL;
>>>>>> +
>>>>>>               /* set guilty */
>>>>>>               drm_sched_increase_karma(s_job);
>>>>>>   retry:
>>>>>> @@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>             /* got the hw fence, signal finished fence */
>>>>>>           atomic_dec(ring->sched.score);
>>>>>> -        dma_fence_put(s_job->s_fence->parent);
>>>>>> dma_fence_get(&s_job->s_fence->finished);
>>>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>>>> dma_fence_put(&s_job->s_fence->finished);
>>>>>> @@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct 
>>>>>> amdgpu_device *adev,
>>>>>>        *
>>>>>>        * job->base holds a reference to parent fence
>>>>>>        */
>>>>>> -    if (job && job->base.s_fence->parent &&
>>>>>> - dma_fence_is_signaled(job->base.s_fence->parent)) {
>>>>>> +    if (job && (job->hw_fence.ops != NULL) &&
>>>>>> +        dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>           job_signaled = true;
>>>>>>           dev_info(adev->dev, "Guilty job already signaled, 
>>>>>> skipping HW reset");
>>>>>>           goto skip_hw_reset;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> index d6d54ba4c185..9bd4e18212fc 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> @@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>>>> *ring, struct dma_fence **f, struct amd
>>>>>>       if (job && job->job_run_counter) {
>>>>>>           /* reinit seq for resubmitted jobs */
>>>>>>           fence->seqno = seq;
>>>>>> +        /* TO be inline with external fence creation and other 
>>>>>> drivers */
>>>>>> +        dma_fence_get(fence);
>>>>>>       } else {
>>>>>> -        if (job)
>>>>>> +        if (job) {
>>>>>>               dma_fence_init(fence, &amdgpu_job_fence_ops,
>>>>>>                          &ring->fence_drv.lock,
>>>>>>                          adev->fence_context + ring->idx, seq);
>>>>>> +            /* Against remove in amdgpu_job_{free, free_cb} */
>>>>>> +            dma_fence_get(fence);
>>>>>> +        }
>>>>>>           else
>>>>>>               dma_fence_init(fence, &amdgpu_fence_ops,
>>>>>>                          &ring->fence_drv.lock,
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> index 58568fdde2d0..638e1d600258 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> @@ -267,10 +267,6 @@ static struct dma_fence 
>>>>>> *amdgpu_job_run(struct drm_sched_job *sched_job)
>>>>>>               DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>>>>>       }
>>>>>>   -    if (!job->job_run_counter)
>>>>>> -        dma_fence_get(fence);
>>>>>> -    else if (finished->error < 0)
>>>>>> -        dma_fence_put(&job->hw_fence);
>>>>>>       job->job_run_counter++;
>>>>>>       amdgpu_job_free_resources(job);
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.
  2022-06-23 14:51             ` Andrey Grodzovsky
@ 2022-06-23 14:54               ` Christian König
  0 siblings, 0 replies; 34+ messages in thread
From: Christian König @ 2022-06-23 14:54 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, monk.liu, yiqing.yao

Am 23.06.22 um 16:51 schrieb Andrey Grodzovsky:
>
> On 2022-06-23 01:52, Christian König wrote:
>> Am 22.06.22 um 19:19 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-06-22 03:17, Christian König wrote:
>>>> Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2022-06-21 03:28, Christian König wrote:
>>>>>> Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
>>>>>>> Align refcount behaviour for amdgpu_job embedded HW fence with
>>>>>>> classic pointer style HW fences by increasing refcount each
>>>>>>> time emit is called so amdgpu code doesn't need to make workarounds
>>>>>>> using amdgpu_job.job_run_counter to keep the HW fence refcount 
>>>>>>> balanced.
>>>>>>
>>>>>> Could we now also remove job_run_counter?
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> I am afraid not, job counter is needed since at all times the 
>>>>> refcount on the
>>>>> embedded fence cannot drop to zero because this will free the job 
>>>>> itself before
>>>>> the end of it's life cycle. We have to be able to differentiate in 
>>>>> amdgpu_fence_emit
>>>>> between first ever call where we init the embedded fence's 
>>>>> refcount from scratch using kref_init
>>>>> to repeating calls when refcount already > 0 and we just fake 
>>>>> increase the refcount to align
>>>>> behavior with pointer style fences in other drivers.
>>>>
>>>> Well what we should probably rather do is move the init out of emit 
>>>> instead.
>>>>
>>>> The only down side I can see is that the sequence number isn't know 
>>>> on initial init and so needs to be zero or something like that.
>>>>
>>>> Regards,
>>>> Christian.
>>>
>>>
>>> Not sure how this help, the problem is not there but in 
>>> amdgpu_job_run, for embedded fence and resubmit job in pending list 
>>> amdgpu_job_run
>>> will be called twice or even 3 times with recheck guilty job 
>>> sequence. I am supposed to do dma_fence_init to embeded HW fence 
>>> only on first call while on second and
>>> third only update sequence_num and increase refcount. How can i 
>>> differentiate between first and non first calls without 
>>> job_run_counter ?
>>
>> Yeah, good point. We should really stop re-submitting jobs altogether 
>> in the kernel and move that whole functionality into userspace.
>>
>> Christian.
>
>
> So i guess we keep this for now and see how to move resubmit 
> functionality to user space ? as a separate task ?

Well yes. I mean that's a rather larger project on it's own.

Christian.

>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> I guess we could assume that embedded fence is all zeroes before 
>>>>> first dma_fence_init  if assuming the job itself
>>>>> was allocated using kzalloc and so u can look at dma_fence_ops == 
>>>>> NULL or maybe seqno == 0
>>>>> as a hint if that the fist call or not but it's a risky assumption 
>>>>> in my opinion.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Also since in the previous patch we resumed setting 
>>>>>>> s_fence->parent to NULL
>>>>>>> in drm_sched_stop switch to directly checking if job->hw_fence is
>>>>>>> signaled to short circuit reset if already signed.
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> Tested-by: Yiqing Yao <yiqing.yao@amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 
>>>>>>> ++++++++++++++++------
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
>>>>>>>   4 files changed, 25 insertions(+), 11 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> index 513c57f839d8..447bd92c4856 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>>>>>>> @@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>>           goto err_ib_sched;
>>>>>>>       }
>>>>>>>   +    /* Drop the initial kref_init count (see drm_sched_main 
>>>>>>> as example) */
>>>>>>> +    dma_fence_put(f);
>>>>>>>       ret = dma_fence_wait(f, false);
>>>>>>>     err_ib_sched:
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index c99541685804..f9718119834f 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -5009,16 +5009,28 @@ static void 
>>>>>>> amdgpu_device_recheck_guilty_jobs(
>>>>>>>             /* clear job's guilty and depend the folowing step 
>>>>>>> to decide the real one */
>>>>>>>           drm_sched_reset_karma(s_job);
>>>>>>> -        /* for the real bad job, it will be resubmitted twice, 
>>>>>>> adding a dma_fence_get
>>>>>>> -         * to make sure fence is balanced */
>>>>>>> -        dma_fence_get(s_job->s_fence->parent);
>>>>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>>>   +        if (!s_job->s_fence->parent) {
>>>>>>> +            DRM_WARN("Failed to get a HW fence for job!");
>>>>>>> +            continue;
>>>>>>> +        }
>>>>>>> +
>>>>>>>           ret = dma_fence_wait_timeout(s_job->s_fence->parent, 
>>>>>>> false, ring->sched.timeout);
>>>>>>>           if (ret == 0) { /* timeout */
>>>>>>>               DRM_ERROR("Found the real bad job! ring:%s, 
>>>>>>> job_id:%llx\n",
>>>>>>>                           ring->sched.name, s_job->id);
>>>>>>>   +
>>>>>>> +            /* Clear this failed job from fence array */
>>>>>>> + amdgpu_fence_driver_clear_job_fences(ring);
>>>>>>> +
>>>>>>> +            /* Since the job won't signal and we go for
>>>>>>> +             * another resubmit drop this parent pointer
>>>>>>> +             */
>>>>>>> + dma_fence_put(s_job->s_fence->parent);
>>>>>>> +            s_job->s_fence->parent = NULL;
>>>>>>> +
>>>>>>>               /* set guilty */
>>>>>>>               drm_sched_increase_karma(s_job);
>>>>>>>   retry:
>>>>>>> @@ -5047,7 +5059,6 @@ static void 
>>>>>>> amdgpu_device_recheck_guilty_jobs(
>>>>>>>             /* got the hw fence, signal finished fence */
>>>>>>>           atomic_dec(ring->sched.score);
>>>>>>> -        dma_fence_put(s_job->s_fence->parent);
>>>>>>> dma_fence_get(&s_job->s_fence->finished);
>>>>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>>>>> dma_fence_put(&s_job->s_fence->finished);
>>>>>>> @@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>>        *
>>>>>>>        * job->base holds a reference to parent fence
>>>>>>>        */
>>>>>>> -    if (job && job->base.s_fence->parent &&
>>>>>>> - dma_fence_is_signaled(job->base.s_fence->parent)) {
>>>>>>> +    if (job && (job->hw_fence.ops != NULL) &&
>>>>>>> +        dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>>           job_signaled = true;
>>>>>>>           dev_info(adev->dev, "Guilty job already signaled, 
>>>>>>> skipping HW reset");
>>>>>>>           goto skip_hw_reset;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index d6d54ba4c185..9bd4e18212fc 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>>>>> *ring, struct dma_fence **f, struct amd
>>>>>>>       if (job && job->job_run_counter) {
>>>>>>>           /* reinit seq for resubmitted jobs */
>>>>>>>           fence->seqno = seq;
>>>>>>> +        /* TO be inline with external fence creation and other 
>>>>>>> drivers */
>>>>>>> +        dma_fence_get(fence);
>>>>>>>       } else {
>>>>>>> -        if (job)
>>>>>>> +        if (job) {
>>>>>>>               dma_fence_init(fence, &amdgpu_job_fence_ops,
>>>>>>>                          &ring->fence_drv.lock,
>>>>>>>                          adev->fence_context + ring->idx, seq);
>>>>>>> +            /* Against remove in amdgpu_job_{free, free_cb} */
>>>>>>> +            dma_fence_get(fence);
>>>>>>> +        }
>>>>>>>           else
>>>>>>>               dma_fence_init(fence, &amdgpu_fence_ops,
>>>>>>>                          &ring->fence_drv.lock,
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> index 58568fdde2d0..638e1d600258 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> @@ -267,10 +267,6 @@ static struct dma_fence 
>>>>>>> *amdgpu_job_run(struct drm_sched_job *sched_job)
>>>>>>>               DRM_ERROR("Error scheduling IBs (%d)\n", r);
>>>>>>>       }
>>>>>>>   -    if (!job->job_run_counter)
>>>>>>> -        dma_fence_get(fence);
>>>>>>> -    else if (finished->error < 0)
>>>>>>> -        dma_fence_put(&job->hw_fence);
>>>>>>>       job->job_run_counter++;
>>>>>>>       amdgpu_job_free_resources(job);
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
  2022-06-22 15:04           ` Christian König
@ 2022-06-23 21:18             ` Andrey Grodzovsky
  2022-06-24  6:00               ` Christian König
  0 siblings, 1 reply; 34+ messages in thread
From: Andrey Grodzovsky @ 2022-06-23 21:18 UTC (permalink / raw)
  To: Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, Christian.Koenig, monk.liu, yiqing.yao


On 2022-06-22 11:04, Christian König wrote:
> Am 22.06.22 um 17:01 schrieb Andrey Grodzovsky:
>>
>> On 2022-06-22 05:00, Christian König wrote:
>>> Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky:
>>>> On 2022-06-21 03:19, Christian König wrote:
>>>>
>>>>> Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
>>>>>> Problem:
>>>>>> In amdgpu_job_submit_direct - The refcount should drop by 2
>>>>>> but it drops only by 1.
>>>>>>
>>>>>> amdgpu_ib_sched->emit -> refcount 1 from first fence init
>>>>>> dma_fence_get -> refcount 2
>>>>>> dme_fence_put -> refcount 1
>>>>>>
>>>>>> Fix:
>>>>>> Add put for external_hw_fence in amdgpu_job_free/free_cb
>>>>>
>>>>> Well what is the external_hw_fence good for in this construct?
>>>>
>>>>
>>>> As far as I understand for direct submissions you don't want to 
>>>> pass a job
>>>> pointer to ib_schedule and so u can't use the embedded fence for 
>>>> this case.
>>>
>>> Can you please look a bit deeper into this, we now have a couple of 
>>> fields in the job structure which have no obvious use.
>>>
>>> I think we could pass a job structure to ib_schedule even for direct 
>>> submit now.
>>
>>
>> Are you sure  ? I see a lot of activities in amdgpu_ib_schedule 
>> depend on presence of  vm and fence_ctx which are set if the job 
>> pointer argument != NULL, might this have a negative impact on direct 
>> submit ?
>
> Not 100% sure, but we did tons of workarounds because we didn't had a 
> job pointer for direct submit.
>
> But this was before we embedded the IBs at the end of the job.
>
> It's quite likely that this should be possible now, it's just that 
> somebody needs to double check.
>
> Christian.


Looking more i see stuff like amdgpu_vm_flush and 
amdgpu_ring_emit_cntxcntl, emit_frame_cntl that are conditioned on job 
argument, doesn't look to me like this is relevant to direct submit ?

I also noticed that direct submit passes back the created fence to it's 
caller while freeing the job immediately, Using embedded job here will 
increase the time the job object will hang around the memory
without any use as long as it's fence is referenced. Job object is much 
larger then a single fence.

Andrey


>
>>
>> Andrey
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>>>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> index 10aa073600d4..58568fdde2d0 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> @@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct 
>>>>>> drm_sched_job *s_job)
>>>>>>       /* only put the hw fence if has embedded fence */
>>>>>>       if (job->hw_fence.ops != NULL)
>>>>>>           dma_fence_put(&job->hw_fence);
>>>>>> -    else
>>>>>> +    else {
>>>>>
>>>>> When one side of the if uses {} the other side should use {} as 
>>>>> well, e.g. use } else { here.
>>>>>
>>>>> Christian.
>>>>>
>>>>>> + dma_fence_put(job->external_hw_fence);
>>>>>>           kfree(job);
>>>>>> +    }
>>>>>>   }
>>>>>>     void amdgpu_job_free(struct amdgpu_job *job)
>>>>>> @@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>>>>       /* only put the hw fence if has embedded fence */
>>>>>>       if (job->hw_fence.ops != NULL)
>>>>>>           dma_fence_put(&job->hw_fence);
>>>>>> -    else
>>>>>> +    else {
>>>>>> +        dma_fence_put(job->external_hw_fence);
>>>>>>           kfree(job);
>>>>>> +    }
>>>>>>   }
>>>>>>     int amdgpu_job_submit(struct amdgpu_job *job, struct 
>>>>>> drm_sched_entity *entity,
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence
  2022-06-23 21:18             ` Andrey Grodzovsky
@ 2022-06-24  6:00               ` Christian König
  0 siblings, 0 replies; 34+ messages in thread
From: Christian König @ 2022-06-24  6:00 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, dri-devel, amd-gfx
  Cc: jingwen.chen2, monk.liu, yiqing.yao

Am 23.06.22 um 23:18 schrieb Andrey Grodzovsky:
>
> On 2022-06-22 11:04, Christian König wrote:
>> Am 22.06.22 um 17:01 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-06-22 05:00, Christian König wrote:
>>>> Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky:
>>>>> On 2022-06-21 03:19, Christian König wrote:
>>>>>
>>>>>> Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky:
>>>>>>> Problem:
>>>>>>> In amdgpu_job_submit_direct - The refcount should drop by 2
>>>>>>> but it drops only by 1.
>>>>>>>
>>>>>>> amdgpu_ib_sched->emit -> refcount 1 from first fence init
>>>>>>> dma_fence_get -> refcount 2
>>>>>>> dme_fence_put -> refcount 1
>>>>>>>
>>>>>>> Fix:
>>>>>>> Add put for external_hw_fence in amdgpu_job_free/free_cb
>>>>>>
>>>>>> Well what is the external_hw_fence good for in this construct?
>>>>>
>>>>>
>>>>> As far as I understand for direct submissions you don't want to 
>>>>> pass a job
>>>>> pointer to ib_schedule and so u can't use the embedded fence for 
>>>>> this case.
>>>>
>>>> Can you please look a bit deeper into this, we now have a couple of 
>>>> fields in the job structure which have no obvious use.
>>>>
>>>> I think we could pass a job structure to ib_schedule even for 
>>>> direct submit now.
>>>
>>>
>>> Are you sure  ? I see a lot of activities in amdgpu_ib_schedule 
>>> depend on presence of  vm and fence_ctx which are set if the job 
>>> pointer argument != NULL, might this have a negative impact on 
>>> direct submit ?
>>
>> Not 100% sure, but we did tons of workarounds because we didn't had a 
>> job pointer for direct submit.
>>
>> But this was before we embedded the IBs at the end of the job.
>>
>> It's quite likely that this should be possible now, it's just that 
>> somebody needs to double check.
>>
>> Christian.
>
>
> Looking more i see stuff like amdgpu_vm_flush and 
> amdgpu_ring_emit_cntxcntl, emit_frame_cntl that are conditioned on job 
> argument, doesn't look to me like this is relevant to direct submit ?

I think that those could and maybe even should be cleaned up.

>
> I also noticed that direct submit passes back the created fence to 
> it's caller while freeing the job immediately, Using embedded job here 
> will increase the time the job object will hang around the memory
> without any use as long as it's fence is referenced. Job object is 
> much larger then a single fence.

Ah! Yes, you are right! That was the fundamental problem we ran into: 
When we submit the IB tests during GPU reset we can't allocate any memory!

Ok, that needs further investigation. Please go ahead with your plan to 
fix this first and then clean it up later on.

Regards,
Christian.

>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++--
>>>>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> index 10aa073600d4..58568fdde2d0 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>>> @@ -152,8 +152,10 @@ static void amdgpu_job_free_cb(struct 
>>>>>>> drm_sched_job *s_job)
>>>>>>>       /* only put the hw fence if has embedded fence */
>>>>>>>       if (job->hw_fence.ops != NULL)
>>>>>>>           dma_fence_put(&job->hw_fence);
>>>>>>> -    else
>>>>>>> +    else {
>>>>>>
>>>>>> When one side of the if uses {} the other side should use {} as 
>>>>>> well, e.g. use } else { here.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> + dma_fence_put(job->external_hw_fence);
>>>>>>>           kfree(job);
>>>>>>> +    }
>>>>>>>   }
>>>>>>>     void amdgpu_job_free(struct amdgpu_job *job)
>>>>>>> @@ -165,8 +167,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>>>>>>       /* only put the hw fence if has embedded fence */
>>>>>>>       if (job->hw_fence.ops != NULL)
>>>>>>>           dma_fence_put(&job->hw_fence);
>>>>>>> -    else
>>>>>>> +    else {
>>>>>>> +        dma_fence_put(job->external_hw_fence);
>>>>>>>           kfree(job);
>>>>>>> +    }
>>>>>>>   }
>>>>>>>     int amdgpu_job_submit(struct amdgpu_job *job, struct 
>>>>>>> drm_sched_entity *entity,
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-06-24  6:00 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-20 22:02 [PATCH 0/5] Rework amdgpu HW fence refocunt and update scheduler parent fence refcount Andrey Grodzovsky
2022-06-20 22:02 ` Andrey Grodzovsky
2022-06-20 22:02 ` [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence Andrey Grodzovsky
2022-06-20 22:02   ` Andrey Grodzovsky
2022-06-21  7:19   ` Christian König
2022-06-21 19:34     ` Andrey Grodzovsky
2022-06-22  9:00       ` Christian König
2022-06-22 15:01         ` Andrey Grodzovsky
2022-06-22 15:04           ` Christian König
2022-06-23 21:18             ` Andrey Grodzovsky
2022-06-24  6:00               ` Christian König
2022-06-20 22:02 ` [PATCH 2/5] drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences Andrey Grodzovsky
2022-06-20 22:02   ` Andrey Grodzovsky
2022-06-21  7:21   ` Christian König
2022-06-20 22:03 ` [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset Andrey Grodzovsky
2022-06-20 22:03   ` Andrey Grodzovsky
2022-06-21  7:25   ` Christian König
2022-06-21 19:45     ` Andrey Grodzovsky
2022-06-22  1:47       ` VURDIGERENATARAJ, CHANDAN
2022-06-22  2:41         ` Andrey Grodzovsky
2022-06-22 17:31       ` Andrey Grodzovsky
2022-06-23  5:57         ` Christian König
2022-06-20 22:03 ` [PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer' Andrey Grodzovsky
2022-06-20 22:03   ` Andrey Grodzovsky
2022-06-21  7:26   ` Christian König
2022-06-20 22:03 ` [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change Andrey Grodzovsky
2022-06-20 22:03   ` Andrey Grodzovsky
2022-06-21  7:28   ` Christian König
2022-06-21 20:00     ` Andrey Grodzovsky
2022-06-22  7:17       ` Christian König
2022-06-22 17:19         ` Andrey Grodzovsky
2022-06-23  5:52           ` Christian König
2022-06-23 14:51             ` Andrey Grodzovsky
2022-06-23 14:54               ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.