[PATCH 00/12] *** SRIOV GPU RESET PATCHES ***

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/12] *** SRIOV GPU RESET PATCHES ***
@ 2017-09-30  6:03 Monk Liu
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

implement strict mode gpu reset, and some changes for loose mode reset

Monk Liu (12):
  drm/amdgpu/sriov:now must reinit psp
  drm/amdgpu/sriov:fix memory leak in psp_load_fw
  drm/amdgpu/sriov:use atomic type for sriov_reset
  drm/amdgpu/sriov:cleanup gpu rest mlock
  drm/amdgpu/sriov:accurate description for sriov_gpu_reset
  drm/amdgpu/sriov:handle more jobs hang in different ring case
  drm/amdgpu/sriov:implement strict gpu reset
  drm/amdgpu:explicitly call fence_process
  drm/amdgpu/sriov:return -ENODEV if gpu reseted
  drm/amdgpu/sriov:implement guilty ctx for loose reset
  drm/amdgpu/sriov:show error if ib test failed
  drm/amdgpu/sriov:no shadow buffer recovery

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  46 ++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 135 +++++++++++++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |   7 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  19 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c       |   7 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |  22 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c     |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c      |   2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |   2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c         |   6 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c         |   6 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |   4 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |   4 +-
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c |  82 ++++++++++++++++
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   3 +
 18 files changed, 284 insertions(+), 73 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/12] drm/amdgpu/sriov:now must reinit psp
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-09-30  6:03   ` Monk Liu
  2017-09-30  6:03   ` [PATCH 02/12] drm/amdgpu/sriov:fix memory leak in psp_load_fw Monk Liu
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

otherwise after VF FLR the KIQ cannot work

Change-Id: Icb18e794b5e4dccfd70c811c138c7102df874203
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a86d856..08bc0cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1880,6 +1880,7 @@ static int amdgpu_sriov_reinit_late(struct amdgpu_device *adev)
 
 	static enum amd_ip_block_type ip_order[] = {
 		AMD_IP_BLOCK_TYPE_SMC,
+		AMD_IP_BLOCK_TYPE_PSP,
 		AMD_IP_BLOCK_TYPE_DCE,
 		AMD_IP_BLOCK_TYPE_GFX,
 		AMD_IP_BLOCK_TYPE_SDMA,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/12] drm/amdgpu/sriov:fix memory leak in psp_load_fw
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-09-30  6:03   ` [PATCH 01/12] drm/amdgpu/sriov:now must reinit psp Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
  2017-09-30  6:03   ` [PATCH 03/12] drm/amdgpu/sriov:use atomic type for sriov_reset Monk Liu
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

when doing gpu reset this routine shouldn't do anything
resource allocating otherwise SRIOV gpu reset will cause
memory leak

Change-Id: I25da3a5b475196c75c7e639adc40751754625968
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 447d446..52daabc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -334,23 +334,26 @@ static int psp_load_fw(struct amdgpu_device *adev)
 	int ret;
 	struct psp_context *psp = &adev->psp;
 
+	if (amdgpu_sriov_vf(adev) && adev->in_sriov_reset != 0)
+		goto skip_memalloc;
+
 	psp->cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
 	if (!psp->cmd)
 		return -ENOMEM;
 
 	ret = amdgpu_bo_create_kernel(adev, PSP_1_MEG, PSP_1_MEG,
-				      AMDGPU_GEM_DOMAIN_GTT,
-				      &psp->fw_pri_bo,
-				      &psp->fw_pri_mc_addr,
-				      &psp->fw_pri_buf);
+					AMDGPU_GEM_DOMAIN_GTT,
+					&psp->fw_pri_bo,
+					&psp->fw_pri_mc_addr,
+					&psp->fw_pri_buf);
 	if (ret)
 		goto failed;
 
 	ret = amdgpu_bo_create_kernel(adev, PSP_FENCE_BUFFER_SIZE, PAGE_SIZE,
-				      AMDGPU_GEM_DOMAIN_VRAM,
-				      &psp->fence_buf_bo,
-				      &psp->fence_buf_mc_addr,
-				      &psp->fence_buf);
+					AMDGPU_GEM_DOMAIN_VRAM,
+					&psp->fence_buf_bo,
+					&psp->fence_buf_mc_addr,
+					&psp->fence_buf);
 	if (ret)
 		goto failed_mem2;
 
@@ -375,6 +378,7 @@ static int psp_load_fw(struct amdgpu_device *adev)
 	if (ret)
 		goto failed_mem;
 
+skip_memalloc:
 	ret = psp_hw_start(psp);
 	if (ret)
 		goto failed_mem;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/12] drm/amdgpu/sriov:use atomic type for sriov_reset
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-09-30  6:03   ` [PATCH 01/12] drm/amdgpu/sriov:now must reinit psp Monk Liu
  2017-09-30  6:03   ` [PATCH 02/12] drm/amdgpu/sriov:fix memory leak in psp_load_fw Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
  2017-09-30  6:03   ` [PATCH 04/12] drm/amdgpu/sriov:cleanup gpu rest mlock Monk Liu
                     ` (9 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Change-Id: If33ebac11c93ce2753c30bbe0d51b594449e2e7f
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c      | 6 +++---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 6 +++---
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a5b0b67..de11527 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1605,7 +1605,7 @@ struct amdgpu_device {
 
 	/* record last mm index being written through WREG32*/
 	unsigned long last_mm_index;
-	bool                            in_sriov_reset;
+	atomic_t      in_sriov_reset;
 };
 
 static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 08bc0cf..f507894 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2753,7 +2753,7 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
 
 	mutex_lock(&adev->virt.lock_reset);
 	atomic_inc(&adev->gpu_reset_counter);
-	adev->in_sriov_reset = true;
+	atomic_set(&adev->in_sriov_reset, 1);
 
 	/* block TTM */
 	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
@@ -2864,7 +2864,7 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
 		dev_info(adev->dev, "GPU reset successed!\n");
 	}
 
-	adev->in_sriov_reset = false;
+	atomic_set(&adev->in_sriov_reset, 0);
 	mutex_unlock(&adev->virt.lock_reset);
 	return r;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 52daabc..36cd6d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -264,7 +264,7 @@ static int psp_hw_start(struct psp_context *psp)
 	struct amdgpu_device *adev = psp->adev;
 	int ret;
 
-	if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {
+	if (!amdgpu_sriov_vf(adev) || !atomic_read(&adev->in_sriov_reset)) {
 		ret = psp_bootloader_load_sysdrv(psp);
 		if (ret)
 			return ret;
@@ -334,7 +334,7 @@ static int psp_load_fw(struct amdgpu_device *adev)
 	int ret;
 	struct psp_context *psp = &adev->psp;
 
-	if (amdgpu_sriov_vf(adev) && adev->in_sriov_reset != 0)
+	if (amdgpu_sriov_vf(adev) && atomic_read(&adev->in_sriov_reset))
 		goto skip_memalloc;
 
 	psp->cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index 6564902..80208a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
@@ -370,7 +370,7 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
 		return 0;
 	}
 
-	if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {
+	if (!amdgpu_sriov_vf(adev) || !atomic_read(&adev->in_sriov_reset)) {
 		err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,
 					amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
 					AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index dfc10b1..3905ee5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -4812,7 +4812,7 @@ static int gfx_v8_0_kiq_init_queue(struct amdgpu_ring *ring)
 
 	gfx_v8_0_kiq_setting(ring);
 
-	if (adev->in_sriov_reset) { /* for GPU_RESET case */
+	if (atomic_read(&adev->in_sriov_reset)) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct vi_mqd_allocation));
@@ -4849,7 +4849,7 @@ static int gfx_v8_0_kcq_init_queue(struct amdgpu_ring *ring)
 	struct vi_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.compute_ring[0];
 
-	if (!adev->in_sriov_reset && !adev->gfx.in_suspend) {
+	if (!atomic_read(&adev->in_sriov_reset) && !adev->gfx.in_suspend) {
 		memset((void *)mqd, 0, sizeof(struct vi_mqd_allocation));
 		((struct vi_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
 		((struct vi_mqd_allocation *)mqd)->dynamic_rb_mask = 0xFFFFFFFF;
@@ -4861,7 +4861,7 @@ static int gfx_v8_0_kcq_init_queue(struct amdgpu_ring *ring)
 
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(struct vi_mqd_allocation));
-	} else if (adev->in_sriov_reset) { /* for GPU_RESET case */
+	} else if (atomic_read(&adev->in_sriov_reset)) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct vi_mqd_allocation));
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index deeaee14..7e44306 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -2721,7 +2721,7 @@ static int gfx_v9_0_kiq_init_queue(struct amdgpu_ring *ring)
 
 	gfx_v9_0_kiq_setting(ring);
 
-	if (adev->in_sriov_reset) { /* for GPU_RESET case */
+	if (atomic_read(&adev->in_sriov_reset)) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct v9_mqd_allocation));
@@ -2759,7 +2759,7 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring)
 	struct v9_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.compute_ring[0];
 
-	if (!adev->in_sriov_reset && !adev->gfx.in_suspend) {
+	if (!atomic_read(&adev->in_sriov_reset) && !adev->gfx.in_suspend) {
 		memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
 		((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
 		((struct v9_mqd_allocation *)mqd)->dynamic_rb_mask = 0xFFFFFFFF;
@@ -2771,7 +2771,7 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring)
 
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(struct v9_mqd_allocation));
-	} else if (adev->in_sriov_reset) { /* for GPU_RESET case */
+	} else if (atomic_read(&adev->in_sriov_reset)) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct v9_mqd_allocation));
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/12] drm/amdgpu/sriov:cleanup gpu rest mlock
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 03/12] drm/amdgpu/sriov:use atomic type for sriov_reset Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
  2017-09-30  6:03   ` [PATCH 05/12] drm/amdgpu/sriov:accurate description for sriov_gpu_reset Monk Liu
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

this mutex lock is for preventing multi-thread concurrrently
executing the sriov_gpu_reset, now we use atomic_add_unless
to replace it.

Change-Id: Id07e364764252a631cb75b01c7b7ff8d173d6c95
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   | 1 -
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f507894..56a9ebe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2751,9 +2751,11 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
 	struct amdgpu_ring *ring;
 	struct dma_fence *fence = NULL, *next = NULL;
 
-	mutex_lock(&adev->virt.lock_reset);
+	/* other thread is already into the gpu reset so just quit */
+	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
+		return 0;
+
 	atomic_inc(&adev->gpu_reset_counter);
-	atomic_set(&adev->in_sriov_reset, 1);
 
 	/* block TTM */
 	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
@@ -2865,7 +2867,6 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
 	}
 
 	atomic_set(&adev->in_sriov_reset, 0);
-	mutex_unlock(&adev->virt.lock_reset);
 	return r;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index ab05121..64930ef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -107,8 +107,6 @@ void amdgpu_virt_init_setting(struct amdgpu_device *adev)
 	adev->enable_virtual_display = true;
 	adev->cg_flags = 0;
 	adev->pg_flags = 0;
-
-	mutex_init(&adev->virt.lock_reset);
 }
 
 uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index afcfb8b..a3cbd5a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -53,7 +53,6 @@ struct amdgpu_virt {
 	uint64_t			csa_vmid0_addr;
 	bool chained_ib_support;
 	uint32_t			reg_val_offs;
-	struct mutex                    lock_reset;
 	struct amdgpu_irq_src		ack_irq;
 	struct amdgpu_irq_src		rcv_irq;
 	struct work_struct		flr_work;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/12] drm/amdgpu/sriov:accurate description for sriov_gpu_reset
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 04/12] drm/amdgpu/sriov:cleanup gpu rest mlock Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
  2017-09-30  6:03   ` [PATCH 06/12] drm/amdgpu/sriov:handle more jobs hang in different ring case Monk Liu
                     ` (7 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Change-Id: I4120f146e68cb38db69418426740d409de86101a
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 56a9ebe..31a5608 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2741,6 +2741,9 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
  *
  * Attempt the reset the GPU if it has hung (all asics).
  * for SRIOV case.
+ * if @job is null, it will not drop any job, instead it just repeat those
+ * jobs.
+ *
  * Returns 0 for success or an error on failure.
  */
 int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/12] drm/amdgpu/sriov:handle more jobs hang in different ring case
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 05/12] drm/amdgpu/sriov:accurate description for sriov_gpu_reset Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
       [not found]     ` <1506751432-21789-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-09-30  6:03   ` [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset Monk Liu
                     ` (6 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

quit first and try later if gpu_reset is already running, this
way we can handle different jobs hang on different ring and
crash each other on the same time

Change-Id: I0c6bc8d76959c5053e7523c41b2305032fc6b79a
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 15 ++++++++++++---
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 31a5608..9efbb33 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2754,9 +2754,9 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
 	struct amdgpu_ring *ring;
 	struct dma_fence *fence = NULL, *next = NULL;
 
-	/* other thread is already into the gpu reset so just quit */
+	/* other thread is already into the gpu reset so just quit and come later */
 	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
-		return 0;
+		return -EAGAIN;
 
 	atomic_inc(&adev->gpu_reset_counter);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4510627..0db81a4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -37,10 +37,19 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
 		  atomic_read(&job->ring->fence_drv.last_seq),
 		  job->ring->fence_drv.sync_seq);
 
-	if (amdgpu_sriov_vf(job->adev))
-		amdgpu_sriov_gpu_reset(job->adev, job);
-	else
+	if (amdgpu_sriov_vf(job->adev)) {
+		int r;
+
+try_again:
+		r = amdgpu_sriov_gpu_reset(job->adev, job);
+		if (r == -EAGAIN) {
+			/* maye two different schedulers all have hang job, try later */
+			schedule();
+			goto try_again;
+		}
+	} else {
 		amdgpu_gpu_reset(job->adev);
+	}
 }
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (5 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 06/12] drm/amdgpu/sriov:handle more jobs hang in different ring case Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
       [not found]     ` <1506751432-21789-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-09-30  6:03   ` [PATCH 08/12] drm/amdgpu:explicitly call fence_process Monk Liu
                     ` (5 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

changes:
1)implement strict mode sriov gpu reset
2)always call sriov_gpu_reset_strict if hypervisor notify FLR
3)in strict reset mode, set error to all fences.
4)change fence_wait/cs_wait functions to return -ENODEV if fence signaled
with error == -ETIME,

Since after strict gpu reset we consider the VRAM were lost,
and since assuming VRAM lost there is little help to recover
shadow BO because all textures/resources/shaders cannot
recovered (if they resident in VRAM)

Change-Id: I50d9b8b5185ba92f137f07c9deeac19d740d753b
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 25 ++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 90 +++++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  4 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  4 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  4 +-
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 60 ++++++++++++++++++
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  2 +
 10 files changed, 188 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index de11527..de9c164 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -123,6 +123,7 @@ extern int amdgpu_cntl_sb_buf_per_se;
 extern int amdgpu_param_buf_per_se;
 extern int amdgpu_job_hang_limit;
 extern int amdgpu_lbpw;
+extern int amdgpu_sriov_reset_level;
 
 #ifdef CONFIG_DRM_AMDGPU_SI
 extern int amdgpu_si_support;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index c6a214f..9467cf6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1262,6 +1262,7 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
 	struct amdgpu_ctx *ctx;
 	struct dma_fence *fence;
 	long r;
+	int fence_err = 0;
 
 	if (amdgpu_kms_vram_lost(adev, fpriv))
 		return -ENODEV;
@@ -1283,6 +1284,8 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
 		r = PTR_ERR(fence);
 	else if (fence) {
 		r = dma_fence_wait_timeout(fence, true, timeout);
+		/* gpu hang and this fence is signaled by gpu reset if fence_err < 0 */
+		fence_err = dma_fence_get_status(fence);
 		dma_fence_put(fence);
 	} else
 		r = 1;
@@ -1292,7 +1295,10 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
 		return r;
 
 	memset(wait, 0, sizeof(*wait));
-	wait->out.status = (r == 0);
+	wait->out.status = (fence_err < 0);
+
+	if (fence_err < 0)
+		return -ENODEV;
 
 	return 0;
 }
@@ -1346,6 +1352,7 @@ static int amdgpu_cs_wait_all_fences(struct amdgpu_device *adev,
 	uint32_t fence_count = wait->in.fence_count;
 	unsigned int i;
 	long r = 1;
+	int fence_err = 0;
 
 	for (i = 0; i < fence_count; i++) {
 		struct dma_fence *fence;
@@ -1358,16 +1365,20 @@ static int amdgpu_cs_wait_all_fences(struct amdgpu_device *adev,
 			continue;
 
 		r = dma_fence_wait_timeout(fence, true, timeout);
+		fence_err = dma_fence_get_status(fence);
 		dma_fence_put(fence);
 		if (r < 0)
 			return r;
 
-		if (r == 0)
+		if (r == 0 || fence_err < 0)
 			break;
 	}
 
 	memset(wait, 0, sizeof(*wait));
-	wait->out.status = (r > 0);
+	wait->out.status = (r > 0 && fence_err == 0);
+
+	if (fence_err < 0)
+		return -ENODEV;
 
 	return 0;
 }
@@ -1391,6 +1402,7 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
 	struct dma_fence **array;
 	unsigned int i;
 	long r;
+	int fence_err = 0;
 
 	/* Prepare the fence array */
 	array = kcalloc(fence_count, sizeof(struct dma_fence *), GFP_KERNEL);
@@ -1418,10 +1430,12 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
 				       &first);
 	if (r < 0)
 		goto err_free_fence_array;
+	else
+		fence_err = dma_fence_get_status(array[first]);
 
 out:
 	memset(wait, 0, sizeof(*wait));
-	wait->out.status = (r > 0);
+	wait->out.status = (r > 0 && fence_err == 0);
 	wait->out.first_signaled = first;
 	/* set return value 0 to indicate success */
 	r = 0;
@@ -1431,6 +1445,9 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
 		dma_fence_put(array[i]);
 	kfree(array);
 
+	if (fence_err < 0)
+		return -ENODEV;
+
 	return r;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9efbb33..122e2e1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2734,6 +2734,96 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
 }
 
 /**
+ * amdgpu_sriov_gpu_reset_strict - reset the asic under strict mode
+ *
+ * @adev: amdgpu device pointer
+ * @job: which job trigger hang
+ *
+ * Attempt the reset the GPU if it has hung (all asics).
+ * for SRIOV case.
+ * Returns 0 for success or an error on failure.
+ *
+ * this function will deny all process/fence created before this reset,
+ * and drop all jobs unfinished during this reset.
+ *
+ * Application should take the responsibility to re-open the FD to re-create
+ * the VM page table and recover all resources as well
+ *
+ **/
+int amdgpu_sriov_gpu_reset_strict(struct amdgpu_device *adev, struct amdgpu_job *job)
+{
+	int i, r = 0;
+	int resched;
+	struct amdgpu_ring *ring;
+
+	/* other thread is already into the gpu reset so just quit and come later */
+	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
+		return -EAGAIN;
+
+	atomic_inc(&adev->gpu_reset_counter);
+
+	/* block TTM */
+	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
+
+	/* fake signal jobs already scheduled  */
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		kthread_park(ring->sched.thread);
+		amd_sched_set_sched_hang(&ring->sched);
+		amdgpu_fence_driver_force_completion_ring(ring);
+		amd_sched_set_queue_hang(&ring->sched);
+	}
+
+	/* request to take full control of GPU before re-initialization  */
+	if (job)
+		amdgpu_virt_reset_gpu(adev);
+	else
+		amdgpu_virt_request_full_gpu(adev, true);
+
+	/* Resume IP prior to SMC */
+	amdgpu_sriov_reinit_early(adev);
+
+	/* we need recover gart prior to run SMC/CP/SDMA resume */
+	amdgpu_ttm_recover_gart(adev);
+
+	/* now we are okay to resume SMC/CP/SDMA */
+	amdgpu_sriov_reinit_late(adev);
+
+	/* resume IRQ status */
+	amdgpu_irq_gpu_reset_resume_helper(adev);
+
+	if (amdgpu_ib_ring_tests(adev))
+		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
+
+	/* release full control of GPU after ib test */
+	amdgpu_virt_release_full_gpu(adev, true);
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		kthread_unpark(ring->sched.thread);
+	}
+
+	drm_helper_resume_force_mode(adev->ddev);
+
+	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
+	if (r)
+		dev_info(adev->dev, "Strict mode GPU reset failed\n");
+	else
+		dev_info(adev->dev, "Strict mode GPU reset successed!\n");
+
+	atomic_set(&adev->in_sriov_reset, 0);
+	return 0;
+}
+
+/**
  * amdgpu_sriov_gpu_reset - reset the asic
  *
  * @adev: amdgpu device pointer
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 8f5211c..eee67dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -123,6 +123,7 @@ int amdgpu_cntl_sb_buf_per_se = 0;
 int amdgpu_param_buf_per_se = 0;
 int amdgpu_job_hang_limit = 0;
 int amdgpu_lbpw = -1;
+int amdgpu_sriov_reset_level = 0;
 
 MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
 module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
@@ -269,6 +270,9 @@ module_param_named(job_hang_limit, amdgpu_job_hang_limit, int ,0444);
 MODULE_PARM_DESC(lbpw, "Load Balancing Per Watt (LBPW) support (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(lbpw, amdgpu_lbpw, int, 0444);
 
+MODULE_PARM_DESC(sriov_reset_level, "what level will gpu reset on, 0: loose, 1:strict, other:disable (default 0))");
+module_param_named(sriov_reset_level, amdgpu_sriov_reset_level, int ,0444);
+
 #ifdef CONFIG_DRM_AMDGPU_SI
 
 #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 0db81a4..933823a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -41,7 +41,11 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
 		int r;
 
 try_again:
-		r = amdgpu_sriov_gpu_reset(job->adev, job);
+		if (amdgpu_sriov_reset_level == 1)
+			r = amdgpu_sriov_gpu_reset_strict(job->adev, job);
+		else
+			r = amdgpu_sriov_gpu_reset(job->adev, job);
+
 		if (r == -EAGAIN) {
 			/* maye two different schedulers all have hang job, try later */
 			schedule();
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index a3cbd5a..5664a10 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -100,5 +100,6 @@ int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
 int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job);
 int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
 void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
+int amdgpu_sriov_gpu_reset_strict(struct amdgpu_device *adev, struct amdgpu_job *job);
 
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 2812d88..00a9629 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -247,8 +247,8 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 		return;
 	}
 
-	/* Trigger recovery due to world switch failure */
-	amdgpu_sriov_gpu_reset(adev, NULL);
+	/* use strict mode if FLR triggered from hypervisor */
+	amdgpu_sriov_gpu_reset_strict(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index c25a831..c94b6e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -513,8 +513,8 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 		return;
 	}
 
-	/* Trigger recovery due to world switch failure */
-	amdgpu_sriov_gpu_reset(adev, NULL);
+	/* use strict mode if FLR triggered from hypervisor */
+	amdgpu_sriov_gpu_reset_strict(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 97c94f9..12c3092 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -430,6 +430,66 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
 	spin_unlock(&sched->job_list_lock);
 }
 
+/**
+ * amd_sched_set_sched_hang
+ * @sched: the scheduler need to set all pending jobs hang
+ *
+ * this routine set all unfinished jobs pending in the sched to
+ * an error -ETIME statues
+ *
+ **/
+void amd_sched_set_sched_hang(struct amd_gpu_scheduler *sched)
+{
+	struct amd_sched_job *s_job;
+
+	spin_lock(&sched->job_list_lock);
+	list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node)
+		dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+
+	spin_unlock(&sched->job_list_lock);
+}
+
+/**
+ * amd_sched_set_queue_hang
+ * @sched: the scheduler need to set all job in kfifo hang
+ *
+ * this routine set all jobs in the KFIFO of @sched to an error
+ * -ETIME status and signal those jobs.
+ *
+ **/
+
+void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
+{
+	struct amd_sched_entity *entity, *tmp;
+	struct amd_sched_job *s_job;
+	struct amd_sched_rq *rq;
+	int i;
+
+	/* set HANG status on all jobs queued and fake signal them */
+	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_MAX; i++) {
+		rq = &sched->sched_rq[i];
+
+		spin_lock(&rq->lock);
+		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
+			if (entity->dependency) {
+				dma_fence_remove_callback(entity->dependency, &entity->cb);
+				dma_fence_put(entity->dependency);
+				entity->dependency = NULL;
+			}
+
+			spin_lock(&entity->queue_lock);
+			while(kfifo_out(&entity->job_queue, &s_job, sizeof(s_job)) == sizeof(s_job)) {
+				dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+				amd_sched_fence_scheduled(s_job->s_fence);
+				amd_sched_fence_finished(s_job->s_fence);
+			}
+			spin_unlock(&entity->queue_lock);
+		}
+		spin_unlock(&rq->lock);
+	}
+	wake_up(&sched->job_scheduled);
+}
+
 void amd_sched_job_kickout(struct amd_sched_job *s_job)
 {
 	struct amd_gpu_scheduler *sched = s_job->sched;
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index f9d8f28..f0242aa 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -167,4 +167,6 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched);
 bool amd_sched_dependency_optimized(struct dma_fence* fence,
 				    struct amd_sched_entity *entity);
 void amd_sched_job_kickout(struct amd_sched_job *s_job);
+void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched);
+void amd_sched_set_sched_hang(struct amd_gpu_scheduler *sched);
 #endif
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/12] drm/amdgpu:explicitly call fence_process
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (6 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
       [not found]     ` <1506751432-21789-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-09-30  6:03   ` [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted Monk Liu
                     ` (4 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

this way no need to wait timer triggered to save time

Change-Id: Ie96fd2fc1f6054ebc1e58c3d703471639371ee22
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 333bad7..13785d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -543,8 +543,13 @@ void amdgpu_fence_driver_force_completion(struct amdgpu_device *adev)
 
 void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring)
 {
-	if (ring)
+	if (ring) {
 		amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
+		/* call fence process manually can get it done quickly
+		 * instead of waiting for the timer triggered
+		 */
+		amdgpu_fence_process(ring);
+	}
 }
 
 /*
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (7 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 08/12] drm/amdgpu:explicitly call fence_process Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
       [not found]     ` <1506751432-21789-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-09-30  6:03   ` [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset Monk Liu
                     ` (3 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

for SRIOV strict mode gpu reset:

In kms open we mark the latest adev->gpu_reset_counter in fpriv
we return -ENODEV in cs_ioctl or info_ioctl if they found
fpriv->gpu_reset_counter != adev->gpu_reset_counter.

this way we prevent a potential bad process/FD from submitting
cmds and notify userspace with -ENODEV.

userspace should close all BO/ctx and re-open
dri FD to re-create virtual memory system for this process

Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
 3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index de9c164..b40d4ba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -772,6 +772,7 @@ struct amdgpu_fpriv {
 	struct idr		bo_list_handles;
 	struct amdgpu_ctx_mgr	ctx_mgr;
 	u32			vram_lost_counter;
+	int gpu_reset_counter;
 };
 
 /*
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 9467cf6..6a1515e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
 	if (amdgpu_kms_vram_lost(adev, fpriv))
 		return -ENODEV;
 
+	if (amdgpu_sriov_vf(adev) &&
+		amdgpu_sriov_reset_level == 1 &&
+		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
+		return -ENODEV;
+
 	parser.adev = adev;
 	parser.filp = filp;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 282f45b..bd389cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
 	if (amdgpu_kms_vram_lost(adev, fpriv))
 		return -ENODEV;
 
+	if (amdgpu_sriov_vf(adev) &&
+		amdgpu_sriov_reset_level == 1 &&
+		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
+		return -ENODEV;
+
 	switch (info->query) {
 	case AMDGPU_INFO_ACCEL_WORKING:
 		ui32 = adev->accel_working;
@@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
 		goto out_suspend;
 	}
 
+	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
+
 	r = amdgpu_vm_init(adev, &fpriv->vm,
 			   AMDGPU_VM_CONTEXT_GFX, 0);
 	if (r) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (8 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
       [not found]     ` <1506751432-21789-11-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-09-30  6:03   ` [PATCH 11/12] drm/amdgpu/sriov:show error if ib test failed Monk Liu
                     ` (2 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Change-Id: I7904f362aa0f578a5cbf5d40c7a242c2c6680a92
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 16 +++++++++-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  1 +
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 22 ++++++++++++++++++++++
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  1 +
 5 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b40d4ba..b63e602 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -737,6 +737,7 @@ struct amdgpu_ctx {
 	struct dma_fence	**fences;
 	struct amdgpu_ctx_ring	rings[AMDGPU_MAX_RINGS];
 	bool preamble_presented;
+	bool guilty;
 };
 
 struct amdgpu_ctx_mgr {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 6a1515e..f92962e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -79,16 +79,19 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
 	if (cs->in.num_chunks == 0)
 		return 0;
 
+	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
+	if (!p->ctx)
+		return -EINVAL;
+
+	if (amdgpu_sriov_vf(p->adev) &&
+		amdgpu_sriov_reset_level == 0 &&
+		p->ctx->guilty)
+		return -ENODEV;
+
 	chunk_array = kmalloc_array(cs->in.num_chunks, sizeof(uint64_t), GFP_KERNEL);
 	if (!chunk_array)
 		return -ENOMEM;
 
-	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
-	if (!p->ctx) {
-		ret = -EINVAL;
-		goto free_chunk;
-	}
-
 	/* get chunks */
 	chunk_array_user = u64_to_user_ptr(cs->in.chunks);
 	if (copy_from_user(chunk_array, chunk_array_user,
@@ -184,7 +187,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
 	p->nchunks = 0;
 put_ctx:
 	amdgpu_ctx_put(p->ctx);
-free_chunk:
 	kfree(chunk_array);
 
 	return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 75c933b..028e9f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -60,6 +60,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
 					  rq, amdgpu_sched_jobs);
 		if (r)
 			goto failed;
+		ctx->rings[i].entity.guilty = &ctx->guilty;
 	}
 
 	r = amdgpu_queue_mgr_init(adev, &ctx->queue_mgr);
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 12c3092..89b0573 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -493,10 +493,32 @@ void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
 void amd_sched_job_kickout(struct amd_sched_job *s_job)
 {
 	struct amd_gpu_scheduler *sched = s_job->sched;
+	struct amd_sched_entity *entity, *tmp;
+	struct amd_sched_rq *rq;
+	int i;
+	bool found;
 
 	spin_lock(&sched->job_list_lock);
 	list_del_init(&s_job->node);
 	spin_unlock(&sched->job_list_lock);
+
+	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+
+	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_KERNEL; i++) {
+		rq = &sched->sched_rq[i];
+
+		spin_lock(&rq->lock);
+		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
+			if (s_job->s_entity == entity && entity->guilty) {
+				*entity->guilty = true;
+				found = true;
+				break;
+			}
+		}
+		spin_unlock(&rq->lock);
+		if (found)
+			break;
+	}
 }
 
 void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index f0242aa..16c2244 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -49,6 +49,7 @@ struct amd_sched_entity {
 
 	struct dma_fence		*dependency;
 	struct dma_fence_cb		cb;
+	bool *guilty; /* this points to ctx's guilty */
 };
 
 /**
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/12] drm/amdgpu/sriov:show error if ib test failed
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (9 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
       [not found]     ` <1506751432-21789-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-09-30  6:03   ` [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery Monk Liu
  2017-10-01  9:31   ` [PATCH 00/12] *** SRIOV GPU RESET PATCHES *** Christian König
  12 siblings, 1 reply; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

fix loose mode gpu reset ib test result incorrect message

Change-Id: Ic4e3b51e4ff77c5e08d268a4a5ca32e7c882367c
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 122e2e1..c3f10b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2902,7 +2902,8 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
 
 	amdgpu_irq_gpu_reset_resume_helper(adev);
 
-	if (amdgpu_ib_ring_tests(adev))
+	r = amdgpu_ib_ring_tests(adev);
+	if (r)
 		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
 
 	/* release full control of GPU after ib test */
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (10 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 11/12] drm/amdgpu/sriov:show error if ib test failed Monk Liu
@ 2017-09-30  6:03   ` Monk Liu
       [not found]     ` <1506751432-21789-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-01  9:31   ` [PATCH 00/12] *** SRIOV GPU RESET PATCHES *** Christian König
  12 siblings, 1 reply; 49+ messages in thread
From: Monk Liu @ 2017-09-30  6:03 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

1, we have deadlock unresloved between shadow bo recovery
and ctx_do_release,

2, for loose mode gpu reset we always assume VRAM not lost
so there is no need to do that from begining

Change-Id: I5259f9d943239bd1fa2e45eb446ef053299fbfb1
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c3f10b5..8ae7a2c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2840,9 +2840,7 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
 {
 	int i, j, r = 0;
 	int resched;
-	struct amdgpu_bo *bo, *tmp;
 	struct amdgpu_ring *ring;
-	struct dma_fence *fence = NULL, *next = NULL;
 
 	/* other thread is already into the gpu reset so just quit and come later */
 	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
@@ -2909,33 +2907,6 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
 	/* release full control of GPU after ib test */
 	amdgpu_virt_release_full_gpu(adev, true);
 
-	DRM_INFO("recover vram bo from shadow\n");
-
-	ring = adev->mman.buffer_funcs_ring;
-	mutex_lock(&adev->shadow_list_lock);
-	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
-		next = NULL;
-		amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
-		if (fence) {
-			r = dma_fence_wait(fence, false);
-			if (r) {
-				WARN(r, "recovery from shadow isn't completed\n");
-				break;
-			}
-		}
-
-		dma_fence_put(fence);
-		fence = next;
-	}
-	mutex_unlock(&adev->shadow_list_lock);
-
-	if (fence) {
-		r = dma_fence_wait(fence, false);
-		if (r)
-			WARN(r, "recovery from shadow isn't completed\n");
-	}
-	dma_fence_put(fence);
-
 	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
 		ring = adev->rings[i % AMDGPU_MAX_RINGS];
 		if (!ring || !ring->sched.thread)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/12] *** SRIOV GPU RESET PATCHES ***
       [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (11 preceding siblings ...)
  2017-09-30  6:03   ` [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery Monk Liu
@ 2017-10-01  9:31   ` Christian König
  12 siblings, 0 replies; 49+ messages in thread
From: Christian König @ 2017-10-01  9:31 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Patches #1-#5 are Reviewed-by: Christian König <christian.koenig@amd.com>.

Need to take a look at the rest when I'm back from vacation.

Regards,
Christian.

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> implement strict mode gpu reset, and some changes for loose mode reset
>
> Monk Liu (12):
>    drm/amdgpu/sriov:now must reinit psp
>    drm/amdgpu/sriov:fix memory leak in psp_load_fw
>    drm/amdgpu/sriov:use atomic type for sriov_reset
>    drm/amdgpu/sriov:cleanup gpu rest mlock
>    drm/amdgpu/sriov:accurate description for sriov_gpu_reset
>    drm/amdgpu/sriov:handle more jobs hang in different ring case
>    drm/amdgpu/sriov:implement strict gpu reset
>    drm/amdgpu:explicitly call fence_process
>    drm/amdgpu/sriov:return -ENODEV if gpu reseted
>    drm/amdgpu/sriov:implement guilty ctx for loose reset
>    drm/amdgpu/sriov:show error if ib test failed
>    drm/amdgpu/sriov:no shadow buffer recovery
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  46 ++++++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 135 +++++++++++++++++++-------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   4 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |   7 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  19 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c       |   7 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |  22 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c     |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c      |   2 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |   2 +-
>   drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c         |   6 +-
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c         |   6 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |   4 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |   4 +-
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c |  82 ++++++++++++++++
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   3 +
>   18 files changed, 284 insertions(+), 73 deletions(-)
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery
       [not found]     ` <1506751432-21789-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-01  9:32       ` Christian König
  2017-10-01  9:36       ` Christian König
  1 sibling, 0 replies; 49+ messages in thread
From: Christian König @ 2017-10-01  9:32 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> 1, we have deadlock unresloved between shadow bo recovery
> and ctx_do_release,
>
> 2, for loose mode gpu reset we always assume VRAM not lost
> so there is no need to do that from begining
>
> Change-Id: I5259f9d943239bd1fa2e45eb446ef053299fbfb1
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

NAK, we must always recover the page tables or otherwise no process 
would be able to proceed.

Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 -----------------------------
>   1 file changed, 29 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c3f10b5..8ae7a2c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2840,9 +2840,7 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>   {
>   	int i, j, r = 0;
>   	int resched;
> -	struct amdgpu_bo *bo, *tmp;
>   	struct amdgpu_ring *ring;
> -	struct dma_fence *fence = NULL, *next = NULL;
>   
>   	/* other thread is already into the gpu reset so just quit and come later */
>   	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
> @@ -2909,33 +2907,6 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>   	/* release full control of GPU after ib test */
>   	amdgpu_virt_release_full_gpu(adev, true);
>   
> -	DRM_INFO("recover vram bo from shadow\n");
> -
> -	ring = adev->mman.buffer_funcs_ring;
> -	mutex_lock(&adev->shadow_list_lock);
> -	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -		next = NULL;
> -		amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> -		if (fence) {
> -			r = dma_fence_wait(fence, false);
> -			if (r) {
> -				WARN(r, "recovery from shadow isn't completed\n");
> -				break;
> -			}
> -		}
> -
> -		dma_fence_put(fence);
> -		fence = next;
> -	}
> -	mutex_unlock(&adev->shadow_list_lock);
> -
> -	if (fence) {
> -		r = dma_fence_wait(fence, false);
> -		if (r)
> -			WARN(r, "recovery from shadow isn't completed\n");
> -	}
> -	dma_fence_put(fence);
> -
>   	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>   		ring = adev->rings[i % AMDGPU_MAX_RINGS];
>   		if (!ring || !ring->sched.thread)


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery
       [not found]     ` <1506751432-21789-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-01  9:32       ` Christian König
@ 2017-10-01  9:36       ` Christian König
       [not found]         ` <e767c6f2-4050-c697-2075-c3d744e6b379-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-01  9:36 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> 1, we have deadlock unresloved between shadow bo recovery
> and ctx_do_release,
>
> 2, for loose mode gpu reset we always assume VRAM not lost
> so there is no need to do that from begining
>
> Change-Id: I5259f9d943239bd1fa2e45eb446ef053299fbfb1
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

NAK, even when VRAM ist lost we must restore the page tables or 
otherwise no process would be able to proceed.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 -----------------------------
>   1 file changed, 29 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c3f10b5..8ae7a2c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2840,9 +2840,7 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>   {
>   	int i, j, r = 0;
>   	int resched;
> -	struct amdgpu_bo *bo, *tmp;
>   	struct amdgpu_ring *ring;
> -	struct dma_fence *fence = NULL, *next = NULL;
>   
>   	/* other thread is already into the gpu reset so just quit and come later */
>   	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
> @@ -2909,33 +2907,6 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>   	/* release full control of GPU after ib test */
>   	amdgpu_virt_release_full_gpu(adev, true);
>   
> -	DRM_INFO("recover vram bo from shadow\n");
> -
> -	ring = adev->mman.buffer_funcs_ring;
> -	mutex_lock(&adev->shadow_list_lock);
> -	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -		next = NULL;
> -		amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> -		if (fence) {
> -			r = dma_fence_wait(fence, false);
> -			if (r) {
> -				WARN(r, "recovery from shadow isn't completed\n");
> -				break;
> -			}
> -		}
> -
> -		dma_fence_put(fence);
> -		fence = next;
> -	}
> -	mutex_unlock(&adev->shadow_list_lock);
> -
> -	if (fence) {
> -		r = dma_fence_wait(fence, false);
> -		if (r)
> -			WARN(r, "recovery from shadow isn't completed\n");
> -	}
> -	dma_fence_put(fence);
> -
>   	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>   		ring = adev->rings[i % AMDGPU_MAX_RINGS];
>   		if (!ring || !ring->sched.thread)


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery
       [not found]         ` <e767c6f2-4050-c697-2075-c3d744e6b379-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-04  9:41           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449346A746E70A7BE88FEA084730-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-04  9:41 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3121 bytes --]

Why ? the page tables are resided in VRAM, no need to recovery if no VRAM lost



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Christian König<mailto:ckoenig.leichtzumerken@gmail.com>
Sent: 2017年10月1日 17:36
To: Liu, Monk<mailto:Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery



Am 30.09.2017 um 08:03 schrieb Monk Liu:
> 1, we have deadlock unresloved between shadow bo recovery
> and ctx_do_release,
>
> 2, for loose mode gpu reset we always assume VRAM not lost
> so there is no need to do that from begining
>
> Change-Id: I5259f9d943239bd1fa2e45eb446ef053299fbfb1
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

NAK, even when VRAM ist lost we must restore the page tables or
otherwise no process would be able to proceed.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 -----------------------------
>   1 file changed, 29 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c3f10b5..8ae7a2c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2840,9 +2840,7 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>   {
>        int i, j, r = 0;
>        int resched;
> -     struct amdgpu_bo *bo, *tmp;
>        struct amdgpu_ring *ring;
> -     struct dma_fence *fence = NULL, *next = NULL;
>
>        /* other thread is already into the gpu reset so just quit and come later */
>        if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
> @@ -2909,33 +2907,6 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>        /* release full control of GPU after ib test */
>        amdgpu_virt_release_full_gpu(adev, true);
>
> -     DRM_INFO("recover vram bo from shadow\n");
> -
> -     ring = adev->mman.buffer_funcs_ring;
> -     mutex_lock(&adev->shadow_list_lock);
> -     list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -             next = NULL;
> -             amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> -             if (fence) {
> -                     r = dma_fence_wait(fence, false);
> -                     if (r) {
> -                             WARN(r, "recovery from shadow isn't completed\n");
> -                             break;
> -                     }
> -             }
> -
> -             dma_fence_put(fence);
> -             fence = next;
> -     }
> -     mutex_unlock(&adev->shadow_list_lock);
> -
> -     if (fence) {
> -             r = dma_fence_wait(fence, false);
> -             if (r)
> -                     WARN(r, "recovery from shadow isn't completed\n");
> -     }
> -     dma_fence_put(fence);
> -
>        for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>                ring = adev->rings[i % AMDGPU_MAX_RINGS];
>                if (!ring || !ring->sched.thread)



[-- Attachment #1.2: Type: text/html, Size: 7515 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery
       [not found]             ` <BLUPR12MB0449346A746E70A7BE88FEA084730-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-04 10:56               ` Christian König
       [not found]                 ` <9b08e030-1a47-39ef-8010-64c51d4560e8-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-04 10:56 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3575 bytes --]

Ah! Sorry, my fault.

I've missed the "no" and thought you wanted to abandon all processing 
because VRAM is always lost.

Going to review the remaining patches today.

Christian.

Am 04.10.2017 um 11:41 schrieb Liu, Monk:
>
> Why ? the page tables are resided in VRAM, no need to recovery if no 
> VRAM lost
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for 
> Windows 10
>
> *From: *Christian König <mailto:ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> *Sent: *2017年10月1日17:36
> *To: *Liu, Monk <mailto:Monk.Liu-5C7GfCeVMHo@public.gmane.org>; 
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
> *Subject: *Re: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery
>
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
> > 1, we have deadlock unresloved between shadow bo recovery
> > and ctx_do_release,
> >
> > 2, for loose mode gpu reset we always assume VRAM not lost
> > so there is no need to do that from begining
> >
> > Change-Id: I5259f9d943239bd1fa2e45eb446ef053299fbfb1
> > Signed-off-by: Monk Liu <Monk.Liu-5C7GfCeVMHo@public.gmane.org>
>
> NAK, even when VRAM ist lost we must restore the page tables or
> otherwise no process would be able to proceed.
>
> Regards,
> Christian.
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 
> -----------------------------
> >   1 file changed, 29 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index c3f10b5..8ae7a2c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2840,9 +2840,7 @@ int amdgpu_sriov_gpu_reset(struct 
> amdgpu_device *adev, struct amdgpu_job *job)
> >   {
> >        int i, j, r = 0;
> >        int resched;
> > -     struct amdgpu_bo *bo, *tmp;
> >        struct amdgpu_ring *ring;
> > -     struct dma_fence *fence = NULL, *next = NULL;
> >
> >        /* other thread is already into the gpu reset so just quit 
> and come later */
> >        if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
> > @@ -2909,33 +2907,6 @@ int amdgpu_sriov_gpu_reset(struct 
> amdgpu_device *adev, struct amdgpu_job *job)
> >        /* release full control of GPU after ib test */
> >        amdgpu_virt_release_full_gpu(adev, true);
> >
> > -     DRM_INFO("recover vram bo from shadow\n");
> > -
> > -     ring = adev->mman.buffer_funcs_ring;
> > -     mutex_lock(&adev->shadow_list_lock);
> > -     list_for_each_entry_safe(bo, tmp, &adev->shadow_list, 
> shadow_list) {
> > -             next = NULL;
> > -             amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> > -             if (fence) {
> > -                     r = dma_fence_wait(fence, false);
> > -                     if (r) {
> > -                             WARN(r, "recovery from shadow isn't 
> completed\n");
> > -                             break;
> > -                     }
> > -             }
> > -
> > -             dma_fence_put(fence);
> > -             fence = next;
> > -     }
> > -     mutex_unlock(&adev->shadow_list_lock);
> > -
> > -     if (fence) {
> > -             r = dma_fence_wait(fence, false);
> > -             if (r)
> > -                     WARN(r, "recovery from shadow isn't completed\n");
> > -     }
> > -     dma_fence_put(fence);
> > -
> >        for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> >                ring = adev->rings[i % AMDGPU_MAX_RINGS];
> >                if (!ring || !ring->sched.thread)
>
>


[-- Attachment #1.2: Type: text/html, Size: 8336 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery
       [not found]                 ` <9b08e030-1a47-39ef-8010-64c51d4560e8-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  4:12                   ` Liu, Monk
  0 siblings, 0 replies; 49+ messages in thread
From: Liu, Monk @ 2017-10-09  4:12 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 3610 bytes --]

Any updates for the rest patches ?

From: Koenig, Christian
Sent: 2017年10月4日 18:56
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery

Ah! Sorry, my fault.

I've missed the "no" and thought you wanted to abandon all processing because VRAM is always lost.

Going to review the remaining patches today.

Christian.

Am 04.10.2017 um 11:41 schrieb Liu, Monk:

Why ? the page tables are resided in VRAM, no need to recovery if no VRAM lost



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Christian König<mailto:ckoenig.leichtzumerken@gmail.com>
Sent: 2017年10月1日 17:36
To: Liu, Monk<mailto:Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery


Am 30.09.2017 um 08:03 schrieb Monk Liu:
> 1, we have deadlock unresloved between shadow bo recovery
> and ctx_do_release,
>
> 2, for loose mode gpu reset we always assume VRAM not lost
> so there is no need to do that from begining
>
> Change-Id: I5259f9d943239bd1fa2e45eb446ef053299fbfb1
> Signed-off-by: Monk Liu <Monk.Liu@amd.com><mailto:Monk.Liu@amd.com>

NAK, even when VRAM ist lost we must restore the page tables or
otherwise no process would be able to proceed.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 -----------------------------
>   1 file changed, 29 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c3f10b5..8ae7a2c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2840,9 +2840,7 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>   {
>        int i, j, r = 0;
>        int resched;
> -     struct amdgpu_bo *bo, *tmp;
>        struct amdgpu_ring *ring;
> -     struct dma_fence *fence = NULL, *next = NULL;
>
>        /* other thread is already into the gpu reset so just quit and come later */
>        if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
> @@ -2909,33 +2907,6 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>        /* release full control of GPU after ib test */
>        amdgpu_virt_release_full_gpu(adev, true);
>
> -     DRM_INFO("recover vram bo from shadow\n");
> -
> -     ring = adev->mman.buffer_funcs_ring;
> -     mutex_lock(&adev->shadow_list_lock);
> -     list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -             next = NULL;
> -             amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> -             if (fence) {
> -                     r = dma_fence_wait(fence, false);
> -                     if (r) {
> -                             WARN(r, "recovery from shadow isn't completed\n");
> -                             break;
> -                     }
> -             }
> -
> -             dma_fence_put(fence);
> -             fence = next;
> -     }
> -     mutex_unlock(&adev->shadow_list_lock);
> -
> -     if (fence) {
> -             r = dma_fence_wait(fence, false);
> -             if (r)
> -                     WARN(r, "recovery from shadow isn't completed\n");
> -     }
> -     dma_fence_put(fence);
> -
>        for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>                ring = adev->rings[i % AMDGPU_MAX_RINGS];
>                if (!ring || !ring->sched.thread)




[-- Attachment #1.2: Type: text/html, Size: 11049 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 06/12] drm/amdgpu/sriov:handle more jobs hang in different ring case
       [not found]     ` <1506751432-21789-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  8:18       ` Christian König
  0 siblings, 0 replies; 49+ messages in thread
From: Christian König @ 2017-10-09  8:18 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> quit first and try later if gpu_reset is already running, this
> way we can handle different jobs hang on different ring and
> crash each other on the same time

Using schedule like this is not good coding style, please use a lock or 
completion event instead.

Christian.

>
> Change-Id: I0c6bc8d76959c5053e7523c41b2305032fc6b79a
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 15 ++++++++++++---
>   2 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 31a5608..9efbb33 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2754,9 +2754,9 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>   	struct amdgpu_ring *ring;
>   	struct dma_fence *fence = NULL, *next = NULL;
>   
> -	/* other thread is already into the gpu reset so just quit */
> +	/* other thread is already into the gpu reset so just quit and come later */
>   	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
> -		return 0;
> +		return -EAGAIN;
>   
>   	atomic_inc(&adev->gpu_reset_counter);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 4510627..0db81a4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -37,10 +37,19 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   		  atomic_read(&job->ring->fence_drv.last_seq),
>   		  job->ring->fence_drv.sync_seq);
>   
> -	if (amdgpu_sriov_vf(job->adev))
> -		amdgpu_sriov_gpu_reset(job->adev, job);
> -	else
> +	if (amdgpu_sriov_vf(job->adev)) {
> +		int r;
> +
> +try_again:
> +		r = amdgpu_sriov_gpu_reset(job->adev, job);
> +		if (r == -EAGAIN) {
> +			/* maye two different schedulers all have hang job, try later */
> +			schedule();
> +			goto try_again;
> +		}
> +	} else {
>   		amdgpu_gpu_reset(job->adev);
> +	}
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset
       [not found]     ` <1506751432-21789-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  8:20       ` Christian König
       [not found]         ` <250ce10a-cca0-0193-b2ed-cc2f04e80d0c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-10-09 10:58       ` Nicolai Hähnle
  1 sibling, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-09  8:20 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> changes:
> 1)implement strict mode sriov gpu reset
> 2)always call sriov_gpu_reset_strict if hypervisor notify FLR
> 3)in strict reset mode, set error to all fences.
> 4)change fence_wait/cs_wait functions to return -ENODEV if fence signaled
> with error == -ETIME,
>
> Since after strict gpu reset we consider the VRAM were lost,
> and since assuming VRAM lost there is little help to recover
> shadow BO because all textures/resources/shaders cannot
> recovered (if they resident in VRAM)

NAK, we shouldn't return an error code on the wait function and instead 
handle the fence error code in amdgpu_ctx_query().

Regards,
Christian.

>
> Change-Id: I50d9b8b5185ba92f137f07c9deeac19d740d753b
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 25 ++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 90 +++++++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  4 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  6 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  1 +
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  4 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  4 +-
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 60 ++++++++++++++++++
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  2 +
>   10 files changed, 188 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index de11527..de9c164 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -123,6 +123,7 @@ extern int amdgpu_cntl_sb_buf_per_se;
>   extern int amdgpu_param_buf_per_se;
>   extern int amdgpu_job_hang_limit;
>   extern int amdgpu_lbpw;
> +extern int amdgpu_sriov_reset_level;
>   
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   extern int amdgpu_si_support;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index c6a214f..9467cf6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1262,6 +1262,7 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
>   	struct amdgpu_ctx *ctx;
>   	struct dma_fence *fence;
>   	long r;
> +	int fence_err = 0;
>   
>   	if (amdgpu_kms_vram_lost(adev, fpriv))
>   		return -ENODEV;
> @@ -1283,6 +1284,8 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
>   		r = PTR_ERR(fence);
>   	else if (fence) {
>   		r = dma_fence_wait_timeout(fence, true, timeout);
> +		/* gpu hang and this fence is signaled by gpu reset if fence_err < 0 */
> +		fence_err = dma_fence_get_status(fence);
>   		dma_fence_put(fence);
>   	} else
>   		r = 1;
> @@ -1292,7 +1295,10 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
>   		return r;
>   
>   	memset(wait, 0, sizeof(*wait));
> -	wait->out.status = (r == 0);
> +	wait->out.status = (fence_err < 0);
> +
> +	if (fence_err < 0)
> +		return -ENODEV;
>   
>   	return 0;
>   }
> @@ -1346,6 +1352,7 @@ static int amdgpu_cs_wait_all_fences(struct amdgpu_device *adev,
>   	uint32_t fence_count = wait->in.fence_count;
>   	unsigned int i;
>   	long r = 1;
> +	int fence_err = 0;
>   
>   	for (i = 0; i < fence_count; i++) {
>   		struct dma_fence *fence;
> @@ -1358,16 +1365,20 @@ static int amdgpu_cs_wait_all_fences(struct amdgpu_device *adev,
>   			continue;
>   
>   		r = dma_fence_wait_timeout(fence, true, timeout);
> +		fence_err = dma_fence_get_status(fence);
>   		dma_fence_put(fence);
>   		if (r < 0)
>   			return r;
>   
> -		if (r == 0)
> +		if (r == 0 || fence_err < 0)
>   			break;
>   	}
>   
>   	memset(wait, 0, sizeof(*wait));
> -	wait->out.status = (r > 0);
> +	wait->out.status = (r > 0 && fence_err == 0);
> +
> +	if (fence_err < 0)
> +		return -ENODEV;
>   
>   	return 0;
>   }
> @@ -1391,6 +1402,7 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
>   	struct dma_fence **array;
>   	unsigned int i;
>   	long r;
> +	int fence_err = 0;
>   
>   	/* Prepare the fence array */
>   	array = kcalloc(fence_count, sizeof(struct dma_fence *), GFP_KERNEL);
> @@ -1418,10 +1430,12 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
>   				       &first);
>   	if (r < 0)
>   		goto err_free_fence_array;
> +	else
> +		fence_err = dma_fence_get_status(array[first]);
>   
>   out:
>   	memset(wait, 0, sizeof(*wait));
> -	wait->out.status = (r > 0);
> +	wait->out.status = (r > 0 && fence_err == 0);
>   	wait->out.first_signaled = first;
>   	/* set return value 0 to indicate success */
>   	r = 0;
> @@ -1431,6 +1445,9 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
>   		dma_fence_put(array[i]);
>   	kfree(array);
>   
> +	if (fence_err < 0)
> +		return -ENODEV;
> +
>   	return r;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9efbb33..122e2e1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2734,6 +2734,96 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
>   }
>   
>   /**
> + * amdgpu_sriov_gpu_reset_strict - reset the asic under strict mode
> + *
> + * @adev: amdgpu device pointer
> + * @job: which job trigger hang
> + *
> + * Attempt the reset the GPU if it has hung (all asics).
> + * for SRIOV case.
> + * Returns 0 for success or an error on failure.
> + *
> + * this function will deny all process/fence created before this reset,
> + * and drop all jobs unfinished during this reset.
> + *
> + * Application should take the responsibility to re-open the FD to re-create
> + * the VM page table and recover all resources as well
> + *
> + **/
> +int amdgpu_sriov_gpu_reset_strict(struct amdgpu_device *adev, struct amdgpu_job *job)
> +{
> +	int i, r = 0;
> +	int resched;
> +	struct amdgpu_ring *ring;
> +
> +	/* other thread is already into the gpu reset so just quit and come later */
> +	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
> +		return -EAGAIN;
> +
> +	atomic_inc(&adev->gpu_reset_counter);
> +
> +	/* block TTM */
> +	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> +
> +	/* fake signal jobs already scheduled  */
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +		kthread_park(ring->sched.thread);
> +		amd_sched_set_sched_hang(&ring->sched);
> +		amdgpu_fence_driver_force_completion_ring(ring);
> +		amd_sched_set_queue_hang(&ring->sched);
> +	}
> +
> +	/* request to take full control of GPU before re-initialization  */
> +	if (job)
> +		amdgpu_virt_reset_gpu(adev);
> +	else
> +		amdgpu_virt_request_full_gpu(adev, true);
> +
> +	/* Resume IP prior to SMC */
> +	amdgpu_sriov_reinit_early(adev);
> +
> +	/* we need recover gart prior to run SMC/CP/SDMA resume */
> +	amdgpu_ttm_recover_gart(adev);
> +
> +	/* now we are okay to resume SMC/CP/SDMA */
> +	amdgpu_sriov_reinit_late(adev);
> +
> +	/* resume IRQ status */
> +	amdgpu_irq_gpu_reset_resume_helper(adev);
> +
> +	if (amdgpu_ib_ring_tests(adev))
> +		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
> +
> +	/* release full control of GPU after ib test */
> +	amdgpu_virt_release_full_gpu(adev, true);
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +		kthread_unpark(ring->sched.thread);
> +	}
> +
> +	drm_helper_resume_force_mode(adev->ddev);
> +
> +	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
> +	if (r)
> +		dev_info(adev->dev, "Strict mode GPU reset failed\n");
> +	else
> +		dev_info(adev->dev, "Strict mode GPU reset successed!\n");
> +
> +	atomic_set(&adev->in_sriov_reset, 0);
> +	return 0;
> +}
> +
> +/**
>    * amdgpu_sriov_gpu_reset - reset the asic
>    *
>    * @adev: amdgpu device pointer
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 8f5211c..eee67dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -123,6 +123,7 @@ int amdgpu_cntl_sb_buf_per_se = 0;
>   int amdgpu_param_buf_per_se = 0;
>   int amdgpu_job_hang_limit = 0;
>   int amdgpu_lbpw = -1;
> +int amdgpu_sriov_reset_level = 0;
>   
>   MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600);
> @@ -269,6 +270,9 @@ module_param_named(job_hang_limit, amdgpu_job_hang_limit, int ,0444);
>   MODULE_PARM_DESC(lbpw, "Load Balancing Per Watt (LBPW) support (1 = enable, 0 = disable, -1 = auto)");
>   module_param_named(lbpw, amdgpu_lbpw, int, 0444);
>   
> +MODULE_PARM_DESC(sriov_reset_level, "what level will gpu reset on, 0: loose, 1:strict, other:disable (default 0))");
> +module_param_named(sriov_reset_level, amdgpu_sriov_reset_level, int ,0444);
> +
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   
>   #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 0db81a4..933823a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -41,7 +41,11 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   		int r;
>   
>   try_again:
> -		r = amdgpu_sriov_gpu_reset(job->adev, job);
> +		if (amdgpu_sriov_reset_level == 1)
> +			r = amdgpu_sriov_gpu_reset_strict(job->adev, job);
> +		else
> +			r = amdgpu_sriov_gpu_reset(job->adev, job);
> +
>   		if (r == -EAGAIN) {
>   			/* maye two different schedulers all have hang job, try later */
>   			schedule();
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index a3cbd5a..5664a10 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -100,5 +100,6 @@ int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
>   int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job);
>   int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
>   void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
> +int amdgpu_sriov_gpu_reset_strict(struct amdgpu_device *adev, struct amdgpu_job *job);
>   
>   #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 2812d88..00a9629 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -247,8 +247,8 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   		return;
>   	}
>   
> -	/* Trigger recovery due to world switch failure */
> -	amdgpu_sriov_gpu_reset(adev, NULL);
> +	/* use strict mode if FLR triggered from hypervisor */
> +	amdgpu_sriov_gpu_reset_strict(adev, NULL);
>   }
>   
>   static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index c25a831..c94b6e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -513,8 +513,8 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>   		return;
>   	}
>   
> -	/* Trigger recovery due to world switch failure */
> -	amdgpu_sriov_gpu_reset(adev, NULL);
> +	/* use strict mode if FLR triggered from hypervisor */
> +	amdgpu_sriov_gpu_reset_strict(adev, NULL);
>   }
>   
>   static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 97c94f9..12c3092 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -430,6 +430,66 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
>   	spin_unlock(&sched->job_list_lock);
>   }
>   
> +/**
> + * amd_sched_set_sched_hang
> + * @sched: the scheduler need to set all pending jobs hang
> + *
> + * this routine set all unfinished jobs pending in the sched to
> + * an error -ETIME statues
> + *
> + **/
> +void amd_sched_set_sched_hang(struct amd_gpu_scheduler *sched)
> +{
> +	struct amd_sched_job *s_job;
> +
> +	spin_lock(&sched->job_list_lock);
> +	list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node)
> +		dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> +
> +	spin_unlock(&sched->job_list_lock);
> +}
> +
> +/**
> + * amd_sched_set_queue_hang
> + * @sched: the scheduler need to set all job in kfifo hang
> + *
> + * this routine set all jobs in the KFIFO of @sched to an error
> + * -ETIME status and signal those jobs.
> + *
> + **/
> +
> +void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
> +{
> +	struct amd_sched_entity *entity, *tmp;
> +	struct amd_sched_job *s_job;
> +	struct amd_sched_rq *rq;
> +	int i;
> +
> +	/* set HANG status on all jobs queued and fake signal them */
> +	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_MAX; i++) {
> +		rq = &sched->sched_rq[i];
> +
> +		spin_lock(&rq->lock);
> +		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
> +			if (entity->dependency) {
> +				dma_fence_remove_callback(entity->dependency, &entity->cb);
> +				dma_fence_put(entity->dependency);
> +				entity->dependency = NULL;
> +			}
> +
> +			spin_lock(&entity->queue_lock);
> +			while(kfifo_out(&entity->job_queue, &s_job, sizeof(s_job)) == sizeof(s_job)) {
> +				dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> +				amd_sched_fence_scheduled(s_job->s_fence);
> +				amd_sched_fence_finished(s_job->s_fence);
> +			}
> +			spin_unlock(&entity->queue_lock);
> +		}
> +		spin_unlock(&rq->lock);
> +	}
> +	wake_up(&sched->job_scheduled);
> +}
> +
>   void amd_sched_job_kickout(struct amd_sched_job *s_job)
>   {
>   	struct amd_gpu_scheduler *sched = s_job->sched;
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index f9d8f28..f0242aa 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -167,4 +167,6 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched);
>   bool amd_sched_dependency_optimized(struct dma_fence* fence,
>   				    struct amd_sched_entity *entity);
>   void amd_sched_job_kickout(struct amd_sched_job *s_job);
> +void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched);
> +void amd_sched_set_sched_hang(struct amd_gpu_scheduler *sched);
>   #endif


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
       [not found]     ` <1506751432-21789-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  8:23       ` Christian König
       [not found]         ` <5cb1ae43-ec3a-2b0b-b78b-91cefd575672-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-09  8:23 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> this way no need to wait timer triggered to save time

In principle a good idea, but please remove 
amdgpu_fence_driver_force_completion_ring() and use 
amdgpu_fence_driver_force_completion() instead.

Regards,
Christian.

>
> Change-Id: Ie96fd2fc1f6054ebc1e58c3d703471639371ee22
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 333bad7..13785d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -543,8 +543,13 @@ void amdgpu_fence_driver_force_completion(struct amdgpu_device *adev)
>   
>   void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring)
>   {
> -	if (ring)
> +	if (ring) {
>   		amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
> +		/* call fence process manually can get it done quickly
> +		 * instead of waiting for the timer triggered
> +		 */
> +		amdgpu_fence_process(ring);
> +	}
>   }
>   
>   /*


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]     ` <1506751432-21789-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  8:25       ` Christian König
       [not found]         ` <6e81d8b0-267a-1ea8-b228-93286fc6a954-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-09  8:25 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> for SRIOV strict mode gpu reset:
>
> In kms open we mark the latest adev->gpu_reset_counter in fpriv
> we return -ENODEV in cs_ioctl or info_ioctl if they found
> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>
> this way we prevent a potential bad process/FD from submitting
> cmds and notify userspace with -ENODEV.
>
> userspace should close all BO/ctx and re-open
> dri FD to re-create virtual memory system for this process

The whole aproach is a NAK from my side.

We need to enable userspace to continue, not force it into process 
termination to recover. Otherwise we could send a SIGTERM in the first 
place.

Regards,
Christian.

>
> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>   3 files changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index de9c164..b40d4ba 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>   	struct idr		bo_list_handles;
>   	struct amdgpu_ctx_mgr	ctx_mgr;
>   	u32			vram_lost_counter;
> +	int gpu_reset_counter;
>   };
>   
>   /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 9467cf6..6a1515e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>   	if (amdgpu_kms_vram_lost(adev, fpriv))
>   		return -ENODEV;
>   
> +	if (amdgpu_sriov_vf(adev) &&
> +		amdgpu_sriov_reset_level == 1 &&
> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
> +		return -ENODEV;
> +
>   	parser.adev = adev;
>   	parser.filp = filp;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 282f45b..bd389cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>   	if (amdgpu_kms_vram_lost(adev, fpriv))
>   		return -ENODEV;
>   
> +	if (amdgpu_sriov_vf(adev) &&
> +		amdgpu_sriov_reset_level == 1 &&
> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
> +		return -ENODEV;
> +
>   	switch (info->query) {
>   	case AMDGPU_INFO_ACCEL_WORKING:
>   		ui32 = adev->accel_working;
> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>   		goto out_suspend;
>   	}
>   
> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
> +
>   	r = amdgpu_vm_init(adev, &fpriv->vm,
>   			   AMDGPU_VM_CONTEXT_GFX, 0);
>   	if (r) {


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
       [not found]     ` <1506751432-21789-11-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  8:27       ` Christian König
       [not found]         ` <e4c96014-b4f4-e013-a966-9e2e03b9a62b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-09  8:27 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> Change-Id: I7904f362aa0f578a5cbf5d40c7a242c2c6680a92
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

NAK, if a context is guilty of a GPU reset should be determined in 
amdgpu_ctx_query() by looking at the fences in the ring buffer.

Not when the GPU reset itself occurs.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 16 +++++++++-------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  1 +
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 22 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  1 +
>   5 files changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index b40d4ba..b63e602 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -737,6 +737,7 @@ struct amdgpu_ctx {
>   	struct dma_fence	**fences;
>   	struct amdgpu_ctx_ring	rings[AMDGPU_MAX_RINGS];
>   	bool preamble_presented;
> +	bool guilty;
>   };
>   
>   struct amdgpu_ctx_mgr {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 6a1515e..f92962e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -79,16 +79,19 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   	if (cs->in.num_chunks == 0)
>   		return 0;
>   
> +	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
> +	if (!p->ctx)
> +		return -EINVAL;
> +
> +	if (amdgpu_sriov_vf(p->adev) &&
> +		amdgpu_sriov_reset_level == 0 &&
> +		p->ctx->guilty)
> +		return -ENODEV;
> +
>   	chunk_array = kmalloc_array(cs->in.num_chunks, sizeof(uint64_t), GFP_KERNEL);
>   	if (!chunk_array)
>   		return -ENOMEM;
>   
> -	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
> -	if (!p->ctx) {
> -		ret = -EINVAL;
> -		goto free_chunk;
> -	}
> -
>   	/* get chunks */
>   	chunk_array_user = u64_to_user_ptr(cs->in.chunks);
>   	if (copy_from_user(chunk_array, chunk_array_user,
> @@ -184,7 +187,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   	p->nchunks = 0;
>   put_ctx:
>   	amdgpu_ctx_put(p->ctx);
> -free_chunk:
>   	kfree(chunk_array);
>   
>   	return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 75c933b..028e9f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -60,6 +60,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   					  rq, amdgpu_sched_jobs);
>   		if (r)
>   			goto failed;
> +		ctx->rings[i].entity.guilty = &ctx->guilty;
>   	}
>   
>   	r = amdgpu_queue_mgr_init(adev, &ctx->queue_mgr);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 12c3092..89b0573 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -493,10 +493,32 @@ void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
>   void amd_sched_job_kickout(struct amd_sched_job *s_job)
>   {
>   	struct amd_gpu_scheduler *sched = s_job->sched;
> +	struct amd_sched_entity *entity, *tmp;
> +	struct amd_sched_rq *rq;
> +	int i;
> +	bool found;
>   
>   	spin_lock(&sched->job_list_lock);
>   	list_del_init(&s_job->node);
>   	spin_unlock(&sched->job_list_lock);
> +
> +	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> +
> +	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_KERNEL; i++) {
> +		rq = &sched->sched_rq[i];
> +
> +		spin_lock(&rq->lock);
> +		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
> +			if (s_job->s_entity == entity && entity->guilty) {
> +				*entity->guilty = true;
> +				found = true;
> +				break;
> +			}
> +		}
> +		spin_unlock(&rq->lock);
> +		if (found)
> +			break;
> +	}
>   }
>   
>   void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index f0242aa..16c2244 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>   
>   	struct dma_fence		*dependency;
>   	struct dma_fence_cb		cb;
> +	bool *guilty; /* this points to ctx's guilty */
>   };
>   
>   /**


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 11/12] drm/amdgpu/sriov:show error if ib test failed
       [not found]     ` <1506751432-21789-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  8:29       ` Christian König
  0 siblings, 0 replies; 49+ messages in thread
From: Christian König @ 2017-10-09  8:29 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> fix loose mode gpu reset ib test result incorrect message
>
> Change-Id: Ic4e3b51e4ff77c5e08d268a4a5ca32e7c882367c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 122e2e1..c3f10b5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2902,7 +2902,8 @@ int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>   
>   	amdgpu_irq_gpu_reset_resume_helper(adev);
>   
> -	if (amdgpu_ib_ring_tests(adev))
> +	r = amdgpu_ib_ring_tests(adev);
> +	if (r)
>   		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
>   
>   	/* release full control of GPU after ib test */


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset
       [not found]         ` <250ce10a-cca0-0193-b2ed-cc2f04e80d0c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-09  8:30           ` Liu, Monk
  0 siblings, 0 replies; 49+ messages in thread
From: Liu, Monk @ 2017-10-09  8:30 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

why we shouldn't return error code ? any particular reason ?

if we wake up a thread waiting on some fence, and return 0 the UMD won't aware that a gpu hang occurred just now ...

for this strict mode reset the policy is aligned with vulkan spec:
it defines that the vk runtime should return VK_DEVICELOST error to app on those waiting functions, and kernel should let VK UMD aware of it so kernel must return error to UMD in the cs_wait IOCTL

UMD cannot invoke amdgpu_query_ctx from no reason, it should be notified by some error fence like cs_wait

BR Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2017年10月9日 16:20
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> changes:
> 1)implement strict mode sriov gpu reset 2)always call 
> sriov_gpu_reset_strict if hypervisor notify FLR 3)in strict reset 
> mode, set error to all fences.
> 4)change fence_wait/cs_wait functions to return -ENODEV if fence 
> signaled with error == -ETIME,
>
> Since after strict gpu reset we consider the VRAM were lost, and since 
> assuming VRAM lost there is little help to recover shadow BO because 
> all textures/resources/shaders cannot recovered (if they resident in 
> VRAM)

NAK, we shouldn't return an error code on the wait function and instead handle the fence error code in amdgpu_ctx_query().

Regards,
Christian.

>
> Change-Id: I50d9b8b5185ba92f137f07c9deeac19d740d753b
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 25 ++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 90 +++++++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  4 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  6 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  1 +
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  4 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  4 +-
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 60 ++++++++++++++++++
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  2 +
>   10 files changed, 188 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index de11527..de9c164 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -123,6 +123,7 @@ extern int amdgpu_cntl_sb_buf_per_se;
>   extern int amdgpu_param_buf_per_se;
>   extern int amdgpu_job_hang_limit;
>   extern int amdgpu_lbpw;
> +extern int amdgpu_sriov_reset_level;
>   
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   extern int amdgpu_si_support;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index c6a214f..9467cf6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1262,6 +1262,7 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
>   	struct amdgpu_ctx *ctx;
>   	struct dma_fence *fence;
>   	long r;
> +	int fence_err = 0;
>   
>   	if (amdgpu_kms_vram_lost(adev, fpriv))
>   		return -ENODEV;
> @@ -1283,6 +1284,8 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
>   		r = PTR_ERR(fence);
>   	else if (fence) {
>   		r = dma_fence_wait_timeout(fence, true, timeout);
> +		/* gpu hang and this fence is signaled by gpu reset if fence_err < 0 */
> +		fence_err = dma_fence_get_status(fence);
>   		dma_fence_put(fence);
>   	} else
>   		r = 1;
> @@ -1292,7 +1295,10 @@ int amdgpu_cs_wait_ioctl(struct drm_device *dev, void *data,
>   		return r;
>   
>   	memset(wait, 0, sizeof(*wait));
> -	wait->out.status = (r == 0);
> +	wait->out.status = (fence_err < 0);
> +
> +	if (fence_err < 0)
> +		return -ENODEV;
>   
>   	return 0;
>   }
> @@ -1346,6 +1352,7 @@ static int amdgpu_cs_wait_all_fences(struct amdgpu_device *adev,
>   	uint32_t fence_count = wait->in.fence_count;
>   	unsigned int i;
>   	long r = 1;
> +	int fence_err = 0;
>   
>   	for (i = 0; i < fence_count; i++) {
>   		struct dma_fence *fence;
> @@ -1358,16 +1365,20 @@ static int amdgpu_cs_wait_all_fences(struct amdgpu_device *adev,
>   			continue;
>   
>   		r = dma_fence_wait_timeout(fence, true, timeout);
> +		fence_err = dma_fence_get_status(fence);
>   		dma_fence_put(fence);
>   		if (r < 0)
>   			return r;
>   
> -		if (r == 0)
> +		if (r == 0 || fence_err < 0)
>   			break;
>   	}
>   
>   	memset(wait, 0, sizeof(*wait));
> -	wait->out.status = (r > 0);
> +	wait->out.status = (r > 0 && fence_err == 0);
> +
> +	if (fence_err < 0)
> +		return -ENODEV;
>   
>   	return 0;
>   }
> @@ -1391,6 +1402,7 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
>   	struct dma_fence **array;
>   	unsigned int i;
>   	long r;
> +	int fence_err = 0;
>   
>   	/* Prepare the fence array */
>   	array = kcalloc(fence_count, sizeof(struct dma_fence *), 
> GFP_KERNEL); @@ -1418,10 +1430,12 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
>   				       &first);
>   	if (r < 0)
>   		goto err_free_fence_array;
> +	else
> +		fence_err = dma_fence_get_status(array[first]);
>   
>   out:
>   	memset(wait, 0, sizeof(*wait));
> -	wait->out.status = (r > 0);
> +	wait->out.status = (r > 0 && fence_err == 0);
>   	wait->out.first_signaled = first;
>   	/* set return value 0 to indicate success */
>   	r = 0;
> @@ -1431,6 +1445,9 @@ static int amdgpu_cs_wait_any_fence(struct amdgpu_device *adev,
>   		dma_fence_put(array[i]);
>   	kfree(array);
>   
> +	if (fence_err < 0)
> +		return -ENODEV;
> +
>   	return r;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9efbb33..122e2e1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2734,6 +2734,96 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
>   }
>   
>   /**
> + * amdgpu_sriov_gpu_reset_strict - reset the asic under strict mode
> + *
> + * @adev: amdgpu device pointer
> + * @job: which job trigger hang
> + *
> + * Attempt the reset the GPU if it has hung (all asics).
> + * for SRIOV case.
> + * Returns 0 for success or an error on failure.
> + *
> + * this function will deny all process/fence created before this 
> +reset,
> + * and drop all jobs unfinished during this reset.
> + *
> + * Application should take the responsibility to re-open the FD to 
> +re-create
> + * the VM page table and recover all resources as well
> + *
> + **/
> +int amdgpu_sriov_gpu_reset_strict(struct amdgpu_device *adev, struct 
> +amdgpu_job *job) {
> +	int i, r = 0;
> +	int resched;
> +	struct amdgpu_ring *ring;
> +
> +	/* other thread is already into the gpu reset so just quit and come later */
> +	if (!atomic_add_unless(&adev->in_sriov_reset, 1, 1))
> +		return -EAGAIN;
> +
> +	atomic_inc(&adev->gpu_reset_counter);
> +
> +	/* block TTM */
> +	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> +
> +	/* fake signal jobs already scheduled  */
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +		kthread_park(ring->sched.thread);
> +		amd_sched_set_sched_hang(&ring->sched);
> +		amdgpu_fence_driver_force_completion_ring(ring);
> +		amd_sched_set_queue_hang(&ring->sched);
> +	}
> +
> +	/* request to take full control of GPU before re-initialization  */
> +	if (job)
> +		amdgpu_virt_reset_gpu(adev);
> +	else
> +		amdgpu_virt_request_full_gpu(adev, true);
> +
> +	/* Resume IP prior to SMC */
> +	amdgpu_sriov_reinit_early(adev);
> +
> +	/* we need recover gart prior to run SMC/CP/SDMA resume */
> +	amdgpu_ttm_recover_gart(adev);
> +
> +	/* now we are okay to resume SMC/CP/SDMA */
> +	amdgpu_sriov_reinit_late(adev);
> +
> +	/* resume IRQ status */
> +	amdgpu_irq_gpu_reset_resume_helper(adev);
> +
> +	if (amdgpu_ib_ring_tests(adev))
> +		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
> +
> +	/* release full control of GPU after ib test */
> +	amdgpu_virt_release_full_gpu(adev, true);
> +
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +		kthread_unpark(ring->sched.thread);
> +	}
> +
> +	drm_helper_resume_force_mode(adev->ddev);
> +
> +	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
> +	if (r)
> +		dev_info(adev->dev, "Strict mode GPU reset failed\n");
> +	else
> +		dev_info(adev->dev, "Strict mode GPU reset successed!\n");
> +
> +	atomic_set(&adev->in_sriov_reset, 0);
> +	return 0;
> +}
> +
> +/**
>    * amdgpu_sriov_gpu_reset - reset the asic
>    *
>    * @adev: amdgpu device pointer
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 8f5211c..eee67dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -123,6 +123,7 @@ int amdgpu_cntl_sb_buf_per_se = 0;
>   int amdgpu_param_buf_per_se = 0;
>   int amdgpu_job_hang_limit = 0;
>   int amdgpu_lbpw = -1;
> +int amdgpu_sriov_reset_level = 0;
>   
>   MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in megabytes");
>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@ 
> -269,6 +270,9 @@ module_param_named(job_hang_limit, amdgpu_job_hang_limit, int ,0444);
>   MODULE_PARM_DESC(lbpw, "Load Balancing Per Watt (LBPW) support (1 = enable, 0 = disable, -1 = auto)");
>   module_param_named(lbpw, amdgpu_lbpw, int, 0444);
>   
> +MODULE_PARM_DESC(sriov_reset_level, "what level will gpu reset on, 0: 
> +loose, 1:strict, other:disable (default 0))"); 
> +module_param_named(sriov_reset_level, amdgpu_sriov_reset_level, int 
> +,0444);
> +
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   
>   #if defined(CONFIG_DRM_RADEON) || defined(CONFIG_DRM_RADEON_MODULE) 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 0db81a4..933823a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -41,7 +41,11 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   		int r;
>   
>   try_again:
> -		r = amdgpu_sriov_gpu_reset(job->adev, job);
> +		if (amdgpu_sriov_reset_level == 1)
> +			r = amdgpu_sriov_gpu_reset_strict(job->adev, job);
> +		else
> +			r = amdgpu_sriov_gpu_reset(job->adev, job);
> +
>   		if (r == -EAGAIN) {
>   			/* maye two different schedulers all have hang job, try later */
>   			schedule();
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index a3cbd5a..5664a10 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -100,5 +100,6 @@ int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
>   int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job);
>   int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
>   void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
> +int amdgpu_sriov_gpu_reset_strict(struct amdgpu_device *adev, struct 
> +amdgpu_job *job);
>   
>   #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 2812d88..00a9629 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -247,8 +247,8 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   		return;
>   	}
>   
> -	/* Trigger recovery due to world switch failure */
> -	amdgpu_sriov_gpu_reset(adev, NULL);
> +	/* use strict mode if FLR triggered from hypervisor */
> +	amdgpu_sriov_gpu_reset_strict(adev, NULL);
>   }
>   
>   static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev, 
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c 
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index c25a831..c94b6e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -513,8 +513,8 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>   		return;
>   	}
>   
> -	/* Trigger recovery due to world switch failure */
> -	amdgpu_sriov_gpu_reset(adev, NULL);
> +	/* use strict mode if FLR triggered from hypervisor */
> +	amdgpu_sriov_gpu_reset_strict(adev, NULL);
>   }
>   
>   static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev, 
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c 
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 97c94f9..12c3092 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -430,6 +430,66 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
>   	spin_unlock(&sched->job_list_lock);
>   }
>   
> +/**
> + * amd_sched_set_sched_hang
> + * @sched: the scheduler need to set all pending jobs hang
> + *
> + * this routine set all unfinished jobs pending in the sched to
> + * an error -ETIME statues
> + *
> + **/
> +void amd_sched_set_sched_hang(struct amd_gpu_scheduler *sched) {
> +	struct amd_sched_job *s_job;
> +
> +	spin_lock(&sched->job_list_lock);
> +	list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node)
> +		dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> +
> +	spin_unlock(&sched->job_list_lock);
> +}
> +
> +/**
> + * amd_sched_set_queue_hang
> + * @sched: the scheduler need to set all job in kfifo hang
> + *
> + * this routine set all jobs in the KFIFO of @sched to an error
> + * -ETIME status and signal those jobs.
> + *
> + **/
> +
> +void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched) {
> +	struct amd_sched_entity *entity, *tmp;
> +	struct amd_sched_job *s_job;
> +	struct amd_sched_rq *rq;
> +	int i;
> +
> +	/* set HANG status on all jobs queued and fake signal them */
> +	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_MAX; i++) {
> +		rq = &sched->sched_rq[i];
> +
> +		spin_lock(&rq->lock);
> +		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
> +			if (entity->dependency) {
> +				dma_fence_remove_callback(entity->dependency, &entity->cb);
> +				dma_fence_put(entity->dependency);
> +				entity->dependency = NULL;
> +			}
> +
> +			spin_lock(&entity->queue_lock);
> +			while(kfifo_out(&entity->job_queue, &s_job, sizeof(s_job)) == sizeof(s_job)) {
> +				dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> +				amd_sched_fence_scheduled(s_job->s_fence);
> +				amd_sched_fence_finished(s_job->s_fence);
> +			}
> +			spin_unlock(&entity->queue_lock);
> +		}
> +		spin_unlock(&rq->lock);
> +	}
> +	wake_up(&sched->job_scheduled);
> +}
> +
>   void amd_sched_job_kickout(struct amd_sched_job *s_job)
>   {
>   	struct amd_gpu_scheduler *sched = s_job->sched; diff --git 
> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h 
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index f9d8f28..f0242aa 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -167,4 +167,6 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched);
>   bool amd_sched_dependency_optimized(struct dma_fence* fence,
>   				    struct amd_sched_entity *entity);
>   void amd_sched_job_kickout(struct amd_sched_job *s_job);
> +void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched); void 
> +amd_sched_set_sched_hang(struct amd_gpu_scheduler *sched);
>   #endif


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
       [not found]         ` <5cb1ae43-ec3a-2b0b-b78b-91cefd575672-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-09  8:32           ` Liu, Monk
       [not found]             ` <BLUPR12MB04491DDBC8ACFE2FB43D0F0084740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-09  8:32 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Why do that ?

In outside there is already a for loop to iterate over all rings so force_completion_ring() is the right one to use

BR Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2017年10月9日 16:24
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> this way no need to wait timer triggered to save time

In principle a good idea, but please remove 
amdgpu_fence_driver_force_completion_ring() and use 
amdgpu_fence_driver_force_completion() instead.

Regards,
Christian.

>
> Change-Id: Ie96fd2fc1f6054ebc1e58c3d703471639371ee22
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 333bad7..13785d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -543,8 +543,13 @@ void amdgpu_fence_driver_force_completion(struct amdgpu_device *adev)
>   
>   void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring)
>   {
> -	if (ring)
> +	if (ring) {
>   		amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
> +		/* call fence process manually can get it done quickly
> +		 * instead of waiting for the timer triggered
> +		 */
> +		amdgpu_fence_process(ring);
> +	}
>   }
>   
>   /*


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]         ` <6e81d8b0-267a-1ea8-b228-93286fc6a954-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-09  8:35           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449531313F50BE080F7746D84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-09  8:35 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Please be aware that this policy is what the strict mode defined and what customer want, 
And also please check VK spec, it defines that after GPU reset all vk INSTANCE should close/release its resource/device/ctx and all buffers, and call re-initvkinstance after gpu reset

So this whole approach is what just aligned with the spec, and to not influence with current MESA/OGL client that's why I put the whole approach into the strict mode
And by default strict mode is not selected 


BR Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2017年10月9日 16:26
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> for SRIOV strict mode gpu reset:
>
> In kms open we mark the latest adev->gpu_reset_counter in fpriv we 
> return -ENODEV in cs_ioctl or info_ioctl if they found
> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>
> this way we prevent a potential bad process/FD from submitting cmds 
> and notify userspace with -ENODEV.
>
> userspace should close all BO/ctx and re-open dri FD to re-create 
> virtual memory system for this process

The whole aproach is a NAK from my side.

We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.

Regards,
Christian.

>
> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>   3 files changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index de9c164..b40d4ba 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>   	struct idr		bo_list_handles;
>   	struct amdgpu_ctx_mgr	ctx_mgr;
>   	u32			vram_lost_counter;
> +	int gpu_reset_counter;
>   };
>   
>   /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 9467cf6..6a1515e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>   	if (amdgpu_kms_vram_lost(adev, fpriv))
>   		return -ENODEV;
>   
> +	if (amdgpu_sriov_vf(adev) &&
> +		amdgpu_sriov_reset_level == 1 &&
> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
> +		return -ENODEV;
> +
>   	parser.adev = adev;
>   	parser.filp = filp;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 282f45b..bd389cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>   	if (amdgpu_kms_vram_lost(adev, fpriv))
>   		return -ENODEV;
>   
> +	if (amdgpu_sriov_vf(adev) &&
> +		amdgpu_sriov_reset_level == 1 &&
> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
> +		return -ENODEV;
> +
>   	switch (info->query) {
>   	case AMDGPU_INFO_ACCEL_WORKING:
>   		ui32 = adev->accel_working;
> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>   		goto out_suspend;
>   	}
>   
> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
> +
>   	r = amdgpu_vm_init(adev, &fpriv->vm,
>   			   AMDGPU_VM_CONTEXT_GFX, 0);
>   	if (r) {


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
       [not found]         ` <e4c96014-b4f4-e013-a966-9e2e03b9a62b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-09  8:39           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449C8E878F09AE59BA816E284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-09  8:39 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

How APP/UMD aware that a context is guilty or triggered too much loops of hang ??

Why APP/UMD voluntarily call amdgpu_ctx_query() to check whether gpu reset occurred or not ?

Please be aware that for another CSP customer this "loose mode" is 100% welcome and wanted by they, and more important 

This approach won't cross X server at all, only the guilty process/context is rejected upon its submitting


I don't agree that we should rely on ctx_query(), no one is responsible to call it from time to time



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2017年10月9日 16:28
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset

Am 30.09.2017 um 08:03 schrieb Monk Liu:
> Change-Id: I7904f362aa0f578a5cbf5d40c7a242c2c6680a92
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

NAK, if a context is guilty of a GPU reset should be determined in
amdgpu_ctx_query() by looking at the fences in the ring buffer.

Not when the GPU reset itself occurs.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 16 +++++++++-------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  1 +
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 22 ++++++++++++++++++++++
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  1 +
>   5 files changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index b40d4ba..b63e602 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -737,6 +737,7 @@ struct amdgpu_ctx {
>   	struct dma_fence	**fences;
>   	struct amdgpu_ctx_ring	rings[AMDGPU_MAX_RINGS];
>   	bool preamble_presented;
> +	bool guilty;
>   };
>   
>   struct amdgpu_ctx_mgr {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 6a1515e..f92962e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -79,16 +79,19 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   	if (cs->in.num_chunks == 0)
>   		return 0;
>   
> +	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
> +	if (!p->ctx)
> +		return -EINVAL;
> +
> +	if (amdgpu_sriov_vf(p->adev) &&
> +		amdgpu_sriov_reset_level == 0 &&
> +		p->ctx->guilty)
> +		return -ENODEV;
> +
>   	chunk_array = kmalloc_array(cs->in.num_chunks, sizeof(uint64_t), GFP_KERNEL);
>   	if (!chunk_array)
>   		return -ENOMEM;
>   
> -	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
> -	if (!p->ctx) {
> -		ret = -EINVAL;
> -		goto free_chunk;
> -	}
> -
>   	/* get chunks */
>   	chunk_array_user = u64_to_user_ptr(cs->in.chunks);
>   	if (copy_from_user(chunk_array, chunk_array_user, @@ -184,7 +187,6 
> @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>   	p->nchunks = 0;
>   put_ctx:
>   	amdgpu_ctx_put(p->ctx);
> -free_chunk:
>   	kfree(chunk_array);
>   
>   	return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 75c933b..028e9f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -60,6 +60,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>   					  rq, amdgpu_sched_jobs);
>   		if (r)
>   			goto failed;
> +		ctx->rings[i].entity.guilty = &ctx->guilty;
>   	}
>   
>   	r = amdgpu_queue_mgr_init(adev, &ctx->queue_mgr); diff --git 
> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c 
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 12c3092..89b0573 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -493,10 +493,32 @@ void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
>   void amd_sched_job_kickout(struct amd_sched_job *s_job)
>   {
>   	struct amd_gpu_scheduler *sched = s_job->sched;
> +	struct amd_sched_entity *entity, *tmp;
> +	struct amd_sched_rq *rq;
> +	int i;
> +	bool found;
>   
>   	spin_lock(&sched->job_list_lock);
>   	list_del_init(&s_job->node);
>   	spin_unlock(&sched->job_list_lock);
> +
> +	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> +
> +	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_KERNEL; i++) {
> +		rq = &sched->sched_rq[i];
> +
> +		spin_lock(&rq->lock);
> +		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
> +			if (s_job->s_entity == entity && entity->guilty) {
> +				*entity->guilty = true;
> +				found = true;
> +				break;
> +			}
> +		}
> +		spin_unlock(&rq->lock);
> +		if (found)
> +			break;
> +	}
>   }
>   
>   void amd_sched_job_recovery(struct amd_gpu_scheduler *sched) diff 
> --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h 
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index f0242aa..16c2244 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>   
>   	struct dma_fence		*dependency;
>   	struct dma_fence_cb		cb;
> +	bool *guilty; /* this points to ctx's guilty */
>   };
>   
>   /**


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
       [not found]             ` <BLUPR12MB04491DDBC8ACFE2FB43D0F0084740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-09  8:40               ` Christian König
       [not found]                 ` <62bb9496-b29f-0230-8fa4-0bad470c12c8-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-09  8:40 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

We should avoid functionality duplication here.

Either change the caller of amdgpu_fence_driver_force_completion_ring() 
to use amdgpu_fence_driver_force_completion() or use 
amdgpu_fence_driver_force_completion_ring() in 
amdgpu_fence_driver_force_completion().

The later is probably easier to do.

Regards,
Christian.

Am 09.10.2017 um 10:32 schrieb Liu, Monk:
> Why do that ?
>
> In outside there is already a for loop to iterate over all rings so force_completion_ring() is the right one to use
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月9日 16:24
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
>
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>> this way no need to wait timer triggered to save time
> In principle a good idea, but please remove
> amdgpu_fence_driver_force_completion_ring() and use
> amdgpu_fence_driver_force_completion() instead.
>
> Regards,
> Christian.
>
>> Change-Id: Ie96fd2fc1f6054ebc1e58c3d703471639371ee22
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 7 ++++++-
>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 333bad7..13785d8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -543,8 +543,13 @@ void amdgpu_fence_driver_force_completion(struct amdgpu_device *adev)
>>    
>>    void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring)
>>    {
>> -	if (ring)
>> +	if (ring) {
>>    		amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
>> +		/* call fence process manually can get it done quickly
>> +		 * instead of waiting for the timer triggered
>> +		 */
>> +		amdgpu_fence_process(ring);
>> +	}
>>    }
>>    
>>    /*
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
       [not found]                 ` <62bb9496-b29f-0230-8fa4-0bad470c12c8-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  8:51                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB0449E49C10230F350B9BD3B284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-09  8:51 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

I need the hw fence signaled before the following steps, and since I used a loop for that I cannot change to use force_completion() at all 

If you are picky on the duplication here, the good approach is removing the old force_completion(), and use force_completion() in all gpu reset routines 

That's clean way




-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月9日 16:41
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process

We should avoid functionality duplication here.

Either change the caller of amdgpu_fence_driver_force_completion_ring()
to use amdgpu_fence_driver_force_completion() or use
amdgpu_fence_driver_force_completion_ring() in amdgpu_fence_driver_force_completion().

The later is probably easier to do.

Regards,
Christian.

Am 09.10.2017 um 10:32 schrieb Liu, Monk:
> Why do that ?
>
> In outside there is already a for loop to iterate over all rings so 
> force_completion_ring() is the right one to use
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月9日 16:24
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
>
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>> this way no need to wait timer triggered to save time
> In principle a good idea, but please remove
> amdgpu_fence_driver_force_completion_ring() and use
> amdgpu_fence_driver_force_completion() instead.
>
> Regards,
> Christian.
>
>> Change-Id: Ie96fd2fc1f6054ebc1e58c3d703471639371ee22
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 7 ++++++-
>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 333bad7..13785d8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -543,8 +543,13 @@ void amdgpu_fence_driver_force_completion(struct 
>> amdgpu_device *adev)
>>    
>>    void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring)
>>    {
>> -	if (ring)
>> +	if (ring) {
>>    		amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
>> +		/* call fence process manually can get it done quickly
>> +		 * instead of waiting for the timer triggered
>> +		 */
>> +		amdgpu_fence_process(ring);
>> +	}
>>    }
>>    
>>    /*
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
       [not found]                     ` <BLUPR12MB0449E49C10230F350B9BD3B284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-09  8:52                       ` Liu, Monk
       [not found]                         ` <BLUPR12MB04495DD27084790E5B219D7384740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-09  8:52 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Amends:

If you are picky on the duplication here, the good approach is removing the old force_completion(), and use force_completion_ring() in all gpu reset routines

-----Original Message-----
From: Liu, Monk 
Sent: 2017年10月9日 16:52
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 08/12] drm/amdgpu:explicitly call fence_process

I need the hw fence signaled before the following steps, and since I used a loop for that I cannot change to use force_completion() at all 

If you are picky on the duplication here, the good approach is removing the old force_completion(), and use force_completion() in all gpu reset routines 

That's clean way




-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月9日 16:41
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process

We should avoid functionality duplication here.

Either change the caller of amdgpu_fence_driver_force_completion_ring()
to use amdgpu_fence_driver_force_completion() or use
amdgpu_fence_driver_force_completion_ring() in amdgpu_fence_driver_force_completion().

The later is probably easier to do.

Regards,
Christian.

Am 09.10.2017 um 10:32 schrieb Liu, Monk:
> Why do that ?
>
> In outside there is already a for loop to iterate over all rings so 
> force_completion_ring() is the right one to use
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月9日 16:24
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
>
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>> this way no need to wait timer triggered to save time
> In principle a good idea, but please remove
> amdgpu_fence_driver_force_completion_ring() and use
> amdgpu_fence_driver_force_completion() instead.
>
> Regards,
> Christian.
>
>> Change-Id: Ie96fd2fc1f6054ebc1e58c3d703471639371ee22
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 7 ++++++-
>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 333bad7..13785d8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -543,8 +543,13 @@ void amdgpu_fence_driver_force_completion(struct 
>> amdgpu_device *adev)
>>    
>>    void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring)
>>    {
>> -	if (ring)
>> +	if (ring) {
>>    		amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
>> +		/* call fence process manually can get it done quickly
>> +		 * instead of waiting for the timer triggered
>> +		 */
>> +		amdgpu_fence_process(ring);
>> +	}
>>    }
>>    
>>    /*
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]             ` <BLUPR12MB0449531313F50BE080F7746D84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-09  8:54               ` Christian König
  2017-10-09 11:01               ` Nicolai Hähnle
  1 sibling, 0 replies; 49+ messages in thread
From: Christian König @ 2017-10-09  8:54 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Haehnle,
	Nicolai, Olsak, Marek

I think the approach which is currently used by Vulkan actually doesn't 
sounds correct to me and should be fixed from the very beginning.

The kernel driver should expose the hardware capabilities as cleanly as 
possible to userspace and *NOT* just try to fulfill any random userspace 
or customer requirements.

Otherwise we run into having only specialized code which fits just one 
requirement instead of a complete solution which works for everyone.

What we need at least is signaling that a problem occurred, blocking all 
command submission until userspace reacted, canceling all submission in 
flight and be able to reset this state from the userspace side.

You implemented this by blocking everything an assuming that userspace 
would somehow magically react gracefully.

What needs to be done instead is to implement proper reset handling in 
Mesa and/or the DDX and while doing so we can talk about requirements 
for the kernel driver.

Adding Nicolai and Marek for this to come up with a plan how to fix this 
in Mesa.

Until that is properly done I will block any attempt to push a halve 
backed solution upstream which only works for the closed source. I will 
also revert the existing changed cause they seem to have caused a bunch 
of regression to the GPU reset code (not that the code was good in the 
first place, but now it doesn't work any more at all).

Regards,
Christian.

Am 09.10.2017 um 10:35 schrieb Liu, Monk:
> Please be aware that this policy is what the strict mode defined and what customer want,
> And also please check VK spec, it defines that after GPU reset all vk INSTANCE should close/release its resource/device/ctx and all buffers, and call re-initvkinstance after gpu reset
>
> So this whole approach is what just aligned with the spec, and to not influence with current MESA/OGL client that's why I put the whole approach into the strict mode
> And by default strict mode is not selected
>
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月9日 16:26
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
>
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>> for SRIOV strict mode gpu reset:
>>
>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we
>> return -ENODEV in cs_ioctl or info_ioctl if they found
>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>
>> this way we prevent a potential bad process/FD from submitting cmds
>> and notify userspace with -ENODEV.
>>
>> userspace should close all BO/ctx and re-open dri FD to re-create
>> virtual memory system for this process
> The whole aproach is a NAK from my side.
>
> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>
> Regards,
> Christian.
>
>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>    3 files changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index de9c164..b40d4ba 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>    	struct idr		bo_list_handles;
>>    	struct amdgpu_ctx_mgr	ctx_mgr;
>>    	u32			vram_lost_counter;
>> +	int gpu_reset_counter;
>>    };
>>    
>>    /*
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 9467cf6..6a1515e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>    	if (amdgpu_kms_vram_lost(adev, fpriv))
>>    		return -ENODEV;
>>    
>> +	if (amdgpu_sriov_vf(adev) &&
>> +		amdgpu_sriov_reset_level == 1 &&
>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>> +		return -ENODEV;
>> +
>>    	parser.adev = adev;
>>    	parser.filp = filp;
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 282f45b..bd389cf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>    	if (amdgpu_kms_vram_lost(adev, fpriv))
>>    		return -ENODEV;
>>    
>> +	if (amdgpu_sriov_vf(adev) &&
>> +		amdgpu_sriov_reset_level == 1 &&
>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>> +		return -ENODEV;
>> +
>>    	switch (info->query) {
>>    	case AMDGPU_INFO_ACCEL_WORKING:
>>    		ui32 = adev->accel_working;
>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>    		goto out_suspend;
>>    	}
>>    
>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>> +
>>    	r = amdgpu_vm_init(adev, &fpriv->vm,
>>    			   AMDGPU_VM_CONTEXT_GFX, 0);
>>    	if (r) {
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
       [not found]                         ` <BLUPR12MB04495DD27084790E5B219D7384740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-09  8:58                           ` Christian König
  0 siblings, 0 replies; 49+ messages in thread
From: Christian König @ 2017-10-09  8:58 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Yeah, completely fine with me as well. We should just not duplicate that 
handling.

Taking a look at how amdgpu_fence_driver_force_completion() in 
amdgpu_fence_driver_fini() and amdgpu_fence_driver_suspend() what you 
suggest actually looks like the sanest thing to me.

Regards,
Christian.

Am 09.10.2017 um 10:52 schrieb Liu, Monk:
> Amends:
>
> If you are picky on the duplication here, the good approach is removing the old force_completion(), and use force_completion_ring() in all gpu reset routines
>
> -----Original Message-----
> From: Liu, Monk
> Sent: 2017年10月9日 16:52
> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
>
> I need the hw fence signaled before the following steps, and since I used a loop for that I cannot change to use force_completion() at all
>
> If you are picky on the duplication here, the good approach is removing the old force_completion(), and use force_completion() in all gpu reset routines
>
> That's clean way
>
>
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月9日 16:41
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
>
> We should avoid functionality duplication here.
>
> Either change the caller of amdgpu_fence_driver_force_completion_ring()
> to use amdgpu_fence_driver_force_completion() or use
> amdgpu_fence_driver_force_completion_ring() in amdgpu_fence_driver_force_completion().
>
> The later is probably easier to do.
>
> Regards,
> Christian.
>
> Am 09.10.2017 um 10:32 schrieb Liu, Monk:
>> Why do that ?
>>
>> In outside there is already a for loop to iterate over all rings so
>> force_completion_ring() is the right one to use
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月9日 16:24
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 08/12] drm/amdgpu:explicitly call fence_process
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> this way no need to wait timer triggered to save time
>> In principle a good idea, but please remove
>> amdgpu_fence_driver_force_completion_ring() and use
>> amdgpu_fence_driver_force_completion() instead.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ie96fd2fc1f6054ebc1e58c3d703471639371ee22
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 7 ++++++-
>>>     1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 333bad7..13785d8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -543,8 +543,13 @@ void amdgpu_fence_driver_force_completion(struct
>>> amdgpu_device *adev)
>>>     
>>>     void amdgpu_fence_driver_force_completion_ring(struct amdgpu_ring *ring)
>>>     {
>>> -	if (ring)
>>> +	if (ring) {
>>>     		amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
>>> +		/* call fence process manually can get it done quickly
>>> +		 * instead of waiting for the timer triggered
>>> +		 */
>>> +		amdgpu_fence_process(ring);
>>> +	}
>>>     }
>>>     
>>>     /*


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
       [not found]             ` <BLUPR12MB0449C8E878F09AE59BA816E284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-09  9:03               ` Christian König
       [not found]                 ` <d249cc75-29e3-713f-fc5a-2f26f555500b-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-09  9:03 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Well I'm not saying that the app needs to repeatedly call 
amdgpu_ctx_query, but rather that we need a complete concept.

See, the upstream kernel driver is made for Mesa and not the closed 
source driver stack.

I can't 100% judge if this approach wouldn't work with Mesa because we 
haven't implemented it there, but it strongly looks like that stuff 
won't work.

So I need a solution which works with Mesa and the open source 
components before we can push it upstream.

Regards,
Christian.

Am 09.10.2017 um 10:39 schrieb Liu, Monk:
> How APP/UMD aware that a context is guilty or triggered too much loops of hang ??
>
> Why APP/UMD voluntarily call amdgpu_ctx_query() to check whether gpu reset occurred or not ?
>
> Please be aware that for another CSP customer this "loose mode" is 100% welcome and wanted by they, and more important
>
> This approach won't cross X server at all, only the guilty process/context is rejected upon its submitting
>
>
> I don't agree that we should rely on ctx_query(), no one is responsible to call it from time to time
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月9日 16:28
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
>
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>> Change-Id: I7904f362aa0f578a5cbf5d40c7a242c2c6680a92
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> NAK, if a context is guilty of a GPU reset should be determined in
> amdgpu_ctx_query() by looking at the fences in the ring buffer.
>
> Not when the GPU reset itself occurs.
>
> Regards,
> Christian.
>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 16 +++++++++-------
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  1 +
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 22 ++++++++++++++++++++++
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  1 +
>>    5 files changed, 34 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index b40d4ba..b63e602 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -737,6 +737,7 @@ struct amdgpu_ctx {
>>    	struct dma_fence	**fences;
>>    	struct amdgpu_ctx_ring	rings[AMDGPU_MAX_RINGS];
>>    	bool preamble_presented;
>> +	bool guilty;
>>    };
>>    
>>    struct amdgpu_ctx_mgr {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 6a1515e..f92962e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -79,16 +79,19 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>    	if (cs->in.num_chunks == 0)
>>    		return 0;
>>    
>> +	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
>> +	if (!p->ctx)
>> +		return -EINVAL;
>> +
>> +	if (amdgpu_sriov_vf(p->adev) &&
>> +		amdgpu_sriov_reset_level == 0 &&
>> +		p->ctx->guilty)
>> +		return -ENODEV;
>> +
>>    	chunk_array = kmalloc_array(cs->in.num_chunks, sizeof(uint64_t), GFP_KERNEL);
>>    	if (!chunk_array)
>>    		return -ENOMEM;
>>    
>> -	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
>> -	if (!p->ctx) {
>> -		ret = -EINVAL;
>> -		goto free_chunk;
>> -	}
>> -
>>    	/* get chunks */
>>    	chunk_array_user = u64_to_user_ptr(cs->in.chunks);
>>    	if (copy_from_user(chunk_array, chunk_array_user, @@ -184,7 +187,6
>> @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>    	p->nchunks = 0;
>>    put_ctx:
>>    	amdgpu_ctx_put(p->ctx);
>> -free_chunk:
>>    	kfree(chunk_array);
>>    
>>    	return ret;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index 75c933b..028e9f1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -60,6 +60,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>    					  rq, amdgpu_sched_jobs);
>>    		if (r)
>>    			goto failed;
>> +		ctx->rings[i].entity.guilty = &ctx->guilty;
>>    	}
>>    
>>    	r = amdgpu_queue_mgr_init(adev, &ctx->queue_mgr); diff --git
>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> index 12c3092..89b0573 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> @@ -493,10 +493,32 @@ void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
>>    void amd_sched_job_kickout(struct amd_sched_job *s_job)
>>    {
>>    	struct amd_gpu_scheduler *sched = s_job->sched;
>> +	struct amd_sched_entity *entity, *tmp;
>> +	struct amd_sched_rq *rq;
>> +	int i;
>> +	bool found;
>>    
>>    	spin_lock(&sched->job_list_lock);
>>    	list_del_init(&s_job->node);
>>    	spin_unlock(&sched->job_list_lock);
>> +
>> +	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>> +
>> +	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_KERNEL; i++) {
>> +		rq = &sched->sched_rq[i];
>> +
>> +		spin_lock(&rq->lock);
>> +		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>> +			if (s_job->s_entity == entity && entity->guilty) {
>> +				*entity->guilty = true;
>> +				found = true;
>> +				break;
>> +			}
>> +		}
>> +		spin_unlock(&rq->lock);
>> +		if (found)
>> +			break;
>> +	}
>>    }
>>    
>>    void amd_sched_job_recovery(struct amd_gpu_scheduler *sched) diff
>> --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> index f0242aa..16c2244 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>    
>>    	struct dma_fence		*dependency;
>>    	struct dma_fence_cb		cb;
>> +	bool *guilty; /* this points to ctx's guilty */
>>    };
>>    
>>    /**
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
       [not found]                 ` <d249cc75-29e3-713f-fc5a-2f26f555500b-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-09  9:14                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB04498EE183C86C2B93DDA85484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-09  9:14 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

I can assure you this loose mode is 100% fit with current MESA,

Can you illustrate which points breaks MESA ?

You can see the whole logic only interested in the guilty ctx, and only the guilty ctx would receive the -ENODEV error

All innocent/regular running MESA client like X server and compositor eve didn't aware of a gpu reset at all, they just keep running 


BR  Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月9日 17:04
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset

Well I'm not saying that the app needs to repeatedly call amdgpu_ctx_query, but rather that we need a complete concept.

See, the upstream kernel driver is made for Mesa and not the closed source driver stack.

I can't 100% judge if this approach wouldn't work with Mesa because we haven't implemented it there, but it strongly looks like that stuff won't work.

So I need a solution which works with Mesa and the open source components before we can push it upstream.

Regards,
Christian.

Am 09.10.2017 um 10:39 schrieb Liu, Monk:
> How APP/UMD aware that a context is guilty or triggered too much loops of hang ??
>
> Why APP/UMD voluntarily call amdgpu_ctx_query() to check whether gpu reset occurred or not ?
>
> Please be aware that for another CSP customer this "loose mode" is 
> 100% welcome and wanted by they, and more important
>
> This approach won't cross X server at all, only the guilty 
> process/context is rejected upon its submitting
>
>
> I don't agree that we should rely on ctx_query(), no one is 
> responsible to call it from time to time
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月9日 16:28
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for 
> loose reset
>
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>> Change-Id: I7904f362aa0f578a5cbf5d40c7a242c2c6680a92
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> NAK, if a context is guilty of a GPU reset should be determined in
> amdgpu_ctx_query() by looking at the fences in the ring buffer.
>
> Not when the GPU reset itself occurs.
>
> Regards,
> Christian.
>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 16 +++++++++-------
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  1 +
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 22 ++++++++++++++++++++++
>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  1 +
>>    5 files changed, 34 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index b40d4ba..b63e602 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -737,6 +737,7 @@ struct amdgpu_ctx {
>>    	struct dma_fence	**fences;
>>    	struct amdgpu_ctx_ring	rings[AMDGPU_MAX_RINGS];
>>    	bool preamble_presented;
>> +	bool guilty;
>>    };
>>    
>>    struct amdgpu_ctx_mgr {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 6a1515e..f92962e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -79,16 +79,19 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>    	if (cs->in.num_chunks == 0)
>>    		return 0;
>>    
>> +	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
>> +	if (!p->ctx)
>> +		return -EINVAL;
>> +
>> +	if (amdgpu_sriov_vf(p->adev) &&
>> +		amdgpu_sriov_reset_level == 0 &&
>> +		p->ctx->guilty)
>> +		return -ENODEV;
>> +
>>    	chunk_array = kmalloc_array(cs->in.num_chunks, sizeof(uint64_t), GFP_KERNEL);
>>    	if (!chunk_array)
>>    		return -ENOMEM;
>>    
>> -	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
>> -	if (!p->ctx) {
>> -		ret = -EINVAL;
>> -		goto free_chunk;
>> -	}
>> -
>>    	/* get chunks */
>>    	chunk_array_user = u64_to_user_ptr(cs->in.chunks);
>>    	if (copy_from_user(chunk_array, chunk_array_user, @@ -184,7 
>> +187,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>    	p->nchunks = 0;
>>    put_ctx:
>>    	amdgpu_ctx_put(p->ctx);
>> -free_chunk:
>>    	kfree(chunk_array);
>>    
>>    	return ret;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index 75c933b..028e9f1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -60,6 +60,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>    					  rq, amdgpu_sched_jobs);
>>    		if (r)
>>    			goto failed;
>> +		ctx->rings[i].entity.guilty = &ctx->guilty;
>>    	}
>>    
>>    	r = amdgpu_queue_mgr_init(adev, &ctx->queue_mgr); diff --git 
>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> index 12c3092..89b0573 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>> @@ -493,10 +493,32 @@ void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
>>    void amd_sched_job_kickout(struct amd_sched_job *s_job)
>>    {
>>    	struct amd_gpu_scheduler *sched = s_job->sched;
>> +	struct amd_sched_entity *entity, *tmp;
>> +	struct amd_sched_rq *rq;
>> +	int i;
>> +	bool found;
>>    
>>    	spin_lock(&sched->job_list_lock);
>>    	list_del_init(&s_job->node);
>>    	spin_unlock(&sched->job_list_lock);
>> +
>> +	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>> +
>> +	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_KERNEL; i++) {
>> +		rq = &sched->sched_rq[i];
>> +
>> +		spin_lock(&rq->lock);
>> +		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>> +			if (s_job->s_entity == entity && entity->guilty) {
>> +				*entity->guilty = true;
>> +				found = true;
>> +				break;
>> +			}
>> +		}
>> +		spin_unlock(&rq->lock);
>> +		if (found)
>> +			break;
>> +	}
>>    }
>>    
>>    void amd_sched_job_recovery(struct amd_gpu_scheduler *sched) diff 
>> --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> index f0242aa..16c2244 100644
>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>    
>>    	struct dma_fence		*dependency;
>>    	struct dma_fence_cb		cb;
>> +	bool *guilty; /* this points to ctx's guilty */
>>    };
>>    
>>    /**
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
       [not found]                     ` <BLUPR12MB04498EE183C86C2B93DDA85484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-09  9:24                       ` Christian König
  0 siblings, 0 replies; 49+ messages in thread
From: Christian König @ 2017-10-09  9:24 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> Can you illustrate which points breaks MESA ?
It doesn't break Mesa, but there is not handling in Mesa for -ENODEV.

So I will block adding any kernel functionality which returns -ENODEV 
before Mesa gets a proper handling for this.

We need to implement feature in Mesa first, it is our primary user space 
client. Without having handling there we can't submit anything upstream.

Regards,
Christian.

Am 09.10.2017 um 11:14 schrieb Liu, Monk:
> I can assure you this loose mode is 100% fit with current MESA,
>
> Can you illustrate which points breaks MESA ?
>
> You can see the whole logic only interested in the guilty ctx, and only the guilty ctx would receive the -ENODEV error
>
> All innocent/regular running MESA client like X server and compositor eve didn't aware of a gpu reset at all, they just keep running
>
>
> BR  Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月9日 17:04
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset
>
> Well I'm not saying that the app needs to repeatedly call amdgpu_ctx_query, but rather that we need a complete concept.
>
> See, the upstream kernel driver is made for Mesa and not the closed source driver stack.
>
> I can't 100% judge if this approach wouldn't work with Mesa because we haven't implemented it there, but it strongly looks like that stuff won't work.
>
> So I need a solution which works with Mesa and the open source components before we can push it upstream.
>
> Regards,
> Christian.
>
> Am 09.10.2017 um 10:39 schrieb Liu, Monk:
>> How APP/UMD aware that a context is guilty or triggered too much loops of hang ??
>>
>> Why APP/UMD voluntarily call amdgpu_ctx_query() to check whether gpu reset occurred or not ?
>>
>> Please be aware that for another CSP customer this "loose mode" is
>> 100% welcome and wanted by they, and more important
>>
>> This approach won't cross X server at all, only the guilty
>> process/context is rejected upon its submitting
>>
>>
>> I don't agree that we should rely on ctx_query(), no one is
>> responsible to call it from time to time
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月9日 16:28
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for
>> loose reset
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> Change-Id: I7904f362aa0f578a5cbf5d40c7a242c2c6680a92
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> NAK, if a context is guilty of a GPU reset should be determined in
>> amdgpu_ctx_query() by looking at the fences in the ring buffer.
>>
>> Not when the GPU reset itself occurs.
>>
>> Regards,
>> Christian.
>>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 16 +++++++++-------
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  1 +
>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 22 ++++++++++++++++++++++
>>>     drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  1 +
>>>     5 files changed, 34 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index b40d4ba..b63e602 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -737,6 +737,7 @@ struct amdgpu_ctx {
>>>     	struct dma_fence	**fences;
>>>     	struct amdgpu_ctx_ring	rings[AMDGPU_MAX_RINGS];
>>>     	bool preamble_presented;
>>> +	bool guilty;
>>>     };
>>>     
>>>     struct amdgpu_ctx_mgr {
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 6a1515e..f92962e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -79,16 +79,19 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>     	if (cs->in.num_chunks == 0)
>>>     		return 0;
>>>     
>>> +	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
>>> +	if (!p->ctx)
>>> +		return -EINVAL;
>>> +
>>> +	if (amdgpu_sriov_vf(p->adev) &&
>>> +		amdgpu_sriov_reset_level == 0 &&
>>> +		p->ctx->guilty)
>>> +		return -ENODEV;
>>> +
>>>     	chunk_array = kmalloc_array(cs->in.num_chunks, sizeof(uint64_t), GFP_KERNEL);
>>>     	if (!chunk_array)
>>>     		return -ENOMEM;
>>>     
>>> -	p->ctx = amdgpu_ctx_get(fpriv, cs->in.ctx_id);
>>> -	if (!p->ctx) {
>>> -		ret = -EINVAL;
>>> -		goto free_chunk;
>>> -	}
>>> -
>>>     	/* get chunks */
>>>     	chunk_array_user = u64_to_user_ptr(cs->in.chunks);
>>>     	if (copy_from_user(chunk_array, chunk_array_user, @@ -184,7
>>> +187,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>     	p->nchunks = 0;
>>>     put_ctx:
>>>     	amdgpu_ctx_put(p->ctx);
>>> -free_chunk:
>>>     	kfree(chunk_array);
>>>     
>>>     	return ret;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index 75c933b..028e9f1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -60,6 +60,7 @@ static int amdgpu_ctx_init(struct amdgpu_device *adev, struct amdgpu_ctx *ctx)
>>>     					  rq, amdgpu_sched_jobs);
>>>     		if (r)
>>>     			goto failed;
>>> +		ctx->rings[i].entity.guilty = &ctx->guilty;
>>>     	}
>>>     
>>>     	r = amdgpu_queue_mgr_init(adev, &ctx->queue_mgr); diff --git
>>> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> index 12c3092..89b0573 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>>> @@ -493,10 +493,32 @@ void amd_sched_set_queue_hang(struct amd_gpu_scheduler *sched)
>>>     void amd_sched_job_kickout(struct amd_sched_job *s_job)
>>>     {
>>>     	struct amd_gpu_scheduler *sched = s_job->sched;
>>> +	struct amd_sched_entity *entity, *tmp;
>>> +	struct amd_sched_rq *rq;
>>> +	int i;
>>> +	bool found;
>>>     
>>>     	spin_lock(&sched->job_list_lock);
>>>     	list_del_init(&s_job->node);
>>>     	spin_unlock(&sched->job_list_lock);
>>> +
>>> +	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>> +
>>> +	for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_KERNEL; i++) {
>>> +		rq = &sched->sched_rq[i];
>>> +
>>> +		spin_lock(&rq->lock);
>>> +		list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>>> +			if (s_job->s_entity == entity && entity->guilty) {
>>> +				*entity->guilty = true;
>>> +				found = true;
>>> +				break;
>>> +			}
>>> +		}
>>> +		spin_unlock(&rq->lock);
>>> +		if (found)
>>> +			break;
>>> +	}
>>>     }
>>>     
>>>     void amd_sched_job_recovery(struct amd_gpu_scheduler *sched) diff
>>> --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> index f0242aa..16c2244 100644
>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>>> @@ -49,6 +49,7 @@ struct amd_sched_entity {
>>>     
>>>     	struct dma_fence		*dependency;
>>>     	struct dma_fence_cb		cb;
>>> +	bool *guilty; /* this points to ctx's guilty */
>>>     };
>>>     
>>>     /**
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset
       [not found]     ` <1506751432-21789-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-09  8:20       ` Christian König
@ 2017-10-09 10:58       ` Nicolai Hähnle
  1 sibling, 0 replies; 49+ messages in thread
From: Nicolai Hähnle @ 2017-10-09 10:58 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 30.09.2017 08:03, Monk Liu wrote:
> changes:
> 1)implement strict mode sriov gpu reset
> 2)always call sriov_gpu_reset_strict if hypervisor notify FLR
> 3)in strict reset mode, set error to all fences.
> 4)change fence_wait/cs_wait functions to return -ENODEV if fence signaled
> with error == -ETIME,
> 
> Since after strict gpu reset we consider the VRAM were lost,
> and since assuming VRAM lost there is little help to recover
> shadow BO because all textures/resources/shaders cannot
> recovered (if they resident in VRAM)
> 
> Change-Id: I50d9b8b5185ba92f137f07c9deeac19d740d753b
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
[snip]
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9efbb33..122e2e1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2734,6 +2734,96 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
>   }
>   
>   /**
> + * amdgpu_sriov_gpu_reset_strict - reset the asic under strict mode
> + *
> + * @adev: amdgpu device pointer
> + * @job: which job trigger hang
> + *
> + * Attempt the reset the GPU if it has hung (all asics).
> + * for SRIOV case.
> + * Returns 0 for success or an error on failure.
> + *
> + * this function will deny all process/fence created before this reset,
> + * and drop all jobs unfinished during this reset.
> + *
> + * Application should take the responsibility to re-open the FD to re-create
> + * the VM page table and recover all resources as well

Total NAK to this. It is *completely* infeasible from the UMD side, 
because multiple drivers can simultaneously use the same FD.

The KMD should just drop all previously submitted jobs and let the UMD 
worry about whether it wants to re-use buffer objects or not.

The VM page table can then be rebuilt transparently based on whatever BO 
lists are used as new submissions are made after the reset.

Cheers,
Nicolai
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]             ` <BLUPR12MB0449531313F50BE080F7746D84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-10-09  8:54               ` Christian König
@ 2017-10-09 11:01               ` Nicolai Hähnle
       [not found]                 ` <71b411c8-21a6-fe9b-ed33-7928571a88da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 49+ messages in thread
From: Nicolai Hähnle @ 2017-10-09 11:01 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 09.10.2017 10:35, Liu, Monk wrote:
> Please be aware that this policy is what the strict mode defined and what customer want,
> And also please check VK spec, it defines that after GPU reset all vk INSTANCE should close/release its resource/device/ctx and all buffers, and call re-initvkinstance after gpu reset

Sorry, but you simply cannot implement a correct user-space 
implementation of those specs on top of this.

It will break as soon as you have both OpenGL and Vulkan running in the 
same process (or heck, our Vulkan and radv :)), because both drivers 
will use the same fd.

Cheers,
Nicolai



> So this whole approach is what just aligned with the spec, and to not influence with current MESA/OGL client that's why I put the whole approach into the strict mode
> And by default strict mode is not selected
> 
> 
> BR Monk
> 
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月9日 16:26
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
> 
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>> for SRIOV strict mode gpu reset:
>>
>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we
>> return -ENODEV in cs_ioctl or info_ioctl if they found
>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>
>> this way we prevent a potential bad process/FD from submitting cmds
>> and notify userspace with -ENODEV.
>>
>> userspace should close all BO/ctx and re-open dri FD to re-create
>> virtual memory system for this process
> 
> The whole aproach is a NAK from my side.
> 
> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
> 
> Regards,
> Christian.
> 
>>
>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>    3 files changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index de9c164..b40d4ba 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>    	struct idr		bo_list_handles;
>>    	struct amdgpu_ctx_mgr	ctx_mgr;
>>    	u32			vram_lost_counter;
>> +	int gpu_reset_counter;
>>    };
>>    
>>    /*
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 9467cf6..6a1515e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>    	if (amdgpu_kms_vram_lost(adev, fpriv))
>>    		return -ENODEV;
>>    
>> +	if (amdgpu_sriov_vf(adev) &&
>> +		amdgpu_sriov_reset_level == 1 &&
>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>> +		return -ENODEV;
>> +
>>    	parser.adev = adev;
>>    	parser.filp = filp;
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 282f45b..bd389cf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>    	if (amdgpu_kms_vram_lost(adev, fpriv))
>>    		return -ENODEV;
>>    
>> +	if (amdgpu_sriov_vf(adev) &&
>> +		amdgpu_sriov_reset_level == 1 &&
>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>> +		return -ENODEV;
>> +
>>    	switch (info->query) {
>>    	case AMDGPU_INFO_ACCEL_WORKING:
>>    		ui32 = adev->accel_working;
>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>    		goto out_suspend;
>>    	}
>>    
>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>> +
>>    	r = amdgpu_vm_init(adev, &fpriv->vm,
>>    			   AMDGPU_VM_CONTEXT_GFX, 0);
>>    	if (r) {
> 
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 


-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                 ` <71b411c8-21a6-fe9b-ed33-7928571a88da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-10  4:26                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB04492B28DF57EACE2149562D84750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-10  4:26 UTC (permalink / raw)
  To: Nicolai Hähnle, Koenig, Christian,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

After VRAM lost happens, all clients no matter radv/mesa/ogl is useless,

Any drivers uses this FD should be denied by KMD after VRAM lost, and UMD can destroy/close this FD and re-open it and rebuild all resources

That's the only option for VRAM lost case



-----Original Message-----
From: Nicolai Hähnle [mailto:nhaehnle@gmail.com] 
Sent: 2017年10月9日 19:01
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

On 09.10.2017 10:35, Liu, Monk wrote:
> Please be aware that this policy is what the strict mode defined and 
> what customer want, And also please check VK spec, it defines that 
> after GPU reset all vk INSTANCE should close/release its 
> resource/device/ctx and all buffers, and call re-initvkinstance after 
> gpu reset

Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.

It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.

Cheers,
Nicolai



> So this whole approach is what just aligned with the spec, and to not 
> influence with current MESA/OGL client that's why I put the whole 
> approach into the strict mode And by default strict mode is not 
> selected
> 
> 
> BR Monk
> 
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月9日 16:26
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
> reseted
> 
> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>> for SRIOV strict mode gpu reset:
>>
>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we 
>> return -ENODEV in cs_ioctl or info_ioctl if they found
>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>
>> this way we prevent a potential bad process/FD from submitting cmds 
>> and notify userspace with -ENODEV.
>>
>> userspace should close all BO/ctx and re-open dri FD to re-create 
>> virtual memory system for this process
> 
> The whole aproach is a NAK from my side.
> 
> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
> 
> Regards,
> Christian.
> 
>>
>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>    3 files changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index de9c164..b40d4ba 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>    	struct idr		bo_list_handles;
>>    	struct amdgpu_ctx_mgr	ctx_mgr;
>>    	u32			vram_lost_counter;
>> +	int gpu_reset_counter;
>>    };
>>    
>>    /*
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 9467cf6..6a1515e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>    	if (amdgpu_kms_vram_lost(adev, fpriv))
>>    		return -ENODEV;
>>    
>> +	if (amdgpu_sriov_vf(adev) &&
>> +		amdgpu_sriov_reset_level == 1 &&
>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>> +		return -ENODEV;
>> +
>>    	parser.adev = adev;
>>    	parser.filp = filp;
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 282f45b..bd389cf 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>    	if (amdgpu_kms_vram_lost(adev, fpriv))
>>    		return -ENODEV;
>>    
>> +	if (amdgpu_sriov_vf(adev) &&
>> +		amdgpu_sriov_reset_level == 1 &&
>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>> +		return -ENODEV;
>> +
>>    	switch (info->query) {
>>    	case AMDGPU_INFO_ACCEL_WORKING:
>>    		ui32 = adev->accel_working;
>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>    		goto out_suspend;
>>    	}
>>    
>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>> +
>>    	r = amdgpu_vm_init(adev, &fpriv->vm,
>>    			   AMDGPU_VM_CONTEXT_GFX, 0);
>>    	if (r) {
> 
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 


--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                     ` <BLUPR12MB04492B28DF57EACE2149562D84750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-10  6:58                       ` Christian König
       [not found]                         ` <85c67ae9-bfe2-390a-79d0-6e5872b9be62-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-10  6:58 UTC (permalink / raw)
  To: Liu, Monk, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Daenzer, Michel

As Nicolai explained that approach simply won't work.

The fd is used by more than just the closed source Vulkan driver and I 
think even by some components not developed by AMD (common X code? 
Michel please comment as well).

So closing it and reopening it to handle a GPU reset is simply not an 
option.

Regards,
Christian.

Am 10.10.2017 um 06:26 schrieb Liu, Monk:
> After VRAM lost happens, all clients no matter radv/mesa/ogl is useless,
>
> Any drivers uses this FD should be denied by KMD after VRAM lost, and UMD can destroy/close this FD and re-open it and rebuild all resources
>
> That's the only option for VRAM lost case
>
>
>
> -----Original Message-----
> From: Nicolai Hähnle [mailto:nhaehnle@gmail.com]
> Sent: 2017年10月9日 19:01
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
>
> On 09.10.2017 10:35, Liu, Monk wrote:
>> Please be aware that this policy is what the strict mode defined and
>> what customer want, And also please check VK spec, it defines that
>> after GPU reset all vk INSTANCE should close/release its
>> resource/device/ctx and all buffers, and call re-initvkinstance after
>> gpu reset
> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>
> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>
> Cheers,
> Nicolai
>
>
>
>> So this whole approach is what just aligned with the spec, and to not
>> influence with current MESA/OGL client that's why I put the whole
>> approach into the strict mode And by default strict mode is not
>> selected
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月9日 16:26
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>> reseted
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> for SRIOV strict mode gpu reset:
>>>
>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we
>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>
>>> this way we prevent a potential bad process/FD from submitting cmds
>>> and notify userspace with -ENODEV.
>>>
>>> userspace should close all BO/ctx and re-open dri FD to re-create
>>> virtual memory system for this process
>> The whole aproach is a NAK from my side.
>>
>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>     3 files changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index de9c164..b40d4ba 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>     	struct idr		bo_list_handles;
>>>     	struct amdgpu_ctx_mgr	ctx_mgr;
>>>     	u32			vram_lost_counter;
>>> +	int gpu_reset_counter;
>>>     };
>>>     
>>>     /*
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 9467cf6..6a1515e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	parser.adev = adev;
>>>     	parser.filp = filp;
>>>     
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 282f45b..bd389cf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	switch (info->query) {
>>>     	case AMDGPU_INFO_ACCEL_WORKING:
>>>     		ui32 = adev->accel_working;
>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>     		goto out_suspend;
>>>     	}
>>>     
>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>> +
>>>     	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>     			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>     	if (r) {
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                         ` <85c67ae9-bfe2-390a-79d0-6e5872b9be62-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-10  7:12                           ` Liu, Monk
       [not found]                             ` <BLUPR12MB04497B5442F66C969861742584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2017-10-10  7:19                           ` Liu, Monk
  2017-10-10  7:47                           ` Michel Dänzer
  2 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-10  7:12 UTC (permalink / raw)
  To: Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Daenzer, Michel

Then the question is how we treat recovery if VRAM lost ?

-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月10日 14:59
To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer@amd.com>
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

As Nicolai explained that approach simply won't work.

The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code? 
Michel please comment as well).

So closing it and reopening it to handle a GPU reset is simply not an option.

Regards,
Christian.

Am 10.10.2017 um 06:26 schrieb Liu, Monk:
> After VRAM lost happens, all clients no matter radv/mesa/ogl is 
> useless,
>
> Any drivers uses this FD should be denied by KMD after VRAM lost, and 
> UMD can destroy/close this FD and re-open it and rebuild all resources
>
> That's the only option for VRAM lost case
>
>
>
> -----Original Message-----
> From: Nicolai Hähnle [mailto:nhaehnle@gmail.com]
> Sent: 2017年10月9日 19:01
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
> reseted
>
> On 09.10.2017 10:35, Liu, Monk wrote:
>> Please be aware that this policy is what the strict mode defined and 
>> what customer want, And also please check VK spec, it defines that 
>> after GPU reset all vk INSTANCE should close/release its 
>> resource/device/ctx and all buffers, and call re-initvkinstance after 
>> gpu reset
> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>
> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>
> Cheers,
> Nicolai
>
>
>
>> So this whole approach is what just aligned with the spec, and to not 
>> influence with current MESA/OGL client that's why I put the whole 
>> approach into the strict mode And by default strict mode is not 
>> selected
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月9日 16:26
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
>> reseted
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> for SRIOV strict mode gpu reset:
>>>
>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we 
>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>
>>> this way we prevent a potential bad process/FD from submitting cmds 
>>> and notify userspace with -ENODEV.
>>>
>>> userspace should close all BO/ctx and re-open dri FD to re-create 
>>> virtual memory system for this process
>> The whole aproach is a NAK from my side.
>>
>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>     3 files changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index de9c164..b40d4ba 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>     	struct idr		bo_list_handles;
>>>     	struct amdgpu_ctx_mgr	ctx_mgr;
>>>     	u32			vram_lost_counter;
>>> +	int gpu_reset_counter;
>>>     };
>>>     
>>>     /*
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 9467cf6..6a1515e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	parser.adev = adev;
>>>     	parser.filp = filp;
>>>     
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 282f45b..bd389cf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	switch (info->query) {
>>>     	case AMDGPU_INFO_ACCEL_WORKING:
>>>     		ui32 = adev->accel_working;
>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>     		goto out_suspend;
>>>     	}
>>>     
>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>> +
>>>     	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>     			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>     	if (r) {
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                         ` <85c67ae9-bfe2-390a-79d0-6e5872b9be62-5C7GfCeVMHo@public.gmane.org>
  2017-10-10  7:12                           ` Liu, Monk
@ 2017-10-10  7:19                           ` Liu, Monk
  2017-10-10  7:47                           ` Michel Dänzer
  2 siblings, 0 replies; 49+ messages in thread
From: Liu, Monk @ 2017-10-10  7:19 UTC (permalink / raw)
  To: Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Daenzer, Michel

Think of the worst case, after VRAM lost all video memory is garbage, and under this situation all clients opened our dri FD
Is useless, can you illustrate how to handle such scenario without re-opening the dri FD?

BR 

-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月10日 14:59
To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer@amd.com>
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

As Nicolai explained that approach simply won't work.

The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code? 
Michel please comment as well).

So closing it and reopening it to handle a GPU reset is simply not an option.

Regards,
Christian.

Am 10.10.2017 um 06:26 schrieb Liu, Monk:
> After VRAM lost happens, all clients no matter radv/mesa/ogl is 
> useless,
>
> Any drivers uses this FD should be denied by KMD after VRAM lost, and 
> UMD can destroy/close this FD and re-open it and rebuild all resources
>
> That's the only option for VRAM lost case
>
>
>
> -----Original Message-----
> From: Nicolai Hähnle [mailto:nhaehnle@gmail.com]
> Sent: 2017年10月9日 19:01
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
> reseted
>
> On 09.10.2017 10:35, Liu, Monk wrote:
>> Please be aware that this policy is what the strict mode defined and 
>> what customer want, And also please check VK spec, it defines that 
>> after GPU reset all vk INSTANCE should close/release its 
>> resource/device/ctx and all buffers, and call re-initvkinstance after 
>> gpu reset
> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>
> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>
> Cheers,
> Nicolai
>
>
>
>> So this whole approach is what just aligned with the spec, and to not 
>> influence with current MESA/OGL client that's why I put the whole 
>> approach into the strict mode And by default strict mode is not 
>> selected
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月9日 16:26
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
>> reseted
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> for SRIOV strict mode gpu reset:
>>>
>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we 
>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>
>>> this way we prevent a potential bad process/FD from submitting cmds 
>>> and notify userspace with -ENODEV.
>>>
>>> userspace should close all BO/ctx and re-open dri FD to re-create 
>>> virtual memory system for this process
>> The whole aproach is a NAK from my side.
>>
>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>     3 files changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index de9c164..b40d4ba 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>     	struct idr		bo_list_handles;
>>>     	struct amdgpu_ctx_mgr	ctx_mgr;
>>>     	u32			vram_lost_counter;
>>> +	int gpu_reset_counter;
>>>     };
>>>     
>>>     /*
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 9467cf6..6a1515e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	parser.adev = adev;
>>>     	parser.filp = filp;
>>>     
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 282f45b..bd389cf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>     	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>     		return -ENODEV;
>>>     
>>> +	if (amdgpu_sriov_vf(adev) &&
>>> +		amdgpu_sriov_reset_level == 1 &&
>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> +		return -ENODEV;
>>> +
>>>     	switch (info->query) {
>>>     	case AMDGPU_INFO_ACCEL_WORKING:
>>>     		ui32 = adev->accel_working;
>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>     		goto out_suspend;
>>>     	}
>>>     
>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>> +
>>>     	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>     			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>     	if (r) {
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                             ` <BLUPR12MB04497B5442F66C969861742584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-10  7:25                               ` Christian König
       [not found]                                 ` <f06b80fa-fc96-a93c-59b7-2460dba95e94-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-10  7:25 UTC (permalink / raw)
  To: Liu, Monk, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Daenzer, Michel

As Nicolai described before:

The kernel will reject all command submission from contexts which where 
created before the VRAM lost happened.

We expose the VRAM lost counter to userspace. When Mesa sees that a 
command submission is rejected it will query VRAM lost counter and 
declare all resources which where in VRAM at this moment as invalidated. 
E.g. shader binaries, texture descriptors etc.. will be reuploaded when 
they are used the next time.

The application needs to recreate it's GL context, just in the same way 
as it would if we found this context guilty of causing a reset.

You should be able to handle this the same way in Vulkan and I think we 
can expose the GPU reset counter to userspace as well. This way you can 
implement the strict mode in userspace and don't need to affect all 
applications with that. In other words the effect will be limited to the 
Vulkan stack.

Regards,
Christian.

Am 10.10.2017 um 09:12 schrieb Liu, Monk:
> Then the question is how we treat recovery if VRAM lost ?
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月10日 14:59
> To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer@amd.com>
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
>
> As Nicolai explained that approach simply won't work.
>
> The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code?
> Michel please comment as well).
>
> So closing it and reopening it to handle a GPU reset is simply not an option.
>
> Regards,
> Christian.
>
> Am 10.10.2017 um 06:26 schrieb Liu, Monk:
>> After VRAM lost happens, all clients no matter radv/mesa/ogl is
>> useless,
>>
>> Any drivers uses this FD should be denied by KMD after VRAM lost, and
>> UMD can destroy/close this FD and re-open it and rebuild all resources
>>
>> That's the only option for VRAM lost case
>>
>>
>>
>> -----Original Message-----
>> From: Nicolai Hähnle [mailto:nhaehnle@gmail.com]
>> Sent: 2017年10月9日 19:01
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>> reseted
>>
>> On 09.10.2017 10:35, Liu, Monk wrote:
>>> Please be aware that this policy is what the strict mode defined and
>>> what customer want, And also please check VK spec, it defines that
>>> after GPU reset all vk INSTANCE should close/release its
>>> resource/device/ctx and all buffers, and call re-initvkinstance after
>>> gpu reset
>> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>>
>> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>>
>> Cheers,
>> Nicolai
>>
>>
>>
>>> So this whole approach is what just aligned with the spec, and to not
>>> influence with current MESA/OGL client that's why I put the whole
>>> approach into the strict mode And by default strict mode is not
>>> selected
>>>
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2017年10月9日 16:26
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>>> reseted
>>>
>>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>>> for SRIOV strict mode gpu reset:
>>>>
>>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we
>>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>>
>>>> this way we prevent a potential bad process/FD from submitting cmds
>>>> and notify userspace with -ENODEV.
>>>>
>>>> userspace should close all BO/ctx and re-open dri FD to re-create
>>>> virtual memory system for this process
>>> The whole aproach is a NAK from my side.
>>>
>>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>>      3 files changed, 13 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index de9c164..b40d4ba 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>>      	struct idr		bo_list_handles;
>>>>      	struct amdgpu_ctx_mgr	ctx_mgr;
>>>>      	u32			vram_lost_counter;
>>>> +	int gpu_reset_counter;
>>>>      };
>>>>      
>>>>      /*
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> index 9467cf6..6a1515e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>>      	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>>      		return -ENODEV;
>>>>      
>>>> +	if (amdgpu_sriov_vf(adev) &&
>>>> +		amdgpu_sriov_reset_level == 1 &&
>>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>>> +		return -ENODEV;
>>>> +
>>>>      	parser.adev = adev;
>>>>      	parser.filp = filp;
>>>>      
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index 282f45b..bd389cf 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>>      	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>>      		return -ENODEV;
>>>>      
>>>> +	if (amdgpu_sriov_vf(adev) &&
>>>> +		amdgpu_sriov_reset_level == 1 &&
>>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>>> +		return -ENODEV;
>>>> +
>>>>      	switch (info->query) {
>>>>      	case AMDGPU_INFO_ACCEL_WORKING:
>>>>      		ui32 = adev->accel_working;
>>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>>      		goto out_suspend;
>>>>      	}
>>>>      
>>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>>> +
>>>>      	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>>      			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>>      	if (r) {
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>> --
>> Lerne, wie die Welt wirklich ist,
>> Aber vergiss niemals, wie sie sein sollte.
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                         ` <85c67ae9-bfe2-390a-79d0-6e5872b9be62-5C7GfCeVMHo@public.gmane.org>
  2017-10-10  7:12                           ` Liu, Monk
  2017-10-10  7:19                           ` Liu, Monk
@ 2017-10-10  7:47                           ` Michel Dänzer
       [not found]                             ` <0c91bb14-a874-9ee6-8756-2a31eb41d5b2-otUistvHUpPR7s880joybQ@public.gmane.org>
  2 siblings, 1 reply; 49+ messages in thread
From: Michel Dänzer @ 2017-10-10  7:47 UTC (permalink / raw)
  To: Christian König, Liu, Monk, Nicolai Hähnle
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 10/10/17 08:58 AM, Christian König wrote:
> As Nicolai explained that approach simply won't work.
> 
> The fd is used by more than just the closed source Vulkan driver and I
> think even by some components not developed by AMD (common X code?
> Michel please comment as well).

The only thing that comes to mind possibly matching your description is
the generic modesetting Xorg driver, but that only calls KMS API ioctls.
Command submission and memory management ioctls should only be called by
hardware specific userspace code.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                             ` <0c91bb14-a874-9ee6-8756-2a31eb41d5b2-otUistvHUpPR7s880joybQ@public.gmane.org>
@ 2017-10-10  7:57                               ` Christian König
       [not found]                                 ` <36f5b680-c881-3b4f-0784-3cd624064004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Christian König @ 2017-10-10  7:57 UTC (permalink / raw)
  To: Michel Dänzer, Christian König, Liu, Monk, Nicolai Hähnle
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 10.10.2017 um 09:47 schrieb Michel Dänzer:
> On 10/10/17 08:58 AM, Christian König wrote:
>> As Nicolai explained that approach simply won't work.
>>
>> The fd is used by more than just the closed source Vulkan driver and I
>> think even by some components not developed by AMD (common X code?
>> Michel please comment as well).
> The only thing that comes to mind possibly matching your description is
> the generic modesetting Xorg driver, but that only calls KMS API ioctls.
> Command submission and memory management ioctls should only be called by
> hardware specific userspace code.

Yeah, that was in my mind as well.

But does that use the same fd as the GL driver used for Glamor?

If yes we can't just call close() on that fd.

Christian.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                                 ` <36f5b680-c881-3b4f-0784-3cd624064004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-10  8:17                                   ` Michel Dänzer
  0 siblings, 0 replies; 49+ messages in thread
From: Michel Dänzer @ 2017-10-10  8:17 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo, Liu, Monk, Nicolai Hähnle
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 10/10/17 09:57 AM, Christian König wrote:
> Am 10.10.2017 um 09:47 schrieb Michel Dänzer:
>> On 10/10/17 08:58 AM, Christian König wrote:
>>> As Nicolai explained that approach simply won't work.
>>>
>>> The fd is used by more than just the closed source Vulkan driver and I
>>> think even by some components not developed by AMD (common X code?
>>> Michel please comment as well).
>> The only thing that comes to mind possibly matching your description is
>> the generic modesetting Xorg driver, but that only calls KMS API ioctls.
>> Command submission and memory management ioctls should only be called by
>> hardware specific userspace code.
> 
> Yeah, that was in my mind as well.
> 
> But does that use the same fd as the GL driver used for Glamor?

Yes, it does.


> If yes we can't just call close() on that fd.

True, that's a good point.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                                 ` <f06b80fa-fc96-a93c-59b7-2460dba95e94-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-10  8:21                                   ` Liu, Monk
       [not found]                                     ` <BLUPR12MB0449B68E81C778A9D07FB38584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Liu, Monk @ 2017-10-10  8:21 UTC (permalink / raw)
  To: Koenig, Christian, Nicolai Hähnle,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Daenzer, Michel

>As Nicolai described before:

The kernel will reject all command submission from contexts which where created before the VRAM lost happened.

ML: this is similar with what my strict mode reset do ☹, except that my logic checks if the FD is opened before gpu reset, but Nicolai checks if the context created before VRAM lost,
But comparing context creating timing with VRAM lost timing is not accurate enough like I said before:
Even you have context created after VRAM lost, that doesn't mean you can allow this context submitting jobs, because some BO in the BO_LIST of passed in with this context may modified before GPU reset/VRAM lost,
So the safest way is comparing the timing of FD opening

We expose the VRAM lost counter to userspace. When Mesa sees that a command submission is rejected it will query VRAM lost counter and declare all resources which where in VRAM at this moment as invalidated. 
E.g. shader binaries, texture descriptors etc.. will be reuploaded when they are used the next time.

The application needs to recreate it's GL context, just in the same way as it would if we found this context guilty of causing a reset.


-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月10日 15:26
To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer@amd.com>
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

As Nicolai described before:

The kernel will reject all command submission from contexts which where created before the VRAM lost happened.

We expose the VRAM lost counter to userspace. When Mesa sees that a command submission is rejected it will query VRAM lost counter and declare all resources which where in VRAM at this moment as invalidated. 
E.g. shader binaries, texture descriptors etc.. will be reuploaded when they are used the next time.

The application needs to recreate it's GL context, just in the same way as it would if we found this context guilty of causing a reset.

You should be able to handle this the same way in Vulkan and I think we can expose the GPU reset counter to userspace as well. This way you can implement the strict mode in userspace and don't need to affect all applications with that. In other words the effect will be limited to the Vulkan stack.

Regards,
Christian.

Am 10.10.2017 um 09:12 schrieb Liu, Monk:
> Then the question is how we treat recovery if VRAM lost ?
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月10日 14:59
> To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; 
> amd-gfx@lists.freedesktop.org; Daenzer, Michel 
> <Michel.Daenzer@amd.com>
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
> reseted
>
> As Nicolai explained that approach simply won't work.
>
> The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code?
> Michel please comment as well).
>
> So closing it and reopening it to handle a GPU reset is simply not an option.
>
> Regards,
> Christian.
>
> Am 10.10.2017 um 06:26 schrieb Liu, Monk:
>> After VRAM lost happens, all clients no matter radv/mesa/ogl is 
>> useless,
>>
>> Any drivers uses this FD should be denied by KMD after VRAM lost, and 
>> UMD can destroy/close this FD and re-open it and rebuild all 
>> resources
>>
>> That's the only option for VRAM lost case
>>
>>
>>
>> -----Original Message-----
>> From: Nicolai Hähnle [mailto:nhaehnle@gmail.com]
>> Sent: 2017年10月9日 19:01
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
>> reseted
>>
>> On 09.10.2017 10:35, Liu, Monk wrote:
>>> Please be aware that this policy is what the strict mode defined and 
>>> what customer want, And also please check VK spec, it defines that 
>>> after GPU reset all vk INSTANCE should close/release its 
>>> resource/device/ctx and all buffers, and call re-initvkinstance 
>>> after gpu reset
>> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>>
>> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>>
>> Cheers,
>> Nicolai
>>
>>
>>
>>> So this whole approach is what just aligned with the spec, and to 
>>> not influence with current MESA/OGL client that's why I put the 
>>> whole approach into the strict mode And by default strict mode is 
>>> not selected
>>>
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2017年10月9日 16:26
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu 
>>> reseted
>>>
>>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>>> for SRIOV strict mode gpu reset:
>>>>
>>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we 
>>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>>
>>>> this way we prevent a potential bad process/FD from submitting cmds 
>>>> and notify userspace with -ENODEV.
>>>>
>>>> userspace should close all BO/ctx and re-open dri FD to re-create 
>>>> virtual memory system for this process
>>> The whole aproach is a NAK from my side.
>>>
>>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>>      3 files changed, 13 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index de9c164..b40d4ba 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>>      	struct idr		bo_list_handles;
>>>>      	struct amdgpu_ctx_mgr	ctx_mgr;
>>>>      	u32			vram_lost_counter;
>>>> +	int gpu_reset_counter;
>>>>      };
>>>>      
>>>>      /*
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> index 9467cf6..6a1515e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>>      	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>>      		return -ENODEV;
>>>>      
>>>> +	if (amdgpu_sriov_vf(adev) &&
>>>> +		amdgpu_sriov_reset_level == 1 &&
>>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>>> +		return -ENODEV;
>>>> +
>>>>      	parser.adev = adev;
>>>>      	parser.filp = filp;
>>>>      
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index 282f45b..bd389cf 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>>      	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>>      		return -ENODEV;
>>>>      
>>>> +	if (amdgpu_sriov_vf(adev) &&
>>>> +		amdgpu_sriov_reset_level == 1 &&
>>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>>> +		return -ENODEV;
>>>> +
>>>>      	switch (info->query) {
>>>>      	case AMDGPU_INFO_ACCEL_WORKING:
>>>>      		ui32 = adev->accel_working;
>>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>>      		goto out_suspend;
>>>>      	}
>>>>      
>>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>>> +
>>>>      	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>>      			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>>      	if (r) {
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>> --
>> Lerne, wie die Welt wirklich ist,
>> Aber vergiss niemals, wie sie sein sollte.
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
       [not found]                                     ` <BLUPR12MB0449B68E81C778A9D07FB38584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-10  8:57                                       ` Nicolai Hähnle
  0 siblings, 0 replies; 49+ messages in thread
From: Nicolai Hähnle @ 2017-10-10  8:57 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Daenzer, Michel

On 10.10.2017 10:21, Liu, Monk wrote:
>> As Nicolai described before:
> 
> The kernel will reject all command submission from contexts which where created before the VRAM lost happened.
> 
> ML: this is similar with what my strict mode reset do ☹, except that my logic checks if the FD is opened before gpu reset, but Nicolai checks if the context created before VRAM lost,
> But comparing context creating timing with VRAM lost timing is not accurate enough like I said before:
> Even you have context created after VRAM lost, that doesn't mean you can allow this context submitting jobs, because some BO in the BO_LIST of passed in with this context may modified before GPU reset/VRAM lost,
> So the safest way is comparing the timing of FD opening
> 
> We expose the VRAM lost counter to userspace. When Mesa sees that a command submission is rejected it will query VRAM lost counter and declare all resources which where in VRAM at this moment as invalidated.
> E.g. shader binaries, texture descriptors etc.. will be reuploaded when they are used the next time.
> 
> The application needs to recreate it's GL context, just in the same way as it would if we found this context guilty of causing a reset.

Yes, for most applications. But this is *entirely* unrelated to 
re-opening the FD.

With OpenGL robustness contexts, what happens is that

1. Driver & application detect "context lost"
2. Application destroys the OpenGL context
--> driver destroys the kernel context and all associated buffer objects
3. Application creates a new OpenGL context
--> driver creates a new kernel context and new buffer objects

In this sequence, the content of buffer objects created before the VRAM 
lost is irrelevant because they will all be destroyed, and new command 
submissions will only use "fresh" buffer objects.

(Or, in the case of Mesa, it's actually possible that we re-use buffer 
objects from before VRAM lost due to the caching we do for performance, 
but the contents of those buffer objects will have been completely 
re-initialized.)

The FD simply isn't the right unit of granularity to track this.

Cheers,
Nicolai


> 
> 
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月10日 15:26
> To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>; amd-gfx@lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer@amd.com>
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
> 
> As Nicolai described before:
> 
> The kernel will reject all command submission from contexts which where created before the VRAM lost happened.
> 
> We expose the VRAM lost counter to userspace. When Mesa sees that a command submission is rejected it will query VRAM lost counter and declare all resources which where in VRAM at this moment as invalidated.
> E.g. shader binaries, texture descriptors etc.. will be reuploaded when they are used the next time.
> 
> The application needs to recreate it's GL context, just in the same way as it would if we found this context guilty of causing a reset.
> 
> You should be able to handle this the same way in Vulkan and I think we can expose the GPU reset counter to userspace as well. This way you can implement the strict mode in userspace and don't need to affect all applications with that. In other words the effect will be limited to the Vulkan stack.
> 
> Regards,
> Christian.
> 
> Am 10.10.2017 um 09:12 schrieb Liu, Monk:
>> Then the question is how we treat recovery if VRAM lost ?
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2017年10月10日 14:59
>> To: Liu, Monk <Monk.Liu@amd.com>; Nicolai Hähnle <nhaehnle@gmail.com>;
>> amd-gfx@lists.freedesktop.org; Daenzer, Michel
>> <Michel.Daenzer@amd.com>
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>> reseted
>>
>> As Nicolai explained that approach simply won't work.
>>
>> The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code?
>> Michel please comment as well).
>>
>> So closing it and reopening it to handle a GPU reset is simply not an option.
>>
>> Regards,
>> Christian.
>>
>> Am 10.10.2017 um 06:26 schrieb Liu, Monk:
>>> After VRAM lost happens, all clients no matter radv/mesa/ogl is
>>> useless,
>>>
>>> Any drivers uses this FD should be denied by KMD after VRAM lost, and
>>> UMD can destroy/close this FD and re-open it and rebuild all
>>> resources
>>>
>>> That's the only option for VRAM lost case
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Nicolai Hähnle [mailto:nhaehnle@gmail.com]
>>> Sent: 2017年10月9日 19:01
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>>> reseted
>>>
>>> On 09.10.2017 10:35, Liu, Monk wrote:
>>>> Please be aware that this policy is what the strict mode defined and
>>>> what customer want, And also please check VK spec, it defines that
>>>> after GPU reset all vk INSTANCE should close/release its
>>>> resource/device/ctx and all buffers, and call re-initvkinstance
>>>> after gpu reset
>>> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>>>
>>> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>
>>>> So this whole approach is what just aligned with the spec, and to
>>>> not influence with current MESA/OGL client that's why I put the
>>>> whole approach into the strict mode And by default strict mode is
>>>> not selected
>>>>
>>>>
>>>> BR Monk
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>> Sent: 2017年10月9日 16:26
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>>>> reseted
>>>>
>>>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>>>> for SRIOV strict mode gpu reset:
>>>>>
>>>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we
>>>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>>>
>>>>> this way we prevent a potential bad process/FD from submitting cmds
>>>>> and notify userspace with -ENODEV.
>>>>>
>>>>> userspace should close all BO/ctx and re-open dri FD to re-create
>>>>> virtual memory system for this process
>>>> The whole aproach is a NAK from my side.
>>>>
>>>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>>>>       3 files changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index de9c164..b40d4ba 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>>>>       	struct idr		bo_list_handles;
>>>>>       	struct amdgpu_ctx_mgr	ctx_mgr;
>>>>>       	u32			vram_lost_counter;
>>>>> +	int gpu_reset_counter;
>>>>>       };
>>>>>       
>>>>>       /*
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> index 9467cf6..6a1515e 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>>>>       	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>>>       		return -ENODEV;
>>>>>       
>>>>> +	if (amdgpu_sriov_vf(adev) &&
>>>>> +		amdgpu_sriov_reset_level == 1 &&
>>>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>>>> +		return -ENODEV;
>>>>> +
>>>>>       	parser.adev = adev;
>>>>>       	parser.filp = filp;
>>>>>       
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> index 282f45b..bd389cf 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>>>>       	if (amdgpu_kms_vram_lost(adev, fpriv))
>>>>>       		return -ENODEV;
>>>>>       
>>>>> +	if (amdgpu_sriov_vf(adev) &&
>>>>> +		amdgpu_sriov_reset_level == 1 &&
>>>>> +		fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>>>> +		return -ENODEV;
>>>>> +
>>>>>       	switch (info->query) {
>>>>>       	case AMDGPU_INFO_ACCEL_WORKING:
>>>>>       		ui32 = adev->accel_working;
>>>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>>>       		goto out_suspend;
>>>>>       	}
>>>>>       
>>>>> +	fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>>>> +
>>>>>       	r = amdgpu_vm_init(adev, &fpriv->vm,
>>>>>       			   AMDGPU_VM_CONTEXT_GFX, 0);
>>>>>       	if (r) {
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>> --
>>> Lerne, wie die Welt wirklich ist,
>>> Aber vergiss niemals, wie sie sein sollte.
>>
> 


-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-10-10  8:57 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-30  6:03 [PATCH 00/12] *** SRIOV GPU RESET PATCHES *** Monk Liu
     [not found] ` <1506751432-21789-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-09-30  6:03   ` [PATCH 01/12] drm/amdgpu/sriov:now must reinit psp Monk Liu
2017-09-30  6:03   ` [PATCH 02/12] drm/amdgpu/sriov:fix memory leak in psp_load_fw Monk Liu
2017-09-30  6:03   ` [PATCH 03/12] drm/amdgpu/sriov:use atomic type for sriov_reset Monk Liu
2017-09-30  6:03   ` [PATCH 04/12] drm/amdgpu/sriov:cleanup gpu rest mlock Monk Liu
2017-09-30  6:03   ` [PATCH 05/12] drm/amdgpu/sriov:accurate description for sriov_gpu_reset Monk Liu
2017-09-30  6:03   ` [PATCH 06/12] drm/amdgpu/sriov:handle more jobs hang in different ring case Monk Liu
     [not found]     ` <1506751432-21789-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:18       ` Christian König
2017-09-30  6:03   ` [PATCH 07/12] drm/amdgpu/sriov:implement strict gpu reset Monk Liu
     [not found]     ` <1506751432-21789-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:20       ` Christian König
     [not found]         ` <250ce10a-cca0-0193-b2ed-cc2f04e80d0c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:30           ` Liu, Monk
2017-10-09 10:58       ` Nicolai Hähnle
2017-09-30  6:03   ` [PATCH 08/12] drm/amdgpu:explicitly call fence_process Monk Liu
     [not found]     ` <1506751432-21789-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:23       ` Christian König
     [not found]         ` <5cb1ae43-ec3a-2b0b-b78b-91cefd575672-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:32           ` Liu, Monk
     [not found]             ` <BLUPR12MB04491DDBC8ACFE2FB43D0F0084740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:40               ` Christian König
     [not found]                 ` <62bb9496-b29f-0230-8fa4-0bad470c12c8-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:51                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB0449E49C10230F350B9BD3B284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:52                       ` Liu, Monk
     [not found]                         ` <BLUPR12MB04495DD27084790E5B219D7384740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:58                           ` Christian König
2017-09-30  6:03   ` [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted Monk Liu
     [not found]     ` <1506751432-21789-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:25       ` Christian König
     [not found]         ` <6e81d8b0-267a-1ea8-b228-93286fc6a954-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:35           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449531313F50BE080F7746D84740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  8:54               ` Christian König
2017-10-09 11:01               ` Nicolai Hähnle
     [not found]                 ` <71b411c8-21a6-fe9b-ed33-7928571a88da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-10  4:26                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB04492B28DF57EACE2149562D84750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  6:58                       ` Christian König
     [not found]                         ` <85c67ae9-bfe2-390a-79d0-6e5872b9be62-5C7GfCeVMHo@public.gmane.org>
2017-10-10  7:12                           ` Liu, Monk
     [not found]                             ` <BLUPR12MB04497B5442F66C969861742584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  7:25                               ` Christian König
     [not found]                                 ` <f06b80fa-fc96-a93c-59b7-2460dba95e94-5C7GfCeVMHo@public.gmane.org>
2017-10-10  8:21                                   ` Liu, Monk
     [not found]                                     ` <BLUPR12MB0449B68E81C778A9D07FB38584750-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-10  8:57                                       ` Nicolai Hähnle
2017-10-10  7:19                           ` Liu, Monk
2017-10-10  7:47                           ` Michel Dänzer
     [not found]                             ` <0c91bb14-a874-9ee6-8756-2a31eb41d5b2-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-10-10  7:57                               ` Christian König
     [not found]                                 ` <36f5b680-c881-3b4f-0784-3cd624064004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-10  8:17                                   ` Michel Dänzer
2017-09-30  6:03   ` [PATCH 10/12] drm/amdgpu/sriov:implement guilty ctx for loose reset Monk Liu
     [not found]     ` <1506751432-21789-11-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:27       ` Christian König
     [not found]         ` <e4c96014-b4f4-e013-a966-9e2e03b9a62b-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-09  8:39           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449C8E878F09AE59BA816E284740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  9:03               ` Christian König
     [not found]                 ` <d249cc75-29e3-713f-fc5a-2f26f555500b-5C7GfCeVMHo@public.gmane.org>
2017-10-09  9:14                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB04498EE183C86C2B93DDA85484740-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-09  9:24                       ` Christian König
2017-09-30  6:03   ` [PATCH 11/12] drm/amdgpu/sriov:show error if ib test failed Monk Liu
     [not found]     ` <1506751432-21789-12-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-09  8:29       ` Christian König
2017-09-30  6:03   ` [PATCH 12/12] drm/amdgpu/sriov:no shadow buffer recovery Monk Liu
     [not found]     ` <1506751432-21789-13-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-01  9:32       ` Christian König
2017-10-01  9:36       ` Christian König
     [not found]         ` <e767c6f2-4050-c697-2075-c3d744e6b379-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-04  9:41           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449346A746E70A7BE88FEA084730-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-04 10:56               ` Christian König
     [not found]                 ` <9b08e030-1a47-39ef-8010-64c51d4560e8-5C7GfCeVMHo@public.gmane.org>
2017-10-09  4:12                   ` Liu, Monk
2017-10-01  9:31   ` [PATCH 00/12] *** SRIOV GPU RESET PATCHES *** Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.